Skip to content
DA DataAcuity by The Geek Network

DataAcuity — Security Posture

Status: ⚠️ Current state is NOT production-ready for data traffic. See §3 for blockers and §6 for the hardening plan. Last audited: 2026-05-28 Owner: Tinashe Bhengu

This document is the honest accounting of what's secure on .106, what isn't, and the prioritised path to making the box safe for real customer data flowing through the BI pipeline.


1. TL;DR

.106 runs 54 Docker containers. The infrastructure for proper security exists (Traefik for TLS, Keycloak for auth, fail2ban active, internal-only Docker networks) but most services are exposed raw on the public IP without authentication because they were originally set up for dev convenience and never hardened.

Before the BI pipeline carries real customer data, the following must be true:

  1. No PostgreSQL port is reachable from the public internet
  2. No internal-only service (Loki, cAdvisor, exporters, dashboards) is reachable from the public internet
  3. Every customer-touching API has either Keycloak auth or a documented "this is intentionally public" decision
  4. Backups of data_warehouse are verified restorable
  5. Monitoring + alerting on auth failures, anomalous query volumes, and disk pressure is wired
  6. A documented incident response runbook exists

We're at 0 of 6 right now. Estimated work: 1–2 weeks for the lockdown, 1 week for monitoring + runbooks.

2. Threat model

What we're defending against, in priority order:

Threat Impact Mitigation strategy
Unauthenticated read of customer PII from data warehouse POPIA/GDPR breach, regulatory fines, brand damage Anonymisation pipeline (§6 of BI Pipeline doc) + lock down data_warehouse to internal-only
Public PostgreSQL exposure with brute-force credential attack Total compromise of warehouse data, lateral movement to other DBs Close port 5001/5433 to public, require VPN or jump host for direct DB access
Unauthenticated MCP / API abuse (geo_mcp, valhalla, markets) Rate-limit-free scraping, data exfiltration of POIs/places, denial of service Front everything with Traefik + API key or Keycloak token
Container escape via cAdvisor / Docker socket Host compromise, all-container compromise Close cAdvisor public port, restrict Docker socket access
Log exfiltration via public Loki PII visible in app logs leaks publicly Close Loki public port
Admin UI takeover (n8n, twenty, ai_brain_webui, automatisch) Workflow tampering, CRM data leak, LLM cost runaway Put behind Keycloak SSO via Traefik
Credential leak via .env files / Docker inspect Lateral movement Secrets in env-var-only, never in committed files; use Docker secrets where supported
DDoS on public services Service degradation, cost spike Cloudflare in front of Traefik (TBD) + per-service rate limits
Backup compromise Data loss + RTO blown Encrypted backups, off-server replication, restore drills

Out of scope (covered elsewhere):

  • App-layer authn/authz (lives in TGN AuthAPI per CLAUDE.md)
  • App-layer business logic abuse (lives in each app's threat model)
  • Banking compliance specifics (lives in .claude-memory/banking-compliance-rules.md)

3. Current state — the audit

Audit run on .106 on 2026-05-28. Findings grouped by severity.

3.1 🚨 P0 — must fix before any production data traffic

# Issue Detail
1 data_warehouse PostgreSQL on port 5001 published to 0.0.0.0 Anyone on the internet can attempt to connect. Only the password protects the warehouse. Source: docker ps shows 0.0.0.0:5001->5432/tcp
2 maps_db PostgreSQL on port 5433 published to 0.0.0.0 Same as above for the maps database
3 loki log aggregator on port 3100 published to 0.0.0.0 All container logs (potentially with PII in app log lines) reachable by anyone who knows the LogQL API. No auth required
4 cadvisor on port 8081 published to 0.0.0.0 Full container introspection — anyone can see images, env vars, resource usage, command lines
5 Prometheus exporters (node-exporter :9100, nginx-exporter :9113, redis-exporters :9121/9122/9123, postgres-exporters :9187/9188) all published to 0.0.0.0 Host and DB metrics leak, useful for attackers fingerprinting the system

3.2 ⚠️ P1 — must fix before customer-facing scale

# Issue Detail
6 geo_mcp on 5026 has no authentication Anyone can call MCP tools (geocode, reverse_geocode, search_places, route, discover_quest, nearby POIs). Confirmed HTTP 200 on /sse with no credentials
7 valhalla on 5027 has no authentication Africa routing free to anyone. Could be abused for free routing-as-a-service
8 maps_api on 5020 has no authentication on root or /docs OpenAPI exposed, all endpoints callable
9 markets_api on 8000 has no authentication Even though data is currently broken, the surface is public
10 Admin UIs published raw: n8n (5008), automatisch (5004), twenty_crm (5005), morph_convertx (5011), ai_brain_webui (5000), bio_onelink (5009), dashboard-backend (5007), api-docs (8082) Each has its own auth (varies in strength); none are behind a unified SSO. Should all front through Traefik + Keycloak
11 No TLS certificates found at /etc/letsencrypt/live/ on the host HTTPS termination is happening somewhere (likely Cloudflare or .118 ARR) but .106 itself doesn't terminate TLS. Container-to-container is unencrypted. Internet-to-container plaintext if hit on the host IP directly
12 tagme_api (5023) and transit_api (5030) live on .106 Per CLAUDE.md all 36 APIs should be on .104/.105. These two are exceptions. Should either be moved or the rule updated and documented

3.3 🟡 P2 — should fix in next 30 days

# Issue Detail
13 Keycloak deployed but barely wired Master realm exists with public_key, but most services don't authenticate against it. Wiring this is the cleanest way to fix #10
14 No verified backup restore drill for data_warehouse /home/geektrading/backups/ exists but no documentation of when restoration was last tested
15 No documented incident response runbook If data_warehouse is compromised, what do we do? Who's paged? Where's the rotation key? No answer
16 No anomaly alerting on auth failures or query volumes We have Prometheus/Grafana but no rules for "unusual query rate against geo_db" or "auth fail spike on Keycloak"
17 replicator credential in plaintext in Deployment/deployment-credentials.ps1 (committed to repo) Should be in a secret manager (Vault, AWS Secrets Manager, Azure Key Vault) — even though the repo is private, this is the wrong pattern
18 No automated PII-leak scanning on dbt models The BI pipeline §6.3 specifies it; not yet implemented
19 n8n's 55 MB sqlite content is black-box to ops Could contain credentials, workflows touching customer data, etc. Audit overdue
20 maps_osrm runs without data but still consumes resources and presents attack surface Either load Africa OSM into it or decommission
21 automatisch has 0 flows but is publicly exposed Same as #20 — either commit to using it or remove

3.4 🟢 P3 — nice to have

# Issue Detail
22 fail2ban is active but the policy isn't documented Verify policy covers SSH + WebUI auth failures + Postgres
23 No CIS / hardening baseline scan ever run on the host Run lynis audit system for a baseline
24 No SBOM / image vulnerability scan Run trivy against every container image, schedule monthly
25 No automated TLS renewal monitoring If Traefik's ACME fails silently, certs expire. Need an alert
26 No audit log for who SSHed and what they ran OS-level audit (auditd, falco) not deployed

4. What's already good ✅

Credit where due — not everything is broken:

  • Docker network isolation is correctly used for DBs: geo_db, gateway-db, keycloak_db, superset_db, twenty_db, automatisch_db, bio_db, markets_db (internal port), maps_redis, gateway-redis, twenty_redis, automatisch_redis, markets_redis — none are publicly exposed
  • maps_osrm and maps_prerender correctly internal-only
  • fail2ban is active on the host (mitigates SSH brute force)
  • Keycloak is deployed and ready to be wired — the auth infrastructure exists, just isn't used
  • Traefik is deployed with ACME — TLS infrastructure exists
  • api-gateway-external / api-gateway-internal containers exist — the gateway pattern is set up
  • Monitoring stack is comprehensive — Prometheus + Grafana + Loki + Alertmanager + multiple exporters
  • dbt and the data warehouse infrastructure are sound — schema layering pattern works
  • Replication from .104 to .105 is healthy — ETL won't have to touch the primary

5. Compliance — what regulators care about

For each regulation the BI pipeline must satisfy, what's the current security gap?

Regulation Requirement Current gap
POPIA (SA) "Reasonable technical and organisational measures to prevent unauthorised access" Public Postgres ports (P0 #1, #2) — direct violation. Public Loki (P0 #3) — likely violation if any PII in logs
POPIA "Limit access to what is necessary for the purpose" No role-based access on warehouse (everyone reading uses dwh_user) — partial gap
POPIA "Mandatory breach notification within 72 hours" No incident response runbook (P2 #15) — gap in capability to comply
FICA "Records retained 5 years, securely stored" Backups not verified restorable (P2 #14) — capability gap
GDPR Art. 32 "Pseudonymisation and encryption of personal data" Anonymisation pipeline designed but not deployed; encryption-at-rest assumed but not verified
GDPR Art. 25 "Privacy by design and by default" Public Postgres ports violate this on its face
GDPR Art. 30 "Records of processing activities" No data processing register exists
GDPR Art. 33-34 "Breach notification 72 hours" Same as POPIA — capability gap
PCI-DSS (if any card data) Network segmentation, no card data in logs Out of scope for .106 (SDPKT handles cards, on .104) but verify no card data leaks into data_warehouse via the pipeline

Bottom line for compliance: the warehouse cannot accept production customer data until P0 items 1–3 are closed, full stop. Items 4–5 are also blockers given anti-fingerprinting expectations under "reasonable technical measures."

6. Hardening plan — phased

Sequenced so each phase is independently shippable and each phase reduces risk meaningfully.

Phase A — Lock down public ports (3-5 days)

Acceptance criteria: All 🚨 P0 issues from §3.1 closed.

Steps:

  1. Change PostgreSQL port mappings for data_warehouse and maps_db from 0.0.0.0:5001->5432 to 127.0.0.1:5001->5432. External access requires SSH tunnel or VPN.
    # in docker-compose.yml
    ports:
      - "127.0.0.1:5001:5432"  # was "5001:5432"
    
  2. Same change for loki, cadvisor, all *-exporter containers. These don't need any public access — Prometheus scrapes them from inside the Docker network already.
  3. Update Prometheus scrape configs if any used host.docker.internal:9100 style — switch to container-name targets.
  4. Verify with external nmap from a different IP that ports 5001, 5433, 3100, 8081, 9100, 9113, 9121-9123, 9187, 9188 no longer respond.
  5. Document the new access path for the ops team (SSH tunnel command, Tailscale config, or bastion host).

Estimated time: 1-2 days. Risk: low (the consumers of these ports are all internal).

Phase B — Front public APIs with Traefik + auth (5-7 days)

Acceptance criteria: All ⚠️ P1 issues from §3.2 closed. No service published raw to 0.0.0.0 except the few intentionally public.

Steps:

  1. Inventory all currently-public services — categorise as: (a) intentionally public (api-gateway-external, dataacuity_portal), (b) needs auth wrapper (geo_mcp, valhalla, maps_api, markets_api), (c) admin (n8n, twenty, ai_brain_webui, etc.) needs Keycloak SSO.
  2. Configure Traefik routes for each category, terminating TLS at Traefik. Issue Let's Encrypt certs via existing ACME setup.
  3. For category (b): wrap with an API-key middleware initially (faster), wire to Keycloak token validation as Phase B.5.
  4. For category (c): integrate Keycloak SSO via Traefik's ForwardAuth middleware (or use oauth2-proxy as the intermediate).
  5. Remove 0.0.0.0: from every docker-compose for the wrapped services. Internal Docker DNS handles container-to-Traefik routing.
  6. Verify with nmap that only ports 80 (redirect), 443 (Traefik), 22 (SSH), and 8084 (api-gateway-external if intentionally separate) respond.

Estimated time: 1 week. Risk: medium (touches every public surface — needs careful change windows).

Phase C — Backups + DR drill (2-3 days)

Acceptance criteria: P2 #14 closed.

  1. Verify the nightly backup job at /home/geektrading/backups/ is actually running and what it backs up
  2. Run a restore drill into a fresh empty container — measure RTO
  3. Set up off-server backup replication to either .118 or external S3
  4. Document the runbook for "data_warehouse is gone, restore it"
  5. Add backup-freshness alert to Prometheus + Alertmanager (warn if backup older than 26 h)

Phase D — Incident response + monitoring (3-5 days)

Acceptance criteria: P2 #15, #16 closed; P1 PII alerting from BI pipeline §11.2 in place.

  1. Write incident response runbook — who's paged, escalation chain, breach notification template
  2. Wire Alertmanager to actually page someone (currently it just collects)
  3. Add Prometheus rules for the alert classes in BI Pipeline §11.2 (P1 alerts page on-call, P2 alerts create tickets)
  4. Auth-failure alerts: Keycloak login failure spike, postgres auth failure spike, fail2ban ban-rate spike
  5. Anomaly alerts: warehouse query volume 3σ above baseline, disk free <20 GB, container restart loop

Phase E — n8n audit + secret rotation (2-3 days)

Acceptance criteria: P2 #17, #19 closed.

  1. Copy n8n's sqlite out of the container, read it, document every workflow
  2. For workflows touching customer data, ensure they go through the BI pipeline path (not direct DB access) — refactor if needed
  3. Migrate replicator and other secrets out of Deployment/deployment-credentials.ps1 into a secret manager
  4. Rotate any credential that was visible in committed files

Phase F — Hardening baseline + image scans (2 days)

Acceptance criteria: P3 #23, #24 closed.

  1. Run lynis audit system on .106, document findings
  2. Run trivy image against every image used, document CVEs above HIGH
  3. Schedule both as monthly Grafana-tracked jobs

Phase G — Decommission dead services (1 day)

Acceptance criteria: P2 #20, #21 closed.

  1. Decide: maps_osrm — load Africa OSM data or decommission. Same as Valhalla but redundant if Valhalla covers the use case
  2. Decide: automatisch — start using it or decommission
  3. Stop, remove, document the decision in a CHANGELOG

Total

Approximately 3-4 weeks of focused security work to get .106 to production-ready for data traffic.

7. Definition of "production-ready for data traffic"

For the BI pipeline to start carrying real customer data, ALL of these must be true:

  • All P0 issues from §3.1 are closed (no public Postgres, no public infra ports)
  • All P1 issues from §3.2 are closed (no unauthenticated API, no public admin UI)
  • Backups verified restorable (Phase C complete)
  • Alerting wired and tested (Phase D complete)
  • Incident response runbook exists and the on-call rotation knows where it is
  • PII-absence dbt tests are running on every marts.* model
  • Compliance sign-off recorded (a named compliance reviewer has audited a sample and signed off)
  • A documented "kill switch" — a single action that stops all data flow into the warehouse if something is wrong

Anything less is "dev / staging quality only" and the warehouse must contain only synthetic or already-anonymised data.

8. Specific service hardening notes

8.1 geo_mcp

  • Add X-API-Key header check before any tool call — keys issued per consumer (TagMe, Takemehome, Butler)
  • Rate limit: 600 calls/min per key by default, lower for browser-origin
  • Log every tool call (consumer, tool, args, latency) to Loki for audit
  • Behind Traefik with TLS; container itself only listens on Docker network

8.2 valhalla

  • Same API-key gate
  • Disable /expansion, /trace_attributes, /height if not needed (smaller attack surface)
  • Cache aggressively (24h on routes — they don't change)

8.3 maps_api

  • Already designed for it (slowapi rate limiter present) — verify the limit is sensible (currently 60/min/IP)
  • API-key auth for /api/v2/* (the new BI-relevant proxies)
  • Disable /docs in production unless authenticated

8.4 data_warehouse

  • Move to internal-only port
  • Role separation: etl_user (write to raw.* only), dwh_user (read all, write staging/intermediate/marts/analytics), analytics_user (read analytics.* only), compliance_user (read everything + access to compliance.token_vault)
  • Connection limit per role; warn if any role hits 80% of limit
  • pg_stat_statements enabled for query analytics
  • Audit logging enabled (or at minimum log all SELECT * FROM intermediate.* access)

8.5 keycloak

  • Master realm: admin password rotated and stored in secret manager
  • Per-app realm (or client) for: BigBruh!, n8n, Superset, Grafana, twenty_crm, dataacuity_portal, ai_brain_webui
  • Backup the realm exports nightly

8.6 superset

  • Behind Keycloak SSO via OAuth/OIDC
  • Row-level security on the warehouse connections so analysts only see appropriate marts
  • Disable Public role's access to any dataset

8.7 n8n

  • After Phase E audit, behind Keycloak
  • Encrypt the sqlite at rest (filesystem-level encryption on the volume)
  • Webhooks (the public attack surface) get their own API-key validation

9. Trusted access patterns (for ops + developers)

After hardening, how does a developer / ops engineer access .106 services?

Need Path
Web UI access (Superset, Grafana, n8n, dataacuity_portal) https://.dataacuity.co.za → Traefik → Keycloak SSO → service
API call from TGN app https://maps.dataacuity.co.za/api/v2/... with X-API-Key header → Traefik → maps_api → backend
Direct Postgres access (DBA, analyst) SSH tunnel: ssh -L 5001:data_warehouse:5001 geektrading@.106 → connect locally to localhost:5001
Direct container shell (debugging) SSH to .106docker exec -it <container> bash (requires being in the docker group, which is locked to named users)
Read logs (Loki) Grafana UI → Loki data source → LogQL queries. No direct Loki port access
Prometheus metrics Grafana UI → Prometheus data source. No direct port access
Backup restore (DR drill) Pull backup from off-server location → load into a fresh container per the documented runbook

10. Open questions — with findings and recommendations

Repo-wide search done 2026-05-28. For each question, what exists today and the recommendation:

10.1 Compliance reviewer / DPO — NOT ASSIGNED

Finding: No named DPO or compliance reviewer anywhere in the repo. Compliance rules are documented (.claude-memory/banking-compliance-rules.md, AppInfo/TrustSeal/TRUSTSEAL_IMPLEMENTATION_PLAN.md) but no person owns sign-off.

Why it matters: Anonymisation Standard §11 requires a named reviewer to sign off on every intermediate.* model before it ships. Without this role, the BI pipeline cannot legally start carrying production PII.

Recommendation: This is a hiring/appointment decision, not a technical one. Three options:

  • (a) Assign internally — likely the most senior backend/data lead with compliance training; lowest cost, real ongoing time commitment (~2 h / week)
  • (b) Contract external counsel — POPIA-specialist law firm in SA (Webber Wentzel, ENS, Bowmans all have practices). Higher cost, lower internal burden, more credible to regulators
  • (c) Hire dedicated DPO — only justifiable at scale; GDPR mandates this once you process EU PII at meaningful volume

Action: Pick (a) or (b) before BI Pipeline Phase 2 starts.

10.2 Incident response process — DOES NOT EXIST

Finding: No IR runbooks, no on-call rotation, no escalation chain, no PagerDuty / Opsgenie / similar. The only related artifacts are .claude-memory/security-audit-critical.md (which documents past P0 findings but no response procedure) and the "ONE connection attempt, then ask" rule in CLAUDE.md (a safeguard, not an IR plan).

Why it matters: POPIA Sec 22 mandates 72-h breach notification. GDPR Art 33 same. We can't comply without a defined process.

Recommendation: Build it. Doesn't need to be elaborate to start:

  1. Pager: PagerDuty free tier (5 users free) OR Opsgenie free tier (5 users free) OR self-hosted (KumaHQ exists as part of Uptime Kuma which we could deploy on .106 cheaply)
  2. Runbook: One markdown file covering: detection paths, severity classes, escalation chain (named humans with phone numbers), breach-notification template, post-incident review template
  3. Rotation: Even a 1-person "always on-call" is better than nothing; expand to 2-3 once team grows
  4. Tabletop exercise: Quarterly — pick a scenario, walk through the runbook, find gaps

Action: Build during Phase D. Estimate 1 week to ship v1.

10.3 Cloudflare in front of Traefik — PARTIAL USAGE TODAY

Finding: Cloudflare R2 (object storage) is the only Cloudflare service in use (AppInfo/Infrastructure/R2_QUICK_REFERENCE.md). No DNS, no CDN, no WAF.

Why it matters: DDoS mitigation is hard without a CDN. Traefik + fail2ban can handle slow/medium attacks but not real volumetric ones. WAF rules block common attack patterns (SQLi, XSS, path traversal) before they reach our services.

Recommendation: Yes, add Cloudflare in front of public TGN endpoints. Specifics:

  • Free tier covers most needs: DNS, basic DDoS (Layer 3/4), free SSL, basic WAF rules
  • Pro tier ($25/mo per zone) adds: image optimisation, advanced rate-limiting, WAF managed rules
  • Business tier ($200/mo per zone) adds: 100% uptime SLA, advanced WAF, bypass-cache rules
  • Most pragmatic: Free tier for most domains, Pro for maps.dataacuity.co.za once it carries paid traffic
  • TLS termination: Cloudflare → re-encrypts to Traefik (full strict). Traefik's ACME stays for internal mutual TLS

Action: Onboard .106 services to Cloudflare during Phase B (Traefik wiring). Don't try to do it before Traefik is wired — Cloudflare-in-front of raw exposed ports is worse than current state.

10.4 Secret manager choice — NOTHING CENTRALISED TODAY

Finding: Secrets live in Deployment/deployment-credentials.ps1 (plaintext, committed-but-meant-to-not-be) and GitHub Actions secrets (~50 of them). No Vault, no AWS Secrets Manager, no Azure Key Vault, no Doppler.

Why it matters: The credentials file is in the repo (committed) — even with .gitignore warning, you can't unship that horse. Rotation is manual. Audit trail is git log only. This is a serious gap when banking compliance is in scope.

Recommendation: Pick one and consolidate. My ranking for this situation:

  1. HashiCorp Vault (self-hosted on .118) — best feature set, full audit, transit encryption, dynamic credentials. Cost: ops time to run it. Steep learning curve.
  2. Doppler (SaaS, free tier for small teams) — fastest to adopt, good DX, native Docker / CLI integration. Cost: $0–10/user/mo. Outsources your secrets to a third party.
  3. AWS Secrets Manager / Azure Key Vault — only if you already use that cloud for other things; otherwise adds operational surface
  4. Bitwarden Secrets Manager — newer, free self-hosted Vaultwarden + paid SM tier. Worth watching but young

For TGN's current scale and SA jurisdiction, my pick is Vault self-hosted on .118 — keeps secrets in-country, full control, no SaaS lock-in. Doppler is the second-best if simplicity matters more.

Action: Decide + start migration during Phase E (n8n audit + secret rotation are paired in the plan).

10.5 SOC2 readiness — NOT IN ACTIVE PROGRAM

Finding: Listed as a Phase 6 future item in AppInfo/TrustSeal/TRUSTSEAL_IMPLEMENTATION_PLAN.md with $25K budgeted. No active controls inventory, no auditor engagement, no timeline.

Why it matters: SOC2 is enterprise-customer table-stakes if TGN wants to sell DataAcuity / GeoGlobal / BI services to large companies. POPIA + GDPR alone are sufficient for B2C operations but limit B2B sales.

Recommendation: Don't pursue SOC2 now. Reasons:

  • The hardening work in this doc (Phases A-G) addresses ~70% of SOC2 Type II controls organically — defer formal audit until those land
  • SOC2 audit takes 6-12 months and ~$25K. Should be timed for when a specific big customer needs it
  • Trying to "build for SOC2" prematurely tends to over-engineer for hypothetical needs

If/when a deal demands it: revisit. Until then, the work we're doing aligns with future SOC2 readiness without paying the audit tax.

10.6 24/7 on-call — DOES NOT EXIST

Finding: No rotation schedule, no paging integration, no shift docs. De-facto policy is "best effort during business hours."

Why it matters: If the BI pipeline goes down at 2am and the morning's analytics are stale, that's mildly bad. If geo_db is breached at 2am and we don't notice until 9am, the POPIA 72-h notification clock has already burned 7 hours.

Recommendation: Tier it. Full 24/7 with paid shifts is overkill for current scale. But:

  • Critical alerts only at night — page only on: confirmed PII breach, all-services-down, payment gateway outage. Everything else waits for morning.
  • Single-person rotation with PagerDuty/Opsgenie scheduling
  • Define what's "critical" so the page only fires for things that truly can't wait
  • Document a 30-min response SLA for critical pages; everything else is best-effort
  • Quarterly review of pages fired — too many false positives, tune the rules; too few, expand the criteria

Action: Set up during Phase D, paired with the incident response work in §10.2.

# Decision needed My recommendation Block on which phase
10.1 Compliance reviewer Assign internal (a) for v1, retain external counsel (b) for audit BI Pipeline Phase 2
10.2 Incident response Build it (PagerDuty/Opsgenie free + 1 markdown runbook) Security Phase D
10.3 Cloudflare Yes, free tier; Pro for revenue-bearing domains Security Phase B
10.4 Secret manager HashiCorp Vault self-hosted on .118 Security Phase E
10.5 SOC2 Defer; revisit when a customer requires it none — not blocking
10.6 24/7 on-call Tiered: critical-only at night, single-person rotation Security Phase D

These need a thumbs-up before the corresponding phase ships.

11. Cross-references

  • DataAcuity_BI_Pipeline.md §6 — the anonymisation framework that this security posture supports
  • DataAcuity_Architecture_Overview.md §5 — the public exposure summary this doc expands on
  • GeoGlobal_Deployment.md §11 — service-specific hardening for geo_mcp/valhalla
  • Deployment/deployment-credentials.ps1 — the credentials file that's part of P2 #17
  • .claude-memory/banking-compliance-rules.md — the SARB/FICA/POPIA rules informing the compliance map in §5
  • .claude-memory/deploy-pattern-pgbouncer-cascade.md — connection discipline that informs Phase A and B

12. Inspection findings after Phase A — what changed (2026-05-28 PM)

Post-Phase-A inspection of the wider stack turned up a few items that the original audit missed. They affect the upcoming phases.

12.1 Traefik is not running

  • Compose file at /home/geektrading/suite/traefik/docker-compose.yml ✅ exists
  • ACME cert data at /home/geektrading/suite/traefik/acme/acme.json (140 KB) ✅ exists
  • Dynamic config at /home/geektrading/suite/traefik/config/{middlewares,services,tls}.yml ✅ exists with routes for dataacuity.co.za, traefik.dataacuity.co.za etc.
  • Container itself: not running. docker ps -a --filter name=traefik returns empty
  • Image: not present locally. docker images | grep traefik returns empty

Implication for Phase B: First task is docker pull traefik:v3.0 then docker compose -f /home/geektrading/suite/traefik/docker-compose.yml up -d. Once the cert refreshes (or proves valid from acme.json), then add the API routes.

12.2 DNS is wired correctly for Traefik

Verified dig +short from .106:

Hostname Resolves to
dataacuity.co.za 197.97.200.106 (direct A record)
maps.dataacuity.co.za CNAME → dataacuity.co.za.106
auth.dataacuity.co.za CNAME → dataacuity.co.za.106
traefik.dataacuity.co.za CNAME → dataacuity.co.za.106

ACME HTTP-01 challenge will work once Traefik is up.

12.3 Restic backups verified healthy

/home/geektrading/backups/restic-repo inspection:

  • 358 snapshots total, daily cadence verified
  • Most recent: 2026-05-28 03:00 (this morning's backup)
  • restic check returns "no errors were found"
  • Backup script (/home/geektrading/backups/scripts/backup-databases.sh) targets only markets_db + data_warehouse — missing geo_db, maps_db, keycloak_db, superset_db, gateway-db, twenty_db, automatisch_db, bio_db
  • Snapshots are LOCAL to .106 — off-server replication is still P2 #14

Implication for Phase C: Less work than expected; the restic infrastructure is solid. Two real gaps:

  1. Expand backup-databases.sh to cover all warehouse-affecting DBs (especially geo_db)
  2. Set up off-server replication of restic-repo to .118 or to an external S3-compatible target

12.4 .105 → .106 PostgreSQL connectivity blocked

docker run --network data-warehouse_data_stack postgres:15-alpine psql -h 197.97.200.105 ... times out — no successful connection in 60 seconds.

Likely causes (probable order):

  1. Windows Firewall on .105 blocks inbound TCP 5432 from .106's IP
  2. Postgres on .105 (Windows IIS-hosted setup) is bound to localhost / specific IPs only
  3. Network routing between Windows servers (.104/.105) and the Ubuntu DataAcuity server (.106) requires a specific path

Implication for BI Pipeline Phase 1: The extract framework cannot run until this is resolved. Needs a workstream on the .104/.105 side: DBA opens a firewall rule + pg_hba.conf entry for .106's IP, using the existing replicator user from Deployment/deployment-credentials.ps1.

12.5 dbt warehouse is empty scaffolding

Real row counts in data_warehouse.datawarehouse (verified 2026-05-28):

Schema Tables Real data?
bronze, silver, gold (medallion) 0 tables Empty
dbt_dev_marts 2 tables, 3 rows each Build-verification only
dbt_dev_staging empty
tgn 13 monthly event partitions, all 0 rows except tgn.events_2025_12 (4 rows) Essentially empty
public dbt metadata only

The dbt models listed in DataAcuity_BI_Pipeline.md §8.4 as "already running" — they exist as SQL but have never produced real output. The pipeline is greenfield from a data-flow perspective. BI Pipeline doc has been corrected.

12.6 Action items added to the hardening plan

  • Phase B: pull + start Traefik before adding routes (add to §6 Phase B steps)
  • Phase C: expand backup script to cover all DBs; off-server replication (add to §6 Phase C steps)
  • BI Phase 0: get DBA to open .105 firewall + pg_hba.conf for .106 (add as pre-req)
  • DataAcuity_BI_Pipeline.md: corrected to reflect empty-warehouse reality (done 2026-05-28)

13. Change log

Date Change By
2026-05-28 (am) Initial document — audit findings + hardening plan Tinashe Bhengu
2026-05-28 (pm) Phase A executed; §10 open questions investigated with recommendations Tinashe Bhengu
2026-05-28 (pm) Added §12 inspection findings: Traefik state, restic health, .105 gap, warehouse reality Tinashe Bhengu
Something went wrong on this page. Reload