DataAcuity — Security Posture
Status: ⚠️ Current state is NOT production-ready for data traffic. See §3 for blockers and §6 for the hardening plan. Last audited: 2026-05-28 Owner: Tinashe Bhengu
This document is the honest accounting of what's secure on .106, what isn't, and the prioritised path to making the box safe for real customer data flowing through the BI pipeline.
1. TL;DR
.106 runs 54 Docker containers. The infrastructure for proper security exists (Traefik for TLS, Keycloak for auth, fail2ban active, internal-only Docker networks) but most services are exposed raw on the public IP without authentication because they were originally set up for dev convenience and never hardened.
Before the BI pipeline carries real customer data, the following must be true:
- No PostgreSQL port is reachable from the public internet
- No internal-only service (Loki, cAdvisor, exporters, dashboards) is reachable from the public internet
- Every customer-touching API has either Keycloak auth or a documented "this is intentionally public" decision
- Backups of
data_warehouseare verified restorable - Monitoring + alerting on auth failures, anomalous query volumes, and disk pressure is wired
- A documented incident response runbook exists
We're at 0 of 6 right now. Estimated work: 1–2 weeks for the lockdown, 1 week for monitoring + runbooks.
2. Threat model
What we're defending against, in priority order:
| Threat | Impact | Mitigation strategy |
|---|---|---|
| Unauthenticated read of customer PII from data warehouse | POPIA/GDPR breach, regulatory fines, brand damage | Anonymisation pipeline (§6 of BI Pipeline doc) + lock down data_warehouse to internal-only |
| Public PostgreSQL exposure with brute-force credential attack | Total compromise of warehouse data, lateral movement to other DBs | Close port 5001/5433 to public, require VPN or jump host for direct DB access |
| Unauthenticated MCP / API abuse (geo_mcp, valhalla, markets) | Rate-limit-free scraping, data exfiltration of POIs/places, denial of service | Front everything with Traefik + API key or Keycloak token |
| Container escape via cAdvisor / Docker socket | Host compromise, all-container compromise | Close cAdvisor public port, restrict Docker socket access |
| Log exfiltration via public Loki | PII visible in app logs leaks publicly | Close Loki public port |
| Admin UI takeover (n8n, twenty, ai_brain_webui, automatisch) | Workflow tampering, CRM data leak, LLM cost runaway | Put behind Keycloak SSO via Traefik |
| Credential leak via .env files / Docker inspect | Lateral movement | Secrets in env-var-only, never in committed files; use Docker secrets where supported |
| DDoS on public services | Service degradation, cost spike | Cloudflare in front of Traefik (TBD) + per-service rate limits |
| Backup compromise | Data loss + RTO blown | Encrypted backups, off-server replication, restore drills |
Out of scope (covered elsewhere):
- App-layer authn/authz (lives in TGN AuthAPI per
CLAUDE.md) - App-layer business logic abuse (lives in each app's threat model)
- Banking compliance specifics (lives in
.claude-memory/banking-compliance-rules.md)
3. Current state — the audit
Audit run on .106 on 2026-05-28. Findings grouped by severity.
3.1 🚨 P0 — must fix before any production data traffic
| # | Issue | Detail |
|---|---|---|
| 1 | data_warehouse PostgreSQL on port 5001 published to 0.0.0.0 |
Anyone on the internet can attempt to connect. Only the password protects the warehouse. Source: docker ps shows 0.0.0.0:5001->5432/tcp |
| 2 | maps_db PostgreSQL on port 5433 published to 0.0.0.0 |
Same as above for the maps database |
| 3 | loki log aggregator on port 3100 published to 0.0.0.0 |
All container logs (potentially with PII in app log lines) reachable by anyone who knows the LogQL API. No auth required |
| 4 | cadvisor on port 8081 published to 0.0.0.0 |
Full container introspection — anyone can see images, env vars, resource usage, command lines |
| 5 | Prometheus exporters (node-exporter :9100, nginx-exporter :9113, redis-exporters :9121/9122/9123, postgres-exporters :9187/9188) all published to 0.0.0.0 | Host and DB metrics leak, useful for attackers fingerprinting the system |
3.2 ⚠️ P1 — must fix before customer-facing scale
| # | Issue | Detail |
|---|---|---|
| 6 | geo_mcp on 5026 has no authentication |
Anyone can call MCP tools (geocode, reverse_geocode, search_places, route, discover_quest, nearby POIs). Confirmed HTTP 200 on /sse with no credentials |
| 7 | valhalla on 5027 has no authentication |
Africa routing free to anyone. Could be abused for free routing-as-a-service |
| 8 | maps_api on 5020 has no authentication on root or /docs |
OpenAPI exposed, all endpoints callable |
| 9 | markets_api on 8000 has no authentication |
Even though data is currently broken, the surface is public |
| 10 | Admin UIs published raw: n8n (5008), automatisch (5004), twenty_crm (5005), morph_convertx (5011), ai_brain_webui (5000), bio_onelink (5009), dashboard-backend (5007), api-docs (8082) | Each has its own auth (varies in strength); none are behind a unified SSO. Should all front through Traefik + Keycloak |
| 11 | No TLS certificates found at /etc/letsencrypt/live/ on the host |
HTTPS termination is happening somewhere (likely Cloudflare or .118 ARR) but .106 itself doesn't terminate TLS. Container-to-container is unencrypted. Internet-to-container plaintext if hit on the host IP directly |
| 12 | tagme_api (5023) and transit_api (5030) live on .106 |
Per CLAUDE.md all 36 APIs should be on .104/.105. These two are exceptions. Should either be moved or the rule updated and documented |
3.3 🟡 P2 — should fix in next 30 days
| # | Issue | Detail |
|---|---|---|
| 13 | Keycloak deployed but barely wired | Master realm exists with public_key, but most services don't authenticate against it. Wiring this is the cleanest way to fix #10 |
| 14 | No verified backup restore drill for data_warehouse |
/home/geektrading/backups/ exists but no documentation of when restoration was last tested |
| 15 | No documented incident response runbook | If data_warehouse is compromised, what do we do? Who's paged? Where's the rotation key? No answer |
| 16 | No anomaly alerting on auth failures or query volumes | We have Prometheus/Grafana but no rules for "unusual query rate against geo_db" or "auth fail spike on Keycloak" |
| 17 | replicator credential in plaintext in Deployment/deployment-credentials.ps1 (committed to repo) |
Should be in a secret manager (Vault, AWS Secrets Manager, Azure Key Vault) — even though the repo is private, this is the wrong pattern |
| 18 | No automated PII-leak scanning on dbt models | The BI pipeline §6.3 specifies it; not yet implemented |
| 19 | n8n's 55 MB sqlite content is black-box to ops | Could contain credentials, workflows touching customer data, etc. Audit overdue |
| 20 | maps_osrm runs without data but still consumes resources and presents attack surface |
Either load Africa OSM into it or decommission |
| 21 | automatisch has 0 flows but is publicly exposed |
Same as #20 — either commit to using it or remove |
3.4 🟢 P3 — nice to have
| # | Issue | Detail |
|---|---|---|
| 22 | fail2ban is active but the policy isn't documented |
Verify policy covers SSH + WebUI auth failures + Postgres |
| 23 | No CIS / hardening baseline scan ever run on the host | Run lynis audit system for a baseline |
| 24 | No SBOM / image vulnerability scan | Run trivy against every container image, schedule monthly |
| 25 | No automated TLS renewal monitoring | If Traefik's ACME fails silently, certs expire. Need an alert |
| 26 | No audit log for who SSHed and what they ran | OS-level audit (auditd, falco) not deployed |
4. What's already good ✅
Credit where due — not everything is broken:
- Docker network isolation is correctly used for DBs:
geo_db,gateway-db,keycloak_db,superset_db,twenty_db,automatisch_db,bio_db,markets_db(internal port),maps_redis,gateway-redis,twenty_redis,automatisch_redis,markets_redis— none are publicly exposed - maps_osrm and maps_prerender correctly internal-only
fail2banis active on the host (mitigates SSH brute force)- Keycloak is deployed and ready to be wired — the auth infrastructure exists, just isn't used
- Traefik is deployed with ACME — TLS infrastructure exists
- api-gateway-external / api-gateway-internal containers exist — the gateway pattern is set up
- Monitoring stack is comprehensive — Prometheus + Grafana + Loki + Alertmanager + multiple exporters
- dbt and the data warehouse infrastructure are sound — schema layering pattern works
- Replication from .104 to .105 is healthy — ETL won't have to touch the primary
5. Compliance — what regulators care about
For each regulation the BI pipeline must satisfy, what's the current security gap?
| Regulation | Requirement | Current gap |
|---|---|---|
| POPIA (SA) | "Reasonable technical and organisational measures to prevent unauthorised access" | Public Postgres ports (P0 #1, #2) — direct violation. Public Loki (P0 #3) — likely violation if any PII in logs |
| POPIA | "Limit access to what is necessary for the purpose" | No role-based access on warehouse (everyone reading uses dwh_user) — partial gap |
| POPIA | "Mandatory breach notification within 72 hours" | No incident response runbook (P2 #15) — gap in capability to comply |
| FICA | "Records retained 5 years, securely stored" | Backups not verified restorable (P2 #14) — capability gap |
| GDPR Art. 32 | "Pseudonymisation and encryption of personal data" | Anonymisation pipeline designed but not deployed; encryption-at-rest assumed but not verified |
| GDPR Art. 25 | "Privacy by design and by default" | Public Postgres ports violate this on its face |
| GDPR Art. 30 | "Records of processing activities" | No data processing register exists |
| GDPR Art. 33-34 | "Breach notification 72 hours" | Same as POPIA — capability gap |
| PCI-DSS (if any card data) | Network segmentation, no card data in logs | Out of scope for .106 (SDPKT handles cards, on .104) but verify no card data leaks into data_warehouse via the pipeline |
Bottom line for compliance: the warehouse cannot accept production customer data until P0 items 1–3 are closed, full stop. Items 4–5 are also blockers given anti-fingerprinting expectations under "reasonable technical measures."
6. Hardening plan — phased
Sequenced so each phase is independently shippable and each phase reduces risk meaningfully.
Phase A — Lock down public ports (3-5 days)
Acceptance criteria: All 🚨 P0 issues from §3.1 closed.
Steps:
- Change PostgreSQL port mappings for
data_warehouseandmaps_dbfrom0.0.0.0:5001->5432to127.0.0.1:5001->5432. External access requires SSH tunnel or VPN.# in docker-compose.yml ports: - "127.0.0.1:5001:5432" # was "5001:5432" - Same change for
loki,cadvisor, all*-exportercontainers. These don't need any public access — Prometheus scrapes them from inside the Docker network already. - Update Prometheus scrape configs if any used
host.docker.internal:9100style — switch to container-name targets. - Verify with external
nmapfrom a different IP that ports 5001, 5433, 3100, 8081, 9100, 9113, 9121-9123, 9187, 9188 no longer respond. - Document the new access path for the ops team (SSH tunnel command, Tailscale config, or bastion host).
Estimated time: 1-2 days. Risk: low (the consumers of these ports are all internal).
Phase B — Front public APIs with Traefik + auth (5-7 days)
Acceptance criteria: All ⚠️ P1 issues from §3.2 closed. No service published raw to 0.0.0.0 except the few intentionally public.
Steps:
- Inventory all currently-public services — categorise as: (a) intentionally public (api-gateway-external, dataacuity_portal), (b) needs auth wrapper (geo_mcp, valhalla, maps_api, markets_api), (c) admin (n8n, twenty, ai_brain_webui, etc.) needs Keycloak SSO.
- Configure Traefik routes for each category, terminating TLS at Traefik. Issue Let's Encrypt certs via existing ACME setup.
- For category (b): wrap with an API-key middleware initially (faster), wire to Keycloak token validation as Phase B.5.
- For category (c): integrate Keycloak SSO via Traefik's ForwardAuth middleware (or use oauth2-proxy as the intermediate).
- Remove
0.0.0.0:from every docker-compose for the wrapped services. Internal Docker DNS handles container-to-Traefik routing. - Verify with
nmapthat only ports 80 (redirect), 443 (Traefik), 22 (SSH), and 8084 (api-gateway-external if intentionally separate) respond.
Estimated time: 1 week. Risk: medium (touches every public surface — needs careful change windows).
Phase C — Backups + DR drill (2-3 days)
Acceptance criteria: P2 #14 closed.
- Verify the nightly backup job at
/home/geektrading/backups/is actually running and what it backs up - Run a restore drill into a fresh empty container — measure RTO
- Set up off-server backup replication to either
.118or external S3 - Document the runbook for "data_warehouse is gone, restore it"
- Add backup-freshness alert to Prometheus + Alertmanager (warn if backup older than 26 h)
Phase D — Incident response + monitoring (3-5 days)
Acceptance criteria: P2 #15, #16 closed; P1 PII alerting from BI pipeline §11.2 in place.
- Write incident response runbook — who's paged, escalation chain, breach notification template
- Wire Alertmanager to actually page someone (currently it just collects)
- Add Prometheus rules for the alert classes in BI Pipeline §11.2 (P1 alerts page on-call, P2 alerts create tickets)
- Auth-failure alerts: Keycloak login failure spike, postgres auth failure spike, fail2ban ban-rate spike
- Anomaly alerts: warehouse query volume 3σ above baseline, disk free <20 GB, container restart loop
Phase E — n8n audit + secret rotation (2-3 days)
Acceptance criteria: P2 #17, #19 closed.
- Copy n8n's sqlite out of the container, read it, document every workflow
- For workflows touching customer data, ensure they go through the BI pipeline path (not direct DB access) — refactor if needed
- Migrate
replicatorand other secrets out ofDeployment/deployment-credentials.ps1into a secret manager - Rotate any credential that was visible in committed files
Phase F — Hardening baseline + image scans (2 days)
Acceptance criteria: P3 #23, #24 closed.
- Run
lynis audit systemon.106, document findings - Run
trivy imageagainst every image used, document CVEs above HIGH - Schedule both as monthly Grafana-tracked jobs
Phase G — Decommission dead services (1 day)
Acceptance criteria: P2 #20, #21 closed.
- Decide:
maps_osrm— load Africa OSM data or decommission. Same as Valhalla but redundant if Valhalla covers the use case - Decide:
automatisch— start using it or decommission - Stop, remove, document the decision in a CHANGELOG
Total
Approximately 3-4 weeks of focused security work to get .106 to production-ready for data traffic.
7. Definition of "production-ready for data traffic"
For the BI pipeline to start carrying real customer data, ALL of these must be true:
- All P0 issues from §3.1 are closed (no public Postgres, no public infra ports)
- All P1 issues from §3.2 are closed (no unauthenticated API, no public admin UI)
- Backups verified restorable (Phase C complete)
- Alerting wired and tested (Phase D complete)
- Incident response runbook exists and the on-call rotation knows where it is
- PII-absence dbt tests are running on every
marts.*model - Compliance sign-off recorded (a named compliance reviewer has audited a sample and signed off)
- A documented "kill switch" — a single action that stops all data flow into the warehouse if something is wrong
Anything less is "dev / staging quality only" and the warehouse must contain only synthetic or already-anonymised data.
8. Specific service hardening notes
8.1 geo_mcp
- Add
X-API-Keyheader check before any tool call — keys issued per consumer (TagMe, Takemehome, Butler) - Rate limit: 600 calls/min per key by default, lower for browser-origin
- Log every tool call (consumer, tool, args, latency) to Loki for audit
- Behind Traefik with TLS; container itself only listens on Docker network
8.2 valhalla
- Same API-key gate
- Disable
/expansion,/trace_attributes,/heightif not needed (smaller attack surface) - Cache aggressively (24h on routes — they don't change)
8.3 maps_api
- Already designed for it (slowapi rate limiter present) — verify the limit is sensible (currently 60/min/IP)
- API-key auth for
/api/v2/*(the new BI-relevant proxies) - Disable
/docsin production unless authenticated
8.4 data_warehouse
- Move to internal-only port
- Role separation:
etl_user(write toraw.*only),dwh_user(read all, writestaging/intermediate/marts/analytics),analytics_user(readanalytics.*only),compliance_user(read everything + access tocompliance.token_vault) - Connection limit per role; warn if any role hits 80% of limit
- pg_stat_statements enabled for query analytics
- Audit logging enabled (or at minimum log all
SELECT * FROM intermediate.*access)
8.5 keycloak
- Master realm: admin password rotated and stored in secret manager
- Per-app realm (or client) for: BigBruh!, n8n, Superset, Grafana, twenty_crm, dataacuity_portal, ai_brain_webui
- Backup the realm exports nightly
8.6 superset
- Behind Keycloak SSO via OAuth/OIDC
- Row-level security on the warehouse connections so analysts only see appropriate marts
- Disable
Publicrole's access to any dataset
8.7 n8n
- After Phase E audit, behind Keycloak
- Encrypt the sqlite at rest (filesystem-level encryption on the volume)
- Webhooks (the public attack surface) get their own API-key validation
9. Trusted access patterns (for ops + developers)
After hardening, how does a developer / ops engineer access .106 services?
| Need | Path |
|---|---|
| Web UI access (Superset, Grafana, n8n, dataacuity_portal) | https:// |
| API call from TGN app | https://maps.dataacuity.co.za/api/v2/... with X-API-Key header → Traefik → maps_api → backend |
| Direct Postgres access (DBA, analyst) | SSH tunnel: ssh -L 5001:data_warehouse:5001 geektrading@.106 → connect locally to localhost:5001 |
| Direct container shell (debugging) | SSH to .106 → docker exec -it <container> bash (requires being in the docker group, which is locked to named users) |
| Read logs (Loki) | Grafana UI → Loki data source → LogQL queries. No direct Loki port access |
| Prometheus metrics | Grafana UI → Prometheus data source. No direct port access |
| Backup restore (DR drill) | Pull backup from off-server location → load into a fresh container per the documented runbook |
10. Open questions — with findings and recommendations
Repo-wide search done 2026-05-28. For each question, what exists today and the recommendation:
10.1 Compliance reviewer / DPO — NOT ASSIGNED
Finding: No named DPO or compliance reviewer anywhere in the repo. Compliance rules are documented (.claude-memory/banking-compliance-rules.md, AppInfo/TrustSeal/TRUSTSEAL_IMPLEMENTATION_PLAN.md) but no person owns sign-off.
Why it matters: Anonymisation Standard §11 requires a named reviewer to sign off on every intermediate.* model before it ships. Without this role, the BI pipeline cannot legally start carrying production PII.
Recommendation: This is a hiring/appointment decision, not a technical one. Three options:
- (a) Assign internally — likely the most senior backend/data lead with compliance training; lowest cost, real ongoing time commitment (~2 h / week)
- (b) Contract external counsel — POPIA-specialist law firm in SA (Webber Wentzel, ENS, Bowmans all have practices). Higher cost, lower internal burden, more credible to regulators
- (c) Hire dedicated DPO — only justifiable at scale; GDPR mandates this once you process EU PII at meaningful volume
Action: Pick (a) or (b) before BI Pipeline Phase 2 starts.
10.2 Incident response process — DOES NOT EXIST
Finding: No IR runbooks, no on-call rotation, no escalation chain, no PagerDuty / Opsgenie / similar. The only related artifacts are .claude-memory/security-audit-critical.md (which documents past P0 findings but no response procedure) and the "ONE connection attempt, then ask" rule in CLAUDE.md (a safeguard, not an IR plan).
Why it matters: POPIA Sec 22 mandates 72-h breach notification. GDPR Art 33 same. We can't comply without a defined process.
Recommendation: Build it. Doesn't need to be elaborate to start:
- Pager: PagerDuty free tier (5 users free) OR Opsgenie free tier (5 users free) OR self-hosted (KumaHQ exists as part of Uptime Kuma which we could deploy on .106 cheaply)
- Runbook: One markdown file covering: detection paths, severity classes, escalation chain (named humans with phone numbers), breach-notification template, post-incident review template
- Rotation: Even a 1-person "always on-call" is better than nothing; expand to 2-3 once team grows
- Tabletop exercise: Quarterly — pick a scenario, walk through the runbook, find gaps
Action: Build during Phase D. Estimate 1 week to ship v1.
10.3 Cloudflare in front of Traefik — PARTIAL USAGE TODAY
Finding: Cloudflare R2 (object storage) is the only Cloudflare service in use (AppInfo/Infrastructure/R2_QUICK_REFERENCE.md). No DNS, no CDN, no WAF.
Why it matters: DDoS mitigation is hard without a CDN. Traefik + fail2ban can handle slow/medium attacks but not real volumetric ones. WAF rules block common attack patterns (SQLi, XSS, path traversal) before they reach our services.
Recommendation: Yes, add Cloudflare in front of public TGN endpoints. Specifics:
- Free tier covers most needs: DNS, basic DDoS (Layer 3/4), free SSL, basic WAF rules
- Pro tier ($25/mo per zone) adds: image optimisation, advanced rate-limiting, WAF managed rules
- Business tier ($200/mo per zone) adds: 100% uptime SLA, advanced WAF, bypass-cache rules
- Most pragmatic: Free tier for most domains, Pro for
maps.dataacuity.co.zaonce it carries paid traffic - TLS termination: Cloudflare → re-encrypts to Traefik (full strict). Traefik's ACME stays for internal mutual TLS
Action: Onboard .106 services to Cloudflare during Phase B (Traefik wiring). Don't try to do it before Traefik is wired — Cloudflare-in-front of raw exposed ports is worse than current state.
10.4 Secret manager choice — NOTHING CENTRALISED TODAY
Finding: Secrets live in Deployment/deployment-credentials.ps1 (plaintext, committed-but-meant-to-not-be) and GitHub Actions secrets (~50 of them). No Vault, no AWS Secrets Manager, no Azure Key Vault, no Doppler.
Why it matters: The credentials file is in the repo (committed) — even with .gitignore warning, you can't unship that horse. Rotation is manual. Audit trail is git log only. This is a serious gap when banking compliance is in scope.
Recommendation: Pick one and consolidate. My ranking for this situation:
- HashiCorp Vault (self-hosted on
.118) — best feature set, full audit, transit encryption, dynamic credentials. Cost: ops time to run it. Steep learning curve. - Doppler (SaaS, free tier for small teams) — fastest to adopt, good DX, native Docker / CLI integration. Cost: $0–10/user/mo. Outsources your secrets to a third party.
- AWS Secrets Manager / Azure Key Vault — only if you already use that cloud for other things; otherwise adds operational surface
- Bitwarden Secrets Manager — newer, free self-hosted Vaultwarden + paid SM tier. Worth watching but young
For TGN's current scale and SA jurisdiction, my pick is Vault self-hosted on .118 — keeps secrets in-country, full control, no SaaS lock-in. Doppler is the second-best if simplicity matters more.
Action: Decide + start migration during Phase E (n8n audit + secret rotation are paired in the plan).
10.5 SOC2 readiness — NOT IN ACTIVE PROGRAM
Finding: Listed as a Phase 6 future item in AppInfo/TrustSeal/TRUSTSEAL_IMPLEMENTATION_PLAN.md with $25K budgeted. No active controls inventory, no auditor engagement, no timeline.
Why it matters: SOC2 is enterprise-customer table-stakes if TGN wants to sell DataAcuity / GeoGlobal / BI services to large companies. POPIA + GDPR alone are sufficient for B2C operations but limit B2B sales.
Recommendation: Don't pursue SOC2 now. Reasons:
- The hardening work in this doc (Phases A-G) addresses ~70% of SOC2 Type II controls organically — defer formal audit until those land
- SOC2 audit takes 6-12 months and ~$25K. Should be timed for when a specific big customer needs it
- Trying to "build for SOC2" prematurely tends to over-engineer for hypothetical needs
If/when a deal demands it: revisit. Until then, the work we're doing aligns with future SOC2 readiness without paying the audit tax.
10.6 24/7 on-call — DOES NOT EXIST
Finding: No rotation schedule, no paging integration, no shift docs. De-facto policy is "best effort during business hours."
Why it matters: If the BI pipeline goes down at 2am and the morning's analytics are stale, that's mildly bad. If geo_db is breached at 2am and we don't notice until 9am, the POPIA 72-h notification clock has already burned 7 hours.
Recommendation: Tier it. Full 24/7 with paid shifts is overkill for current scale. But:
- Critical alerts only at night — page only on: confirmed PII breach, all-services-down, payment gateway outage. Everything else waits for morning.
- Single-person rotation with PagerDuty/Opsgenie scheduling
- Define what's "critical" so the page only fires for things that truly can't wait
- Document a 30-min response SLA for critical pages; everything else is best-effort
- Quarterly review of pages fired — too many false positives, tune the rules; too few, expand the criteria
Action: Set up during Phase D, paired with the incident response work in §10.2.
Summary of recommended decisions
| # | Decision needed | My recommendation | Block on which phase |
|---|---|---|---|
| 10.1 | Compliance reviewer | Assign internal (a) for v1, retain external counsel (b) for audit | BI Pipeline Phase 2 |
| 10.2 | Incident response | Build it (PagerDuty/Opsgenie free + 1 markdown runbook) | Security Phase D |
| 10.3 | Cloudflare | Yes, free tier; Pro for revenue-bearing domains | Security Phase B |
| 10.4 | Secret manager | HashiCorp Vault self-hosted on .118 | Security Phase E |
| 10.5 | SOC2 | Defer; revisit when a customer requires it | none — not blocking |
| 10.6 | 24/7 on-call | Tiered: critical-only at night, single-person rotation | Security Phase D |
These need a thumbs-up before the corresponding phase ships.
11. Cross-references
DataAcuity_BI_Pipeline.md§6 — the anonymisation framework that this security posture supportsDataAcuity_Architecture_Overview.md§5 — the public exposure summary this doc expands onGeoGlobal_Deployment.md§11 — service-specific hardening for geo_mcp/valhallaDeployment/deployment-credentials.ps1— the credentials file that's part of P2 #17.claude-memory/banking-compliance-rules.md— the SARB/FICA/POPIA rules informing the compliance map in §5.claude-memory/deploy-pattern-pgbouncer-cascade.md— connection discipline that informs Phase A and B
12. Inspection findings after Phase A — what changed (2026-05-28 PM)
Post-Phase-A inspection of the wider stack turned up a few items that the original audit missed. They affect the upcoming phases.
12.1 Traefik is not running
- Compose file at
/home/geektrading/suite/traefik/docker-compose.yml✅ exists - ACME cert data at
/home/geektrading/suite/traefik/acme/acme.json(140 KB) ✅ exists - Dynamic config at
/home/geektrading/suite/traefik/config/{middlewares,services,tls}.yml✅ exists with routes fordataacuity.co.za,traefik.dataacuity.co.zaetc. - Container itself: not running.
docker ps -a --filter name=traefikreturns empty - Image: not present locally.
docker images | grep traefikreturns empty
Implication for Phase B: First task is docker pull traefik:v3.0 then docker compose -f /home/geektrading/suite/traefik/docker-compose.yml up -d. Once the cert refreshes (or proves valid from acme.json), then add the API routes.
12.2 DNS is wired correctly for Traefik
Verified dig +short from .106:
| Hostname | Resolves to |
|---|---|
dataacuity.co.za |
197.97.200.106 (direct A record) |
maps.dataacuity.co.za |
CNAME → dataacuity.co.za → .106 |
auth.dataacuity.co.za |
CNAME → dataacuity.co.za → .106 |
traefik.dataacuity.co.za |
CNAME → dataacuity.co.za → .106 |
ACME HTTP-01 challenge will work once Traefik is up.
12.3 Restic backups verified healthy
/home/geektrading/backups/restic-repo inspection:
- 358 snapshots total, daily cadence verified
- Most recent:
2026-05-28 03:00(this morning's backup) restic checkreturns "no errors were found"- Backup script (
/home/geektrading/backups/scripts/backup-databases.sh) targets onlymarkets_db+data_warehouse— missinggeo_db,maps_db,keycloak_db,superset_db,gateway-db,twenty_db,automatisch_db,bio_db - Snapshots are LOCAL to
.106— off-server replication is still P2 #14
Implication for Phase C: Less work than expected; the restic infrastructure is solid. Two real gaps:
- Expand
backup-databases.shto cover all warehouse-affecting DBs (especiallygeo_db) - Set up off-server replication of
restic-repoto.118or to an external S3-compatible target
12.4 .105 → .106 PostgreSQL connectivity blocked
docker run --network data-warehouse_data_stack postgres:15-alpine psql -h 197.97.200.105 ... times out — no successful connection in 60 seconds.
Likely causes (probable order):
- Windows Firewall on
.105blocks inbound TCP 5432 from.106's IP - Postgres on
.105(Windows IIS-hosted setup) is bound to localhost / specific IPs only - Network routing between Windows servers (
.104/.105) and the Ubuntu DataAcuity server (.106) requires a specific path
Implication for BI Pipeline Phase 1: The extract framework cannot run until this is resolved. Needs a workstream on the .104/.105 side: DBA opens a firewall rule + pg_hba.conf entry for .106's IP, using the existing replicator user from Deployment/deployment-credentials.ps1.
12.5 dbt warehouse is empty scaffolding
Real row counts in data_warehouse.datawarehouse (verified 2026-05-28):
| Schema | Tables | Real data? |
|---|---|---|
bronze, silver, gold (medallion) |
0 tables | Empty |
dbt_dev_marts |
2 tables, 3 rows each | Build-verification only |
dbt_dev_staging |
empty | — |
tgn |
13 monthly event partitions, all 0 rows except tgn.events_2025_12 (4 rows) |
Essentially empty |
public |
dbt metadata only | — |
The dbt models listed in DataAcuity_BI_Pipeline.md §8.4 as "already running" — they exist as SQL but have never produced real output. The pipeline is greenfield from a data-flow perspective. BI Pipeline doc has been corrected.
12.6 Action items added to the hardening plan
- Phase B: pull + start Traefik before adding routes (add to §6 Phase B steps)
- Phase C: expand backup script to cover all DBs; off-server replication (add to §6 Phase C steps)
- BI Phase 0: get DBA to open
.105firewall +pg_hba.conffor.106(add as pre-req) - DataAcuity_BI_Pipeline.md: corrected to reflect empty-warehouse reality (done 2026-05-28)
13. Change log
| Date | Change | By |
|---|---|---|
| 2026-05-28 (am) | Initial document — audit findings + hardening plan | Tinashe Bhengu |
| 2026-05-28 (pm) | Phase A executed; §10 open questions investigated with recommendations | Tinashe Bhengu |
| 2026-05-28 (pm) | Added §12 inspection findings: Traefik state, restic health, .105 gap, warehouse reality | Tinashe Bhengu |