DataAcuity — BI Data Pipeline

Owner: Tinashe Bhengu Status: 🔵 DESIGN — pipeline must be built from scratch. The existing dbt project is scaffolding, not a live ETL. Audience: anyone who'll touch the data layer — backend engineers, DBAs, dbt authors, BI dashboard builders, compliance reviewers

The single source of truth for how raw app data becomes anonymised, enriched, business-intelligence-ready data that the TGN apps and BigBruh! can consume safely.

1. Executive summary

The Geek Network has 36 APIs and ~20 apps generating user events, transactions, locations, messages, bids, check-ins, etc. on .104 (primary) with hot replication to .105. Today this data is siloed per app — usable for direct app features but not for cross-app analytics, customer-360 views, ML feature stores, or anonymised partner analytics.

The DataAcuity stack on .106 has the infrastructure for a BI pipeline: a Postgres data warehouse (data_warehouse container), a dbt project with TGN-tagged models, Superset for dashboards, Grafana/Loki/Prometheus for ops telemetry, n8n for workflow orchestration, and restic-based backups running daily. What is not yet built is the actual data flow — the warehouse contains only scaffolding (see §1.1), no nightly ETL is running, and the path from .105 → .106 PostgreSQL traffic appears to be blocked (see §1.2). What's missing is the disciplined pipeline that gets every app's data in, strips PII, enriches it with geographic / temporal / FX / cross-app context, and pushes business-safe aggregates back to .104 so the apps can consume their own intelligence at low latency.

This document specifies that pipeline.

1.1 Honest state of the warehouse (verified 2026-05-28)

data_warehouse.datawarehouse on .106:
  schemas:        bronze, silver, gold (medallion architecture, EMPTY)
                  dbt_dev_marts (3 rows in each of 2 marts — scaffolding)
                  dbt_dev_staging (empty)
                  tgn (4 rows in events_2025_12; 11 future monthly partitions all 0 rows)
                  public

  dbt models with real data:
    dbt_dev_marts.fct_tgn_daily_metrics   — 3 rows
    dbt_dev_marts.fct_tgn_source_summary  — 3 rows

  dbt models that exist in code but haven't published anywhere:
    marts/tagme/fct_tagme_checkins.sql        (in dbt_project — never run successfully)
    marts/tagme/fct_tagme_tags.sql            (same)
    marts/tagme/fct_map_ad_impressions.sql    (same)
    staging/tgn/stg_tgn__events.sql           (same)
    staging/tgn/stg_tgn__ingest_log.sql       (same)

The dbt project is named "DataAcuity DBT Project" and was authored for bronze/silver/gold schemas. The few dbt_dev_* rows that exist look like build verification, not real ETL output. Treat the warehouse as greenfield for this pipeline's purposes.

1.2 Honest state of the .105 → .106 connectivity

Verified 2026-05-28: a psql -h 197.97.200.105 -U replicator -d postgres from a container on .106's data-warehouse_data_stack network times out. Likely causes:

Windows Firewall on .105 blocks PostgreSQL inbound from .106 (most likely — TGN ops convention is that .106 is a separate stack)
PostgreSQL on .105 is bound to localhost only
Network routing between data-centre subnets requires a specific path

This is BI Phase 1's first concrete blocker. Resolving it is a separate workstream: DBA on .104/.105 opens a firewall rule + pg_hba.conf entry for .106's IP using the existing replicator user (already documented in Deployment/deployment-credentials.ps1).

2. Why we need this

Problem today	What the pipeline solves
Each app's analytics are siloed in its own DB	Unified warehouse with cross-app fact / dim tables
Raw PII sits in app DBs and would leak through any analytics surface	Anonymisation layer guarantees PII never reaches BI consumers
No FX-normalised view of revenue across SDPKT / BidBaas / WhatWeWant / Takemehome	Enrichment layer attaches FX, time, geo context to every event
BigBruh! is stuck on "live counters" because it has no warehouse to query	BigBruh! Analytics tier reads `marts.*` and embeds Superset charts
Customer-360, churn prediction, fraud detection are impossible without a unified user view	`marts.dim_user_360` joins anonymised user activity across all apps
ML training data extraction is ad-hoc and breaks compliance	A documented, audited path to ML feature stores with anonymity guarantees
Apps can't show users their own activity history without expensive in-app aggregation	Per-app analytics pushed back to `.104` lets apps render their own dashboards locally

3. Architecture overview

┌─────────────────────────────────────────────────────────────┐
│  .104 PRIMARY (read-write)                                  │
│  ───────────────────────                                    │
│  All 36 API postgres DBs (auth_db, ledger_db, tagme_db,     │
│  bidbaas_db, sdpkt_db, txtme_db, sleptondb, etc.)           │
│                                                              │
│  + analytics_db  (NEW — pushed-back marts, see §9)          │
└─────────────────────────┬───────────────────────────────────┘
                          │ streaming replication
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  .105 STANDBY (read-only replica)                           │
│  ───────────────────────────                                │
│  Same 36 DBs, near-real-time. THIS is where ETL reads from. │
└─────────────────────────┬───────────────────────────────────┘
                          │ nightly + selective near-real-time
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  .106 DATA WAREHOUSE STACK                                  │
│  ──────────────────────────                                 │
│                                                              │
│  ┌─────────────────────┐   ┌───────────────────────────┐    │
│  │ data_warehouse      │   │ geo_db                    │    │
│  │ (postgres :5001)    │◄──┤ 13.4M places + 3.5M POIs  │    │
│  │                     │   │ (enrichment source)       │    │
│  │ Schemas:            │   └───────────────────────────┘    │
│  │  raw.*       <─── extracted from .105                    │
│  │  staging.*   <─── lightly typed + deduped                │
│  │  intermediate.* <── anonymised + standardised            │
│  │  marts.*     <─── business-ready fact/dim tables         │
│  │  analytics.* <─── denormalised views for dashboards      │
│  │  ml.*        <─── feature stores                         │
│  └─────────────────────┘                                    │
│                              ┌────────────────────────┐     │
│  ┌─────────────────────┐    │ valhalla + geo_mcp     │     │
│  │ dbt_transform       │───▶│ (enrichment service)   │     │
│  │ (transformations)   │    └────────────────────────┘     │
│  └─────────────────────┘                                    │
│                                                              │
│  ┌─────────────────────┐   ┌───────────────────────────┐    │
│  │ Superset :5003      │   │ Grafana / Prom / Loki     │    │
│  │ (BI dashboards)     │   │ (ops telemetry)           │    │
│  └─────────────────────┘   └───────────────────────────┘    │
└─────────────────────────┬───────────────────────────────────┘
                          │ aggregated, anonymised marts
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  .104 analytics_db  (push-back)                             │
│  ─────────────────                                           │
│  Per-app marts every app's API can read locally for:        │
│    - "your activity this week" surfaces                     │
│    - cohort comparisons                                     │
│    - in-app dashboards                                      │
│  ZERO raw PII. Aggregates only. K-anonymised (k≥5).         │
└─────────────────────────────────────────────────────────────┘

4. The data flow — step by step

Every record goes through six stages. Each stage is idempotent and resumable.

Capture — app writes to its own DB on .104. Already happening; no change.
Replicate — streaming replication to .105. Already happening; no change. We never read from .104 for ETL — that would risk the primary under load.
Extract — scheduled job pulls from .105 into data_warehouse.raw.* on .106. Incremental where possible (using updated_at / event_time watermarks), full snapshot for small dimension tables.
Standardise + dedup — dbt models transform raw.* → staging.*. Types coerced, encodings normalised, soft-deleted rows filtered.
Anonymise — dbt models transform staging.* → intermediate.*. PII fields hashed, tokenised, generalised, or suppressed per the rules in §6.
Enrich — dbt models transform intermediate.* → marts.*. Geo / time / FX / user-segment / cross-app context attached per §7.
Serve — marts.* are read by Superset for dashboards, by ML training jobs for feature extraction, and by the push-back job that writes selected aggregates into .104 analytics_db for app consumption.

Latency expectations:

Stage	Latency target
Replicate (.104 → .105)	< 5 s (streaming)
Extract (.105 → raw.*)	T+15 min (event tables, incremental); T+24 h (dimension snapshots)
Standardise + dedup	T+5 min after extract
Anonymise	T+5 min after standardise
Enrich	T+15 min after anonymise (geo lookups dominate)
Push-back to .104	T+30 min after marts ready

End-to-end: ~1 hour from event capture to anonymised, enriched availability in .104 analytics_db. Good enough for "this week", "today", "right now" surfaces in app UIs.

For true real-time (sub-minute) analytics — fraud flagging, live ops counters — keep using BigBruh!'s direct read-replica queries against .105. The pipeline is for intelligence, not live ops.

5. Source registry

Every TGN data source that flows through the pipeline. Source-of-truth list — update this file when a new API ships.

Source DB (on .104/.105)	Owning API	Key tables	Extract mode	Frequency
`auth_db`	AuthAPI	`users`, `sessions`, `kyc_records`, `phone_pins`	Incremental on `updated_at`	15 min
`ledger_db`	LedgerAPI	`transactions`, `accounts`, `holds`, `reversals`	Incremental on `created_at`	5 min (event-driven via WAL eventually)
`messaging_db`	MessagingAPI / TxTMe!	`conversations`, `messages` (CONTENT NEVER LEAVES — see §6.5), `participants`	Incremental + metadata-only	1 h
`apps_db`	shared TGN	`app_events`, `user_app_sessions`	Incremental	15 min
`tagme_db`	TagMeAPI	`check_ins`, `points_of_interest`, `tags`, `user_locations` (rounded), `events`	Incremental	1 h
`kiffstore_db`	KiffStoreAPI	`orders`, `products`, `inventory_events`	Incremental	1 h
`bidbaas_db`	BidBaasAPI	`auctions`, `bids`, `winners`, `payouts`	Incremental	5 min (revenue-critical)
`jobcenter_db`	TheJobCenterAPI	`job_postings`, `applications`, `placements`	Incremental	1 h
`sdpkt_db`	SdpktAPI	`wallets`, `payments`, `holds`, `merchant_settlements`	Incremental	5 min
`payfast_db`	PayfastAPI	`payment_attempts`, `itn_callbacks`	Incremental	15 min
`glocell_db`	GlocellAPI	`airtime_purchases`, `data_purchases`	Incremental	1 h
`media_db`	MediaAPI	`uploads`, `views`, `streams` (metadata only)	Incremental	1 h
`slepton_db`	SleptOnAPI	`app_installs`, `app_updates`, `developer_payouts`	Incremental	1 h
`notification_db`	NotificationAPI	`sends`, `deliveries`, `opens`	Incremental	15 min
`incentives_db`	IncentivesAPI	`rewards`, `claims`, `referrals`	Incremental	1 h
`wolverine_db`	WolverineAPI	`incidents`, `fixes`, `deploys`	Incremental	1 h
`takemehome_db`	TakemehomeAPI	`searches`, `clicks`, `bookings_attributed`	Incremental	1 h
`whatwewant_db`	WhatWeWantAPI	`lists`, `items`, `claims`, `purchases`	Incremental	1 h
(plus the remaining ~18 APIs as they go live)

Anything not listed here is not in the warehouse. Adding a source is a PR that updates this table + a new dbt source + the extract job's config.

6. Anonymisation framework

The single rule: marts.* and downstream must be safe to share with any internal team and exportable to vetted external partners without further sanitisation. The anonymisation step is the gate.

6.1 PII inventory

Every field across TGN apps falls in one of these buckets:

Bucket	Examples	Strategy
Direct identifiers	name, email, phone, SA ID, passport, IBAN, card PAN	Hash (deterministic) OR tokenise (reversible by compliance)
Quasi-identifiers	DOB, postcode, employer, IP	Generalise (DOB → 5-yr band, postcode → suburb, IP → city)
Location	GPS coordinates, addresses	Round to ~1 km grid for user-level analytics; full precision only for POI-level (which has no individual user attached)
Biometric	face embeddings, fingerprints, voice	Suppress — never extract from source
Free-text content	message bodies (TxTMe!), notes, descriptions	Suppress for marketing analytics; allowed only for explicit moderation/safety pipelines under separate access controls
Financial sensitive	account balances, tx amounts	Allowed, but joined to anonymised user_hash only
Behavioural	timestamps, click counts, route choices	Allowed (low PII risk in aggregate)
Device	device_id, IDFA, push tokens	Hash (deterministic)

6.2 Deterministic hashing standard

For any direct identifier we need to join across sources (e.g., same user_id in auth_db and tagme_db.check_ins):

-- Salt is per-source-type, stored in the compliance vault, rotated annually
user_hash = encode(digest(salt_users || ':' || user_id::text, 'sha256'), 'hex')

Properties:

Deterministic: same input → same hash, so joins still work
One-way: can't reverse without the salt
Salted: rainbow-table protection
Rotatable: annual salt rotation invalidates historical joins, which limits backward inference (a feature, not a bug — it caps how long an "anonymised" dataset stays linkable)

Hash length: full 64-char SHA-256 hex in intermediate.*, optionally truncated to 16 chars in marts.* for compactness (still 2^64 collision space, ample for our user count).

6.3 Tokenisation (for fields that must be reversible)

Some fields legal/compliance need to recover under court order (e.g., SA ID for a fraud investigation). These get tokenised instead of hashed:

sa_id_token = nextval('compliance.token_seq')
-- The mapping {sa_id → token} is stored in compliance.token_vault, encrypted at rest,
-- accessible only to two named compliance officers via audited break-glass procedure.

token_vault lives in its own DB on .118 with its own backup schedule. Tokens NEVER appear in marts.* — only in intermediate.* where access is locked down.

6.4 Generalisation rules

Source field	In `marts.*`
`date_of_birth`	`age_band` (`<18`, `18-24`, `25-34`, `35-44`, `45-54`, `55-64`, `65+`)
`address`	`city`, `country` (no street, no postcode finer than 4-digit)
`gps_lat`, `gps_lng`	rounded to 0.01° (~1.1 km grid) for user-level; full precision only for POI-level
`ip_address`	enriched to `city`, `country`, `isp_class` — IP itself dropped
`device_id`	hashed; `device_class` (Android-recent, Android-old, iOS-recent, iOS-old, Web) kept
`exact_timestamp`	full precision allowed (timestamps aren't very identifying in aggregate); but in low-volume marts, round to hour

6.5 Message content — special case

TxTMe! / MessagingAPI message bodies are SUPPRESSED end-to-end. The pipeline extracts only:

conversation_id (hashed)
participant_count
message_count_per_day
has_media (boolean)
detected_language (top-level only)

Never the body. Never media payloads. Never subject lines.

Moderation/safety pipelines that need content access do NOT go through this BI pipeline — they have a separate, more tightly governed path with explicit user consent recorded.

6.6 K-anonymity guarantee

Every marts.* table has a documented k-value:

k ≥ 5 is the default minimum for any group that could enable re-identification
Marts that aggregate to (city, age_band, day) check that every (city, age_band, day) bucket has ≥5 users; otherwise the row is suppressed or generalised further (e.g., week-level)
dbt tests (dbt_utils.expression_is_true) enforce this — a failing test blocks the model from publishing

6.7 Compliance map

Regulation	Region	Pipeline guarantee
POPIA	South Africa	Consent recorded at signup; right-to-erasure handled (§6.8); cross-border transfer logged; PI minimisation applied (only fields needed for the mart purpose flow through)
GDPR	EU users	Same as POPIA + Article 17 right-to-be-forgotten propagates within 30 days; lawful basis = legitimate interest for ops analytics, consent for marketing
FICA	SA banking	Transaction data kept linkable via compliance.token_vault for 5 years; marts use hashed identifiers only
CCPA	California	Right-to-know + right-to-delete same mechanism as GDPR

6.8 Right-to-be-forgotten

When a user requests deletion (or compliance triggers it):

AuthAPI marks users.deleted_at on .104
Next extract picks up the soft-delete flag
dbt model intermediate.deletions_pending collects all hashes belonging to that user across sources
A retention job (nightly) deletes those rows from raw.* and staging.* and re-runs dependent dbt models
compliance.token_vault entry is purged
The push-back to .104 analytics_db re-runs and drops the user from any aggregate where they're now sole member

SLA: 30 days from request to last trace removed.

7. Enrichment framework

Anonymised data becomes valuable when it's joined to context the apps couldn't easily produce themselves. Five enrichment families.

7.1 Geographic enrichment (via GeoGlobal)

For any row with (lat, lng) or address_text:

Reverse geocode via geo_mcp.reverse_geocode(lat, lng) → attach city, country, admin1, admin2
POI context via geo_mcp.interesting_nearby(lat, lng, within_km=2) → attach nearest_poi_category, nearest_poi_score
Distance to nearest population centre for "is this user in a rural or urban area"
Cached aggressively — see GeoGlobal_API_Reference.md for TTLs

This is why GeoGlobal matters here, not just for app features.

7.2 Temporal enrichment

For any timestamp:

day_of_week (Mon-Sun)
hour_of_day (local time at user's last known city)
is_weekend, is_public_holiday (per country, using dim_holidays table seeded from external source)
is_payday_week (heuristic: last week of month + 5 days)
season (Southern Hemisphere ≠ Northern)

7.3 FX enrichment

For any monetary amount in ledger_db, sdpkt_db, bidbaas_db, takemehome_db, whatwewant_db:

amount_zar — already native for SA transactions
amount_usd — converted using the FX rate at transaction time
amount_btc_sats — for crypto-comparable views

FX rates come from a dim_fx_rates_daily table. Currently the Markets API on .106 is broken (verified 2026-05-28), so as a stopgap we ingest daily ZAR/USD/EUR/GBP/BTC from a free source (e.g., exchangerate.host) into the warehouse directly. Once Markets is repaired, switch the source.

7.4 User segment enrichment

For each user_hash:

kyc_tier — no-kyc / basic / verified (joins to auth_db.kyc_records)
lifecycle_stage — new (<7d), active (7d-90d), engaged (>90d active in last 30d), dormant (no activity 30d), churned (no activity 90d)
apps_used_30d — array of app IDs the user touched
revenue_band_zar_30d — none, low, mid, high (percentile-based, recomputed weekly)
mesh_node_class — non-relay / consumer / contributor / heavy-relay (CircleAether activity)

7.5 Cross-app enrichment

The killer feature — answers "this user did X in TagMe then Y in BidBaas":

cross_app_session_id — derived from a sliding window of timestamps. If a user has events in app A and app B within 30 min, same session.
attribution_chain — for revenue events, the prior 24h of touchpoints across all apps (powers attribution models)
funnel_position — for known multi-step funnels (e.g., TagMe-checkin → BidBaas-bid → SDPKT-pay), which step the user is on

8. Warehouse schema layers

Six schemas inside data_warehouse.datawarehouse. Strict layering — a model in layer N may only depend on layers ≤ N.

0. raw.*           ─ identical to source. No transforms. PII present.
1. staging.*       ─ types + dedup + soft-delete filter. PII still present.
2. intermediate.*  ─ ANONYMISED. PII gone (hashed/tokenised/generalised/suppressed).
3. marts.*         ─ ENRICHED. Business-ready fact/dim. Safe to share internally.
4. analytics.*     ─ Denormalised views over marts, optimised for Superset dashboards.
5. ml.*            ─ Feature stores. Time-aligned, leakage-checked.

8.1 Layer 0 — `raw.*`

One table per source table, named raw.{source_db}__{table} (e.g. raw.tagme_db__check_ins)
Materialised as Postgres tables (not views) because the source isn't always reachable
Loaded by the extract job (§10), never edited by hand
Retention: 90 days. After that, only staging.* and downstream remain (which by then are PII-free).
Access: locked to the ETL service account + two named DBAs. NOT readable by dbt jobs running as dwh_user.

8.2 Layer 1 — `staging.*`

One model per raw.* table, named staging.stg_{source}__{table} (matches existing stg_tgn__events convention)
Type coercion, encoding normalisation, soft-delete filter, dedup
Materialised as views (cheap, always fresh)
PII still present — this layer is access-controlled

8.3 Layer 2 — `intermediate.*`

This is where the magic — and the compliance line — sits.
Named intermediate.int_{purpose} (e.g. intermediate.int_user_360_anon)
Materialised as tables
Implements all rules in §6
Has dbt tests for: every column in §6.1's "Direct identifiers" bucket is verified absent (regex test on each column name + content check)
Sign-off: every new intermediate.* model requires PR review by compliance reviewer

8.4 Layer 3 — `marts.*`

The deliverable layer
Named marts.fct_* for facts, marts.dim_* for dimensions, marts.agg_* for pre-aggregated summaries
Existing models (already running):
- marts.fct_tgn_daily_metrics
- marts.fct_tgn_source_summary
- marts.fct_tagme_tags
- marts.fct_tagme_checkins
- marts.fct_map_ad_impressions
New models needed for full coverage:
- marts.dim_user_360_anon — anonymised user × 30-day rollup across all apps
- marts.fct_ledger_transactions_anon — every ledger event, hashed users, FX-normalised
- marts.fct_bidbaas_auctions — auction × winner-hash × revenue
- marts.fct_sdpkt_payments — payments × merchant-hash × FX-normalised
- marts.fct_messaging_metadata — conversation-hash × metadata only (per §6.5)
- marts.fct_app_sessions — session × app × duration × outcome
- marts.fct_panik_alerts — emergency events, geo-generalised
- marts.dim_app_metadata — slowly-changing dimension of app definitions (categories, costs, etc.)
- marts.dim_fx_rates_daily — FX history
- marts.dim_holidays — country × date holiday flags
- marts.agg_revenue_daily_app_country — daily revenue rolled up to (app, country, day)
- marts.agg_user_cohorts_weekly — cohort retention for marketing
- marts.agg_funnel_app_to_app — cross-app conversion funnels

8.5 Layer 4 — `analytics.*`

Views, not tables — always fresh, no extra storage
Tailored to Superset dashboards (column names, formats, ordering match the dashboard expectations)
One per dashboard, named analytics.dash_{dashboard_name}

8.6 Layer 5 — `ml.*`

Feature stores with strict time-aligned features (no future leakage)
Versioned tables — ml.feat_user_v1, ml.feat_user_v2
Owned by the ML team (not yet staffed; ml schema is empty for now but reserved)

9. Push-back to `.104` — the `analytics_db` pattern

Apps need their own intelligence at low latency. Cross-server queries (.104 ↔ .106) over WAN are too slow. So: selected marts get pushed back to a dedicated analytics_db on .104 (and replicated to .105).

9.1 What gets pushed back

Per-app analytics: each app's API gets a schema (e.g., analytics_db.tagme.*, analytics_db.bidbaas.*) with the aggregates it needs to render in-app dashboards.
Anonymised aggregates only. No row-level PII even after our anonymisation — these are aggregate counters / rollups (e.g., "how many check-ins did this user have this week", "average bid value in this category", "your spend vs. category median").
Per-user views are pre-computed for the user_hash that the app holds locally. The app calls analytics_db.tagme.user_summary WHERE user_hash = $1. Fast.

9.2 Schema sketch

analytics_db
├── tagme.user_summary           (user_hash, checkins_7d, checkins_30d, top_category, top_city)
├── tagme.poi_popularity         (poi_id, checkins_7d, checkins_30d, trending_score)
├── bidbaas.user_summary         (user_hash, bids_30d, wins_30d, total_spend_zar_30d, category_mix)
├── bidbaas.category_pulse       (category, median_winning_bid_zar, win_rate_30d, supply_30d)
├── sdpkt.user_summary           (user_hash, sends_30d, receives_30d, top_merchant_category)
├── txtme.user_summary           (user_hash, convs_30d, messages_30d, top_lang, response_time_p50_min)
├── takemehome.user_summary      (user_hash, searches_30d, clicks_30d, top_destination_city)
├── whatwewant.user_summary      (user_hash, lists_30d, items_30d, fulfilled_30d, top_category)
├── shared.user_lifecycle        (user_hash, stage, days_since_signup, apps_30d_array)
├── shared.cross_app_funnel      (funnel_id, user_hash, step, entered_at, completed)
└── shared.fx_rates_daily        (currency_pair, date, rate_close)

9.3 Push mechanism

Nightly (or per-mart-cadence) job runs on .106:

-- Simplified — real version uses dbt_external_tables or postgres_fdw
INSERT INTO analytics_db.tagme.user_summary
SELECT user_hash, checkins_7d, ... FROM marts.agg_tagme_user_summary
ON CONFLICT (user_hash) DO UPDATE SET ...

Implementation options:

postgres_fdw — data_warehouse opens a foreign data wrapper to .104 analytics_db, INSERTs over the network. Easiest.
logical replication — set up analytics_db as a logical subscriber to specific tables in data_warehouse. Most automatic.
batch CSV via S3/MinIO — dump to file, COPY FROM on .104. Most decoupled.

Recommendation: start with postgres_fdw for simplicity, migrate to logical replication when volume justifies.

9.4 What the apps see

The apps' Postgres user (per-app, least-privilege) gets SELECT on its schema in analytics_db. They read their own summaries directly:

// In TagMeAPI
var summary = await db.QuerySingleAsync<UserSummary>(
    "SELECT * FROM analytics.tagme_user_summary WHERE user_hash = @h",
    new { h = HashUser(currentUser.Id) });

App-side hashing is identical to the pipeline's (same salt, same algorithm) so the lookup just works.

10. Orchestration

10.1 Tool choice

dbt Core CLI in cron for the transformation chain (raw → staging → intermediate → marts → analytics). Already installed in dbt_transform container.
A small Python extract service for the .105 → raw.* hop. Either:
- Custom (~200 LOC, full control), or
- Airbyte (a folder at /home/geektrading/airbyte/ exists — possibly already partially configured; verify before committing).
n8n for event-driven side workflows (e.g., "when a Wolverine incident is logged, trigger an ad-hoc dbt run on impacted models"). NOT for the main batch pipeline.

Why not n8n for the whole pipeline? n8n is excellent for spiky event-driven flows but worse than dbt for hundreds of dependent SQL transformations. Use both, in their lanes.

10.2 Schedule

Job	When	Runs as
Extract event tables (`tagme.check_ins`, `bidbaas.bids`, etc.)	every 15 min	`etl_user` on .106
Extract dimension snapshots (`auth.users`, `apps.app_metadata`)	hourly	`etl_user`
dbt `staging` + `intermediate` runs	every 30 min	`dwh_user`
dbt `marts` runs	every hour	`dwh_user`
dbt `analytics` views	always-fresh views (no schedule)	`dwh_user`
Push-back to `.104 analytics_db`	hourly (offset 15 min after marts)	`pushback_user`
FX rate ingest from exchangerate.host	daily 02:00 SAST	`etl_user`
Holiday calendar refresh	weekly Sun 03:00 SAST	`etl_user`
Right-to-be-forgotten sweep	nightly 04:00 SAST	`compliance_user`
dbt test suite (full)	nightly 05:00 SAST	`dwh_user`
Re-key (salt rotation)	annually, scheduled change-window	`compliance_user`

10.3 PgBouncer / connection discipline

Per deploy-pattern-pgbouncer-cascade.md:

All ETL connections to .105 go through PgBouncer
Read-only replicator user from Deployment/deployment-credentials.ps1 is reused (already has SELECT on all DBs)
Max parallel connections per ETL job: 4
Never restart the ETL service between 06:00 and 22:00 SAST without notifying the team — burst of reconnects can trip the PgBouncer cache cascade

11. Monitoring + data quality

11.1 Tests (in dbt)

Every model has at minimum:

not_null on all primary keys
unique on natural keys
accepted_values on enums
Freshness check: staging.stg_* must have a row with event_time within the last hour (else warn)
K-anonymity check on any marts.* that aggregates users (warn at k<10, fail at k<5)
For intermediate.* — PII-absence regex check on every column

11.2 Observability

Prometheus scrapes the extract service + dbt run job for: rows ingested, dbt model run time, test pass/fail counts
Grafana dashboard "BI Pipeline Health" — should exist; build it as Phase 0
Loki captures dbt logs for ad-hoc debugging
Alertmanager rules:
- Any intermediate.* PII check fails → page on-call (P1)
- Extract job hasn't run in >2× expected interval → page on-call (P1)
- dbt test suite freshness fails on revenue-critical marts (ledger/sdpkt/bidbaas) → page on-call (P1)
- K-anonymity warn → ticket (P2)
- Disk free on .106 data_warehouse < 30 GB → ticket (P2)

11.3 Lineage

dbt auto-generates lineage. Hosted at superset:5003 or via dbt docs serve. Every analyst should know how to find: "which sources contribute to marts.fct_bidbaas_auctions?"

12. Implementation phases

The pipeline is too big to land in one drop. Phased, each phase independently shippable.

Phase 0 — Foundation (1 week)

Verify postgres roles + connection strings for data_warehouse, markets_db, superset_db
Document the existing dbt models (the five running marts.fct_* and two staging.*)
Set up the Grafana "BI Pipeline Health" dashboard scraping what's already there
Audit n8n's existing 55 MB sqlite — what workflows are running? Document or retire each.
Create compliance.token_vault DB on .118 (empty, just the schema)
Acceptance: repo has accurate docs of current state; ops can see pipeline health

Phase 1 — Extract framework (1 week)

Build the Python extract service (or wire Airbyte if it's already half-set-up)
Configure the source registry table (§5) as the single config file the extract reads
Ship extracts for the 3 revenue-critical sources first: ledger_db, sdpkt_db, bidbaas_db
Land in raw.* on .106
Acceptance: every 5 min, fresh rows from those 3 sources land in raw.*

Phase 2 — Anonymisation layer (1 week)

Implement the hash standard + the compliance.token_vault helper functions
Write intermediate.int_users_anon joining auth_db.users across all sources with deterministic hashing
Write the PII-absence dbt test macro
Acceptance: an external reviewer audits a sample of intermediate.* rows and confirms no PII leaks

Phase 3 — Initial marts + push-back (2 weeks)

Build the 5 highest-value marts (in order):
1. marts.fct_ledger_transactions_anon
2. marts.fct_bidbaas_auctions
3. marts.fct_sdpkt_payments
4. marts.dim_user_360_anon
5. marts.agg_revenue_daily_app_country
Wire postgres_fdw to .104 analytics_db
Push the first user-summary back (analytics.bidbaas_user_summary)
Have BidBaas read it
Acceptance: BidBaas in-app shows "your bids this week" sourced from analytics_db

Phase 4 — Coverage expansion (2 weeks)

Add the rest of the source DBs to the extract registry
Build remaining marts (TagMe deeper, TxTMe! metadata, WhatWeWant, Takemehome, etc.)
Bring all per-app analytics.*_user_summary views online
Acceptance: every app has an analytics tab populated from analytics_db

Phase 5 — Enrichment expansion (1 week)

Geographic enrichment (via GeoGlobal) on all user-location events
Temporal enrichment (holidays, paydays)
FX enrichment (either fixed via exchangerate.host, or wired to a repaired Markets API)
Cross-app session derivation
Acceptance: marts.agg_funnel_app_to_app shows real cross-app journeys

Phase 6 — BI surfaces (1-2 weeks)

BigBruh! "Analytics" tab (Superset embed) — replaces "live ops counters" with rich historical views
Per-app dashboards in Superset
Public-facing aggregate dashboards (if/where wanted) — e.g., DataAcuity portal could show anonymised national-level TGN economy metrics
Acceptance: BigBruh! Analytics page ships; team retires the "live counters only" view

Phase 7 — ML feature stores (open-ended)

Define the first two ml.feat_* tables — user-churn features + fraud features
Wire to a model-training pipeline (out of scope here)
Acceptance: ML team has a documented entry point with leakage-tested features

13. Open questions to resolve before Phase 0

Markets API repair vs. external FX source — decide before Phase 5
postgres_fdw vs. logical replication for push-back — decide during Phase 3
n8n's role — keep, retire, or grow? Decide during Phase 0 audit
Public-facing aggregate dashboards — is this a TGN product? Decide before Phase 6
ML team ownership of ml.* — who, when? Defer until Phase 7

14. Cross-references

GeoGlobal_README.md — the geo enrichment service used in §7.1
GeoGlobal_API_Reference.md — exact endpoints called during enrichment
GeoGlobal_Data_Schema.md — the geo_db.geonames and interesting_locations tables
Deployment/deployment-credentials.ps1 — REPLICATION_USER and replicator password used by extract service
.claude-memory/deploy-pattern-pgbouncer-cascade.md — connection discipline rules
.claude-memory/banking-compliance-rules.md — POPIA / FICA constraints we honour in §6
code/Config/TheGeekNetworkServices.json — single source of truth for app metadata that powers marts.dim_app_metadata