Skip to content
DA DataAcuity by The Geek Network

DataAcuity — BI Data Pipeline

Owner: Tinashe Bhengu Status: 🔵 DESIGN — pipeline must be built from scratch. The existing dbt project is scaffolding, not a live ETL. Audience: anyone who'll touch the data layer — backend engineers, DBAs, dbt authors, BI dashboard builders, compliance reviewers

The single source of truth for how raw app data becomes anonymised, enriched, business-intelligence-ready data that the TGN apps and BigBruh! can consume safely.


1. Executive summary

The Geek Network has 36 APIs and ~20 apps generating user events, transactions, locations, messages, bids, check-ins, etc. on .104 (primary) with hot replication to .105. Today this data is siloed per app — usable for direct app features but not for cross-app analytics, customer-360 views, ML feature stores, or anonymised partner analytics.

The DataAcuity stack on .106 has the infrastructure for a BI pipeline: a Postgres data warehouse (data_warehouse container), a dbt project with TGN-tagged models, Superset for dashboards, Grafana/Loki/Prometheus for ops telemetry, n8n for workflow orchestration, and restic-based backups running daily. What is not yet built is the actual data flow — the warehouse contains only scaffolding (see §1.1), no nightly ETL is running, and the path from .105.106 PostgreSQL traffic appears to be blocked (see §1.2). What's missing is the disciplined pipeline that gets every app's data in, strips PII, enriches it with geographic / temporal / FX / cross-app context, and pushes business-safe aggregates back to .104 so the apps can consume their own intelligence at low latency.

This document specifies that pipeline.

1.1 Honest state of the warehouse (verified 2026-05-28)

data_warehouse.datawarehouse on .106:
  schemas:        bronze, silver, gold (medallion architecture, EMPTY)
                  dbt_dev_marts (3 rows in each of 2 marts — scaffolding)
                  dbt_dev_staging (empty)
                  tgn (4 rows in events_2025_12; 11 future monthly partitions all 0 rows)
                  public

  dbt models with real data:
    dbt_dev_marts.fct_tgn_daily_metrics   — 3 rows
    dbt_dev_marts.fct_tgn_source_summary  — 3 rows

  dbt models that exist in code but haven't published anywhere:
    marts/tagme/fct_tagme_checkins.sql        (in dbt_project — never run successfully)
    marts/tagme/fct_tagme_tags.sql            (same)
    marts/tagme/fct_map_ad_impressions.sql    (same)
    staging/tgn/stg_tgn__events.sql           (same)
    staging/tgn/stg_tgn__ingest_log.sql       (same)

The dbt project is named "DataAcuity DBT Project" and was authored for bronze/silver/gold schemas. The few dbt_dev_* rows that exist look like build verification, not real ETL output. Treat the warehouse as greenfield for this pipeline's purposes.

1.2 Honest state of the .105 → .106 connectivity

Verified 2026-05-28: a psql -h 197.97.200.105 -U replicator -d postgres from a container on .106's data-warehouse_data_stack network times out. Likely causes:

  • Windows Firewall on .105 blocks PostgreSQL inbound from .106 (most likely — TGN ops convention is that .106 is a separate stack)
  • PostgreSQL on .105 is bound to localhost only
  • Network routing between data-centre subnets requires a specific path

This is BI Phase 1's first concrete blocker. Resolving it is a separate workstream: DBA on .104/.105 opens a firewall rule + pg_hba.conf entry for .106's IP using the existing replicator user (already documented in Deployment/deployment-credentials.ps1).

2. Why we need this

Problem today What the pipeline solves
Each app's analytics are siloed in its own DB Unified warehouse with cross-app fact / dim tables
Raw PII sits in app DBs and would leak through any analytics surface Anonymisation layer guarantees PII never reaches BI consumers
No FX-normalised view of revenue across SDPKT / BidBaas / WhatWeWant / Takemehome Enrichment layer attaches FX, time, geo context to every event
BigBruh! is stuck on "live counters" because it has no warehouse to query BigBruh! Analytics tier reads marts.* and embeds Superset charts
Customer-360, churn prediction, fraud detection are impossible without a unified user view marts.dim_user_360 joins anonymised user activity across all apps
ML training data extraction is ad-hoc and breaks compliance A documented, audited path to ML feature stores with anonymity guarantees
Apps can't show users their own activity history without expensive in-app aggregation Per-app analytics pushed back to .104 lets apps render their own dashboards locally

3. Architecture overview

┌─────────────────────────────────────────────────────────────┐
│  .104 PRIMARY (read-write)                                  │
│  ───────────────────────                                    │
│  All 36 API postgres DBs (auth_db, ledger_db, tagme_db,     │
│  bidbaas_db, sdpkt_db, txtme_db, sleptondb, etc.)           │
│                                                              │
│  + analytics_db  (NEW — pushed-back marts, see §9)          │
└─────────────────────────┬───────────────────────────────────┘
                          │ streaming replication
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  .105 STANDBY (read-only replica)                           │
│  ───────────────────────────                                │
│  Same 36 DBs, near-real-time. THIS is where ETL reads from. │
└─────────────────────────┬───────────────────────────────────┘
                          │ nightly + selective near-real-time
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  .106 DATA WAREHOUSE STACK                                  │
│  ──────────────────────────                                 │
│                                                              │
│  ┌─────────────────────┐   ┌───────────────────────────┐    │
│  │ data_warehouse      │   │ geo_db                    │    │
│  │ (postgres :5001)    │◄──┤ 13.4M places + 3.5M POIs  │    │
│  │                     │   │ (enrichment source)       │    │
│  │ Schemas:            │   └───────────────────────────┘    │
│  │  raw.*       <─── extracted from .105                    │
│  │  staging.*   <─── lightly typed + deduped                │
│  │  intermediate.* <── anonymised + standardised            │
│  │  marts.*     <─── business-ready fact/dim tables         │
│  │  analytics.* <─── denormalised views for dashboards      │
│  │  ml.*        <─── feature stores                         │
│  └─────────────────────┘                                    │
│                              ┌────────────────────────┐     │
│  ┌─────────────────────┐    │ valhalla + geo_mcp     │     │
│  │ dbt_transform       │───▶│ (enrichment service)   │     │
│  │ (transformations)   │    └────────────────────────┘     │
│  └─────────────────────┘                                    │
│                                                              │
│  ┌─────────────────────┐   ┌───────────────────────────┐    │
│  │ Superset :5003      │   │ Grafana / Prom / Loki     │    │
│  │ (BI dashboards)     │   │ (ops telemetry)           │    │
│  └─────────────────────┘   └───────────────────────────┘    │
└─────────────────────────┬───────────────────────────────────┘
                          │ aggregated, anonymised marts
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  .104 analytics_db  (push-back)                             │
│  ─────────────────                                           │
│  Per-app marts every app's API can read locally for:        │
│    - "your activity this week" surfaces                     │
│    - cohort comparisons                                     │
│    - in-app dashboards                                      │
│  ZERO raw PII. Aggregates only. K-anonymised (k≥5).         │
└─────────────────────────────────────────────────────────────┘

4. The data flow — step by step

Every record goes through six stages. Each stage is idempotent and resumable.

  1. Capture — app writes to its own DB on .104. Already happening; no change.
  2. Replicate — streaming replication to .105. Already happening; no change. We never read from .104 for ETL — that would risk the primary under load.
  3. Extract — scheduled job pulls from .105 into data_warehouse.raw.* on .106. Incremental where possible (using updated_at / event_time watermarks), full snapshot for small dimension tables.
  4. Standardise + dedup — dbt models transform raw.*staging.*. Types coerced, encodings normalised, soft-deleted rows filtered.
  5. Anonymise — dbt models transform staging.*intermediate.*. PII fields hashed, tokenised, generalised, or suppressed per the rules in §6.
  6. Enrich — dbt models transform intermediate.*marts.*. Geo / time / FX / user-segment / cross-app context attached per §7.
  7. Servemarts.* are read by Superset for dashboards, by ML training jobs for feature extraction, and by the push-back job that writes selected aggregates into .104 analytics_db for app consumption.

Latency expectations:

Stage Latency target
Replicate (.104 → .105) < 5 s (streaming)
Extract (.105 → raw.*) T+15 min (event tables, incremental); T+24 h (dimension snapshots)
Standardise + dedup T+5 min after extract
Anonymise T+5 min after standardise
Enrich T+15 min after anonymise (geo lookups dominate)
Push-back to .104 T+30 min after marts ready

End-to-end: ~1 hour from event capture to anonymised, enriched availability in .104 analytics_db. Good enough for "this week", "today", "right now" surfaces in app UIs.

For true real-time (sub-minute) analytics — fraud flagging, live ops counters — keep using BigBruh!'s direct read-replica queries against .105. The pipeline is for intelligence, not live ops.

5. Source registry

Every TGN data source that flows through the pipeline. Source-of-truth list — update this file when a new API ships.

Source DB (on .104/.105) Owning API Key tables Extract mode Frequency
auth_db AuthAPI users, sessions, kyc_records, phone_pins Incremental on updated_at 15 min
ledger_db LedgerAPI transactions, accounts, holds, reversals Incremental on created_at 5 min (event-driven via WAL eventually)
messaging_db MessagingAPI / TxTMe! conversations, messages (CONTENT NEVER LEAVES — see §6.5), participants Incremental + metadata-only 1 h
apps_db shared TGN app_events, user_app_sessions Incremental 15 min
tagme_db TagMeAPI check_ins, points_of_interest, tags, user_locations (rounded), events Incremental 1 h
kiffstore_db KiffStoreAPI orders, products, inventory_events Incremental 1 h
bidbaas_db BidBaasAPI auctions, bids, winners, payouts Incremental 5 min (revenue-critical)
jobcenter_db TheJobCenterAPI job_postings, applications, placements Incremental 1 h
sdpkt_db SdpktAPI wallets, payments, holds, merchant_settlements Incremental 5 min
payfast_db PayfastAPI payment_attempts, itn_callbacks Incremental 15 min
glocell_db GlocellAPI airtime_purchases, data_purchases Incremental 1 h
media_db MediaAPI uploads, views, streams (metadata only) Incremental 1 h
slepton_db SleptOnAPI app_installs, app_updates, developer_payouts Incremental 1 h
notification_db NotificationAPI sends, deliveries, opens Incremental 15 min
incentives_db IncentivesAPI rewards, claims, referrals Incremental 1 h
wolverine_db WolverineAPI incidents, fixes, deploys Incremental 1 h
takemehome_db TakemehomeAPI searches, clicks, bookings_attributed Incremental 1 h
whatwewant_db WhatWeWantAPI lists, items, claims, purchases Incremental 1 h
(plus the remaining ~18 APIs as they go live)

Anything not listed here is not in the warehouse. Adding a source is a PR that updates this table + a new dbt source + the extract job's config.

6. Anonymisation framework

The single rule: marts.* and downstream must be safe to share with any internal team and exportable to vetted external partners without further sanitisation. The anonymisation step is the gate.

6.1 PII inventory

Every field across TGN apps falls in one of these buckets:

Bucket Examples Strategy
Direct identifiers name, email, phone, SA ID, passport, IBAN, card PAN Hash (deterministic) OR tokenise (reversible by compliance)
Quasi-identifiers DOB, postcode, employer, IP Generalise (DOB → 5-yr band, postcode → suburb, IP → city)
Location GPS coordinates, addresses Round to ~1 km grid for user-level analytics; full precision only for POI-level (which has no individual user attached)
Biometric face embeddings, fingerprints, voice Suppress — never extract from source
Free-text content message bodies (TxTMe!), notes, descriptions Suppress for marketing analytics; allowed only for explicit moderation/safety pipelines under separate access controls
Financial sensitive account balances, tx amounts Allowed, but joined to anonymised user_hash only
Behavioural timestamps, click counts, route choices Allowed (low PII risk in aggregate)
Device device_id, IDFA, push tokens Hash (deterministic)

6.2 Deterministic hashing standard

For any direct identifier we need to join across sources (e.g., same user_id in auth_db and tagme_db.check_ins):

-- Salt is per-source-type, stored in the compliance vault, rotated annually
user_hash = encode(digest(salt_users || ':' || user_id::text, 'sha256'), 'hex')

Properties:

  • Deterministic: same input → same hash, so joins still work
  • One-way: can't reverse without the salt
  • Salted: rainbow-table protection
  • Rotatable: annual salt rotation invalidates historical joins, which limits backward inference (a feature, not a bug — it caps how long an "anonymised" dataset stays linkable)

Hash length: full 64-char SHA-256 hex in intermediate.*, optionally truncated to 16 chars in marts.* for compactness (still 2^64 collision space, ample for our user count).

6.3 Tokenisation (for fields that must be reversible)

Some fields legal/compliance need to recover under court order (e.g., SA ID for a fraud investigation). These get tokenised instead of hashed:

sa_id_token = nextval('compliance.token_seq')
-- The mapping {sa_id → token} is stored in compliance.token_vault, encrypted at rest,
-- accessible only to two named compliance officers via audited break-glass procedure.

token_vault lives in its own DB on .118 with its own backup schedule. Tokens NEVER appear in marts.* — only in intermediate.* where access is locked down.

6.4 Generalisation rules

Source field In marts.*
date_of_birth age_band (<18, 18-24, 25-34, 35-44, 45-54, 55-64, 65+)
address city, country (no street, no postcode finer than 4-digit)
gps_lat, gps_lng rounded to 0.01° (~1.1 km grid) for user-level; full precision only for POI-level
ip_address enriched to city, country, isp_class — IP itself dropped
device_id hashed; device_class (Android-recent, Android-old, iOS-recent, iOS-old, Web) kept
exact_timestamp full precision allowed (timestamps aren't very identifying in aggregate); but in low-volume marts, round to hour

6.5 Message content — special case

TxTMe! / MessagingAPI message bodies are SUPPRESSED end-to-end. The pipeline extracts only:

  • conversation_id (hashed)
  • participant_count
  • message_count_per_day
  • has_media (boolean)
  • detected_language (top-level only)

Never the body. Never media payloads. Never subject lines.

Moderation/safety pipelines that need content access do NOT go through this BI pipeline — they have a separate, more tightly governed path with explicit user consent recorded.

6.6 K-anonymity guarantee

Every marts.* table has a documented k-value:

  • k ≥ 5 is the default minimum for any group that could enable re-identification
  • Marts that aggregate to (city, age_band, day) check that every (city, age_band, day) bucket has ≥5 users; otherwise the row is suppressed or generalised further (e.g., week-level)
  • dbt tests (dbt_utils.expression_is_true) enforce this — a failing test blocks the model from publishing

6.7 Compliance map

Regulation Region Pipeline guarantee
POPIA South Africa Consent recorded at signup; right-to-erasure handled (§6.8); cross-border transfer logged; PI minimisation applied (only fields needed for the mart purpose flow through)
GDPR EU users Same as POPIA + Article 17 right-to-be-forgotten propagates within 30 days; lawful basis = legitimate interest for ops analytics, consent for marketing
FICA SA banking Transaction data kept linkable via compliance.token_vault for 5 years; marts use hashed identifiers only
CCPA California Right-to-know + right-to-delete same mechanism as GDPR

6.8 Right-to-be-forgotten

When a user requests deletion (or compliance triggers it):

  1. AuthAPI marks users.deleted_at on .104
  2. Next extract picks up the soft-delete flag
  3. dbt model intermediate.deletions_pending collects all hashes belonging to that user across sources
  4. A retention job (nightly) deletes those rows from raw.* and staging.* and re-runs dependent dbt models
  5. compliance.token_vault entry is purged
  6. The push-back to .104 analytics_db re-runs and drops the user from any aggregate where they're now sole member

SLA: 30 days from request to last trace removed.

7. Enrichment framework

Anonymised data becomes valuable when it's joined to context the apps couldn't easily produce themselves. Five enrichment families.

7.1 Geographic enrichment (via GeoGlobal)

For any row with (lat, lng) or address_text:

  • Reverse geocode via geo_mcp.reverse_geocode(lat, lng) → attach city, country, admin1, admin2
  • POI context via geo_mcp.interesting_nearby(lat, lng, within_km=2) → attach nearest_poi_category, nearest_poi_score
  • Distance to nearest population centre for "is this user in a rural or urban area"
  • Cached aggressively — see GeoGlobal_API_Reference.md for TTLs

This is why GeoGlobal matters here, not just for app features.

7.2 Temporal enrichment

For any timestamp:

  • day_of_week (Mon-Sun)
  • hour_of_day (local time at user's last known city)
  • is_weekend, is_public_holiday (per country, using dim_holidays table seeded from external source)
  • is_payday_week (heuristic: last week of month + 5 days)
  • season (Southern Hemisphere ≠ Northern)

7.3 FX enrichment

For any monetary amount in ledger_db, sdpkt_db, bidbaas_db, takemehome_db, whatwewant_db:

  • amount_zar — already native for SA transactions
  • amount_usd — converted using the FX rate at transaction time
  • amount_btc_sats — for crypto-comparable views

FX rates come from a dim_fx_rates_daily table. Currently the Markets API on .106 is broken (verified 2026-05-28), so as a stopgap we ingest daily ZAR/USD/EUR/GBP/BTC from a free source (e.g., exchangerate.host) into the warehouse directly. Once Markets is repaired, switch the source.

7.4 User segment enrichment

For each user_hash:

  • kyc_tier — no-kyc / basic / verified (joins to auth_db.kyc_records)
  • lifecycle_stage — new (<7d), active (7d-90d), engaged (>90d active in last 30d), dormant (no activity 30d), churned (no activity 90d)
  • apps_used_30d — array of app IDs the user touched
  • revenue_band_zar_30dnone, low, mid, high (percentile-based, recomputed weekly)
  • mesh_node_class — non-relay / consumer / contributor / heavy-relay (CircleAether activity)

7.5 Cross-app enrichment

The killer feature — answers "this user did X in TagMe then Y in BidBaas":

  • cross_app_session_id — derived from a sliding window of timestamps. If a user has events in app A and app B within 30 min, same session.
  • attribution_chain — for revenue events, the prior 24h of touchpoints across all apps (powers attribution models)
  • funnel_position — for known multi-step funnels (e.g., TagMe-checkin → BidBaas-bid → SDPKT-pay), which step the user is on

8. Warehouse schema layers

Six schemas inside data_warehouse.datawarehouse. Strict layering — a model in layer N may only depend on layers ≤ N.

0. raw.*           ─ identical to source. No transforms. PII present.
1. staging.*       ─ types + dedup + soft-delete filter. PII still present.
2. intermediate.*  ─ ANONYMISED. PII gone (hashed/tokenised/generalised/suppressed).
3. marts.*         ─ ENRICHED. Business-ready fact/dim. Safe to share internally.
4. analytics.*     ─ Denormalised views over marts, optimised for Superset dashboards.
5. ml.*            ─ Feature stores. Time-aligned, leakage-checked.

8.1 Layer 0 — raw.*

  • One table per source table, named raw.{source_db}__{table} (e.g. raw.tagme_db__check_ins)
  • Materialised as Postgres tables (not views) because the source isn't always reachable
  • Loaded by the extract job (§10), never edited by hand
  • Retention: 90 days. After that, only staging.* and downstream remain (which by then are PII-free).
  • Access: locked to the ETL service account + two named DBAs. NOT readable by dbt jobs running as dwh_user.

8.2 Layer 1 — staging.*

  • One model per raw.* table, named staging.stg_{source}__{table} (matches existing stg_tgn__events convention)
  • Type coercion, encoding normalisation, soft-delete filter, dedup
  • Materialised as views (cheap, always fresh)
  • PII still present — this layer is access-controlled

8.3 Layer 2 — intermediate.*

  • This is where the magic — and the compliance line — sits.
  • Named intermediate.int_{purpose} (e.g. intermediate.int_user_360_anon)
  • Materialised as tables
  • Implements all rules in §6
  • Has dbt tests for: every column in §6.1's "Direct identifiers" bucket is verified absent (regex test on each column name + content check)
  • Sign-off: every new intermediate.* model requires PR review by compliance reviewer

8.4 Layer 3 — marts.*

  • The deliverable layer
  • Named marts.fct_* for facts, marts.dim_* for dimensions, marts.agg_* for pre-aggregated summaries
  • Existing models (already running):
    • marts.fct_tgn_daily_metrics
    • marts.fct_tgn_source_summary
    • marts.fct_tagme_tags
    • marts.fct_tagme_checkins
    • marts.fct_map_ad_impressions
  • New models needed for full coverage:
    • marts.dim_user_360_anon — anonymised user × 30-day rollup across all apps
    • marts.fct_ledger_transactions_anon — every ledger event, hashed users, FX-normalised
    • marts.fct_bidbaas_auctions — auction × winner-hash × revenue
    • marts.fct_sdpkt_payments — payments × merchant-hash × FX-normalised
    • marts.fct_messaging_metadata — conversation-hash × metadata only (per §6.5)
    • marts.fct_app_sessions — session × app × duration × outcome
    • marts.fct_panik_alerts — emergency events, geo-generalised
    • marts.dim_app_metadata — slowly-changing dimension of app definitions (categories, costs, etc.)
    • marts.dim_fx_rates_daily — FX history
    • marts.dim_holidays — country × date holiday flags
    • marts.agg_revenue_daily_app_country — daily revenue rolled up to (app, country, day)
    • marts.agg_user_cohorts_weekly — cohort retention for marketing
    • marts.agg_funnel_app_to_app — cross-app conversion funnels

8.5 Layer 4 — analytics.*

  • Views, not tables — always fresh, no extra storage
  • Tailored to Superset dashboards (column names, formats, ordering match the dashboard expectations)
  • One per dashboard, named analytics.dash_{dashboard_name}

8.6 Layer 5 — ml.*

  • Feature stores with strict time-aligned features (no future leakage)
  • Versioned tables — ml.feat_user_v1, ml.feat_user_v2
  • Owned by the ML team (not yet staffed; ml schema is empty for now but reserved)

9. Push-back to .104 — the analytics_db pattern

Apps need their own intelligence at low latency. Cross-server queries (.104 ↔ .106) over WAN are too slow. So: selected marts get pushed back to a dedicated analytics_db on .104 (and replicated to .105).

9.1 What gets pushed back

  • Per-app analytics: each app's API gets a schema (e.g., analytics_db.tagme.*, analytics_db.bidbaas.*) with the aggregates it needs to render in-app dashboards.
  • Anonymised aggregates only. No row-level PII even after our anonymisation — these are aggregate counters / rollups (e.g., "how many check-ins did this user have this week", "average bid value in this category", "your spend vs. category median").
  • Per-user views are pre-computed for the user_hash that the app holds locally. The app calls analytics_db.tagme.user_summary WHERE user_hash = $1. Fast.

9.2 Schema sketch

analytics_db
├── tagme.user_summary           (user_hash, checkins_7d, checkins_30d, top_category, top_city)
├── tagme.poi_popularity         (poi_id, checkins_7d, checkins_30d, trending_score)
├── bidbaas.user_summary         (user_hash, bids_30d, wins_30d, total_spend_zar_30d, category_mix)
├── bidbaas.category_pulse       (category, median_winning_bid_zar, win_rate_30d, supply_30d)
├── sdpkt.user_summary           (user_hash, sends_30d, receives_30d, top_merchant_category)
├── txtme.user_summary           (user_hash, convs_30d, messages_30d, top_lang, response_time_p50_min)
├── takemehome.user_summary      (user_hash, searches_30d, clicks_30d, top_destination_city)
├── whatwewant.user_summary      (user_hash, lists_30d, items_30d, fulfilled_30d, top_category)
├── shared.user_lifecycle        (user_hash, stage, days_since_signup, apps_30d_array)
├── shared.cross_app_funnel      (funnel_id, user_hash, step, entered_at, completed)
└── shared.fx_rates_daily        (currency_pair, date, rate_close)

9.3 Push mechanism

Nightly (or per-mart-cadence) job runs on .106:

-- Simplified — real version uses dbt_external_tables or postgres_fdw
INSERT INTO analytics_db.tagme.user_summary
SELECT user_hash, checkins_7d, ... FROM marts.agg_tagme_user_summary
ON CONFLICT (user_hash) DO UPDATE SET ...

Implementation options:

  1. postgres_fdwdata_warehouse opens a foreign data wrapper to .104 analytics_db, INSERTs over the network. Easiest.
  2. logical replication — set up analytics_db as a logical subscriber to specific tables in data_warehouse. Most automatic.
  3. batch CSV via S3/MinIO — dump to file, COPY FROM on .104. Most decoupled.

Recommendation: start with postgres_fdw for simplicity, migrate to logical replication when volume justifies.

9.4 What the apps see

The apps' Postgres user (per-app, least-privilege) gets SELECT on its schema in analytics_db. They read their own summaries directly:

// In TagMeAPI
var summary = await db.QuerySingleAsync<UserSummary>(
    "SELECT * FROM analytics.tagme_user_summary WHERE user_hash = @h",
    new { h = HashUser(currentUser.Id) });

App-side hashing is identical to the pipeline's (same salt, same algorithm) so the lookup just works.

10. Orchestration

10.1 Tool choice

  • dbt Core CLI in cron for the transformation chain (raw → staging → intermediate → marts → analytics). Already installed in dbt_transform container.
  • A small Python extract service for the .105 → raw.* hop. Either:
    • Custom (~200 LOC, full control), or
    • Airbyte (a folder at /home/geektrading/airbyte/ exists — possibly already partially configured; verify before committing).
  • n8n for event-driven side workflows (e.g., "when a Wolverine incident is logged, trigger an ad-hoc dbt run on impacted models"). NOT for the main batch pipeline.

Why not n8n for the whole pipeline? n8n is excellent for spiky event-driven flows but worse than dbt for hundreds of dependent SQL transformations. Use both, in their lanes.

10.2 Schedule

Job When Runs as
Extract event tables (tagme.check_ins, bidbaas.bids, etc.) every 15 min etl_user on .106
Extract dimension snapshots (auth.users, apps.app_metadata) hourly etl_user
dbt staging + intermediate runs every 30 min dwh_user
dbt marts runs every hour dwh_user
dbt analytics views always-fresh views (no schedule) dwh_user
Push-back to .104 analytics_db hourly (offset 15 min after marts) pushback_user
FX rate ingest from exchangerate.host daily 02:00 SAST etl_user
Holiday calendar refresh weekly Sun 03:00 SAST etl_user
Right-to-be-forgotten sweep nightly 04:00 SAST compliance_user
dbt test suite (full) nightly 05:00 SAST dwh_user
Re-key (salt rotation) annually, scheduled change-window compliance_user

10.3 PgBouncer / connection discipline

Per deploy-pattern-pgbouncer-cascade.md:

  • All ETL connections to .105 go through PgBouncer
  • Read-only replicator user from Deployment/deployment-credentials.ps1 is reused (already has SELECT on all DBs)
  • Max parallel connections per ETL job: 4
  • Never restart the ETL service between 06:00 and 22:00 SAST without notifying the team — burst of reconnects can trip the PgBouncer cache cascade

11. Monitoring + data quality

11.1 Tests (in dbt)

Every model has at minimum:

  • not_null on all primary keys
  • unique on natural keys
  • accepted_values on enums
  • Freshness check: staging.stg_* must have a row with event_time within the last hour (else warn)
  • K-anonymity check on any marts.* that aggregates users (warn at k<10, fail at k<5)
  • For intermediate.* — PII-absence regex check on every column

11.2 Observability

  • Prometheus scrapes the extract service + dbt run job for: rows ingested, dbt model run time, test pass/fail counts
  • Grafana dashboard "BI Pipeline Health" — should exist; build it as Phase 0
  • Loki captures dbt logs for ad-hoc debugging
  • Alertmanager rules:
    • Any intermediate.* PII check fails → page on-call (P1)
    • Extract job hasn't run in >2× expected interval → page on-call (P1)
    • dbt test suite freshness fails on revenue-critical marts (ledger/sdpkt/bidbaas) → page on-call (P1)
    • K-anonymity warn → ticket (P2)
    • Disk free on .106 data_warehouse < 30 GB → ticket (P2)

11.3 Lineage

dbt auto-generates lineage. Hosted at superset:5003 or via dbt docs serve. Every analyst should know how to find: "which sources contribute to marts.fct_bidbaas_auctions?"

12. Implementation phases

The pipeline is too big to land in one drop. Phased, each phase independently shippable.

Phase 0 — Foundation (1 week)

  • Verify postgres roles + connection strings for data_warehouse, markets_db, superset_db
  • Document the existing dbt models (the five running marts.fct_* and two staging.*)
  • Set up the Grafana "BI Pipeline Health" dashboard scraping what's already there
  • Audit n8n's existing 55 MB sqlite — what workflows are running? Document or retire each.
  • Create compliance.token_vault DB on .118 (empty, just the schema)
  • Acceptance: repo has accurate docs of current state; ops can see pipeline health

Phase 1 — Extract framework (1 week)

  • Build the Python extract service (or wire Airbyte if it's already half-set-up)
  • Configure the source registry table (§5) as the single config file the extract reads
  • Ship extracts for the 3 revenue-critical sources first: ledger_db, sdpkt_db, bidbaas_db
  • Land in raw.* on .106
  • Acceptance: every 5 min, fresh rows from those 3 sources land in raw.*

Phase 2 — Anonymisation layer (1 week)

  • Implement the hash standard + the compliance.token_vault helper functions
  • Write intermediate.int_users_anon joining auth_db.users across all sources with deterministic hashing
  • Write the PII-absence dbt test macro
  • Acceptance: an external reviewer audits a sample of intermediate.* rows and confirms no PII leaks

Phase 3 — Initial marts + push-back (2 weeks)

  • Build the 5 highest-value marts (in order):
    1. marts.fct_ledger_transactions_anon
    2. marts.fct_bidbaas_auctions
    3. marts.fct_sdpkt_payments
    4. marts.dim_user_360_anon
    5. marts.agg_revenue_daily_app_country
  • Wire postgres_fdw to .104 analytics_db
  • Push the first user-summary back (analytics.bidbaas_user_summary)
  • Have BidBaas read it
  • Acceptance: BidBaas in-app shows "your bids this week" sourced from analytics_db

Phase 4 — Coverage expansion (2 weeks)

  • Add the rest of the source DBs to the extract registry
  • Build remaining marts (TagMe deeper, TxTMe! metadata, WhatWeWant, Takemehome, etc.)
  • Bring all per-app analytics.*_user_summary views online
  • Acceptance: every app has an analytics tab populated from analytics_db

Phase 5 — Enrichment expansion (1 week)

  • Geographic enrichment (via GeoGlobal) on all user-location events
  • Temporal enrichment (holidays, paydays)
  • FX enrichment (either fixed via exchangerate.host, or wired to a repaired Markets API)
  • Cross-app session derivation
  • Acceptance: marts.agg_funnel_app_to_app shows real cross-app journeys

Phase 6 — BI surfaces (1-2 weeks)

  • BigBruh! "Analytics" tab (Superset embed) — replaces "live ops counters" with rich historical views
  • Per-app dashboards in Superset
  • Public-facing aggregate dashboards (if/where wanted) — e.g., DataAcuity portal could show anonymised national-level TGN economy metrics
  • Acceptance: BigBruh! Analytics page ships; team retires the "live counters only" view

Phase 7 — ML feature stores (open-ended)

  • Define the first two ml.feat_* tables — user-churn features + fraud features
  • Wire to a model-training pipeline (out of scope here)
  • Acceptance: ML team has a documented entry point with leakage-tested features

13. Open questions to resolve before Phase 0

  1. Markets API repair vs. external FX source — decide before Phase 5
  2. postgres_fdw vs. logical replication for push-back — decide during Phase 3
  3. n8n's role — keep, retire, or grow? Decide during Phase 0 audit
  4. Public-facing aggregate dashboards — is this a TGN product? Decide before Phase 6
  5. ML team ownership of ml.* — who, when? Defer until Phase 7

14. Cross-references

  • GeoGlobal_README.md — the geo enrichment service used in §7.1
  • GeoGlobal_API_Reference.md — exact endpoints called during enrichment
  • GeoGlobal_Data_Schema.md — the geo_db.geonames and interesting_locations tables
  • Deployment/deployment-credentials.ps1REPLICATION_USER and replicator password used by extract service
  • .claude-memory/deploy-pattern-pgbouncer-cascade.md — connection discipline rules
  • .claude-memory/banking-compliance-rules.md — POPIA / FICA constraints we honour in §6
  • code/Config/TheGeekNetworkServices.json — single source of truth for app metadata that powers marts.dim_app_metadata
Something went wrong on this page. Reload