DataAcuity — BI Data Pipeline
Owner: Tinashe Bhengu Status: 🔵 DESIGN — pipeline must be built from scratch. The existing dbt project is scaffolding, not a live ETL. Audience: anyone who'll touch the data layer — backend engineers, DBAs, dbt authors, BI dashboard builders, compliance reviewers
The single source of truth for how raw app data becomes anonymised, enriched, business-intelligence-ready data that the TGN apps and BigBruh! can consume safely.
1. Executive summary
The Geek Network has 36 APIs and ~20 apps generating user events, transactions, locations, messages, bids, check-ins, etc. on .104 (primary) with hot replication to .105. Today this data is siloed per app — usable for direct app features but not for cross-app analytics, customer-360 views, ML feature stores, or anonymised partner analytics.
The DataAcuity stack on .106 has the infrastructure for a BI pipeline: a Postgres data warehouse (data_warehouse container), a dbt project with TGN-tagged models, Superset for dashboards, Grafana/Loki/Prometheus for ops telemetry, n8n for workflow orchestration, and restic-based backups running daily. What is not yet built is the actual data flow — the warehouse contains only scaffolding (see §1.1), no nightly ETL is running, and the path from .105 → .106 PostgreSQL traffic appears to be blocked (see §1.2). What's missing is the disciplined pipeline that gets every app's data in, strips PII, enriches it with geographic / temporal / FX / cross-app context, and pushes business-safe aggregates back to .104 so the apps can consume their own intelligence at low latency.
This document specifies that pipeline.
1.1 Honest state of the warehouse (verified 2026-05-28)
data_warehouse.datawarehouse on .106:
schemas: bronze, silver, gold (medallion architecture, EMPTY)
dbt_dev_marts (3 rows in each of 2 marts — scaffolding)
dbt_dev_staging (empty)
tgn (4 rows in events_2025_12; 11 future monthly partitions all 0 rows)
public
dbt models with real data:
dbt_dev_marts.fct_tgn_daily_metrics — 3 rows
dbt_dev_marts.fct_tgn_source_summary — 3 rows
dbt models that exist in code but haven't published anywhere:
marts/tagme/fct_tagme_checkins.sql (in dbt_project — never run successfully)
marts/tagme/fct_tagme_tags.sql (same)
marts/tagme/fct_map_ad_impressions.sql (same)
staging/tgn/stg_tgn__events.sql (same)
staging/tgn/stg_tgn__ingest_log.sql (same)
The dbt project is named "DataAcuity DBT Project" and was authored for bronze/silver/gold schemas. The few dbt_dev_* rows that exist look like build verification, not real ETL output. Treat the warehouse as greenfield for this pipeline's purposes.
1.2 Honest state of the .105 → .106 connectivity
Verified 2026-05-28: a psql -h 197.97.200.105 -U replicator -d postgres from a container on .106's data-warehouse_data_stack network times out. Likely causes:
- Windows Firewall on
.105blocks PostgreSQL inbound from.106(most likely — TGN ops convention is that.106is a separate stack) - PostgreSQL on
.105is bound tolocalhostonly - Network routing between data-centre subnets requires a specific path
This is BI Phase 1's first concrete blocker. Resolving it is a separate workstream: DBA on .104/.105 opens a firewall rule + pg_hba.conf entry for .106's IP using the existing replicator user (already documented in Deployment/deployment-credentials.ps1).
2. Why we need this
| Problem today | What the pipeline solves |
|---|---|
| Each app's analytics are siloed in its own DB | Unified warehouse with cross-app fact / dim tables |
| Raw PII sits in app DBs and would leak through any analytics surface | Anonymisation layer guarantees PII never reaches BI consumers |
| No FX-normalised view of revenue across SDPKT / BidBaas / WhatWeWant / Takemehome | Enrichment layer attaches FX, time, geo context to every event |
| BigBruh! is stuck on "live counters" because it has no warehouse to query | BigBruh! Analytics tier reads marts.* and embeds Superset charts |
| Customer-360, churn prediction, fraud detection are impossible without a unified user view | marts.dim_user_360 joins anonymised user activity across all apps |
| ML training data extraction is ad-hoc and breaks compliance | A documented, audited path to ML feature stores with anonymity guarantees |
| Apps can't show users their own activity history without expensive in-app aggregation | Per-app analytics pushed back to .104 lets apps render their own dashboards locally |
3. Architecture overview
┌─────────────────────────────────────────────────────────────┐
│ .104 PRIMARY (read-write) │
│ ─────────────────────── │
│ All 36 API postgres DBs (auth_db, ledger_db, tagme_db, │
│ bidbaas_db, sdpkt_db, txtme_db, sleptondb, etc.) │
│ │
│ + analytics_db (NEW — pushed-back marts, see §9) │
└─────────────────────────┬───────────────────────────────────┘
│ streaming replication
▼
┌─────────────────────────────────────────────────────────────┐
│ .105 STANDBY (read-only replica) │
│ ─────────────────────────── │
│ Same 36 DBs, near-real-time. THIS is where ETL reads from. │
└─────────────────────────┬───────────────────────────────────┘
│ nightly + selective near-real-time
▼
┌─────────────────────────────────────────────────────────────┐
│ .106 DATA WAREHOUSE STACK │
│ ────────────────────────── │
│ │
│ ┌─────────────────────┐ ┌───────────────────────────┐ │
│ │ data_warehouse │ │ geo_db │ │
│ │ (postgres :5001) │◄──┤ 13.4M places + 3.5M POIs │ │
│ │ │ │ (enrichment source) │ │
│ │ Schemas: │ └───────────────────────────┘ │
│ │ raw.* <─── extracted from .105 │
│ │ staging.* <─── lightly typed + deduped │
│ │ intermediate.* <── anonymised + standardised │
│ │ marts.* <─── business-ready fact/dim tables │
│ │ analytics.* <─── denormalised views for dashboards │
│ │ ml.* <─── feature stores │
│ └─────────────────────┘ │
│ ┌────────────────────────┐ │
│ ┌─────────────────────┐ │ valhalla + geo_mcp │ │
│ │ dbt_transform │───▶│ (enrichment service) │ │
│ │ (transformations) │ └────────────────────────┘ │
│ └─────────────────────┘ │
│ │
│ ┌─────────────────────┐ ┌───────────────────────────┐ │
│ │ Superset :5003 │ │ Grafana / Prom / Loki │ │
│ │ (BI dashboards) │ │ (ops telemetry) │ │
│ └─────────────────────┘ └───────────────────────────┘ │
└─────────────────────────┬───────────────────────────────────┘
│ aggregated, anonymised marts
▼
┌─────────────────────────────────────────────────────────────┐
│ .104 analytics_db (push-back) │
│ ───────────────── │
│ Per-app marts every app's API can read locally for: │
│ - "your activity this week" surfaces │
│ - cohort comparisons │
│ - in-app dashboards │
│ ZERO raw PII. Aggregates only. K-anonymised (k≥5). │
└─────────────────────────────────────────────────────────────┘
4. The data flow — step by step
Every record goes through six stages. Each stage is idempotent and resumable.
- Capture — app writes to its own DB on
.104. Already happening; no change. - Replicate — streaming replication to
.105. Already happening; no change. We never read from.104for ETL — that would risk the primary under load. - Extract — scheduled job pulls from
.105intodata_warehouse.raw.*on.106. Incremental where possible (using updated_at / event_time watermarks), full snapshot for small dimension tables. - Standardise + dedup — dbt models transform
raw.*→staging.*. Types coerced, encodings normalised, soft-deleted rows filtered. - Anonymise — dbt models transform
staging.*→intermediate.*. PII fields hashed, tokenised, generalised, or suppressed per the rules in §6. - Enrich — dbt models transform
intermediate.*→marts.*. Geo / time / FX / user-segment / cross-app context attached per §7. - Serve —
marts.*are read by Superset for dashboards, by ML training jobs for feature extraction, and by the push-back job that writes selected aggregates into.104 analytics_dbfor app consumption.
Latency expectations:
| Stage | Latency target |
|---|---|
| Replicate (.104 → .105) | < 5 s (streaming) |
| Extract (.105 → raw.*) | T+15 min (event tables, incremental); T+24 h (dimension snapshots) |
| Standardise + dedup | T+5 min after extract |
| Anonymise | T+5 min after standardise |
| Enrich | T+15 min after anonymise (geo lookups dominate) |
| Push-back to .104 | T+30 min after marts ready |
End-to-end: ~1 hour from event capture to anonymised, enriched availability in .104 analytics_db. Good enough for "this week", "today", "right now" surfaces in app UIs.
For true real-time (sub-minute) analytics — fraud flagging, live ops counters — keep using BigBruh!'s direct read-replica queries against .105. The pipeline is for intelligence, not live ops.
5. Source registry
Every TGN data source that flows through the pipeline. Source-of-truth list — update this file when a new API ships.
| Source DB (on .104/.105) | Owning API | Key tables | Extract mode | Frequency |
|---|---|---|---|---|
auth_db |
AuthAPI | users, sessions, kyc_records, phone_pins |
Incremental on updated_at |
15 min |
ledger_db |
LedgerAPI | transactions, accounts, holds, reversals |
Incremental on created_at |
5 min (event-driven via WAL eventually) |
messaging_db |
MessagingAPI / TxTMe! | conversations, messages (CONTENT NEVER LEAVES — see §6.5), participants |
Incremental + metadata-only | 1 h |
apps_db |
shared TGN | app_events, user_app_sessions |
Incremental | 15 min |
tagme_db |
TagMeAPI | check_ins, points_of_interest, tags, user_locations (rounded), events |
Incremental | 1 h |
kiffstore_db |
KiffStoreAPI | orders, products, inventory_events |
Incremental | 1 h |
bidbaas_db |
BidBaasAPI | auctions, bids, winners, payouts |
Incremental | 5 min (revenue-critical) |
jobcenter_db |
TheJobCenterAPI | job_postings, applications, placements |
Incremental | 1 h |
sdpkt_db |
SdpktAPI | wallets, payments, holds, merchant_settlements |
Incremental | 5 min |
payfast_db |
PayfastAPI | payment_attempts, itn_callbacks |
Incremental | 15 min |
glocell_db |
GlocellAPI | airtime_purchases, data_purchases |
Incremental | 1 h |
media_db |
MediaAPI | uploads, views, streams (metadata only) |
Incremental | 1 h |
slepton_db |
SleptOnAPI | app_installs, app_updates, developer_payouts |
Incremental | 1 h |
notification_db |
NotificationAPI | sends, deliveries, opens |
Incremental | 15 min |
incentives_db |
IncentivesAPI | rewards, claims, referrals |
Incremental | 1 h |
wolverine_db |
WolverineAPI | incidents, fixes, deploys |
Incremental | 1 h |
takemehome_db |
TakemehomeAPI | searches, clicks, bookings_attributed |
Incremental | 1 h |
whatwewant_db |
WhatWeWantAPI | lists, items, claims, purchases |
Incremental | 1 h |
| (plus the remaining ~18 APIs as they go live) |
Anything not listed here is not in the warehouse. Adding a source is a PR that updates this table + a new dbt source + the extract job's config.
6. Anonymisation framework
The single rule: marts.* and downstream must be safe to share with any internal team and exportable to vetted external partners without further sanitisation. The anonymisation step is the gate.
6.1 PII inventory
Every field across TGN apps falls in one of these buckets:
| Bucket | Examples | Strategy |
|---|---|---|
| Direct identifiers | name, email, phone, SA ID, passport, IBAN, card PAN | Hash (deterministic) OR tokenise (reversible by compliance) |
| Quasi-identifiers | DOB, postcode, employer, IP | Generalise (DOB → 5-yr band, postcode → suburb, IP → city) |
| Location | GPS coordinates, addresses | Round to ~1 km grid for user-level analytics; full precision only for POI-level (which has no individual user attached) |
| Biometric | face embeddings, fingerprints, voice | Suppress — never extract from source |
| Free-text content | message bodies (TxTMe!), notes, descriptions | Suppress for marketing analytics; allowed only for explicit moderation/safety pipelines under separate access controls |
| Financial sensitive | account balances, tx amounts | Allowed, but joined to anonymised user_hash only |
| Behavioural | timestamps, click counts, route choices | Allowed (low PII risk in aggregate) |
| Device | device_id, IDFA, push tokens | Hash (deterministic) |
6.2 Deterministic hashing standard
For any direct identifier we need to join across sources (e.g., same user_id in auth_db and tagme_db.check_ins):
-- Salt is per-source-type, stored in the compliance vault, rotated annually
user_hash = encode(digest(salt_users || ':' || user_id::text, 'sha256'), 'hex')
Properties:
- Deterministic: same input → same hash, so joins still work
- One-way: can't reverse without the salt
- Salted: rainbow-table protection
- Rotatable: annual salt rotation invalidates historical joins, which limits backward inference (a feature, not a bug — it caps how long an "anonymised" dataset stays linkable)
Hash length: full 64-char SHA-256 hex in intermediate.*, optionally truncated to 16 chars in marts.* for compactness (still 2^64 collision space, ample for our user count).
6.3 Tokenisation (for fields that must be reversible)
Some fields legal/compliance need to recover under court order (e.g., SA ID for a fraud investigation). These get tokenised instead of hashed:
sa_id_token = nextval('compliance.token_seq')
-- The mapping {sa_id → token} is stored in compliance.token_vault, encrypted at rest,
-- accessible only to two named compliance officers via audited break-glass procedure.
token_vault lives in its own DB on .118 with its own backup schedule. Tokens NEVER appear in marts.* — only in intermediate.* where access is locked down.
6.4 Generalisation rules
| Source field | In marts.* |
|---|---|
date_of_birth |
age_band (<18, 18-24, 25-34, 35-44, 45-54, 55-64, 65+) |
address |
city, country (no street, no postcode finer than 4-digit) |
gps_lat, gps_lng |
rounded to 0.01° (~1.1 km grid) for user-level; full precision only for POI-level |
ip_address |
enriched to city, country, isp_class — IP itself dropped |
device_id |
hashed; device_class (Android-recent, Android-old, iOS-recent, iOS-old, Web) kept |
exact_timestamp |
full precision allowed (timestamps aren't very identifying in aggregate); but in low-volume marts, round to hour |
6.5 Message content — special case
TxTMe! / MessagingAPI message bodies are SUPPRESSED end-to-end. The pipeline extracts only:
- conversation_id (hashed)
- participant_count
- message_count_per_day
- has_media (boolean)
- detected_language (top-level only)
Never the body. Never media payloads. Never subject lines.
Moderation/safety pipelines that need content access do NOT go through this BI pipeline — they have a separate, more tightly governed path with explicit user consent recorded.
6.6 K-anonymity guarantee
Every marts.* table has a documented k-value:
- k ≥ 5 is the default minimum for any group that could enable re-identification
- Marts that aggregate to (city, age_band, day) check that every (city, age_band, day) bucket has ≥5 users; otherwise the row is suppressed or generalised further (e.g., week-level)
- dbt tests (
dbt_utils.expression_is_true) enforce this — a failing test blocks the model from publishing
6.7 Compliance map
| Regulation | Region | Pipeline guarantee |
|---|---|---|
| POPIA | South Africa | Consent recorded at signup; right-to-erasure handled (§6.8); cross-border transfer logged; PI minimisation applied (only fields needed for the mart purpose flow through) |
| GDPR | EU users | Same as POPIA + Article 17 right-to-be-forgotten propagates within 30 days; lawful basis = legitimate interest for ops analytics, consent for marketing |
| FICA | SA banking | Transaction data kept linkable via compliance.token_vault for 5 years; marts use hashed identifiers only |
| CCPA | California | Right-to-know + right-to-delete same mechanism as GDPR |
6.8 Right-to-be-forgotten
When a user requests deletion (or compliance triggers it):
- AuthAPI marks
users.deleted_aton.104 - Next extract picks up the soft-delete flag
- dbt model
intermediate.deletions_pendingcollects all hashes belonging to that user across sources - A retention job (nightly) deletes those rows from
raw.*andstaging.*and re-runs dependent dbt models compliance.token_vaultentry is purged- The push-back to
.104 analytics_dbre-runs and drops the user from any aggregate where they're now sole member
SLA: 30 days from request to last trace removed.
7. Enrichment framework
Anonymised data becomes valuable when it's joined to context the apps couldn't easily produce themselves. Five enrichment families.
7.1 Geographic enrichment (via GeoGlobal)
For any row with (lat, lng) or address_text:
- Reverse geocode via
geo_mcp.reverse_geocode(lat, lng)→ attachcity,country,admin1,admin2 - POI context via
geo_mcp.interesting_nearby(lat, lng, within_km=2)→ attachnearest_poi_category,nearest_poi_score - Distance to nearest population centre for "is this user in a rural or urban area"
- Cached aggressively — see
GeoGlobal_API_Reference.mdfor TTLs
This is why GeoGlobal matters here, not just for app features.
7.2 Temporal enrichment
For any timestamp:
day_of_week(Mon-Sun)hour_of_day(local time at user's last known city)is_weekend,is_public_holiday(per country, usingdim_holidaystable seeded from external source)is_payday_week(heuristic: last week of month + 5 days)season(Southern Hemisphere ≠ Northern)
7.3 FX enrichment
For any monetary amount in ledger_db, sdpkt_db, bidbaas_db, takemehome_db, whatwewant_db:
amount_zar— already native for SA transactionsamount_usd— converted using the FX rate at transaction timeamount_btc_sats— for crypto-comparable views
FX rates come from a dim_fx_rates_daily table. Currently the Markets API on .106 is broken (verified 2026-05-28), so as a stopgap we ingest daily ZAR/USD/EUR/GBP/BTC from a free source (e.g., exchangerate.host) into the warehouse directly. Once Markets is repaired, switch the source.
7.4 User segment enrichment
For each user_hash:
kyc_tier— no-kyc / basic / verified (joins toauth_db.kyc_records)lifecycle_stage— new (<7d), active (7d-90d), engaged (>90d active in last 30d), dormant (no activity 30d), churned (no activity 90d)apps_used_30d— array of app IDs the user touchedrevenue_band_zar_30d—none,low,mid,high(percentile-based, recomputed weekly)mesh_node_class— non-relay / consumer / contributor / heavy-relay (CircleAether activity)
7.5 Cross-app enrichment
The killer feature — answers "this user did X in TagMe then Y in BidBaas":
cross_app_session_id— derived from a sliding window of timestamps. If a user has events in app A and app B within 30 min, same session.attribution_chain— for revenue events, the prior 24h of touchpoints across all apps (powers attribution models)funnel_position— for known multi-step funnels (e.g., TagMe-checkin → BidBaas-bid → SDPKT-pay), which step the user is on
8. Warehouse schema layers
Six schemas inside data_warehouse.datawarehouse. Strict layering — a model in layer N may only depend on layers ≤ N.
0. raw.* ─ identical to source. No transforms. PII present.
1. staging.* ─ types + dedup + soft-delete filter. PII still present.
2. intermediate.* ─ ANONYMISED. PII gone (hashed/tokenised/generalised/suppressed).
3. marts.* ─ ENRICHED. Business-ready fact/dim. Safe to share internally.
4. analytics.* ─ Denormalised views over marts, optimised for Superset dashboards.
5. ml.* ─ Feature stores. Time-aligned, leakage-checked.
8.1 Layer 0 — raw.*
- One table per source table, named
raw.{source_db}__{table}(e.g.raw.tagme_db__check_ins) - Materialised as Postgres tables (not views) because the source isn't always reachable
- Loaded by the extract job (§10), never edited by hand
- Retention: 90 days. After that, only
staging.*and downstream remain (which by then are PII-free). - Access: locked to the ETL service account + two named DBAs. NOT readable by dbt jobs running as
dwh_user.
8.2 Layer 1 — staging.*
- One model per
raw.*table, namedstaging.stg_{source}__{table}(matches existingstg_tgn__eventsconvention) - Type coercion, encoding normalisation, soft-delete filter, dedup
- Materialised as views (cheap, always fresh)
- PII still present — this layer is access-controlled
8.3 Layer 2 — intermediate.*
- This is where the magic — and the compliance line — sits.
- Named
intermediate.int_{purpose}(e.g.intermediate.int_user_360_anon) - Materialised as tables
- Implements all rules in §6
- Has dbt tests for: every column in §6.1's "Direct identifiers" bucket is verified absent (regex test on each column name + content check)
- Sign-off: every new
intermediate.*model requires PR review by compliance reviewer
8.4 Layer 3 — marts.*
- The deliverable layer
- Named
marts.fct_*for facts,marts.dim_*for dimensions,marts.agg_*for pre-aggregated summaries - Existing models (already running):
marts.fct_tgn_daily_metricsmarts.fct_tgn_source_summarymarts.fct_tagme_tagsmarts.fct_tagme_checkinsmarts.fct_map_ad_impressions
- New models needed for full coverage:
marts.dim_user_360_anon— anonymised user × 30-day rollup across all appsmarts.fct_ledger_transactions_anon— every ledger event, hashed users, FX-normalisedmarts.fct_bidbaas_auctions— auction × winner-hash × revenuemarts.fct_sdpkt_payments— payments × merchant-hash × FX-normalisedmarts.fct_messaging_metadata— conversation-hash × metadata only (per §6.5)marts.fct_app_sessions— session × app × duration × outcomemarts.fct_panik_alerts— emergency events, geo-generalisedmarts.dim_app_metadata— slowly-changing dimension of app definitions (categories, costs, etc.)marts.dim_fx_rates_daily— FX historymarts.dim_holidays— country × date holiday flagsmarts.agg_revenue_daily_app_country— daily revenue rolled up to (app, country, day)marts.agg_user_cohorts_weekly— cohort retention for marketingmarts.agg_funnel_app_to_app— cross-app conversion funnels
8.5 Layer 4 — analytics.*
- Views, not tables — always fresh, no extra storage
- Tailored to Superset dashboards (column names, formats, ordering match the dashboard expectations)
- One per dashboard, named
analytics.dash_{dashboard_name}
8.6 Layer 5 — ml.*
- Feature stores with strict time-aligned features (no future leakage)
- Versioned tables —
ml.feat_user_v1,ml.feat_user_v2 - Owned by the ML team (not yet staffed; ml schema is empty for now but reserved)
9. Push-back to .104 — the analytics_db pattern
Apps need their own intelligence at low latency. Cross-server queries (.104 ↔ .106) over WAN are too slow. So: selected marts get pushed back to a dedicated analytics_db on .104 (and replicated to .105).
9.1 What gets pushed back
- Per-app analytics: each app's API gets a schema (e.g.,
analytics_db.tagme.*,analytics_db.bidbaas.*) with the aggregates it needs to render in-app dashboards. - Anonymised aggregates only. No row-level PII even after our anonymisation — these are aggregate counters / rollups (e.g., "how many check-ins did this user have this week", "average bid value in this category", "your spend vs. category median").
- Per-user views are pre-computed for the user_hash that the app holds locally. The app calls
analytics_db.tagme.user_summary WHERE user_hash = $1. Fast.
9.2 Schema sketch
analytics_db
├── tagme.user_summary (user_hash, checkins_7d, checkins_30d, top_category, top_city)
├── tagme.poi_popularity (poi_id, checkins_7d, checkins_30d, trending_score)
├── bidbaas.user_summary (user_hash, bids_30d, wins_30d, total_spend_zar_30d, category_mix)
├── bidbaas.category_pulse (category, median_winning_bid_zar, win_rate_30d, supply_30d)
├── sdpkt.user_summary (user_hash, sends_30d, receives_30d, top_merchant_category)
├── txtme.user_summary (user_hash, convs_30d, messages_30d, top_lang, response_time_p50_min)
├── takemehome.user_summary (user_hash, searches_30d, clicks_30d, top_destination_city)
├── whatwewant.user_summary (user_hash, lists_30d, items_30d, fulfilled_30d, top_category)
├── shared.user_lifecycle (user_hash, stage, days_since_signup, apps_30d_array)
├── shared.cross_app_funnel (funnel_id, user_hash, step, entered_at, completed)
└── shared.fx_rates_daily (currency_pair, date, rate_close)
9.3 Push mechanism
Nightly (or per-mart-cadence) job runs on .106:
-- Simplified — real version uses dbt_external_tables or postgres_fdw
INSERT INTO analytics_db.tagme.user_summary
SELECT user_hash, checkins_7d, ... FROM marts.agg_tagme_user_summary
ON CONFLICT (user_hash) DO UPDATE SET ...
Implementation options:
- postgres_fdw —
data_warehouseopens a foreign data wrapper to.104 analytics_db, INSERTs over the network. Easiest. - logical replication — set up
analytics_dbas a logical subscriber to specific tables indata_warehouse. Most automatic. - batch CSV via S3/MinIO — dump to file,
COPY FROMon.104. Most decoupled.
Recommendation: start with postgres_fdw for simplicity, migrate to logical replication when volume justifies.
9.4 What the apps see
The apps' Postgres user (per-app, least-privilege) gets SELECT on its schema in analytics_db. They read their own summaries directly:
// In TagMeAPI
var summary = await db.QuerySingleAsync<UserSummary>(
"SELECT * FROM analytics.tagme_user_summary WHERE user_hash = @h",
new { h = HashUser(currentUser.Id) });
App-side hashing is identical to the pipeline's (same salt, same algorithm) so the lookup just works.
10. Orchestration
10.1 Tool choice
- dbt Core CLI in cron for the transformation chain (raw → staging → intermediate → marts → analytics). Already installed in
dbt_transformcontainer. - A small Python extract service for the .105 → raw.* hop. Either:
- Custom (~200 LOC, full control), or
- Airbyte (a folder at
/home/geektrading/airbyte/exists — possibly already partially configured; verify before committing).
- n8n for event-driven side workflows (e.g., "when a Wolverine incident is logged, trigger an ad-hoc dbt run on impacted models"). NOT for the main batch pipeline.
Why not n8n for the whole pipeline? n8n is excellent for spiky event-driven flows but worse than dbt for hundreds of dependent SQL transformations. Use both, in their lanes.
10.2 Schedule
| Job | When | Runs as |
|---|---|---|
Extract event tables (tagme.check_ins, bidbaas.bids, etc.) |
every 15 min | etl_user on .106 |
Extract dimension snapshots (auth.users, apps.app_metadata) |
hourly | etl_user |
dbt staging + intermediate runs |
every 30 min | dwh_user |
dbt marts runs |
every hour | dwh_user |
dbt analytics views |
always-fresh views (no schedule) | dwh_user |
Push-back to .104 analytics_db |
hourly (offset 15 min after marts) | pushback_user |
| FX rate ingest from exchangerate.host | daily 02:00 SAST | etl_user |
| Holiday calendar refresh | weekly Sun 03:00 SAST | etl_user |
| Right-to-be-forgotten sweep | nightly 04:00 SAST | compliance_user |
| dbt test suite (full) | nightly 05:00 SAST | dwh_user |
| Re-key (salt rotation) | annually, scheduled change-window | compliance_user |
10.3 PgBouncer / connection discipline
Per deploy-pattern-pgbouncer-cascade.md:
- All ETL connections to
.105go through PgBouncer - Read-only
replicatoruser fromDeployment/deployment-credentials.ps1is reused (already has SELECT on all DBs) - Max parallel connections per ETL job: 4
- Never restart the ETL service between 06:00 and 22:00 SAST without notifying the team — burst of reconnects can trip the PgBouncer cache cascade
11. Monitoring + data quality
11.1 Tests (in dbt)
Every model has at minimum:
not_nullon all primary keysuniqueon natural keysaccepted_valueson enums- Freshness check:
staging.stg_*must have a row withevent_timewithin the last hour (else warn) - K-anonymity check on any
marts.*that aggregates users (warn at k<10, fail at k<5) - For
intermediate.*— PII-absence regex check on every column
11.2 Observability
- Prometheus scrapes the extract service + dbt run job for: rows ingested, dbt model run time, test pass/fail counts
- Grafana dashboard "BI Pipeline Health" — should exist; build it as Phase 0
- Loki captures dbt logs for ad-hoc debugging
- Alertmanager rules:
- Any
intermediate.*PII check fails → page on-call (P1) - Extract job hasn't run in >2× expected interval → page on-call (P1)
- dbt test suite freshness fails on revenue-critical marts (ledger/sdpkt/bidbaas) → page on-call (P1)
- K-anonymity warn → ticket (P2)
- Disk free on
.106 data_warehouse< 30 GB → ticket (P2)
- Any
11.3 Lineage
dbt auto-generates lineage. Hosted at superset:5003 or via dbt docs serve. Every analyst should know how to find: "which sources contribute to marts.fct_bidbaas_auctions?"
12. Implementation phases
The pipeline is too big to land in one drop. Phased, each phase independently shippable.
Phase 0 — Foundation (1 week)
- Verify postgres roles + connection strings for
data_warehouse,markets_db,superset_db - Document the existing dbt models (the five running
marts.fct_*and twostaging.*) - Set up the Grafana "BI Pipeline Health" dashboard scraping what's already there
- Audit n8n's existing 55 MB sqlite — what workflows are running? Document or retire each.
- Create
compliance.token_vaultDB on.118(empty, just the schema) - Acceptance: repo has accurate docs of current state; ops can see pipeline health
Phase 1 — Extract framework (1 week)
- Build the Python extract service (or wire Airbyte if it's already half-set-up)
- Configure the source registry table (§5) as the single config file the extract reads
- Ship extracts for the 3 revenue-critical sources first:
ledger_db,sdpkt_db,bidbaas_db - Land in
raw.*on.106 - Acceptance: every 5 min, fresh rows from those 3 sources land in
raw.*
Phase 2 — Anonymisation layer (1 week)
- Implement the hash standard + the compliance.token_vault helper functions
- Write
intermediate.int_users_anonjoiningauth_db.usersacross all sources with deterministic hashing - Write the PII-absence dbt test macro
- Acceptance: an external reviewer audits a sample of
intermediate.*rows and confirms no PII leaks
Phase 3 — Initial marts + push-back (2 weeks)
- Build the 5 highest-value marts (in order):
marts.fct_ledger_transactions_anonmarts.fct_bidbaas_auctionsmarts.fct_sdpkt_paymentsmarts.dim_user_360_anonmarts.agg_revenue_daily_app_country
- Wire postgres_fdw to
.104 analytics_db - Push the first user-summary back (
analytics.bidbaas_user_summary) - Have BidBaas read it
- Acceptance: BidBaas in-app shows "your bids this week" sourced from analytics_db
Phase 4 — Coverage expansion (2 weeks)
- Add the rest of the source DBs to the extract registry
- Build remaining marts (TagMe deeper, TxTMe! metadata, WhatWeWant, Takemehome, etc.)
- Bring all per-app
analytics.*_user_summaryviews online - Acceptance: every app has an analytics tab populated from
analytics_db
Phase 5 — Enrichment expansion (1 week)
- Geographic enrichment (via GeoGlobal) on all user-location events
- Temporal enrichment (holidays, paydays)
- FX enrichment (either fixed via exchangerate.host, or wired to a repaired Markets API)
- Cross-app session derivation
- Acceptance:
marts.agg_funnel_app_to_appshows real cross-app journeys
Phase 6 — BI surfaces (1-2 weeks)
- BigBruh! "Analytics" tab (Superset embed) — replaces "live ops counters" with rich historical views
- Per-app dashboards in Superset
- Public-facing aggregate dashboards (if/where wanted) — e.g., DataAcuity portal could show anonymised national-level TGN economy metrics
- Acceptance: BigBruh! Analytics page ships; team retires the "live counters only" view
Phase 7 — ML feature stores (open-ended)
- Define the first two
ml.feat_*tables — user-churn features + fraud features - Wire to a model-training pipeline (out of scope here)
- Acceptance: ML team has a documented entry point with leakage-tested features
13. Open questions to resolve before Phase 0
- Markets API repair vs. external FX source — decide before Phase 5
- postgres_fdw vs. logical replication for push-back — decide during Phase 3
- n8n's role — keep, retire, or grow? Decide during Phase 0 audit
- Public-facing aggregate dashboards — is this a TGN product? Decide before Phase 6
- ML team ownership of
ml.*— who, when? Defer until Phase 7
14. Cross-references
GeoGlobal_README.md— the geo enrichment service used in §7.1GeoGlobal_API_Reference.md— exact endpoints called during enrichmentGeoGlobal_Data_Schema.md— thegeo_db.geonamesandinteresting_locationstablesDeployment/deployment-credentials.ps1—REPLICATION_USERandreplicatorpassword used by extract service.claude-memory/deploy-pattern-pgbouncer-cascade.md— connection discipline rules.claude-memory/banking-compliance-rules.md— POPIA / FICA constraints we honour in §6code/Config/TheGeekNetworkServices.json— single source of truth for app metadata that powersmarts.dim_app_metadata