Skip to content
DA DataAcuity by The Geek Network

DataAcuity — BI Anonymisation Standard

Status: 🔵 DESIGNED — to be reviewed and signed off by compliance reviewer before BI Pipeline Phase 2 ships Owner: Tinashe Bhengu Applies to: Every intermediate.* model in the data warehouse and everything downstream (marts.*, analytics.*, ml.*, push-back to .104 analytics_db) Companion docs: DataAcuity_BI_Pipeline.md §6 (overview), DataAcuity_BI_Compliance_Map.md (regulator clause mapping)

This document is the single specification that the anonymisation layer must implement and that compliance review must verify against. It exists in isolation from the BI Pipeline architecture doc so legal/compliance reviewers can audit just this surface without wading through the rest.


1. Principles

The pipeline obeys five non-negotiable principles. Everything else in this doc is consequence.

  1. PII is suppressed at the intermediate.* boundary. Downstream models MUST NOT contain raw PII. Ever. A dbt test (§7) verifies this on every model.
  2. Joins survive anonymisation. The same user's events in different source DBs must be linkable in the warehouse — but only by an irreversible hash. The salt isn't accessible to analysts.
  3. Reversibility is a separate privilege. Some fields (SA ID, IBAN) are tokenised (reversible by compliance under audit), not hashed. Tokens are stored in compliance.token_vault on .118, accessed only via break-glass procedure.
  4. Generalisation, not pseudonymisation, for quasi-identifiers. DOBs become age bands, GPS becomes a 1 km grid, addresses become cities. We don't try to hash these — generalisation gives better privacy and equivalent analytical utility.
  5. K-anonymity ≥ 5 on any user aggregate. Every group (city × age band × day, etc.) has at least 5 distinct users, or the row is suppressed/generalised further.

2. Field classification

Every field in any source DB falls into one of seven classes. The class determines treatment.

Class Definition Strategy Examples
A. Direct identifier — joinable Uniquely identifies a person AND we need to join their events across sources Hash (deterministic, salted SHA-256) user_id, email, phone, device_id, wallet_id
B. Direct identifier — recoverable Uniquely identifies AND legal/compliance may need to recover the original under court order Tokenise (sequential token in compliance.token_vault) sa_id, passport_number, iban, card_pan_last4, merchant_legal_id
C. Quasi-identifier Doesn't identify alone but can in combination Generalise to coarser bucket date_of_birth → age band; address → city; gps_lat/lng → 0.01° grid; ip_address → city, ISP class
D. Sensitive content Free text/media containing arbitrary PII Suppress (drop entirely) message bodies, voice notes, photos, document scans, biometric templates, comment text
E. Sensitive attribute Not identifying alone but high-confidentiality Pass through but log access KYC tier, kyc_failure_reason, account_status, fraud_flag, criminal_record_flag
F. Behavioural Low PII risk in aggregate Pass through timestamps, click counts, route choices, scroll positions, dwell times
G. Operational No PII Pass through unchanged server_id, app_version, country (already coarse), language

When a field doesn't obviously fit, default to D (suppress) and reclassify only after explicit review.

3. The hashing standard (Class A)

3.1 Algorithm

hash(salt, identifier) = encode(digest(salt || ':' || identifier::text, 'sha256'), 'hex')
  • salt is a per-source-type secret stored in compliance.salts table on .118. There is one salt per identifier kind (one for user_ids, one for device_ids, one for emails, etc.).
  • Salt length: 32 random bytes (256 bits), base64-encoded.
  • Hash output length: 64 hex chars (full SHA-256). marts.* MAY truncate to 16 hex chars (still 2⁶⁴ collision space) for compactness; intermediate.* keeps full length.
  • No client-side salt visibility. dbt models reference the salt via {{ env_var('SALT_USERS') }} — env vars are read at run time from the orchestrator secret manager, NEVER in version-controlled files.

3.2 Salt rotation

Trigger Action Effect
Annual scheduled rotation Generate new salt, mark old as is_retired = true with rotated_on date All hashes computed after rotation use the new salt; old hashes still join within their era
Suspected salt leak Same as above + immediately rebuild affected intermediate.* and downstream models with new salt Cuts the inference chain for an attacker who got the old salt
Salt-derived hash collision found Should never happen at our scale; if it does, escalate to security review

Salts MUST NOT be exported, backed up to insecure locations, or visible in logs.

3.3 What you can / cannot do with hashes

Operation OK? Notes
JOIN ON staging.A.user_hash = staging.B.user_hash This is the whole point — cross-source joins still work
GROUP BY user_hash (per-user aggregates) But the aggregated rows still count as user-level data — apply k-anonymity (§5)
WHERE user_hash = '<specific hash>' (lookup) ⚠️ Allowed for ops/debug; logged to audit
Reverse hash to email/user_id Mathematically infeasible without the salt; even with salt, this is the break-glass path through compliance
Truncate hash below 16 chars Increases collision risk; standardise on 16 or 64

4. The tokenisation standard (Class B)

4.1 Mechanism

  1. Source row arrives with a Class B field (e.g., sa_id = '8203035811088')
  2. intermediate.* model calls compliance.tokenise_sa_id('8203035811088') (a stored function)
  3. The function:
    • Looks up the existing token if present, returns it
    • Else generates nextval('compliance.sa_id_token_seq') (a bigint), inserts (token, encrypted_value, created_at) into compliance.token_vault.sa_id, returns the token
  4. intermediate.* model stores only the token
  5. Original value is encrypted at rest in token_vault using the compliance team's KMS key

4.2 Token format

  • bigint, monotonically increasing per token family
  • Tokens NEVER appear in marts.* — only in intermediate.* where access is locked down
  • Token families: sa_id, passport, iban, card_pan_last4, merchant_legal_id

4.3 Recovery (break-glass)

The only legitimate recovery paths:

  1. Court order / regulatory request — compliance officer files an internal ticket with the legal reference, two-person sign-off (compliance + executive), recovery happens via a logged stored procedure
  2. User support — user requested data export — same flow but with the user's verifiable consent record attached
  3. Fraud investigation — same flow but with fraud team as the second signer instead of executive

Every recovery is logged in compliance.token_access_audit with: who, when, which token, what reason, attached ticket reference. Retained 7 years.

5. Generalisation rules (Class C)

Source Becomes Rationale
date_of_birth (date) age_band{<18, 18-24, 25-34, 35-44, 45-54, 55-64, 65+} DOB + city is identifying; age band is not
date_of_birth (where age is the actual analytical need) age_years truncated to integer But check k-anonymity per (age_years, city, …)
address (full street) city, country. Suburb level OK if city has ≥10,000 users in the dataset Suburb in small towns identifies
postcode (4-digit SA) First 2 digits only OR drop Full postcode + age + sex is uniquely identifying
gps_lat, gps_lng (user position) round to 0.01° (~1.1 km grid) Sub-km GPS over a week traces a person to their home
gps_lat, gps_lng (POI position) full precision POIs are public, not personal
ip_address city, country, isp_class via geoip lookup. IP itself dropped IP is a strong identifier
device_id, idfa, gaid hash (Class A treatment) Cross-device but pseudonymous
user_agent parsed to device_class{android-recent, android-old, ios-recent, ios-old, web-desktop, web-mobile, web-tablet, other} Full UA fingerprints
phone_number hash + country_code, mobile_network retained as separate columns Hash for joins, country/network for analytics
email hash + email_domain retained Domain is useful for cohort analysis without identifying
name_first, name_last, display_name suppress (Class D) Never needed for analytics
dob-based age in low-volume cells round to 5-year band k-anonymity protection
timestamp (high-cardinality marts) round to hour for marts where day-level grouping creates groups of <5 Time uniqueness identifies in small datasets

6. Suppression rules (Class D)

These fields NEVER leave their source DB. They MAY be processed by separate, more tightly governed pipelines (moderation, fraud) under explicit consent, but never the BI pipeline.

  • Message bodies (TxTMe!, MessagingAPI, BidBaas chat, SleptOn reviews)
  • Voice notes, audio recordings, video files
  • Photos / images / document scans (KYC documents especially)
  • Biometric templates (face embeddings, fingerprint vectors, voice prints)
  • Free-text fields users can write in (display name nickname, profile bio, item descriptions, etc.) — unless the field is bounded to a controlled vocabulary
  • OAuth refresh tokens, session tokens, password hashes
  • Card numbers (full PAN), CVVs (we never store these anyway, but verify in any source)
  • Government ID images
  • Health-related notes (Panik emergency call notes can contain medical info)

Test: the extract job has a deny list of column names per source table. Any new column added to a source DB that matches a deny pattern (*_body, *_text, *_note, *_image, *_blob, password*, token*, secret*, _pem, _key) is automatically suppressed at extract time AND flagged for review.

7. PII-absence dbt tests

Every model in intermediate.* and downstream MUST pass these tests:

7.1 Column-name check

tests/generic/test_no_pii_columns.sql:

-- Macro: assert no column name matches a forbidden pattern
{% test no_pii_columns(model) %}
WITH offenders AS (
  SELECT column_name
  FROM information_schema.columns
  WHERE table_schema = '{{ model.schema }}' AND table_name = '{{ model.identifier }}'
    AND lower(column_name) ~ '^(name_first|name_last|display_name|email|phone|sa_id|passport|iban|card_pan|address_line|gps_lat|gps_lng|ip_address|user_agent|dob|date_of_birth|password|token|secret|api_key)$'
)
SELECT * FROM offenders
{% endtest %}

A failing test blocks the model from publishing.

7.2 Content regex check (sampled)

tests/generic/test_no_pii_content.sql:

-- Macro: assert string columns don't contain values that look like PII
-- Sample 1000 rows per column; if any match, fail.
{% test no_pii_content(model) %}
WITH s AS (
  SELECT * FROM {{ model }} LIMIT 1000
), offenders AS (
  SELECT 'email pattern' AS issue, column_name, val
  FROM (
    SELECT column_name, (s.*::text) AS val FROM s, lateral (
      SELECT key AS column_name FROM jsonb_each_text(to_jsonb(s))
    ) c
  ) flat
  WHERE val ~ '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
  UNION ALL
  SELECT 'SA ID pattern' AS issue, column_name, val FROM (...)
    WHERE val ~ '^\d{13}$'
  UNION ALL
  SELECT 'IBAN pattern', ... WHERE val ~ '^[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}$'
  -- ... phone, card, etc.
)
SELECT * FROM offenders
{% endtest %}

False positives possible (e.g., a transaction reference that looks like a phone number); compliance reviews each finding.

7.3 K-anonymity check

tests/generic/test_k_anon.sql:

{% test k_anon(model, group_by_columns, k=5) %}
SELECT * FROM (
  SELECT {{ group_by_columns | join(', ') }}, COUNT(DISTINCT user_hash) AS n
  FROM {{ model }}
  GROUP BY {{ group_by_columns | join(', ') }}
) WHERE n < {{ k }}
{% endtest %}

Applied to every marts.agg_* and marts.fct_* that has a user_hash column. Default k=5; sensitive marts may require k=10 or higher (specified per model).

7.4 Test execution

  • Tests run as part of every dbt build (which is what cron schedules)
  • A failing PII column or content test = build fails, no new data published, on-call paged
  • A failing k-anon test = build warns, ticket created, data published with affected rows suppressed via row filter

8. Salt vault — compliance.salts

Lives in compliance_db on .118. Schema:

CREATE TABLE compliance.salts (
  identifier_kind TEXT PRIMARY KEY,    -- 'users', 'devices', 'emails', ...
  salt_base64 TEXT NOT NULL,           -- 32 bytes base64
  is_retired BOOLEAN NOT NULL DEFAULT false,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  rotated_on TIMESTAMPTZ NULL,
  rotated_by TEXT NULL                 -- audit who rotated
);

Access:

  • dbt_runner service account: SELECT on salts WHERE NOT is_retired (read-only, current salt only)
  • compliance_admin group: full SELECT + UPDATE for rotation
  • audit_reader: read rotated_on, rotated_by only — for compliance reports

9. Token vault — compliance.token_vault

Lives in compliance_db on .118. One table per token family:

CREATE TABLE compliance.token_vault_sa_id (
  token BIGINT PRIMARY KEY,
  encrypted_value BYTEA NOT NULL,      -- AES-256-GCM with per-row IV
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  source_dbs TEXT[] NOT NULL           -- which source DBs have referenced this token
);

CREATE TABLE compliance.token_access_audit (
  id BIGSERIAL PRIMARY KEY,
  accessed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  accessed_by TEXT NOT NULL,
  token_family TEXT NOT NULL,
  token BIGINT NOT NULL,
  recovery_reason TEXT NOT NULL,
  ticket_reference TEXT NOT NULL,
  second_signer TEXT NOT NULL
);

Encryption key in dedicated KMS (or gpg-agent on .118 for initial implementation). Never in DB, never on disk in plaintext.

10. Right-to-be-forgotten flow

When a user requests deletion (POPIA Sec 24, GDPR Art 17, CCPA):

  1. AuthAPI on .104 sets users.deleted_at = NOW(), deletion_reason = 'rtbf_request_<ticket>' and emits an event to apps_db.app_events with type user.rtbf.requested
  2. Extract job picks up the change on next 15-min cycle, lands in raw.auth_db__users and raw.apps_db__app_events
  3. dbt model intermediate.deletions_pending collects all user_hash values belonging to deleted users across all source DBs
  4. Retention job (nightly 04:00 SAST) runs:
    • DELETE FROM raw.*__check_ins WHERE user_id IN (SELECT user_id FROM users_pending_delete)
    • DELETE FROM raw.*__messages_metadata WHERE user_id IN (...)
    • ...etc for every source table that has user data
    • DELETE FROM compliance.token_vault_sa_id WHERE token IN (SELECT token FROM sa_id_tokens_pending_delete)
    • DELETE FROM analytics_db.*.user_summary WHERE user_hash IN (SELECT user_hash FROM users_pending_delete)
  5. dbt re-run of all dependent models so aggregates exclude the deleted users
  6. Verification job confirms zero rows remain referencing the deleted user across all schemas, writes a compliance.deletion_completion record with timestamp
  7. Confirmation sent to the AuthAPI which forwards to the user via the original support channel

SLA: 30 days from request to last trace removed. Tracked via compliance.deletion_completion.completed_at - users.deletion_requested_at.

10.1 What can't be deleted

Some records have legal retention requirements that override RTBF:

  • Financial transactions (FICA, 5 years) — kept in compliance.retained_transactions with all identifiers tokenised (Class B). User identity recoverable only via compliance break-glass.
  • Tax records — kept per SARS requirements (5 years).
  • Court-ordered preservation — frozen via compliance.legal_hold flag.

Anonymised aggregates that have lost the user identity by construction (e.g., marts.agg_revenue_daily_app_country) are NOT affected by RTBF — they don't represent the user, just aggregate counts.

11. Compliance review process

Before any new intermediate.* model goes live:

  1. Author submits PR with: model SQL, list of source columns + their Class (per §2), justification for any Class A/B/C choices, sample anonymised output (10 rows)
  2. Compliance reviewer (named role, see open question §10 in Security_Posture) checks:
    • Every source column is correctly classified
    • Class C generalisation is sufficient (not too fine-grained)
    • Class D fields are actually suppressed (not just renamed)
    • PII-absence tests are present
    • K-anonymity test is present where applicable
  3. Either: approves, requests changes, or escalates to a second reviewer
  4. On approval, the PR can be merged; the model deploys via the standard dbt build

This process MUST NOT be skipped. A model that bypasses review is an audit finding.

12. Glossary

Term Meaning
PII Personally Identifiable Information
Quasi-identifier A field that doesn't identify alone but does in combination with others (e.g., DOB + zip + sex)
k-anonymity Privacy guarantee: every group of records sharing the same quasi-identifier values has ≥ k members
Salt Random secret added to a hash input to prevent rainbow-table attacks and to defeat known-plaintext lookups
Tokenisation Replacing a sensitive value with a non-sensitive substitute (token), with the mapping stored separately under access controls
Generalisation Replacing a specific value with a less specific bucket (DOB → age band)
Suppression Dropping a field entirely (never extracting it from source)
Break-glass A controlled, audited procedure for bypassing normal access controls in a defined emergency
RTBF Right To Be Forgotten (GDPR Art. 17 / POPIA Sec. 24)

13. Change log

Date Change By
2026-05-28 Initial document — extracted standard from BI Pipeline §6 for compliance review Tinashe Bhengu
Something went wrong on this page. Reload