DataAcuity — BI Anonymisation Standard

Status: 🔵 DESIGNED — to be reviewed and signed off by compliance reviewer before BI Pipeline Phase 2 ships Owner: Tinashe Bhengu Applies to: Every intermediate.* model in the data warehouse and everything downstream (marts.*, analytics.*, ml.*, push-back to .104 analytics_db) Companion docs: DataAcuity_BI_Pipeline.md §6 (overview), DataAcuity_BI_Compliance_Map.md (regulator clause mapping)

This document is the single specification that the anonymisation layer must implement and that compliance review must verify against. It exists in isolation from the BI Pipeline architecture doc so legal/compliance reviewers can audit just this surface without wading through the rest.

1. Principles

The pipeline obeys five non-negotiable principles. Everything else in this doc is consequence.

PII is suppressed at the intermediate.* boundary. Downstream models MUST NOT contain raw PII. Ever. A dbt test (§7) verifies this on every model.
Joins survive anonymisation. The same user's events in different source DBs must be linkable in the warehouse — but only by an irreversible hash. The salt isn't accessible to analysts.
Reversibility is a separate privilege. Some fields (SA ID, IBAN) are tokenised (reversible by compliance under audit), not hashed. Tokens are stored in compliance.token_vault on .118, accessed only via break-glass procedure.
Generalisation, not pseudonymisation, for quasi-identifiers. DOBs become age bands, GPS becomes a 1 km grid, addresses become cities. We don't try to hash these — generalisation gives better privacy and equivalent analytical utility.
K-anonymity ≥ 5 on any user aggregate. Every group (city × age band × day, etc.) has at least 5 distinct users, or the row is suppressed/generalised further.

2. Field classification

Every field in any source DB falls into one of seven classes. The class determines treatment.

Class	Definition	Strategy	Examples
A. Direct identifier — joinable	Uniquely identifies a person AND we need to join their events across sources	Hash (deterministic, salted SHA-256)	`user_id`, `email`, `phone`, `device_id`, `wallet_id`
B. Direct identifier — recoverable	Uniquely identifies AND legal/compliance may need to recover the original under court order	Tokenise (sequential token in `compliance.token_vault`)	`sa_id`, `passport_number`, `iban`, `card_pan_last4`, `merchant_legal_id`
C. Quasi-identifier	Doesn't identify alone but can in combination	Generalise to coarser bucket	`date_of_birth` → age band; `address` → city; `gps_lat/lng` → 0.01° grid; `ip_address` → city, ISP class
D. Sensitive content	Free text/media containing arbitrary PII	Suppress (drop entirely)	message bodies, voice notes, photos, document scans, biometric templates, comment text
E. Sensitive attribute	Not identifying alone but high-confidentiality	Pass through but log access	KYC tier, kyc_failure_reason, account_status, fraud_flag, criminal_record_flag
F. Behavioural	Low PII risk in aggregate	Pass through	timestamps, click counts, route choices, scroll positions, dwell times
G. Operational	No PII	Pass through unchanged	server_id, app_version, country (already coarse), language

When a field doesn't obviously fit, default to D (suppress) and reclassify only after explicit review.

3. The hashing standard (Class A)

3.1 Algorithm

hash(salt, identifier) = encode(digest(salt || ':' || identifier::text, 'sha256'), 'hex')

salt is a per-source-type secret stored in compliance.salts table on .118. There is one salt per identifier kind (one for user_ids, one for device_ids, one for emails, etc.).
Salt length: 32 random bytes (256 bits), base64-encoded.
Hash output length: 64 hex chars (full SHA-256). marts.* MAY truncate to 16 hex chars (still 2⁶⁴ collision space) for compactness; intermediate.* keeps full length.
No client-side salt visibility. dbt models reference the salt via {{ env_var('SALT_USERS') }} — env vars are read at run time from the orchestrator secret manager, NEVER in version-controlled files.

3.2 Salt rotation

Trigger	Action	Effect
Annual scheduled rotation	Generate new salt, mark old as `is_retired = true` with rotated_on date	All hashes computed after rotation use the new salt; old hashes still join within their era
Suspected salt leak	Same as above + immediately rebuild affected `intermediate.*` and downstream models with new salt	Cuts the inference chain for an attacker who got the old salt
Salt-derived hash collision found	Should never happen at our scale; if it does, escalate to security review	—

Salts MUST NOT be exported, backed up to insecure locations, or visible in logs.

3.3 What you can / cannot do with hashes

Operation	OK?	Notes
`JOIN ON staging.A.user_hash = staging.B.user_hash`	✅	This is the whole point — cross-source joins still work
`GROUP BY user_hash` (per-user aggregates)	✅	But the aggregated rows still count as user-level data — apply k-anonymity (§5)
`WHERE user_hash = '<specific hash>'` (lookup)	⚠️	Allowed for ops/debug; logged to audit
Reverse hash to email/user_id	❌	Mathematically infeasible without the salt; even with salt, this is the break-glass path through compliance
Truncate hash below 16 chars	❌	Increases collision risk; standardise on 16 or 64

4. The tokenisation standard (Class B)

4.1 Mechanism

Source row arrives with a Class B field (e.g., sa_id = '8203035811088')
intermediate.* model calls compliance.tokenise_sa_id('8203035811088') (a stored function)
The function:
- Looks up the existing token if present, returns it
- Else generates nextval('compliance.sa_id_token_seq') (a bigint), inserts (token, encrypted_value, created_at) into compliance.token_vault.sa_id, returns the token
intermediate.* model stores only the token
Original value is encrypted at rest in token_vault using the compliance team's KMS key

4.2 Token format

bigint, monotonically increasing per token family
Tokens NEVER appear in marts.* — only in intermediate.* where access is locked down
Token families: sa_id, passport, iban, card_pan_last4, merchant_legal_id

4.3 Recovery (break-glass)

The only legitimate recovery paths:

Court order / regulatory request — compliance officer files an internal ticket with the legal reference, two-person sign-off (compliance + executive), recovery happens via a logged stored procedure
User support — user requested data export — same flow but with the user's verifiable consent record attached
Fraud investigation — same flow but with fraud team as the second signer instead of executive

Every recovery is logged in compliance.token_access_audit with: who, when, which token, what reason, attached ticket reference. Retained 7 years.

5. Generalisation rules (Class C)

Source	Becomes	Rationale
`date_of_birth` (date)	`age_band` ∈ `{<18, 18-24, 25-34, 35-44, 45-54, 55-64, 65+}`	DOB + city is identifying; age band is not
`date_of_birth` (where age is the actual analytical need)	`age_years` truncated to integer	But check k-anonymity per (age_years, city, …)
`address` (full street)	`city`, `country`. Suburb level OK if city has ≥10,000 users in the dataset	Suburb in small towns identifies
`postcode` (4-digit SA)	First 2 digits only OR drop	Full postcode + age + sex is uniquely identifying
`gps_lat`, `gps_lng` (user position)	round to `0.01°` (~1.1 km grid)	Sub-km GPS over a week traces a person to their home
`gps_lat`, `gps_lng` (POI position)	full precision	POIs are public, not personal
`ip_address`	`city`, `country`, `isp_class` via geoip lookup. IP itself dropped	IP is a strong identifier
`device_id`, `idfa`, `gaid`	hash (Class A treatment)	Cross-device but pseudonymous
`user_agent`	parsed to `device_class` ∈ `{android-recent, android-old, ios-recent, ios-old, web-desktop, web-mobile, web-tablet, other}`	Full UA fingerprints
`phone_number`	hash + `country_code`, `mobile_network` retained as separate columns	Hash for joins, country/network for analytics
`email`	hash + `email_domain` retained	Domain is useful for cohort analysis without identifying
`name_first`, `name_last`, `display_name`	suppress (Class D)	Never needed for analytics
`dob`-based age in low-volume cells	round to 5-year band	k-anonymity protection
`timestamp` (high-cardinality marts)	round to hour for marts where day-level grouping creates groups of <5	Time uniqueness identifies in small datasets

6. Suppression rules (Class D)

These fields NEVER leave their source DB. They MAY be processed by separate, more tightly governed pipelines (moderation, fraud) under explicit consent, but never the BI pipeline.

Message bodies (TxTMe!, MessagingAPI, BidBaas chat, SleptOn reviews)
Voice notes, audio recordings, video files
Photos / images / document scans (KYC documents especially)
Biometric templates (face embeddings, fingerprint vectors, voice prints)
Free-text fields users can write in (display name nickname, profile bio, item descriptions, etc.) — unless the field is bounded to a controlled vocabulary
OAuth refresh tokens, session tokens, password hashes
Card numbers (full PAN), CVVs (we never store these anyway, but verify in any source)
Government ID images
Health-related notes (Panik emergency call notes can contain medical info)

Test: the extract job has a deny list of column names per source table. Any new column added to a source DB that matches a deny pattern (*_body, *_text, *_note, *_image, *_blob, password*, token*, secret*, _pem, _key) is automatically suppressed at extract time AND flagged for review.

7. PII-absence dbt tests

Every model in intermediate.* and downstream MUST pass these tests:

7.1 Column-name check

tests/generic/test_no_pii_columns.sql:

-- Macro: assert no column name matches a forbidden pattern
{% test no_pii_columns(model) %}
WITH offenders AS (
  SELECT column_name
  FROM information_schema.columns
  WHERE table_schema = '{{ model.schema }}' AND table_name = '{{ model.identifier }}'
    AND lower(column_name) ~ '^(name_first|name_last|display_name|email|phone|sa_id|passport|iban|card_pan|address_line|gps_lat|gps_lng|ip_address|user_agent|dob|date_of_birth|password|token|secret|api_key)$'
)
SELECT * FROM offenders
{% endtest %}

A failing test blocks the model from publishing.

7.2 Content regex check (sampled)

tests/generic/test_no_pii_content.sql:

-- Macro: assert string columns don't contain values that look like PII
-- Sample 1000 rows per column; if any match, fail.
{% test no_pii_content(model) %}
WITH s AS (
  SELECT * FROM {{ model }} LIMIT 1000
), offenders AS (
  SELECT 'email pattern' AS issue, column_name, val
  FROM (
    SELECT column_name, (s.*::text) AS val FROM s, lateral (
      SELECT key AS column_name FROM jsonb_each_text(to_jsonb(s))
    ) c
  ) flat
  WHERE val ~ '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
  UNION ALL
  SELECT 'SA ID pattern' AS issue, column_name, val FROM (...)
    WHERE val ~ '^\d{13}$'
  UNION ALL
  SELECT 'IBAN pattern', ... WHERE val ~ '^[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}$'
  -- ... phone, card, etc.
)
SELECT * FROM offenders
{% endtest %}

False positives possible (e.g., a transaction reference that looks like a phone number); compliance reviews each finding.

7.3 K-anonymity check

tests/generic/test_k_anon.sql:

{% test k_anon(model, group_by_columns, k=5) %}
SELECT * FROM (
  SELECT {{ group_by_columns | join(', ') }}, COUNT(DISTINCT user_hash) AS n
  FROM {{ model }}
  GROUP BY {{ group_by_columns | join(', ') }}
) WHERE n < {{ k }}
{% endtest %}

Applied to every marts.agg_* and marts.fct_* that has a user_hash column. Default k=5; sensitive marts may require k=10 or higher (specified per model).

7.4 Test execution

Tests run as part of every dbt build (which is what cron schedules)
A failing PII column or content test = build fails, no new data published, on-call paged
A failing k-anon test = build warns, ticket created, data published with affected rows suppressed via row filter

8. Salt vault — `compliance.salts`

Lives in compliance_db on .118. Schema:

CREATE TABLE compliance.salts (
  identifier_kind TEXT PRIMARY KEY,    -- 'users', 'devices', 'emails', ...
  salt_base64 TEXT NOT NULL,           -- 32 bytes base64
  is_retired BOOLEAN NOT NULL DEFAULT false,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  rotated_on TIMESTAMPTZ NULL,
  rotated_by TEXT NULL                 -- audit who rotated
);

Access:

dbt_runner service account: SELECT on salts WHERE NOT is_retired (read-only, current salt only)
compliance_admin group: full SELECT + UPDATE for rotation
audit_reader: read rotated_on, rotated_by only — for compliance reports

9. Token vault — `compliance.token_vault`

Lives in compliance_db on .118. One table per token family:

CREATE TABLE compliance.token_vault_sa_id (
  token BIGINT PRIMARY KEY,
  encrypted_value BYTEA NOT NULL,      -- AES-256-GCM with per-row IV
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  source_dbs TEXT[] NOT NULL           -- which source DBs have referenced this token
);

CREATE TABLE compliance.token_access_audit (
  id BIGSERIAL PRIMARY KEY,
  accessed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  accessed_by TEXT NOT NULL,
  token_family TEXT NOT NULL,
  token BIGINT NOT NULL,
  recovery_reason TEXT NOT NULL,
  ticket_reference TEXT NOT NULL,
  second_signer TEXT NOT NULL
);

Encryption key in dedicated KMS (or gpg-agent on .118 for initial implementation). Never in DB, never on disk in plaintext.

10. Right-to-be-forgotten flow

When a user requests deletion (POPIA Sec 24, GDPR Art 17, CCPA):

AuthAPI on .104 sets users.deleted_at = NOW(), deletion_reason = 'rtbf_request_<ticket>' and emits an event to apps_db.app_events with type user.rtbf.requested
Extract job picks up the change on next 15-min cycle, lands in raw.auth_db__users and raw.apps_db__app_events
dbt model intermediate.deletions_pending collects all user_hash values belonging to deleted users across all source DBs
Retention job (nightly 04:00 SAST) runs:
- DELETE FROM raw.*__check_ins WHERE user_id IN (SELECT user_id FROM users_pending_delete)
- DELETE FROM raw.*__messages_metadata WHERE user_id IN (...)
- ...etc for every source table that has user data
- DELETE FROM compliance.token_vault_sa_id WHERE token IN (SELECT token FROM sa_id_tokens_pending_delete)
- DELETE FROM analytics_db.*.user_summary WHERE user_hash IN (SELECT user_hash FROM users_pending_delete)
dbt re-run of all dependent models so aggregates exclude the deleted users
Verification job confirms zero rows remain referencing the deleted user across all schemas, writes a compliance.deletion_completion record with timestamp
Confirmation sent to the AuthAPI which forwards to the user via the original support channel

SLA: 30 days from request to last trace removed. Tracked via compliance.deletion_completion.completed_at - users.deletion_requested_at.

10.1 What can't be deleted

Some records have legal retention requirements that override RTBF:

Financial transactions (FICA, 5 years) — kept in compliance.retained_transactions with all identifiers tokenised (Class B). User identity recoverable only via compliance break-glass.
Tax records — kept per SARS requirements (5 years).
Court-ordered preservation — frozen via compliance.legal_hold flag.

Anonymised aggregates that have lost the user identity by construction (e.g., marts.agg_revenue_daily_app_country) are NOT affected by RTBF — they don't represent the user, just aggregate counts.

11. Compliance review process

Before any new intermediate.* model goes live:

Author submits PR with: model SQL, list of source columns + their Class (per §2), justification for any Class A/B/C choices, sample anonymised output (10 rows)
Compliance reviewer (named role, see open question §10 in Security_Posture) checks:
- Every source column is correctly classified
- Class C generalisation is sufficient (not too fine-grained)
- Class D fields are actually suppressed (not just renamed)
- PII-absence tests are present
- K-anonymity test is present where applicable
Either: approves, requests changes, or escalates to a second reviewer
On approval, the PR can be merged; the model deploys via the standard dbt build

This process MUST NOT be skipped. A model that bypasses review is an audit finding.

12. Glossary

Term	Meaning
PII	Personally Identifiable Information
Quasi-identifier	A field that doesn't identify alone but does in combination with others (e.g., DOB + zip + sex)
k-anonymity	Privacy guarantee: every group of records sharing the same quasi-identifier values has ≥ k members
Salt	Random secret added to a hash input to prevent rainbow-table attacks and to defeat known-plaintext lookups
Tokenisation	Replacing a sensitive value with a non-sensitive substitute (token), with the mapping stored separately under access controls
Generalisation	Replacing a specific value with a less specific bucket (DOB → age band)
Suppression	Dropping a field entirely (never extracting it from source)
Break-glass	A controlled, audited procedure for bypassing normal access controls in a defined emergency
RTBF	Right To Be Forgotten (GDPR Art. 17 / POPIA Sec. 24)

13. Change log

Date	Change	By
2026-05-28	Initial document — extracted standard from BI Pipeline §6 for compliance review	Tinashe Bhengu