DataAcuity — BI Anonymisation Standard
Status: 🔵 DESIGNED — to be reviewed and signed off by compliance reviewer before BI Pipeline Phase 2 ships
Owner: Tinashe Bhengu
Applies to: Every intermediate.* model in the data warehouse and everything downstream (marts.*, analytics.*, ml.*, push-back to .104 analytics_db)
Companion docs: DataAcuity_BI_Pipeline.md §6 (overview), DataAcuity_BI_Compliance_Map.md (regulator clause mapping)
This document is the single specification that the anonymisation layer must implement and that compliance review must verify against. It exists in isolation from the BI Pipeline architecture doc so legal/compliance reviewers can audit just this surface without wading through the rest.
1. Principles
The pipeline obeys five non-negotiable principles. Everything else in this doc is consequence.
- PII is suppressed at the
intermediate.*boundary. Downstream models MUST NOT contain raw PII. Ever. A dbt test (§7) verifies this on every model. - Joins survive anonymisation. The same user's events in different source DBs must be linkable in the warehouse — but only by an irreversible hash. The salt isn't accessible to analysts.
- Reversibility is a separate privilege. Some fields (SA ID, IBAN) are tokenised (reversible by compliance under audit), not hashed. Tokens are stored in
compliance.token_vaulton.118, accessed only via break-glass procedure. - Generalisation, not pseudonymisation, for quasi-identifiers. DOBs become age bands, GPS becomes a 1 km grid, addresses become cities. We don't try to hash these — generalisation gives better privacy and equivalent analytical utility.
- K-anonymity ≥ 5 on any user aggregate. Every group (city × age band × day, etc.) has at least 5 distinct users, or the row is suppressed/generalised further.
2. Field classification
Every field in any source DB falls into one of seven classes. The class determines treatment.
| Class | Definition | Strategy | Examples |
|---|---|---|---|
| A. Direct identifier — joinable | Uniquely identifies a person AND we need to join their events across sources | Hash (deterministic, salted SHA-256) | user_id, email, phone, device_id, wallet_id |
| B. Direct identifier — recoverable | Uniquely identifies AND legal/compliance may need to recover the original under court order | Tokenise (sequential token in compliance.token_vault) |
sa_id, passport_number, iban, card_pan_last4, merchant_legal_id |
| C. Quasi-identifier | Doesn't identify alone but can in combination | Generalise to coarser bucket | date_of_birth → age band; address → city; gps_lat/lng → 0.01° grid; ip_address → city, ISP class |
| D. Sensitive content | Free text/media containing arbitrary PII | Suppress (drop entirely) | message bodies, voice notes, photos, document scans, biometric templates, comment text |
| E. Sensitive attribute | Not identifying alone but high-confidentiality | Pass through but log access | KYC tier, kyc_failure_reason, account_status, fraud_flag, criminal_record_flag |
| F. Behavioural | Low PII risk in aggregate | Pass through | timestamps, click counts, route choices, scroll positions, dwell times |
| G. Operational | No PII | Pass through unchanged | server_id, app_version, country (already coarse), language |
When a field doesn't obviously fit, default to D (suppress) and reclassify only after explicit review.
3. The hashing standard (Class A)
3.1 Algorithm
hash(salt, identifier) = encode(digest(salt || ':' || identifier::text, 'sha256'), 'hex')
saltis a per-source-type secret stored incompliance.saltstable on.118. There is one salt per identifier kind (one for user_ids, one for device_ids, one for emails, etc.).- Salt length: 32 random bytes (256 bits), base64-encoded.
- Hash output length: 64 hex chars (full SHA-256).
marts.*MAY truncate to 16 hex chars (still 2⁶⁴ collision space) for compactness;intermediate.*keeps full length. - No client-side salt visibility. dbt models reference the salt via
{{ env_var('SALT_USERS') }}— env vars are read at run time from the orchestrator secret manager, NEVER in version-controlled files.
3.2 Salt rotation
| Trigger | Action | Effect |
|---|---|---|
| Annual scheduled rotation | Generate new salt, mark old as is_retired = true with rotated_on date |
All hashes computed after rotation use the new salt; old hashes still join within their era |
| Suspected salt leak | Same as above + immediately rebuild affected intermediate.* and downstream models with new salt |
Cuts the inference chain for an attacker who got the old salt |
| Salt-derived hash collision found | Should never happen at our scale; if it does, escalate to security review | — |
Salts MUST NOT be exported, backed up to insecure locations, or visible in logs.
3.3 What you can / cannot do with hashes
| Operation | OK? | Notes |
|---|---|---|
JOIN ON staging.A.user_hash = staging.B.user_hash |
✅ | This is the whole point — cross-source joins still work |
GROUP BY user_hash (per-user aggregates) |
✅ | But the aggregated rows still count as user-level data — apply k-anonymity (§5) |
WHERE user_hash = '<specific hash>' (lookup) |
⚠️ | Allowed for ops/debug; logged to audit |
| Reverse hash to email/user_id | ❌ | Mathematically infeasible without the salt; even with salt, this is the break-glass path through compliance |
| Truncate hash below 16 chars | ❌ | Increases collision risk; standardise on 16 or 64 |
4. The tokenisation standard (Class B)
4.1 Mechanism
- Source row arrives with a Class B field (e.g.,
sa_id = '8203035811088') intermediate.*model callscompliance.tokenise_sa_id('8203035811088')(a stored function)- The function:
- Looks up the existing token if present, returns it
- Else generates
nextval('compliance.sa_id_token_seq')(a bigint), inserts(token, encrypted_value, created_at)intocompliance.token_vault.sa_id, returns the token
intermediate.*model stores only the token- Original value is encrypted at rest in
token_vaultusing the compliance team's KMS key
4.2 Token format
- bigint, monotonically increasing per token family
- Tokens NEVER appear in
marts.*— only inintermediate.*where access is locked down - Token families:
sa_id,passport,iban,card_pan_last4,merchant_legal_id
4.3 Recovery (break-glass)
The only legitimate recovery paths:
- Court order / regulatory request — compliance officer files an internal ticket with the legal reference, two-person sign-off (compliance + executive), recovery happens via a logged stored procedure
- User support — user requested data export — same flow but with the user's verifiable consent record attached
- Fraud investigation — same flow but with fraud team as the second signer instead of executive
Every recovery is logged in compliance.token_access_audit with: who, when, which token, what reason, attached ticket reference. Retained 7 years.
5. Generalisation rules (Class C)
| Source | Becomes | Rationale |
|---|---|---|
date_of_birth (date) |
age_band ∈ {<18, 18-24, 25-34, 35-44, 45-54, 55-64, 65+} |
DOB + city is identifying; age band is not |
date_of_birth (where age is the actual analytical need) |
age_years truncated to integer |
But check k-anonymity per (age_years, city, …) |
address (full street) |
city, country. Suburb level OK if city has ≥10,000 users in the dataset |
Suburb in small towns identifies |
postcode (4-digit SA) |
First 2 digits only OR drop | Full postcode + age + sex is uniquely identifying |
gps_lat, gps_lng (user position) |
round to 0.01° (~1.1 km grid) |
Sub-km GPS over a week traces a person to their home |
gps_lat, gps_lng (POI position) |
full precision | POIs are public, not personal |
ip_address |
city, country, isp_class via geoip lookup. IP itself dropped |
IP is a strong identifier |
device_id, idfa, gaid |
hash (Class A treatment) | Cross-device but pseudonymous |
user_agent |
parsed to device_class ∈ {android-recent, android-old, ios-recent, ios-old, web-desktop, web-mobile, web-tablet, other} |
Full UA fingerprints |
phone_number |
hash + country_code, mobile_network retained as separate columns |
Hash for joins, country/network for analytics |
email |
hash + email_domain retained |
Domain is useful for cohort analysis without identifying |
name_first, name_last, display_name |
suppress (Class D) | Never needed for analytics |
dob-based age in low-volume cells |
round to 5-year band | k-anonymity protection |
timestamp (high-cardinality marts) |
round to hour for marts where day-level grouping creates groups of <5 | Time uniqueness identifies in small datasets |
6. Suppression rules (Class D)
These fields NEVER leave their source DB. They MAY be processed by separate, more tightly governed pipelines (moderation, fraud) under explicit consent, but never the BI pipeline.
- Message bodies (TxTMe!, MessagingAPI, BidBaas chat, SleptOn reviews)
- Voice notes, audio recordings, video files
- Photos / images / document scans (KYC documents especially)
- Biometric templates (face embeddings, fingerprint vectors, voice prints)
- Free-text fields users can write in (display name nickname, profile bio, item descriptions, etc.) — unless the field is bounded to a controlled vocabulary
- OAuth refresh tokens, session tokens, password hashes
- Card numbers (full PAN), CVVs (we never store these anyway, but verify in any source)
- Government ID images
- Health-related notes (Panik emergency call notes can contain medical info)
Test: the extract job has a deny list of column names per source table. Any new column added to a source DB that matches a deny pattern (*_body, *_text, *_note, *_image, *_blob, password*, token*, secret*, _pem, _key) is automatically suppressed at extract time AND flagged for review.
7. PII-absence dbt tests
Every model in intermediate.* and downstream MUST pass these tests:
7.1 Column-name check
tests/generic/test_no_pii_columns.sql:
-- Macro: assert no column name matches a forbidden pattern
{% test no_pii_columns(model) %}
WITH offenders AS (
SELECT column_name
FROM information_schema.columns
WHERE table_schema = '{{ model.schema }}' AND table_name = '{{ model.identifier }}'
AND lower(column_name) ~ '^(name_first|name_last|display_name|email|phone|sa_id|passport|iban|card_pan|address_line|gps_lat|gps_lng|ip_address|user_agent|dob|date_of_birth|password|token|secret|api_key)$'
)
SELECT * FROM offenders
{% endtest %}
A failing test blocks the model from publishing.
7.2 Content regex check (sampled)
tests/generic/test_no_pii_content.sql:
-- Macro: assert string columns don't contain values that look like PII
-- Sample 1000 rows per column; if any match, fail.
{% test no_pii_content(model) %}
WITH s AS (
SELECT * FROM {{ model }} LIMIT 1000
), offenders AS (
SELECT 'email pattern' AS issue, column_name, val
FROM (
SELECT column_name, (s.*::text) AS val FROM s, lateral (
SELECT key AS column_name FROM jsonb_each_text(to_jsonb(s))
) c
) flat
WHERE val ~ '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
UNION ALL
SELECT 'SA ID pattern' AS issue, column_name, val FROM (...)
WHERE val ~ '^\d{13}$'
UNION ALL
SELECT 'IBAN pattern', ... WHERE val ~ '^[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}$'
-- ... phone, card, etc.
)
SELECT * FROM offenders
{% endtest %}
False positives possible (e.g., a transaction reference that looks like a phone number); compliance reviews each finding.
7.3 K-anonymity check
tests/generic/test_k_anon.sql:
{% test k_anon(model, group_by_columns, k=5) %}
SELECT * FROM (
SELECT {{ group_by_columns | join(', ') }}, COUNT(DISTINCT user_hash) AS n
FROM {{ model }}
GROUP BY {{ group_by_columns | join(', ') }}
) WHERE n < {{ k }}
{% endtest %}
Applied to every marts.agg_* and marts.fct_* that has a user_hash column. Default k=5; sensitive marts may require k=10 or higher (specified per model).
7.4 Test execution
- Tests run as part of every
dbt build(which is what cron schedules) - A failing PII column or content test = build fails, no new data published, on-call paged
- A failing k-anon test = build warns, ticket created, data published with affected rows suppressed via row filter
8. Salt vault — compliance.salts
Lives in compliance_db on .118. Schema:
CREATE TABLE compliance.salts (
identifier_kind TEXT PRIMARY KEY, -- 'users', 'devices', 'emails', ...
salt_base64 TEXT NOT NULL, -- 32 bytes base64
is_retired BOOLEAN NOT NULL DEFAULT false,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
rotated_on TIMESTAMPTZ NULL,
rotated_by TEXT NULL -- audit who rotated
);
Access:
dbt_runnerservice account: SELECT on salts WHERE NOT is_retired (read-only, current salt only)compliance_admingroup: full SELECT + UPDATE for rotationaudit_reader: readrotated_on, rotated_byonly — for compliance reports
9. Token vault — compliance.token_vault
Lives in compliance_db on .118. One table per token family:
CREATE TABLE compliance.token_vault_sa_id (
token BIGINT PRIMARY KEY,
encrypted_value BYTEA NOT NULL, -- AES-256-GCM with per-row IV
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
source_dbs TEXT[] NOT NULL -- which source DBs have referenced this token
);
CREATE TABLE compliance.token_access_audit (
id BIGSERIAL PRIMARY KEY,
accessed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
accessed_by TEXT NOT NULL,
token_family TEXT NOT NULL,
token BIGINT NOT NULL,
recovery_reason TEXT NOT NULL,
ticket_reference TEXT NOT NULL,
second_signer TEXT NOT NULL
);
Encryption key in dedicated KMS (or gpg-agent on .118 for initial implementation). Never in DB, never on disk in plaintext.
10. Right-to-be-forgotten flow
When a user requests deletion (POPIA Sec 24, GDPR Art 17, CCPA):
- AuthAPI on
.104setsusers.deleted_at = NOW(), deletion_reason = 'rtbf_request_<ticket>'and emits an event toapps_db.app_eventswith typeuser.rtbf.requested - Extract job picks up the change on next 15-min cycle, lands in
raw.auth_db__usersandraw.apps_db__app_events - dbt model
intermediate.deletions_pendingcollects alluser_hashvalues belonging to deleted users across all source DBs - Retention job (nightly 04:00 SAST) runs:
DELETE FROM raw.*__check_ins WHERE user_id IN (SELECT user_id FROM users_pending_delete)DELETE FROM raw.*__messages_metadata WHERE user_id IN (...)- ...etc for every source table that has user data
DELETE FROM compliance.token_vault_sa_id WHERE token IN (SELECT token FROM sa_id_tokens_pending_delete)DELETE FROM analytics_db.*.user_summary WHERE user_hash IN (SELECT user_hash FROM users_pending_delete)
- dbt re-run of all dependent models so aggregates exclude the deleted users
- Verification job confirms zero rows remain referencing the deleted user across all schemas, writes a
compliance.deletion_completionrecord with timestamp - Confirmation sent to the AuthAPI which forwards to the user via the original support channel
SLA: 30 days from request to last trace removed. Tracked via compliance.deletion_completion.completed_at - users.deletion_requested_at.
10.1 What can't be deleted
Some records have legal retention requirements that override RTBF:
- Financial transactions (FICA, 5 years) — kept in
compliance.retained_transactionswith all identifiers tokenised (Class B). User identity recoverable only via compliance break-glass. - Tax records — kept per SARS requirements (5 years).
- Court-ordered preservation — frozen via
compliance.legal_holdflag.
Anonymised aggregates that have lost the user identity by construction (e.g., marts.agg_revenue_daily_app_country) are NOT affected by RTBF — they don't represent the user, just aggregate counts.
11. Compliance review process
Before any new intermediate.* model goes live:
- Author submits PR with: model SQL, list of source columns + their Class (per §2), justification for any Class A/B/C choices, sample anonymised output (10 rows)
- Compliance reviewer (named role, see open question §10 in Security_Posture) checks:
- Every source column is correctly classified
- Class C generalisation is sufficient (not too fine-grained)
- Class D fields are actually suppressed (not just renamed)
- PII-absence tests are present
- K-anonymity test is present where applicable
- Either: approves, requests changes, or escalates to a second reviewer
- On approval, the PR can be merged; the model deploys via the standard dbt build
This process MUST NOT be skipped. A model that bypasses review is an audit finding.
12. Glossary
| Term | Meaning |
|---|---|
| PII | Personally Identifiable Information |
| Quasi-identifier | A field that doesn't identify alone but does in combination with others (e.g., DOB + zip + sex) |
| k-anonymity | Privacy guarantee: every group of records sharing the same quasi-identifier values has ≥ k members |
| Salt | Random secret added to a hash input to prevent rainbow-table attacks and to defeat known-plaintext lookups |
| Tokenisation | Replacing a sensitive value with a non-sensitive substitute (token), with the mapping stored separately under access controls |
| Generalisation | Replacing a specific value with a less specific bucket (DOB → age band) |
| Suppression | Dropping a field entirely (never extracting it from source) |
| Break-glass | A controlled, audited procedure for bypassing normal access controls in a defined emergency |
| RTBF | Right To Be Forgotten (GDPR Art. 17 / POPIA Sec. 24) |
13. Change log
| Date | Change | By |
|---|---|---|
| 2026-05-28 | Initial document — extracted standard from BI Pipeline §6 for compliance review | Tinashe Bhengu |