Automating PII Tagging in dbt with OpenAI
In today’s data-driven world, organizations are collecting, transforming, and analyzing more data than ever before. With this explosion of data comes a growing responsibility: ensuring that sensitive information, especially Personally Identifiable Information (PII) is properly identified, documented, and protected. For analytics engineers and data teams using dbt, this means going beyond just building models and pipelines. It means embedding data governance and compliance directly into your workflows.
But how do you keep up with the ever-changing landscape of data, especially as new models and columns are added? Manual PII tagging is tedious, error-prone, and simply doesn’t scale. That’s where automation, powered by large language models like OpenAI’s GPT, can make a transformative difference.
In this article, we’ll walk you through how to automate PII tagging in dbt using OpenAI. We’ll cover the why, the how, and the practical steps to get you started, including code examples, best practices, and tips for integrating this process into your data engineering workflow.
Why PII Tagging Matters in dbt
PII or Personally Identifiable Information refers to any data that could potentially identify a specific individual. This includes names, email addresses, phone numbers, social security numbers, and more. With regulations like GDPR, CCPA, and HIPAA, organizations are under increasing pressure to identify, protect, and manage PII throughout their data lifecycle.
dbt (data build tool) has become the de facto standard for analytics engineering, enabling teams to transform raw data into clean, reliable datasets. dbt’s YAML-based documentation makes it easy to describe models and columns, but it doesn’t natively flag which fields contain PII. This gap can lead to compliance risks, especially as data flows downstream to BI tools, dashboards, and applications.
Why automate PII tagging?
- Scale: Manual tagging doesn’t scale as your data warehouse grows.
- Accuracy: Automation reduces human error and oversight.
- Compliance: Proactive PII identification is essential for regulatory compliance.
- Collaboration: Clear PII tagging helps downstream users understand data sensitivity.
Challenges of Manual PII Tagging
Let’s face it: manually reviewing every model and column in your dbt project is a daunting task. Here are some common challenges:
- Time-consuming: Reviewing hundreds of columns across dozens of models is slow and repetitive.
- Inconsistency: Different team members may interpret PII differently, leading to inconsistent tagging.
- Missed fields: It’s easy to overlook new columns or forget to update tags as models evolve.
- Documentation drift: As your dbt project grows, documentation can quickly become outdated.
The result?
Gaps in your data governance, increased compliance risk, and more headaches for your data team.
How OpenAI Can Help
OpenAI’s large language models (LLMs), such as GPT-4, are trained on vast amounts of text and can understand context, semantics, and intent. This makes them ideal for tasks like:
- Classifying column descriptions: Is this column likely to contain PII?
- Suggesting tags: Should this field be marked as sensitive?
- Automating documentation: Generating or updating metadata based on natural language descriptions.
By integrating OpenAI into your dbt workflow, you can automate the process of scanning YAML documentation, analysing column descriptions (and even sample data), and flagging potential PII fields for review.
The Automated PII Tagging Workflow
Here’s a high-level overview of how you can automate PII tagging in dbt using OpenAI:
- Read dbt YAML Files: Parse your dbt model documentation to extract model and column metadata.
- Fetch Sample Data (Optional): For greater accuracy, retrieve a few sample values from your data warehouse for each column.
- PII Detection with OpenAI: Send column descriptions (and sample values) to OpenAI’s API for classification.
- Human-in-the-Loop Review: Prompt a user to confirm or override the model’s suggestions.
- Update YAML Files: Add or update the meta section in your YAML to flag columns as containing PII.
- Integrate with CI/CD (Optional): Run this process automatically as part of your data pipeline.
Let’s break down each step in detail.
Step-by-Step Guide: Automating PII Tagging in dbt
Set Up Your Environment
You’ll need:
- A Python environment (
python>=3.7
) - The following packages:
pyyaml
,openai
,python-dotenv
,sqlalchemy
,pandas
- An OpenAI API key
- Access to your dbt project’s YAML files
- (Optional) Access to your data warehouse for sample data
Install dependencies:
pip install pyyaml openai python-dotenv sqlalchemy pandas
Load Environment Variables
Store your OpenAI API key and database connection string in a .env
file:
OPENAI_API_KEY=sk-...
DB_URI=postgresql://user:password@host:port/dbname
Read dbt YAML Files
Parse your dbt model YAML files to extract model and column metadata.
Fetch Sample Data (Optional but Recommended)
For each column, fetch a few sample values from your data warehouse. This gives the LLM more context and improves classification accuracy.
Classify Columns with OpenAI
Send the column description and sample values to OpenAI’s API, asking it to classify the field as PII or non-PII.
Human-in-the-Loop Review
If the model flags a field as PII, prompt a user to confirm before updating the YAML.
Update YAML Files
Add a meta section with contains_pii: true
to columns confirmed as PII.
Sample Code: PII Tagging Script with OpenAI
Below is a sample script that automates PII tagging in dbt using OpenAI. This script is inspired by Matthew Senick’s Gemini-based workflow, but adapted for OpenAI.
import os
import yaml
import pandas as pd
from sqlalchemy import create_engine
import openai
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
DB_URI = os.getenv("DB_URI")
openai.api_key = OPENAI_API_KEY
def get_sample_values(engine, table, column, n=5):
"""Fetch up to n sample values from the specified table and column."""
try:
query = f'SELECT "{column}" FROM "{table}" WHERE "{column}" IS NOT NULL LIMIT {n}'
df = pd.read_sql(query, engine)
if not df.empty:
return [str(val) for val in df[column].tolist()]
except Exception as e:
print(f"Error fetching samples for {table}.{column}: {e}")
return []
def check_yaml_files(directory, engine):
for filename in os.listdir(directory):
if filename.endswith('.yml'):
filepath = os.path.join(directory, filename)
with open(filepath, 'r') as file:
try:
data = yaml.safe_load(file)
print(f"Contents of {filename}:")
check_pii(data, filepath, engine)
except yaml.YAMLError as exc:
print(f"Error reading {filename}: {exc}")
def check_pii(data, filepath, engine):
for model in data.get('models', []):
model_name = model.get('name')
print(f"Model: {model_name}")
for column in model.get('columns', []):
col_name = column.get('name')
description = column.get('description', '')
sample_values = get_sample_values(engine, model_name, col_name, n=5)
prompt = (
"Classify the following (respond with only the label):\n"
f"text: {description}\n"
f"sample_values: {sample_values}\n"
"label: PII, non-PII"
)
try:
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
max_tokens=10,
temperature=0
)
label = response.choices[0].message['content'].strip()
except Exception as e:
import time
print(f"Error: {e}. Retrying in 60 seconds...")
time.sleep(60)
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
max_tokens=10,
temperature=0
)
label = response.choices[0].message['content'].strip()
print(f"Column '{col_name}' PII status: {label}")
if label.lower() == 'pii':
user_input = input(f"Is the column '{col_name}' PII? (y/n): ").strip().lower()
if user_input == 'y':
column['meta'] = {'contains_pii': True}
with open(filepath, 'w') as file:
yaml.dump(data, file, default_flow_style=False, sort_keys=False)
if __name__ == "__main__":
staging_directory = '../dbt/sample_project/models/staging'
engine = create_engine(DB_URI)
check_yaml_files(staging_directory, engine)
Best Practices for Automated PII Tagging
1. Use Descriptive Column Names and Descriptions
The more context you provide in your YAML, the better the LLM can classify fields. Avoid vague descriptions.
2. Fetch Sample Data
Including real sample values helps the model distinguish between, for example, a “user_id” (non-PII) and an “email” (PII).
3. Keep a Human in the Loop
Always prompt a user to confirm before updating documentation. LLMs are powerful, but not infallible.
4. Document Your Process
Add a note in your dbt project README describing how PII tagging is automated and how to review flagged fields.
5. Review Regularly
Schedule periodic reviews of your PII tags, especially after major schema changes.
Integrating PII Tagging into CI/CD
For maximum impact, integrate your PII tagging script into your CI/CD pipeline. This way, every time a model or column is added or changed, the script runs automatically, flagging potential PII for review.
Example workflow:
- On pull request, run the PII tagging script.
- If new PII is detected, require a reviewer to confirm before merging.
- Update YAML files as part of the PR.
This approach ensures that PII tagging stays up-to-date as your dbt project evolves. Here is an example GitHub Actions workflow that:
- Runs your PII tagging script
- Captures its output
- Posts a comment on the pull request with the PII check results
- Indicates if any model YAML files need updating
Save this as .github/workflows/pii_tagging.yml
:
name: Automated PII Tagging in dbt
on:
pull_request:
paths:
- 'dbt/sample_project/models/staging/**.yml'
- 'pii_tagging.py'
- '.github/workflows/pii_tagging.yml'
jobs:
pii-tagging:
runs-on: ubuntu-latest
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
DB_URI: ${{ secrets.DB_URI }}
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pyyaml openai python-dotenv sqlalchemy pandas
- name: Run PII Tagging Script and Save Output
id: pii_check
run: |
python pii_tagging.py > pii_check_output.txt 2>&1 || echo "Script exited with non-zero status"
tail -20 pii_check_output.txt # Show last 20 lines in logs for debugging
- name: Create PR Comment with PII Check Output
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
const fs = require('fs');
const output = fs.readFileSync('pii_check_output.txt', 'utf8');
let needsUpdate = false;
if (output.includes('needs update') || output.toLowerCase().includes('update yaml')) {
needsUpdate = true;
}
const body = `
### 🤖 Automated PII Tagging Check
\`\`\`
${output}
\`\`\`
${needsUpdate ? '⚠️ **Some model YAML files need updating to reflect PII findings.**' : '✅ No model YAML updates required.'}
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
Limitations and Human-in-the-Loop
While OpenAI’s models are powerful, they’re not perfect. Here are some limitations to keep in mind:
- False positives/negatives: The model may misclassify some fields, especially with ambiguous descriptions.
- Context limitations: Without sample data, the model may struggle to classify certain fields.
- Evolving definitions: What counts as PII may change based on regulations or company policy.
Solution:
Always keep a human in the loop. Use automation to flag potential PII, but require a person to confirm before updating documentation.
Conclusion
Automating PII tagging in dbt with OpenAI is a game-changer for data governance. It saves time, reduces errors, and helps ensure compliance as your data warehouse grows. By combining the power of large language models with human oversight, you can build a scalable, reliable, and compliant data pipeline.
Ready to get started?
- Set up your environment
- Run the sample script
- Review and confirm flagged fields
- Integrate into your CI/CD pipeline
With these steps, you’ll be well on your way to a more secure, compliant, and well-documented data platform.
Discussion