Why Test-Driven Development (TDD) Matters for Analytics Engineering
Imagine you’re part of a company’s analytics engineering team. You spend hours building data models, cleaning up messy data, and making dashboards. But when someone in the company wants to answer a question, they skip your work and go straight to the raw data. Why? Because they trust the raw data more than your models.
This is a common problem. If people don’t trust the work of the analytics engineering team, then all the effort you put in doesn’t matter. So, how do you build trust? One of the best ways is by using test-driven development (TDD).
In this article, we’ll explain what TDD is, why it’s important, and how you can use it in analytics engineering. We’ll use simple language, lots of examples, and practical advice you can use right away.
What is Test-Driven Development (TDD)?
Test-driven development is a way of writing code where you write tests before you write the actual code. The idea is simple:
- Gather requirements: Figure out what the user needs.
- Write a test: Turn that requirement into a test.
- Run the test: It should fail, because you haven’t written the code yet.
- Write the code: Implement the feature or fix.
- Run all tests: Now, the new test (and all old ones) should pass.
- Repeat: Go back to step 1 for the next requirement.
This process might sound slow, but it actually saves time in the long run. It helps you catch mistakes early, makes your code more reliable, and gives you confidence that your work is correct.
Why Trust Matters
Trust is a big deal in analytics. If people don’t trust your data models, they won’t use them. Stephen M. R. Covey, in his book The Speed of Trust, says trust has four parts:
- Integrity: Are you honest?
- Intent: Do you mean well?
- Capabilities: Are you able to do the job?
- Results: Do you deliver what you promise?
To be trusted by others, you first have to trust yourself. TDD helps you build that self-trust because you know your work is solid.
How TDD Builds Confidence
Integrity
Let’s say a user finds a bug in your data model. You fix it and say, “It won’t happen again.” But how do you know? If you don’t have a test for that bug, you can’t be sure it won’t come back.
With TDD, you write a test that catches the bug. Now, you can say, “I’ve written a test that will catch this problem if it ever happens again.” That’s much more reassuring, both for you and your users.
Example: Writing a Test for a Bug
Suppose a user reports that some transactions are missing from a report. You investigate and find the problem. Before you fix it, you write a test:
-- test_missing_transactions.sql
with missing as (
select id
from transactions
where date = '2024-01-01'
and amount > 0
)
select *
from missing
where id not in (select id from report)
This test checks that all transactions from January 1, 2024, with a positive amount, are present in the report. If any are missing, the test will fail.
Now, you fix the code, run the test, and it passes. If the problem ever comes back, the test will catch it.
Intent, Capability, and Results
Imagine you’re a data customer. You give the analytics engineer some example data and ask for a report. When you check the report, your example isn’t there. You might wonder: Does the engineer care? Are they good at their job? Can they deliver results?
Often, this isn’t because the engineer doesn’t care or isn’t capable. It’s because of regressions—fixing one thing breaks another. Without tests, these are hard to catch.
Example: Regression
Suppose you have two requirements:
- Transactions over $1000 should be flagged as “large.”
- Transactions from VIP customers should be flagged as “VIP.”
You write code for the first requirement, then add the second. But the new code accidentally breaks the first. If you have tests for both, you’ll catch the problem right away.
-- test_large_transactions.sql
select *
from transactions
where amount > 1000
and flag <> 'large'
-- test_vip_transactions.sql
select *
from transactions
where customer_type = 'VIP'
and flag <> 'VIP'
Every time you make a change, you run all the tests. If any fail, you know you broke something.
How to Do TDD in Analytics Engineering
Start with What the Customer Needs
Talk to your users. What do they want? If something is broken, ask for an example. If they want something new, ask them to show you what it should look like—maybe in a spreadsheet.
Turn those examples into tests.
Example: Turning a Spreadsheet into a Test
Suppose a user gives you this spreadsheet:
id | amount | flag |
---|---|---|
1 | 500 | small |
2 | 2000 | large |
3 | 1500 | large |
You can write a test like this:
-- test_transaction_flags.sql
with expected as (
select 1 as id, 500 as amount, 'small' as flag
union all
select 2, 2000, 'large'
union all
select 3, 1500, 'large'
),
actual as (
select id, amount, flag
from transactions
where id in (1, 2, 3)
)
select *
from expected
left join actual using (id, amount, flag)
where actual.id is null
This test checks that the actual data matches the expected data for those IDs.
Focus on the Output
Don’t worry about how the data gets there. Focus on what the user wants to see. Your tests should check the final output, not the intermediate steps.
Example: Output Test
If the user wants a report with columns customer_id
, total_spent
, and status
Write a test that checks those columns.
-- test_customer_report.sql
with expected as (
select 101 as customer_id, 3000 as total_spent, 'gold' as status
union all
select 102, 500, 'silver'
),
actual as (
select customer_id, total_spent, status
from customer_report
where customer_id in (101, 102)
)
select *
from expected
left join actual using (customer_id, total_spent, status)
where actual.customer_id is null
Dealing with Nulls
SQL can be tricky with null values. For example, not(column = 'value')
won’t return rows where the column is null. It’s a good idea to test for nulls separately.
Example: Not Null Test
# In dbt, you can write this in your model's YAML file
columns:
- name: amount
tests:
- not_null
This test will fail if any amount
values are null.
Write the Test First
It’s easy to accidentally write a test that always passes. That’s why you should write the test before you write the code. The test should fail at first. Then, when you write the code, the test should pass.
Documentation-Driven Development
Writing documentation is just as important as writing tests. In dbt, you can write documentation in YAML files before you write any SQL.
Example: Documenting a Model
models:
- name: transactions
description: All customer transactions
columns:
- name: id
description: Unique transaction ID
tests:
- not_null
- unique
- name: amount
description: Transaction amount
tests:
- not_null
Take notes in this format as you talk to users. It helps you keep track of requirements and makes your documentation better.
Handling Edge Cases
Sometimes, you need to write logic for situations that don’t exist in your data yet. For example, maybe you need to handle a new type of transaction that hasn’t happened yet.There are a few ways to deal with this:
- Just write the logic: Don’t write a test for it.
- Use a non-production environment: If you have a test database, create fake data for the edge case and write a test.
- Use dbt unit tests: Define fake inputs in your model’s YAML file.
Most of the time, just writing the logic is enough. If the edge case is really important, use one of the other methods.
Enforcing TDD in Your Team
If you’re a team leader, you might want to make TDD a requirement. Here’s one way to do it:
- All dbt models are stored in a Git repository.
- The production branch is write-protected. All changes must come through pull requests.
- Every pull request must have tests for every new feature or fix.
- Reviewers only approve pull requests if they see the tests.
This process helps make sure everyone follows TDD.
TDD for Data Engineering
TDD works great for analytics engineering, but what about data engineering? Data engineers usually just move data from one place to another, without much logic. TDD isn’t as useful here.
For example, if you’re loading data from the Facebook Ads API, you could write tests that mock the API responses. But since you’re just loading the data as-is, there’s not much to test.
Instead, focus on making your code simple and reliable. Use retry mechanisms to handle errors.
Conclusion
Test-driven development is a powerful tool for analytics engineers. It helps you build trust, catch mistakes early, and deliver better results. It’s easy to get started: just talk to your users, turn their requirements into tests, and write your code to make the tests pass.
If you’re a team leader, consider making TDD a requirement. It will make your team’s work more reliable and trusted.
Analytics engineering is the perfect place for TDD. The effort to write tests is small, but the benefits are huge. Give it a try—you’ll be glad you did.
Discussion