For a long time, ETL has been the go-to workflow for data teams of all shapes and sizes. I was fortunate enough to start my analytics journey with a long-standing client of One51 Consulting, where I was exposed to an extremely rich ecosystem of data ingestion tools and orchestration techniques. The client’s ETL process involved a suite of proprietary tools and custom scripts ingesting data from various disparate sources, all on an on-prem server.
Eventually, the client made the switch to Snowflake. Equipped with modern data pipeline tools like Fivetran and with the help of One51’s accelerator tool, Composer, we were ingesting data even faster. This combination allowed us to move towards a much more flexible ELT workflow for our data transformation process, which involved transforming the raw data in a view and writing a stored procedure to load the data into a table.
However, as business needs evolved, our data warehouse was growing exponentially. It was becoming difficult to keep up with the demands of our reporting projects now that we had so many scripts to maintain. I would find myself sifting through layers and layers of transformation logic, tracing data from the source tables to our MicroStrategy reports. When an issue arose, or a change was made, it was common practice to truncate these massive one-way incremental models, forcing us to re-run the ETL pipeline from the beginning. Spot-checking data issues inevitably led to a re-run of the entire process. Transformation logic was often siloed and very specific to each team member’s use case, and team members were unaware of the already developed scripts, resulting in duplication of transformation code. There was an increased latency for the delivery of reports, and quality control was becoming a common problem. End users were starting to develop data literacy and began to catch on.
Where do we go from here?
My experience was not an uncommon one. The amount of maintenance overhead data teams can accumulate is due to a lack of proper change management for transformation code. It is not a matter of ELT vs ETL, but rather, how we have been transforming data for so long.
It was not until mid-2022 that I was exposed to dbt. After a year of using the tool with another client, it has completely revolutionised my thinking about the transformation process. dbt is the data transformation tool built on the philosophy that “data transformation tools should conform to the practices that allow software engineers to ship faster and more reliable code”. What does this mean? It means that a good transformation process/tool should:
- Support transformation as code (SQL)
- Support standardisation of data transformation code
- Facilitate modular, accessible and version-controlled data transformation code
- Have robust testing practices throughout the development cycle
- Facilitate proactive documentation and data cataloguing
dbt puts all these capabilities firmly within the domain of the analyst, leveraging tried and true software engineering-like practices to streamline the transformation process and allow business users to make well-informed decisions on data they can trust. With dbt, we have the flexibility to meet the needs of our downstream users and ensure consistency in our metrics through dbt’s semantic layer. Whether it’s machine learning solutions or BI dashboards, we can have confidence that our model is sound, the data is current, and the transformed data is accurate.
dbt automates a lot of the menial work we often dread. I can easily test the columns I care about early in my dependency graph (DAG) to catch the errors before they reach my facts and dimensions. If an error is flagged, I can be reassured there is an issue with the source data, not my model. I no longer need to repeat my validations when there are existing assertions that my team have already created. Most of the documentation is generated by dbt through its ability to make inferences from my data models and data warehouse. This eliminates the need for last-minute cramming to condense three months’ work into the final two weeks before handover. The most noticeable thing for me is how the data quality increases and maintenance time is reduced. No more getting bogged down in an infinite backlog of emails and Jira tickets. Instead, I can focus on adding value for my clients more efficiently and reliably through continuous integration.
Is dbt the right tool for you?
dbt believe that it is the right transformation tool for you if:
Your data warehouse runs on separate hardware from the database used by your product.
Data teams should work in a development environment separate from production, allowing them to work independently of engineers.
Your data is loaded into your data warehouse in its raw format.
dbt is not an ETL tool. It assumes the data for transformation is loaded into your data warehouse in its raw format. Transformations should occur in the same warehouse to which your source data is extracted and not move between other warehouses.
Your data transformation logic can be expressed in SQL.
dbt supports a combination of SQL and Jinja syntax, a common templating language built for Python. Transformations that can’t be done in SQL should be done separately from dbt.
Your data can be transformed in batch.
Many modern data pipelines and warehousing tools are built around batch processing (Fivetran, Snowflake, BigQuery, etc). There can be a latency when processing your transformations depending on the frequency you run (refresh) your models. This is the compromise we make as the need for real-time data doesn’t suit most analytics use cases.
You write idempotent data transformations.
If you were to run a transformation multiple times and it produces the same result, it is idempotent. dbt is built on this assumption to maintain data consistency and reliability. You do not need to write CREATE, INSERT, or UPDATE statements in dbt. Models are defined as SELECT statements (no more stored procedures!).
When is it not the right fit for you?
Migrating to an ELT framework using dbt can be long and costly. Organisations might not want to migrate off their stored procedure-based transformation process because it’s accumulated so much momentum to the point where it’s untouchable. Organisations who ‘hit rock bottom’ in their transformation journey often use this as the catalyst for making the switch.
dbt caters to the technical team. The typical daily routine of the analyst moves from writing and managing SQL scripts to running Python scripts from the command line and creating pull requests with Git. There is some assumed knowledge around using dbt, and this can be a hurdle for some individuals. Additionally, setting up dbt (particularly dbt Core) in your stack can involve many moving parts, as you must consider orchestrating and scheduling your models and building a robust CI/CD pipeline.
Using a framework like dbt for data transformation doesn’t guarantee your team will develop modular data models with easy-to-interpret model dependencies or valuable documentation. Some of us are either stuck in our old ways or reluctant to change. Utilising dbt to its full potential involves using the full set of features available in the tool but also setting guidelines and having a peer-review process to ensure that models are built with best practices in mind. If we don’t do this, we are just moving the problem we already had to a new tool.
dbt is not just another tool to add to your stack. Many organisations are finding great benefit from joining this ‘movement’ towards a new way of thinking about our transformation processes. Data transformation is the foundation of analytics. Without a consistent, accessible, and more governed approach to transformation, we can’t deliver meaningful, quality data to our end users.