We've been building a myriad of different types of ETL pipelines for various environments for some time now. We did that by using other technologies, different stacks, different everything. The only thing all of these had in common was the SQL. A language that transcends all the differences across almost all the possible data and engineering stacks, domains, and even employee roles. I would say it might be quite hard to find another similar language or technology, for that matter, that engineers, business analysts, and C* level executives could all be comfortable using.
Why do we build data pipelines? The answer is simple. We need to move the data from point A to point B as reliably and as fast as we can while also keeping our engineering and, most importantly, our maintenance costs as low as possible. We care about low costs for maintenance because it might become the bottleneck as soon as the data volumes start to increase. The math is simple, the more data that flows through the pipeline, the bigger our costs are. So even small cost optimizations here can yield great cost benefits in the long run.
The dbt (data build tool) is a relatively small software package built by Fishtown Analytics, with the specific mission of simplifying our ETL workflows. The reasoning behind it is simple, instead of complex pipelines built using external tools with an even more complex infrastructure to contain, orchestrate and monitor all of those, the dbt brings the problem back to the roots providing a platform to do all of the transformations within the database itself.
The main benefit of the stack built with dbt is that much fewer different technologies are needed to achieve the same or even more while putting it in a simpler and more ad-friendly form: do more with less! A smaller amount of various technologies essentially means that we will do less maintenance in the future. And we would argue that this might be one of the most vital metrics for that in the data world.
So what is so great about it, how is it different from everything else? To answer that question, we would need to borrow some things from data-mesh concepts. The main being our focus on the actual data rather than the pipelines themselves. That, and also the SQL that we've mentioned at the beginning of the article. These two things combined make the whole experience of operating and using such a system, built on top of the dbt, feel like a breeze.
While working on our most recent project, we were not using the dbt from the start. Our day-to-day experience of building the data-processing stack was more or less basic. Our stack mostly consisted of AWS-backed systems. We relied heavily on Kinesis streams for piping and AWS Glue for the heavy transform workloads, with additional sprinkles of Lambdas here and there. And our warehouse of choice was Redshift.
We started small with dbt; first, we've started adding it only to do some top-level representations/models of the data for our data analytics and business teams. Then we used those to build a set of new domain-specific dashboards within our reporting layer. After getting the taste of it, we realized that this tool can provide us much more compared to what we've been using it for. We started using tests for the data within those top-layer representations to validate that they remain intact and conform to a specific set of rules. That approach already saved us many headaches by alerting us about some things getting messed up before they went to the actual dashboard users, saving us a lot of engineering time and keeping our data team's reputation perfect in the eyes of other teams.
Slowly but surely, we've migrated most of our incoming data pipelines to that approach as well. Instead of dumping all of the data into S3 and hoping the Glue will take care of everything for us, we started to look at the incoming data more carefully and keep it in a more structured and organized way. Some of the data went directly to the Redshift tables, and some of it was still stored on the S3 but then loaded as an external table via Redshift Spectrum. That allowed us to keep structured data in its initial form but already have it queryable and accessible within the Redshift. Additionally, one of the best features that dbt has to offer was an amazing declarative way of creating data models, which most of the time we just were deployed to the warehouse as views, giving us not a snapshot of the data at the given time, but rather a window to the current actual data.
We've covered most of the incoming data structures with basic checks for schema conformity and data freshness. That in and of itself was already a monumental achievement since previously, it took us much effort to implement a new monitoring solution for a new pipeline tailored to that specific pipeline. That, combined with the unseen before velocity with which the data was already accessible within the rest of the data layers, gave us a big confidence boost to continue expanding and improving in this direction, which we are continuing to do to this day.
One of the main benefits of this approach, we think, would be an overall reduction of complexity in the tooling stack. That simplifies the upkeeping, monitoring, development, and introducing and mentoring of new team members to work with it. Putting a set of known technologies (like SQL) in front of it gives us a substantial base to work and build upon.
And the ability to freely test both the structure and nature of the data along with its freshness and conformity gives us confidence and provides reliability to others.