Data Processing

Meltano: the framework for next-generation data pipelines

Meltano is a full-package data integration platform that’s poised to challenge even the most established players in the data space. By building on top of the industry’s best open-source tools and infusing them with DataOps best practices, Meltano sure packs a powerful punch.

What is an ELT pipeline?

An ELT pipeline refers to a set of processes for converting messy and non-actionable data into a format that analysts, data scientists, salespeople, and other end-users can access and draw insights from. The acronym “ELT” refers to the three steps of a data pipeline:

  • extracting raw data from a source;
  • loading it into a centralized database; and,
  • standardizing it through various transformations.

In practice, the exact contents of each step will vary on a case-by-case basis. For example, let’s say we’re working with multiple data sources. In this case, each of our data feeds would require a custom script for scheduling downloads, a custom data loader that massages the extracted data into your database’s schema, and a specialized script for transformation.

Put briefly, getting pipelines to run smoothly requires considerable effort. Yet it’s crucial that data-oriented organizations have reliable data ingestion tools in place. Meltano is one such tool, thanks to its DataOps-friendly approach to building data pipelines.

What is Meltano?

Meltano is a self-hosted ELT solution created by GitLab. Initially made for internal use by the GitLab data team, Meltano quickly grew into an independent entity when its team became aware that many organizations were facing the very issues that Meltano intended to solve.

With so many hosted options around (e.g., Snowflake, Databricks), you may be wondering if there’s really room for another on-premise ELT platform. After all, commercial off-the-shelf tools are designed to solve the data integration problem.

But unlike its hosted peers, Meltano’s mission is to challenge the pay-to-play status quo and democratize data workflows by making ELT platforms freely accessible to everyone.

Indeed, many businesses cannot afford the fees charged by the proprietary tools that currently dominate the market. As companies start ingesting more data, high maintenance costs can quickly push their financial investments into diminishing returns territory. Likewise, tools like Snowflake only support major cloud providers, so if you’re not using one of these or are running in a non-standard region, you could incur serious data transfer charges. This prevents many businesses from realizing the full potential of their data.

Meltano’s answer to the above is an open-source tool that streamlines every step of the data ingestion workflow. Meltano’s open-source model means it’s free to use. Likewise, Meltano allows you to process your data locally, so you don’t need to lease services from third parties and have your sensitive data ever leave your systems.

A true ELT solution

Meltano stands out for its ability to offer solutions at every step of the ELT pipeline.

Traditionally, data pipelines would struggle in terms of data accessibility. Simply getting data from one place to another would require a lot of time, which could be better spent on deriving insights. By adopting DataOps principles and a distributed data mesh approach, Meltano brings together the data industry’s best practices to overcome problems common to data systems. This results in accelerated workflows, shorter cycle times, and generally more reliable data platforms.

Meltano builds pipelines in a modular fashion, combining open-source tools like extractors from Singer’s taps, loaders from Singer’s targets, and transformers from dbt. These are the best tools our community has to offer. For instance, Singer as the underlying extraction and loading protocol makes data ingestion a breeze, whereas dbt makes transformations as simple as writing SQL queries.

A Meltano project is stored as a directory on your computer or in a Git repository, with the project’s configuration stored as text-based files. To make a change to any part of the pipeline, you simply adjust the corresponding element’s configuration file. In true DevOps fashion, you can apply any modern software development principle, including version control, code reviews and CI/CD to your Meltano project.

This way, Meltano’s pipelines are self-contained from start to finish, and you don’t need to look elsewhere to build a separate solution for a given step of the pipeline.

Meltano in action

We’ll now illustrate the ease with which you can use Meltano to create a data integration pipeline. We assume that you’ve already installed Meltano on your machine. If not, you can follow Meltano’s official installation guide.

First, we’ll run a command to initialize a Meltano project named “meltano-demo” and create the project’s directory:

meltano init meltano-democd meltano-demo

Now let’s initialize the pipeline’s components. First, we’ll create an extractor that pulls the data from GitLab:

meltano init meltano-democd meltano-demo

We’ll configure the extractor to pull the data from meltano/meltano projects. Additionally, we’ll instruct the tap to only extract data as of June 1, 2021, and only the data under the “tags” stream:

meltano config tap-gitlab set projects meltano/meltano
meltano config tap-gitlab set start_date 2021-06-01T00:00:00Zmeltano select tap-gitlab tags

The extractor is all set up! Now, we’ll build a loader to store the data into a JSONL file:

meltano add loader target-jsonl

We’ll need to set up a folder where the JSONL file will be stored. We can do this manually, by opening the meltano.yml file and adding the required lines ourselves. Alternatively, we can run the following commands:

meltano config target-jsonl set destination_path my_jsonl_filesmkdir my_jsonl_files

With the extractor and loader in place, all that’s left is to combine them into a pipeline:

meltano elt tap-gitlab target-jsonl

Finally, to verify the pipeline’s successful execution, we’ll print the top tag from the created file:

$ head -n 1 my_jsonl_files/tags.jsonl

{"name": "v1.78.0", "message": "Bump version: 1.77.0 \\u2192 1.78.0", "target": "802654553892e7bf8cc4fee78bce259a7fb741ab", "commit_id": "b45b987ed6d9c50d89da418bb916556d0efc2f10", "project_id": 7603319}

And voila! It took us just 10 commands to create a data integration pipeline.

Meltano implementation best practices

In this section, we'll provide some tips on using Meltano to make the most of your data pipelines.

Tip 1: Keep dbt transformations separate

We recommend keeping each pipeline’s source-centered and business-centered dbt transformations separate. This measure might seem unnecessary, but making a clear distinction between the contexts and pipelines pays off in the long run as segregating transformations simplifies testing and validation. You can validate sources separately from models, in addition to conducting schema, referential integrity or your own custom tests. This goes a long way towards building a reliable data platform that reflects the true state of your business’s data.

Tip 2: Build reusable components

You’ll be working with all kinds of data stored across databases—Excel sheets, RFC 822 files, multiple APIs, you name it. You may need to write custom code for extractors and loaders. To minimize the effort required to maintain these components, aim to make them reusable. Luckily, reusable data integration is the main idea behind Singer.

Tip 3: Do not skimp on monitoring and logging

Don’t overlook the importance of monitoring and logging parts of your data cycle. Putting proper monitoring and logging tools in place early on will help you stay informed about your infrastructure’s processes. This will ultimately mitigate any problems arising later on, as your team will be able to troubleshoot them much faster.

Tip 4: Establish success criteria

How do you evaluate the success of your data platform? One way is to establish success criteria by which to measure your business’s performance. Consider establishing benchmarks for cycle time, speed to market, and/or developer productivity, and optimize accordingly.

Tip 5: Start small

Start with a small project or proof-of-concept. This project is not meant to implement the entire solution but rather to prove that its implementation is feasible in the first place. Your project at a microscale can let you improve upon your idea and serve as a strategic roadmap for future development.

Start implementing Meltano for your business

By now, it’s clear why Meltano is a great choice for building your data platform: it’s powerful but simple to maintain, and its open-source model makes it flexible, budget-friendly, and reliable.

Looking to get started with Meltano?

At Mighty Digital, we help organizations leverage Meltano to power their data integration pipelines. Our data infrastructure and processing experts are experienced in Meltano implementations, and require minimal input on your end.

Get in touch with Mighty Digital experts today!
Connect with us