Data Warehouse

How Singer is changing the landscape of DataOps

April 28, 2022
clock
9 min read
Singer

Share

copied

The DataOps methodology is increasingly embraced by companies looking to improve product quality, customer satisfaction, staff productivity, and cost-efficiency. Created to help you maximize your DataOps implementation, Singer is the open-source ELT standard that’s revolutionizing data processing. But what is Singer, and how is it changing the DataOps landscape? This article will go over key concepts and practical examples to illustrate why you should use Singer to implement DataOps in your organization.

What is DataOps?

DataOps is a modern approach to data analytics. Data analysts often work with data that comes from various sources and in a variety of different formats. This makes data merging and analytics complex, prone to error, and ultimately less reliable than you’d like. Companies find themselves investing heavily into fixing data issues, rather than channeling those resources to get valuable information from their data.

DataOps addresses the data pipeline complexity problem by unifying a set of practices organized around three key pillars:

  • Modern software engineering approaches inspired by DevOps to automate the lifecycle of data production;
  • The Agile methodology of collaborating, learning, adapting (CLA);
  • Cross-functional teams, where people from key departments like Production, Analytics, and Operations all work together towards the goal of obtaining quality business analytics.

DataOps shifts focus away from specific technologies, and instead emphasizes people, teams, and processes as the means to achieve desired business outcomes. It is a new and wholly independent approach to organizing and preparing data analytics to reduce errors and improve product quality.

With DataOps in place, companies can get the most out of their data.

What is Singer?

In a DataOps environment, you’re constantly moving data: you first extract it from a source before transforming it and loading it into a data warehouse for analysis. But the data pipeline can be a series of parts written in different languages and which contain data in different structures. What’s more: these individual components may all be maintained by different teams. Wouldn’t it be great if, through a single standard, all of your data could be consolidated—regardless of your organization’s size and the complexity? Good news, that’s exactly what Singer offers!

Singer is a specification for data extraction and data loading scripts communication. The extracted data comes in the JSON (JavaScript Object Notation) format, with which a large number of programming languages are familiar. In fact, the Singer extracting and loading programs (referred to as taps and targets) can be written in any programming language.

The image above represents what the Singer standard offers in terms of ELT—which stands for Extract, Load, Transform. Singer acts as a unified standard tying together all the data sources and preparing the resulting data for use in a data warehouse.

The Singer protocol is open-source and is implemented in data tools like Meltano. Singer itself isn’t opinionated on how exactly a data pipeline should run, so Meltano and similar projects provide management and automation for pipelines built with Singer.

In addition to self-hosted tools like Meltano, there are also managed services like Stitch, which provide an all-in-one web-based interface to connect to your data sources and then use Singer tools to extract data and load it into a cloud-based warehouse.

Singer specification overview

The Singer standard works with data extraction scripts called Taps and data loading scripts called Targets. These scripts are usually written in Python or in another programming language, and are open-source. Taps take the data from their sources, like MySQL databases, Google Sheets, or Amazon S3 buckets, and convert it into JSON records that are printed to standard output. JSON output can be used directly with a Singer-compliant target, which will push this data into a warehouse for further analysis.

The output of a Singer tap

With Singer, the output of any tap is structured the same way. At output, data is already standardized in its structure, regardless of where it comes from. The output is made of three types of JSON messages: SCHEMA, RECORD and STATE, each essential for the Singer pipeline to function correctly.

Let’s have a look at what exactly the Singer JSON looks like. The image below is a screenshot of our example dataset—the data covers maize harvests in the member countries of the European Union. We downloaded the dataset from the European data portal and then uploaded it to a Google Sheet. Here’s how the data looks in the sheet:

As covered above, Singer taps produce three kinds of JSON messages (SCHEMA, RECORD, and STATE) from any data source, and our sheet is no exception.

SCHEMA message

The Schema is a message type that describes the shape of the data in our source. Here’s how a schema message can look (we’ve omitted some details for brevity):

{
    "streams": [
        {
            "tap_stream_id": "Sheet1",
            "stream": "Sheet1",
            "schema": {
                "selected": "true",
                "properties": {
                    ...
                },
                "crops,strucpro,geo\\time": {
                    "type": [
                        "null",
                        "string"
                    ]
                },
                "2010": {
                    "anyOf": [
                        {
                            "type": "number",
                            "multipleOf": 1e-15
                        },
                        {
                            "type": [
                                "null",
                                "string"
                            ]
                        }
                    ]
                }
            }
        },

        ...
    ]
}

The data in the schema identifies the source’s name, Sheet1, as the stream name. In Singer terms, a stream is a unit of data that comes from the tap. In the case of a Google Sheet, each sheet will have its own Stream section. In a database, for example, each table might also have its own Stream.

For each stream, the schema shows the expected type for each property. In our case, the properties are our spreadsheet’s columns. Therefore, we have the properties crops,strucpro,geo\time and 2010—the headings for the first two columns in the sheet.

Each property has its type defined in the schema. The property 2010 in our schema message can be either a number or a string*.*

Only the first two sections of the SCHEMA record are shown in our example above, but the schema contains type information for the entire sheet.

RECORD message

The Singer record provides the actual data coming from the stream. For example, here’s a row from our Google Sheet displayed as a Singer RECORD message:

{
    "type": "RECORD",
    "stream": "Sheet1",
    "record": {
        "2010": 54.2,
        "2011": 61.2,
        "2012": 53.5,
        "2013": ":",
        "2014": 55,
        "2015": 54.6,
        "2016": 58.5,
        "2017": 58.1,
        "2018": 54.1,
        "2019": 55.1,
        "2020": ":",
        "2021": ":",
        "__sdc_spreadsheet_id": "1YoRV0euSdK8Etd6ML_axTObHt4wISORKTKQow8iv7Ow",
        "__sdc_sheet_id": 0,
        "__sdc_row": 2,
        "crops,strucpro,geo\\ ime": "C1500,AR,AL"
    },
    "version": 1626974962849,
    "time_extracted": "2021-07-22T17:29:21.590751Z"
}

You can see that the fields in this JSON message include the stream name, the time when the record was extracted, and a calculated version field that helps Singer identify which fields have been recently updated. The record field contains a single row from our Google Sheet.

A new RECORD message is produced for each unit of data coming from our tap, and in our case, Singer produces one record per row.

STATE message

The state provides a summary of the data contained in the stream and the current situation of the extraction process—so that Singer knows how to proceed if a process is interrupted. Here’s an example state message:

{
    "type": "STATE",
    "value": {
        "currently_syncing": "Sheet1"
    }
}

Now that we know how Singer messages look, let’s try using Singer taps and targets to export the data from our Google Sheet into a CSV file.

Taps and targets in action

To extract our data from a Google Sheet, we’ll need to use the Singer tap called tap-google-sheets. We’ll be outputting the data to a CSV file, to which end we’ll need to use the Singer target called target-csv.

Before running Singer, we’ll need to make sure that our machine has Python and pip installed. Both the tap and target are PyPI packages maintained by the Singer community, so we can install them by running pip install:

$ pip install tap-google-sheets$ pip install target-csv

To get data from a Google Sheet, we’ll need to create a configuration file for tap-google-sheets. Here’s how our config-gsheets.json configuration file looks:

{
  "client_id": "<oauth-client-id>@vv.apps.googleusercontent.com",
  "client_secret": "<secret>",
  "refresh_token": "<token>",
  "spreadsheet_id": "1YoRV0euSdK8Etd6ML_axTObHt4wISORKTKQow8iv7Ow",
  "start_date": "2019-01-01T00:00:00Z",
  "user_agent": "tap-google-sheets <[email protected]>"
}

You can find more details on how to create the right API credentials with your Google Account in the tap-google-sheets README file on GitHub.

With the tap and target installed and the config created, we can run the discover step. This step will produce the SCHEMA record that we’ll save to a catalog.json file and will use later in the process:

$ tap-google-sheets -c config-gsheets.json --discover > catalog.json

At this point, the catalog.json file contains the SCHEMA record for our Google Sheet, similar to the one we showed in the previous section.

With the catalog created, we can now run the tap and inspect its output:

$ tap-google-sheets --config config-gsheets.json --catalog catalog.json

Here’s how our terminal window looks when we run the above command:

The tap’s output gets printed to the standard output. This includes INFO messages and JSON records—the SCHEMA record is visible in the screenshot, and RECORD messages are not shown.

After we’ve confirmed that the tap is working as expected, it’s time to connect the tap to our CSV target. The preferred way to use Singer taps and targets together is to connect them using a Unix pipe:

$ tap-google-sheets --config config-gsheets.json --catalog catalog.json | target-csv

The target-csv target creates a file in the same directory. Unless you specify a filename using one of the target-csv config options, the filename will be created from the current date, time, and Google Sheet name. In our case, the filename is Sheet1-20210722T191529.csv. Let’s have a look at the CSV file that the program created:

$ head Sheet1-20210722T191529.csv

1YoRV0euSdK8Etd6ML_axTObHt4wISORKTKQow8iv7Ow,0,2,"C1500,AR,AL",54.2,61.2,53.5,:,55.0,54.6,58.5,58.1,54.1,55.1,:,:
1YoRV0euSdK8Etd6ML_axTObHt4wISORKTKQow8iv7Ow,0,3,"C1500,AR,AT",201.14,217.1,219.7,201.92,216.32,188.73,195.25,209.48,209.9,220.69,212.6,218.64

We can see that the records from our Google Sheet were successfully inserted as rows in our CSV file.

Why Singer is promising for the DataOps landscape

One of the main issues companies face when implementing DataOps is the variety of data sources. The more sources there are, the harder it gets to put the data together and run analytics on it. This is where Singer truly shines: taps and targets allow you to work with many data sources in a standardized way, and the list of data sources and destinations that Singer supports is constantly growing.

When you use similar pipelines for all of your data, you reduce the cognitive load on your DataOps team members. They can worry less about the compatibility of different databases and file formats, and focus more on delivering value to your business through data analysis. Using tools like Singer to unify the data pipeline also opens up opportunities for greater automation and monitoring to ensure data quality.

Because data is outputted in a JSON format from Singer taps, a number of programming languages can be used to write Tap and Target scripts. This eliminates the need for profound and expensive changes to your ecosystem to implement Singer in your Analytics process.

Singer ecosystem highlights

Singer comes with a large set of components designed to adapt to your ecosystem. If you want to get data from a source, there’s likely a community-maintained tap for that data source. As of this writing, there are 156 taps available in the Singer organization on GitHub, covering everything from PostgreSQL databases to Google Sheets and LinkedIn Ads API.

The Singer ecosystem also provides a template for easily creating your own Singer targets, letting you integrate Singer with any data infrastructure you may have in place.

Implementing Singer to make the most of your data

Singer should be your tool of choice for implementing DataOps. Its flexibility, reliability, and open-source model make it a solid partner for the long run. Singer’s future is bright, as it’s continually developing and integrating more sources by the day.

Looking to implement Singer to take your DataOps processes to the next level?

At Mighty Digital, we help businesses achieve their data analytics goals by providing intelligent data management solutions. As data infrastructure and processing experts, we become an extension of your team to help you realize Singer’s potential to turn your data into a source of growth.

Our data infrastructure and processing experts work directly with Data, Engineering, and Finance teams to gather requirements for a Singer implementation, before building out all the necessary infrastructure and migrating data pipelines to Singer, with minimal involvement required on your end.

Vladyslav Hrytsenko

Vladyslav Hrytsenko

linkedIn icon
Top full-stack engineer and open-source contributor, data solutions architect. Chief Technology Officer at Mighty Digital