Grai: A Comprehensive Guide to Data Lineage in your Data Stack

Grai: A Comprehensive Guide to Data Lineage in your Data Stack

Grai is a robust data lineage solution designed to aid in managing your data stack.

This blog post guides you through configuring Grai for a data stack using DBT, Snowflake, Fivetran, and Postgres, and demonstrates how to validate data changes in your version control system to prevent outages rather than respond to them.

Setting the Scene: Your Data Stack

Let's set the scene with an overview of the data stack in question:

  1. DBT – A transformation and testing tool for your data warehouse.
  2. Snowflake – A cloud data platform that serves as the data warehouse.
  3. Fivetran – An ELT (extract, load, transform) solution that brings data into your warehouse.
  4. Postgres – The primary operational database.

Configuring Grai: Step-by-Step Guide

Now, let's configure Grai using its web app. Simply follow these steps:

Step 1: Sign Up for a Grai Account

Head over to Grai's website and sign up for an account at https://app.grai.io/register

Step 2: Add Data Sources

Once you've signed up, navigate to the "Connections" tab and click "Add Connection."

You'll need to add Snowflake, Fivetran, and Postgres. Each connection will require slightly different authentication credentials. For example, to configure Fivetran you'll need both an api_key and api_secret, you can find more information in Fivetran's documentation here.

For now we will leave the optional values like namespace empty.

Step 3: Upload your dbt manifest

Still in the "Add Connection" page, the dbt integration works slightly differently. In this case you simply need to upload an up-to-date manifest.json

Step 4: Complete the Setup Process

After setting up each integration, Grai will automatically ingest data lineage information from your data stack. Navigate to the "Graph" tab at the top of the page to see how your infrastructure connects together. It should look something like this.

Configuring GitHub: Bringing Grai into CI

Now that Grai is configured you're ready to begin using data lineage as part of your testing process. A common testing pattern is to deploy a dev database as part of the CI process. This can be done in a variety of ways - we are fond of using service containers in GitHub Actions but we've also seen companies follow dev/test/prod databases corresponding to dev/test/prod branches in git.

Whichever way you deploy you can integrate tests into CI using one of our supported GitHub Action integrations. More information is available in the docs for how to use each action.

Github Workflows

Since our production database is postgres you'll be deploying the Grai postgres action wherever in whichever repository you use to manage the production database. This is a pretty simple script we'll be placing in the ".github/workflows" folder of your repository. If you don't have any existing workflows you can create a file called "main.yaml" to work from. All that's left to do is copy the content of that yaml to your workflow file.

on:
  - pull_request
name: PostgreSQL
jobs:
  test_postgres:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3
      - name: Run Grai Action
        uses: grai-io/grai-actions/postgres@master
        with:
          namespace: my_apps_grai_namespace
          grai-user: your_username
          grai-password: your_password
          namespace: default
          db-host: prod.db.com
          db-port: '5432'
          db-database-name: my_database
          db-user: my_user
          db-password: my_password
 

Ensure the database credentials you provide here point to the dev database you want to test, not the production database configured above. You'll also want to update your username and password.

Grai in Action: Validating a Pull Request

Now you're done! All that's left to do is begin testing your changes. Every pull request against your production database will now run tests to validate the changes against your DBT tests.

Step 1: Create a Pull Request

First, create a pull request with the proposed change in your Git repository.

Step 2: Grai Validates the Change

Grai will automatically validate the change by analyzing the impact of the proposed modification on the entire data stack.

Step 3: Grai Reports on the Issue

If a test fails or some other issue is detected, Grai will reports back in the pull request, alerting users to the detected problem and preventing any potential outages from going live.

Wrapping Up

Grai is a game-changing tool that brings data lineage to the forefront, enabling you to maintain the integrity of your data stack by validating changes and preventing potential issues. By integrating Grai into your data stack, you can ensure a more reliable and robust data pipeline.

Happy data lineage tracking!