Automating Data Lineage with Metabase

Automating Data Lineage with Metabase
💡
Note: this guide was written as of Metabase v0.46. If you find anything no longer working please let us know in the comments or at hello@grai.io

Connecting the data used by data consumers in the business or analytics end of an organization is often one of the hardest aspects of building an end to end data lineage story. Understanding how Metabase slots into the rest of your data stack is a critical first step towards protecting your users from production dashboard outages.

If your organization depends on Metabase as part of its data consumer story then you're in luck because it exposes a robust API to explore and extract usage metadata. We are going to walk through the structure of the backend API and the code required to re-constitute your BI layer data lineage from scratch. If you're looking for something already built and ready to use you can alternatively use the internal library we've built here at Grai which is available through pip.

pip install grai-source-metabase

The API

💡
If you're interested in diving in first hand you can find the complete Metabase API documentation here.

There are three basic concepts of the Metabase API we will need to understand

  1. Collections
  2. Cards and Questions
  3. Tables
classDiagram Collection -- "*" Question: contains Table -- "*" Question: sources class Collection { +ID: Integer +Name: String } class Question { +ID: Integer +Title: String } class Table { +ID: Integer +TableName: String }

Collections

Collections are sets of items be it questions, models, dashboards, etc... manually defined by the Metabase administrator. They provide organizational structure to the Metabase deployment allowing teams and other business units to structure their workflows within the tool.

Collections are organized around namespaces. We are primarily interested in Tables and Questions both of which are stored in the default namespace so we won't need to bother with them.

We can query for all of the Collections through the GET /api/collection/ route.

Cards and Questions

Metabase Questions are effectively a query, and resulting visualization where Cards are components of a dashboard. Somewhat confusingly, the Metabase API /card/ route actually refers to Questions not Cards. For our purposes this is going to be just fine because when it comes to lineage we care most about the direct SQL queries rather than the higher level organization of those results in the dashboard.

So, our Metabase Questions are available through the GET /api/card/ route.

Tables

The term table in Metabase can be somewhat overloaded. In one case, tables are a type of visualization except rather than an image or plot, it instead renders the raw row and column data of a query.

Tables can also take on their  traditional database meaning of tabular storage of data. Metabase actually builds an index of tables for each database it connects with.

There are two API routes we will need for Tables

  1. GET /api/table/ which returns a list of Metabase indexed tables
  2. GET /api/table/{table_id}/query_metadata which returns metadata about an individual table.

Bringing it Together

We need to know the queries/tables associated with individual Questions which themselves can belong to Collections. I haven't provided example API responses for each of these routes but at a high level

  • Tables, Collections, and Questions each have a unique id
  • Questions have a table_id, collection_id, and database_id reference
💡
Because the response objects from these API's can be very large we are going to simplify and provide a pydantic object which should parse correctly and provide typing information. If you're unfamiliar with pydantic, think of it like a JSON parser.

Connecting Questions to Tables

  1. Query for all Tables and their associated table_ids from GET /api/tables/
  2. Query for every Question from GET /api/cards/
  3. Use the table_id from (2) to connect individual Questions with Tables from (1)

GET /api/tables/

Response: A list of Tables

class Table(BaseModel):
    id: int
    name: str
    active: Optional[bool]
    display_name: Optional[str]
    table_schema: str = Field(alias="schema")
    entity_type: Optional[str]
    db_id: int

GET /api/cards/

Response: A list of Question

class Question(BaseModel):
    id: int
    name: str
    table_id: Optional[int]
    database_id: Optional[int]
    collection_id: Optional[int]
    archived: Optional[bool]

Connecting Questions to Collections

Since we already have a collection_id for each Question all we have to do now is find all of the collections in our Metabase instance and match each Questions collection_id to each Collections id.

GET /api/collection/

Response: A list of Collection

class Collection(BaseModel):
    name: str
    id: int
    archived: Optional[bool]

Connecting Tables to Columns

We'll have to do a little more work to figure out column names belonging to each Table. To do that we will need our last api route

GET /api/tables/{table_id}/query_metadata

Response:

class TableMetadataField(BaseModel):
    id: int
    name: str
    display_name: str
    active: bool


class TableMetadata(BaseModel):
    id: int
    name: str
    table_schema: str = Field(..., alias="schema")
    active: bool
    db_id: int
    display_name: str
    fields: Optional[List[TableMetadataField]]
💡
In the actual JSON response the table_schema value will just be called schema. However, in order to work around restricted keywords in pydantic we've had to use a Field alias here.

Each response yields a database identifier db_id, the table schema schema, the table name name, and a list of associated fields fields. Each field is a column on the table with it's own name.

Wrapping up

At this point we have everything needed to automate building a lineage diagram connecting Questions through to the the fields they reference in the warehouse. If you're using something like Grai, you can integrate this lineage back into your CI workflows to provide alerts whenever data changes might impact Metabase users or to identify data no longer being used for analytics.