Automating Data Lineage with Metabase
Connecting the data used by data consumers in the business or analytics end of an organization is often one of the hardest aspects of building an end to end data lineage story. Understanding how Metabase slots into the rest of your data stack is a critical first step towards protecting your users from production dashboard outages.
If your organization depends on Metabase as part of its data consumer story then you're in luck because it exposes a robust API to explore and extract usage metadata. We are going to walk through the structure of the backend API and the code required to re-constitute your BI layer data lineage from scratch. If you're looking for something already built and ready to use you can alternatively use the internal library we've built here at Grai which is available through pip.
pip install grai-source-metabase
The API
There are three basic concepts of the Metabase API we will need to understand
- Collections
- Cards and Questions
- Tables
Collections
Collections are sets of items be it questions, models, dashboards, etc... manually defined by the Metabase administrator. They provide organizational structure to the Metabase deployment allowing teams and other business units to structure their workflows within the tool.
Collections are organized around namespaces
. We are primarily interested in Tables and Questions both of which are stored in the default
namespace so we won't need to bother with them.
We can query for all of the Collections through the GET /api/collection/
route.
Cards and Questions
Metabase Questions are effectively a query, and resulting visualization where Cards are components of a dashboard. Somewhat confusingly, the Metabase API /card/
route actually refers to Questions not Cards. For our purposes this is going to be just fine because when it comes to lineage we care most about the direct SQL queries rather than the higher level organization of those results in the dashboard.
So, our Metabase Questions are available through the GET /api/card/
route.
Tables
The term table in Metabase can be somewhat overloaded. In one case, tables are a type of visualization except rather than an image or plot, it instead renders the raw row and column data of a query.
Tables can also take on their  traditional database meaning of tabular storage of data. Metabase actually builds an index of tables for each database it connects with.
There are two API routes we will need for Tables
-
GET /api/table/
which returns a list of Metabase indexed tables GET /api/table/{table_id}/query_metadata
which returns metadata about an individual table.
Bringing it Together
We need to know the queries/tables associated with individual Questions which themselves can belong to Collections. I haven't provided example API responses for each of these routes but at a high level
- Tables, Collections, and Questions each have a unique
id
- Questions have a
table_id
,collection_id
, anddatabase_id
reference
Connecting Questions to Tables
- Query for all Tables and their associated table_ids from
GET /api/tables/
- Query for every Question from
GET /api/cards/
- Use the
table_id
from (2) to connect individual Questions with Tables from (1)
GET /api/tables/
Response: A list of Tables
class Table(BaseModel):
id: int
name: str
active: Optional[bool]
display_name: Optional[str]
table_schema: str = Field(alias="schema")
entity_type: Optional[str]
db_id: int
GET /api/cards/
Response: A list of Question
class Question(BaseModel):
id: int
name: str
table_id: Optional[int]
database_id: Optional[int]
collection_id: Optional[int]
archived: Optional[bool]
Connecting Questions to Collections
Since we already have a collection_id
for each Question all we have to do now is find all of the collections in our Metabase instance and match each Questions collection_id
to each Collections id
.
GET /api/collection/
Response: A list of Collection
class Collection(BaseModel):
name: str
id: int
archived: Optional[bool]
Connecting Tables to Columns
We'll have to do a little more work to figure out column names belonging to each Table. To do that we will need our last api route
GET /api/tables/{table_id}/query_metadata
Response:
class TableMetadataField(BaseModel):
id: int
name: str
display_name: str
active: bool
class TableMetadata(BaseModel):
id: int
name: str
table_schema: str = Field(..., alias="schema")
active: bool
db_id: int
display_name: str
fields: Optional[List[TableMetadataField]]
table_schema
value will just be called schema
. However, in order to work around restricted keywords in pydantic we've had to use a Field alias here.Each response yields a database identifier db_id
, the table schema schema
, the table name name
, and a list of associated fields fields
. Each field is a column on the table with it's own name.
Wrapping up
At this point we have everything needed to automate building a lineage diagram connecting Questions through to the the fields they reference in the warehouse. If you're using something like Grai, you can integrate this lineage back into your CI workflows to provide alerts whenever data changes might impact Metabase users or to identify data no longer being used for analytics.