dbt Developer Hub Blog

The Catalog Linked Database Diaries: On Freshness and Writes

2026-03-26T00:00:00.000Z

The Catalog Linked Database Diaries: On Freshness and Writes

Last November, at dbt Summit, Jeremy introduced dbt’s multi-platform Iceberg capabilities.

What intrigued us most was the promised interconnectivity of Databricks Unity Catalog and Snowflake catalog-linked databases.

AI’s all the rage, but another little revolution is taking shape: Teams are breaking their data storage out of vendor-specific platforms. For months, we have been chatting with users excited to adopt Iceberg as a core pillar of their data architecture. The Iceberg table format and Iceberg REST catalogs are the emerging standards powering that flexibility.

For dbt’s part, this shows up in two concrete use cases:

dbt projects at scale: Teams share one logical database, with many schemas and hundreds to thousands of tables
Cross-platform mesh: One project in Snowflake, one in Databricks, sharing data without juggling manual refreshes or metadata pointers

What is a Catalog Linked Database?

In the Iceberg model, the catalog is the system of record for table metadata—schemas, snapshots, and evolution. It’s designed so multiple engines can interoperate against that same metadata layer. A catalog-linked database (CLD) is Snowflake’s way of exposing an open Apache Iceberg catalog, and all the Iceberg tables it contains, inside Snowflake as just another database.

The dream is for teams to share Iceberg tables across platforms without recreating their metadata pointers one by one or copying actual data. Consider an organization with finance using figures in Databricks and a marketing team using Snowflake. The Snowflake team wants that upstream data fresh and even wants to backport manual row updates now and then. Old-school synced tables would treat Databricks as a source of truth. With Iceberg and a CLD-enabled architecture, both data platforms point to the same catalog-defined source of truth.

Some upfront configuration work in Snowflake buys you seamless cross-platform queries on seamlessly synced data objects—that’s the on-paper guarantee. (Snowflake recently published a step-by-step tutorial for using CLDs to enable bidirectional data sharing with Databricks — unimaginable a few years ago, and achievable today.)

As we developed our testing suite, we wondered what happens at scale for both reads and writes. Turn up the chaos: what happens when both are happening at once? For example, if the accounting team writes into a Databricks cluster at 6 a.m. every morning, but the synchronization step to the marketing team’s Snowflake cluster takes 2-3 hours, when will it be safe for their morning data analysis jobs to kick off?

Our testing regimen

Based on our telemetry of real-world dbt projects, we see that large projects number in the hundreds of models. Some number in the thousands. For the purposes of our testing, we make the riskiest assumption that every single model would be materialized as an Iceberg table. This is our upper bound. (It’s rare behavior in dbt projects adopting Iceberg, but a team could have legitimate reasons for choosing this.)

At these scales, catalog behavior, metadata operations, and refresh mechanics really start to matter. We observed latencies and frictions in the hundreds, but for science, we pushed this to its extreme. We loaded 500k tables into one database and tested write performance, synchronization promises, etc.

Reading at scale: What’s the overhead when Snowflake reads tables owned elsewhere?
Writing at scale: How does performance change when you’re creating/updating lots of tables and querying big sums of metadata?
Freshness under change: When one platform updates data, how reliably and quickly does the other see it?

We've published our full testing regimen and detailed findings, for anyone who wants to take a deeper look: https://github.com/dbt-labs/snow-dbx-iceberg-benchmark

Reads at scale: Good performance, but only after platforms sync

Using TPC-H queries over large benchmarking datasets, we found that once data is visible and up to date, querying those Iceberg tables from Snowflake is as fast as you’d want in any reasonable analytics workflow. Databricks querying the same data from the owning side is speedy too.

The catch is that “read performance” is really only half the story. In practice, what users experience is not “how fast is this query,” but “am I even querying fresh data?” When freshness slips, CLDs stop feeling like a pipe and start feeling like waiting for a package held up in customs.

Writes and change at scale: The compute bottleneck

When Snowflake is the one making lots of changes (creating tables, updating metadata, producing many Iceberg commits), the job runs the query against the upstream Databricks objects. A query might take twice as long, but the data is synchronized across both platforms. Write throughput becomes the limiting factor. In a dbt-shaped workload—many small/medium table operations rather than one giant append—this can make runs slow and sometimes fragile under contention. This can lead to outright failures claiming your table no longer exists.

Now, when Databricks is making changes, writes are the same as in any ordinary Databricks workflow. The difficulty becomes Snowflake's ability to reflect those changes.

The biggest finding: As scale increases, refresh latency does too

CLDs promise fast syncing. We found this is relatively true at small and medium scales. However, at larger scales, we observed that changes made by Databricks could take far longer than advertised on the tin to synchronize. Generally, we experienced auto-refresh waits 2x longer than expected. When we dialed things up to 500k tables, the refresh on Snowflake for a trivial Databricks INSERT could take two days to propagate. Some tables seemed to get “stuck.” Now, we eventually learned how to manually force refreshes for individual objects (i.e. hacking refresh-related settings to jog the system). But, we found it difficult to predict when data updates would propagate from Databricks back to Snowflake (the good news is that we hear the fine folks at Snowflake have ergonomic improvements on the way). We mostly operated on a gut feeling of when data would arrive.

Mulling over our experiences, we believe the question of whether and how you should adopt Snowflake CLDs comes down to scale and latency:

How many Iceberg tables are you syncing across multiple engines?
Do your workflows require that Snowflake has a near-real-time view of externally managed Iceberg? Or can you treat it as a view that might be stale, accept eventual consistency, and live without clear guarantees unless you build your own manual playbook and monitoring framework?

Interoperability friction: Why it’s not just the metadata

Two non-performance issues showed up quickly:

Naming, quoting, and casing differences become friction points when dbt is generating objects that need to be understood identically by two engines. Our deep dive has given us ideas for dbt to abstract over these ergonomic challenges. In the future, users shouldn’t need to memorize the casing/quoting rules of every catalog/engine combo. For now, unfortunately, that’s just the cost of doing platform-agnostic business.
Metadata and refresh behavior become part of your job. You’re managing tables and the system that decides when tables “exist.” And those show iceberg tables queries are slow.

The takeaway

CLDs work—up to a point. They solve recurring problems about keeping data connected across platforms. If the number of tables or update volume is very large (on the order of tens of thousands), the pattern stops being a useful abstraction. The same goes if you depend on per-second precision for synchronizing writes. But until you approach that edge, CLDs really do make it possible to treat external Iceberg catalogs like any other database.

For us, that can unlock some very exciting capabilities within customers’ dbt workflows—cross-platform mesh, external sources, and maybe even running the same dbt project / DAG against multiple warehouses. We believe that Iceberg integrations will continue to improve, becoming more performant and easier to use. We need only look to the past year of features (including Snowflake CLDs and Databricks' native managed Iceberg tables, the two features that made this story possible) to be excited for what’s coming in the next one.

And finally, we can’t close without giving a nod to the Unity Catalog team for partnering with Snowflake on this killer feature.

Make your AI better at data work with dbt's agent skills

2026-02-05T00:00:00.000Z

Community-driven creation and curation of best practices is perhaps the driving factor behind dbt and analytics engineering’s rise - transferrable workflows and processes enable everyone to create and disseminate organizational knowledge. In the early days, ~~dbt Labs’~~ Fishtown Analytics’ dbt_style_guide.md contained foundational guidelines for anyone adopting the dbt viewpoint for the first time.

Today we released a collection of dbt agent skills so that AI agents (like Claude Code, OpenAI's Codex, Cursor, Factory or Kilo Code) can follow the same dbt best practices you would expect of any collaborator in your codebase. This matters because by extending their baseline capabilities, skills can transform generalist coding agents into highly capable data agents.

dbt agent skills allow you to transform generalist coding agents into highly capable data agents

These skills encapsulate a broad swathe of hard-won knowledge from the dbt Community and the dbt Labs Developer Experience team. Collectively, they represent dozens of hours of focused work by dbt experts, backed by years of using dbt.

With access to skills, agents like Claude take a systematic approach to tasks

The ecosystem is rapidly evolving for both authors of skills and the agents that consume them. We believe these skills are very useful today, and that they will become more useful over the coming weeks and months as:

skills become better embedded into agent workflows, particularly increasing the rate at which they select the right skills to use at the right time
wider community adoption and feedback improves the breadth and depth of available skills

What’s included

Our agent skills repo contains skills for:

Analytics engineering: Build and modify dbt models, write tests, explore data sources
Semantic layer: Create metrics, dimensions, and semantic models with MetricFlow
Platform operations: Troubleshoot job failures, configure the dbt MCP server
Migration: Move projects from dbt Core to the dbt Fusion engine

You’ll notice these skills vary in size of task and complexity. The primary using dbt for analytics engineering skill contains information about the entire workflow loop for analytics engineering. Other skills are more focused and task dependent.

We plan to continue refining these and adding more skills over time. If there’s a skill that would be useful that you don’t see, please open an issue on the repo.

Quickstart

1. Add the skills to your agent

In Claude Code, run these commands (one at a time):

/plugin marketplace add dbt-labs/dbt-agent-skills

/plugin install dbt@dbt-agent-marketplace

For agents other than Claude Code, use this command (requires Node to be installed):

npx skills add dbt-labs/dbt-agent-skills --global

or just manually copy the files you want into the correct path for your agent.

2. Start a new agent session

Restart your terminal to make sure the new skills are detected.

3. Try it yourself

Try giving an instruction like:

Plan and build models for my new HubSpot source tables
Work out why my dbt build just failed
Write unit tests based on the requirements in this GitHub issue, then create a new model that passes
Update fct_transactions to become a semantic model
Is there a difference in bounce rate for free vs paid email domains?

We focused on tasks that are either common (daily model building, debugging) or complex (semantic layer setup, unit testing edge cases). Each skill contains high-signal knowledge, and has been validated in real-world testing and against ADE-bench.

If you just want to get started today, you can stop reading now. But there’s a whole lot to say about what skills are, why they’re useful and how we expect them to plug into the dbt workflows of today and tomorrow.

note

Normal cautions around agentic coding apply. Please take appropriate safeguards, particularly when working with production or sensitive data.

So what is a skill, anyway?

You can think of skills as bundles of prompts (and scripts) which LLMs can dynamically string together to gain context or expertise on a given task.

In some ways, a skill is very simple - it’s a markdown file with a predefined structure. The venerable dbt_style_guide.md of yore would fit right in! It has a bunch of bulleted instructions, some sample code, and links out to other resources when necessary; the new Skills format does the same things. Anthropic introduced Skills in October 2025, and they are now an open standard adopted by 30+ agents.

A better question than what might be why. From the agent skills site:

Agents are increasingly capable, but often don’t have the context they need to do real work reliably. Skills solve this by giving agents access to procedural knowledge and company-, team-, and user-specific context they can load on demand.

Here’s an example skill from Anthropic:

An example SKILL.md file for working with PDFs, which also contains references to more complex workflows to load on-demand

How do skills interact with MCP?

Another common question is how skills differ from MCP servers, and whether both are necessary.

MCP is how you provide access to tools (especially remote tools requiring authentication)
Skills are how you provide context and knowledge around using those tools

dbt Agent skills and the dbt MCP server are complementary, but you don’t have to use both to get value.

Consider the PDF example. Working with PDF files doesn’t require a MCP server, because the editing library can be installed locally. But you want that library to be used in a consistent way instead of the LLM inventing something from first principles every time.

So then why does the dbt MCP also have tools that call into the CLI? For interfaces that support MCP but not skills, it’s helpful to bake the specific way the CLI commands are called into the MCP server, but this is an open question and something we’re watching closely.

From generalist to specialist

To summarize, the best way to think of skills is as a layered training manual. If you took a very smart generalist off the street, what would they need to be able to use and implement your organization's workflows?

Skills provide layered context that builds on an agent's baseline capabilities

Why skills matter

Skills allow you to embed complex process knowledge that is non-obvious to agents

Any experienced dbt practitioner will have a number of intuitions when working with a dbt project:

You want to poke around a bit and get a sense of the schema and underlying data before making any changes. Read some docs, run a couple of dbt show queries, that sort of thing.
If you’re modifying an existing model, you need to look at the underlying data and get a sense of what columns live in upstream data sources.
After making a new model or modifying one, you need to look at the data again, as well as run summary/aggregate statistics to see if it matches your expected shape and output

The current generation of coding agents tends to not do these things by default. Skills fix that by including broad dbt best practices like the ones above, but they can also provide very in-depth and nuanced guidance through supplemental reference materials, such as:

Warehouse-specific configurations, like avoiding full table scans on BigQuery when discovering data
Variations based on the specific dbt version or engine you’re using; dbt compile can detect many SQL errors when invoked from the dbt Fusion engine, but dbt Core needs to run dbt build for the same result.

Skills can also evolve at a faster pace than frontier AI model releases, making it easier to update guidance and adapt to changes in the dbt authoring layer. We recently revamped the authoring experience for semantic models; by including a skill that knows about the new syntax, we can stop your agent from using the old syntax even though that’s the majority of training data online.

Skills protect against plausible but incorrect output

If you ask an LLM to add some tests to your model, it might add an accepted values test. dbt’s documentation on accepted_values tests contains an example saying that the right values on an order_status column are ['placed', 'shipped', 'completed', 'returned'], and we’ve seen some models replicate this or otherwise hallucinate potential column values.

With a skill, you can instruct the agent to preview the data before writing tests to ensure that the output matches the real data in your warehouse.

Skills allow you to give opinionated guidance to agents

Beyond global best practices, there are also a number of opinionated decisions inside of a given team’s dbt project:

What types of data tests should I have on my models?
When should I use the Semantic Layer vs. SQL for natural language questions?
How should the project be structured (stg/int/mart? Medallion? Data vault?)

Our current skills are only semi-opinionated - they have opinions on how and where you should apply your data tests but not on whether you should use dbt’s recommended project structure or style guide. In the future we anticipate that we will release first party opinionated guides on project and code structure and that there will be a thriving ecosystem of opinionated community-sourced skills on different dimensions of data work.

Skills allow you to give non-public information to agents

In addition to adopting our skills, you should add some of your own.

Taking a smart generalist across all disciplines and turning them into a smart generalist with a specialization in dbt still isn’t enough. They also need to become a specialist in the way your company does data.

Obviously we can’t include those in our general best practices skills, but this is where the composability of skills comes in. You can add context about your company, your data, the specific ins and outs and nuances of interacting with your systems, and expect it to augment what we provide.

Examples of questions you might like to answer in your skills:

Have any default macros been overridden in my organization’s project?
What is my organization’s cross-project or cross-platform mesh strategy?
What partitioning rules should be applied to new models for a given usage pattern?

More to come soon on how we might support org level skills within dbt projects.

How we validated the dbt Agent Skills

It can be challenging to assess the performance of AI workflows. There are many different ways to do this and all of them are imperfect, so we have settled on a multilayered strategy for ensuring our agent skills behave the way we want them to.

Careful expert generation and curation of skills

While we did have some LLM assistance in generating some of the skills, these are very much not "oneshotted outputs". Each skill represents hours of crafting, reviewing and refining by world class dbt experts to ensure that our knowledge has been accurately encoded into the skills. Data work has a lot of tacit knowledge and edge cases, and this is where skills really shine.

Hands-on testing of each skill in real life examples

Nothing beats hands-on usage and so we’ve tested each skill to see how it performs in real use cases. This has helped us tune the performance and identify non-obvious gaps in our instructions.

We were particularly thrilled when we asked the agent to make performance recommendations on one of the largest tables in our dbt project, with and without the skill. While both results gave plausible recommendations, the recommendations with the skill were more tailored and relevant to our use case as determined by our internal data team.

Custom suite for A/B testing skills

We developed a system for rapidly comparing different tool combinations (MCP + skills, skills alone, no tools) to understand how they changed an agent’s output.

This library allows testing how variations of skills perform for a given scenario and reviewing in detail the skills and tools called by the agent.

We provide context to Claude Code (e.g. a dbt project or some YAML files) and we ask it to solve a problem with different setups:

with different variations of a skill
with or without a MCP server connected
explicitly prompting the agent to use a skill, or leaving it to discover it solo

We can then either manually compare the conversations (which skills were called, what output was produced), or ask Claude Code to rate the different runs automatically.

One thing we discovered in this process is that Claude is much less willing to use skills in "headless" CLI invocations than "interactive" ones where a user is talking back and forth. Because of this, we felt comfortable including the explicit prompt in benchmarking tasks.

Benchmarking against ADE-bench

We also ran through the ADE-bench tasks to assess performance with and without skills. While not every skill has corresponding tasks in the benchmark (yet!), this provides helpful signal, particularly on the primary analytics engineering skill.

We saw modest improvements in performance on the benchmark with Skills, rising from a 56% accuracy rate without skills to a 58.5% accuracy rate with Skills. But the bigger story is not the headline numbers, but the individual tasks that were solved with skills that previously had 0% success rates.

Notably, we found significant benefits in tasks which require iterative work on top of a dbt DAG, which is one of the most common failure points we've experienced in using coding agents with dbt.

Without skills

With skills

For example, when asked to produce multiple models based on their schema.yml definition, the baseline agent created 6 models at once and declared victory. The skill-using agent worked iteratively, and successfully completed the task every time.

On the other hand, encouraging DRY principles led to the skill-using agent intermittently reusing a column with a logic bug in this task, where the baseline agent noticed and corrected the bug.

Where there are gaps

Today, skill loading can be a little hit-and-miss. As with everything in AI, things are moving fast, and skills are seeing widespread adoption, so we don’t think that’s going to be a long term issue. We’d also love to see stronger and more reliable cross-skill referencing, such as what’s described here.

Again: you should go try this yourself

Here’s the repo, with installation instructions in the readme.

Agent skills have tremendous bang-for-buck for procedural tasks, especially considering how easily you can get started. We’re excited to see many people from across the Community trying them on real-world workflows, and building new skills of their own.

We’re also exploring ways to enable tighter integration between dbt and agent skills, as well as making it easier to manage custom skills for your specific dbt project and data.

The best way to stay involved is to share what you're discovering in #topic-agentic-analytics on Slack or to open up issues on the GitHub repo.

Modernizing the Semantic Layer Spec

2026-01-21T00:00:00.000Z

New engine, who dis?

It’s unlikely that anyone reading this blog has not heard about the new dbt Fusion engine — it’s been the talk of the data town since last January, culminating in Elias’s legendary live Coalesce 2025 demo of the incredible capabilities that native SQL comprehension in dbt can unlock. If you attended Coalesce, or have upgraded your project to Fusion already, you’ve likely also heard about the changes we’ve made to the authoring layer of dbt (the literal code you write in your project). As part of the major version upgrade, we took the opportunity to simplify + standardize the configuration language of dbt to be built to scale as we enter the next era of analytics engineering.

In particular, we wanted to reevaluate how metrics are defined in the dbt Semantic Layer. We’ve heard from numerous community members over the years that defining metrics was just plain hard. In conversation with internal + external users and our newest pals from SDF, we’ve come up with a redesigned YAML spec that is simpler, more integrated to the dbt configuration experience we’ve come to know and love, and built for the future.

What’s changing?

There are three major updates to the structure of semantic modeling in dbt:

Measures → Metrics: Measures are removed from the authorship spec. Simple metrics now can include aggregations and expressions and are the primary building block for more complex metrics.
Reducing nesting: We removed as much deep dictionary nesting as possible to simplify the look and feel of the code, and renamed keys to more directly describe their behavior.
Standardizing on models YAML entries: Semantic annotations are embedded within the model’s YAML entry to remove the need to manage many YAML entries across many files to enrich your models with semantic metadata.

Legacy implementation

models:
  - name: customers
    description: Customer overview data mart, offering key details for each unique customer. One row per customer.
    columns:
      - name: customer_id
        description: The unique key of the orders mart.
      - name: first_ordered_at
        description: The timestamp when a customer placed their first order.

semantic_models: 
  - name: customers: 
    model: ref('customers')
    description: Semantic Model for Customers
    defaults: 
      agg_time_dimension: first_ordered_at
    entities: 
      - name: customer
        type: primary
        expr: customer_id
     dimensions: 
       - name: first_ordered_at
         type: time
         type_params:
           time_granularity: day
     measures:
       - name: lifetime_spend_pretax
         agg: sum
 
 metrics: 
   - name: lifetime_spend_pretax
     type: simple
     description: Customer's lifetime spend before tax
     label: LTV Pre-tax
     type_params:
        measure: 
          name: lifetime_spend_pretax
 

New implementation

models:
  - name: customers
    
    # enable semantic modeling on this model
    semantic_model:
      enabled: true 
    
    # set default aggregation time dimension as a model property
    agg_time_dimension: first_ordered_at
    
    description: Customer overview data mart, offering key details for each unique customer. One row per customer.
    
    columns:
      - name: customer_id
        description: The unique key of the orders mart.

        # annotate column as a primary entity
        entity:
          name: customer
          type: primary
          
      - name: first_ordered_at
        description: The timestamp when a customer placed their first order.

        # annotate column as a time dimension
        granularity: day
        dimension:
          type: time
          

    # define simple metric directly within the model's YAML
    metrics:

      - name: lifetime_spend_pretax
        type: simple
        description: Customer's lifetime spend before tax
        label: LTV Pre-tax
        agg: sum

This has a few clear benefits:

DRYer code: Semantic annotations are now alongside the model’s YAML entry, reducing duplicative work. Now, if a column within the model is a dimension or entity, you can configure it as such. The properties of the column, like its description, are then reflected as the description of the dimension / entity!
Tidier YAML: A tidier metric entry is easier to write, easier to read, and easier to share context across your data team. Maintenance of metric code should be as easy as possible!

Is this the OSI spec?

You may have heard some buzz that dbt joined the industry initiative called the Open Semantic Interchange, working with partners like Snowflake and Tableau to create an open standard for semantic metadata. This is not the OSI Spec! This is an update to the existing dbt Semantic Layer spec, designed to make it easier for dbt users to define and manage their metrics. However, we are actively exploring how we can align with the OSI spec in the future, and we see this as a step towards that goal.

Get started today

This new spec is live on the Fusion engine today. If you’ve migrated onto the engine, and are curious about getting started with the dbt Semantic Layer, check out our docs and get started defining your metrics! This new spec will also be released to dbt Core in version 1.12, coming in the near future. dbt platform users on the dbt Core engine will be able to migrate to the new spec as soon as they upgrade to the Latest dbt version!

Additionally, if you’re an existing user of the semantic layer, our dbt-autofix script now has support for migrating from the legacy metrics implementation to the new one! Simply run dbt-autofix deprecations --semantic-layer, locally or in dbt Studio on the platform, and the vast majority of the code will be migrated automatically!

We’re eager for feedback! Reach out in dbt Community Slack in the #dbt-semantic-layer channel and let us know how your migration / onboarding experience goes!

Building the Remote dbt MCP Server

2025-08-26T00:00:00.000Z

In April, we released the local dbt MCP (Model Context Protocol) server as an open source project to connect AI agents and LLMs with direct, governed access to trusted dbt assets. The dbt MCP server provides a universal, open standard for bridging AI systems with your structured context that keeps your agents accurate, governed, and trustworthy. Learn more in About dbt Model Context Protocol.

Since releasing the local dbt MCP server, the dbt community has been applying it in incredible ways including agentic conversational analytics, data catalog exploration, and dbt project refactoring. However, a key piece of feedback we received from AI engineers was that the local dbt MCP server isn’t easy to deploy or host for multi-tenanted workloads, making it difficult to build applications on top of the dbt MCP server.

This is why we are excited to announce a new way to integrate with dbt MCP: the remote dbt MCP server. The remote dbt MCP server doesn’t require installing dependencies or running the dbt MCP server in your infrastructure, making it easier than ever to build and run agents. It is available today in public beta for users with dbt Starter, Enterprise, or Enterprise+ plans, ready for you to start building AI-powered applications.

What is the Remote dbt MCP Server? Starter Enterprise Enterprise +Beta

Commonly, agents and MCP servers run locally on your computer, but local-first agents are limited in the type of applications that can be built. With remote MCP, new experiences are possible. For instance, remote MCP enables server-side agents to perform long-running tasks, be shared across an organization, and be accessed through web applications -- all experiences that are far more difficult (or impossible) in a local agent architecture.

The remote dbt MCP server brings structured, governed context to these experiences and enables you to build innovative data applications on top of them. The remote dbt MCP server makes it possible for your agent to answer business questions with the dbt Semantic Layer, discover data assets with the dbt Discovery API, and run natural-language queries with SQL tools. Check out our docs here to learn about the full list of supported tools. These capabilities are easy to integrate in various platforms with the standardized MCP specification.

The remote dbt MCP server is great for application builders, but there are still times when you would want to run the dbt MCP server locally. Specifically, if you are using a local coding agent like Cursor or Claude Code, we recommend the local dbt MCP server. This ensures that the code you are writing locally matches what the agent has access to.

The Remote dbt MCP Server Architecture

Hosting your own remote MCP server is non-trivial. While a local MCP server only has to consider a single tenant experience, remote servers need to manage concurrent connections from multiple different users as well as the deployment and maintenance of the server and infrastructure. Additionally, connections need to be securely authenticated and isolated from each other. The latest updates to the MCP spec provides a new way to communicate with MCP servers, Streamable HTTP, allowing for stateless remote connections with agents. Streamable HTTP makes things easier but there is still a high lift for most data teams to deploy an MCP server. With the remote dbt MCP server, we handle all of this complexity so, if you are building an agentic application, all you need to worry about is making an HTTP connection to our API.

At the same time, we want the remote dbt MCP server to have similar functionality as the local dbt MCP server without entirely reimplementing the tools. We implemented these requirements by running a Streamable HTTP MCP server and adding proxied versions of each dbt MCP tool to this server. The proxied version of each tool has the same tool parameters, description, and implementation as the open source version, ensuring a consistent experience. The difference is that the proxied versions are configured via HTTP headers rather than environment variables and these tools connect directly to our internal APIs which reduces latency.

The remote dbt MCP architecture

The Remote dbt MCP Server in Action

Now that we have a better understanding of how the remote dbt MCP server works, let's implement it in practice by creating a simple agent loop with LangGraph in Python. We are using LangGraph as an example here, but you can use whichever language or framework you would like. Check out our examples directory for more resources on creating agents with the dbt MCP server, including the full example shown here.

The agent we implement here will be able to conduct conversational analytics grounded in structured, governed context from your dbt project. This means it can receive a user's question, search for relevant metadata with the dbt Discovery API, find important metrics with dbt Semantic Layer API, explore the data, and return an accurate, trustworthy answer. This shows how the remote dbt MCP server can power AI applications that combine the flexibility of LLMs with the trust and consistency of your dbt assets.

For this example to work, you will need to install LangGraph dependencies and set an environment variable for the Anthropic API key:

pip install langgraph "langchain[anthropic]" langchain-mcp-adapters
export ANTHROPIC_API_KEY=<your-api-key>

First, we need to define the URL & headers that the MCP client will use. These values will depend on your specific dbt Cloud deployment. In this example, we are setting the configuration from environment variables. For more information on this configuration, refer to About dbt Model Context Protocol .

import os

url = f"https://{os.environ.get('DBT_HOST')}/api/ai/v1/mcp/"
headers = {
  "x-dbt-user-id": os.environ.get("DBT_USER_ID"),
  "x-dbt-prod-environment-id": os.environ.get("DBT_PROD_ENV_ID"),
  "x-dbt-dev-environment-id": os.environ.get("DBT_DEV_ENV_ID"),
  "Authorization": f"token {os.environ.get('DBT_TOKEN')}",
}

Next, we need to create an MCP client, so our agent knows how to use the remote dbt MCP server.

from langchain_mcp_adapters.client import MultiServerMCPClient

client = MultiServerMCPClient(
  {
    "dbt": {
      "url": url,
      "headers": headers,
      "transport": "streamable_http",
    }
  }
)

Then, we need to get the available tools from the remote dbt MCP server.

tools = await client.get_tools()

Now, we can create our LangGraph agent.

from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import InMemorySaver

agent = create_react_agent(
  model="anthropic:claude-3-7-sonnet-latest",
  tools=tools,
  # This allows the agent to have conversational memory.
  checkpointer=InMemorySaver(),
)

Finally, we can run the agent in a loop. This example relies on print_stream_item which you can find in the full example here. You can exit the loop by killing the program with CTRL+C.

# This config maintains the conversation thread.
config = {"configurable": {"thread_id": "1"}}
while True:
  user_input = input("User > ")
  agent_response = agent.invoke(
    {"messages": {"role": "user", "content": user_input}},
    config
  )
  print_stream_item(item)

With our agent implemented, we can run the program and ask it a question. You should see an output like this:

User > How much revenue did we make last month?
Agent > I'll help you find out the revenue for last month. Let me first check what metrics are available in the dbt Semantic Layer.
    using tool: list_metrics
Agent > I see that we have a "revenue" metric available. Let me get the dimensions for this metric to understand how I can query for last month's data:
    using tool: get_dimensions
Agent > Now I'll query the revenue metric for last month. I'll use the "metric_time" dimension with a MONTH grain:
    using tool: query_metrics
Agent > Based on the results, the total revenue for last month was **$102,379.00**.

Future Work

Now that remote dbt MCP is available in public beta, we encourage you to build agents to interact with your dbt resources, bringing structured, governed context into AI workflows without the overhead of local setup. Here are some ideas for types of agents you can build with the remote dbt MCP server:

Answer business-related questions with accurate, governed metrics from dbt
Identify PII columns and enforcing governance policies automatically
PR review agent to improve code quality and expedite the review process
Explore metadata and catalog information to accelerate data discovery and troubleshooting
On-call incident support agent to remediate issues faster

We are continuing to invest in remote dbt MCP, with upcoming features like OAuth-based authentication to make remote MCP authentication & authorization even easier. If you have any feedback, need help, or just want to chat, join us in the #tools-dbt-mcp channel in our community Slack.

How to train a linear regression model with dbt and BigFrames

2025-07-11T00:00:00.000Z

Introduction to dbt and BigFrames

dbt: A framework for transforming data in modern data warehouses using modular SQL or Python. dbt enables data teams to develop analytics code collaboratively and efficiently by applying software engineering best practices such as version control, modularity, portability, CI/CD, testing, and documentation. For more information, refer to What is dbt?

BigQuery DataFrames (BigFrames): An open-source Python library offered by Google. BigFrames scales Python data processing by transpiling common Python data science APIs (pandas and scikit-learn) to BigQuery SQL.

You can read more in the official BigFrames guide and view the public BigFrames GitHub repository.

By combining dbt with BigFrames via the dbt-bigquery adapter (referred to as "dbt-BigFrames"), you gain:

dbt’s modular SQL and Python modeling, dependency management with dbt.ref(), environment configurations, and data testing. With the cloud-based dbt platform, you also get job scheduling and monitoring.
BigFrames’ ability to execute complex Python transformations (including machine learning) directly in BigQuery.

dbt-BigFrames utilizes the Colab Enterprise notebook executor service in a GCP project to run Python models. These notebooks execute BigFrames code, which is translated into BigQuery SQL.

Refer to these guides to learn more: Use BigQuery DataFrames in dbt or Using BigQuery DataFrames with dbt Python models.

To illustrate the practical impact of combining dbt with BigFrames, the following sections explore how this integration can streamline and scale a common machine learning task: training a linear regression model on large datasets.

The power of dbt-BigFrames for large-scale linear regression

Linear regression is a cornerstone of predictive analytics, used in:

Sales forecasting
Financial modeling
Demand planning
Real estate valuation

These tasks often require processing datasets too large for traditional in-memory Python. BigFrames alone solves this, but combining it with dbt offers a structured, maintainable, and production-ready way to train models or generate batch predictions on large data.

“dbt-BigFrames” with ML: A practical example

We’ll walk through training a linear regression model using a dbt Python model powered by BigFrames, focusing on the structure and orchestration provided by dbt.

We’ll use the epa_historical_air_quality dataset from BigQuery Public Data (courtesy of the U.S. Environmental Protection Agency).

Problem statement

Develop a machine learning model to predict atmospheric ozone levels using historical air quality and environmental sensor data, enabling more accurate monitoring and forecasting of air pollution trends.

Key stages:

Data Foundation: Transform raw source tables into an analysis-ready dataset.
Machine learning Analysis: Train a linear regression model on the cleaned data.

Setting up your dbt project for BigFrames

Prerequisites

A Google Cloud account
A dbt platform or Core setup
Basic to intermediate SQL and Python
Familiarity with dbt using Beginner dbt guides

Sample `profiles.yml` for BigFrames

my_epa_project:
  outputs:
    dev:
      compute_region: us-central1
      dataset: your_bq_dataset
      gcs_bucket: your_gcs_bucket
      location: US
      method: oauth
      priority: interactive
      project: your_gcp_project
      threads: 1
      type: bigquery
  target: dev

Sample `dbt_project.yml`

name: 'my_epa_project'
version: '1.0.0'
config-version: 2

models:
  my_epa_project:
    submission_method: bigframes
    notebook_template_id: 701881164074529xxxx  # Optional
    timeout: 6000
    example:
      +materialized: view

The dbt Python models for linear regression

This project uses two modular dbt Python models:

prepare_table.py — Ingests and prepares data
prediction.py — Trains the model and generates predictions

Part 1: Preparing the table (`prepare_table.py`)

def model(dbt, session):
    dbt.config(submission_method="bigframes", timeout=6000)

    dataset = "bigquery-public-data.epa_historical_air_quality"
    index_columns = ["state_name", "county_name", "site_num", "date_local", "time_local"]
    param_column = "parameter_name"
    value_column = "sample_measurement"
    params_dfs = []

    table_param_dict = {
        "co_hourly_summary": "co",
        "no2_hourly_summary": "no2",
        "o3_hourly_summary": "o3",
        "pressure_hourly_summary": "pressure",
        "so2_hourly_summary": "so2",
        "temperature_hourly_summary": "temperature",
    }

    for table, param in table_param_dict.items():
        param_df = bpd.read_gbq(f"{dataset}.{table}", columns=index_columns + [value_column])
        param_df = param_df.sort_values(index_columns).drop_duplicates(index_columns).set_index(index_columns).rename(columns={value_column: param})
        params_dfs.append(param_df)

    wind_table = f"{dataset}.wind_hourly_summary"
    wind_speed_df = bpd.read_gbq(
        wind_table,
        columns=index_columns + [value_column],
        filters=[(param_column, "==", "Wind Speed - Resultant")]
    )
    wind_speed_df = wind_speed_df.sort_values(index_columns).drop_duplicates(index_columns).set_index(index_columns).rename(columns={value_column: "wind_speed"})
    params_dfs.append(wind_speed_df)

    df = bpd.concat(params_dfs, axis=1, join="inner").cache()
    return df.reset_index()

Part 2: Training the model and making predictions (`prediction.py`)

def model(dbt, session):
    dbt.config(submission_method="bigframes", timeout=6000)

    df = dbt.ref("prepare_table")

    train_data_filter = (df.date_local.dt.year < 2017)
    test_data_filter = (df.date_local.dt.year >= 2017) & (df.date_local.dt.year < 2020)
    predict_data_filter = (df.date_local.dt.year >= 2020)

    index_columns = ["state_name", "county_name", "site_num", "date_local", "time_local"]
    df_train = df[train_data_filter].set_index(index_columns)
    df_test = df[test_data_filter].set_index(index_columns)
    df_predict = df[predict_data_filter].set_index(index_columns)

    X_train, y_train = df_train.drop(columns="o3"), df_train["o3"]
    X_test, y_test = df_test.drop(columns="o3"), df_test["o3"]
    X_predict = df_predict.drop(columns="o3")

    from bigframes.ml.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(X_train, y_train)
    df_pred = model.predict(X_predict)

    return df_pred

Running your dbt ML pipeline

# Run all models
dbt run

# Or run just your new models
dbt run --select prepare_table prediction

Key advantages of dbt and BigFrames for ML

Scalability & Efficiency: Handle large datasets in BigQuery via BigFrames
Simplified Workflow: Use familiar APIs like pandas and scikit-learn
dbt Orchestration:
- Dependency management with dbt.ref() and dbt.source()
- Scheduled retraining with dbt run
- Testing, documentation, and reproducibility

Conclusion and next steps

By integrating BigFrames into your dbt workflows, you can build scalable, maintainable, and production-ready machine learning pipelines. While this example used linear regression, the same principles apply across other ML use cases with bigframes.ml.

Feedback and support

📚 dbt Support
📨 Email feedback on BigFrames: bigframes-feedback@google.com
🛠 File issues on GitHub
📬 Subscribe to BigFrames updates

The new dbt VS Code extension: The experience we've all been waiting for

2025-06-25T00:00:00.000Z

Hello, community!

My name is Bruno, and you might have seen me posting dbt content on LinkedIn. If you haven't, let me introduce myself. I started working with dbt more than 3 years ago. At that time, I was very new to the tool, and to understand it a bit better, I started creating some resources to help me with dbt learning. One of them, a dbt cheatsheet, was the starting point for my community journey.

I went from this cheatsheet to creating all different kinds of content, contributing and engaging with the community, until I got the dbt community award two times, and I am very thankful and proud about that.

Since the acquisition of SDF Labs by dbt Labs, I have been waiting for the day that we would see what the result of the fusion of these two companies would be. Spoiler alert: It’s the dbt Fusion engine and it's better than I could have expected.

The dbt developer experience in the pre-fusion-era

If you've ever started a dbt project, chances are your journey began like mine did: cloning jaffle_shop, opening it in VS Code, and running dbt run for the first time (actually the second time, because I know you forgot to run dbt deps in the first one). This is the dbt initiation process, our ‘hello-world’.

You played around with staging models, the orders table, customers table. But let's be honest, the developer experience in that setup was always a bit... clunky.

You wanted to check the lineage of your project, one of the coolest features of dbt, and you had to run dbt docs generate, serve, and open the docs in a browser. Made some updates? Do all the steps again.

Did you want to check your project's metadata? You had to rely on dbt docs (that whole process again), or build some custom solution with the manifest.json.

Moving to dbt Cloud (now called just dbt platform) made things smoother. It has a built-in Studio IDE with git integration, easier to compile and preview models. An auto-updating lineage tab below the model, a much better documentation with dbt Explorer, now renamed to Catalog. And a lot of other powerful features for orchestration, observability, CI/CD, and more.

The cloud-based dbt platform was a big step up, but even so, many of us still preferred to use our own dev environments. We like using our themes, our VS Code extensions, our terminals, but this would mean losing all the nice cloud features while developing. A sad trade-off.

We've already been to dbt platform platform and back to the terminal, and some problems remain. Consider this all too common scenario when modifying a dbt model: forgetting a comma[1]. You don't learn your mistake until after dbt tries to run this model on your warehouse, but dbt can't do this until your cluster is turned on. So it's not until a full minute later that you get the feedback about your missing punctuation mark.

[1]: because you are using trailing commas instead of leading commas, and they're harder to see, and I'm talking too much about the comma fight.

All this back-and-forth communication of dbt and the platform was slowing down your project.

That’s why this new release is such a big deal. It solves all the problems above and introduces other things I didn't know I needed until I saw it.

The new era of dbt development

With the acquisition of SDF Labs and a renewed focus on developer experience, dbt Labs announced its new engine, Fusion. This engine was built from zero with Rust, and its intelligence will power up dbt, no matter where you run it. There are different ways you can use the Fusion engine, and the best one is with the also announced VS Code extension.

The Fusion engine with the VS Code extension is how folks will want to develop with dbt moving forward. I can say this feels like the experience we’ve all been waiting for.

After using it, it’s hard to imagine going back. Working with dbt in VS Code without this extension just doesn’t make sense anymore.

It comes with a lot of features to streamline your work and make you more efficient by developing faster and spending less. But let me tell you about my favorites:

Catch SQL Errors in Real Time

There was no question what I was picking first. No more waiting for your platform to debug your code for you. If you misspell a column name or goof up a function's order of parameters, you catch those errors before you run anything.

This is because Fusion doesn't treat SQL code as just a string anymore; it really understands it. It also shows you some helpful information about the error.

Showing function errors.

Showing column name errors.

This is the greatest improvement of this engine, IMHO.

Model and Column Lineage

My next favorite feature is the lineage view. If you were a dbt platform user, you would feel at home. And if you were using dbt Core, finally, no more generating dbt docs to visualize lineage.

Now there's a tab lineage tab that shows your project’s lineage directly in VS Code. It’s interactive and live. You can use the lenses feature, that's pretty cool to have a good visualization of your project by different attributes like resource_type, or materialization.

Project Lineage.

And something I was not expecting to be here, but thankfully it is, column-level lineage! Not just where columns come from, but also how they change: renamed, transformed, or passed through.

This is incredibly helpful for debugging transformations or understanding how that key metric is shaped across models.

Column-level Lineage.

Instant refactoring

Ok, let me show you just one more thing! Have you ever faced a situation where you'd like to rename a model or a column, but it's used many places downstream that you give up because you don't want to refactor everything or you are afraid you will break something?

Now, thanks to the deep dbt Fusion SQL understanding, you can rename your model or column, and the extension will refactor all downstream dependencies for you. But don't worry, before doing it, the extension allows you to see a preview of the changes, so you can be sure it is doing what you want.

There are a lot more other features this extension is bringing, like navigating through models instantly, autocompleting everything, renaming models or columns and being warned how it will impact your project, previewing models & CTEs, and other features that are already covered in other blogs. By the way, it just launched, so I believe we can expect more and more enhancements to come.

Conclusion: A New Default

This extension changes what using dbt feels like. It brings together performance, context, and interactivity in a way that finally makes dbt feel at home inside a modern developer environment. And the best part? It’s just getting started.

The Fusion engine is already powering a faster, smarter dbt under the hood. And it opens the door to a more fluid, confident, and intuitive development experience. Fewer context switches. Fewer gotchas. More time spent thinking about your data, not your tooling.

If you’ve ever built models in a text editor and wished dbt “just knew more,” this is for you.

If you’ve relied on the CLI but missed having true autocomplete, this is for you.

And if you’ve wanted the best of both worlds, the flexibility of Core with the power of Cloud, this might just become your new default. Even if you use dbt Cloud, it powers up dbt-core to another level.

We’re incredibly excited to see how the community builds with this. Try it out. Push it. Share what’s working, and what’s missing.

This new extension will be constantly updated, so stay tuned for more improvements.

This is the experience we’ve all been waiting for.

Bruno is a lead Data Engineer at phData, and recently built a dbt learning platform called DataGym.io.

The Components of the dbt Fusion engine and how they fit together

2025-05-28T00:00:00.000Z

Today, we announced the dbt Fusion engine.

Fusion isn't just one thing — it's a set of interconnected components working together to power the next generation of analytics engineering.

This post maps out each piece of the Fusion architecture, explains how they fit together, and clarifies what's available to you whether you're compiling from source, using our pre-built binaries, or developing within a dbt Fusion powered product experience.

From the Rust engine to the VS Code extension, through to new Arrow-based adapters and Apache-licensed foundational technologies, we'll break down exactly what each component does, how each component is licensed (for why, see Tristan's accompanying post), and how you can start using it and get involved today.

This post describes the state of the world as it will be when Fusion reaches General Availability. For a look at the path to GA, read this post.

There are a number of different ways to access the dbt Fusion engine

A big change between the dbt Fusion engine and the dbt Core engine is their language. Core is Python; Fusion is Rust. This is meaningful not just because of the performance benefits, but because it creates a new way for us to distribute functionality to the community.

To distribute a Python program, you also have to distribute its underlying source code. But Rust is a compiled language, meaning we can share either the source code or just the compiled binaries derived from that source code.

This means that features which would have otherwise had to stay completely proprietary for IP reasons can instead be broadly distributed in binary form. There's also a completely source-available version of dbt Fusion which will exceed dbt Core's capabilities by the time we reach GA.

What variants of the dbt Fusion engine exist?

Source-available dbt Fusion engine

Artifact type: Code

Available at: https://github.com/dbt-labs/dbt-fusion (Note: this repo currently only contains the code necessary for a dbt parse and dbt deps - more will follow!)

License: ELv2

This will be the foundation of the Fusion engine - the code that lets you:

Execute your dbt seed/run/test/build
Render your Jinja and create your DAG
Connect to the adapters that render your dbt project into the DDL and DML that hits your warehouse
Produce the artifacts in your dbt project

To be clear, the self-compiled binary that's available today doesn't do much yet. By the time the new engine enters general availability, its source-available components will exceed the net capabilities of dbt Core. If you are a data team running dbt Core, simply running the self-compiled version of dbt Fusion will be a pure upgrade.

This repository will also include the code necessary for Level 1 SQL Comprehension (the ability to parse SQL into a syntax tree).

As long as you comply with the three restrictions in ELv2:

✅ You can adopt the binary into your data workflows without dbt Labs' involvement
✅ You can see and modify the code

Precompiled dbt Fusion engine binary

Artifact type: Precompiled binary

How to access: download following the instructions here

License: ELv2

When you download the precompiled binary created by dbt Labs, it contains:

All of the functionality in the Source Available Fusion
Additional capabilities which are derived from proprietary code (such as the Level 2 SQL Comprehension required to compile and type-check your SQL).

As long as you comply with the three restrictions in ELv2,

✅ You can adopt the binary into your data workflows without dbt Labs' involvement
❌ But you cannot see or modify the code itself

The vast majority of existing dbt Core users that adopt the freely distributed components of Fusion should use the binary to do so, rather than compiling it from source code. The binary has the same permissions but more capabilities (and it saves you from having to compile it yourself). You can use it internally at your company for free, even if you are not a dbt Labs customer.

Using the dbt Fusion engine with a commercial agreement

Artifact type: Precompiled binary and managed service

Available at: Download binary and sign up for the service

License: ELv2 (binary) and Proprietary (service)

Organizations who do have a commercial agreement will unlock even more capabilities, but they'll use the exact same publicly-released binary discussed above. If you want to start using platform features, such as dbt Mesh, all you need to do is download a configuration file. (Joel commentary - As someone who has been juggling the dbt Cloud CLI alongside dbt Core for the last couple of years, I cannot overstate how thrilled I am by this.)

Obviously there's additional cloud-backed services necessary to deliver platform-specific features, such as State-Aware Orchestration. That code is proprietary and governed by your agreement with dbt Labs.

Other pieces of the puzzle

The dbt Fusion engine is the headline act, but its underlying technologies can be mixed and matched in a variety of ways.

The dbt VS Code Extension and Language Server

Artifact type: Precompiled binaries

How to access: Install on the VS Code marketplace

License: Proprietary

The dbt VS Code extension is one of the first product experiences built on top of Fusion. It is not part of Fusion, it is powered by Fusion and is part of the wider dbt platform's offerings (with a generous free tier). Specifically, the VS Code extension interacts with another brand-new binary, the dbt Language Server.

The Language Server is built on top of a subset of the technology powering the extended Fusion engine: as an example, it can quickly compile SQL and interact with databases, but it defers to the dbt binary when it's time to actually run a model.

The VS Code extension interacts with the Language Server to understand your SQL, and the Fusion binary to execute your SQL.

The dbt Authoring Layer

Artifact type: JSON Schema definitions

Available at: Git repos for input files and output artifacts

License: Apache 2.0

When you think of dbt, you're probably thinking of a combination of the Engine (described above) and the Authoring Layer.

The Authoring Layer is made up of everything necessary to define the what of a dbt project: things like the YAML specs, Artifact specs, CLI commands and flags, and macro signatures. As the user interface to dbt, the authoring layer is standard between Core and Fusion, although the Fusion engine does not include support for various behaviours and functions deprecated in earlier releases of dbt Core.

For the first time, we're releasing a series of definitive JSON schemas, backed by the code in dbt Core and Fusion, that encapsulate the acceptable content of dbt's various YAML files. These are Apache 2.0-licensed and will be particularly helpful for other tools integrating with dbt projects.

This joins the existing JSON schemas defining the shape of dbt's output artifacts (e.g. manifest.json). As we stabilize Fusion's metadata output (logging and artifacts) on the path to GA, we will update the published schemas.

dbt Fusion engine adapters

Artifact type: Source code

Available at: Initial code in dbt-fusion repo, with more to come

License: Apache 2.0 (later this year)

Adapters are responsible for two key tasks:

Knowing how to create the appropriate SQL commands (via macros and materializations) for a data platform
Connecting to that target data platform and sending it SQL commands

Much like Fusion is the next generation engine for dbt, we also needed next-generation adapters for dbt. These adapters are written in Rust and built on the Apache Arrow standard.

The templating of SQL commands largely carries over from macros in the dbt Core adapters. Database connectivity is another story, the dbt Fusion engine cannot use the Python classes present in each adapter, for reasons both practical and performance-related.

Enter the Apache Arrow ecosystem at large, and the new ADBC API in particular. ADBC is a future-looking platform for database connectivity, and we are leaning into it heavily with these Fusion adapters.

Because the ADBC standard is extremely new, not all databases are compatible with ADBC yet, and using ADBC in a Rust client isn't easy. To solve both problems, we have created a Rust client library, XDBC that:

Supports ODBC connections to databases where Arrow is not yet provided as an output
Provides generic methods for creating and managing connections to databases
Is useful for anyone who wants to build data tooling in Rust, inside or outside of the dbt ecosystem

All of this will be open-sourced under the Apache 2.0 license later this year, namely:

Fusion adapters we have created
The XDBC library
We'll also continue upstreaming improvements to Apache Arrow's ADBC project

ANTLR Grammars

Artifact type: g4 files

Available at: (repo to come, in the meantime you can discuss this in #dbt-fusion-engine in the dbt Slack)

License: Apache 2.0 (later this year)

ANTLR grammars are the formal language specifications that let Fusion parse every SQL statement across multiple dialects. Specifically, ANTLR takes in these declarative, high level grammars and uses them to generate a parser. The grammars have wide utility anywhere it's necessary to parse SQL – not just in Fusion – and we're releasing them as Apache 2 to enable the Community and others in the data ecosystem to build on top of them.

Most ANTLR grammars are only applicable to a single dialect, but the SDF team created a system which makes it possible to define a shared base grammar and generate each warehouse's g4 file from there. This halves the amount of work required to support a new dialect at the level of precision and robustness required.

dbt-jinja

Artifact type: Source code

Available at: A subdirectory of the dbt-fusion repo (but there's still work to do before it's easy to use outside of the Fusion repository)

License: Apache 2.0

Since Fusion is completely Rust-based, while Jinja is a Python project, we needed a completely new way to render all the Jinja spread through users' projects. We started by switching to minijinja: a Rust port of a subset of the original Jinja project, written by Jinja's original maintainer.

This subset of coverage wasn't enough to support existing dbt projects, so we created Rust-native implementations of the majority of these missing features. This achieved the best of both worlds: significant performance improvements while maintaining compatibility with users' existing codebases.

dbt-jinja is the most feature-complete implementation of Jinja in Rust, and is available with an Apache 2.0 license today, with a more formal release (documentation etc) later this year. It's useful whether you're building tooling to operate on top of dbt projects, or working on something completely different which just needs to render Jinja quickly.

How do I engage with these components?

Our Contributors' Principles remain: Building dbt is a team sport!

If you want to open a PR against publicly-viewable code, you can.
If you want to open issues describing bugs during the Fusion engine's beta period, you can. (This is probably one of the highest-leverage things you can do!)
If you want to open a discussion and pitch a new way to use dbt more effectively in our new SQL-aware world, you can.
If you want to move upstream, and contribute to the standards underlying the dbt Fusion engine like Arrow, ADBC, Iceberg, or DataFusion, you can. You might see some familiar faces while you're there!
If you just want to let dbt get better and better in the background, you can do that too.
Want to get involved in the team building this? If the components here are uniquely interesting to you, email careers.fusion@dbtlabs.com.

If you need a hand wrapping your head around any of these new components, drop by #dbt-fusion-engine in the Community Slack - we'd love to chat.

Path to GA: How the dbt Fusion engine rolls out from beta to production

2025-05-28T00:00:00.000Z

Today, we announced that the dbt Fusion engine is available in beta.

If Fusion works with your project today, great! You're in for a treat 😄
If it's your first day using dbt, welcome! You should start on Fusion — you're in for a treat too.

Today is Launch Day — the first day of a new era: the Age of Fusion. We expect many teams with existing projects will encounter at least one issue that will prevent them from adopting the dbt Fusion engine in production environments. That's ok!

We're moving quickly to unblock more teams, and we are committing that by the time Fusion reaches General Availability:

We will support Snowflake, Databricks, BigQuery, Redshift — and likely also Athena, Postgres, Spark, and Trino — with the new Fusion Adapter pattern.
We will have coverage for (basically) all dbt Core functionality. Some things are impractical to replicate outside of Python, or so seldom-used that we'll be more reactive than proactive. On the other hand, many existing dbt Core behaviours will be improved by the unique capabilities of the dbt Fusion engine, such as speed and SQL comprehension. You'll see us talk about this in relevant GitHub issues, many of which we've linked below.
The source-available dbt-fusion repository will contain more total functionality than what is available in dbt Core today. (Read more about this here.)
The developer experience will be even speedier and more intuitive.

These statements aren't true yet — but you can see where we're headed. That's what betas are for, that's the journey we're going on together, and that's why we want to have you all involved.

We will be adding functionality rapidly over the coming weeks. In particular, keep an eye out for Databricks, BigQuery and Redshift support (in that order) in the coming weeks.

The most popular dbt Labs packages (dbt_utils, audit_helper, dbt_external_tables, dbt_project_evaluator) are already compatible with Fusion. Some external packages may not work out of the box, but we plan to work with package maintainers to get them ready & working on Fusion.

So when is Fusion going to be GA? We're targeting later this year for full feature parity, but we're also hoping to approach it asymptotically — meaning that many existing dbt users ca start adopting Fusion much sooner.

During the beta period, you may run into unanticipated (and anticipated) issues when trying to run your project on Fusion. Please share any issues in the dbt-fusion repository or on Slack in #dbt-fusion-engine, and we'll do our best to to unblock you.

Can I use Fusion for my dbt project today?

Maybe! The biggest first question: "Is your adapter supported yet?" (If not, sit tight, we're working fast!) If so, then it depends on the exact matrix of features you currently use in your dbt project.

You may be able to start using Fusion immediately, may need to make (mostly automatic) modifications to your project to resolve deprecations, or your project may not yet be parsable at all:

State	Description	Workaround	Resolvable by
Unblocked	You can adopt the dbt Fusion engine with no changes to your project
Soft blocked	Your project parses successfully but relies on not-yet-implemented functionality	Don't invoke unsupported functions or build unsupported models	dbt Labs
Hard blocked by deprecations	Your project contains functionality deprecated in dbt Core v1.10	Resolve deprecations with the dbt-autofix script or workflow in dbt Studio	You
Hard blocked by known parse issues	Your project contains Python models or uses a not-yet-supported adapter	Temporarily remove Python models	dbt Labs
Hard blocked by unknown parse issues	Your project is probably doing something surprising with Jinja	Create an issue, consider modifying impacted code	You & dbt Labs

Loading table...

We're continuously removing blockers to Fusion adoption on a rolling basis during this beta period and in the leadup to a broader release. The rest of this post will go deeper into the four thematic criteria we set out above:

Adapter coverage
Feature coverage
Source-available code publishing
Developer experience improvements

Requirement for GA: Adapter Coverage

Databricks, BigQuery and Redshift

dbt Fusion's adapters are now based on the ADBC standard, a modern, high-performance Apache project optimised for columnar analytical databases.

dbt Labs has developed new ADBC-compatible drivers (and a supporting framework, XDBC) to complement the existing, stable Snowflake driver.

Target release dates: We expect to add support for Databricks, BigQuery, and Redshift (in that order) in the coming weeks.

Where possible, Fusion adapters will support the same authentication methods and connection/credential configurations as dbt Core adapters. We've also heard loud & clear feedback from dbt platform customers who have beta-tested the Fusion CLI — we want to figure out a way for Fusion CLI to use connection setup (config/creds) from the platform for local runs (tracking issue).

Athena, Postgres, Spark and Trino

We're aiming to support these adapters later in the year, prior to GA. Check each adapter's tracking issue (Trino, Athena, Spark, and Postgres) for specific timelines.

Custom adapters

The short answer: Fusion's new adapter format could be extended to support community development of third-party adapters, but it's not on the near-term roadmap before GA (tracking issue).

The longer answer: Fusion now downloads necessary drivers (part of the adapter stack) on-demand. This dynamic linking requires the drivers to be signed by dbt Labs, meaning that we need to have a system in place to review contributions of new drivers and ensure their security.

In the meantime, if you want to migrate a supported project to the dbt Fusion engine but have a dependency on another project using a custom adapter, you can use a Hybrid project to have dbt Core execute the unsupported part of the pipeline and then publish artifacts for downstream projects to consume.

Requirement for GA: Feature coverage

Feature coverage includes ensuring documented features work as expected, as well as (where possible) supporting undocumented "accidental" features.

Most of the time, even if your project uses an unimplemented feature, you can still take Fusion for a spin. This is because as long as your project parses, you can just skip unsupported models.

Known unimplemented features

Python models

Python models are the one exception to that "just skip them" advice. The dbt Fusion engine does not currently support parsing Python models, which means it can not extract refs or configs inside the files. Instead of potentially building models out of DAG order, we've chosen to not support Python models at all for now. They're coming back though - check out the issue for details.

Breadth of Materialization Support

As of today we support the most common materializations: table, view, incremental, ephemeral for models — plus the materializations underlying snapshots, seeds, and tests. Other native strategies (like microbatch incremental models, iceberg tables, materialized views/dynamic tables, or stored test failures) as well as custom materializations are on the roadmap — check their respective issues to see when.

It's worth reiterating here: Even if you have models that rely on not-yet-supported materialization strategies, you can still try the dbt Fusion engine in the rest of your project. The rest of your DAG will build as normal, but unsupported strategies will raise an error if they are included in scope of dbt build or dbt run.

To exclude those nodes, use a command like

dbt build --exclude config.materialized:my_custom_mat
dbt build --exclude config.incremental_strategy:microbatch

Other common features

Did you know that there are over 400 documented features of dbt? Doug does, because he had to put them all into a Notion database.

Fusion already supports two-thirds of them, and we have a plan for the rest. You can follow along at the dbt-fusion repo, where there are issues to track the outstanding behaviours. There's also a rough set of milestones attached, but those are subject to reordering as more teams start using Fusion and giving feedback.

Some of the most relevant ones include:

Exposures
A new stable logging system
A new local documentation experience that replaces dbt-docs (!)
Programmatic invocations
Model governance (contracts, constraints, access, deprecation_date)
A grab bag of CLI commands like dbt clone, state:modified.subselector, --empty, ...

It's worth noting that resolution doesn't necessarily mean identical behaviours. As a couple of examples:

Many of these behaviours have not been implemented yet because the Fusion engine introduces new capabilities, above all SQL comprehension, that we will leverage to provide a superior experience. A direct port-over of the feature would miss the point.
Others (like the events and logging system) are tightly coupled to dbt Core's Python roots — they're worth a rethink, and not worth shooting for exact 100% conformance

Here's a point-in-time snapshot of how we expect to tackle the known remaining work. Please refer to the repository's issues page as the source of truth:

An indication of the dbt Fusion engine's path to GA

Surprise unimplemented features

Did you know that there are also over a bajillion undocumented features of dbt? Since March, we've been validating the new engine's parser against projects orchestrated by the dbt platform, which has flagged hundreds of divergent behaviours and common parse bugs.

But we also know there is a long tail of behaviours that will only arise in the wild, and that the easiest way to get to the bottom of them will be to work with users.

This work will be ongoing, alongside feature support. When you start using the Fusion engine, please open an issue if you hit an unexpected error — and please include a basic project that reproduces the error, so we can fix it!

Requirement for GA: The Source-available `dbt-fusion` codebase is better than `dbt-core` for most use cases

By GA, the dbt-fusion repository will have the necessary (and fully source-available) components to compile a functional engine for the vast majority of dbt Core projects — and a faster one at that. That means that you will always have the ability to compile, use, and modify this code itself, without requiring access to the dbt Labs provided binary (although we think you'll probably just want to use the binary, for reasons detailed in the Components of the dbt Fusion engine post).

So far, we've released the code necessary to self-compile a dbt binary that can run dbt deps and dbt parse. Throughout the beta period we will continue to prepare more code for use by those who want to view, contribute to, or modify the code for their own purposes, including what's necessary for the rest of the commands to work.

Beyond just the code necessary to produce a complete dbt binary, we've also committed to open-sourcing several of the underlying library components (such as dbt-jinja, dbt-serde-yaml, and the grammars necessary to produce a high-performance SQL parser). Again, check out the Components of the dbt Fusion engine post for the details.

Some behaviours that worked in dbt Core won't have an equivalent in this new codebase. The most obvious examples are those which depended on the vagaries of Python: arbitrary callbacks on the EventManager (there's no longer an EventManager on which to register a callback!), the experimental plugins system (dynamic loading of binaries works completely differently in Rust and would require signing), or the dbt templater in SQLFluff (which hooked into dbt Core beyond the exposed interfaces - although we plan to build a fast linter ourselves).

Requirement for GA: The DX rocks

More speed

Invocations powered by the dbt Fusion engine are already significantly faster than the same invocation in dbt Core, but there's more to do here! We know that there is still a lot of low-hanging fruit, and by GA we expect to see tasks like full project compilation complete at least twice as fast for many projects.

If you do some benchmarking, we're particularly interested in any situations where Fusion "pauses" on a single file for a couple of seconds. Some other things to keep in mind:

Writing very large manifests is pretty slow, no matter what. Try including --no-write-json. We're wondering whether it makes sense to have a trimmed-down manifest by default. What do you think?
The dbt compile command involves more work in Fusion than in dbt Core, because it's doing full SQL validation. To compare just the SQL rendering step (the equivalent of dbt Core's compile command), you can try turning off static analysis with the CLI flag --static-analysis off.

As a sign of what's possible, take note of the incremental recompilation used to provide real-time feedback in the VS Code extension.

A more info-dense console output

While we were preparing for the beta release, we kept the Fusion CLI output intentionally verbose — it displays everything that's happening, which means errors and warnings can be pushed out of view by other status updates. We're already in the process of clearing this up a bit, and we've got some funny ideas about the possibility of progress bars. However we do it, the goal should be that you see the log lines about things that need attention, and not much more.

Your idea here

What feels off when you're using dbt Fusion? Tell us all about it — if you've got a clear idea for what's wrong and what it should be instead, feel free to jump straight to a GitHub issue. Bonus points if you've got a minimal repro project.

If you need to kick an idea around before opening an issue, we'll also be actively checking in on #dbt-fusion-engine (for high-level discussions) and #dbt-fusion-engine-migration (to get into the weeds of a specific bug) on Slack.

From now until Fusion is GA, we will be prioritizing parity with existing framework features, not adding new ones. Once we hit GA, we'll think about whether to transfer existing feature requests from the dbt-core repo to dbt-fusion — or maybe a third place? — stay tuned.

Following along

The path to GA for Fusion is a Community-wide effort. We want to hear from you, work with you, get your ideas and feedback. Whether it is sharing a bug report, an idea for a feature or more high level thoughts and feedback, we're looking to engage with you.

In Slack, we're on #dbt-fusion-engine and #dbt-fusion-engine-migration
The GitHub repo is https://github.com/dbt-labs/dbt-fusion
There are a couple of dozen dbt World Circuit meetups happening globally during June: https://www.meetup.com/pro/dbt/. (Jeremy will be speaking in Paris, Marseille, and Boston — come hang out!)
We'll be having regular office hours with a revolving cast of characters from the Developer Experience, Engineering, and Product teams. Dates will be circulated in the #dbt-fusion-engine channel.

Meet the dbt Fusion Engine: the new Rust-based, industrial-grade engine for dbt

2025-05-28T00:00:00.000Z

TL;DR: What You Need to Know

dbt’s familiar authoring layer remains unchanged, but the execution engine beneath it is completely new.
The new engine is called the dbt Fusion engine — rewritten from the ground up in Rust based on technology from SDF. The dbt Fusion engine is substantially faster than dbt Core and has built in SQL comprehension technology to power the next generation of analytics engineering workflows.
The dbt Fusion engine is currently in beta. You can try it today if you use Snowflake — with additional adapters coming starting in early June. Review our path to general availability (GA) and try the quickstart.
You do not need to be a dbt Labs customer to use Fusion - dbt Core users can adopt the dbt Fusion engine today for free in your local environment.
You can use Fusion with the new dbt VS Code extension, directly via the CLI, or via dbt Studio.
This is the beginning of a new era for analytics engineering. For a glimpse into what the Fusion engine is going to enable over the next 1 to 2 years, read this post.

Since its introduction in 2016, dbt has paved the way for the analytics engineering revolution. Teams worldwide have moved from ad hoc processes running customized SQL scripts into a mature analytics workflow based on the dbt viewpoint. dbt enables data practitioners to work like software engineers, building their analytics code as an asset to ship trusted data products faster.

dbt came to represent many things:

A viewpoint on how analytics should be done
A workflow where data practitioners could put that viewpoint into action
A framework — dbt Core — that powered this workflow comprised of:
- An authoring layer: The schema, spec, and definitions for a dbt project written in SQL, YML, and Jinja
- An engine: The tooling via which the authoring layer was built and executed against a data platform, resolving templated code into executable SQL, building your dependency graph, and more.

dbt is made up of two different things: authoring layer and engine.

While the authoring layer has continued to evolve nicely, giving dbt developers ever-more functionality to work with, the engine itself, dbt Core, is still built on the same technology and uses the same primary design principles that it started with in 2016. This causes two primary problems that cannot be iteratively solved:

dbt Core can be slow. It’s built in Python and for larger dbt projects it can become unworkable. Even for smaller projects, to power a great developer experience, users would need a step change in performance.
The dbt engine renders SQL, but it doesn’t comprehend SQL. That means that any functionality relying on specifics of SQL code was impossible to build into dbt.

And so it became clear that for us to power the analytics workloads of tomorrow, we weren't going to get there with incremental improvements — we needed to rebuild the dbt engine from scratch. We needed:

An engine built for speed.
An engine that knows about your code.
An engine that powers the next generation of developer experience.

And that engine is Fusion.

What exactly is Fusion?

Fusion is the new engine for dbt.

If the authoring layer is "what" your dbt project is supposed to do, then the engine is the "how." That includes:

Rendering Jinja
Building dependency graphs
Creating artifact files
Communicating with databases

At first glance, Fusion looks a lot like dbt Core. Your projects are built using the familiar dbt authoring layer. You still write SQL and Jinja. You still type dbt run. (To make it easier to try Fusion, we're also shipping with an optional dbtf alias, as many users have the dbt namespace already specified).

But underneath that is a layer of technical depth and rigor that is entirely new to dbt, happening at the engine layer.

Fusion:

Is fully rewritten in Rust, enabling a dramatically faster dbt experience. Fusion does not depend on Python at all. In fact, besides the adapter macros, not a single line of code is shared between dbt Core and the dbt Fusion engine. (For long-time dbt spelunkers, we've described the new structure in a separate post.)
Understands your SQL code. It’s a true SQL compiler and gives dbt a full view on what the code in your dbt project means and how it will propagate across your entire data lineage.

Based on the technology from SDF, Fusion represents a step change increase in the technical capabilities of dbt.

Familiar Authoring Layer, Powerful New Engine.

As a result of these capabilities, Fusion can deliver new experiences. Some of these we’re releasing today, like real-time error detection in VS Code and significant cost savings in project execution. dbt now knows about your code!

You probably now know enough now to head on over to the quickstart and get going, but if you want to know little more about what Fusion delivers today, keep reading.

Near-term benefits of adopting Fusion

You can think of Fusion as the same dbt you know and love, but better and faster, and you're going to see it show up in a lot of places!

Functionality powered by the dbt Fusion Engine and its components

So how and why should you adopt Fusion for your dbt project?

Just the new Fusion-powered dbt CLI

Significant performance improvements: Up to 30x faster parsing and 2x quicker full-project compilation, with near-instant recompilation of single files in the VS Code Extension. We expect continued performance gains as part of the path to GA.

The new Fusion-powered dbt Fusion CLI + VS Code extension

But the real benefit of Fusion is not just going to be in the CLI itself — it’s in the ability to build net new product experiences that leverage Fusion’s capabilities. The first of these, unveiled today, is the VS Code extension, powered by dbt Fusion’s SQL Comprehension. This extension could only be built on Fusion:

It’s fast — the VS Code extension recompiles your entire dbt project in the background every time you save any file, as well as identifying errors instantly for the active file. For that to be workable, it needs to happen fast.
It understand SQL and functions as a compiler — it knows what columns exist in your project, what functions you are using and the type signature and output of those functions.

There’s a whole host of features in the VS Code extension. Some early favorites:

Write code with confidence — live error detection and function autocomplete.
- How many time have you hit dbt run only to realize that you typed select * frmo, misspelled a column name or tried to sum the unsummable? No more! With the LSP-powered VS Code extension, you can immediately see when pesky errors sneak into your code.
  
  You wouldn't sum a datetime.
- Similarly — is it dateadd or date_add? And which way around do the arguments go again? Just start typing and you'll see contextual prompts and autocomplete.
See how the code you’ve written iteratively progresses to your transformed data: Preview CTEs and viewing compiled code
- Because the VS Code extension compiles your code every time you save, you can view the compiled code from your project in real time as you’re making edits. This is a real lifesaver when working on complex macros.
- Writing your code with CTEs allows you to modularly split up the logic in your model. The days when you swap out the final CTE at the end for the name of the CTE you're debugging are no more, now you can just click.
Traverse your project: Go-to-reference and built in lineage
- Need to find out how an upstream model was defined? Or where all the inputs from the model you’re working on came from? With both the ability to jump to the model and column references and view model and column level lineage, it’s honestly a night and day difference.

I could go on and on and on — there’s so much here.

Taken separately, these range from quality of life improvements to significant changes.

But taken together, it actually fundamentally changes the experience of writing your dbt code. There were just so many things that you had to constantly be juggling in the back of your head that are now offloaded to the extension. The sum change to the experience of writing dbt code... is exceptional. I already can’t imagine working without this.

Of course — there’s another technology changing the experience of writing dbt (and all) code — AI. The functionality that Fusion enables dovetails perfectly with AI-assisted coding by allowing you to vet, validate, and comprehend AI-generated code more easily. Moving forward, expect even tighter coupling between Fusion and AI-based coding assistants as the speed and rigor of Fusion will help produce higher quality AI-generated code.

The VS Code extension is one of our first product experiences exclusively powered by the dbt Fusion engine. The extension depends on the Language Server, and the Language Server depends on Fusion's SQL comprehension capabilities. We made the decision not to support dbt Core for the VS Code Extension because existing community-built extensions have already built as much as is possible on top of dbt Core's foundation. To get to this next level of experience, we needed Fusion.

How to get started with Fusion

The dbt Fusion engine is currently in beta. We've written a separate post describing the path to Fusion's final release, and how you can see if your project is compatible today.

Whether or not you can move your existing project to Fusion today, you can jump into the VS Code extension using our quickstart to try get a feeling for what's ahead.

dbt customers: Over the coming weeks, in projects eligible to start using Fusion, you’ll see a toggle in your account or receive a message from your account team. From there, you can activate Fusion for your environments.
To use the VS Code extension: Install the "dbt" extension directly from the marketplace for automated setup and head to the quickstart. This will also automatically install the Fusion-powered CLI for you.
To use the dbt CLI powered by Fusion: Simply install Fusion

If you are looking to migrate an existing project to Fusion, see the migration guide — as well as the dbt-autofix helper, which automatically addresses many of the changes needed to migrate to Fusion.

What's Next?

Today’s launch is the start. There is much left to do over the short term and long term.

Moving forward we’re building many net new products and evolutions of our current products that simply wouldn’t have been possible in a pre-Fusion world. This will be particularly impactful for powering AI workflows, both to assist in the creation of high quality dbt projects and serving as the trusted interface to structured data for AI agents.

We’re excited to work with the Community on the evolution of Fusion. If you’ve heard talk about the early days of the dbt Community and wished you could have been around for it, you now have the opportunity to make the deep, foundational impact that is often only possible at the start of a new technical innovation cycle.

So get involved!

Try out the Fusion quickstart
Open up a GitHub issue in dbt-fusion to report a bug or participate in the path to GA
Join us on Slack in #dbt-fusion-engine and share your thoughts or questions
Head to an in-person dbt Meetup — we’re hosting the dbt World Circuit 🏎️ around the world where you can and come talk to one of us about Fusion!

AI Evaluation in dbt

2025-05-04T00:00:00.000Z

The AI revolution is here—but are we ready?
Across the world, the excitement around AI is undeniable. Discussions on large language models, agentic workflows, and how AI is set to transform every industry abound, yet real-world use cases of AI in production remain few and far between.

A common issue blocking people from moving AI use cases to production is an ability to evaluate the validity of AI responses in a systematic and well governed way. Moving AI workflows from prototype to production requires rigorous evaluation, and most organizations do not have a framework to ensure AI outputs remain high-quality, trustworthy, and actionable.

Why AI Evaluation Matters
The more conversations we have with data teams, the clearer the problem becomes: Companies don’t want to move AI into production unless they can monitor and ensure its quality once it's there -- the fear of a ‘rogue AI’ still exceeds perceived benefits.

The core challenge isn’t just building AI use cases; it’s about continuously monitoring their performance and ensuring the same level of quality and reliability we’ve come to expect from other data assets. To trust AI in production, we need structured workflows that:

Ensure data quality before it’s fed into AI models
Evaluate AI-generated responses against responses known to be true
Trigger alerts or corrective actions when AI performance drifts below acceptable thresholds

Without these capabilities, AI workflows remain stuck in experimental phases, unable to meet the reliability requirements of production use cases.

Using dbt to Build AI Evaluation Workflows Most organizations already use dbt to transform, test and validate their data. As an already trusted framework for data quality, it seemed natural to use dbt’s testing capabilities to evaluate and monitor AI workflows as well.

Let’s walk through a simple example using dbt and Snowflake Cortex for AI evaluation.

Ingest Data We start by uploading a dataset of IMDB movie reviews, along with human-labeled sentiment scores (positive or negative). This serves as our source of truth.
Run AI Workflow As an simple example workflow, we use Snowflake Cortex’s sentiment analysis function, to classify each review.
Evaluate AI Output versus Human Review We create an evaluation model in dbt that uses the Cortex Complete function to compare the AI-generated sentiment to the actual human-labeled sentiment.
Define Pass/Fail Criteria We configure a custom dbt test to set an accuracy threshold (e.g., 75% accuracy). If AI sentiment predictions fall below this level, the test triggers a warning or error.
Store and Visualize Results Native dbt functionality can easily store test failures in the warehouse providing traceability for further investigation, and data for reporting on AI accuracy.

Scaling AI Evaluation with dbt
This workflow naturally extends dbt’s native testing capabilities and leverages the powerful ability to embed Snowflake Cortex calls in SQL models. In this way users can combine the power of Snowflake Cortex with the established governance and quality framework of dbt to address the issues stated above.

By using dbt to evaluate AI, organizations can apply the same rigorous testing principles they already use for data pipelines to ensure their AI models are production-ready and maintain quality and governance of all data assets centrally.

What we Built Let's walk through this example step by step to give you a sense of how it all works. For this example, we start with a test data set which contains the input to our AI workflow, as well as a true measurement given by a human reviewer -- in this example our input is the text review of different movies and the actual_sentiment contains a -1 for negative reviews and 1 for positive reviews. Finally we include a time stamp indicating when our AI provided the response. This time stamp will allow us to track our AI accuracy over time.

our input data set, including actual sentiment

The next step is to create another output table containing both the true measurement from our dataset and the value returned by our AI. Since we can embed the Snowflake Cortex call directly in a SQL model we can easily build this in dbt using a simple reference function.

results of our agentic workflow

We also include the input to our AI workflow along with the AI calculated and human determined measurement for the data set. Including all these data points, while not strictly necessary, allows for clear understanding of what was fed into the AI workflow and easy traceability of specific responses. We will follow this same pattern again, using a dbt reference function to create one last dbt model where we build the evaluation prompt and use Cortex Complete to give this prompt to Cortex and store the results. The lionshare of the work building this model was the prompt engineering for the evaluation prompt. We initially built the prompt directly in Snowflake Cortex to ensure it was returning the type of response needed before moving the prompt into dbt.

AI generated results automatically evaluated by one or more models

We chose to define our prompt as a Jinja variable as opposed to listing it directly in each dbt model. This has the upside of increasing model readability, but obscures the text of what the prompt is from someone reading the model. To address this issue and provide full traceability, we materialize the prompt as a column in this table. This means that each output row contains not only the evaluation score but also the exact prompt given to produce it. Regardless of where you define your evaluation prompt, by including it as part of your dbt project it will benefit from the same change management and version control processes as the rest of your dbt project, ensuring strong governance of your AI workflows. Another great benefit of this approach and the flexibility provided by dbt and Snowflake Cortex is that you can easily toggle the model you are using to run the evaluation. In this example we use Snowflake Llama, but using any other supported model is as easy as changing a function parameter. You can even run multiple evaluations using different models to assess accuracy by simply adding additional columns to your dbt model.

dbt Testing evaluates AI accuracry along side data quality

The final step here is writing a dbt custom test to find any responses failing to meet our accuracy threshold. By creating this dbt test we can ensure issues with AI accuracy are caught and flagged as part of our standard dbt runs and quality checks. We can also easily leverage dbt’s ability to store test failures to record quality issues found in AI processes for further investigation and triage.

As a final benefit of capturing AI evaluations as part of your dbt project is just that - your AI quality information becomes part of your dbt project meaning quality results are available in all the same ways as any other dbt test result. You can view this information in Catalog, feed it into your data catalog of choice, use the test results to trigger additional downstream processes or visualize the information as quality dashboards through BI. As AI workflows become more commonplace, businesses need a systematic way to evaluate and monitor AI outputs, just as they do with traditional data products. Fortunately, the same principles and tools within dbt can be easily applied to AI evaluation as well. With dbt, data teams can bridge the gap between AI experimentation and AI in production, by ensuring trust, reliability, and governance to AI workflows.

Ready to bring AI evaluation into your dbt workflow? Get started with the dbt MCP server—it makes it easy to connect your AI systems to trusted, governed data.

Scaling Data Pipelines for a Growth-Stage Fintech with Incremental Models

2025-05-01T00:00:00.000Z

Introduction

Building scalable data pipelines in a fast-growing fintech can feel like fixing a bike while riding it. You must keep insights flowing even as data volumes explode. At Kuda (a Nigerian neo-bank), we faced this problem as our user base surged. Traditional batch ETL (rebuilding entire tables each run) started to buckle; pipelines took hours, and costs ballooned. We needed to keep data fresh without reprocessing everything. Our solution was to leverage dbt’s incremental models, which process only new or changed records. This dramatically cut run times and curbed our BigQuery costs, letting us scale efficiently.

Challenges in Scaling

Rapid growth brought some serious scaling challenges, and the most important were:

Performance: Our nightly full-refresh models that once took minutes began taking hours as data grew. For example, our core transactions table became too slow to rebuild from scratch for each update. Analytics dashboards lagged, and stakeholders lost timely insights. In real-time fintech, such latency is unacceptable.
Cost: More data and longer processing also drove up our BigQuery bills. Scanning a 2TB table every hour to grab a few MB of new data was wasteful. Under BigQuery’s on-demand pricing model, this could rack up thousands of dollars per month. We needed to increase throughput without scaling cost linearly, which meant rethinking our processing to avoid full table scans.
Data Integrity: As pipelines and dependencies multiplied, so did the risk of inconsistencies or failures. Any new approach had to maintain accuracy and consistency even as we sped things up.

Approach: Incremental Models & Key Strategies

We tackled these issues by embracing dbt’s incremental models, which process only new or updated records since the last run. Instead of monolithic daily rebuilds, our models continuously ingested changes in small bites. Below, we outline our key incremental strategies —append, insert_overwrite, and merge — and how we tuned performance and cost.

Append Strategy

This is the simplest incremental approach: Each run adds new rows to the existing table and never touches old rows. It's ideal for append-only data (e.g. logs or transactions that never change after insertion).

In dbt, using append is straightforward. We configure the model as incremental and specify incremental_strategy='append' (supported in some adapters like Snowflake).

Note: append is not currently supported in BigQuery. Always confirm adapter support before choosing an incremental strategy.

In the SQL, we filter the source to only new records since the last load. For example, to incrementally load new transactions:

Code: Append Incremental Strategy

{{ config(
    materialized = 'incremental',
    incremental_strategy = 'append'
) }}

SELECT 
    transaction_id,
    customer_id,
    transaction_date,
    amount,
    status
FROM {{ source('core', 'transactions') }}
{% if is_incremental() %}
WHERE transaction_date > (
   SELECT MAX(transaction_date) FROM {{ this }}
)
{% endif %}

This query appends only transactions that have a transaction date later than the maximum transaction date in the target table.

Append Incremental Model – Before Incremental Run

transaction_id	customer_id	transaction_date	amount	status
10001	C001	2023-09-28	₦12,000	completed
10002	C002	2023-09-28	₦5,000	completed

Loading table...

Append Incremental Model – After Incremental Run

transaction_id	customer_id	transaction_date	amount	status
10001	C001	2023-09-28	₦12,000	completed
10002	C002	2023-09-28	₦5,000	completed
10003	C003	2023-09-29	₦8,500	completed
10004	C004	2023-09-30	₦7,250	completed
10005	C005	2023-09-30	₦3,100	pending

Loading table...

Illustration: The "Before" table shows data before an incremental run; the "After" table shows new transactions (in bold) added. No historical data is touched. Append is great for immutable data streams (like transaction logs or event streams that only ever grow).

Append served us well for ingestion pipelines that just accumulate history without reprocessing old data. However, we had to guard against duplicates (if the source might resend records, we applied deduplication or unique constraints). Also, pure append doesn’t handle updates or deletions to existing records. If data can change after insertion (e.g. a transaction status moves from "pending" to "completed"), a different strategy is needed.

Insert Overwrite Strategy

For data partitioned by date (or another key) that may need partial replacements, insert_overwrite is ideal. Instead of merging rows, this strategy overwrites entire partitions of the target table each run. The table must be partitioned (daily, hourly, etc.), and the model will drop and rebuild only the partitions that have new or updated data.

We used insert_overwrite for partitioned data like daily aggregates, where changes are isolated by date. For example, if a table is partitioned by transaction_date, an insert_overwrite model can refresh just the partition for "2023-10-01" without affecting other days.

Here’s how we configured a model to use insert_overwrite on BigQuery:

Code: Insert Overwrite Strategy

{{ config(
    materialized = 'incremental',
    incremental_strategy = 'insert_overwrite',
    partition_by = { 'field': 'transaction_date', 'data_type': 'date' }
) }}

SELECT
    customer_id,
    transaction_date,
    amount,
    transaction_type
FROM {{ source('core', 'transactions') }}
WHERE transaction_date >= {{ this.last_partition }}

Here, partition_by defines the table partition. The WHERE clause uses {{ this.last_partition }} (the latest partition already present in the target) to pull only data for new or updated partitions. On each run, BigQuery will replace any existing partition that meets the filter (e.g. the partition for the latest date) with the query results. Older partitions stay untouched.

Insert Overwrite Strategy – Before Incremental Run

transaction_date	transaction_id	customer_id	amount	status
2023-09-29	11001	C011	₦14,000	completed
2023-09-29	11002	C012	₦6,500	completed
2023-10-01	12001	C021	₦8,000	pending
2023-10-01	12002	C022	₦4,200	completed

Loading table...

(Partition to be overwritten highlighted in bold.)

Insert Overwrite Strategy – After Incremental Run

transaction_date	transaction_id	customer_id	amount	status
2023-09-29	11001	C011	₦14,000	completed
2023-09-29	11002	C012	₦6,500	completed
2023-10-01	12003	C023	₦8,150	completed
2023-10-01	12004	C024	₦3,900	completed
2023-10-01	12005	C025	₦5,500	completed

Loading table...

(New partition data is shown in bold, replacing the old partition.)

Illustration: "Before" shows a partitioned table with the October 1, 2023 partition highlighted; "After" shows that partition replaced with fresh rows. This approach lets us refresh a specific day’s data (e.g. to capture late-arriving transactions or corrections) without rebuilding the whole table.

At Kuda, insert_overwrite was invaluable for derived tables and rollups. For instance, our daily customer spend aggregates are updated incrementally by replacing just the latest day's data, keeping those tables accurate with minimal cost. By replacing whole partitions, we avoided complex row-by-row merges while still catching any corrections within that day (for example, a back-dated transaction on that day would be picked up when the partition is reprocessed).

One note on static vs dynamic partitions: We mostly used static daily partitions (overwriting entire days). Some warehouses (and newer dbt features) allow dynamic partition updates (updating only changed rows within a partition), but we stuck to whole-day replacements for simplicity. It's easier to say "each run rebuilds yesterday’s partition from scratch," ensuring we capture any late modifications for that day. This dramatically improved performance for large tables (no full table scans) while still correcting recent data when needed. We just had to align the partition_by field and filter logic to avoid wiping the wrong partition.

Merge Strategy

For tables where new records arrive and existing ones can change, we used the merge strategy. It performs an upsert based on a unique key: new rows are inserted, and if a key already exists, specified fields are updated. This is perfect for data like customer profiles or account balances that evolve.

In dbt, using incremental_strategy='merge' requires a unique_key (on BigQuery or Snowflake, dbt compiles it into a merge statement). We can also limit which columns get updated using merge_update_columns or exclude certain fields with merge_exclude_columns. For example:

Code: Merge Strategy

{{ config(
    materialized = 'incremental',
    incremental_strategy = 'merge',
    unique_key = 'account_id',
    merge_update_columns = ['balance', 'last_updated']
) }}

SELECT 
    account_id,
    balance,
    last_updated
FROM {{ source('core', 'accounts') }}
{% if is_incremental() %}
WHERE last_updated > (
   SELECT MAX(last_updated) FROM {{ this }}
)
{% endif %}

This model selects only new or updated records (those with a last_updated more recent than the max in the target) and merges them into the accounts table on account_id. We chose to update only the balance and last_updated fields for existing accounts (to avoid overwriting other data). If an incoming account_id doesn’t exist in the target yet, a new row is inserted.

Merge was our go-to for upserts. For example, we maintained a daily updated accounts table of customer statuses and balances using merge. Each day, new accounts were added, and any changes (balance updates, status changes) were merged into existing records. This prevented duplicates (which a naive append would create) and ensured one row per account with the latest info.

We learned to define unique keys and update columns carefully. In one case, we omitted a merge_exclude_columns and accidentally overwrote a timestamp we meant to preserve—a quick lesson in being explicit. Also, merge comes with a performance cost: each run joins the new data with the existing table. With proper clustering on the key and only a day’s worth of new data, this was fine for us, but at a very large scale, it needs monitoring.

Optimizing Performance and Cost

Choosing the right incremental strategy was half the battle; we also employed several performance tuning and cost optimization techniques to make our pipelines truly scale:

Partitioning: On large tables, we partitioned by date (or another key) so incremental runs only scan the new slice. For example, partitioning the transactions table by transaction_date meant a daily incremental load only touched that day's partition. BigQuery’s partition pruning reduced scans from entire multi-terabyte tables to just a few GB per run (e.g. 2TB down to 0.01TB), yielding huge savings. Partitioning also sped up downstream queries that filter by date.
Clustering: We added clustering on columns frequently used in filters or joins (e.g. clustering transactions by customer_id). In BigQuery, clustering sorts the data by those columns, so queries filtering on them scan less data within each partition. The improvements were subtle but meaningful; some queries that once scanned tens of GBs now scan only a fraction when the table is well-clustered.
Smart Scheduling: We tuned model run frequencies to balance freshness vs. cost. Not every model needs to run constantly. Our customer-facing tables (transactions, balances) ran hourly for near-real-time updates, whereas internal analytics models ran daily or a few times per day. Adjusting schedules avoided wasteful runs and saved on compute cost. We also used dependency-based scheduling (via dbt Cloud), so heavy models ran only after upstream data was updated, preventing runs when no new source data arrived.
Warehouse Tuning: We optimized our warehouse compute as well. Since incremental models drastically cut per-run processing, we could use smaller clusters/slots and run more often without overspending, a big win as data volumes grew.
Monitoring & Alerting: We tracked metrics like run durations and rows processed to catch anomalies. For example, if a daily incremental model that usually adds hundreds of rows suddenly adds zero, that's a red flag (upstream failure or missing source data). Similarly, if a job that normally takes 5 minutes jumps to 50, it likely did an unintended full scan. We also watched data freshness, if an hourly model hasn’t loaded new data in 3 hours, we investigate. These checks helped us catch issues early (like a stale source or a broken filter) and kept data flowing reliably.

With these optimizations in place, we vastly improved our pipeline speed and cost-efficiency. Instead of fearing the next data surge, we were confident our system could handle growth by design.

Real-World Implementation: Kuda Case Study

How did these approaches work out in practice at Kuda during hyper-growth?

One critical dataset was our customer transactions feed. Initially, a full daily rebuild of the transactions table took over an hour and scanned the entire history. We refactored it into an incremental model (append strategy with partition pruning). The first run built the historical backlog, and subsequent runs pulled only new transactions. The difference was night and day: the incremental job ran in minutes, and data scanned per run dropped by over 90%. Analysts saw new transactions within the hour, and our monthly BigQuery cost for that table plummeted even as data continued to grow.

To illustrate, here’s a simplified daily transactions summary. Initially, it contained data up to Sept 30, 2023:

Daily Transactions Summary – Before

date	transactions_count	total_amount
2023-09-28	1,045	25,100,000
2023-09-29	980	22,340,000
2023-09-30	1,102	27,500,000

Loading table...

After the next incremental load (bringing in transactions from October 1, 2023), the table automatically includes the new day's metrics without recomputing prior days:

Daily Transactions Summary – After

date	transactions_count	total_amount
2023-09-28	1,045	25,100,000
2023-09-29	980	22,340,000
2023-09-30	1,102	27,500,000
2023-10-01	1,210	30,230,000

Loading table...

Table: Example of a daily transactions summary. After an incremental load for 2023-10-01, the new day’s data appears without reprocessing previous days.

This approach kept our teams and customers up-to-date. Customer support could view nearly real-time transactions to investigate issues, and customers could generate current account statements on the fly.

We also used incremental models for regulatory and finance reporting. For example, our Finance team needed a daily reconciliation of balances and an end-of-day accounts table. They were fine with data being a day old, but it had to be accurate and deduplicated. We built this with a nightly incremental merge on the accounts table, merging changes from the core accounts data into a fresh daily view of account states. It provided a reliable daily snapshot of accounts. (Our Finance team didn’t realize any fancy incremental process was involved, they just got their report each morning!)

During the launch of a new card product, we needed to monitor transaction declines and errors in near real-time. Our existing daily-refreshed dashboard wasn’t enough. We set up an incremental model that ingested card transaction events every 15 minutes. To ensure that no historical fixes were missed, we also scheduled a nightly full refresh of this model during the launch period. This hybrid approach gave us timely visibility and a daily catch-up for any late-arriving corrections. It proved crucial: we spotted issues (like a spike in declines from an API glitch) early and fixed them, minimizing customer impact. After the launch, we reverted to purely incremental runs once things stabilized.

The overall impact at Kuda was huge. Some heavy transformations that were close to failing became reliable again. Stakeholders noticed fresher data in their reports, our customer satisfaction scores improved as no one saw stale data. By controlling costs and keeping pipelines efficient, we kept management and regulators happy.

Lessons Learned & Best Practices

Throughout this scaling journey, we learned a ton about what worked and what didn’t. Here are some of the key lessons and best practices we’d recommend to any growth-stage fintech looking to implement incremental models:

Choose the Right Strategy: Not all tables should use the same incremental approach. Generally, use append for insert-only data, insert_overwrite for data partitioned by date (or ID) where you can replace whole chunks, and merge for true upsert scenarios. If source deletes are an issue, consider delete+insert or handle soft deletes.
Partition Wisely: Partitioning is critical, but pick an appropriate granularity. The right grain (hour, day, month) depends on data volume and query patterns. For us, daily partitions were often the sweet spot, small enough to reduce scanned data, not so small as to create thousands of partitions. Always align your incremental filter with the partition field to enable pruning.
Monitor Your Models: Implement tests or alerts on incremental models. For example, check that each run’s row count is within expected bounds, if a daily incremental that usually adds hundreds of rows suddenly adds zero, that’s a red flag. Catching these issues early prevents bigger problems down the line.
Periodic Full Refreshes: Over time, even a well-built incremental model can drift due to small errors or schema changes. We scheduled occasional full refreshes for critical models to realign them with source data, essentially giving a clean slate that catches any discrepancies or missed data. Similarly, after major logic changes, we’d do a one-time full refresh to apply the new logic across all historical data and then switch back to incremental.
Test and Document: We treated incremental models like mission-critical code. We wrote tests to ensure the logic is sound (for instance, after an incremental run, the target’s record count for a period equals the source’s count for that period, if not, the filter might be wrong). We also documented each model’s assumptions (e.g. “this model runs incrementally, do not disable the is_incremental filter in development”). Good documentation helped new team members avoid breaking incremental logic.
Design for Scale Early: Our biggest lesson was to plan for scale from the start. Now, when designing a new model or pipeline, we ask, “What if the data grows 10x?”. If a full refresh won't be feasible at that size, we build it incrementally from day one. It's much easier than refactoring under pressure later. This mindset, combined with dbt’s flexible incremental features, has future-proofed our pipelines. As our data continues to grow, the incremental approach should keep holding up.

Conclusion

Scaling a fintech data platform doesn’t have to mean scaling cost and runtime at the same pace. By using dbt’s incremental models—paired with optimizations like partitioning, clustering, and careful scheduling—Kuda transformed its pipelines to handle rapid growth. We kept data fresh and accurate for users without breaking the bank. Incremental processing let us handle ever-increasing volumes in bite-sized chunks, maintaining agility as the company grew.

If you’re at a growing company struggling with slow or expensive data jobs, give dbt’s incremental models a try. As we saw at Kuda, the payoff can be huge: faster insights, happier stakeholders, and a data platform ready for whatever the future brings. The future of data processing (in fintech and beyond) is incremental. With tools like dbt, you can ride the wave of growth instead of drowning in it.

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

2025-04-21T00:00:00.000Z

dbt is the standard for creating governed, trustworthy datasets on top of your structured data. MCP is showing increasing promise as the standard for providing context to LLMs to allow them to function at a high level in real world, operational scenarios.

Today, we are open sourcing an experimental version of the dbt MCP server. We expect that over the coming years, structured data is going to become heavily integrated into AI workflows and that dbt will play a key role in building and provisioning this data.

In particular, we expect both Business Intelligence and Data Engineering will be driven by AI operating on top of the context defined in your dbt projects.

We are committed to building the data control plane that enables AI to reliably access structured data from across your entire data lineage. Over the coming months and years, data teams will increasingly focus on building the rich context that feeds into the dbt MCP server. Both AI agents and business stakeholders will then operate on top of LLM-driven systems hydrated by the dbt MCP context.

Today’s system is not a full realization of the vision in the posts shared above, but it is a meaningful step towards safely integrating your structured enterprise data into AI workflows. In this post, we’ll walk through what the dbt MCP server can do today, some tips for getting started and some of the limitations of the current implementation.

We believe it is important for the industry to start coalescing on best practices for safe and trustworthy ways to access your business data via LLM.

What is MCP?

MCP stands for Model Context Protocol - it is an open protocol released by Anthropic in November of last year to allow AI systems to dynamically pull in context and data. Why does this matter?

Even the most sophisticated models are constrained by their isolation from data—trapped behind information silos and legacy systems. Every new data source requires its own custom implementation, making truly connected systems difficult to scale.

MCP addresses this challenge. It provides a universal, open standard for connecting AI systems with data sources, replacing fragmented integrations with a single protocol. - Anthropic

Since then, MCP has become widely supported, with Google, Microsoft and OpenAI all committing to support MCP.

What does the dbt MCP Server do?

Think of it as the missing glue between:

Your dbt project (models, docs, lineage, Semantic Layer)
Any MCP‑enabled client (Claude Desktop Projects, Cursor, agent frameworks, custom apps, etc.)

We’ve known for a while that the combination of structured data from your dbt project + LLMs is a potent combo (particularly when using the dbt Semantic Layer). The question has been, what is the best way to provision this across a wide variety of LLM applications in a way that puts the power in the hands of the Community and the ecosystem, rather than us building out a series of one-off integrations.

The dbt MCP server provides access to a set of tools that operate on top of your dbt project. These tools can be called by LLM systems to learn about your data and metadata.

Remember, as with any AI workflows, to make sure that you are taking appropriate caution in terms of giving these access to production systems and data. Consider starting in a sandbox environment or only granting read permissions.

There are three primary functions of the dbt MCP server today.

Three use‑case pillars of the dbt MCP server

Data discovery: Understand what data assets exist in your dbt project.
Data querying: Directly query the data in your dbt project. This has two components:
- Use the dbt Semantic Layer for trustworthy, single source of truth reporting on your metrics
- Execution of SQL queries for more freewheeling data exploration and development
Run and perform commands within dbt: Access the dbt CLI to run a project and perform other operations

How the dbt MCP server fits between data sources and MCP‑enabled clients

❓Do I need to be a dbt Cloud customer to use the dbt MCP server?

No - there is functionality for both dbt and dbt Core users included in the MCP. Over time, Cloud-specific services will be built into the MCP server where they provide differentiated value.

Let’s walk through examples of these and why each of them can be helpful in human driven and agent driven use cases:

Using the dbt MCP Server for Data Asset Discovery

dbt has knowledge about the data assets that exist across your entire data stack, from raw staging models to polished analytical marts. The dbt MCP server exposes this knowledge in a way that makes it accessible to LLMs and AI agents, enabling powerful discovery capabilities:

For human stakeholders: Learn about your production dbt project interactively through natural language. Business users can ask questions like "What customer data do we have?" or "Where do we store marketing spend information?" and receive accurate information based on your dbt project's documentation and structure.
For AI agent workflows: Automatically discover and understand the available data models, their relationships, and their structures without human intervention. This allows agents to autonomously navigate complex data environments and produce accurate insights. This can be useful context for any agent that needs to operate on top of information in a data platform.

The data discovery tools allow LLMs to understand what data exists, how it's structured, and how different data assets relate to each other. This contextual understanding is essential for generating accurate SQL, answering business questions, and providing trustworthy data insights.

Data Asset Discovery Tools:

note - for all of these tools, you do not need to directly access them in your workflow. Rather, the MCP client will use the context you have provided to determine which is the most accurate tool to use at a given time.

Tool Name	Purpose	Output
`get_all_models`	Provides a complete inventory of all models in the dbt project, regardless of type	List of all model names and their descriptions
`get_mart_models`	Identifies presentation layer models specifically designed for end-user consumption	List of mart model names and descriptions (models in the reporting layer)
`get_model_details`	Retrieves comprehensive information about a specific model	Compiled SQL, description, column names, column descriptions, and column data types
`get_model_parents`	Identifies upstream dependencies for a specific model	List of parent models that the specified model depends on

Loading table...

Using the dbt MCP server for querying data via the dbt Semantic Layer

The dbt Semantic Layer defines your organization's metrics and dimensions in a consistent, governed way. With the dbt MCP server, LLMs can understand and query these metrics directly, ensuring that AI-generated analyses are consistent with your organization's definitions.

For human stakeholders: Request metrics using natural language. Users can ask for "monthly revenue by region" and get accurate results that match your organization's standard metric definitions, with a higher baseline of accuracy than LLM generated SQL queries.
For AI agent workflows: As agentic systems take action in the real world over a longer time horizon, they will need ways to understand the underlying reality of your business. From feeding into deep research style reports to feeding operational agents, the dbt Semantic Layer can provide a trusted underlying interface for LLM systems.

By leveraging the dbt Semantic Layer through the MCP server, you ensure that LLM-generated analyses are based on rigorous definitions instantiated as code, flexibly available in any MCP-supported client.

Semantic Layer Tools:

Tool Name	Purpose	Output
`list_metrics`	Provides an inventory of all available metrics in the dbt Semantic Layer	Complete list of metric names, types, labels, and descriptions
`get_dimensions`	Identifies available dimensions for specified metrics	List of dimensions that can be used to group/filter the specified metrics
`query_metrics`	Executes queries against metrics in the dbt Semantic Layer	Query results based on specified metrics, dimensions, and filters

Loading table...

Using the dbt MCP server for SQL execution to power text to sql

While the dbt Semantic Layer provides a governed, metrics-based approach to data querying, there are many analytical needs that require more flexible, exploratory SQL queries. The dbt MCP server will soon include SQL validation and querying capabilities with rich context awareness.

For human stakeholders: Ask complex analytical questions that go beyond predefined metrics. Users can explore data freely while still benefiting from the LLM's understanding of their specific data models, ensuring that generated SQL is correct and optimized for your environment.
For AI agent workflows: Generate and validate SQL against your data models automatically. Agents can create and execute complex queries that adapt to schema changes, optimize for performance, and follow your organization's SQL patterns and conventions.

Unlike traditional SQL generation, queries created through the dbt MCP server will be aware of your specific data models, making them more accurate and useful for your particular environment. This capability is particularly valuable for data exploration, one-off analyses, and prototype development that might later be incorporated into your dbt project.

Currently SQL execution is managed through the dbt Show tool, over the near term we expect to release tooling that is more performant and fit to this precise use case.

Using the dbt MCP server for project execution

The dbt MCP server doesn't just provide access to data—it also allows LLMs and AI agents to interact directly with dbt, executing commands and managing your project.

For human stakeholders: Trigger dbt commands through conversational interfaces without CLI knowledge. Users can ask to "run the daily models" or "test the customer models" and get clear explanations of the results, including suggestions for fixing any issues that arise.
For AI agent workflows: Autonomously run dbt processes in response to events. Agents can manage project execution, automatically test and validate model changes, and even debug common issues without human intervention.

While the discovery and query tools operate on top of environments as the context source, these execution tools interact directly with the CLI, both dbt Core and the Cloud CLI.

Project Execution Tools

Tool Name	Purpose	Output
`build`	Executes the dbt build command to build the entire project	Results of the build process including success/failure status and logs
`compile`	Executes the dbt compile command to compile the project's SQL	Results of the compilation process including success/failure status and logs
`list`	Lists all resources in the dbt project	Structured list of resources within the project
`parse`	Parses the dbt project files	Results of the parsing process including success/failure status and logs
`run`	Executes the dbt run command to run models in the project	Results of the run process including success/failure status and logs
`test`	Executes tests defined in the dbt project	Results of test execution including success/failure status and logs

Loading table...

Getting Started

The dbt MCP server is now available as an experimental release. To get started:

Clone the repository from GitHub: dbt-labs/dbt-mcp
Follow the installation instructions in the README
Connect your dbt project and start exploring the capabilities

We're excited to see how the community builds with and extends the dbt MCP server. Whether you're building an AI-powered BI tool, an autonomous data agent, or just exploring the possibilities of LLMs in your data workflows, the dbt MCP server provides a solid foundation for bringing your dbt context to AI applications.

What is the best workflow for the current iteration of the MCP server?

This early release is primarily meant to be used on top of an existing dbt project to answer questions about your data and metadata - roughly tracking towards the set of use cases described in this post on the future of BI and data consumption.

Chat use case:

We suggest using Claude Desktop for this and creating a custom project that includes a prompt explaining the use cases you are looking to cover.

To get this working:

Follow the instructions in the Readme to install the MCP server
Validate that you have added the MCP config to your Claude desktop config. You should see ‘dbt’ when you go to Claude→Settings→Developer

Claude Desktop – MCP server running in Developer settings

Create a new project called “analytics”. Give it a description of how an end user might interact with it.

Example Claude Desktop project connected to the dbt MCP server

Add a custom prompt explaining that questions in this project will likely be routed through the dbt MCP server. You’ll likely want to customize this to your particular organizational context.
- For example: This conversation is connected to and knows about the information in your dbt Project via the dbt MCP server. When you receive a question that plausibly needs data from an external data source, you will likely want to use the tools available via the dbt MCP server to provide it.

Deployment considerations:

This is an experimental release. We recommend that initial use should be focused on prototyping and proving value before rolling out widely across your organization.
Be particularly mindful with the project execution tools - remember that LLMs make mistakes and begin with permissions scoped so that you can experiment but not disrupt your data operations.
Start with the smallest possible use case that provides tangible value. Instead of giving this access to your entire production dbt Project, consider creating an upstream project that inherits a smaller subset of models and metrics that will power the workflow.
As of right now we don’t have perfect adherence for tool selection. In our testing, the model will sometimes cycle through several unnecessary tool calls or call them in the wrong order. While this can usually be fixed by more specific prompting by the end user, that goes against the spirit of allowing the model to dynamically select the right tool for the job. We expect this to be addressed over time via improvements in the dbt MCP Server, as well as client interfaces and the protocol itself.
Think carefully about the use cases for Semantic Layer tool vs. using the SQL execution tool. SQL execution is powerful but less controllable. We’re looking to do a lot of hands on testing to begin to develop heuristics about when SQL execution is the best option, when to bake logic into the Semantic Layer and whether there are new abstractions that might be needed for AI workflows.
Tool use is powerful because it can link multiple tools together. What tools complement the dbt MCP Server? How can we use this to tie our structured data into other workflows?

The future of the dbt MCP and the correct layers of abstraction for interfacing with your data

We are in the very early days of MCP as a protocol and determining how best to connect your structured data to LLM systems. This is an extremely exciting, dynamic time where we are working out, in real time, how to best serve this data and context.

We have high confidence that the approach of serving context to your AI systems via dbt will prove a durable piece of this stack. As we work with the Community on implementing this in real world use cases, it is quite likely that the details of the implementation and how you access it may change. Here are some of the areas we expect this to evolve.

Determining the best source of context for the dbt MCP You’ll notice that these tools have two broad information inputs - dbt Cloud APIs and the dbt CLI. We expect to continue to build on both of these, specifically with dbt Cloud APIs to serve the abstraction of choice when it is desirable to operate off of a specific environment.

There will be other use cases, specifically for dbt development, when you’ll want to operate based off of your current working context, we’ll be releasing tooling for that in the near future (and welcome Community submitted ideas and contributions). We’re looking forward to trying out alternative methods here and looking forward to hearing from the Community how you would like to have this context loaded in. Please feel free to experiment and share your findings with us.

Determining the most useful tools for the dbt MCP

What are the best and most useful set of tools to enable human in the loop and AI driven LLM access to structured data? The dbt MCP server presents our early explorations, but we anticipate that the Community will find many more.

How to handle hosting, authentication, RBAC and more

Currently the dbt MCP server is locally hosted, with access management via scoped service tokens from dbt Cloud or locally configured via your CLI. We expect there to be three levels via which we will continue to build out systems to make this not only safe and secure, but tailored to the needs of the specific user (human or agent) accessing the MCP.

Hosting of the MCP: In the near future we will have a Cloud hosted version of the MCP alongside the current local MCP
Managing data access with the MCP: We are committed to offering safe and trustworthy data and data asset access (think OAuth support and more)
User and domain level context: Over the longer run we are looking into ways to provide user and domain specific knowledge about your data assets to the systems as they are querying it.

Expect to hear more on this front on 5/28.

This is a new frontier for the whole Community. We need to be having open, honest discussions about how to integrate these systems into our existing workflows and open up new use cases.

To join the conversation, head over to #tools-dbt-mcp in the dbt Community Slack.

Establishing dbt Cloud: Securing your account through SSO & RBAC

2025-04-17T00:00:00.000Z

As a dbt Cloud admin, you’ve just upgraded to dbt Cloud on the Enterprise plan - congrats! dbt Cloud has a lot to offer such as CI/CD, Orchestration, dbt Explorer, dbt Semantic Layer, dbt Mesh, Visual Editor, dbt Copilot, and so much more. But where should you begin?

We strongly recommend as you start adopting dbt Cloud functionality to make it a priority to set up Single-Sign On (SSO) and Role-Based Access Control (RBAC). This foundational step enables your organization to keep your data pipelines secure, onboard users into dbt Cloud with ease, and optimize cost savings for the long term.

Authentication vs. Authorization

Before we dig into SSO, RBAC, and more — let’s go over how they map into two foundational security concepts.

Authentication: SSO is configured to gate authentication - it verifies (via an IdP) that users are who they say they are and can log into the specified dbt Cloud account.
Authorization: RBAC is an authorization model - it controls what users can see and do within dbt Cloud based on their assigned licenses, groups, and permission sets.

Single-Sign On (SSO)

Your SSO configuration steps will depend on your IdP, so we encourage you to start at our SSO Overview page and find the IdP-specific doc under that section that’s specific to your setup.

Regardless of what IdP you use, one of the first things you should do as a dbt Cloud admin is set the login slug value. This should be a unique company identifier.

Keep in mind that whatever you set, the slug will be appended to the end of the SSO login URL that your users will use to sign into dbt Cloud. For example:

If I set my login slug to mynewco
My SSO login URL will look something like https://cloud.getdbt.com/enterprise-login/mynewco.

At first glance, this screen has a lot of info and fields — but with the SSO docs in hand, dbt Cloud admins are ready to start setting up smooth, scalable workflows.

dbt Cloud's SSO configuration page

Let’s break this down at a high level to make it more digestible:

After setting the desired login slug, a dbt Cloud admin will go to the dbt Cloud SSO configuration page and copy/paste everything under the Identity provider values section and will share the values with the IdP admin.
The IdP admin will create a dbt Cloud app and then provide the values under the dbt configuration section to the dbt Cloud admin.
tip
Refer to the appropriate setup docs for Google Workspace, Okta, Microsoft Entra ID, or SAML 2.0.
The dbt Cloud admin will fill in those values into the SSO configuration page under the dbt configuration section and click Save to complete the process.

After completing this process:

We strongly advise you validate the SSO flow is working by pasting the SSO login URL (it should look like https://cloud.getdbt.com/enterprise-login/dbtlabs) into your web browser’s private window
And try to log into your account via the IdP.
If the SSO flow isn’t working as expected, an account admin will still be able to log in with a password to correct the configuration.

tip

Be aware of our SSO enforcement policy — once SSO is configured, all non-admin users will have to log in via SSO as a security best practice, while account admins, by default, can still authenticate with a password in lieu of multi-factor authentication (MFA).

Once you've set up SSO successfully, you have additional ways to onboard your users into dbt Cloud on top of sending out an email invite:

Provide users the SSO login URL to access dbt Cloud. This is also known as the SP-initiated flow (SP stands for Service Provider; in this case, it would be dbt Cloud).
Provision the dbt Cloud for users to access on their IdP’s dashboard. This is also known as the IdP-initiated flow.

SSO flows into dbt Cloud

Get stuck setting up SSO? Open a support ticket, and one of our Customer Solutions Engineers will be happy to help you!

Licenses and Groups

In dbt Cloud, there are two main levers to control user access:

As a prerequisite, these all should be set before configuring RBAC. Let’s get into these!

Licenses

There are three license types in dbt Cloud:

Developer: User can be granted any permissions.
Read-Only: User has read-only permissions applied to all dbt Cloud resources regardless of the role-based permissions that the user is assigned.
IT: User has Security Admin and Billing Admin permissions applied, regardless of the group permissions assigned.

Odds are that the majority of your users will be developers or analysts who’ll need Developer licenses. You can assign default licenses to users based on the groups that they’re in on the IdP side under Account Settings --> Groups & Licenses --> License mappings.

An example license mapping

If a user is in multiple groups with different license types assigned, they will be granted the highest license type — Developer.

Groups

Groups are used to manage permissions. They define what a user can see and do across projects and environments. We recommend reviewing our available permissions sets and determining which are applicable to your dbt Cloud user base.

Keep in mind group permissions are additive in nature for users that belong to more than one group — meaning if a user belongs to multiple groups, they'll inherit all assigned permissions.

Navigating to Groups & Licenses page in dbt Cloud, you’ll see three default groups — Everyone, Member, and Owner. There’s also an option to create your own groups on the top right.

The out-of-the-box dbt Cloud groups you may use

Here’s a brief primer on the default groups:

Owner: This group is for individuals responsible for the entire account and will give them elevated account admin privileges. You cannot change the permissions.
Member: This group is for the general members of your organization, who will also have full developer access to the account. You cannot change the permissions. By default, dbt Cloud adds new users to this group.
Everyone: A general group for all members of your organization. Customize the permissions to fit your organizational needs. By default, dbt Cloud adds new users to this group and only grants user access to their personal profile.

While we recommend creating your own groups and deleting the defaults to better tailor it to your business’ needs, you should only delete the defaults after your own groups have been created and permission sets have been associated with them. These default groups are available to you as a means of getting users started in dbt Cloud. To sum up what they do, the Owner group will give users full account admin access while Everyone and Member groups will give users full developer access.

To help get you started, these are the main permission sets that should be assigned to most users:

User persona	Permission set
dbt Cloud Admin	Account Admin
dbt Developer	Developer
dbt Analyst	Analyst

Loading table...

You can also use groups to control which projects and environments users can access.

Creating a new dbt Cloud group

Role-Based Access Control via IdP

If you made it this far, thanks for staying with me here! We’re now ready to configure RBAC, which assign users to the right groups and effectively the right permission sets after they authenticate into dbt Cloud. This hinges on the SSO group mapping(s) you’ll find within a group.

As an example, let’s say that I want specific users in this group where the SSO group mapping is dbt-developer. Note that you can also specify more than one.

Configuring a SSO group mapping within a group

Here’s what we do to make it happen:

Have your IdP admin create a dbt-developer group in the IdP.
Assign users who should be in the dbt Cloud group to that IdP group.
Have users sign into dbt Cloud to confirm they get assigned to that group.

Easy enough, right? Just make sure these two conditions are checked for RBAC to work properly between your IdP and dbt Cloud:

Group names must be an exact match
Group names have the same casing

Making a SSO group mapping work with your idenity provider

Automate SSO & RBAC: Introducing SCIM

We have exciting news — System for Cross-Domain Identity Management) (SCIM) support will be generally available in May 2025 (for SCIM-compliant IdPs & Okta)! If you’re unfamiliar with SCIM, you can think of it as automated user provisioning in dbt Cloud. It makes user data more secure and simplifies the admin and user experience by automating the user identity and group lifecycle.

Here’s why you should care about SCIM as a dbt Cloud admin:

Improved Admin and end user experience — Through automating user onboarding and offboarding, SCIM saves time for dbt Cloud admins that are managing multiple users on a weekly basis. If a user is added or removed in the IdP, their license and user account is automatically added/removed from dbt Cloud.
Simplified RBAC with group management — Admins can simplify access control management by using SCIM to update group membership. Currently, SSO group mapping enables admins to add new users to groups when they are JIT provisioned. SCIM would build on that functionality to allow group management not only for new users but also for existing users.

Closing thoughts

Securing your account through SSO and RBAC should be one of your first priorities after getting on the Enterprise plan.

Not only does it keep your data safe, it allows you to onboard users into your account at scale. While it may be just the beginning of your dbt Cloud journey, putting in the work to check off this crucial step will establish that users are leveraging dbt responsibly at an enterprise grade level!

Getting Started with git Branching Strategies and dbt

2025-03-10T00:00:00.000Z

Hi! We’re Christine and Carol, Resident Architects at dbt Labs. Our day-to-day work is all about helping teams reach their technical and business-driven goals. Collaborating with a broad spectrum of customers ranging from scrappy startups to massive enterprises, we’ve gained valuable experience guiding teams to implement architecture which addresses their major pain points.

The information we’re about to share isn't just from our experiences - we frequently collaborate with other experts like Taylor Dunlap and Steve Dowling who have greatly contributed to the amalgamation of this guidance. Their work lies in being the critical bridge for teams between implementation and business outcomes, ultimately leading teams to align on a comprehensive technical vision through identification of problems and solutions.

Why are we here?
We help teams with dbt architecture, which encompasses the tools, processes and configurations used to start developing and deploying with dbt. There’s a lot of decision making that happens behind the scenes to standardize on these pieces - much of which is informed by understanding what we want the development workflow to look like. The focus on having the perfect workflow often gets teams stuck in heaps of planning and endless conversations, which slows down or even stops momentum on development. If you feel this, we’re hoping our guidance will give you a great sense of comfort in taking steps to unblock development - even when you don’t have everything figured out yet!

There are three major tools that play an important role in dbt development:

A repository
Contains the code we want to change or deploy, along with tools for change management processes.
A data platform
Contains data for our inputs (loaded from other systems) and databases/schemas for our outputs, as well as permission management for data objects.
A dbt project
Helps us manage development and deployment processes of our code to our data platform (and other cool stuff!)

dbt's relationship to git and the data platform

No matter how you end up defining your development workflow, these major steps are always present:

Development: How teams make and test changes to code
Quality Assurance: How teams ensure changes work and produce expected outputs
Promotion: How teams move changes to the next stage
Deployment: How teams surface changes to others

This article will be focusing mainly on the topic of git and your repository, how code corresponds to populating your data platform, and the common dbt configurations we implement to make this happen. We’ll also be pinning ourselves to the steps of the development workflow throughout.

Why we should focus on git

Source control (and git in particular) is foundational to modern development with or without dbt. It facilitates collaboration between teams of any size and makes it easy to maintain oversight of the code changes in your project. Understanding these controlled processes and what code looks like at each step makes understanding how we need to configure our data platform and dbt much easier.

⭐️ How to “just get started” ⭐️

This article will be talking about git topics in depth — this will be helpful if your team is familiar with some of the options and needs help considering the tradeoffs. If you’re getting started for the first time and don’t have strong opinions, we recommend starting with Direct Promotion.

Direct Promotion is the foundation of all git branching strategies, works well with basic git knowledge, requires the least amount of provisioning, and can easily evolve into another strategy if or when your team needs it. We understand this recommendation can invoke some thoughts of “what if?”. We urge you to think about starting with direct promotion like getting a suit tailored. Your developers can wear it while you’re figuring out the adjustments, and this is a much more informative step forward because it allows us to see how the suit functions in motion — our resulting adjustments can be starkly different than what we thought we’d need when it was static.

The best part with ‘just getting started’ is that it’s not hard to change configurations in dbt for your git strategy later on (and we'll cover this), so don’t think of this as a critical decision that will that will result in months of breaking development for re-configuration if you don’t get it right immediately. Truly, changing your git strategy can be done in a matter of minutes in dbt Cloud.

Branching strategies

Once a repository has its initial commit, it always starts with one default branch which is typically called main or master — we’ll be calling the default branch main in our upcoming examples. The main branch is always the final destination that we’re aiming to land our changes, and most often corresponds to the term "production" - another term you'll see us use throughout.

How we want our workflow to look getting our changes from development to main is the big discussion. Our process needs to consider all the steps in our workflow: development, quality assurance, promotion, and deployment. Branching Strategies define what this process looks like. We at dbt are not reinventing the wheel - a number of common strategies have already been defined, implemented, iterated on, and tested for at least a decade.

There are two major strategies that encompass all forms of branching strategies: Direct Promotion and Indirect Promotion. We’ll start by laying these two out simply:

What is the strategy?
How does the development workflow of the strategy look to a team?
Which repository branching rules and helpers help us in this strategy?
How do we commonly configure dbt Cloud for this strategy?
How do branches and dbt processes map to our data platform with this strategy?

Then, we’ll end by comparing the strategies and covering some frequently asked questions.

Know before you go

There are many ways to configure each tool (especially dbt) to accomplish what you need. The upcoming strategy details were written intently to provide what we think are the minimal standards to get teams up and running quickly. These are starter configurations and practices which are easy to tweak and adjust later on. Expanding on these configurations is an exercise left to the reader!

Direct promotion

Direct promotion means we only keep one long-lived branch in our repository — in our case, main. Here’s the workflow for this strategy:

Direct promotion branching strategy

How does the development workflow look to a team?

Layout:

feature is the developer’s unique branch where task-related changes happen
main is the branch that contains our “production” version of code

Workflow:

Development: I create a feature branch from main to make, test, and personally review changes
Quality Assurance: I open a pull request comparing my feature against main, which is then reviewed by peers (required), stakeholders, or subject matter experts (SMEs). We highly recommend including stakeholders or SMEs for feedback during PR in this strategy because the next step changes main.
Promotion: After all required approvals and checks, I merge my changes to main
Deployment: Others can see and use my changes in main after I merge and main is deployed

Repository branching rules and helpers

At a minimum, we like to set up:

Branch protection on main (like these settings for GitHub), requiring:
- a pull request (no direct commits to main)
- pull requests must have at least 1 reviewer's approval
A PR template (such as our boiler-plate PR template) for feature PRs against main

dbt Cloud processes and environments

Here’s our branching strategy again, but now with the dbt Cloud processes we want to incorporate:

Direct Promotion strategy with dbt cloud processes denoted

In order to create the jobs in our diagram, we need dbt Cloud environments. Here are the common configurations for this setup:

Environment Name	Environment Type	Deployment Type	Base Branch	Will handle…
Development	development	-	`main`	Operations done in the IDE (including creating feature branches)
Continuous Integration	deployment	General	`main`	A continuous integration job
Production	deployment	Production	`main`	A deployment job

Loading table...

Data platform organization

Now we need to focus on where we want to build things in our data platform. For that, we need to set our database and schema settings on the environments. Here’s our diagram again, but now mapping how we want our objects to populate from our branches to our data platform:

Direct Promotion strategy with branch relations to data platform objects

Taking the table we created previously for our dbt Cloud environment, let's further map environment configurations to our data platform:

Environment Name	Database	Schema
Development	`development`	User-specified in Profile Settings > Credentials
Continuous Integration	`development`	Any safe default, like `dev_ci` (it doesn’t even have to exist). The job we intend to set up will override the schema here anyway to denote the unique PR.
Production	`production`	`analytics`

Loading table...

note

We are showing environment configurations here, but a default database will be set at the highest level in a connection (which is a required setting of an environment). Deployment environments can override a connection's database setting when needed.

Direct promotion example

In this example, Steve uses the term “QA” for defining the environment which builds the changed code from feature branch pull requests. This is equivalent to our ‘Continuous Integration’ environment — this is a great example of defining names which make the most sense for your team!

Indirect promotion

A note about Indirect Promotion

Indirect Promotion introduces more steps of ownership, so this branching strategy works best when you can identify people who have a great understanding of git to handle branch management. Additionally, the time from development to production is lengthier due to the workload of these new steps, so it requires good project management. We expand more on this later, but it’s an important call out as this is where we see unprepared teams struggle most.

Indirect promotion adds other long-lived branches that derive from main. The most simple version of indirect promotion is a two-trunk hierarchical structure — this is the one we see implemented most commonly in indirect workflows.

Hierarchical promotion is promoting changes back the same way we derived the branches. Example:

a middle branch is derived from main
feature branches derive from the middle branch
feature branches merge back to the middle branch
the middle branch merges back to main

Some common names for a middle branch as seen in the wild are:

qa: Quality Assurance
uat: User Acceptance Testing
staging or preprod: Common software development terminology

We’ll be calling our middle branch qa from throughout the rest of this article.

Here’s the workflow for this strategy:

Indirect Promotion branching strategy

How does the development workflow look to a developer?

Changes from our direct promotion workflow are highlighted in blue.

Layout:

feature is the developer’s unique branch where task-related changes happen
qa contains approved changes from developers’ feature branches, which will be merged to main and enter production together once additional testing is complete.qa is always ahead of main in changes.
main is the branch that contains our “production” version of code

Workflow:

Development: I create a feature branch from qa to make, test, and personally review changes
Quality Assurance: I open a pull request comparing my feature branch to qa, which is then reviewed by peers and optionally subject matter experts or stakeholders
Promotion: After all required approvals and checks, I can merge my changes to qa
Quality Assurance: SMEs or other stakeholders can review my changes in qa when I merge my feature
Promotion: Once QA specialists give their approval of qa’s version of data, a release manager opens a pull request using qa’s branch targeting main (we define this as a “release”)
Deployment: Others can see and use my changes (and other’s changes) in main after qa is merged to main and main is deployed

Repository branching rules and helpers

At a minimum, we like to set up:

Branch protection on main and qa (like these settings for GitHub), requiring:
- a pull request (no direct commits to main or qa)
- pull requests must have at least 1 reviewer's approval
A PR template (such as our boiler-plate PR template) for feature PRs against qa
A PR template (such as our boiler-plate PR template for releases) for qa PRs against main

dbt Cloud processes and environments

Here’s our branching strategy again, but now with the dbt Cloud processes we want to incorporate:

Indirect Promotion strategy with dbt cloud processes denoted

In order to create the jobs in our diagram, we need dbt Cloud environments. Here are the common configurations for this setup:

Environment Name	Environment Type	Deployment Type	Base Branch	Will handle…
Development	development	-	`qa`	Operations done in the IDE (including creating feature branches)
Feature CI	deployment	General	`qa`	A continuous integration job
Quality Assurance	deployment	Staging	`qa`	A deployment job
Release CI	deployment	General	`main`	A continuous integration job
Production	deployment	Production	`main`	A deployment job

Loading table...

Data platform organization

Now we need to focus on where we want to build things in our data platform. For that, we need to set our database and schema settings on the environments. There are two common setups for mapping code, but before we get in to those remember this note from direct promotion:

note

Configuration 1: A 1:1 of qa and main assets In this pattern, the CI schemas are populated in a database outside of Production and QA. This is usually done to keep the databases aligned to what’s been merged on their corresponding branches. Here’s our diagram, now mapping to the data platform with this pattern:

Indirect Promotion branches and how they relate to 1\:1 organization in the data platform

Here are our configurations for this pattern:

Environment Name	Database	Schema
Development	`development`	User-specified in Profile Settings > Credentials
Feature CI	`development`	Any safe default, like `dev_ci` (it doesn’t even have to exist). The job we intend to set up will override the schema here anyway to denote the unique PR.
Quality Assurance	`qa`	`analytics`
Release CI	`development`	A safe default
Production	`production`	`analytics`

Loading table...

Configuration 2: A reflection of the workflow initiative

In this pattern, the CI schemas populate in a qa database because it’s a step in quality assurance. Here’s our diagram, now mapping to the data platform with this pattern:

Indirect Promotion branches and how they relate to workflow initiative organization in the data platform

Here are our configurations for this pattern:

Environment Name	Database	Schema
Development	`development`	User-specified in Profile Settings > Credentials
Feature CI	`qa`	Any safe default, like `dev_ci` (it doesn’t even have to exist). The job we intend to set up will override the schema here anyway to denote the unique PR.
Quality Assurance	`qa`	`analytics`
Release CI	`qa`	A safe default
Production	`production`	`analytics`

Loading table...

Indirect promotion example

In this example, Steve uses the term “UAT” to define the automatic deployment of the middle branch and “QA” to define what’s built from feature branch pull requests. He also defines a database for each (with four databases total - one for development schemas, one for CI schemas, one for middle branch deployments, and one for production deployments) — we wanted to show you this example as it speaks to how configurable these processes are apart from our standard examples.

What did indirect promotion change?

You’ve probably noticed there is one overall theme of adding our additional branch, and that’s supporting our Quality Assurance initiative. Let’s break it down:

Development

While no one will be developing in the qa branch itself, it does need a level of oversight just like a feature branch needs in order to stay in sync with its base branch. This is because a change now to main (like a hotfix or accidental merge) won’t immediately flag our feature branches since they are based off of qa's version of code. This branch needs to stay in sync with any change in main for this reason.
Quality Assurance

There are now two places where quality can be reviewed (feature and qa) before changes hit production. qa is typically leveraged in at least one of these ways for more quality assurance work:
- Testing and reviewing how end-to-end changes are performing over time
- Deploying the full image of the qa changes to a centralized location. Some common reasons to deploy qa code are:
  - Leveraging deferral and Advanced comparison features in CI
  - Testing builds from environment-specific data sets (dynamic sources)
  - Creating staging versions of workbooks in your BI tool. This is most relevant when your BI tool doesn’t do well with changing underlying schemas. For instance, some tools have better controls for grabbing a production workbook for development, switching the underlying schema to a dbt_cloud_pr_# schema, and reflecting those changes without breaking things. Other tools will break every column selection you have in your workbook, even if the structure is the same. For this reason, it is sometimes easier to create one “staging” version workbook and always point it to a database built from QA code - the changes then can always be reflected and reviewed from that workbook before the code changes in production.
  - For other folks who want to see or test changes, but aren’t personas that would be included in the review process. For instance, you may have a subject matter expert reviewing and approving alongside developers, who understands the process of looking at dbt_cloud_pr schemas. However, if this person now communicates that they have just approved some changes with development to their teammates who will use those changes, the team might ask if there is a way they can also see the changes. Since the CI schema is dropped after merge, they would need to wait see this change in production if there is no process deploying the middle branch.
Promotion

There are now two places where code needs to be promoted:
- From feature to qa by a developer and peer (and optionally SMEs or stakeholders)
- From qa to main by a release manager and SMEs or stakeholders
Additionally, approved changes from feature branches are promoted together from qa.
Deployment

There are now two major branches code can be deployed from:
- qa: The “working” version with changes, features merge here
- main: The “production” version
Due to our changes collecting on the qa branch, our deployment process changes from continuous deployment (”streaming” changes to main in direct promotion) to continuous delivery (”batched” changes to main). Julia Schottenstein does a great job explaining the differences here.

Comparing branching strategies

Since most teams can make direct promotion work, we’ll list some key flags for when we start thinking about indirect promotion with a team:

They speak about having a dedicated environment for QA, UAT, staging, or pre-production work.
They ask how they can test changes end-to-end and over time before things hit production.
Their developers aren’t the same, or the only, folks who are checking data outputs for validity - especially if the other folks are more familiar performing validations from other tools (like from BI dashboards).
Their different environments aren’t working with identical data. Like software environments, they may have limited or scrubbed versions of production data depending on the environment.
They have a schedule in mind for making changes “public”, and want to hold features back from being seen or usable until then.
They have very high-stakes data consumption.

If you fit any of these, you likely fit into an indirect promotion strategy.

Strengths and Weaknesses

We highly recommend that you choose your branching strategy based on which best supports your workflow needs over any perceived pros and cons — when these are put in the context of your team’s structure and technical skills, you’ll find some aren’t strengths or weaknesses at all!

Direct promotion

Strengths
- Much faster in terms of seeing changes - once the PR is merged and deployed, the changes are “in production”.
- Changes don’t get stuck in a middle branch that’s pending the acceptance of someone else’s validation on data output.
- Management is mainly distributed - every developer owns their own branch and ensuring it’s in sync with what’s in main.
- There’s no releases to worry about, so no extra processes to manage.
Weaknesses
- It can present challenges for testing changes end-to-end or over time in an environment that isn't production. Our desire to build only modified and directly impacted models to reduce the amount of models executed in CI goes against the grain of full end-to-end testing, and our CI mechanism (which executes only upon pull request or new commit) won’t help us test over time.
- It can be more difficult for differing schedules or technical abilities when it comes to review. It’s essential in this strategy to include stakeholders or subject matter experts on pull requests before merge, because the next step is production. Additionally, some tools aren’t great at switching databases and schemas even if the shape of the data is the same. Constant breakage of reports for review can be too much overhead.
- It can be harder to test configurations or job changes before they hit production, especially if things function a bit differently based on environment.
- It can be harder to share code that works fully but isn’t a full reflection of a complete task. Changes need to be agreed upon to go to production so others can pull them in, otherwise developers need to know how to pull these in from other branches that aren’t main (and be aware of staying in sync or risk merge conflicts).
Indirect promotion

Strengths
- There’s a dedicated environment to test end-to-end changes over time.
- Data outputs can be reviewed either with a developer on PR or once things are in the middle branch.
- Review from other tools is much easier because we have the option of deploying our middle branch to a centralized location. “Staging” reports can be set up to always refer to this location for reviewing changes, and processes for creating new reports can flow from staging to production.
- Configurations and job changes can be tested with production-like parameters before they actually hit production.
- Changes merged to the middle branch for shared development won't be reflected in production. Consumers of main will be none-the-wiser about the things that developers do for ease of collaboration.
Weaknesses
- Changes can be slower to get to production due to the extra processes intended for the middle branch. In order to keep things moving, there should be someone (or a group of people) in place who fully own managing the changes, validation status, and release cycle.
- Changes that are valid can get stuck behind other changes that aren’t - having a good plan in place for how the team should handle this scenario is essential because conundrum can hold up getting things to production.
- There’s extra management of any new trunks, which will need ownership - without someone (or a group of people) who are knowledgeable, it can be confusing understanding what needs to be done and how to do it when things get out of sync.
- It can require additional compute in the form of scheduled jobs in the QA environment, as well as an additional CI job from qa > main for testing releases before they're merged.

Further enhancements

Once you have your basic configurations in place, you can further tweak your project by considering which other features will be helpful for your needs:

Continuous Integration:
- Only running and testing changed models and their dependencies
- Using dbt clone to get a copy of large incrementals in CI
Development and Deployment:
- Using schema configurations in the project to add more separation in a database
- Using database configurations in the project to switch databases for model builds

Frequently asked git questions

General

How do you prevent developers from changing specific files?

Parser, Better, Faster, Stronger: A peek at the new dbt engine

2025-02-19T00:00:00.000Z

Remember how dbt felt when you had a small project? You pressed enter and stuff just happened immediately? We're bringing that back.

Benchmarking tip: always try to get data that's good enough that you don't need to do statistics on it

After a series of deep dives into the guts of SQL comprehension, let's talk about speed a little bit. Specifically, I want to talk about one of the most annoying slowdowns as your project grows: project parsing.

When you're waiting a few seconds or a few minutes for things to start happening after you invoke dbt, it's because parsing isn't finished yet. But Lukas' SDF demo at last month's webinar didn't have a big wait, so why not?

A primer on parsing

Parsing your project (remember: not your SQL!) is how dbt builds the dependency graph of models and macros. If you've ever looked at a manifest.json and noticed all the depends_on blocks, that's what we're talking about.

Without the resolved dependencies, dbt can't filter down to a subset of your project – this is why parsing is always an all-or-nothing affair. You can't do dbt parse --select my_model+ because parsing is what works out what's on the other side of that plus. (Of course, most projects use partial parsing so are not starting from scratch every time).

All those refs and macros are defined in Jinja. I don't know if you've ever thought about how Jinja gets from curly braces into text, but it's pretty weird! It's actually a two-step process: first it gets converted into Python code, and then that Python code is itself run to generate a string!

This is kinda slow. Not so much as a one-off, but a project with 10,000 nodes might have 15-20,000 dependencies so every millisecond adds up.

What if we wanted it to be faster?

Since running the code is slow, one way to get results faster is to not run the code. Since v1.0, dbt's parser has used a static analyzer to resolve refs when possible, which is about 3x faster than going through the whole rigmarole above.

The other way you could get the result faster is to run the code faster.

The original author of Jinja also wrote minijinja – a Rust implementation of a subset of the original Jinja library.

This is not the post for a deep dive on why Rust and Python have such different performance characteristics, but the key takeaway is that minijinja can fully evaluate a ref 30 times faster than today's dbt can even statically analyze it.

Our analysis in the leadup to dbt v1.0 showed that the static analyzer could handle 60% of models. Evaluating refs 30x faster in 60% of models would itself be great.

But recall that static analysis was the workaround for evaluating Jinja being slow. Since we can now evaluate Jinja faster than we can statically analyze it, let's just^† evaluate everything!

^†The word "just" is doing a lot of heavy lifting here. In practice, there's a lot happening behind the scenes to get both the performance of minijinja and the ability to process the full range of capabilities of a dbt project. Another story for another day.

What does this mean in practice?

As you saw at the top of the post, I've been running some synthetic projects against an early build of the new dbt engine, and it's pretty snappy - parsing a 10,000 model project in under 600ms. Let's see how it goes with some other common project sizes:

You might have to squint, but I promise there's a yellow line on each of those groups

Even a 20,000-model project finished parsing in about a second. The equivalent cold parse takes well over a minute, and a partial parse (with no changed files) took about 12 seconds.

Let's look at one more comparison: 100k models. I need to break out the log scale for this one:

The new dbt engine parsed our 100,000 model example project in under 10 seconds, compared with almost 20 minutes.

Let me be clear: I do not think you should put 100,000 models into your project! I mostly ran that one for the lols. But back in the realm of project sizes that actually exist:

If your project isn't currently eligible for partial parsing, cold parses in Rust are fast enough to make it a moot point.
Regardless of how your project parses today, your project will feel like it's a couple of orders of magnitude smaller than it is.

We're just getting started

Speed is just one benefit to come from this integration, and pales in comparison to, say, the importance of logical plans. But it sure is fun!

The teams are still hard at work integrating the two tools, and we'll have more to share on how the developer experience will change thanks to SDF's tech at our Developer Day event in March.

The key technologies behind SQL Comprehension

2025-01-24T00:00:00.000Z

You ever wonder what’s really going on in your database when you fire off a (perfect, efficient, full-of-insight) SQL query to your database?

OK, probably not 😅. Your personal tastes aside, we’ve been talking a lot about SQL Comprehension tools at dbt Labs in the wake of our acquisition of SDF Labs, and think that the community would benefit if we included them in the conversation too! We recently published a blog that talked about the different levels of SQL Comprehension tools. If you read that, you may have encountered a few new terms you weren’t super familiar with.

In this post, we’ll talk about the technologies that underpin SQL Comprehension tools in more detail. Hopefully, you come away with a deeper understanding of and appreciation for the hard work that your computer does to turn your SQL queries into actionable business insights!

Here’s a quick refresher on the levels of SQL comprehension:

The three levels of SQL Comprehension, with example SQL.

Each of these levels is powered by a distinct set of technologies. It’s useful to explore these technologies in the context of the SQL Comprehension tool you are probably most familiar with: a database! A database, as you might have guessed, has the deepest possible SQL comprehension abilities as well as SQL execution abilities — it contains all necessary technology to translate a SQL query text into rows and columns.

Here’s a simplified diagram to show your query’s fantastic voyage of translation into tabular data:

A flow chart showing a SQL query's journey to raw data.

First, databases use a parser to translate SQL code into a syntax tree. This enables syntax validation + error handling.

Second, database compilers bind metadata to the syntax tree to create a fully validated logical plan. This enables a complete understanding of the operations required to generate your dataset, including information about the datatypes that are input and output during SQL execution.

Third, the database optimizes and plans the operations defined by a logical plan, generating a physical plan that maps the logical steps to physical hardware, then executes the steps with data to finally return your dataset!

Let’s explore each of these levels in more depth!

Level 1: Parsing

At Level 1, SQL comprehension tools use a parser to translate SQL code into a syntax tree. This enables syntax validation + error handling. Key Concepts: Intermediate Representations, Parsers, Syntax Trees

Parsers can model the grammar and structure of code.

Intermediate representations

tip

Intermediate representations are data objects created during the process of compiling code.

Before we dive into the specific technologies, we should define a key concept in computer science that’s very relevant to understanding how this entire process works under the hood: an Intermediate Representation (IR). When code is executed on a computer, it has to be translated from the human-readable code we write to the machine-readable code that actually does the work that the higher-level code specifies, in a process called compiling. As a part of this process, your code will be translated into a number of different objects as the program runs; each of these is called an intermediate representation.

To provide an example / analogy that will be familiar to dbt users, think about what your intermediate models are in the context of your dbt DAG — a translated form of your source data created in the process of synthesizing your final data marts. These models are effectively an intermediate representation. We’re going to talk about a few different types of IRs in this post, so it’s useful to know about them now before we get too deep!

Parsers

tip

Parsers are programs that translate raw code into syntax trees.

All programming languages require a parser, which is often the first step in the translation process from human-readable to machine-readable code. Parsers are programs that can map the syntax, or grammar, of your code into a syntax tree, and understand whether the code you wrote follows the basic rules of the language.

In computing, parsers have a few underlying pieces of technology that build the syntax tree that understands the relationships between your variables, functions, and classes, etc. The components of a parser include:

a lexer, which takes raw code strings, and return lists of tokens recognized in the code (in SQL, SELECT , FROM , and sum would be examples of tokens recognized by a lexer)
a parser, which takes the lists of tokens generated by a lexer, and builds the syntax tree based on grammatical rules of the language (i.e. a SELECT must be followed by one or more column expressions, a FROM must reference a table, or CTE, or subquery, etc).

In other words, the lexer first detects the tokens that are present in a SQL query (is there a filter? which functions are called?) and the parser is responsible for mapping the dependencies between them.

A quick vocab note: while technically, the parser is only the component that translates tokens into a syntax tree, the word “parser” has come to be shorthand for the whole process of lexing and parsing.

Syntax trees

tip

Syntax trees are a representation of a unit of language according to a set of grammatical rules.

Your first introduction to understanding syntactical rules probably came when you learned how to diagram sentences in your grade school grammar classes! Diagramming the parts of speech in a sentence and mapping the dependencies between each of its components is precisely what a parser does — the resulting representation of the sentence is a syntax tree. Here’s a silly example:

My cat jumped over my lazy dog

By parsing this sentence according to the rules of the English language, we can get this syntax tree:

Apologies to my mother, an english teacher, who likely takes umbrage with this simplified example

Let’s do the same thing with simple SQL query:

select 
  order_id, 
  sum(amount) as total_order_amount
from order_items
where 
  date_trunc('year', ordered_at) = '2025-01-01'
group by 1

By parsing this query according to the rules of the SQL language, we get something that looks like this:

This is a simplified syntax tree — This was made by hand, and may not be exactly what the output of a real SQL parser looks like!

The syntax trees produced by parsers are a very valuable type of intermediate representation; with a syntax tree, you can power features like syntax validation, code linting, and code formatting, since those tools only need knowledge of the syntax of the code you’ve written to work.

However, parsers also dutifully parse syntactically correct code that means nothing at all. To illustrate this, consider the famous sentence developed by linguistics + philosophy professor Noam Chomsky:

Colorless green ideas sleep furiously

That’s a perfectly valid, diagrammable, parsable sentence according to the rules of the English language. But that means absolutely nothing. In SQL engines, you need a way to imbue a syntax tree with additional metadata to understand whether or not it represents executable code. As described in our first post, Level 1 SQL Comprehension tools are not designed to provide this context. They can only provide pure syntax validation. Level 2 SQL Comprehension tools augment these syntax trees with meaning by fully **compiling **the SQL.

Level 2: Compiling

At Level 2, SQL comprehension tools use a compiler to bind metadata to the syntax tree to create a fully validated logical plan. Key concepts: Binders, Logical Plans, Compilers

Binders

tip

In SQL compilers, binders are programs that enhance + resolve syntax trees into logical plans.

In compilers, binders (also called analyzers or resolvers) combine additional metadata with a syntax tree representation and produce a richer, validated, executable intermediate representation. In the above English language example, in our heads, we’re binding our knowledge of the definitions of each of the words to the structure of the sentence, after which, we can derive meaning.

Binders are responsible for this process of resolution. They must bind additional information about the components of the written code (their types, their scopes, their memory implications) to the code you wrote to produce a valid, executable unit of computation.

In the case of SQL binders, a major part of its job is to add warehouse schema information, like column datatypes, with the type signatures of warehouse operators described by the syntax tree to bring full type awareness to the syntax tree. It’s one thing to recognize a substring function in a query; it’s another to understand that a substring must operate on string data, and always produces string data, and will fail if you pass it an integer.

In this example, while the syntax tree knows that the x column is aliased as u, the binder has the knowledge that x is indeed a column of type int and therefore, the resulting column u must also be of type int. Similarly, it knows that the filter condition specified will produce a bool value, and therefore must have compatible datatypes as its two arguments. Luckily, the binder can also see that x and 0 are both of type int, so we're confident this is a fully valid expression. This layer of validation, powered by metadata, is referred to as type awareness.

In addition to being able to trace the way datatypes will flow and change through a set of SQL operations, the function signatures allow the binder to fully validate that you’ve provided valid arguments to a function, inclusive of the acceptable types of columns provided to the function (e.g. split_part can’t work on an int field) as well as valid function configurations (e.g. the acceptable date parts for datediff includes 'nanosecond' but not 'dog_years').

Logical plan

tip

In SQL compilers, logical plans define the validated, resolved set of data processing operations defined by a SQL query.

The intermediate representation output by a binder is a richer intermediate representation that can be executed in a low level language; in the case of database engines, this IR is known as a Logical Plan.

Critically, as a result of the binder’s work of mapping data types to the syntax tree, logical plans have full data type awareness — logical plans can tell you precisely how data flows through an analysis, and can pinpoint when datatypes may change as a result of, say, an aggregation operation.

You can see we’ve gotten a more specific description of how to generate the dataset. Rather than simply mapping the SQL keywords and their dependencies, we have a resolved set of operations, in this case scanning a table, filtering the result, and projecting the values in the x column with an alias of u.

The logical plan contains precise logical description of the computing process your query defined, and validates that it can be executed. Logical plans describe the operations as relational algebra, which is what enable these plans to be fully optimized — the steps in a logical plan can be rearranged and reduced with mathematical equivalency to ensure the steps are as efficient as possible.

This plan can be very helpful for you as a developer, especially if it’s available before you execute the query. If you’ve ever executed an explain function in your database, you’ve viewed a logical plan! You can know exactly what operations will be executed, and critically, you can know that they are valid! This validity check pre-compute is what is referred to as static analysis.

Compilers

tip

Compilers are programs that translate high-level language to low-level language. Parsers and binders together constitute compilers.

Taken together, a parser plus a binder constitute a compiler, a program that takes in high-level code (one that is optimized for human readability, like SQL) and outputs low-level code (one that is optimized for machine readability + execution). In SQL compilers, this output is the logical plan.

A compiler definitionally gives you a deeper understanding of the behavior of the query than a parser alone. We’re now able to trace the data flows and operations that we were abstractly expressing when we initially wrote our SQL query. The compiler incrementally enriches its understanding of the original SQL string and results in a logical plan, which provides static analysis and validation of your SQL logic.

We are however, not all the way down the rabbit hole — a compiler-produced logical plan contains the full instructions for how to execute a piece of code, but doesn’t have any sense of how to actually execute these steps! There’s one more translation required for the rubber to fully meet the motherboard.

Level 3: Executing

At Level 3, the database’s execution engine translates the logical plan into a physical plan, which can finally be executed to return a dataset. Key concepts: Optimization and Planning, Engines, Physical plans

Optimization and planning

tip

A logical plan goes through a process of optimization and planning that maps its operations to the physical hardware that is going to execute each step.

Once the database has a resolved logical plan, it goes through a process of optimization and planning. As mentioned, because logical plans are expressed as relational algebraic expressions, it can choose to execute equivalent steps in whichever order it chooses.

Let’s think of a simple example SQL statement:

select 
  *
from a
join b on a.id = b.a_id
join c on b.id = c.b_id

The logical plan will contain steps to join the tables together as defined in SQL — great! Let’s suppose, however, that table a is several orders of magnitude larger than each of the other two. In that case, the order of joining makes a huge difference in the performance of the query! If we join a and b first, then the result ab with c, we end up scanning the entirety of the extremely large table a twice. If instead we join b and c first, and join the much smaller result bc with table a , we get the same result of abc at a fraction of the cost!

Layering in the knowledge of the physical characteristics of the objects referenced in a query to ensure efficient execution is the job of the optimization and planning stage.

Physical plan

tip

A physical plan is the intermediate representations that contains all necessary information to execute the query.

Once we do the work to decide on the optimal plan with details about the physical characteristics of the data, we get one final intermediate representation: the physical plan. Think about the operations defined by a logical plan — we may know that we have a TableScan operation of a table called some_table. A physical plan is able to map that operation to specific data partitions in specific data storage locations. The physical plan also contains information relevant to memory allocation so the engine can plan accordingly — as in the previous example, it knows the second join will be a lot more resource intensive!

Think about what your data platform of choice has to do when you submit a validated SQL query: the last mile step is deciding which partitions of data on which of its servers should be scanned, how they should be joined and aggregated to ultimately generate the dataset you need. Physical plans are among the last intermediate representations created along the way to actually returning data back from a database.

Execution

tip

A query engine can execute a physical plan and return tabular data

Once a physical plan is generated, all that’s left to do is run it! The database engine executes the physical plan, and fetches, combines, and aggregates your data into the format described by your SQL code. The way that the engine accomplishes this can vary significantly depending on the architecture of your database! Some databases are “single node” in that there is a single computer doing all the work; others are “distributed” and can federate the work across many working compute nodes.

In general, the engine must:

Allocate resources — In order to run your query, a computer must be online and available to do so! This step allocates CPU to each of the operation in the physical plan, whether it be one single node or many nodes executing the full query task
Read Data Into Memory — The tables referenced are then scanned as efficiently as possible, and the rows are processed. This may happen in partial stages depending on whether the tasks are distributed or happening within one single node
Execute Operations — Once the required data is read into memory, it flows through a pipeline of the nodes in your physical plan. There is more than 50 years of work in building optimizations for these steps as applied to different data structures and in-memory representations; everything from row-oriented databases, to columnar, to time series to geo-spatial to graph. But fundamentally, there are 5 common operations:
1. Projection — Extract only the columns or expressions that the user requested needed (e.g. order_id).
2. Filtering — Rows that don’t meet your WHERE condition are dropped.
3. Joining — If your query involves multiple tables, the engine merges or joins them—this could be a hash join, sort-merge join, or even a nested loop join depending on data statistics.
4. Aggregation — If you have an aggregation like SUM(amount) or COUNT(*), the engine groups rows by the specified columns and calculates the aggregated values.
5. Sorting / Window Functions — If the query uses ORDER BY, RANK(), or other window functions, the data flows into those operators next.
Merge and return results — The last mile step is generating the tabular dataset. In the case of distributed systems, this may require combining the results from several nodes into a single result.

Finally! Actionable business insights, right in the palm of your hand!

Looking ahead

That’s probably more about databases that you bargained for! I know this is a lot to absorb - but the best data practitioners have a deep understanding of their tools and this is all extremely relevant for understanding the next evolution of data tooling and data work. Next time you run a query, don't forget to thank your database for all the hard work it's doing for you.

The Three Levels of SQL Comprehension: What they are and why you need to know about them

2025-01-23T00:00:00.000Z

Ever since dbt Labs acquired SDF Labs last week, I've been head-down diving into their technology and making sense of it all. The main thing I knew going in was "SDF understands SQL". It's a nice pithy quote, but the specifics are fascinating.

For the next era of Analytics Engineering to be as transformative as the last, dbt needs to move beyond being a string preprocessor and into fully comprehending SQL. For the first time, SDF provides the technology necessary to make this possible. Today we're going to dig into what SQL comprehension actually means, since it's so critical to what comes next.

What is SQL comprehension?

Let’s call any tool that can look at a string of text, interpret it as SQL, and extract some meaning from it a SQL Comprehension tool.

Put another way, SQL Comprehension tools recognize SQL code and deduce more information about that SQL than is present in the tokens themselves. Here’s a non-exhaustive set of behaviors and capabilities that such a tool might have for a given dialect of SQL:

Identify constituent parts of a query.
Create structured artifacts for their own use or for other tools to consume in turn.
Check whether the SQL is valid.
Understand what will happen when the query runs: things like what columns will be created, what datatypes do they have, and what DDL is involved
Execute the query and return data (unsurprisingly, your database is a tool that comprehends SQL!)

By building on top of tools that truly understand SQL, it is possible to create systems that are much more capable, resilient and flexible than we’ve seen to date.

The Levels of SQL Comprehension

When you look at the capabilities above, you can imagine some of those outcomes being achievable with one line of regex and some that are only possible if you’ve literally built a database. Given that range of possibilities, we believe that “can you comprehend SQL” is an insufficiently precise question.

A better question is “to what level can you comprehend SQL?” To that end, we have identified different levels of capability. Each level deals with a key artifact (or more precisely - a specific "intermediate representation"). And in doing so, each level unlocks specific capabilities and more in-depth validation.

Level	Name	Artifact	Example Capability Unlocked
1	Parsing	Syntax Tree	Know what symbols are used in a query.
2	Compiling	Logical Plan	Know what types are used in a query, and how they change, regardless of their origin.
3	Executing	Physical Plan + Query Results	Know how a query will run on your database, all the way to calculating its results.

Loading table...

At Level 1, you have a baseline comprehension of SQL. By parsing the string of SQL into a Syntax Tree, it’s possible to reason about the components of a query and identify whether you've written syntactically legal code.

At Level 2, the system produces a complete Logical Plan. A logical plan knows about every function that’s called in your query, the datatypes being passed into them, and what every column will look like as a result (among many other things). Static analysis of this plan makes it possible to identify almost every error before you run your code.

Finally, at Level 3, you can actually execute a query and modify data, because it understands all the complexities involved in answering the question "how does the exact data passed into this query get transformed/mutated".

Can I see an example?

This can feel pretty theoretical based on descriptions alone, so let’s look at a basic Snowflake query.

A system at each level of SQL comprehension understands progressively more about the query, and that increased understanding enables it to say with more precision whether the query is valid.

To tools at lower levels of comprehension, some elements of a query are effectively a black box - their syntax tree has the contents of the query but cannot validate whether everything makes sense. Remember that comprehension is deducing more information than is present in the plain text of the query; by comprehending more, you can validate more.

Level 1: Parsing

A parser recognizes that a function called dateadd has been called with three arguments, and knows the contents of those arguments.

However, without knowledge of the function signature, it has no way to validate whether those arguments are valid types, whether three is the right number of arguments, or even whether dateadd is an available function. This also means it can’t know what the datatype of the created column will be.

Parsers are intentionally flexible in what they will consume - their purpose is to make sense of what they're seeing, not nitpick. Most parsers describe themselves as “non-validating”, because true validation requires compilation.

Level 2: Compiling

Extending beyond a parser, a compiler does know the function signatures. It knows that on Snowflake, dateadd is a function which takes three arguments: a datepart, an integer, and an expression (in that order).

A compiler also knows what types a function can return without actually running the code (this is called static analysis, we’ll get into that another day). In this case, because dateadd’s return type depends on the input expression and our expression isn’t explicitly cast, the compiler just knows that the new_day column can be one of three possible datatypes.

Level 3: Executing

A tool with execution capabilities knows everything about this query and the data that is passed into it, including how functions are implemented. Therefore it can perfectly represent the results as run on Snowflake. Again, that’s what databases do. A database is a Level 3 tool.

Review

Let’s review the increasing validation capabilities unlocked by each level of comprehension, and notice that over time the black boxes completely disappear:

In a toy example like this one, the distinctions between the different levels might feel subtle. As you move away from a single query and into a full-scale project, the functionality gaps become more pronounced. That’s hard to demonstrate in a blog post, but fortunately there’s another easier option: look at some failing queries. How the query is broken impacts what level of tool is necessary to recognize the error.

So let’s break things

As the great analytics engineer Tolstoy once noted, “All correctly written queries are alike; each incorrectly written query is incorrect in its own way”.

Consider these three invalid queries:

selecte dateadd('day', 1, getdate()) as tomorrow (Misspelled keyword)
select dateadd('day', getdate(), 1) as tomorrow (Wrong order of arguments)
select cast('2025-01-32' as date) as tomorrow (Impossible date)

Tools that comprehend SQL can catch errors. But they can't all catch the same errors! Each subsequent level will catch more subtle errors in addition to those from all prior levels. That's because the levels are additive — each level contains and builds on the knowledge of the ones below it.

Each of the above queries requires progressively greater SQL comprehension abilities to identify the mistake.

Parser (Level 1): Capture Syntax Errors

Example: selecte dateadd('day', 1, getdate()) as tomorrow

Parsers know that selecte is not a valid keyword in Snowflake SQL, and will reject it.

Compiler (Level 2): Capture Compilation Errors

Example: select dateadd('day', getdate(), 1) as tomorrow

To a parser, this looks fine - all the parentheses and commas are in the right places, and we’ve spelled select correctly this time.

A compiler, on the other hand, recognizes that the function arguments are out of order because:

It knows that the second argument (value) needs to be a number, but that getdate() returns a timestamp_ltz.
Likewise, it knows that a number is not a valid date/time expression for the third argument.

Executor (Level 3): Capture Data Errors

Example: select cast('2025-01-32' as date) as tomorrow

Again, the parser signs off on this as valid SQL syntax.

But this time the compiler also thinks everything is fine! Remember that a compiler checks the signature of a function. It knows that cast takes a source expression and a target datatype as arguments, and it's checked that both these arguments are of the correct type.

It even has an overload that knows that strings can be cast into dates, but since it can’t do any validation of those strings’ values it doesn’t know January 32nd isn’t a valid date.

To actually know whether some data can be processed by a SQL query, you have to, well, process the data. Data errors can only be captured by a Level 3 system.

Conclusion

Building your mental model of the levels of SQL comprehension – why they matter, how they're achieved and what they’ll unlock for you – is critical to understanding the coming era of data tooling.

In introducing these concepts, we’re still just scratching the surface. There's a lot more to discuss:

Going deeper on the specific nuances of each level of comprehension
How each level actually works, including the technologies and artifacts that power each level
How this is all going to roll into a step change in the experience of working with data
What it means for doing great data work

To learn more, check out The key technologies behind SQL Comprehension.

Over the coming days, you'll hear more about all of this from the dbt Labs team - both familiar faces and our new friends from SDF Labs.

This is a special moment for the industry and the community. It's alive with possibilities, with ideas, and with new potential. We're excited to navigate this new frontier with all of you.

Why I wish I had a control plane for my renovation

2025-01-21T00:00:00.000Z

When my wife and I renovated our home, we chose to take on the role of owner-builder. It was a bold (and mostly naive) decision, but we wanted control over every aspect of the project. What we didn’t realize was just how complex and exhausting managing so many moving parts would be.

My wife pondering our sanity

We had to coordinate multiple elements:

The architects, who designed the layout, interior, and exterior.
The architectural plans, which outlined what the house should look like.
The builders, who executed those plans.
The inspectors, councils, and energy raters, who checked whether everything met the required standards.

Each piece was critical — without the plans, there’s no shared vision; without the builders, the plans don’t come to life; and without inspections, mistakes go unnoticed.

But as an inexperienced project manager, I was also the one responsible for stitching everything together:

Architects handed me detailed plans, builders asked for clarifications.
Inspectors flagged issues that were often too late to fix without extra costs or delays.
On top of all this, I also don't speak "builder".

So what should have been quick and collaborative conversations, turned into drawn-out processes because there was no unified system to keep everyone on the same page.

In many ways, this mirrors how data pipelines operate

The architects are the engineers — designing how the pieces fit together.
The architectural plans are your dbt code — the models, tests, and configurations that define what your data should look like.
The builders are the compute layers (for example, Snowflake, BigQuery, or Databricks) that execute those transformations.
The inspectors are the monitoring tools, which focus on retrospective insights like logs, job performance, and error rates.

Here’s the challenge: monitoring tools, by their nature, look backward. They’re great at telling you what happened, but they don’t help you plan or declare what should happen. And when these roles, plans, execution, and monitoring are siloed, teams are left trying to manually stitch them together, often wasting time troubleshooting issues or coordinating workflows.

What makes dbt Cloud different

dbt Cloud unifies these perspectives into a single control plane, bridging proactive and retrospective capabilities:

Proactive planning: In dbt, you declare the desired state of your data before jobs even run — your architectural plans are baked into the pipeline.
Retrospective insights: dbt Cloud surfaces job logs, performance metrics, and test results, providing the same level of insight as traditional monitoring tools.

But the real power lies in how dbt integrates these two perspectives. Transformation logic (the plans) and monitoring (the inspections) are tightly connected, creating a continuous feedback loop where issues can be identified and resolved faster, and pipelines can be optimized more effectively.

Why does this matter?

The silo problem: Many organizations rely on separate tools for transformation and monitoring. This fragmentation creates blind spots, making it harder to identify and resolve issues.
Integrated workflows: dbt Cloud eliminates these silos by connecting transformation and monitoring logic in one place. It doesn’t just report on what happened; it ties those insights directly to the proactive plans that define your pipeline.
Operational confidence: With dbt Cloud, you can trust that your data pipelines are not only functional but aligned with your business goals, monitored in real-time, and easy to troubleshoot.

Why I wish I had a control plane for my renovation

When I think back to my renovation, I realize how much smoother it would have been if I’d had a control plane for the entire process. There are firms that specialize in design-and-build projects, in-house architects, engineers, and contractors. The beauty of these firms is that everything is under one roof, so you know they’re communicating seamlessly.

In my case, though, my architect, builder, and engineer were all completely separate, which meant I was the intermediary. I was the pigeon service shuttling information between them, and it was exhausting. Discussions that should have taken minutes, stretched into weeks and sometimes even months because there was no centralized communication.

dbt Cloud is like having that design-and-build firm for your data pipelines. It’s the control plane that unites proactive planning with retrospective monitoring, eliminating silos and inefficiencies. With dbt Cloud, you don’t need to play the role of the pigeon service — it gives you the visibility, integration, and control you need to manage modern data workflows effortlessly.

Test smarter not harder: Where should tests go in your pipeline?

2024-12-09T00:00:00.000Z

👋 Greetings, dbt’ers! It’s Faith & Jerrie, back again to offer tactical advice on where to put tests in your pipeline.

In our first post on refining testing best practices, we developed a prioritized list of data quality concerns. We also documented first steps for debugging each concern. This post will guide you on where specific tests should go in your data pipeline.

Note that we are constructing this guidance based on how we structure data at dbt Labs. You may use a different modeling approach—that’s okay! Translate our guidance to your data’s shape, and let us know in the comments section what modifications you made.

First, here’s our opinions on where specific tests should go:

Source tests should be fixable data quality concerns. See the callout box below for what we mean by “fixable”.
Staging tests should be business-focused anomalies specific to individual tables, such as accepted ranges or ensuring sequential values. In addition to these tests, your staging layer should clean up any nulls, duplicates, or outliers that you can’t fix in your source system. You generally don’t need to test your cleanup efforts.
Intermediate and marts layer tests should be business-focused anomalies resulting specifically from joins or calculations. You also may consider adding additional primary key and not null tests on columns where it’s especially important to protect the grain.

Where should tests go in your pipeline?

This diagram above outlines where you might put specific data tests in your pipeline. Let’s expand on it and discuss where each type of data quality issue should be tested.

Sources

Tests applied to your sources should indicate fixable-at-the-source-system issues. If your source tests flag source system issues that aren’t fixable, remove the test and mitigate the problem in your staging layer instead.

What does fixable mean?

We consider a "fixable-at-the-source-system" issue to be something that:

You yourself can fix in the source system.
You know the right person to fix it and have a good enough relationship with them that you know you can get it fixed.

You may have issues that can technically get fixed at the source, but it won't happen till the next planning cycle, or you need to develop better relationships to get the issue fixed, or something similar. This demands a more nuanced approach than we'll cover in this post. If you have thoughts on this type of situation, let us know!

Here’s our recommendation for what tests belong on your sources.

Source freshness: testing data freshness for sources that are critical to your pipelines.
- If any sources feed into any of the “top 3” priority categories in our last post, use dbt source freshness in your job execution commands and set the severity to error. That way, if source freshness fails, so does your job.
- If none of your sources feed into high priority categories, set your source freshness severity to warn and add source freshness to your job execution commands. That way, you still get source freshness information but stale data won't fail your pipeline.
Data hygiene: tests that are fixable in the source system (see our note above on “fixability”).
- Examples:
  - Duplicate customer records that can be deleted in the source system
  - Null records, such as a customer name or email address, that can be entered into the source system
  - Primary key testing where duplicates are removable in the source system

Staging

In the staging layer, your models should be cleaning up or mitigating data issues that can't be fixed at the source. Your tests should be focused on business anomaly detection.

Data cleanup and issue mitigation: Use our best practices around staging layers to clean things up. Don’t add tests to your cleanup efforts. If you’re filtering out nulls in a column, adding a not_null test is repetitive! 🌶️
Business-focused anomaly examples: these are data quality issues you should test for in your staging layer, because they fall outside of your business’s defined norms. These might be:
- Values inside a single column that fall outside of an acceptable range. For example, a store selling a greater quantity of limited-edition items than they received in their stock delivery.
- Values that should always be positive, are positive. This might look like a negative transaction amount that isn’t classified as a return. This failing test would then spur further investigation into the offending transaction.
- An unexpected uptick in volume of a quantity column beyond a pre-defined percentage. This might look like a store’s customer volume spiking unexpectedly and outside of expected seasonal norms. This is an anomaly that could indicate a bug or modeling issue.

Intermediate (if applicable)

In your intermediate layer, focus on data hygiene and anomaly tests for new columns. Don’t re-test passthrough columns from sources or staging. Here are some examples of tests you might put in your intermediate layer based on the use cases of intermediate models we outline in this guide.

Intermediate models often re-grain models to prepare them for marts.
- Add a primary key test to any re-grained models.
- Additionally, consider adding a primary key test to models where the grain has remained the same but has been enriched. This helps future-proof your enriched models against future developers who may not be able to glean your intention from SQL alone.
Intermediate models may perform a first set of joins or aggregations to reduce complexity in a final mart.
- Add simple anomaly tests to verify the behavior of your sets of joins and aggregations. This may look like:
  - An accepted_values test on a newly calculated categorical column.
  - A mutually_exclusive_ranges test on two columns whose values behave in relation to one another (ex: asserting age ranges do not overlap).
  - A not_constant test on a column whose value should be continually changing (ex: page view counts on website analytics).
Intermediate models may isolate complex operations.
- The anomaly tests we list above may suffice here.
- You might also consider unit testing any particularly complex pieces of SQL logic.

Marts

Marts layer testing will follow the same hygiene-or-anomaly pattern as staging and intermediate. Similar to your intermediate layer, you should focus your testing on net-new columns in your marts layer. This might look like:

Unit tests: validate especially complex transformation logic. For example:
- Calculating dates in a way that feeds into forecasting.
- Customer segmentation logic, especially logic that has a lot of CASE-WHEN statements.
Primary key tests: focus on where where your mart's granularity has changed from its staging/intermediate inputs.
- Similar to the intermediate models above, you may also want to add primary key tests to models whose grain hasn’t changed, but have been enriched with other data. Primary key tests here communicate your intent.
Business focused anomaly tests: focus on new calculated fields, such as:
- Singular tests on high-priority, high-impact tables where you have a specific problem you want forewarning about.
  - This might be something like fuzzy matching logic to detect when the same person is making multiple emails to extend a free trial beyond its acceptable end date.
- A test for calculated numerical fields that shouldn’t vary by more than certain percentage in a week.
- A calculated ledger table that follows certain business rules, i.e. today’s running total of spend must always be greater than yesterday’s.

CI/CD

All of the testing you’ve applied in your different layers is the manual work of constructing your framework. CI/CD is where it gets automated.

You should run a slim CI to optimize your resource consumption.

With CI/CD and your regular production runs, your testing framework can be on autopilot. 😎

If and when you encounter failures, consult your trusty testing framework doc you built in our earlier post.

Advanced CI

In the early stages of your smarter testing journey, start with dbt Cloud’s built-in flags for advanced CI. In PRs with advanced CI enabled, dbt Cloud will flag what has been modified, added, or removed in the “compare changes” section. These three flags offer confidence and evidence that your changes are what you expect. Then, hand them off for peer review. Advanced CI helps jump start your colleague’s review of your work by bringing all of the implications of the change into one place.

We consider usage of Advanced CI beyond the modified, added, or changed gut checks to be an advanced (heh) testing strategy, and look forward to hearing how you use it.

Wrapping it all up

Judicious data testing is like training for a marathon. It’s not productive to go run 20 miles a day and hope that you’ll be marathon-ready and uninjured. Similarly, throwing data tests randomly at your data pipeline without careful thought is not going to tell you much about your data quality.

Runners go into marathons with training plans. Analytics engineers who care about data quality approach the issue with a plan, too.

As you try out some of the guidance above here, remember that your testing needs are going to evolve over time. Don’t be afraid to revise your original testing strategy.

Let us know your thoughts on these strategies in the comments section. Try them out, and share your thoughts to help us refine them.

Test smarter not harder: add the right tests to your dbt project

2024-11-11T00:00:00.000Z

The Analytics Development Lifecycle (ADLC) is a workflow for improving data maturity and velocity. Testing is a key phase here. Many dbt developers tend to focus on primary keys and source freshness. We think there is a more holistic and in-depth path to tread. Testing is a key piece of the ADLC, and it should drive data quality.

In this blog, we’ll walk through a plan to define data quality. This will look like:

identifying data hygiene issues
identifying business-focused anomaly issues
identifying stats-focused anomaly issues

Once we have defined data quality, we’ll move on to prioritize those concerns. We will:

think through each concern in terms of the breadth of impact
decide if each concern should be at error or warning severity

Who are we?

Let’s start with introductions - we’re Faith and Jerrie, and we work on dbt Labs’s training and services teams, respectively. By working closely with countless companies using dbt, we’ve gained unique perspectives of the landscape.

The training team collates problems organizations think about today and gauge how our solutions fit. These are shorter engagements, which means we see the data world shift and change in real time. Resident Architects spend much more time with teams to craft much more in-depth solutions, figure out where those solutions are helping, and where problems still need to be addressed. Trainers help identify patterns in the problems data teams face, and Resident Architects dive deep on solutions.

Today, we’ll guide you through a particularly thorny problem: testing.

Why testing?

Mariah Rogers broke early ground on data quality and testing in her Coalesce 2022 talk. We’ve seen similar talks again at Coalesce 2024, like this one from the data team at Aiven and this one from the co-founder at Omni Analytics. These talks share a common theme: testing your dbt project too much can get out of control quickly, leading to alert fatigue.

In our customer engagements, we see wildly different approaches to testing data. We’ve definitely seen what Mariah, the Aiven team, and the Omni team have described, which is so many tests that errors and alerts just become noise. We’ve also seen the opposite end of the spectrum—only primary keys being tested. From our field experiences, we believe there’s room for a middle path. A desire for a better approach to data quality and testing isn’t just anecdotal to Coalesce, or to dbt’s training and services. The dbt community has long called for a more intentional approach to data quality and testing - data quality is on the industry’s mind! In fact, 57% of respondents to dbt’s 2024 State of Analytics Engineering survey said that data quality is a predominant issue facing their day-to-day work.

What does d@tA qUaL1Ty even mean?!

High-quality data is trusted and used frequently. It doesn’t get argued over or endlessly scrutinized for matching to other data. Data testing should lead to higher data quality and insights, period.

Best practices in data quality are still nascent. That said, a lot of important baseline work has been done here. There are case studies on implementing dbt testing well. dbt Labs also has an Advanced Testing course, emphasizing that testing should spur action and be focused and informative enough to help address failures. You can even enforce testing best practices and dbt Labs’s own best practices using the dbt_meta_testing or dbt_project_evaluator packages and dbt Explorer’s Recommendations page.

The missing piece is still cohesion and guidance for everyday practitioners to help develop their testing framework.

To recap, we’re going to start with:

identifying data hygiene issues
identifying business-focused anomaly issues
identifying stats-focused anomaly issues

Next, we’ll prioritize. We will:

think through each concern in terms of the breadth of impact
decide if each concern should be at error or warning severity

Get a pen and paper (or a google doc) and join us in constructing your own testing framework.

Identifying data quality issues in your pipeline

Let’s start our framework by identifying types of data quality issues.

In our daily work with customers, we find that data quality issues tend to fall into one of three broad buckets: data hygiene, business-focused anomalies, and stats-focused anomalies. Read the bucket descriptions below, and list 2-3 data quality concerns in your own business context that fall into each bucket.

Bucket 1: Data hygiene

Data hygiene issues are concerns you address in your staging layer. Hygienic data meets your expectations around formatting, completeness, and granularity requirements. Here are a few examples.

Granularity: primary keys are unique and not null. Duplicates throw off calculations.
Completeness: columns that should always contain text, do. Incomplete data often has to get excluded, reducing your overall analytical power.
Formatting: email addresses always have a valid domain. Incorrect emails may affect things like marketing outreach.

Bucket 2: Business-focused anomalies

Business-focused anomalies catch unexpected behavior. You can flag unexpected behavior by clearly defining expected behavior. Business-focused anomalies are when aspects of the data differ from what you know to be typical in your business. You’ll know what’s typical either through your own analyses, your colleagues’ analyses, or things your stakeholder homies point out to you.

Since business-focused anomaly testing is set by a human, it will be fluid and need to be adjusted periodically. Here’s an example.

Imagine you’re a sales analyst. Generally, you know that if your daily sales amount goes up or down by more than 20% daily, that’s bad. Specifically, it’s usually a warning sign for fraud or the order management system (OMS) dropping orders. You set a test in dbt to fail if any given day’s sales amount is a delta of 20% from the previous day. This works for a while.

Then, you have a stretch of 3 months where your test fails 5 times a week! Every time you investigate, it turns out to be valid consumer behavior. You’re suddenly in hypergrowth, and sales are legitimately increasing that much.

Your 20%-change fraud and OMS failure detector is no longer valid. You need to investigate anew which sales spikes or drops indicate fraud or OMS problems. Once you figure out a new threshold, you’ll go back and adjust your testing criteria.

Although your data’s expected behavior will shift over time, you should still commit to defining business-focused anomalies to grow your understanding of what is normal for your data.

Here’s how to identify potential anomalies.

Start at your business intelligence (BI) layer. Pick 1-3 dashboards or tables that you know are used frequently. List these 1-3 dashboards or tables. For each dashboard or table you have, identify 1-3 “expected” behaviors that your end-users rely on. Here are a few examples to get you thinking:

Revenue numbers should not change by more than X% in Y amount of time. This could indicate fraud or OMS problems.
Monthly active users should not decline more than X% after the initial onboarding period. This might indicate user dissatisfaction, usability issues, or that users not finding a feature valuable.
Exam passing rates should stay above Y%. A decline below that threshold may indicate recent content changes or technical issues are affecting understanding or accessibility.

You should also consider what data issues you have had in the past! Look through recent data incidents and pick out 3 or 4 to guard against next time. These might be in a #data-questions channel or perhaps a DM from a stakeholder.

Bucket 3: Stats-focused anomalies

Stats-focused anomalies are fluctuations that go against your expected volumes or metrics. Some examples include:

Volume anomalies. This could be site traffic amounts that may indicate illicit behavior, or perhaps site traffic dropping one day then doubling the next, indicating that a chunk of data were not loaded properly.
Dimensional anomalies, like too many product types underneath a particular product line that may indicate incorrect barcodes.
Column anomalies, like sale values more than a certain number of standard deviations from a mean, that may indicate improper discounting.

Overall, stats-focused anomalies can indicate system flaws, illicit site behavior, or fraud, depending on your industry. They also tend to require more advanced testing practices than we are covering in this blog. We feel stats-based anomalies are worth exploring once you have a good handle on your data hygiene and business-focused anomalies. We won’t give recommendations on stats-focused anomalies in this post.

How to prioritize data quality concerns in your pipeline

Now, you have a written and categorized list of data hygiene concerns and business-focused anomalies to guard against. It’s time to prioritize which quality issues deserve to fail your pipelines.

To prioritize your data quality concerns, think about real-life impact. A couple of guiding questions to consider are:

Are your numbers customer-facing? For example, maybe you work with temperature-tracking devices. Your customers rely on these devices to show them average temperatures on perishable goods like strawberries in-transit. What happens if the temperature of the strawberries reads as 300C when they know their refrigerated truck was working just fine? How is your brand perception impacted when the numbers are wrong?
Are your numbers used to make financial decisions? For example, is the marketing team relying on your numbers to choose how to spend campaign funds?
Are your numbers executive-facing? Will executives use these numbers to reallocate funds or shift priorities?

We think these 3 categories above constitute high-impact, pipeline-failing events, and should be your top priorities. Of course, adjust priority order if your business context calls for it.

Consult your list of data quality issues in the categories we mention above. Decide and mark if any are customer facing, used for financial decisions, or are executive-facing. Mark any data quality issues in those categories as “error”. These are your pipeline-failing events.

If any data quality concerns fall outside of these 3 categories, we classify them as nice-to-knows. Nice-to-know data quality testing can be helpful. But if you don’t have a specific action you can immediately take when a nice-to-know quality test fails, the test should be a warning, not an error.

You could also remove nice-to-know tests altogether. Data testing should drive action. The more alerts you have in your pipeline, the less action you will take. Configure alerts with care!

However, we do think nice-to-know tests are worth keeping if and only if you are gathering evidence for action you plan to take within the next 6 months, like product feature research. In a scenario like that, those tests should still be set to warning.

Start your action plan

Now, your data quality concerns are listed and prioritized. Next, add 1 or 2 initial debugging steps you will take if/when the issues surface. These steps should get added to your framework document. Additionally, consider adding them to a test’s description.

This step is important. Data quality testing should spur action, not accumulate alerts. Listing initial debugging steps for each concern will refine your list to the most critical elements.

If you can't identify an action step for any quality issue, remove it. Put it on a backlog and research what you can do when it surfaces later.

Here’s a few examples from our list of unexpected behaviors above.

For calculated field X, a value above Y or below Z is not possible.
- Debugging initial steps
  - Use dbt test SQL or recent test results in dbt Explorer to find problematic rows
  - Check these rows in staging and first transformed model
  - Pinpoint where unusual values first appear
Revenue shouldn’t change by more than X% in Y amount of time.
- Debugging initial steps:
  - Check recent revenue values in staging model
  - Identify transactions near min/max values
  - Discuss outliers with sales ops team

You now have written out a prioritized list of data quality concerns, as well as action steps to take when each concern surfaces. Next, consult hub.getdbt.com and find tests that address each of your highest priority concerns. dbt-expectations and dbt_utils are great places to start.

The data tests you’ve marked as “errors” above should get error-level severity. Any concerns falling into that nice-to-know category should either not get tested or have their tests set to warning.

Your data quality priorities list is a living reference document. We recommend linking it in your project’s README so that you can go back and edit it as your testing needs evolve. Additionally, developers in your project should have easy access to this document. Maintaining good data quality is everyone’s responsibility!

As you try these ideas out, come to the dbt Community Slack and let us know what works and what doesn’t. Data is a community of practice, and we are eager to hear what comes out of yours.

dbt Developer Hub Blog

The Catalog Linked Database Diaries: On Freshness and Writes

The Catalog Linked Database Diaries: On Freshness and Writes​

What is a Catalog Linked Database?​

Our testing regimen​

Reads at scale: Good performance, but only after platforms sync​

Writes and change at scale: The compute bottleneck​

The biggest finding: As scale increases, refresh latency does too​

Interoperability friction: Why it’s not just the metadata​

The takeaway​

Make your AI better at data work with dbt's agent skills

What’s included​

Quickstart​

1. Add the skills to your agent​

2. Start a new agent session​

3. Try it yourself​

So what is a skill, anyway?​

How do skills interact with MCP?​

From generalist to specialist​

Why skills matter​

Skills allow you to embed complex process knowledge that is non-obvious to agents​

Skills protect against plausible but incorrect output​

Skills allow you to give opinionated guidance to agents​

Skills allow you to give non-public information to agents​

How we validated the dbt Agent Skills​

Careful expert generation and curation of skills​

Hands-on testing of each skill in real life examples​

Custom suite for A/B testing skills​

Benchmarking against ADE-bench​

Where there are gaps​

Again: you should go try this yourself​

Modernizing the Semantic Layer Spec

New engine, who dis?​

What’s changing?​

Legacy implementation​

New implementation​

Is this the OSI spec?​

Get started today​

Building the Remote dbt MCP Server

What is the Remote dbt MCP Server? StarterEnterpriseEnterprise +Beta​

The Remote dbt MCP Server Architecture​

The Remote dbt MCP Server in Action​

Future Work​

How to train a linear regression model with dbt and BigFrames

Introduction to dbt and BigFrames​

The power of dbt-BigFrames for large-scale linear regression​

“dbt-BigFrames” with ML: A practical example​

Problem statement​

Setting up your dbt project for BigFrames​

Prerequisites​

Sample profiles.yml for BigFrames​

Sample dbt_project.yml​

The dbt Python models for linear regression​

Part 1: Preparing the table (prepare_table.py)​

Part 2: Training the model and making predictions (prediction.py)​

Running your dbt ML pipeline​

Key advantages of dbt and BigFrames for ML​

Conclusion and next steps​

Feedback and support​

The new dbt VS Code extension: The experience we've all been waiting for

The dbt developer experience in the pre-fusion-era​

The new era of dbt development​

Catch SQL Errors in Real Time​

Model and Column Lineage​

Instant refactoring​

Conclusion: A New Default​

The Components of the dbt Fusion engine and how they fit together

There are a number of different ways to access the dbt Fusion engine​

What variants of the dbt Fusion engine exist?​

Source-available dbt Fusion engine​

Precompiled dbt Fusion engine binary​

Using the dbt Fusion engine with a commercial agreement​

Other pieces of the puzzle​

The dbt VS Code Extension and Language Server​

The dbt Authoring Layer​

dbt Fusion engine adapters​

ANTLR Grammars​

dbt-jinja​

How do I engage with these components?​

Path to GA: How the dbt Fusion engine rolls out from beta to production

The Catalog Linked Database Diaries: On Freshness and Writes

What is a Catalog Linked Database?

Our testing regimen

Reads at scale: Good performance, but only after platforms sync

Writes and change at scale: The compute bottleneck

The biggest finding: As scale increases, refresh latency does too

Interoperability friction: Why it’s not just the metadata

The takeaway

What’s included

Quickstart

1. Add the skills to your agent

2. Start a new agent session

3. Try it yourself

So what is a skill, anyway?

How do skills interact with MCP?

From generalist to specialist

Why skills matter

Skills allow you to embed complex process knowledge that is non-obvious to agents

Skills protect against plausible but incorrect output

Skills allow you to give opinionated guidance to agents

Skills allow you to give non-public information to agents

How we validated the dbt Agent Skills

Careful expert generation and curation of skills

Hands-on testing of each skill in real life examples

Custom suite for A/B testing skills

Benchmarking against ADE-bench

Where there are gaps

Again: you should go try this yourself

New engine, who dis?

What’s changing?

Legacy implementation

New implementation

Is this the OSI spec?

Get started today

What is the Remote dbt MCP Server? Starter Enterprise Enterprise +Beta

The Remote dbt MCP Server Architecture

The Remote dbt MCP Server in Action

Future Work

Introduction to dbt and BigFrames

The power of dbt-BigFrames for large-scale linear regression

“dbt-BigFrames” with ML: A practical example

Problem statement

Setting up your dbt project for BigFrames

Prerequisites

Sample `profiles.yml` for BigFrames

Sample `dbt_project.yml`

The dbt Python models for linear regression

Part 1: Preparing the table (`prepare_table.py`)

Part 2: Training the model and making predictions (`prediction.py`)

Running your dbt ML pipeline

Key advantages of dbt and BigFrames for ML

Conclusion and next steps

Feedback and support

The dbt developer experience in the pre-fusion-era

The new era of dbt development

Catch SQL Errors in Real Time

Model and Column Lineage

Instant refactoring

Conclusion: A New Default

There are a number of different ways to access the dbt Fusion engine

What variants of the dbt Fusion engine exist?

Source-available dbt Fusion engine

Precompiled dbt Fusion engine binary

Using the dbt Fusion engine with a commercial agreement

Other pieces of the puzzle

The dbt VS Code Extension and Language Server

The dbt Authoring Layer

dbt Fusion engine adapters

ANTLR Grammars

dbt-jinja

How do I engage with these components?

Can I use Fusion for my dbt project today?

Requirement for GA: Adapter Coverage

Databricks, BigQuery and Redshift

Athena, Postgres, Spark and Trino

Custom adapters

Requirement for GA: Feature coverage

Known unimplemented features

Python models

Breadth of Materialization Support