Better Programming

Advice for programmers.

Follow publication

dbt v1.5 — the 3 Big New Things

Jack C
Better Programming
Published in
6 min readApr 24, 2023

The CEO of dbt recently announced that dbt 1.5 will be released at the end of April, with some major changes included. In this article, I wanted to summarise the 3 main new features being released in dbt 1.5 (including data contracts).

First off, let’s start with the use case — you have a single dbt project with thousands of models, and it’s becoming harder to understand who owns what data models and get your changes shipped.

So, one route you can consider is making embedded data teams (e.g. Finance, Sales, Product) own certain data models — by placing them in subfolders, or you could even split your big dbt project into sub projects and import them as packages (like software “services”). For example:

SELECT * FROM {{ ref('sales_models', 'opportunities') }}

This 2nd approach still effectively couples the 2 dbt projects together, so you can’t change the opportunities.sql model in your sales_models dbt project without breaking everything downstream.

dbt 1.5 includes 3 new things to help with running dbt at scale:

  1. Access: define who (and what) can ref models from a dbt project
  2. Contracts: specify what columns, datatypes, and constraints apply to a model. Think dbt tests, but run before a model is built
  3. Versions: the ability to create multiple versions of a single dbt model, so it’s easier to make changes without breaking downstream models

I’ll go through these in more detail below, but in my view — you don’t have to split your dbt project into multiple projects to get value out of these new features.

These features were built by dbt Labs with a view to help data pipelines be treated like “services” and split big dbt projects down into smaller dbt projects each owned by a team.

In this approach, each team would be able to rapidly iterate on their own dbt projects, and with contracts/access/versioning not have to worry about upstream/downstream breaking changes to other dbt projects that they rely on (or rely on them).

This might work for you, but this might not, and I don’t take a stance in this article!

1. Access

Set who / what can reference (ref) a dbt model:

  • private: Model can only be referenced within the same group
  • protected: Model can only be referenced within the same package/project (this is the default)
  • public: Model can be referenced by any group, package, or project

Groups are defined in yml files, and models are assigned access & groups within yml files, e.g.:

Image source — dbt documentation

What’s the use case? It’s a clean way of surfacing which models other projects/people can (and can’t!) use. The result is your team has fewer things with downstream dependencies, making it faster to develop as there’s less to worry about breaking.

Using private vs. protected achieves this clean split within a dbt project, whereas using public vs. private/protected achieves this split if you have multiple dbt projects.

2. Contracts

Data contracts have become somewhat of a hot topic in data, and I’m going to attempt to oversimplify it as “an agreement between data producers and data consumers on how the data should be structured”.

If you split your single dbt project into multiple sub-projects (“services”), and your project relies on the dim_customers table from the customer_success package, then you want to know that the data will always contain a certain set of columns, with given data types, and constraints.

This would even be relevant if you didn’t go down the route of splitting your single dbt project — if you owned everything in models/finance and that relied on models in models/customer_success you’d probably want the same confidence that the data models will be consistent.

dbt tests go some of the way to solve these problems — but importantly they differ to contracts in a couple of major ways:

  • Tests are run after a model is built, whereas contracts are run before a model is built
  • If you remove a column in a SQL file, and you have that column in the yml — then dbt won’t break, it’ll just say that it doesn’t exist. If the yml is under a contract, however, then dbt will break if a column is missing

In dbt, contracts have 3 components: the contracts themselves, columns, and constraints. Here’s an example:

Image source — dbt documentation

Once under contract, a data model has to contain the specified columns, with the specified data types and constraints. As mentioned earlier, contracts are enforced before a model is run — so if it fails, the model won’t run!

You can read the documentation to see the types of constraints available.

What’s the use case? I feel this is useful regardless of whether you have multiple dbt projects or are just a small Data team of <5 people. Being able to ensure your data model has specific columns, constraints, and data types, before running it feels very useful for critical data pipelines!

3. Versions

This builds off contracts. If you have downstream users of a data model and want to make changes to the columns, without breaking anything or doing lots of refactoring downstream, then in dbt 1.5 you will be able to create separate versions of a model that coexist.

So, let’s say you changed the dim_customers model (which is under contract) to remove a column called country_name. You could then version the model by creating a dim_customers_v1 and a dim_customers_v2 — and specifying the differences in the yml file:

Image source — dbt documentation

This is effectively the same as creating an entirely new dbt model, but without creating a new yml entry for the 2nd model. Note — at this point, both models coexist as separate tables!

One thing you might immediately think here — what if I don’t want v1/2/3 all over my data models? dbt has an optional configuration that handles this:

Image source — dbt documentation

So when you “switch off” v1, you could then put defined_in: dim_customers under v2!

What’s the use case? For big organizations with lots of downstream dependencies between teams, this is a lifesaver for being able to develop & ship new things as a team quickly. Typically, building and shipping a data model change and handling downstream impacts had to be done in a single code change, whereas now they can be handled separately.

However, I can see some pitfalls of this approach if not used carefully:

  • Downstream users pointing at different data models: especially painful if you change filtering logic between versions! I think the defined_in configuration always specifying a “live” version of the table is most useful here
  • Refactoring being left to the downstream users: if it’s easy to change your model without breaking anything, then the temptation is to do just that and leave the job of migrating to the new version to your downstream users

You could argue that the 2nd point is a valid thing to be owned by downstream users, but if you make a truly big change (e.g. relationship of orders <> customers is no longer 1:1) then it’s a big responsibility to be passing on!

In summary, this release is a huge step by dbt towards allowing larger data teams to handle complicated dbt ecosystems — but whilst the new features are clearly quite powerful they need careful thought before implementation!

Thanks for reading.

New to dbt, or someone who wants to learn the advanced concepts? My dbt course is now live on Udemy (link), and covers everything from basic to advanced dbt — including building a full project from scratch, 7 hours of video, and 50 downloadable course materials!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Jack C
Jack C

Written by Jack C

I write about Data Analytics and Analytics Engineering

Responses (6)

Write a response