Background
The State of Data (as of 2021)
Whether you’re a Fortune 500 company managing petabytes or a startup just beginning to collect data, organizations across the US are investing in data warehouses and data lakes. The value of data is now universally recognized—terms like "big data" and "my data" are everywhere. Data is the foundation for leveraging technologies like AI and machine learning.
"A string of pearls is only valuable when strung together"
Collecting data is just the first step; the real value comes from putting it to use. Most companies start by extracting key metrics to inform business decisions. As organizations mature, they move beyond simple reporting—using machine learning to build new product features, developing internal tools, and finding innovative ways to drive business value.
To add a more personal perspective, the company I currently work for has been building and using a BigQuery-based data warehouse for nearly three years. During this time, a significant amount of historical data has accumulated in BigQuery, and we've been leveraging machine learning to solve many problems.
However, as various use cases emerge from this accumulated data, problems inevitably begin to surface.
-
Data Quality
Data can continuously change over time, leading to declining data quality. There are generally two main scenarios: 1) Natural data drift, and 2) Errors caused by faulty logic or unhandled external factors. In most cases, edge cases that fall outside human expectations or simple human errors lead to data contamination.
-
Data Version Management
As mentioned above, as data continuously changes, data pipelines also undergo changes. While Spark-based data pipeline code is typically managed through git, managing the resulting data is not straightforward (data warehouses usually manage versions through table labels or table columns). SQL-based pipelines are even more challenging. Unless SQL is managed as code, tracking changes is difficult, and due to SQL's nature, conducting peer reviews on complex pipelines with 500+ lines of SQL code line by line is not easy.
-
Data Ownership
One might wonder if data doesn't belong to everyone in the company, but the reality is different. The more critical the data, the more important it becomes to have someone responsible for fixing issues when they arise, and someone with ownership to review and deploy new versions when logic changes, just like with code.
-
Data Dependency Management
As the amount of data grows, more logic emerges that creates subsequent tables based on joins with fact/dimension tables or the update status of other tables (typically expressible as DAGs). This can lead to side effects when someone deletes or modifies existing tables and datasets. Without a clear understanding of data dependencies, it becomes difficult to safely manage data.
The Need for Systematic Data Management
Several data-related terms have emerged to address the problems mentioned above.
-
Data Catalog

Source: Complete Guide to Data Catalog Tools and Architecture
Data catalogs help manage data effectively by collecting metadata. AWS Glue Catalog is a representative Data Catalog provided by cloud providers, while BigQuery and Snowflake have their Data Warehouses serve the catalog role as well. Databricks' Delta Lake is also known to provide similar functionality.
-
Data Observability

Source: https://newrelic.com
Data observability provides beautiful visualizations of data pipelines and data distributions, making it easier for people to understand and efficiently manage data quality. New Relic and Databand are well-known solutions in this space.
-
Data Lineage

Source: Data Lineage - Mapping Your Data Journey | Subsurface
Data lineage, like catalogs, is a method for efficiently managing metadata. It visualizes data flow and implements lineage tracking to solve data version management and dependency identification problems.

Source: dbt introduction
In addition, modern data infrastructure is seeing an influx of various terms like Data Governance and Data Mesh. Rather than applying all of them, it's better to focus on solving actual problems with methodologies that fit your specific situation.
For more information, check out:
Understanding Modern Data Infrastructure
This series covers various modern data stacks.
Today's topic, dbt, addresses the problems mentioned above through Data Lineage and focuses on the T (Transform) in ELT pipelines.
What is dbt?
dbt stands for data build tool and is designed to make transformation easy, specifically SQL-based transformation in the extract-transform-load process.
dbt (data build tool) - Transform data in your warehouse
(It supports plugins that communicate with Spark clusters, but rather than using Python/Scala, it appears to communicate with thrift/HTTP servers and execute HiveQL).
It's widely used by companies like GitLab, Grailed, Slack, and Notion. (However, use cases in Korea are not yet well-known)
In summary:
A SQL-based Transform tool focused solely on the T in ETL
CLI vs Cloud
dbt has CLI mode and Cloud mode.
CLI mode installs dbt as a Python package. It uses YAML for managing table metadata and unit tests, Jinja templates for various macros and UDFs, and calculates dependencies between tables. CLI mode is also free(!). More details will be covered below.
https://www.loom.com/embed/05f9c34b17c74c97b98286683dafd420
Cloud mode is a paid service. It provides all the features available in CLI mode, plus an integrated SQL development environment (IDE). Through the web IDE, analysts and SQL users can control unit testing for their tables, pull requests from a workflow perspective, and basic Git branch operations like software developers, making the overall SQL development process easier to maintain. It also allows creating CronJobs. dbt Cloud service pricing is as follows.

Source: https://www.getdbt.com/pricing/
Personally, I feel dbt Cloud's strongest selling point is that it simplifies the tedious aspects of dbt work (checking compiled queries, reviewing results, git operations, etc.) through its web interface. However, dbt CLI can handle most features as well. (Not to mention, $50 per person per month is quite expensive...)
What are the advantages and disadvantages?
Based on my experience implementing dbt in an actual company:
Advantages
- Various data unit testing can make existing pipelines more robust
- Intuitive Docs UI reduces communication costs for finding data
- Long SQL can be broken down into reusable modules
- (Not dbt-specific) Managing SQL through code and development cycles allows systematic management of queries and tables through SQL reviews and style conventions
- Various macros and open-source tools exist
- Uses existing data warehouse resources more efficiently without requiring separate cloud or on-premises resources
- Reduces overhead for SQL-based data operations, allowing focus on table modeling and architectural considerations
Disadvantages
- Requires systematic management of YAML, which can lead to manual tasks
- Has a learning curve, requiring internal training
- It's a tool focused only on Transform, so other tools must be used alongside it for Extract or Load
- When using only CLI, monitoring which queries are running is not easy
- BigQuery adapter and others have mechanisms like max bytes billed limit that can be well controlled
When to consider adopting dbt
- When there are many SQL-based data queries and table creations
- When using MPP-based data warehouses in ELT pipeline environments
- Traditional ETL pipelines often combine Transform and Load (like using PySpark to
df.readthendf.write...) - dbt is a library for transformation, with limited support for loading
- Traditional ETL pipelines often combine Transform and Load (like using PySpark to
- When there are many people in the company who work with SQL
- Since dbt performs data transformation work based on SQL, the more people who work with SQL, the greater the productivity impact
- When systematic management of derived tables is needed due to high query costs
- Due to the nature of BigQuery and Snowflake, creating tables is much easier than with traditional RDBMS. This creates a trade-off where tables are created indiscriminately and the same table is created in multiple places, leading to unnecessary data scanning and high query costs
- When you start managing tables systematically through dbt, it becomes easier to understand metadata such as which tables are currently managed, who has ownership, and what columns and tests are used for management. Therefore, in the long term, except for unavoidable cases, you can reduce both the effort to find needed data and query costs
When not to consider adopting dbt
- When there aren't many people in the company who work with SQL
- When Hadoop-based data warehouses and Spark-based data processing are more familiar and well-maintained
- When the speed of handling ad-hoc requests is more important than systematic metadata and table management
Understanding dbt's Basic Features
The following explanation is based on dbt CLI.
All code and development environment configurations can be found in the GitHub repository below.
The configuration for demonstrating dbt's features is as follows:
-
dbt codebase
-
Postgres Docker for DW mocking
-
(optional) devcontainer
-
Python 3.8.x based dbt initial setup
Install dependencies using pyenv and poetry
# cd <PROJECT_NAME> $ pyenv shell 3.8.2 $ poetry config virtualenvs.in-project true # Install dependencies in .venv $ poetry new dbt # Create a new poetry project $ poetry add "dbt-core==0.19.1" "dbt-postgres==0.19.1" # cd <PROJECT_NAME>/dbt $ dbt init --adapter postgres core # Create a dbt project $ cd core && tree . ├── README.md ├── analysis ├── data ├── dbt_project.yml ├── macros ├── models │ └── example │ ├── my_first_dbt_model.sql │ ├── my_second_dbt_model.sql │ └── schema.yml ├── snapshots └── tests
Additionally, the databases and warehouses currently supported by dbt are as follows:
- BigQuery
- Snowflake
- Postgres
- Redshift
- MS SQL
- Oracle
- Presto
- Apache Spark (Thrift, HTTP Server)
- Microsoft Azure Synapse DW
- Dremio
- ClickHouse
(In dbt, these are expressed as adapters)
dbt Project Structure
├── README.md
├── analysis
├── data
├── dbt_project.yml
├── macros
├── models
│ └── example
│ ├── my_first_dbt_model.sql
│ ├── my_second_dbt_model.sql
│ └── schema.yml
├── snapshots
└── tests
This is the structure of an initial dbt project.
-
datacontains static files for data feeding like CSV files. You can load data into the data warehouse/database by executing insert queries through thedbt seedcommand. (Example) -
analysiscontains SQL files for analysis only, not for creating tables. Queries are not actually executed through thedbt runcommand, but are used only for compile testing (dbt compile)-
Example
analysis/customer_count.sqlwith customers as ( select * from {{ ref('raw_customer') }} ) select count(1) from customersThis query is compiled as follows:
target/compiled/core/analysis/customer_count.sqlwith customers as ( select * from "dbt"."transformed"."raw_customer" ) select count(1) from customers
-
-
dbt_project.ymlcontains metadata for the dbt project. (Documentation) -
macrosis a directory containing user-defined Jinja template macros.-
Example
{% macro cents_to_dollars(column_name, precision=2) %} ({{ column_name }} / 100)::numeric(16, {{ precision }}) {% endmacro %}The above macro can be used in models as follows:
select id as payment_id, {{ cents_to_dollars('amount') }} as amount_usd, ... from app_data.payments
-
-
modelsis where tables created with dbt are gathered. It's dbt's main working directory. (Example) -
snapshotcreates tables with information about snapshots of source tables-
Example
{% snapshot orders_snapshot %} {{ config( target_database='analytics', target_schema='snapshots', unique_key='id', strategy='timestamp', updated_at='updated_at', ) }} select * from {{ source('jaffle_shop', 'orders') }} {% endsnapshot %}The above SQL file creates a model called orders_snapshot through the
dbt snapshotcommand.When the query is first executed through the dbt snapshot command, the data at that point in time is attached with
dbt_valid_fromanddbt_valid_tocolumns.From the second time onwards, new rows are inserted into the table.
It's mainly used with source tables. (e.g., when you want to keep daily snapshots of tables that are consistently inserted through daily batches)
-
-
testscontains definitions of tests for models and snapshots
dbt Terminology
-
Model
Simply put, it's a table. When you create a SQL file in the
modelsdirectory, it's compiled into SQL syntax likeCREATE TABLE ASorCREATE OR REPLACE TABLEand adds a table to the connected data warehouse and database. (Therefore, it's good to manage Schema and privileges well) If you don't give theincrementaloption, it overwrites the existing table (like TRUNCATE INSERT in BigQuery), so be careful. -
Source
Refers to the source data/table when creating models. Unlike models, Source doesn't overwrite tables. According to Best Practice, rather than using Source directly, it's recommended to create Staging tables that rename columns from Source Tables according to conventions and, if necessary, make them in a form that's easy to work with through simple where clauses or joins.
-
Schema
People familiar with RDBMS might confuse it with Database - Schema - Table. dbt's Schema refers to the specification for tables that will be created through models (e.g., table name, description, column name, test ...)
-
Example schema.yml
version: 2 models: - name: transactions description: Table containing detailed information by order columns: - name: order_id tests: - unique - not_null description: This is a unique identifier for an order - name: customer_id description: Foreign key to the customers table tests: - not_null - relationships: to: ref('stg_customers') field: customer_id - name: customer_name description: customers full name tests: - not_null
-
-
Profile
This is metadata for connecting dbt with DW and DB. It's good to maintain a 1:1 relationship between one dbt project and one profile. More detailed information can be found in dbt's documentation. By default, when you create a dbt project, it creates and uses
~/.dbt/profiles.yml. However, in collaborative environments, there may be constraints on using different profiles for each local environment, so I personally recommend creating a profiles yaml inside the project and overriding it using the--profiles-dirflag.-
Example Profile yaml
config: send_anonymous_usage_stats: False use_colors: True dbt: target: dev outputs: dev: type: postgres host: warehouse port: 5432 user: dbt password: dbt dbname: dbt schema: transformed threads: 3
-
-
Target
It's most accurate to see this as the environment where dbt will run. Most are configured as dev/prod, dev/live, etc., and distinguish between configurations for running in development and production environments. By default, when executing dbt commands, settings like how many threads to use and (limited to some adapters) max bytes billed are set differently for each environment.
Initial Data Loading
-
Create a CSV file in the
datadirectory
-
Enter the
dbt seedcommand
-
You can create a table in the target based on the CSV file like a model

Creating Your First Model
-
Create
models/staging/stg_orders.sqlwith source as ( select * from {{ ref('raw_orders') }} ), renamed as ( select id as order_id, user_id as customer_id, order_date, status from source ) select * from renamed -
Execute the
dbt runcommand
-
Check the table

-
For conventions on table layering like separating staging/mart, it's good to refer to the guide by fishtown analytics who created dbt.
-
For model selection tips, if you give
+model_nameas the--modeloption, it will build all tables that have dependencies on that model. The related syntax is well organized in GitLab's dbt guide.
Updates tables like payments, orders that are connected to the transactions table together.
Documentation
You can create a webpage to view the metadata of dbt models created so far with two commands.
dbt docs generate
This command compiles models, draws a dependency graph, and stores the metadata in the target folder (this is the default, but you can change it to another name in dbt_project.yml) in the form of manifest.json and catalog.json. It also creates an index.html file to visualize it.
dbt docs serve
Launches a simple web server so you can view it on localhost.

dbt docs serve → localhost:8080
As shown in the screenshot above, you can easily view the metadata, column information, and dependency relationships of the transactions model created in the example above in webpage format. The description is currently empty, but you can leave documentation about the table here in markdown format, such as what kind of table it is, simple logic explanations, etc. (Reference)

You can view table dependencies in more detail in fullscreen view.
Creating Your First Test
There are two main types of tests in dbt.
- Tests before running dbt run
- Tests after running dbt run
For case 1, from basic tests like whether data is properly loaded into source tables, whether specific columns are non-null or unique, to custom test macros, you can have more test cases (for example, if the daily updated row count is n for a certain period, whether today's row count deviates significantly from this distribution... I'll explain this in more detail through the company's tech blog)
For case 2, it tests whether data was created as expected for tables created through the dbt run command. Like case 1, you can test the pipeline using various test cases.
You can add tests by simply adding a line or so of code to schema.yml (case 2) or source.yml (case 1).
As an example, let's add test cases to check if data was properly loaded into the transactions model created above (case 2)
version: 2
models:
- name: transactions
description: Table containing detailed information by order
columns:
- name: order_id
tests:
- unique
- not_null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
- name: customer_name
description: customers full name
tests:
- not_null
- name: order_date
description: Date (UTC) that the order was placed
- name: status
tests:
- accepted_values:
values:
["placed", "shipped", "completed", "return_pending", "returned"]
- name: amount
description: Total amount (AUD) of the order
tests:
- not_null
- name: credit_card_amount
description: Amount of the order (AUD) paid for by credit card
tests:
- not_null
- name: coupon_amount
description: Amount of the order (AUD) paid for by coupon
tests:
- not_null
- name: bank_transfer_amount
description: Amount of the order (AUD) paid for by bank transfer
tests:
- not_null
- name: gift_card_amount
description: Amount of the order (AUD) paid for by gift card
tests:
- not_null
As shown above, you can write both documentation and test cases for models in the yml file.
To run tests, use the dbt test command

The example test above only used test cases provided by default in dbt. To see specifically what queries were used to perform the tests, you can look at the target/compiled directory. Let's examine how the test for the status column in schema.yml was converted to a query:
with all_values as (
select distinct
status as value_field
from "dbt"."transformed"."transactions"
),
validation_errors as (
select
value_field
from all_values
where value_field not in (
'placed','shipped','completed','return_pending','returned'
)
)
select count(*) as validation_errors
from validation_errors
The result of this query will be 0 or a value of 1 or more. If it's 0, it passes; otherwise, it fails. In this way, dbt testing can simply test columns and tables through SQL by checking whether the query result is 0 or another value, so you can cover various test cases with just basic SQL knowledge.
Conclusion
As we've seen above, dbt helps solve various problems in modern data infrastructure using only SQL. Unlike existing data processing methods that require processing in distributed systems with high costs, it delegates processing to Data Warehouse functions, creates modular reusable blocks using Jinja templates, adds documentation and test cases simply with YAML, and allows viewing metadata through web pages, reducing communication costs between organizations that manage data and those that consume it.
I wanted to cover more detailed content in this article, but the article became too long just covering background explanations and basic features, so I plan to organize more advanced features and tips and learnings gained from operating in actual production environments in the next article. For example:
- Creating fast and stable SQL-based data pipelines with Airflow custom dbt operator
- Managing SQL coding conventions with pre-commit + sqlfluff
- More efficient data management of dbt codebase with custom macros and dbt packages
- More detailed model layer separation and easy-to-maintain dbt tips
- Building dbt-based Data Observability dashboards
As mentioned above, many companies are still operating Hadoop-based data infrastructure or environments where it's difficult to apply dbt. As I habitually say, there's no silver bullet. New technologies emerge daily in the data industry, and dbt could also be just a fad.
Adding my personal thoughts, I think there's a trend where many more places are building data infrastructure with ELT paradigms rather than ETL (though it's not that it can't be used in ETL at all...), and it's a technology worth considering as an alternative to Apache Spark, which had a steep learning curve but a broad user base.