This blog was written together with Arbin Bici.
If you are part of the tech community, the probability that you have, at least once, dealt with databases and SQL is high. So, we can relate to, among others, the following issues:
Did I commit this transaction?
My data is duplicated, and I don’t know how to sanitize it.
I have no documentation for this project. Let me reverse engineer the ELT every time I need to change something.
It often occurs that we also make mistakes whilst dealing with these issues, in which case we need someone to double-check the outcome. This type of work usually falls within the scope of the Analytics Engineer's duties who uses dbt or similar tools to boost productivity.
What is dbt?
dbt is an acronym for Data Build Tool, specialized in the transformation(T) part of an ELT process. Hence, with dbt, you transform the warehouse data that is already loaded and shape it the way you want by using classic select statements.
What makes dbt so useful?
dbt materializes queries as relations so that you do not need to worry about the DDL. This way you can focus only on defining models (SQL queries) that transform the data as needed to meet the business requirements. It also offers the power of software programming with Jinja templates.
Developing locally
Let’s assume that we are interested in some information regarding Covid in the past two years. For this, we have a source API that provides data, which is not entirely meaningful at first sight.
The documentation of the API we are using is: https://about-corona.net/documentation
For the analytical process, we are using dbt. The main goal is to find the death rate during the pandemic.
The project structure we have created looks like this:
Firstly, set up the environment:
git clone git@github.com:data-max-hq/covid-dbt-analytics.git
cd covid-dbt-analytics
cd demo
Right now, a record of our data looks like this:
To create a database instance, you must start up a Postgres container with the command below:
docker compose up postgres
Populate the dataset from the API
There is a script in the project, named get_data, that reads from the API and writes inside the covid_data.csv file, under the seeds folder.
Execute:
python get_data.py -i al de it
This API call will return Covid data for Albania, Germany, and Italy. The parameters are the states’ two-digit country codes.
Create sources
Seeds are CSV files in dbt that can be loaded as tables in the database. To create these tables, execute the following command:
dbt deps # install dbt dependencies
dbt seed --profiles-dir ./profiles
The profile.yml file holds the configuration of the database connection, where it is stated that all tables generated by this command are to be created on the source schema in the Postgres database.
After seeding, you should have this view on your database instance:
These two tables are going to be our source. To tell dbt this fact, we create a src_covid_data.yml configuration file in the structure below:
|-- demo
|-- models
|-- |--marts
|-- |-- |--core
|-- |-- |-- |-- (facts and dimensions are stored here)
|-- |--staging
|-- |-- |-- src_covid_data.yml
|-- |-- |-- stg_models.yml
|-- |-- |-- *.sql (intermediate models are stored here)
Once defined as a source, these tables can be referred to within the project as shown in the following Jinja function:
{{source ('source', 'covid_data')}}
Prepare source data
Now that we have a source, we will do some transformations to demonstrate the power of dbt. For this use case, we aggregated the records to show for each country the total daily number of people infected, recovered, and deceased. We create a SQL file called stg_prepared_source, under the staging folder. This model queries the source and aggregates the data to the desired response.
In order to check what is the actual query that runs in the database, you can compile the query using the following command:
dbt compile --select stg_prepared_source --profiles-dir ./profiles
Look under the targets to see the SQL queries generated by dbt.
Execute the following to create all models:
dbt run --profiles-dir ./profiles
To only run this model, execute the following command:
dbt run --select stg_prepared_source --profiles-dir ./profiles
Note: We have configured in the dbt_project.yml that all models under the staging folder will be materialized as views.
Expect to see a view object named stg_prepared_source in the public schema of the database. However, this configuration can be overridden in the models' file by adding
{{ config(materialized = 'table') }}
The second staging model, stg_deaths_per_month, aggregates the number of deaths per month in each country. We have used an incremental model to only process the latest data for this one. A model can be created to have a view, table, ephemeral or incremental materialization.
All materializations are built, if not existent, every time the dbt run command is executed. This results in re-processing the same records repeatedly.
To process the new data, one can use the incremental materialization type and define the filtering rule as follows:
This means that only the data that fits the “where” condition will be inserted or updated in the model.
To surpass this condition and refresh the whole model execute the following command:
dbt run --full-refresh --profiles-dir ./profiles
Calculate the death rate
The main purpose of this project was to find the death rate during the pandemic. We know that the formula for this is the following:
death_rate = deaths/population*100
To calculate the death rate, we have built a macro called calc_death_rate(n), where n is the number of zeros used in the scale.
Macros are functions that are either defined by the developer inside the macros folder or part of a package that is imported.
For the most common functionalities, there are predefined macros that can be imported with packages. Inside the project, there is a packages.yml file where we have listed the packages we want to install.
To install these packages, execute: dbt deps
Macros are then called inside Jinja template tags.
Under the marts' folder, there is the monthly_death_rate.sql model which in its third column has the calc_death_rate(n) macro that takes n as the number of people and returns the death rate per n people.
A macro is executed automatically if it is called by another object. To run a macro independently, one can run it as shown below:
dbt run-operation <macro_name> --profiles-dir ./profiles
In this case, the macro will be called by the model and the model will be created with the run command:
dbt run --profiles-dir ./profiles
At this point, the fct_monthly_death_rate table looks as shown below picture:
Analyses
There may be cases where you just need to save the query but not materialize it. For this purpose, you can save the SQL file under the analyses folder. These files will be compiled but not executed. This way they will be stored and versioned like other dbt objects. To compile one, run the following command:
dbt compile --select new_confirmed --profiles-dir ./profiles
Tests
Tests are SQL queries applied to the data to check for logical mistakes.
Types:
Singular: built-in
Generic: custom tests
Singular tests are used inside the configuration yaml files. They must be assigned to a column to run.
Generic tests are defined as SQL files under the tests folder.
Generic test written by the developer:
How dbt interprets it:
If for any reason this query returns values, the test is said to have failed.
Run all tests:
dbt test --profiles-dir ./profiles
Run singular tests:
dbt test --select test_type:singular
Run generic tests:
dbt test --select test_type:generic
Documents
dbt offers a way of documenting your project with descriptions and relations configured in the yaml files.
To create and version the documentation, execute the following command:
dbt docs generate --profiles-dir ./profiles
To view it, execute the following:
dbt docs serve --profiles-dir ./profiles
This will start a webserver on port 8000 to serve the documentation locally.
A blue circle button in the bottom right corner will show a visualization of the data flow and dependencies between database objects.
The DAG (directed acyclic graph) of the data flow
Streamlit Integration
Streamlit is an open-source framework, used to turn data scripts into shareable web applications with just a few lines of Python code. It visualizes all the data or ML models you want. In our case, we used it to show the process of dbt creating its models displaying the data as a Pandas Dataframe. We also used an Altair Chart that shows the number of deaths for each country selected for every month since January of 2020.
Since the whole application is dockerized, we can use only the following command (while in the demo folder) to start both the Streamlit app and the PostgreSQL database:
docker compose up --build
Now we can go to localhost:8501. Select all the countries you want and click on Get Data.
The output should look as follows:
At this point, we can see the dataframe that shows all the information we get from the API for each country, and then the graph that shows the number of deaths for each month of every selected country.
You can clear the output by clicking the clear button, or clear the selected countries plus the output by pressing the small cross sign in the selection box.
When finished, we stop the docker containers using:
docker compose down
The following diagram shows the architecture of the whole app:
Summary
In this demo, we were able to turn a simple dataset with Covid-related data into a line graph that provides information on how many deaths per month have occurred in each chosen country. We went from a CSV to multiple staging and target models of different materializations, analysis, macros, tests, and documentation, all within one dbt project.
Once properly configured, dbt can:
determine the order in which the staging and target tables should be created
generate the DDL to build the database objects
keep track of documentation and serve a good-looking graph that shows the data flow
run automatic tests and snapshots
define and call functions
compile, run and generate SQL queries with Jinja template tags
Then we could visualize the entire process and output of dbt using Streamlit.
Comments