
AutoGen - Chat to dbt data
Talk data to me
One dbt documentation, one sql-bot and one priest sit at a bar…
Imaging a world where you could just ask questions to your data warehouse by just asking them out loud.
> Hey data, how many users did we have last week?
> 1234
> Hey data, how much money did we earn last month?
> 100000€
and so on…
The idea
We got everything that we need with:
- Your dbt documentation provides the chat bot with the information about your data warehouse
- AutoGen provides the orchestration layer to connect different chatbots, your dbt documentation and your data warehouse
Note
This is not a novel or even new idea. Text-to-sql bots have been plenty around for a while now. (See, Better than a Data Analyst or chat-dbt )
The extension to the given is that via. AutoGen you can directly query the data in chat.
At the moment integrating chatbot capability into all possible tools is the big roadblock to use AI even more.
RAG RAG CITY, Bitch!

All our dbt model definition and documentation is in .yml
-files. Furthermore it is most likely not public information nor is any publicly available chat bot trained on this.
To put it in a nutshell our chatbot does not know from our data model.
Note
As this could be sensible information we might not want to share this with any chatbot API.
The chatbot might be trained on data that descries a similar named data model to ours, but one that vastly differs from our.
To ensure we use the correct documentation and protect sensible data encapsulating
Here comes the RAG (retrieval augmented generation) into the equation. RAG approaches have shown to be more factual, reliable and controllable in that regard. (See Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks )
Data engineer BOT
Note
This bot has connection to our dbt documentation, retrieves information from it and provides other bots with it.
The bot is not responsible for the actual data retrieval, writing sql querys or any other programming.
# Data engineer bot
data_engenieer = RetrieveUserProxyAgent(
name="data_engenieer",
is_termination_msg=termination_msg,
system_message="SQL Assistant has extra content retrieval power for solving difficult problems concerning aviable data sources. You write sql querys. Only write querys containg tables and columns provided to you. Reply `TERMINATE` in the end when everything is done.",
llm_config=llm_config,
max_consecutive_auto_reply=3,
human_input_mode="Never",
retrieve_config={
"task": "qa",
"docs_path": "docs",
"chunk_token_size": 1000,
"model": config_list[0]["model"],
"client": chromadb.PersistentClient(path="/tmp/chromadb"),
"collection_name": "groupchat",
"get_or_create": True,
},
code_execution_config=False
)
We are using the RetrieveUserProxyAgent
.
We assume that the dbt documentation is present in docs
.
For test purposes we placed the following default documentation in docs
:
version: 2
models:
- name: customers
description: This table has basic information about a customer, as well as some derived facts based on a customer's orders
columns:
- name: customer_id
description: This is a unique identifier for a customer
tests:
- unique
- not_null
- name: first_name
description: Customer's first name. PII.
- name: last_name
description: Customer's last name. PII.
- name: first_order
description: Date (UTC) of a customer's first order
- name: most_recent_order
description: Date (UTC) of a customer's most recent order
- name: number_of_orders
description: Count of the number of orders a customer has placed
- name: total_order_amount
description: Total value (AUD) of a customer's orders
- name: orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: order_id
tests:
- unique
- not_null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships:
to: ref('customers')
field: customer_id
- name: order_date
description: Date (UTC) that the order was placed
- name: status
description: '{{ doc("orders_status") }}'
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: amount
description: Total amount (AUD) of the order
tests:
- not_null
- name: credit_card_amount
description: Amount of the order (AUD) paid for by credit card
tests:
- not_null
- name: coupon_amount
description: Amount of the order (AUD) paid for by coupon
tests:
- not_null
- name: bank_transfer_amount
description: Amount of the order (AUD) paid for by bank transfer
tests:
- not_null
- name: gift_card_amount
description: Amount of the order (AUD) paid for by gift card
tests:
- not_null
Data analyst BOT
Note
This bot has connection to our dbt documentation, retrieves information from it and provides other bots with it.
Before we define the bot we need to define and register a exec_sql
function to our chatbots.
For this we extend our llm_config
with the following:
{"config_list": config_list,
"timeout": 60,
"seed": 11,
"temperature": 0,
"functions": [
{
"name": "sql",
"description": "run a sql cell against a postgres database",
"parameters": {
"type": "object",
"properties": {
"cell": {
"type": "string",
"description": "Valid sql query cell to execute.",
}
},
"required": ["cell"], },
},]
}
The definition of the real exec_sql
depends on your set up.
The data analyst bot is given by:
# Data analyst bot
data_analyst = AssistantAgent(
name="data_analyst",
is_termination_msg=termination_msg,
system_message="You are a senior data analyst/data_engenieer. You only execute sql querys. Only use the functions you have been provided with. Reply `TERMINATE` in the end when everything is done.",
llm_config=llm_config,
human_input_mode="Never",
code_execution_config={"work_dir": "coding"},
)
To register the exce_sql function just add the following:
data_analyst.register_function(
function_map={
"sql": exec_sql,
}
)
Start the chat
Know we got everything we need and ask our chat bots a question.
Lets say we are interested in a potential costumer named John Mayer of the jaffel_shop
data defined above.
This triggers our chat:
## Define PROBLEM
PROBLEM = "How much did John Mayer order last month from the Jaffel shop?"
data_engenieer.initiate_chat(
manager,
problem=PROBLEM,
n_results=3,
)
Chat
Trying to create collection.
INFO:autogen.retrieve_utils:Found 1 chunks.
WARNING:chromadb.segment.impl.vector.local_persistent_hnsw:Number of requested results 3 is greater than number of elements in index 1, updating n_results = 1
doc_ids: [['doc_0']]
Adding doc_id doc_0 to context.
data_engenieer (to chat_manager):
You're a retrieve augmented chatbot. You answer user's questions based on your own knowledge and the
context provided by the user.
If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.
You must give as short an answer as possible.
User's question is: How much did John Mayer order last month from the Jaffel shop?
Context is: version: 2
models:
- name: customers
description: This table has basic information about a customer, as well as some derived facts based on a customer's orders
columns:
- name: customer_id
description: This is a unique identifier for a customer
tests:
- unique
- not_null
- name: first_name
description: Customer's first name. PII.
- name: last_name
description: Customer's last name. PII.
- name: first_order
description: Date (UTC) of a customer's first order
- name: most_recent_order
description: Date (UTC) of a customer's most recent order
- name: number_of_orders
description: Count of the number of orders a customer has placed
- name: total_order_amount
description: Total value (AUD) of a customer's orders
- name: orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: order_id
tests:
- unique
- not_null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships:
to: ref('customers')
field: customer_id
- name: order_date
description: Date (UTC) that the order was placed
- name: status
description: '{{ doc("orders_status") }}'
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: amount
description: Total amount (AUD) of the order
tests:
- not_null
- name: credit_card_amount
description: Amount of the order (AUD) paid for by credit card
tests:
- not_null
- name: coupon_amount
description: Amount of the order (AUD) paid for by coupon
tests:
- not_null
- name: bank_transfer_amount
description: Amount of the order (AUD) paid for by bank transfer
tests:
- not_null
- name: gift_card_amount
description: Amount of the order (AUD) paid for by gift card
tests:
- not_null
--------------------------------------------------------------------------------
>>>>>>>> USING AUTO REPLY...
data_engenieer (to chat_manager):
***** Suggested function Call: sql *****
Arguments:
{
"cell": "SELECT SUM(o.amount) as total_order_last_month FROM orders o JOIN customers c ON o.customer_id = c.customer_id WHERE c.first_name = 'John' AND c.last_name = 'Mayer' AND DATE_TRUNC('month', o.order_date) = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')"
}
****************************************
--------------------------------------------------------------------------------
>>>>>>>> USING AUTO REPLY...
>>>>>>>> EXECUTING FUNCTION sql...
data_analyst (to chat_manager):
***** Response from calling function "sql" *****
1000
************************************************
--------------------------------------------------------------------------------
Please give feedback to chat_manager. Press enter to skip and use auto-reply, or type 'exit' to stop the conversation:
We see that the data_engenieer
bot retrieves the information,
the data_analyst
comes up with an sql query and executes the query against a database. The sql is valid and coherent with the given documentation.
Conclusion
We have seen the power of RAG and integrating code execution directly into a chatbot system.