AutoGen - Chat to dbt data

AutoGen - Chat to dbt data

Talk data to me

One dbt documentation, one sql-bot and one priest sit at a bar…

Imaging a world where you could just ask questions to your data warehouse by just asking them out loud.

> Hey data, how many users did we have last week?
> 1234
> Hey data, how much money did we earn last month?
> 100000€

and so on…

The idea

We got everything that we need with:

  • Your dbt documentation provides the chat bot with the information about your data warehouse
  • AutoGen provides the orchestration layer to connect different chatbots, your dbt documentation and your data warehouse

Note

This is not a novel or even new idea. Text-to-sql bots have been plenty around for a while now. (See, Better than a Data Analyst or chat-dbt )

The extension to the given is that via. AutoGen you can directly query the data in chat.

At the moment integrating chatbot capability into all possible tools is the big roadblock to use AI even more.

RAG RAG CITY, Bitch!

some_cool_rapper_stuff

All our dbt model definition and documentation is in .yml-files. Furthermore it is most likely not public information nor is any publicly available chat bot trained on this.

To put it in a nutshell our chatbot does not know from our data model.

Note

As this could be sensible information we might not want to share this with any chatbot API.

The chatbot might be trained on data that descries a similar named data model to ours, but one that vastly differs from our.

To ensure we use the correct documentation and protect sensible data encapsulating

Here comes the RAG (retrieval augmented generation) into the equation. RAG approaches have shown to be more factual, reliable and controllable in that regard. (See Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks )

Data engineer BOT

Note

This bot has connection to our dbt documentation, retrieves information from it and provides other bots with it.

The bot is not responsible for the actual data retrieval, writing sql querys or any other programming.

# Data engineer bot
data_engenieer = RetrieveUserProxyAgent(
    name="data_engenieer",
    is_termination_msg=termination_msg,
    system_message="SQL Assistant has extra content retrieval power for solving difficult problems concerning aviable data sources. You write sql querys. Only write querys containg tables and columns provided to you. Reply `TERMINATE` in the end when everything is done.",
    llm_config=llm_config,
    max_consecutive_auto_reply=3,
    human_input_mode="Never",
    retrieve_config={
        "task": "qa",
        "docs_path": "docs",
        "chunk_token_size": 1000,
        "model": config_list[0]["model"],
        "client": chromadb.PersistentClient(path="/tmp/chromadb"),
        "collection_name": "groupchat",
        "get_or_create": True,
    },
    code_execution_config=False
)

We are using the RetrieveUserProxyAgent . We assume that the dbt documentation is present in docs. For test purposes we placed the following default documentation in docs:

version: 2

models:
  - name: customers
    description: This table has basic information about a customer, as well as some derived facts based on a customer's orders

    columns:
      - name: customer_id
        description: This is a unique identifier for a customer
        tests:
          - unique
          - not_null

      - name: first_name
        description: Customer's first name. PII.

      - name: last_name
        description: Customer's last name. PII.

      - name: first_order
        description: Date (UTC) of a customer's first order

      - name: most_recent_order
        description: Date (UTC) of a customer's most recent order

      - name: number_of_orders
        description: Count of the number of orders a customer has placed

      - name: total_order_amount
        description: Total value (AUD) of a customer's orders

  - name: orders
    description: This table has basic information about orders, as well as some derived facts based on payments

    columns:
      - name: order_id
        tests:
          - unique
          - not_null
        description: This is a unique identifier for an order

      - name: customer_id
        description: Foreign key to the customers table
        tests:
          - not_null
          - relationships:
              to: ref('customers')
              field: customer_id

      - name: order_date
        description: Date (UTC) that the order was placed

      - name: status
        description: '{{ doc("orders_status") }}'
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

      - name: amount
        description: Total amount (AUD) of the order
        tests:
          - not_null

      - name: credit_card_amount
        description: Amount of the order (AUD) paid for by credit card
        tests:
          - not_null

      - name: coupon_amount
        description: Amount of the order (AUD) paid for by coupon
        tests:
          - not_null

      - name: bank_transfer_amount
        description: Amount of the order (AUD) paid for by bank transfer
        tests:
          - not_null

      - name: gift_card_amount
        description: Amount of the order (AUD) paid for by gift card
        tests:
          - not_null

Data analyst BOT

Note

This bot has connection to our dbt documentation, retrieves information from it and provides other bots with it.

Before we define the bot we need to define and register a exec_sql function to our chatbots. For this we extend our llm_config with the following:

{"config_list": config_list,
                  "timeout": 60,
                     "seed": 11,
              "temperature": 0,

              "functions": [
            {
            "name": "sql",
            "description": "run a sql cell against a postgres database",
            "parameters": {
                "type": "object",
                "properties": {
                    "cell": {
                        "type": "string",
                        "description": "Valid sql query cell to execute.",
                    }
                },
                "required": ["cell"], },
            },]
              }

The definition of the real exec_sql depends on your set up.

The data analyst bot is given by:

# Data analyst bot
data_analyst = AssistantAgent(
    name="data_analyst",
    is_termination_msg=termination_msg,
    system_message="You are a senior data analyst/data_engenieer. You only execute sql querys. Only use the functions you have been provided with.   Reply `TERMINATE` in the end when everything is done.",
    llm_config=llm_config,
    human_input_mode="Never",
    code_execution_config={"work_dir": "coding"},
)

To register the exce_sql function just add the following:

    data_analyst.register_function(
            function_map={
                "sql": exec_sql,
            }
        )

Start the chat

Know we got everything we need and ask our chat bots a question. Lets say we are interested in a potential costumer named John Mayer of the jaffel_shop data defined above.

This triggers our chat:

## Define PROBLEM
PROBLEM = "How much did John Mayer order last month from the Jaffel shop?"

data_engenieer.initiate_chat(
    manager,
    problem=PROBLEM,
    n_results=3,
)

Chat

Trying to create collection.
INFO:autogen.retrieve_utils:Found 1 chunks.
WARNING:chromadb.segment.impl.vector.local_persistent_hnsw:Number of requested results 3 is greater than number of elements in index 1, updating n_results = 1
doc_ids:  [['doc_0']]
Adding doc_id doc_0 to context.
data_engenieer (to chat_manager):

You're a retrieve augmented chatbot. You answer user's questions based on your own knowledge and the
context provided by the user.
If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.
You must give as short an answer as possible.

User's question is: How much did John Mayer order last month from the Jaffel shop?

Context is: version: 2

models:
  - name: customers
    description: This table has basic information about a customer, as well as some derived facts based on a customer's orders

    columns:
      - name: customer_id
        description: This is a unique identifier for a customer
        tests:
          - unique
          - not_null

      - name: first_name
        description: Customer's first name. PII.

      - name: last_name
        description: Customer's last name. PII.

      - name: first_order
        description: Date (UTC) of a customer's first order

      - name: most_recent_order
        description: Date (UTC) of a customer's most recent order

      - name: number_of_orders
        description: Count of the number of orders a customer has placed

      - name: total_order_amount
        description: Total value (AUD) of a customer's orders

  - name: orders
    description: This table has basic information about orders, as well as some derived facts based on payments

    columns:
      - name: order_id
        tests:
          - unique
          - not_null
        description: This is a unique identifier for an order

      - name: customer_id
        description: Foreign key to the customers table
        tests:
          - not_null
          - relationships:
              to: ref('customers')
              field: customer_id

      - name: order_date
        description: Date (UTC) that the order was placed

      - name: status
        description: '{{ doc("orders_status") }}'
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

      - name: amount
        description: Total amount (AUD) of the order
        tests:
          - not_null

      - name: credit_card_amount
        description: Amount of the order (AUD) paid for by credit card
        tests:
          - not_null

      - name: coupon_amount
        description: Amount of the order (AUD) paid for by coupon
        tests:
          - not_null

      - name: bank_transfer_amount
        description: Amount of the order (AUD) paid for by bank transfer
        tests:
          - not_null

      - name: gift_card_amount
        description: Amount of the order (AUD) paid for by gift card
        tests:
          - not_null




--------------------------------------------------------------------------------

>>>>>>>> USING AUTO REPLY...
data_engenieer (to chat_manager):

***** Suggested function Call: sql *****
Arguments:
{
"cell": "SELECT SUM(o.amount) as total_order_last_month FROM orders o JOIN customers c ON o.customer_id = c.customer_id WHERE c.first_name = 'John' AND c.last_name = 'Mayer' AND DATE_TRUNC('month', o.order_date) = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')"
}
****************************************

--------------------------------------------------------------------------------

>>>>>>>> USING AUTO REPLY...

>>>>>>>> EXECUTING FUNCTION sql...
data_analyst (to chat_manager):

***** Response from calling function "sql" *****
1000
************************************************

--------------------------------------------------------------------------------

Please give feedback to chat_manager. Press enter to skip and use auto-reply, or type 'exit' to stop the conversation:

We see that the data_engenieer bot retrieves the information, the data_analyst comes up with an sql query and executes the query against a database. The sql is valid and coherent with the given documentation.

Conclusion

We have seen the power of RAG and integrating code execution directly into a chatbot system.

Related Posts

AutoGen - First Contact

AutoGen - First Contact

What is AutoGen? AutoGen is a framework designed for developing LLM applications that utilize multiple conversable agents to solve tasks, with features allowing seamless human interaction.

Read More