Combining text-to-SQL structured evaluation and semantic seek for a trip plan
Summer time is right here! Faculties are out. Time to plan for a household trip!
LlamaIndex not too long ago launched a brand new characteristic to mix text-to-SQL with semantic seek for Retrieval Augmented Era. On this article, let’s discover this new characteristic with our trip plan use case.
First, let’s have a look at some primary ideas.
Structured information refers to information that may be organized right into a formatted repository, usually a database. Such information will be saved in database tables with rows and columns. They’ve relational keys and may simply be mapped into pre-designed fields. Typical structured information embody relational databases resembling PostgreSQL, Snowflake information warehouse, and so on.
Unstructured information refers to information that’s not organized in a predefined method or doesn’t have a predefined information mannequin. Thus, it isn’t a great match for a mainstream relational database. Unstructured information is more and more prevalent in IT techniques and is utilized by organizations in varied enterprise intelligence and analytics functions. Examples embody YouTube movies, audio information, Phrase, PDF, Textual content, and media logs.
Textual content-to-SQL is a process in pure language processing (NLP) the place the aim is to generate SQL queries from pure language textual content routinely. The duty entails changing the textual content enter right into a structured illustration after which utilizing this illustration to generate a semantically appropriate SQL question that may be executed on a database.
Now, let’s get into how we are able to use our trip plan as a use case for experimenting LlamaIndex’s new characteristic of mixing text-to-SQL structured evaluation with semantic search to question each structured and unstructured information.
Assume we now have three nationwide parks on our trip plan checklist: Glacier Nationwide Park, Yellowstone Nationwide Park, and Rocky Mountain Nationwide Park. We now have gathered primary information resembling common June temperature and elevation highest level for these parks in a database desk. That is our structured information.
We now have additionally recognized three corresponding YouTube movies on what to do in these nationwide parks. That is our unstructured information. Notice this can be a demo app solely. In actuality, you are able to do a trip plan by way of chatGPT, however for demo functions, we solely need recommendation on issues to do in these nationwide parks from these three YouTube movies.
We want to have the ability to ask questions in opposition to the structured information solely, resembling “Which nationwide park has the very best elevation level?
”.
We might additionally prefer to ask questions on our unstructured YouTube video information solely, resembling “What are the issues to do in Yellowstone Nationwide Park?
”.
Most of all, we’d prefer to ask questions resembling “Inform me the issues to do within the highest elevation level nationwide park
” or “Inform me the lakes within the nationwide park with the very best elevation level
”. Such questions tackle each our structured information in our database desk (highest elevation level) and our unstructured information in our chosen YouTube movies (issues to do or data associated to the lakes).
Let’s discover how LlamaIndex handles all these three varieties of queries in a single single question engine, SQLAutoVectorQueryEngine
.
Dealing with all of the above three varieties of queries in a single question engine is made attainable by LlamaIndex’s SQLAutoVectorQueryEngine
. Let’s peek below the hood to see how precisely this SQLAutoVectorQueryEngine
works its magic.
SQLAutoVectorQueryEngine
is a robust engine that means that you can mix insights out of your structured tables along with your unstructured information. It first calls a selector, which decides whether or not to question your structured tables for insights. As soon as it does, it might probably infer a corresponding question to the vector retailer to fetch corresponding paperwork.
This engine can leverage a SQL database and a vector retailer to meet complicated pure language queries over structured and unstructured information. As well as, it’s good sufficient to deal with single-engine queries in opposition to both SQL database or vector retailer, offering a complete answer for customers.
Let’s take a more in-depth have a look at every of those three situations by taking a look at some pattern questions and their corresponding high-level diagrams, which peek into the inside workings of SQLAutoVectorQueryEngine
.
State of affairs 1: queries in opposition to structured information solely
Pattern query: Which nationwide park has the very best elevation level?
State of affairs 2: queries in opposition to unstructured information solely
Pattern query: What are the issues to do in Yellowstone Nationwide Park?
State of affairs 3: queries in opposition to each structured and unstructured information
Pattern query: Inform me the lakes within the nationwide park with the very best elevation level?
Finding out LlamaIndex’s official pattern pocket book on SQLAutoVectorQueryEngine
, we seize the detailed implementation steps within the diagram beneath:
Following the implementation steps above, let’s dive into the detailed implementation of our trip plan app.
Outline LLM and service_context
Let’s first outline our LLM and service_context
. Discover we use gpt-4
on this demo. Extra on that later within the “A Few Key Observations” part.
chunk_size = 1024
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-4", streaming=True))
service_context = ServiceContext.from_defaults(chunk_size=chunk_size, llm_predictor=llm_predictor)
Load structured information into SQL database
First, let’s create our SQL database desk for nationwide parks. For demo and ease functions, we’re utilizing SQLite. In actuality, you might be welcome to make use of whichever relational database you favor. NoSQL databases are usually not but supported as of this writing.
We create a desk named national_parks
, with the next columns:
park_name
average_june_temperature
elevation_highest_point
We then use SQLAlchemy, the Python SQL toolkit and Object Relational Mapper, to insert some information to national_parks
desk. In actuality, this may be pointed to any present SQL database the place you need its structured information queried.
engine = create_engine("sqlite:///national_parks.db", future=True)
metadata_obj = MetaData()
metadata_obj.drop_all(engine)# create national_parks desk
table_name = "national_parks"
national_parks_table = Desk(
table_name,
metadata_obj,
Column("park_name", String(50), primary_key=True),
Column("average_june_temperature", String(20)),
Column("elevation_highest_point", String(20))
)
metadata_obj.create_all(engine)
# print tables
metadata_obj.tables.keys()
# insert information
from sqlalchemy import insert
rows = [
{"park_name": "Glacier National Park", "average_june_temperature": "60-70", "elevation_highest_point": "10,000"},
{"park_name": "Yellowstone National Park", "average_june_temperature": "60-75", "elevation_highest_point": "11,000"},
{"park_name": "Rocky Mountain National Park", "average_june_temperature": "60-75", "elevation_highest_point": "14,000"},
]
for row in rows:
stmt = insert(national_parks_table).values(**row)
with engine.join() as connection:
cursor = connection.execute(stmt)
connection.commit()
with engine.join() as connection:
cursor = connection.exec_driver_sql("SELECT * FROM national_parks")
print(cursor.fetchall())
Load unstructured information from YouTube into Chroma
Subsequent, let’s load our chosen YouTube movies’ information into our vector database, Chroma. We’re experimenting with Chroma on this demo as a consequence of among the limitations with the Pinecone starter plan, resembling just one index is allowed in its starter plan. Chroma is an AI-native open-source embedding database. Very intuitive to make use of. I extremely suggest it.
import chromadbchroma_client = chromadb.Shopper()
chroma_collection = chroma_client.create_collection("national-parks-things-to-do")
Now, let’s load our three YouTube movies into youtube_documents
utilizing YoutubeTranscriptReader
from llama_hub
.
from llama_hub.youtube_transcript.base import YoutubeTranscriptReader#masses information on issues to do in these 3 nationwide parks from YouTube movies
youtube_loader = YoutubeTranscriptReader()
youtube_documents = youtube_loader.load_data(ytlinks=['https://www.youtube.com/watch?v=poBfOPFGgUU',
'https://www.youtube.com/watch?v=KLWv8TsKcGc',
'https://www.youtube.com/watch?v=UV4tENBS0mQ'])
Construct SQL index for structured information
Based mostly on our SQLite database desk national_parks
, we assemble a SQLDatabase
, cross it together with the desk title national_parks
into SQLStructStoreIndex
to assemble our SQL index for structured information. See the snippet beneath:
# construct sql index
sql_database = SQLDatabase(engine, include_tables=["national_parks"])
sql_index = SQLStructStoreIndex.from_documents(
[],
sql_database=sql_database,
table_name="national_parks",
)
Construct vector index for unstructured information
We first construct an empty vector_index
from the Chroma vector retailer, whose assortment factors to national-parks-things-to-do
, which we created above. We then loop by way of our youtube_documents
and insert them into vector_index
. Every doc
has metadata of the park hooked up.
chunk_size = 1024
text_splitter = TokenTextSplitter(chunk_size=chunk_size)
node_parser = SimpleNodeParser(text_splitter=text_splitter)# construct vector index
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_index = VectorStoreIndex([], storage_context=storage_context)
# insert paperwork into vector index. Every doc has metadata of the park hooked up
parks = ['Glacier National Park', 'Yellowstone National Park', 'Rocky Mountain National Park']
for park, youtube_document in zip(parks, youtube_documents):
nodes = node_parser.get_nodes_from_documents([youtube_document])
# add metadata to every node
for node in nodes:
node.extra_info = {"title": park}
vector_index.insert_nodes(nodes)
Create SQL question engine from SQL index
Now, let’s create an SQL question engine from the sql_index
we created above:
sql_query_engine = sql_index.as_query_engine(synthesize_response=True)
Create vector question engine from vector index
Equally, let’s create our vector question engine from the vector_index
. We assemble a VectorStoreInfo
to carry the metadata, particularly title
. We then cross each vector_index
and vector_store_info
to assemble a VectorIndexAutoRetriever
, which makes use of an LLM to set vector retailer question parameters routinely. From there, we name RetrieverQueryEngine
to assemble our retriever_query_engine
.
from llama_index.indices.vector_store.retrievers import VectorIndexAutoRetriever
from llama_index.vector_stores.sorts import MetadataInfo, VectorStoreInfo
from llama_index.query_engine.retriever_query_engine import RetrieverQueryEnginevector_store_info = VectorStoreInfo(
content_info='issues to do in several nationwide parks',
metadata_info=[
MetadataInfo(
name='title',
type='str',
description='The name of the national park'),
]
)
vector_auto_retriever = VectorIndexAutoRetriever(vector_index, vector_store_info=vector_store_info)
retriever_query_engine = RetrieverQueryEngine.from_args(
vector_auto_retriever, service_context=service_context
)
Create SQL question software
Transfer on, let’s create our SQL question software, which is required for establishing SQLAutoVectorQueryEngine
. We cross in our sql_query_engine
and a description
:
sql_tool = QueryEngineTool.from_defaults(
query_engine=sql_query_engine,
description=(
'Helpful for translating a pure language question right into a SQL question over a desk containing: '
'national_parks, containing the average_june_temperature/elevation_highest_point of every nationwide park'
)
)
Create vector question software
We additionally must create our vector question software by passing within the retriever_query_engine
and a description
:
vector_tool = QueryEngineTool.from_defaults(
query_engine=retriever_query_engine,
description=f'Helpful for answering semantic questions on totally different nationwide parks',
)
Outline SQLAutoVectorQueryEngine
Lastly, all geese are in a row, and we’re able to outline our SQLAutoVectorQueryEngine
:
query_engine = SQLAutoVectorQueryEngine(
sql_tool,
vector_tool,
service_context=service_context
)
The whole code for our trip plan app will be present in my GitHub repo. Time to spin it up and take a check drive. Let’s ask three questions comparable to the three situations for querying structured, unstructured, and a mix of structured and unstructured information. Let’s see how SQLAutoVectorQueryEngine
performs.
Structured information question
This question is solely in opposition to the SQL question engine. Let’s look at how its text-to-SQL translation glided by wanting on the console output:
Right here are some things to level out:
- Our query in pure language “
which nationwide park has the very best elevation level?
” acquired translated into SQL “SELECT park_name, elevation_highest_point FROM national_parks ORDER BY elevation_highest_point DESC
.” See the SQL assertion highlighted in yellow within the screenshot above. - The SQL response, additionally highlighted in yellow proper after the SQL assertion, precisely introduced the reply based mostly on the info in our database desk, “
Rocky Mountain Nationwide Park has the very best elevation level of 14,000 toes
”. - Question transformation has nothing extra to rework because the SQL response alone suffices the query, and there’s no must hit the vector question engine.
Unstructured information question
Very good reply! Love the bullet factors! Let’s look at its console output within the screenshot beneath:
Key factors to say:
- First line “
Querying different question engine: helpful for answering semantic questions on totally different nationwide parks
” tells us thatSQLAutoVectorQueryEngine
is hitting our vector question engine, not the SQL question engine just like the structured information question situation above. auto_retriever
module (VectorIndexAutoRetriever
) is known as with metadata filtertitle
having worth “Rocky Mountain Nationwide Park
,” and it makes use oftop_k
retrieval to cross in question parameters to the vector question engine.- Vector question engine response is formatted in bullet factors, as highlighted in purple within the screenshot above.
Each structured and unstructured information question
This query asks about each the construction information for the nationwide park with “the very best elevation level
” and unstructured information “the lakes
” from the YouTube video. The reply is great.
Let’s have a look at the main points of the console output. A number of issues to note:
- The selector began with hitting the SQL question engine first, see line 1 in screenshot beneath beginning with
SQL Question
. - Textual content-to-SQL translation from pure language “
the nationwide park with the very best elevation level
” to SQL “SELECT park_name, elevation_highest_point FROM national_parks WHERE elevation_highest_point = (SELECT MAX(elevation_highest_point) FROM national_parks);
”. Effectively carried out! - Question transformation (blue line) comes up with extra detailed query based mostly on the SQL response of Rocky Mountain Nationwide Park having the very best elevation level of 14,000 toes:
what are the title of the lakes in Rocky Mountain Nationwide Park?
auto_retriever
module is known as to filter by metadatatitle
and utilizingtop_k
retrieval to cross in question parameters to the vector question engine.
Excellent! All three situations are dealt with superbly.
Despite the fact that our code has been applied and examined, we nonetheless have a couple of key questions and observations. Let’s proceed our deep dive.
Remark #1: How does the question engine selector decide whether or not to name the SQL question engine or vector question engine first?
I explored SQLAutoVectorQueryEngine
supply code to attempt to discover the logic the place the selector makes the choice whether or not it must first hit SQL question engine or vector question engine for a question. From analyzing the code, I noticed it in the end boils all the way down to the logic at school LLMSingleSelector
.
The choice-making course of in LLMSingleSelector
is primarily pushed by the LLM-based predictor. The LLM predictor considers the immediate, the variety of selections, the context (selections textual content), and the question enter to generate a prediction. The prediction from the LLM mannequin is then parsed utilizing the output parser outlined within the immediate to find out the chosen alternative.
I reached out to Jerry Liu, co-founder and CEO of LlamaIndex, and was pleased to get his affirmation that the selector does certainly depend on LLM to find out whether or not to name the SQL question engine or vector question engine first based mostly on a question.
Remark #2: Limitation of gpt-3.5-turbo in dealing with queries utilizing joint question engines
Whereas experimenting with our journey plan app, I seen gpt-3.5-turbo
doesn’t all the time give correct solutions to questions meant for the joint question engines. For instance, after I ask questions resembling “Inform me the lakes within the nationwide park with the very best elevation level
,” I acquired the response “The lakes in nationwide parks with the very best elevation factors are Lake Tahoe in Nevada, with an elevation of 10,881 toes, and Crater Lake in Oregon, with an elevation of 8,929 toes
,” not the reply I used to be on the lookout for in any respect. See the screenshot beneath:
The console output, see screenshot beneath, reveals the limitation of this mannequin in dealing with such joint question engines:
Discover the SQL question itself was constructed incorrectly, to start with. Why did gpt-3.5-turbo
assemble the WHERE
clause with park_name LIKE ‘%lake%’
in its text-to-SQL interpretation? It doesn’t make sense. Additionally, Remodeled question given SQL response: none
shouldn’t be what we’re on the lookout for both. We’re on the lookout for the transformer to develop a extra detailed query based mostly on SQL response, so it might probably question the vector question engine to get the ultimate reply. On this experiment, we see that the text-to-SQL didn’t translate effectively, ensuing within the SQL response not being correct and the transformer malfunctioning.
As we noticed within the above part, “Each structured and unstructured information question,” gpt-4
handles this identical query superbly and returns the proper reply.
Remark #3: Limitation of each gpt-3.5-turbo and gpt-4 in dealing with queries with vary in joint question engines
In relation to queries with vary, resembling “Inform me the lakes within the nationwide park with the bottom common June temperature
”. Neither gpt-3.5-turbo
nor gpt-4
gave a passable reply.
gpt-3.5-turbo
answered “The lakes within the nationwide parks with the bottom common June temperature are Glacier Nationwide Park (60–70 levels), Yellowstone Nationwide Park (60–75 levels), and Rocky Mountain Nationwide Park (60–75 levels)
.” Not the reply I used to be on the lookout for. See the screenshot beneath:
Its console output, see screenshot beneath, proves that the SQL assertion (highlighted in yellow) is inaccurate. It retrieves all nationwide parks ordered by average_june_temperature
ascending solely, whereas I used to be asking for the nationwide park with the bottom common June temperature, not all three nationwide parks. The intermediate step of question transformation was tousled because of the SQL assertion being incorrect.
It’s value noting that gpt-3.5-turbo
’s limitation in dealing with vary queries is simply noticed when it must name the joint question engines. It, nevertheless, does deal with vary queries within the single SQL question engine accurately, as you may see within the screenshot beneath for the easy question “which nationwide park has the bottom common June temperature?
”:
Console output reveals the proper SQL assertion “SELECT park_name, average_june_temperature FROM national_parks ORDER BY average_june_temperature ASC LIMIT 1;
.” Discover “LIMIT 1
,” which narrows all the way down to the one nationwide park I used to be on the lookout for.
gpt-4
has related limitations in dealing with vary queries in opposition to joint question engines. Right here is the reply given bygpt-4
for a similar query “Inform me the lakes within the nationwide park with the bottom common June temperature
”:
Yikes! That’s not the reply I used to be on the lookout for. Let’s see the place it tousled by analyzing its console output:
Just like gpt-3.5-turbo
, the wrongdoer lies within the incorrect SQL assertion, highlighted in yellow within the screenshot above. I don’t want all nationwide parks returned, I’m solely on the lookout for the one with the bottom common June temperature. Despite the fact that the intermediate step of question transformation was arduous at work, for the reason that flawed SQL response was fed into it, it produced the flawed reply.
Now, let’s see what gpt-4
does with a easy question in opposition to the only SQL question engine, “which nationwide park has the bottom common June temperature?
”:
Its reply is spot on. Verifying its console output, we see that the proper SQL assertion with “LIMIT 1
” was constructed. This habits of gpt-4
mimics gpt-3.5-turbo
.
So, we are able to conclude that each gpt-3.5-turbo
and gpt-4
do deal with vary queries accurately when utilizing a single SQL question engine, however when given vary queries that must name SQL question engine and vector question engine, neither mannequin does a great job.
Remark #4: Significance of Immediate Engineering for calling the joint question engine
I seen when querying each the SQL question engine and vector question engine, I needed to be cautious about wording the immediate. I first tried the query “what are the issues to do within the nationwide park which has the very best elevation level?
,” and I acquired an error.
The console confirmed the next detailed error message:
As you may see, it threw “chromadb.errors.NoDatapointsException: No datapoints discovered for the equipped filter {“title”: “highest elevation level”)
.” Seems to be just like the engine in some way blended up the title with the flawed phrase within the immediate.
I then reworded my immediate to “Inform me the issues to do within the highest elevation level nationwide park
,” and I acquired the next appropriate response:
Console output detailing the steps SQLAutoVectorQueryEngine
used to reach on the reply:
Remark #5: Metadata title is case delicate
I additionally discover the title
in metadata data is case-sensitive. See screenshots beneath, when requested “What are the issues to do in Glacier Nationwide park?
”, discover “park
” is in decrease case. We get the error, NoDatapointsException
, and you may see the console output within the screenshot beneath. After I corrected “park
” to be in uppercase, I acquired the anticipated reply.
Our nationwide parks journey plan demo app was solely meant to experiment and discover LlamaIndex’s SQLAutoVectorQueryEngine
, along with OpenAI, which might deal with each structured evaluation and semantic search, and in addition structured evaluation or semantic search individually.
In actuality, there may very well be quite a few use circumstances the place your group wants to question each information from databases and increase the question end result based mostly on unstructured information, be it analysis papers, inside data base, multimodal information resembling video, audio, or picture information, and so on. No matter your use circumstances could also be, I encourage you to present SQLAutoVectorQueryEngine
a strive!
We explored LlamaIndex’s SQLAutoVectorQueryEngine
on this article. We began with understanding some primary ideas, then launched our nationwide park journey plan use case, the place we wanted to do structured evaluation by way of SQL database, semantic search by way of three YouTube movies, and the mixture of queries over structured and unstructured information.
We dived into the detailed implementation of how one can use SQLAutoVectorQueryEngine
to handle our three-scenario necessities. We examined our implementation efficiently. We additionally famous down a couple of key observations for additional understanding of this nice question engine.
General, this trip plan use case is simply meant to exhibit the facility of LlamaIndex’s SQLAutoVectorQueryEngine
, which is designed to deal with structured evaluation, semantic search, and a mix of each! Quite a few use circumstances can spring from the options supplied by SQLAutoVectorQueryEngine
. I encourage you to discover this highly effective engine and expertise what wonderful work it might probably do!
The whole supply code for this text will be present in my GitHub repo.
Joyful coding!