Combining text-to-SQL structured evaluation and semantic seek for a trip plan
Summer time is right here! Faculties are out. Time to plan for a household trip!
LlamaIndex not too long ago launched a brand new characteristic to mix text-to-SQL with semantic seek for Retrieval Augmented Era. On this article, letโs discover this new characteristic with our trip plan use case.
First, letโs have a look at some primary ideas.
Structured information refers to information that may be organized right into a formatted repository, usually a database. Such information will be saved in database tables with rows and columns. They’ve relational keys and may simply be mapped into pre-designed fields. Typical structured information embody relational databases resembling PostgreSQL, Snowflake information warehouse, and so on.
Unstructured information refers to information that’s not organized in a predefined method or doesn’t have a predefined information mannequin. Thus, it isn’t a great match for a mainstream relational database. Unstructured information is more and more prevalent in IT techniques and is utilized by organizations in varied enterprise intelligence and analytics functions. Examples embody YouTube movies, audio information, Phrase, PDF, Textual content, and media logs.
Textual content-to-SQL is a process in pure language processing (NLP) the place the aim is to generate SQL queries from pure language textual content routinely. The duty entails changing the textual content enter right into a structured illustration after which utilizing this illustration to generate a semantically appropriate SQL question that may be executed on a database.
Now, letโs get into how we are able to use our trip plan as a use case for experimenting LlamaIndexโs new characteristic of mixing text-to-SQL structured evaluation with semantic search to question each structured and unstructured information.
Assume we now have three nationwide parks on our trip plan checklist: Glacier Nationwide Park, Yellowstone Nationwide Park, and Rocky Mountain Nationwide Park. We now have gathered primary information resembling common June temperature and elevation highest level for these parks in a database desk. That is our structured information.
We now have additionally recognized three corresponding YouTube movies on what to do in these nationwide parks. That is our unstructured information. Notice this can be a demo app solely. In actuality, you are able to do a trip plan by way of chatGPT, however for demo functions, we solely need recommendation on issues to do in these nationwide parks from these three YouTube movies.
We want to have the ability to ask questions in opposition to the structured information solely, resembling โWhich nationwide park has the very best elevation level?โ.
We might additionally prefer to ask questions on our unstructured YouTube video information solely, resembling โWhat are the issues to do in Yellowstone Nationwide Park?โ.
Most of all, weโd prefer to ask questions resembling โInform me the issues to do within the highest elevation level nationwide parkโ or โInform me the lakes within the nationwide park with the very best elevation levelโ. Such questions tackle each our structured information in our database desk (highest elevation level) and our unstructured information in our chosen YouTube movies (issues to do or data associated to the lakes).
Letโs discover how LlamaIndex handles all these three varieties of queries in a single single question engine, SQLAutoVectorQueryEngine.
Dealing with all of the above three varieties of queries in a single question engine is made attainable by LlamaIndexโs SQLAutoVectorQueryEngine. Letโs peek below the hood to see how precisely this SQLAutoVectorQueryEngine works its magic.
SQLAutoVectorQueryEngine is a robust engine that means that you can mix insights out of your structured tables along with your unstructured information. It first calls a selector, which decides whether or not to question your structured tables for insights. As soon as it does, it might probably infer a corresponding question to the vector retailer to fetch corresponding paperwork.
This engine can leverage a SQL database and a vector retailer to meet complicated pure language queries over structured and unstructured information. As well as, it’s good sufficient to deal with single-engine queries in opposition to both SQL database or vector retailer, offering a complete answer for customers.
Letโs take a more in-depth have a look at every of those three situations by taking a look at some pattern questions and their corresponding high-level diagrams, which peek into the inside workings of SQLAutoVectorQueryEngine.
State of affairs 1: queries in opposition to structured information solely
Pattern query: Which nationwide park has the very best elevation level?
State of affairs 2: queries in opposition to unstructured information solely
Pattern query: What are the issues to do in Yellowstone Nationwide Park?
State of affairs 3: queries in opposition to each structured and unstructured information
Pattern query: Inform me the lakes within the nationwide park with the very best elevation level?
Finding out LlamaIndexโs official pattern pocket book on SQLAutoVectorQueryEngine, we seize the detailed implementation steps within the diagram beneath:
Following the implementation steps above, letโs dive into the detailed implementation of our trip plan app.
Outline LLM and service_context
Letโs first outline our LLM and service_context. Discover we use gpt-4 on this demo. Extra on that later within the โA Few Key Observationsโ part.
chunk_size = 1024
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-4", streaming=True))
service_context = ServiceContext.from_defaults(chunk_size=chunk_size, llm_predictor=llm_predictor)
Load structured information into SQL database
First, letโs create our SQL database desk for nationwide parks. For demo and ease functions, we’re utilizing SQLite. In actuality, you might be welcome to make use of whichever relational database you favor. NoSQL databases are usually not but supported as of this writing.
We create a desk named national_parks, with the next columns:
park_nameaverage_june_temperatureelevation_highest_point
We then use SQLAlchemy, the Python SQL toolkit and Object Relational Mapper, to insert some information to national_parks desk. In actuality, this may be pointed to any present SQL database the place you need its structured information queried.
engine = create_engine("sqlite:///national_parks.db", future=True)
metadata_obj = MetaData()
metadata_obj.drop_all(engine)# create national_parks desk
table_name = "national_parks"
national_parks_table = Desk(
table_name,
metadata_obj,
Column("park_name", String(50), primary_key=True),
Column("average_june_temperature", String(20)),
Column("elevation_highest_point", String(20))
)
metadata_obj.create_all(engine)
# print tables
metadata_obj.tables.keys()
# insert information
from sqlalchemy import insert
rows = [
{"park_name": "Glacier National Park", "average_june_temperature": "60-70", "elevation_highest_point": "10,000"},
{"park_name": "Yellowstone National Park", "average_june_temperature": "60-75", "elevation_highest_point": "11,000"},
{"park_name": "Rocky Mountain National Park", "average_june_temperature": "60-75", "elevation_highest_point": "14,000"},
]
for row in rows:
stmt = insert(national_parks_table).values(**row)
with engine.join() as connection:
cursor = connection.execute(stmt)
connection.commit()
with engine.join() as connection:
cursor = connection.exec_driver_sql("SELECT * FROM national_parks")
print(cursor.fetchall())
Load unstructured information from YouTube into Chroma
Subsequent, letโs load our chosen YouTube moviesโ information into our vector database, Chroma. We’re experimenting with Chroma on this demo as a consequence of among the limitations with the Pinecone starter plan, resembling just one index is allowed in its starter plan. Chroma is an AI-native open-source embedding database. Very intuitive to make use of. I extremely suggest it.
import chromadbchroma_client = chromadb.Shopper()
chroma_collection = chroma_client.create_collection("national-parks-things-to-do")
Now, letโs load our three YouTube movies into youtube_documents utilizing YoutubeTranscriptReader from llama_hub.
from llama_hub.youtube_transcript.base import YoutubeTranscriptReader#masses information on issues to do in these 3 nationwide parks from YouTube movies
youtube_loader = YoutubeTranscriptReader()
youtube_documents = youtube_loader.load_data(ytlinks=['https://www.youtube.com/watch?v=poBfOPFGgUU',
'https://www.youtube.com/watch?v=KLWv8TsKcGc',
'https://www.youtube.com/watch?v=UV4tENBS0mQ'])
Construct SQL index for structured information
Based mostly on our SQLite database desk national_parks, we assemble a SQLDatabase, cross it together with the desk title national_parks into SQLStructStoreIndex to assemble our SQL index for structured information. See the snippet beneath:
# construct sql index
sql_database = SQLDatabase(engine, include_tables=["national_parks"])
sql_index = SQLStructStoreIndex.from_documents(
[],
sql_database=sql_database,
table_name="national_parks",
)
Construct vector index for unstructured information
We first construct an empty vector_index from the Chroma vector retailer, whose assortment factors to national-parks-things-to-do, which we created above. We then loop by way of our youtube_documents and insert them into vector_index. Every doc has metadata of the park hooked up.
chunk_size = 1024
text_splitter = TokenTextSplitter(chunk_size=chunk_size)
node_parser = SimpleNodeParser(text_splitter=text_splitter)# construct vector index
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_index = VectorStoreIndex([], storage_context=storage_context)
# insert paperwork into vector index. Every doc has metadata of the park hooked up
parks = ['Glacier National Park', 'Yellowstone National Park', 'Rocky Mountain National Park']
for park, youtube_document in zip(parks, youtube_documents):
nodes = node_parser.get_nodes_from_documents([youtube_document])
# add metadata to every node
for node in nodes:
node.extra_info = {"title": park}
vector_index.insert_nodes(nodes)
Create SQL question engine from SQL index
Now, letโs create an SQL question engine from the sql_index we created above:
sql_query_engine = sql_index.as_query_engine(synthesize_response=True)
Create vector question engine from vector index
Equally, letโs create our vector question engine from the vector_index. We assemble a VectorStoreInfo to carry the metadata, particularly title. We then cross each vector_index and vector_store_info to assemble a VectorIndexAutoRetriever, which makes use of an LLM to set vector retailer question parameters routinely. From there, we name RetrieverQueryEngine to assemble our retriever_query_engine.
from llama_index.indices.vector_store.retrievers import VectorIndexAutoRetriever
from llama_index.vector_stores.sorts import MetadataInfo, VectorStoreInfo
from llama_index.query_engine.retriever_query_engine import RetrieverQueryEnginevector_store_info = VectorStoreInfo(
content_info='issues to do in several nationwide parks',
metadata_info=[
MetadataInfo(
name='title',
type='str',
description='The name of the national park'),
]
)
vector_auto_retriever = VectorIndexAutoRetriever(vector_index, vector_store_info=vector_store_info)
retriever_query_engine = RetrieverQueryEngine.from_args(
vector_auto_retriever, service_context=service_context
)
Create SQL question software
Transfer on, letโs create our SQL question software, which is required for establishing SQLAutoVectorQueryEngine. We cross in our sql_query_engine and a description:
sql_tool = QueryEngineTool.from_defaults(
query_engine=sql_query_engine,
description=(
'Helpful for translating a pure language question right into a SQL question over a desk containing: '
'national_parks, containing the average_june_temperature/elevation_highest_point of every nationwide park'
)
)
Create vector question software
We additionally must create our vector question software by passing within the retriever_query_engine and a description:
vector_tool = QueryEngineTool.from_defaults(
query_engine=retriever_query_engine,
description=f'Helpful for answering semantic questions on totally different nationwide parks',
)
Outline SQLAutoVectorQueryEngine
Lastly, all geese are in a row, and we’re able to outline our SQLAutoVectorQueryEngine:
query_engine = SQLAutoVectorQueryEngine(
sql_tool,
vector_tool,
service_context=service_context
)
The whole code for our trip plan app will be present in my GitHub repo. Time to spin it up and take a check drive. Letโs ask three questions comparable to the three situations for querying structured, unstructured, and a mix of structured and unstructured information. Letโs see how SQLAutoVectorQueryEngine performs.
Structured information question
This question is solely in opposition to the SQL question engine. Letโs look at how its text-to-SQL translation glided by wanting on the console output:
Right here are some things to level out:
- Our query in pure language โ
which nationwide park has the very best elevation level?โ acquired translated into SQL โSELECT park_name, elevation_highest_point FROM national_parks ORDER BY elevation_highest_point DESC.โ See the SQL assertion highlighted in yellow within the screenshot above. - The SQL response, additionally highlighted in yellow proper after the SQL assertion, precisely introduced the reply based mostly on the info in our database desk, โ
Rocky Mountain Nationwide Park has the very best elevation level of 14,000 toesโ. - Question transformation has nothing extra to rework because the SQL response alone suffices the query, and there’s no must hit the vector question engine.
Unstructured information question
Very good reply! Love the bullet factors! Letโs look at its console output within the screenshot beneath:
Key factors to say:
- First line โ
Querying different question engine: helpful for answering semantic questions on totally different nationwide parksโ tells us thatSQLAutoVectorQueryEngineis hitting our vector question engine, not the SQL question engine just like the structured information question situation above. auto_retrievermodule (VectorIndexAutoRetriever) is known as with metadata filtertitlehaving worth โRocky Mountain Nationwide Park,โ and it makes use oftop_kretrieval to cross in question parameters to the vector question engine.- Vector question engine response is formatted in bullet factors, as highlighted in purple within the screenshot above.
Each structured and unstructured information question
This query asks about each the construction information for the nationwide park with โthe very best elevation levelโ and unstructured information โthe lakesโ from the YouTube video. The reply is great.
Letโs have a look at the main points of the console output. A number of issues to note:
- The selector began with hitting the SQL question engine first, see line 1 in screenshot beneath beginning with
SQL Question. - Textual content-to-SQL translation from pure language โ
the nationwide park with the very best elevation levelโ to SQL โSELECT park_name, elevation_highest_point FROM national_parks WHERE elevation_highest_point = (SELECT MAX(elevation_highest_point) FROM national_parks);โ. Effectively carried out! - Question transformation (blue line) comes up with extra detailed query based mostly on the SQL response of Rocky Mountain Nationwide Park having the very best elevation level of 14,000 toes:
what are the title of the lakes in Rocky Mountain Nationwide Park? auto_retrievermodule is known as to filter by metadatatitleand utilizingtop_kretrieval to cross in question parameters to the vector question engine.
Excellent! All three situations are dealt with superbly.
Despite the fact that our code has been applied and examined, we nonetheless have a couple of key questions and observations. Letโs proceed our deep dive.
Remark #1: How does the question engine selector decide whether or not to name the SQL question engine or vector question engine first?
I explored SQLAutoVectorQueryEngine supply code to attempt to discover the logic the place the selector makes the choice whether or not it must first hit SQL question engine or vector question engine for a question. From analyzing the code, I noticed it in the end boils all the way down to the logic at school LLMSingleSelector.
The choice-making course of in LLMSingleSelector is primarily pushed by the LLM-based predictor. The LLM predictor considers the immediate, the variety of selections, the context (selections textual content), and the question enter to generate a prediction. The prediction from the LLM mannequin is then parsed utilizing the output parser outlined within the immediate to find out the chosen alternative.
I reached out to Jerry Liu, co-founder and CEO of LlamaIndex, and was pleased to get his affirmation that the selector does certainly depend on LLM to find out whether or not to name the SQL question engine or vector question engine first based mostly on a question.
Remark #2: Limitation of gpt-3.5-turbo in dealing with queries utilizing joint question engines
Whereas experimenting with our journey plan app, I seen gpt-3.5-turbo doesnโt all the time give correct solutions to questions meant for the joint question engines. For instance, after I ask questions resembling โInform me the lakes within the nationwide park with the very best elevation level,โ I acquired the response โThe lakes in nationwide parks with the very best elevation factors are Lake Tahoe in Nevada, with an elevation of 10,881 toes, and Crater Lake in Oregon, with an elevation of 8,929 toes,โ not the reply I used to be on the lookout for in any respect. See the screenshot beneath:
The console output, see screenshot beneath, reveals the limitation of this mannequin in dealing with such joint question engines:
Discover the SQL question itself was constructed incorrectly, to start with. Why did gpt-3.5-turbo assemble the WHERE clause with park_name LIKE โ%lake%โ in its text-to-SQL interpretation? It doesnโt make sense. Additionally, Remodeled question given SQL response: none shouldn’t be what we’re on the lookout for both. We’re on the lookout for the transformer to develop a extra detailed query based mostly on SQL response, so it might probably question the vector question engine to get the ultimate reply. On this experiment, we see that the text-to-SQL didnโt translate effectively, ensuing within the SQL response not being correct and the transformer malfunctioning.
As we noticed within the above part, โEach structured and unstructured information question,โ gpt-4 handles this identical query superbly and returns the proper reply.
Remark #3: Limitation of each gpt-3.5-turbo and gpt-4 in dealing with queries with vary in joint question engines
In relation to queries with vary, resembling โInform me the lakes within the nationwide park with the bottom common June temperatureโ. Neither gpt-3.5-turbo nor gpt-4 gave a passable reply.
gpt-3.5-turboanswered โThe lakes within the nationwide parks with the bottom common June temperature are Glacier Nationwide Park (60โ70 levels), Yellowstone Nationwide Park (60โ75 levels), and Rocky Mountain Nationwide Park (60โ75 levels).โ Not the reply I used to be on the lookout for. See the screenshot beneath:
Its console output, see screenshot beneath, proves that the SQL assertion (highlighted in yellow) is inaccurate. It retrieves all nationwide parks ordered by average_june_temperature ascending solely, whereas I used to be asking for the nationwide park with the bottom common June temperature, not all three nationwide parks. The intermediate step of question transformation was tousled because of the SQL assertion being incorrect.
Itโs value noting that gpt-3.5-turboโs limitation in dealing with vary queries is simply noticed when it must name the joint question engines. It, nevertheless, does deal with vary queries within the single SQL question engine accurately, as you may see within the screenshot beneath for the easy question โwhich nationwide park has the bottom common June temperature?โ:
Console output reveals the proper SQL assertion โSELECT park_name, average_june_temperature FROM national_parks ORDER BY average_june_temperature ASC LIMIT 1;.โ Discover โLIMIT 1,โ which narrows all the way down to the one nationwide park I used to be on the lookout for.
gpt-4has related limitations in dealing with vary queries in opposition to joint question engines. Right here is the reply given bygpt-4for a similar query โInform me the lakes within the nationwide park with the bottom common June temperatureโ:
Yikes! Thatโs not the reply I used to be on the lookout for. Letโs see the place it tousled by analyzing its console output:
Just like gpt-3.5-turbo, the wrongdoer lies within the incorrect SQL assertion, highlighted in yellow within the screenshot above. I donโt want all nationwide parks returned, I’m solely on the lookout for the one with the bottom common June temperature. Despite the fact that the intermediate step of question transformation was arduous at work, for the reason that flawed SQL response was fed into it, it produced the flawed reply.
Now, letโs see what gpt-4 does with a easy question in opposition to the only SQL question engine, โwhich nationwide park has the bottom common June temperature?โ:
Its reply is spot on. Verifying its console output, we see that the proper SQL assertion with โLIMIT 1โ was constructed. This habits of gpt-4 mimics gpt-3.5-turbo.
So, we are able to conclude that each gpt-3.5-turbo and gpt-4 do deal with vary queries accurately when utilizing a single SQL question engine, however when given vary queries that must name SQL question engine and vector question engine, neither mannequin does a great job.
Remark #4: Significance of Immediate Engineering for calling the joint question engine
I seen when querying each the SQL question engine and vector question engine, I needed to be cautious about wording the immediate. I first tried the query โwhat are the issues to do within the nationwide park which has the very best elevation level?,โ and I acquired an error.
The console confirmed the next detailed error message:
As you may see, it threw โchromadb.errors.NoDatapointsException: No datapoints discovered for the equipped filter {โtitleโ: โhighest elevation levelโ).โ Seems to be just like the engine in some way blended up the title with the flawed phrase within the immediate.
I then reworded my immediate to โInform me the issues to do within the highest elevation level nationwide park,โ and I acquired the next appropriate response:
Console output detailing the steps SQLAutoVectorQueryEngine used to reach on the reply:
Remark #5: Metadata title is case delicate
I additionally discover the title in metadata data is case-sensitive. See screenshots beneath, when requested โWhat are the issues to do in Glacier Nationwide park?โ, discover โparkโ is in decrease case. We get the error, NoDatapointsException, and you may see the console output within the screenshot beneath. After I corrected โparkโ to be in uppercase, I acquired the anticipated reply.
Our nationwide parks journey plan demo app was solely meant to experiment and discover LlamaIndexโs SQLAutoVectorQueryEngine, along with OpenAI, which might deal with each structured evaluation and semantic search, and in addition structured evaluation or semantic search individually.
In actuality, there may very well be quite a few use circumstances the place your group wants to question each information from databases and increase the question end result based mostly on unstructured information, be it analysis papers, inside data base, multimodal information resembling video, audio, or picture information, and so on. No matter your use circumstances could also be, I encourage you to present SQLAutoVectorQueryEngine a strive!
We explored LlamaIndexโs SQLAutoVectorQueryEngine on this article. We began with understanding some primary ideas, then launched our nationwide park journey plan use case, the place we wanted to do structured evaluation by way of SQL database, semantic search by way of three YouTube movies, and the mixture of queries over structured and unstructured information.
We dived into the detailed implementation of how one can use SQLAutoVectorQueryEngine to handle our three-scenario necessities. We examined our implementation efficiently. We additionally famous down a couple of key observations for additional understanding of this nice question engine.
General, this trip plan use case is simply meant to exhibit the facility of LlamaIndexโs SQLAutoVectorQueryEngine, which is designed to deal with structured evaluation, semantic search, and a mix of each! Quite a few use circumstances can spring from the options supplied by SQLAutoVectorQueryEngine. I encourage you to discover this highly effective engine and expertise what wonderful work it might probably do!
The whole supply code for this text will be present in my GitHub repo.
Joyful coding!

