Part of GPT-RAG
The diagram below provides an overview of the document ingestion pipeline, which handles various document types, preparing them for indexing and retrieval.
Workflow
-
The
ragindex-indexer-chunk-documents
indexer reads new documents from thedocuments
blob container. -
For each document, it calls the
document-chunking
function app to segment the content into chunks and generate embeddings using the ADA model. -
Finally, each chunk is indexed in the AI Search Index.
The document_chunking
function breaks documents into smaller segments called chunks.
When a document is submitted, the system identifies its file type and selects the appropriate chunker to divide it into chunks suitable for that specific type.
-
For
.pdf
files, the system uses the DocAnalysisChunker with the Document Intelligence API, which extracts structured elements, like tables and sections, converting them into Markdown. LangChain splitters then segment the content based on sections. When Document Intelligence API 4.0 is enabled,.docx
and.pptx
files are processed with this chunker as well. -
For image files such as
.bmp
,.png
,.jpeg
, and.tiff
, the DocAnalysisChunker performs Optical Character Recognition (OCR) to extract text before chunking. -
For specialized formats, specific chunkers are applied:
.vtt
files (video transcriptions) are handled by the TranscriptionChunker, chunking content by time codes..xlsx
files (spreadsheets) are processed by the SpreadsheetChunker, chunking by rows or sheets.
-
For text-based files like
.txt
,.md
,.json
, and.csv
, the LangChainChunker uses LangChain splitters to divide the content by paragraphs or sections.
This setup ensures each document is processed by the most suitable chunker, leading to efficient and accurate chunking.
Important: The file extension determines the choice of chunker as outlined above.
Customization
The chunking process is customizable. You can modify existing chunkers or create new ones to meet specific data processing needs, optimizing the pipeline.
If you are using the few-shot or few-shot scaled NL2SQL strategies in your orchestration component, you may want to index NL2SQL content for use during the retrieval step. The idea is that this content will aid in SQL query creation with these strategies. More details about these NL2SQL strategies can be found in the orchestrator repository.
The NL2SQL Ingestion Process indexes three content types:
- query: Examples of queries for both few-shot and few-shot scaled strategies.
- table: Descriptions of tables for the few-shot scaled scenario.
- column: Descriptions of columns for the few-shot scaled scenario.
Note
If you are using the few-shot strategy, you will only need to index queries.
Each item—whether a query, table, or column—is represented in a JSON file with information specific to the query, table, or column, respectively.
Here’s an example of a query file:
{
"question": "What are the top 5 most expensive products currently available for sale?",
"query": "SELECT TOP 5 ProductID, Name, ListPrice FROM SalesLT.Product WHERE SellEndDate IS NULL ORDER BY ListPrice DESC",
"selected_tables": [
"SalesLT.Product"
],
"selected_columns": [
"SalesLT.Product-ProductID",
"SalesLT.Product-Name",
"SalesLT.Product-ListPrice",
"SalesLT.Product-SellEndDate"
],
"reasoning": "This query retrieves the top 5 products with the highest selling prices that are currently available for sale. It uses the SalesLT.Product table, selects relevant columns, and filters out products that are no longer available by checking that SellEndDate is NULL."
}
In the nl2sql directory of this repository, you can find additional examples of queries, tables, and columns for the following Adventure Works sample SQL Database tables.
Sample Adventure Works Database Tables
Note
You can deploy this sample database in your Azure SQL Database.
The diagram below illustrates the NL2SQL data ingestion pipeline.
Workflow
This outlines the ingestion workflow for query elements.
Note:
The workflow for tables and columns is similar; just replace queries with tables or columns in the steps below.
-
The AI Search
queries-indexer
scans for new query files (each containing a single query) within thequeries
folder in thenl2sql
storage container.Note:
Files are stored in thequeries
folder, not in the root of thenl2sql
container. This setup also applies totables
andcolumns
. -
The
queries-indexer
then uses the#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill
to create a vectorized representation of the question text using the Azure OpenAI Embeddings model.Note:
For query items, the question itself is vectorized. For tables and columns, their descriptions are vectorized. -
Finally, the indexed content is added to the
nl2sql-queries
index.
-
Provision the infrastructure and deploy the solution using the GPT-RAG template.
-
Redeployment Steps:
- Prerequisites:
- Azure Developer CLI
- PowerShell (Windows only)
- Git
- Python 3.11
- Redeployment commands:
azd auth login azd env refresh azd deploy
Note: Use the same environment name, subscription, and region as the initial deployment when running
azd env refresh
.
- Prerequisites:
- Instructions for testing the data ingestion component locally using in VS Code. See Local Deployment Guide.
- Refer to the GPT-RAG Admin & User Guide for instructions.
- See GPT-RAG Admin & User Guide for reindexing instructions.
Here are the formats supported by each chunker. The file extension determines which chunker is used.
Extension | Doc Int API Version |
---|---|
3.1, 4.0 | |
bmp | 3.1, 4.0 |
jpeg | 3.1, 4.0 |
png | 3.1, 4.0 |
tiff | 3.1, 4.0 |
xlsx | 4.0 |
docx | 4.0 |
pptx | 4.0 |
Extension | Format |
---|---|
md | Markdown document |
txt | Plain text file |
html | HTML document |
shtml | Server-side HTML document |
htm | HTML document |
py | Python script |
json | JSON data file |
csv | Comma-separated values file |
xml | XML data file |
Extension | Format |
---|---|
vtt | Video transcription |
Extension | Format |
---|---|
xlsx | Spreadsheet |