Feature Request: Add RAG Support in Data Prepper #5126

srikanthjg · 2024-10-28T23:18:04Z

Is your feature request related to a problem? Please describe.

This feature aims to enhance Data Prepper to support Retrieval-Augmented Generation (RAG) use cases by implementing the following components:

A Vector Embedding Processor: that leverages services like AWS Bedrock, OpenAI etc for generating vector embeddings, facilitating integration with vector databases.
Codecs for Unstructured Data: that enables Data Prepper to ingest additional file formats like PDF, HTML, etc.
Advanced Chunking Strategies: based on the 5 Levels of Text Splitting to improve the relevance of text chunks for embedding and retrieval tasks. For details of the chunking strategy refer - link
These enhancements will enable Data Prepper to handle complex unstructured data pipelines, allowing users to implement sophisticated retrieval and generation workflows for various applications.

Describe the solution you'd like
Sub-Issues
Issue 1: Implement Vector Embedding Processor - Use Bedrock, OpenAI, etc., for Embedding Generation
Objective: Create a processor that generates vector embeddings for input text chunks. The processor will support multiple embedding services, such as AWS Bedrock and OpenAI, and will leverage asynchronous processing for improved throughput.
Key Features:
Configurable embedding source selection.
Batch processing support with individual chunk embeddings.
Error handling and scalability optimizations.

Implementation:
Similar to aws lambda processor in dataprepper, we will need to handle calls to external services(bedrock, openai or hugging face etc) to get vector embeddings.

Issue 2: Add Codec for Unstructured Data (PDF, HTML, and Other Formats)

Objective: Expand Data Prepper’s ingestion capabilities to handle additional unstructured data formats like PDF and HTML, which are commonly used in RAG use cases.
Key Features:
Modular codec design for flexibility.
Basic in-memory text extraction for standard documents.
Integration with AWS Textract for advanced text extraction needs (e.g., handling scanned documents, tables, images).

Implementation:
For basic PDF documents, we can use libraries like apacheTika or apachePdfBox.
For advanced parsing of documents like reading tabluar data or receipts or forms ; it is best we use external services like aws textract or aryn's partitioning service which uses OCR + ML solution to read unstructured data.

Issue 3: Add Chunking Strategies Based on the 5 Levels of Text Splitting

Objective: Implement advanced chunking strategies inspired by the 5 Levels of Text Splitting - link This will improve the semantic relevance of text chunks, enabling higher-quality embeddings and more effective retrieval.
Key Features:
Support for five levels of chunking: character, word, sentence, paragraph, and semantic unit.
Overlap handling to maintain context across chunks.
Configurable chunking parameters for customization.

Implementation:
Use existing frameworks like langchain to leverage the chunking strategy implementations. - langchain on java

To implement this feature, we will need to address the following sub-issues:
[ ] Issue 1: Implement Vector Embedding Processor - Use Bedrock, OpenAI, etc., for Embedding Generation
[ ] Issue 2: Add Codec for Unstructured Data (PDF, HTML, and Other Formats)
[ ] Issue 3: Add Chunking Strategies Based on the 5 Levels of Text Splitting

Additional context
RAG - https://aws.amazon.com/what-is/retrieval-augmented-generation/

srikanthjg added the untriaged label Oct 28, 2024

github-project-automation bot added this to Data Prepper Tracking Board Oct 28, 2024

github-project-automation bot moved this to Unplanned in Data Prepper Tracking Board Oct 28, 2024

dlvenable added enhancement New feature or request and removed untriaged labels Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add RAG Support in Data Prepper #5126

Feature Request: Add RAG Support in Data Prepper #5126

srikanthjg commented Oct 28, 2024

Feature Request: Add RAG Support in Data Prepper #5126

Feature Request: Add RAG Support in Data Prepper #5126

Comments

srikanthjg commented Oct 28, 2024