You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
This feature aims to enhance Data Prepper to support Retrieval-Augmented Generation (RAG) use cases by implementing the following components:
A Vector Embedding Processor: that leverages services like AWS Bedrock, OpenAI etc for generating vector embeddings, facilitating integration with vector databases.
Codecs for Unstructured Data: that enables Data Prepper to ingest additional file formats like PDF, HTML, etc.
Advanced Chunking Strategies: based on the 5 Levels of Text Splitting to improve the relevance of text chunks for embedding and retrieval tasks. For details of the chunking strategy refer - link
These enhancements will enable Data Prepper to handle complex unstructured data pipelines, allowing users to implement sophisticated retrieval and generation workflows for various applications.
Describe the solution you'd like
Sub-Issues Issue 1: Implement Vector Embedding Processor - Use Bedrock, OpenAI, etc., for Embedding Generation
Objective: Create a processor that generates vector embeddings for input text chunks. The processor will support multiple embedding services, such as AWS Bedrock and OpenAI, and will leverage asynchronous processing for improved throughput.
Key Features:
Configurable embedding source selection.
Batch processing support with individual chunk embeddings.
Error handling and scalability optimizations.
Implementation:
Similar to aws lambda processor in dataprepper, we will need to handle calls to external services(bedrock, openai or hugging face etc) to get vector embeddings.
Objective: Expand Data Prepper’s ingestion capabilities to handle additional unstructured data formats like PDF and HTML, which are commonly used in RAG use cases.
Key Features:
Modular codec design for flexibility.
Basic in-memory text extraction for standard documents.
Integration with AWS Textract for advanced text extraction needs (e.g., handling scanned documents, tables, images).
Implementation:
For basic PDF documents, we can use libraries like apacheTika or apachePdfBox.
For advanced parsing of documents like reading tabluar data or receipts or forms ; it is best we use external services like aws textract or aryn's partitioning service which uses OCR + ML solution to read unstructured data.
Objective: Implement advanced chunking strategies inspired by the 5 Levels of Text Splitting - link This will improve the semantic relevance of text chunks, enabling higher-quality embeddings and more effective retrieval.
Key Features:
Support for five levels of chunking: character, word, sentence, paragraph, and semantic unit.
Overlap handling to maintain context across chunks.
Configurable chunking parameters for customization.
Implementation:
Use existing frameworks like langchain to leverage the chunking strategy implementations. - langchain on java
Is your feature request related to a problem? Please describe.
This feature aims to enhance Data Prepper to support Retrieval-Augmented Generation (RAG) use cases by implementing the following components:
These enhancements will enable Data Prepper to handle complex unstructured data pipelines, allowing users to implement sophisticated retrieval and generation workflows for various applications.
Describe the solution you'd like
Sub-Issues
Issue 1: Implement Vector Embedding Processor - Use Bedrock, OpenAI, etc., for Embedding Generation
Objective: Create a processor that generates vector embeddings for input text chunks. The processor will support multiple embedding services, such as AWS Bedrock and OpenAI, and will leverage asynchronous processing for improved throughput.
Key Features:
Configurable embedding source selection.
Batch processing support with individual chunk embeddings.
Error handling and scalability optimizations.
Implementation:
Similar to aws lambda processor in dataprepper, we will need to handle calls to external services(bedrock, openai or hugging face etc) to get vector embeddings.
Issue 2: Add Codec for Unstructured Data (PDF, HTML, and Other Formats)
Objective: Expand Data Prepper’s ingestion capabilities to handle additional unstructured data formats like PDF and HTML, which are commonly used in RAG use cases.
Key Features:
Modular codec design for flexibility.
Basic in-memory text extraction for standard documents.
Integration with AWS Textract for advanced text extraction needs (e.g., handling scanned documents, tables, images).
Implementation:
For basic PDF documents, we can use libraries like apacheTika or apachePdfBox.
For advanced parsing of documents like reading tabluar data or receipts or forms ; it is best we use external services like aws textract or aryn's partitioning service which uses OCR + ML solution to read unstructured data.
Issue 3: Add Chunking Strategies Based on the 5 Levels of Text Splitting
Objective: Implement advanced chunking strategies inspired by the 5 Levels of Text Splitting - link This will improve the semantic relevance of text chunks, enabling higher-quality embeddings and more effective retrieval.
Key Features:
Support for five levels of chunking: character, word, sentence, paragraph, and semantic unit.
Overlap handling to maintain context across chunks.
Configurable chunking parameters for customization.
Implementation:
Use existing frameworks like langchain to leverage the chunking strategy implementations. - langchain on java
To implement this feature, we will need to address the following sub-issues:
[ ] Issue 1: Implement Vector Embedding Processor - Use Bedrock, OpenAI, etc., for Embedding Generation
[ ] Issue 2: Add Codec for Unstructured Data (PDF, HTML, and Other Formats)
[ ] Issue 3: Add Chunking Strategies Based on the 5 Levels of Text Splitting
Additional context
RAG - https://aws.amazon.com/what-is/retrieval-augmented-generation/
The text was updated successfully, but these errors were encountered: