Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add RAG Support in Data Prepper #5126

Open
srikanthjg opened this issue Oct 28, 2024 · 0 comments
Open

Feature Request: Add RAG Support in Data Prepper #5126

srikanthjg opened this issue Oct 28, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@srikanthjg
Copy link
Contributor

Is your feature request related to a problem? Please describe.

This feature aims to enhance Data Prepper to support Retrieval-Augmented Generation (RAG) use cases by implementing the following components:

  1. A Vector Embedding Processor: that leverages services like AWS Bedrock, OpenAI etc for generating vector embeddings, facilitating integration with vector databases.
  2. Codecs for Unstructured Data: that enables Data Prepper to ingest additional file formats like PDF, HTML, etc.
  3. Advanced Chunking Strategies: based on the 5 Levels of Text Splitting to improve the relevance of text chunks for embedding and retrieval tasks. For details of the chunking strategy refer - link
    These enhancements will enable Data Prepper to handle complex unstructured data pipelines, allowing users to implement sophisticated retrieval and generation workflows for various applications.

Describe the solution you'd like
Sub-Issues
Issue 1: Implement Vector Embedding Processor - Use Bedrock, OpenAI, etc., for Embedding Generation
Objective: Create a processor that generates vector embeddings for input text chunks. The processor will support multiple embedding services, such as AWS Bedrock and OpenAI, and will leverage asynchronous processing for improved throughput.
Key Features:
Configurable embedding source selection.
Batch processing support with individual chunk embeddings.
Error handling and scalability optimizations.

Implementation:
Similar to aws lambda processor in dataprepper, we will need to handle calls to external services(bedrock, openai or hugging face etc) to get vector embeddings.

Issue 2: Add Codec for Unstructured Data (PDF, HTML, and Other Formats)

Objective: Expand Data Prepper’s ingestion capabilities to handle additional unstructured data formats like PDF and HTML, which are commonly used in RAG use cases.
Key Features:
Modular codec design for flexibility.
Basic in-memory text extraction for standard documents.
Integration with AWS Textract for advanced text extraction needs (e.g., handling scanned documents, tables, images).

Implementation:
For basic PDF documents, we can use libraries like apacheTika or apachePdfBox.
For advanced parsing of documents like reading tabluar data or receipts or forms ; it is best we use external services like aws textract or aryn's partitioning service which uses OCR + ML solution to read unstructured data.

Issue 3: Add Chunking Strategies Based on the 5 Levels of Text Splitting

Objective: Implement advanced chunking strategies inspired by the 5 Levels of Text Splitting - link This will improve the semantic relevance of text chunks, enabling higher-quality embeddings and more effective retrieval.
Key Features:
Support for five levels of chunking: character, word, sentence, paragraph, and semantic unit.
Overlap handling to maintain context across chunks.
Configurable chunking parameters for customization.

Implementation:
Use existing frameworks like langchain to leverage the chunking strategy implementations. - langchain on java

To implement this feature, we will need to address the following sub-issues:
[ ] Issue 1: Implement Vector Embedding Processor - Use Bedrock, OpenAI, etc., for Embedding Generation
[ ] Issue 2: Add Codec for Unstructured Data (PDF, HTML, and Other Formats)
[ ] Issue 3: Add Chunking Strategies Based on the 5 Levels of Text Splitting

Additional context
RAG - https://aws.amazon.com/what-is/retrieval-augmented-generation/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

2 participants