Releases: deepset-ai/haystack
v1.12.1
⭐ Highlights
Large Language Models with PromptNode
Introducing PromptNode
, a new feature that brings the power of large language models (LLMs) to various NLP tasks. PromptNode
is an easy-to-use, customizable node you can run on its own or in a pipeline. We've designed the API to be user-friendly and suitable for everyday experimentation, but also fully compatible with production-grade Haystack deployments.
By setting a prompt template for a PromptNode
you define what task you want it to do. This way, you can have multiple PromptNode
s in your pipeline, each performing a different task. But that's not all. You can also inject the output of one PromptNode
into the input of another one.
Out of the box, we support both Google T5 Flan and OpenAI GPT-3 models, and you can even mix and match these models in your pipelines.
from haystack.nodes.prompt import PromptNode
# Initialize the node:
prompt_node = PromptNode("google/flan-t5-base") # try also 'text-davinci-003' if you have an OpenAI key
prompt_node("What is the capital of Germany?")
This node can do a lot more than simply querying LLMs: they can manage prompt templates, run batches, share models among instances, be chained together in pipelines, and more. Check its documentation for details!
Support for BM25Retriever
in InMemoryDocumentStore
InMemoryDocumentStore
has always been the go-to document store for small prototypes. The addition of BM25 support makes it officially one of the document stores to support all Retrievers available to Haystack, just like FAISS and Elasticsearch-like stores, but without the external dependencies. Don't use it in your million-documents-throughput deployments to production, though. It's not the fastest document store out there.
🏆 Honorable mention to @anakin87 for this outstanding contribution, among many many others! 🏆
Haystack is always open to external contributions, and every little bit is appreciated. Don't know where to start? Have a look at the Contributors Guidelines.
Extended support for Cohere and OpenAI embeddings
We enabled EmbeddingRetriever
to use the latest Cohere multilingual embedding models and OpenAI embedding models.
Simply use the model's full name (along with your API key) in EmbeddingRetriever
to get them:
# Cohere
retriever = EmbeddingRetriever(embedding_model="multilingual-22-12", batch_size=16, api_key=api_key)
# OpenAI
retriever = EmbeddingRetriever(embedding_model="text-embedding-ada-002", batch_size=32, api_key=api_key, max_seq_len=8191)
Speeding up dense searches in batch mode (Elasticsearch and OpenSearch)
Whenever you need to execute multiple dense searches at once, ElasticsearchDocumentStore
and OpenSearchDocumentStore
can now do it in parallel. This not only speeds up run_batch
and eval_batch
for dense pipelines when used with those document stores but also significantly speeds up multi-embedding retrieval pipelines like, for example, MostSimilarDocumentsPipeline
.
For this, we measured a speed up of up to 49% on a realistic dataset.
Under the hood, our newly introduced query_by_embedding_batch
document store function uses msearch
to unchain the full power of your Elasticsearch/OpenSearch cluster.
⚠️ Deprecated Docker images discontinued
1.12 is the last release we're shipping with the old Docker images deepset/haystack-cpu
, deepset/haystack-gpu
, and their relative tags. We'll remove the corresponding, deprecated Docker files /Dockerfile
, /Dockerfile-GPU
, and /Dockerfile-GPU-minimal
from the codebase after the release.
What's Changed
Pipeline
- fix:
ParsrConverter
fails on pages without text by @anakin87 in #3605 - fix: Convert eval metrics to python float by @tstadel in #3612
- feat: add support for
BM25Retriever
inInMemoryDocumentStore
by @anakin87 in #3561 - chore: fix return type of
aggregate_labels
by @tstadel in #3617 - refactor: change MultiModal retriever to be of type DenseRetriever by @mayankjobanputra in #3598
- fix: Move entire forward pass of TableQA within
torch.no_grad()
by @sjrl in #3636 - feat: add offsets_in_context to evaluation result by @julian-risch in #3640
- bug: Use tqdm auto instead of plain tqdm by @vblagoje in #3672
- fix: monkey patch for
SklearnQueryClassifier
by @anakin87 in #3678 - feat: Update table reader tests to check the answer scores by @sjrl in #3641
- feat: Adds all_terms_must_match parameter to BM25Retriever at runtime by @ugm2 in #3627
- fix: fix PreProcessor
split_by
schema by @ZanSara in #3680 - refactor: Generate JSON schema when missing by @masci in #3533
- refactor: replace
torch.no_grad
withtorch.inference_mode
(where possible) by @anakin87 in #3601 - Adjust get_type() method for pipelines by @vblagoje in #3657
- refactor: improve Multilabel design by @anakin87 in #3658
- feat: Update cohere embedding models #3704 by @vblagoje #3704
- feat: Enable
text-embedding-ada-002
forEmbeddingRetriever
#3721 by @vblagoje #3721 - feat: Expand LLM support with PromptModel, PromptNode, and PromptTemplate by @vblagoje in #3667
DocumentStores
- fix: Flatten
DocumentClassifier
output inSQLDocumentStore
by @anakin87 in #3273 - refactor: move milvus tests to their own module by @masci in #3596
- feat: store metadata using JSON in SQLDocumentStore by @masci in #3547
- fix: Pin faiss-cpu as 1.7.3 seems to have problems by @masci in #3603
- refactor: Move
InMemoryDocumentStore
tests to their own class by @masci in #3614 - chore: remove redundant tests by @masci in #3620
- refactor: Weaviate query with filters by @ZanSara in #3628
- fix: use 9200 as the default port in
launch_opensearch()
by @masci in #3630 - fix: revert Weaviate query with filters and improve tests by @ZanSara in #3646
- feat: add query_by_embedding_batch by @tstadel in #3546
- refactor: filters type by @tstadel in #3682
- fix: pinecone metadata format by @jamescalam in #3660
- fix: fixing broken BM25 support with Weaviate - fixes #3720 #3723 by @zoltan-fedor #3723
Documentation
- fix: fixing the url for document merger by @TuanaCelik in #3615
- docs: Reformat code blocks in docstrings by @brandenchan in #3580
Contributors to Tutorials
- fix: Tutorial 2, finetune a model, distillation code by Benvii deepset-ai/haystack-tutorials#69
- chore: Update 01_Basic_QA_Pipeline.ipynb by gsajko deepset-ai/haystack-tutorials#63
Other Changes
- test: add test to check id_hash_keys is not ignored by @julian-risch in #3577
- fix: remove
beir
fromall-gpu
by @ZanSara in #3669 - feat: Update DocumentMerger and TextIndexingPipeline imports by @brandenchan in #3599
- fix: pin
espnet
in theaudio
extra by @ZanSara in #3693 - refactor: update Squad data by @espoirMur in #3513
- Update CONTRIBUTING.md by @TuanaCelik in #3624
- fix: revamp
colab
extra dependencies by @masci in #3626 - refactor: remove
test
extra by @ZanSara in #3679 - fix: remove beir from the base GPU image by @ZanSara in #3692
- feat: Bump transformers version to remove torch scatter dependency by @sjrl in #3703
New Contributors
- @espoirMur made their first contribution in #3513
Full Changelog: v1.11.1...v1.12.1
v1.12.0
v1.12.0
v1.12.0rc1
⭐ Highlights
Large Language Models with PromptNode
Introducing PromptNode
, a new feature that brings the power of large language models (LLMs) to various NLP tasks. PromptNode
is an easy-to-use, customizable node you can run on its own or in a pipeline. We've designed the API to be user-friendly and suitable for everyday experimentation, but also fully compatible with production-grade Haystack deployments.
By setting a prompt template for a PromptNode
you define what task you want it to do. This way, you can have multiple PromptNode
s in your pipeline, each performing a different task. But that's not all. You can also inject the output of one PromptNode
into the input of another one.
Out of the box, we support both Google T5 Flan and OpenAI GPT-3 models, and you can even mix and match these models in your pipelines.
from haystack.nodes.prompt import PromptNode
# Initialize the node:
prompt_node = PromptNode("google/flan-t5-base") # try also 'text-davinci-003' if you have an OpenAI key
prompt_node("What is the capital of Germany?")
This node can do a lot more than simply querying LLMs: they can manage prompt templates, run batches, share models among instances, be chained together in pipelines, and more. Check its documentation for details!
Support for BM25Retriever
in InMemoryDocumentStore
InMemoryDocumentStore
has always been the go-to document store for small prototypes. The addition of BM25 support makes it officially one of the document stores to support all Retrievers available to Haystack, just like FAISS and Elasticsearch-like stores, but without the external dependencies. Don't use it in your million-documents-throughput deployments to production, though. It's not the fastest document store out there.
🏆 Honorable mention to @anakin87 for this outstanding contribution, among many many others! 🏆
Haystack is always open to external contributions, and every little bit is appreciated. Don't know where to start? Have a look at the Contributors Guidelines.
Extended support for Cohere and OpenAI embeddings
We enabled EmbeddingRetriever
to use the latest Cohere multilingual embedding models and OpenAI embedding models.
Simply use the model's full name (along with your API key) in EmbeddingRetriever
to get them:
# Cohere
retriever = EmbeddingRetriever(embedding_model="multilingual-22-12", batch_size=16, api_key=api_key)
# OpenAI
retriever = EmbeddingRetriever(embedding_model="text-embedding-ada-002", batch_size=32, api_key=api_key, max_seq_len=8191)
Speeding up dense searches in batch mode (Elasticsearch and OpenSearch)
Whenever you need to execute multiple dense searches at once, ElasticsearchDocumentStore
and OpenSearchDocumentStore
can now do it in parallel. This not only speeds up run_batch
and eval_batch
for dense pipelines when used with those document stores but also significantly speeds up multi-embedding retrieval pipelines like, for example, MostSimilarDocumentsPipeline
.
For this, we measured a speed up of up to 49% on a realistic dataset.
Under the hood, our newly introduced query_by_embedding_batch
document store function uses msearch
to unchain the full power of your Elasticsearch/OpenSearch cluster.
⚠️ Deprecated Docker images discontinued
1.12 is the last release we're shipping with the old Docker images deepset/haystack-cpu
, deepset/haystack-gpu
, and their relative tags. We'll remove the corresponding, deprecated Docker files /Dockerfile
, /Dockerfile-GPU
, and /Dockerfile-GPU-minimal
from the codebase after the release.
What's Changed
Pipeline
- fix:
ParsrConverter
fails on pages without text by @anakin87 in #3605 - fix: Convert eval metrics to python float by @tstadel in #3612
- feat: add support for
BM25Retriever
inInMemoryDocumentStore
by @anakin87 in #3561 - chore: fix return type of
aggregate_labels
by @tstadel in #3617 - refactor: change MultiModal retriever to be of type DenseRetriever by @mayankjobanputra in #3598
- fix: Move entire forward pass of TableQA within
torch.no_grad()
by @sjrl in #3636 - feat: add offsets_in_context to evaluation result by @julian-risch in #3640
- bug: Use tqdm auto instead of plain tqdm by @vblagoje in #3672
- fix: monkey patch for
SklearnQueryClassifier
by @anakin87 in #3678 - feat: Update table reader tests to check the answer scores by @sjrl in #3641
- feat: Adds all_terms_must_match parameter to BM25Retriever at runtime by @ugm2 in #3627
- fix: fix PreProcessor
split_by
schema by @ZanSara in #3680 - refactor: Generate JSON schema when missing by @masci in #3533
- refactor: replace
torch.no_grad
withtorch.inference_mode
(where possible) by @anakin87 in #3601 - Adjust get_type() method for pipelines by @vblagoje in #3657
- refactor: improve Multilabel design by @anakin87 in #3658
- feat: Update cohere embedding models #3704 by @vblagoje #3704
- feat: Enable
text-embedding-ada-002
forEmbeddingRetriever
#3721 by @vblagoje #3721
DocumentStores
- fix: Flatten
DocumentClassifier
output inSQLDocumentStore
by @anakin87 in #3273 - refactor: move milvus tests to their own module by @masci in #3596
- feat: store metadata using JSON in SQLDocumentStore by @masci in #3547
- fix: Pin faiss-cpu as 1.7.3 seems to have problems by @masci in #3603
- refactor: Move
InMemoryDocumentStore
tests to their own class by @masci in #3614 - chore: remove redundant tests by @masci in #3620
- refactor: Weaviate query with filters by @ZanSara in #3628
- fix: use 9200 as the default port in
launch_opensearch()
by @masci in #3630 - fix: revert Weaviate query with filters and improve tests by @ZanSara in #3646
- feat: add query_by_embedding_batch by @tstadel in #3546
- refactor: filters type by @tstadel in #3682
- fix: pinecone metadata format by @jamescalam in #3660
- fix: fixing broken BM25 support with Weaviate - fixes #3720 #3723 by @zoltan-fedor #3723
Documentation
- fix: fixing the url for document merger by @TuanaCelik in #3615
- docs: Reformat code blocks in docstrings by @brandenchan in #3580
Contributors to Tutorials
- fix: Tutorial 2, finetune a model, distillation code by Benvii deepset-ai/haystack-tutorials#69
- chore: Update 01_Basic_QA_Pipeline.ipynb by gsajko deepset-ai/haystack-tutorials#63
Other Changes
- test: add test to check id_hash_keys is not ignored by @julian-risch in #3577
- fix: remove
beir
fromall-gpu
by @ZanSara in #3669 - feat: Update DocumentMerger and TextIndexingPipeline imports by @brandenchan in #3599
- fix: pin
espnet
in theaudio
extra by @ZanSara in #3693 - refactor: update Squad data by @espoirMur in #3513
- Update CONTRIBUTING.md by @TuanaCelik in #3624
- fix: revamp
colab
extra dependencies by @masci in #3626 - refactor: remove
test
extra by @ZanSara in #3679 - fix: remove beir from the base GPU image by @ZanSara in #3692
- feat: Bump transformers version to remove torch scatter dependency by @sjrl in #3703
New Contributors
- @espoirMur made their first contribution in #3513
Full Changelog: v1.11.1...v1.12.0rc1
v1.11.1
What's Changed
Full Changelog: v1.11.0...v1.11.1
v1.11.1rc1
What's Changed
Full Changelog: v1.11.0...v1.11.1rc1
v1.11.0
⭐ Highlights
Expanding Haystack’s LLM support further with the new CohereEmbeddingEncoder
(#3356)
Now you can easily create document and query embeddings using Cohere’s large language models: if you have a Cohere account, all you have to do is set the name of one of the supported models (small
, medium
, or large
) and add your API key to the EmbeddingRetriever
component in your pipelines (see docs).
Extracting headlines from Markdown and PDF files (#3445 #3488)
Using the MarkdownConverter
or the ParsrConverter
you can set the parameter extract_headlines
to True
to extract the headlines out of your files together with their start start position in the file and their level. Headlines are stored as a list of dictionaries in the Document's meta field "headlines" and are structured as followed:
{
"headline": <THE HEADLINE STRING>,
"start_idx": <IDX OF HEADLINE START IN document.content >,
"level": <LEVEL OF THE HEADLINE>
}
Introducing the proposals design process (#3333)
We've introduced the proposal design process for substantial changes. A proposal is a single Markdown file that explains why a change is needed and how it would be implemented. You can find a detailed explanation of the process and a proposal template in the proposals directory.
⚠️ Breaking change: removing Milvus1DocumentStore
From this version onwards, Haystack no longer supports version 1 of Milvus. We still support Milvus version 2. We removed Milvus1DocumentStore
and renamed Milvus2DocumentStore
to MilvusDocumentStore
.
What's Changed
Breaking Changes
- bug: removed duplicated meta "name" field addition to content before embedding in
update_embeddings
workflow by @mayankjobanputra in #3368 - BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x by @masci in #3552
Pipeline
- fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in #3330
- bug: change type of split_by to Literal including None by @julian-risch in #3389
- Fix: update pyworld pin by @anakin87 in #3435
- feat: send event if number of queries exceeds threshold by @vblagoje in #3419
- Feat: allow decreasing size of datasets loaded from BEIR by @ugm2 in #3392
- feat: add
__cointains__
toSpan
by @ZanSara in #3446 - Bug: Fix prompt length computation by @Timoeller in #3448
- Add indexing pipeline type by @vblagoje in #3461
- fix: warning if doc store similarity function is incompatible with Sentence Transformers model by @anakin87 in #3455
- feat: Add CohereEmbeddingEncoder to EmbeddingRetriever by @vblagoje in #3453
- feat: Extraction of headlines in markdown files by @bogdankostic in #3445
- bug: replace decorator with counter attribute for pipeline event by @julian-risch in #3462
- feat: add
document_store
to allBaseRetriever.retrieve()
andBaseRetriever.retrieve_batch()
implementations by @ZanSara in #3379 - refactor: TableReader by @sjrl in #3456
- fix: do not reference package directory in
PDFToTextOCRConverter.convert()
by @ZanSara in #3478 - feat: Create the TextIndexingPipeline by @brandenchan in #3473
- refactor: remove YAML save/load methods for subclasses of
BaseStandardPipeline
by @ZanSara in #3443 - fix: strip whitespaces safely from
FARMReader
's answers by @ZanSara in #3526
DocumentStores
- Document Store test refactoring by @masci in #3449
- fix: support long texts for labels in
ElasticsearchDocumentStore
by @anakin87 in #3346 - feat: add SQLDocumentStore tests by @masci in #3517
- refactor: Refactor Weaviate tests by @masci in #3541
- refactor: Pinecone tests by @masci in #3555
- fix: write metadata to SQL Document Store when duplicate_documents!="overwrite" by @anakin87 in #3548
- fix: Elasticsearch / OpenSearch brownfield function does not incorporate meta by @tstadel in #3572
- fix: discard metadata fields if not set in Weaviate by @masci in #3578
UI / Demo
Documentation
- docs: Extend utils API docs coverage by @brandenchan in #3402
- refactor: simplify Summarizer, add Document Merger by @anakin87 in #3452
- feat: introduce proposal design process by @masci in #3333
Other Changes
- fix: Update env variable for model caching timeout by @sjrl in #3405
- feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in #3398
- fix: improve Document
__repr__
by @anakin87 in #3385 - fix: disabling telemetry prevents writing config by @julian-risch in #3465
- refactor: Change
no_answer
attribute by @anakin87 in #3411 - feat: Speed up reader tests by @sjrl in #3476
- fix: pattern to match tags push by @masci in #3469
- fix: using onnx converter on XLMRoberta architecture by @sjrl in #3470
- feat: Add headline extraction to
ParsrConverter
by @bogdankostic in #3488 - refactor: upgrade actions version by @ZanSara in #3506
- docs: Update docker readme by @brandenchan in #3531
- refactor: refactor FAISS tests by @masci in #3537
- feat: include error message in HaystackError telemetry events by @vblagoje in #3543
- fix: [rest_api] support TableQA in the endpoint
/documents/get_by_filters
by @ju-gu in #3551 - bug: fix release number by @mayankjobanputra in #3559
- refactor: Generate JSON schema when missing by @masci in #3533
New Contributors
- @brunnurs made their first contribution in #3330
- @mayankjobanputra made their first contribution in #3368
Full Changelog: v1.10.0...v1.11.0rc1
v1.11.0rc1
⭐ Highlights
Expanding Haystack’s LLM support further with the new CohereEmbeddingEncoder
(#3356)
Now you can easily create document and query embeddings using Cohere’s large language models: if you have a Cohere account, all you have to do is set the name of one of the supported models (small
, medium
, or large
) and add your API key to the EmbeddingRetriever
component in your pipelines (see docs).
Extracting headlines from Markdown and PDF files (#3445 #3488)
Using the MarkdownConverter
or the ParsrConverter
you can set the parameter extract_headlines
to True
to extract the headlines out of your files together with their start start position in the file and their level. Headlines are stored as a list of dictionaries in the Document's meta field "headlines" and are structured as followed:
{
"headline": <THE HEADLINE STRING>,
"start_idx": <IDX OF HEADLINE START IN document.content >,
"level": <LEVEL OF THE HEADLINE>
}
Introducing the proposals design process (#3333)
We've introduced the proposal design process for substantial changes. A proposal is a single Markdown file that explains why a change is needed and how it would be implemented. You can find a detailed explanation of the process and a proposal template in the proposals directory.
⚠️ Breaking change: removing Milvus1DocumentStore
From this version onwards, Haystack no longer supports version 1 of Milvus. We still support Milvus version 2. We removed Milvus1DocumentStore
and renamed Milvus2DocumentStore
to MilvusDocumentStore
.
What's Changed
Breaking Changes
- bug: removed duplicated meta "name" field addition to content before embedding in
update_embeddings
workflow by @mayankjobanputra in #3368 - BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x by @masci in #3552
Pipeline
- fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in #3330
- bug: change type of split_by to Literal including None by @julian-risch in #3389
- Fix: update pyworld pin by @anakin87 in #3435
- feat: send event if number of queries exceeds threshold by @vblagoje in #3419
- Feat: allow decreasing size of datasets loaded from BEIR by @ugm2 in #3392
- feat: add
__cointains__
toSpan
by @ZanSara in #3446 - Bug: Fix prompt length computation by @Timoeller in #3448
- Add indexing pipeline type by @vblagoje in #3461
- fix: warning if doc store similarity function is incompatible with Sentence Transformers model by @anakin87 in #3455
- feat: Add CohereEmbeddingEncoder to EmbeddingRetriever by @vblagoje in #3453
- feat: Extraction of headlines in markdown files by @bogdankostic in #3445
- bug: replace decorator with counter attribute for pipeline event by @julian-risch in #3462
- feat: add
document_store
to allBaseRetriever.retrieve()
andBaseRetriever.retrieve_batch()
implementations by @ZanSara in #3379 - refactor: TableReader by @sjrl in #3456
- fix: do not reference package directory in
PDFToTextOCRConverter.convert()
by @ZanSara in #3478 - feat: Create the TextIndexingPipeline by @brandenchan in #3473
- refactor: remove YAML save/load methods for subclasses of
BaseStandardPipeline
by @ZanSara in #3443 - fix: strip whitespaces safely from
FARMReader
's answers by @ZanSara in #3526
DocumentStores
- Document Store test refactoring by @masci in #3449
- fix: support long texts for labels in
ElasticsearchDocumentStore
by @anakin87 in #3346 - feat: add SQLDocumentStore tests by @masci in #3517
- refactor: Refactor Weaviate tests by @masci in #3541
- refactor: Pinecone tests by @masci in #3555
- fix: write metadata to SQL Document Store when duplicate_documents!="overwrite" by @anakin87 in #3548
- fix: Elasticsearch / OpenSearch brownfield function does not incorporate meta by @tstadel in #3572
- fix: discard metadata fields if not set in Weaviate by @masci in #3578
UI / Demo
Documentation
- docs: Extend utils API docs coverage by @brandenchan in #3402
- refactor: simplify Summarizer, add Document Merger by @anakin87 in #3452
- feat: introduce proposal design process by @masci in #3333
Other Changes
- fix: Update env variable for model caching timeout by @sjrl in #3405
- feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in #3398
- fix: improve Document
__repr__
by @anakin87 in #3385 - fix: disabling telemetry prevents writing config by @julian-risch in #3465
- refactor: Change
no_answer
attribute by @anakin87 in #3411 - feat: Speed up reader tests by @sjrl in #3476
- fix: pattern to match tags push by @masci in #3469
- fix: using onnx converter on XLMRoberta architecture by @sjrl in #3470
- feat: Add headline extraction to
ParsrConverter
by @bogdankostic in #3488 - refactor: upgrade actions version by @ZanSara in #3506
- docs: Update docker readme by @brandenchan in #3531
- refactor: refactor FAISS tests by @masci in #3537
- feat: include error message in HaystackError telemetry events by @vblagoje in #3543
- fix: [rest_api] support TableQA in the endpoint
/documents/get_by_filters
by @ju-gu in #3551 - bug: fix release number by @mayankjobanputra in #3559
- refactor: Generate JSON schema when missing by @masci in #3533
New Contributors
- @brunnurs made their first contribution in #3330
- @mayankjobanputra made their first contribution in #3368
Full Changelog: v1.10.0...v1.11.0rc1
v1.10.0
⭐ Highlights
Expanding Haystack's LLM support with the new OpenAIEmbeddingEncoder
(#3356)
Now you can easily create document and query embeddings using large language models: if you have an OpenAI account, all you have to do is set the name of one of the supported models (ada
, babbage
, davinci
or curie
) and add your API key to the EmbeddingRetriever
component in your pipelines (see docs).
Multimodal retrieval is here! (#2891)
Multimodality with Haystack just made a big leap forward with the addition of MultiModalRetriever
: a Retriever that can handle different modalities for query and documents independently. Take it for a spin and experiment with new Document formats, like images. You can now use the same Retriever for text-to-image, text-to-table, and text-to-text retrieval but also image similarity, table similarity, and more! Feed your favorite multimodal model to MultiModalRetriever
and see it in action.
retriever = MultiModalRetriever(
document_store=InMemoryDocumentStore(embedding_dim=512),
query_embedding_model = "sentence-transformers/clip-ViT-B-32",
query_type="text",
document_embedding_models = {"image": "sentence-transformers/clip-ViT-B-32"}
)
Multi-platform Docker images
Starting with 1.10, we're making the deepset/haystack
images available for linux/amd64
and linux/arm64
.
⚠️ Breaking change in embed_queries
method (#3252)
We've changed the text
argument in the embed_queries
method for DensePassageRetriever
and EmbeddingRetriever
to queries
.
What's Changed
Breaking Changes
Pipeline
- fix: ONNX FARMReader model conversion is broken by @vblagoje in #3211
- bug: JoinDocuments nodes produce incorrect results if preceded by another JoinDocuments node by @JeffRisberg in #3170
- fix: eval() with
add_isolated_node_eval=True
breaks if no node supports it by @tstadel in #3347 - feat: extract label aggregation by @tstadel in #3363
- feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever by @vblagoje in #3356
- fix: stable YAML schema generation by @ZanSara in #3388
- fix: Update how schema is ordered by @sjrl in #3399
- feat:
MultiModalRetriever
by @ZanSara in #2891
DocumentStores
- feat: FAISS in OpenSearch: Support HNSW for cosine by @tstadel in #3217
- feat: add support for Elasticsearch 7.16.2 by @masci in #3318
- refactor: remove dead code from
FAISSDocumentStore
by @anakin87 in #3372 - fix: allow same
vector_id
in different indexes for SQL-based Document stores by @anakin87 in #3383
UI / Demo
Documentation
- docs: Fix a docstring in ray.py by @tanertopal in #3282
Other Changes
- refactor: make
TransformersDocumentClassifier
output consistent between different types of classification by @anakin87 in #3224 - Classify pipeline's type based on its components by @vblagoje in #3132
- docs: sync Haystack API with Readme by @brandenchan in #3223
- fix: MostSimilarDocumentsPipeline doesn't have pipeline property by @vblagoje in #3265
- bug: make
ElasticSearchDocumentStore
usebatch_size
inget_documents_by_id
by @anakin87 in #3166 - refactor: better tests for
TransformersDocumentClassifier
by @anakin87 in #3270 - fix: AttributeError in TranslationWrapperPipeline by @nickchomey in #3290
- refactor: remove Inferencer multiprocessing by @vblagoje in #3283
- fix: opensearch script score with filters by @tstadel in #3321
- feat: Adding filters param to MostSimilarDocumentsPipeline run and run_batch by @JacdDev in #3301
- feat: add multi-platform Docker images by @masci in #3354
- fix: Added checks for DataParallel and WrappedDataParallel by @sjrl in #3366
- fix: QuestionGenerator generates wrong document questions for non-default
num_queries_per_doc
parameter by @vblagoje in #3381 - bug: Adds better way of checking
query
in BaseRetriever and Pipeline.run() by @ugm2 in #3304 - feat: Updated EntityExtractor to handle long texts and added better postprocessing by @sjrl in #3154
- docs: Add comment about the generation of no-answer samples in FARMReader training by @brandenchan in #3404
- feat: Speed up integration tests (nodes) by @sjrl in #3408
- fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in #3330
- bug: change type of split_by to Literal including None by @julian-risch in #3389
- feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in #3398
New Contributors
- @tanertopal made their first contribution in #3282
- @JeffRisberg made their first contribution in #3170
- @JacdDev made their first contribution in #3301
- @hsm207 made their first contribution in #3351
- @ugm2 made their first contribution in #3304
- @brunnurs made their first contribution in #3330
Full Changelog: v1.9.1...v1.10.0rc1
v1.10.0rc1
⭐ Highlights
Expanding Haystack's LLM support with the new OpenAIEmbeddingEncoder
(#3356)
Now you can easily create document and query embeddings using large language models: if you have an OpenAI account, all you have to do is set the name of one of the supported models (ada
, babbage
, davinci
or curie
) and add your API key to the EmbeddedRetriver
component in your pipelines.
Multimodal retrieval is here! (#2891)
Multimodality with Haystack just made a big leap forward with the addition of MultiModalRetriever
: a Retriever that can handle different modalities for query and documents independently. Take it for a spin and experiment with new Document formats, like images. You can now use the same Retriever for text-to-image, text-to-table, and text-to-text retrieval but also image similarity, table similarity, and more! Feed your favorite multimodal model to MultiModalRetriever
and see it in action.
retriever = MultiModalRetriever(
document_store=InMemoryDocumentStore(embedding_dim=512),
query_embedding_model = "sentence-transformers/clip-ViT-B-32",
query_type="text",
document_embedding_models = {"image": "sentence-transformers/clip-ViT-B-32"}
)
Multi-platform Docker images
Starting with 1.10, we're making the deepset/haystack
images available for linux/amd64
and linux/arm64
.
⚠️ Breaking change in embed_queries
method (#3252)
We've changed the text
argument in the embed_queries
method for DensePassageRetriever
and EmbeddingRetriever
to queries
.
What's Changed
Breaking Changes
Pipeline
- fix: ONNX FARMReader model conversion is broken by @vblagoje in #3211
- bug: JoinDocuments nodes produce incorrect results if preceded by another JoinDocuments node by @JeffRisberg in #3170
- fix: eval() with
add_isolated_node_eval=True
breaks if no node supports it by @tstadel in #3347 - feat: extract label aggregation by @tstadel in #3363
- feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever by @vblagoje in #3356
- fix: stable YAML schema generation by @ZanSara in #3388
- fix: Update how schema is ordered by @sjrl in #3399
- feat:
MultiModalRetriever
by @ZanSara in #2891
DocumentStores
- feat: FAISS in OpenSearch: Support HNSW for cosine by @tstadel in #3217
- feat: add support for Elasticsearch 7.16.2 by @masci in #3318
- refactor: remove dead code from
FAISSDocumentStore
by @anakin87 in #3372 - fix: allow same
vector_id
in different indexes for SQL-based Document stores by @anakin87 in #3383
UI / Demo
Documentation
- docs: Fix a docstring in ray.py by @tanertopal in #3282
Other Changes
- refactor: make
TransformersDocumentClassifier
output consistent between different types of classification by @anakin87 in #3224 - Classify pipeline's type based on its components by @vblagoje in #3132
- docs: sync Haystack API with Readme by @brandenchan in #3223
- fix: MostSimilarDocumentsPipeline doesn't have pipeline property by @vblagoje in #3265
- bug: make
ElasticSearchDocumentStore
usebatch_size
inget_documents_by_id
by @anakin87 in #3166 - refactor: better tests for
TransformersDocumentClassifier
by @anakin87 in #3270 - fix: AttributeError in TranslationWrapperPipeline by @nickchomey in #3290
- refactor: remove Inferencer multiprocessing by @vblagoje in #3283
- fix: opensearch script score with filters by @tstadel in #3321
- feat: Adding filters param to MostSimilarDocumentsPipeline run and run_batch by @JacdDev in #3301
- feat: add multi-platform Docker images by @masci in #3354
- fix: Added checks for DataParallel and WrappedDataParallel by @sjrl in #3366
- fix: QuestionGenerator generates wrong document questions for non-default
num_queries_per_doc
parameter by @vblagoje in #3381 - bug: Adds better way of checking
query
in BaseRetriever and Pipeline.run() by @ugm2 in #3304 - feat: Updated EntityExtractor to handle long texts and added better postprocessing by @sjrl in #3154
- docs: Add comment about the generation of no-answer samples in FARMReader training by @brandenchan in #3404
- feat: Speed up integration tests (nodes) by @sjrl in #3408
- fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in #3330
- bug: change type of split_by to Literal including None by @julian-risch in #3389
- feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in #3398
New Contributors
- @tanertopal made their first contribution in #3282
- @JeffRisberg made their first contribution in #3170
- @JacdDev made their first contribution in #3301
- @hsm207 made their first contribution in #3351
- @ugm2 made their first contribution in #3304
- @brunnurs made their first contribution in #3330
Full Changelog: v1.9.1...v1.10.0rc1
v1.9.1
What's Changed
- fix: Allow less restrictive values for parameters in Pipeline configurations by @bogdankostic in #3345
Full Changelog: v1.9.0...v1.9.1rc1