v0.8.0
⭐ Highlights
This is a major Haystack release with many new features. The release blog post has a detailed summary. Below are the top highlights:
Milvus Document Store
Milvus is an open-source vector database. With the MilvusDocumentStore
contributed by @lalitpagaria, embedding based Retrievers like the DensePassageRetriever
or EmbeddingRetriever can use production-ready Milvus servers for large-scale deployments.
Knowledge Graph
An experimental integration for KnowledgeGraphs is introduced using GraphDB. The GraphDBKnowlegeGraph
stores Triples and executes SPARQL queries. It can be integrated with Text2SparqlRetriever
to convert natural language queries to SPARQL.
Pipeline configuration with YAML
The Pipelines can now be configured with YAML. This enables easier sharing of query & indexing configuration, reproducible setups, A/B testing of Pipelines, and moving from development to the production environment.
REST APIs
The REST APIs are revamped to use Pipelines for Query & Indexing files. The YAML configurations are in the rest_api/pipelines.YAML. The new API endpoints are more generic to accommodate custom Pipeline configurations.
Confidence Scores
The answers now have a probability
score that is better calibrated to the model's confidence. It has a range of 0-1; 0 signifying very low confidence, while, 1 for very high confidence.
Web Crawler
A Selenium based web crawler is now part of Haystack, thanks to @DIVYA-19 for the contribution. It takes as input a list of URLs and converts extracted text to Haystack Documents.
⚠️ Breaking Changes
REST APIs
The REST APIs got a major revamp with this release.
-
/doc-qa
&/faq-qa
endpoints are replaced with a more generic POST/query
endpoint. This new endpoint uses Pipelines under-the-hood, that can be configured atrest_api/pipeline.yaml
. -
The new
/query
endpoint expects a single query per request instead of a list of query strings.
The new request format is:{ "query": "Why did the revenue change?" }
and the response looks like this:
{ "query": "Why did the revenue change?", "answers": [ { "answer": "rapid technological change and evolving industry standards", "question": null, "score": 0.543937623500824, "probability": 0.014070278964936733, "context": "tion process. The market for our products is intensely competitive and is characterized by rapid technological change and evolving industry standards.", "offset_start": 91, "offset_end": 149, "offset_start_in_doc": 511, "offset_end_in_doc": 569, "document_id": "f30273b2-4d49-40d8-8824-43b3b6a0ea57", "meta": { "_split_id": "7" } }, { // other answers } ] }
-
The
/doc-qa-feedback
&/faq-qa-feedback
endpoints are replaced with a new generic/feedback
endpoint.
Created At Timestamp
Previously, all documents/labels in SQLDocumentStore
and FAISSDocumentStore
had a field called created
to store the creation timestamp, while ElasticsearchDocumentStore
did not have any timestamp field. Now, all document stores have a created_at
field for documents and labels.
RAGenerator
The top_k_answers
parameter in the RAGenerator
is renamed to top_k
for consistency across Haystack components.
Custom Query for Elasticsearch
The placeholder terms in custom_query
should not have quotes around them. See more details here.
🤓 Detailed Changes
Pipeline
- Fix execution of Pipelines with parallel nodes #901 (@oryx1729)
- Add abstract run method to basecomponent #887 (@tholor)
- Add support for parallel paths in Pipeline #884 (@oryx1729)
- Add runtime parameters to component initialization #873 (@oryx1729 )
- Add support for indexing pipelines #816 (@oryx1729 )
- Adding translator with many generic input parameter support #782 (@lalitpagaria)
- Fix building Pipeline with YAML #800 (@oryx1729)
- Load Pipeline with YAML config file #785 (@oryx1729)
- Add evaluation nodes for Pipelines #904 (@brandenchan)
- Fix passing a list as parameter value in Pipeline YAML #952 (@oryx1729)
Document Store
- Fixes elasticsearch auth #871 (@grafke)
- Allow more options for elasticsearch client (auth, multiple hosts) #845 (@tholor)
- Fix ElasticsearchDocumentStore.query_by_embedding() #823 (@oryx1729)
- Introduce incremental updates for embeddings in document stores #812 (@oryx1729)
- Add method to get metadata values for a key from Elasticsearch #776 (@oryx1729)
- Fix refresh behaviour for Elasticsearch delete #794 (@oryx1729)
- Milvus integration #771 (@lalitpagaria)
- Add flag for use of window queries in SQLDocumentStore #768 (@oryx1729)
- Remove quotes around placeholders in Elasticsearch custom query #762 (@oryx1729)
- Fix delete_all_documents for the SQLDocumentStore #761 (@oryx1729)
Retriever
- Improve dpr conversion #826 (@Timoeller)
- Fix DPR training batch size #898 (@brandenchan)
- Upgrade FAISS to 1.7.0 #834 (@tholor)
- Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg #811(@psorianom)
Modeling
- Add model versioning support #784 (@brandenchan)
- Improve preprocessing and adding of eval data #780 (@Timoeller)
- SQuAD to DPR dataset converter #765 (@psorianom)
- Remove RAG todos after transformers update #781 (@Timoeller)
- Update farm version #936 (@Timoeller)
REST API
- Refactor REST APIs to use Pipelines #922 (@oryx1729)
- Add PDF converter in Dockerfiles #877 (@oryx1729)
- Update GPU Dockerimage (Cuda 11, Fix faiss) #836 (@tholor)
- Add API endpoint to export accuracy metrics from user feedback + created_at timestamp #803(@tholor)
- Fix file upload API #808 (@oryx1729)
File Converter
- Add Markdown file convertor #875 (@lalitpagaria)
- Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813 (@tholor)
Crawler
Knowledge Graph
- knowledge graph example #934 (@julian-risch)
Annotation Tool
- Annotation Tool: data is not persisted when using local version #853 #855(@venuraja79)
Search UI
CI
- Revamp CI #825 (@oryx1729)
- Fix mypy typing #792 (@oryx1729)
- Fix pdftotext dependency in CI #788 (@tholor)
Misc Fixes
- Adding indentation to markup files #947 (@julian-risch)
- Reduce precision in pipeline eval print functions #943 (@lewtun)
- Fix division by zero error in EvalRetriever #938 (@lewtun)
- Logged warning in Faiss and Milvus for filters #913 (@peteradorjan)
- fixed "cannot allocate memory" exception by specifying max_processes #910(@mosheber)
- Fix error when is_impossible not exist #870 (@voidful)
- Fix validation for
split_respect_sentence_boundary
in Preprocessor #869 (@oryx1729) - Fix boolean
progress_bar
for disabling tqdm progressbar #863 (@tholor) - Remove conditional import of FAISS for Windows #819 (@oryx1729)
- Make tqdm progress bars optional (less verbose prod logs) #796 (@tholor)
- Fix error when is_impossible not is_impossible and json dump encoding error [#868](#868 (@voidful)
- fix download ntlk preprocessor #852 (@mrtunguyen)
Documentation
- Add Milvus to the retriever / document store table #931 (@lewtun)
- Fixing inconsistency #926 (@guillim)
- Better default value for mp chunksize #923 (@Timoeller)
- Run Grammarly over README.md #890 (@peterdemin)
- Remove tf-idf youtube link #888 (@ms10596)
- Add Milvus Documentation #838 (@brandenchan)
- Fix link to Quick Demo in ToC. #831 (@aantti)
- Revamp Readme #820 (@brandenchan)
- Update tutorials (torch versions, ES version, replace Finder with Pipeline) #814 (@tholor)
- Choose correct similarity fns during benchmark runs & re-run benchmarks #773 (@brandenchan)
- Docs v0.7.0 #757 (@PiffPaffM)
- Fix top_k param in RAG tutorials #906 (@Timoeller)
- Integrate sentence transformers into benchmarks #843 (@Timoeller)
🙏 Thanks to our contributors
A big thank you to all the contributors for this release: @aantti, @brandenchan, @DIVYA-19, @grafke, @guillim, @julian-risch, @lalitpagaria, @lewtun, @mosheber, @mrtunguyen, @ms10596, @oryx1729, @peteradorjan, @PiffPaffM, @psorianom, @tholor, @Timoeller, @venuraja79, and @voidful.
We would like to thank everyone who participated in the insightful discussions on GitHub and our community Slack!