Skip to content

v0.8.0

Compare
Choose a tag to compare
@oryx1729 oryx1729 released this 13 Apr 15:04

⭐ Highlights

This is a major Haystack release with many new features. The release blog post has a detailed summary. Below are the top highlights:

Milvus Document Store

Milvus is an open-source vector database. With the MilvusDocumentStore contributed by @lalitpagaria, embedding based Retrievers like the DensePassageRetriever or EmbeddingRetriever can use production-ready Milvus servers for large-scale deployments.

Knowledge Graph

An experimental integration for KnowledgeGraphs is introduced using GraphDB. The GraphDBKnowlegeGraph stores Triples and executes SPARQL queries. It can be integrated with Text2SparqlRetriever to convert natural language queries to SPARQL.

Pipeline configuration with YAML

The Pipelines can now be configured with YAML. This enables easier sharing of query & indexing configuration, reproducible setups, A/B testing of Pipelines, and moving from development to the production environment.

REST APIs

The REST APIs are revamped to use Pipelines for Query & Indexing files. The YAML configurations are in the rest_api/pipelines.YAML. The new API endpoints are more generic to accommodate custom Pipeline configurations.

Confidence Scores

The answers now have a probability score that is better calibrated to the model's confidence. It has a range of 0-1; 0 signifying very low confidence, while, 1 for very high confidence.

Web Crawler

A Selenium based web crawler is now part of Haystack, thanks to @DIVYA-19 for the contribution. It takes as input a list of URLs and converts extracted text to Haystack Documents.

⚠️ Breaking Changes

REST APIs

The REST APIs got a major revamp with this release.

  • /doc-qa & /faq-qa endpoints are replaced with a more generic POST /query endpoint. This new endpoint uses Pipelines under-the-hood, that can be configured at rest_api/pipeline.yaml.

  • The new /query endpoint expects a single query per request instead of a list of query strings.
    The new request format is:

    {
        "query": "Why did the revenue change?"
    }

    and the response looks like this:

    {
        "query": "Why did the revenue change?",
        "answers": [
            {
                "answer": "rapid technological change and evolving industry standards",
                "question": null,
                "score": 0.543937623500824,
                "probability": 0.014070278964936733,
                "context": "tion process. The market for our products is intensely competitive and is characterized by rapid technological change and     evolving industry standards.",
                "offset_start": 91,
                "offset_end": 149,
                "offset_start_in_doc": 511,
                "offset_end_in_doc": 569,
                "document_id": "f30273b2-4d49-40d8-8824-43b3b6a0ea57",
                "meta": {
                    "_split_id": "7"
                }
            },
            {
                 // other answers
            }
        ]
    }
  • The /doc-qa-feedback & /faq-qa-feedback endpoints are replaced with a new generic /feedback endpoint.

Created At Timestamp

Previously, all documents/labels in SQLDocumentStore and FAISSDocumentStore had a field called created to store the creation timestamp, while ElasticsearchDocumentStore did not have any timestamp field. Now, all document stores have a created_at field for documents and labels.

RAGenerator

The top_k_answers parameter in the RAGenerator is renamed to top_k for consistency across Haystack components.

Custom Query for Elasticsearch

The placeholder terms in custom_query should not have quotes around them. See more details here.

🤓 Detailed Changes

Pipeline

Document Store

  • Fixes elasticsearch auth #871 (@grafke)
  • Allow more options for elasticsearch client (auth, multiple hosts) #845 (@tholor)
  • Fix ElasticsearchDocumentStore.query_by_embedding() #823 (@oryx1729)
  • Introduce incremental updates for embeddings in document stores #812 (@oryx1729)
  • Add method to get metadata values for a key from Elasticsearch #776 (@oryx1729)
  • Fix refresh behaviour for Elasticsearch delete #794 (@oryx1729)
  • Milvus integration #771 (@lalitpagaria)
  • Add flag for use of window queries in SQLDocumentStore #768 (@oryx1729)
  • Remove quotes around placeholders in Elasticsearch custom query #762 (@oryx1729)
  • Fix delete_all_documents for the SQLDocumentStore #761 (@oryx1729)

Retriever

Modeling

REST API

File Converter

  • Add Markdown file convertor #875 (@lalitpagaria)
  • Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813 (@tholor)

Crawler

Knowledge Graph

Annotation Tool

Search UI

  • Fix UI when API returns fewer answers than expected #828(@tholor)

CI

Misc Fixes

  • Adding indentation to markup files #947 (@julian-risch)
  • Reduce precision in pipeline eval print functions #943 (@lewtun)
  • Fix division by zero error in EvalRetriever #938 (@lewtun)
  • Logged warning in Faiss and Milvus for filters #913 (@peteradorjan)
  • fixed "cannot allocate memory" exception by specifying max_processes #910(@mosheber)
  • Fix error when is_impossible not exist #870 (@voidful)
  • Fix validation for split_respect_sentence_boundary in Preprocessor #869 (@oryx1729)
  • Fix boolean progress_bar for disabling tqdm progressbar #863 (@tholor)
  • Remove conditional import of FAISS for Windows #819 (@oryx1729)
  • Make tqdm progress bars optional (less verbose prod logs) #796 (@tholor)
  • Fix error when is_impossible not is_impossible and json dump encoding error [#868](#868 (@voidful)
  • fix download ntlk preprocessor #852 (@mrtunguyen)

Documentation

🙏 Thanks to our contributors

A big thank you to all the contributors for this release: @aantti, @brandenchan, @DIVYA-19, @grafke, @guillim, @julian-risch, @lalitpagaria, @lewtun, @mosheber, @mrtunguyen, @ms10596, @oryx1729, @peteradorjan, @PiffPaffM, @psorianom, @tholor, @Timoeller, @venuraja79, and @voidful.

We would like to thank everyone who participated in the insightful discussions on GitHub and our community Slack!