20 Jan 16:24

c6f23dc

v1.1.0

⭐ Highlights

Model Distillation for Reader Models

With the new model distillation features, you don't need to choose between accuracy and speed! Now you can compress a large reader model (teacher) into a smaller model (student) while retaining most of the teacher's performance. For example, deepset/tinybert-6l-768d-squad2 is twice as fast as bert-base with an F1 reduction of only 2%.

To distil your own model, just follow these steps:

Call python augment_squad.py --squad_path <your dataset> --output_path <output> --multiplication_factor 20 where augment_squad.py is our data augmentation script.
Run student.distil_intermediate_layers_from(teacher, data_dir="dataset", train_filename="augmented_dataset.json") where student is a small model and teacher is a highly accurate, larger reader model.
Run student.distil_prediction_layer_from(teacher, data_dir="dataset", train_filename="dataset.json") with the same teacher and student.

For more information on what kinds of students and teachers you can use and on model distillation in general, just take a look at this guide.

Integrated vs. Isolated Pipeline Evaluation Modes

When you evaluate a pipeline, you can now use two different evaluation modes and create an automatic report that shows the results of both. The integrated evaluation (default) shows what result quality users will experience when running the pipeline. The isolated evaluation mode additionally shows what the maximum result quality of a node could be if it received the perfect input from the preceeding node. Thereby, you can find out whether the retriever or the reader in an ExtractiveQAPipeline is the bottleneck.

eval_result_with = pipeline.eval(labels=eval_labels, add_isolated_node_eval=True)
pipeline.print_eval_report(eval_result)

================== Evaluation Report ==================
=======================================================
                      Query
                        |
                      Retriever
                        |
                        | recall_single_hit:   ...
                        |
                      Reader
                        |
                        | f1 upper  bound:   0.78
                        | f1:   0.65
                        |
                      Output

As the gap between the upper bound F1-score of the reader differs a lot from its actual F1-score in this report, you would need to improve the predictions of the retriever node to achieve the full performance of this pipeline. Our updated evaluation tutorial lists all the steps to generate an evaluation report with all the metrics you need and their upper bounds of each individual node. The guide explains the two evaluation modes in detail.

Row-Column-Intersection Model for TableQA

Now you can use a Row-Column-Intersection model on your own tabular data. To try it out, just replace the declaration of your TableReader:

reader = RCIReader(row_model_name_or_path="michaelrglass/albert-base-rci-wikisql-row",
                   column_model_name_or_path="michaelrglass/albert-base-rci-wikisql-col")

The RCIReader requires two separate models: One for rows and one for columns. Working on each column and row separately allows it to be used on much larger tables. It is also able to return meaningful confidence scores unlike the TableReader.
Please note, however, that it currently does not support aggregations over multiple cells and that it is a bit slower than other approaches.

Advanced File Converters

Given a file (PDF or DOCX), there are two file converters to extract text and tables in Haystack now:
The ParsrConverter based on the open-source Parsr tool by axa-group introduced into Haystack in this release and the AzureConverter, which we improved on. Both of them return a list of dictionaries containing one dictionary for each table detected in the file and one dictionary containing the text of the file. This format matches the document format and can be used right away for TableQA (see the guide).

converter = ParsrConverter()
docs = converter.convert(file_path="samples/pdf/sample_pdf_1.pdf")

⚠️ Breaking Changes

Custom id hashing on documentstore level by @ArzelaAscoIi in #1910
Implement proper FK in MetaDocumentORM and MetaLabelORM to work on PostgreSQL by @ZanSara in #1990

🤓 Detailed Changes

Pipeline

Extend TranslationWrapper to work with QA Generation by @julian-risch in #1905
Add nDCG to pipeline.eval()'s document metrics by @tstadel in #2008
change column order for evaluatation dataframe by @ju-gu in #1957
Add isolated node eval mode in pipeline eval by @julian-risch in #1962
introduce node_input param by @tstadel in #1854
Add ParsrConverter by @bogdankostic in #1931
Add improvements to AzureConverter by @bogdankostic in #1896

Models

Prevent wrapping DataParallel in second DataParallel by @bogdankostic in #1855
Enable batch mode for SAS cross encoders by @tstadel in #1987
Add RCIReader for TableQA by @bogdankostic in #1909
distinguish intermediate layer & prediction layer distillation phases with different parameters by @MichelBartels in #2001
Add TinyBERT data augmentation by @MichelBartels in #1923
Adding distillation loss functions from TinyBERT by @MichelBartels in #1879

DocumentStores

Raise exception if Elasticsearch search_fields have wrong datatype by @tstadel in #1913
Support custom headers per request in pipeline by @tstadel in #1861
Fix retrieving documents in WeaviateDocumentStore with content_type=None by @bogdankostic in #1938
Fix Numba TypingError in normalize_embedding for cosine similarity by @bogdankostic in #1933
Fix loading a saved FAISSDocumentStore by @bogdankostic in #1937
Propagate duplicate_documents to base class initialization by @yorickvanzweeden in #1936
Fix vector_id collision in FAISS by @yorickvanzweeden in #1961
Unify vector_dim and embedding_dim parameter in Document Store by @mathew55 in #1922
Align similarity scores across document stores by @MichelBartels in #1967
Bugfix - save_to_yaml for OpenSearchDocumentStore by @ArzelaAscoIi in #2017
Fix elasticsearch scores if they are 0.0 by @tstadel in #1980

REST API

Rely api healthcheck on status code rather than json decoding by @fabiolab in #1871
Bump version in REST api by @tholor in #1875

UI / Demo

Replace SessionState with Streamlit built-in by @yorickvanzweeden in #2006
Fix demo deployment by @askainet in #1877
Add models to demo docker image by @ZanSara in #1978

Documentation

Update pydoc-markdown-file-classifier.yml by @brandenchan in #1856
Create v1.0 docs by @brandenchan in #1862
Fix typo by @amotl in #1869
Correct bug with encoding when generating Markdown documentation issue #1880 by @albertovilla in #1881
Minor typo by @javier in #1900
Fixed the grammatical issue in optimization guides #1940 by @eldhoittangeorge in #1941
update link to annotation tool docu by @julian-risch in #2005
Extend Tutorial 5 with Upper Bound Reader Eval Metrics by @julian-risch in #1995
Add distillation to finetuning tutorial by @MichelBartels in #2025
Add ndcg and eval_mode to docs by @tstadel in #2038
Remove hard-coded variables from the Tutorial 15 by @dmigo in #1984

Other Changes

upgrade transformers to 4.13.0 by @julian-risch in #1659
Fix typo in the Windows CI UI deps by @ZanSara in #1876
Exchanged min...

Contributors

javier, askainet, and 23 other contributors

Assets 3

08 Dec 08:05

tholor

v1.0.0

8cb513c

1.0.0

🎁 Haystack 1.0

We worked hard to bring you an early Christmas present: 1.0 is out! In the last months, we re-designed many essential parts of Haystack, introduced new features, and simplified many user-facing methods. We believe Haystack is now much easier to use and a solid base for many exciting upcoming features that we plan. This release is a major milestone on our journey with you, the community, and we want to thank you again for all the great contributions, discussions, questions, and bug reports that helped us to build a better Haystack. This journey has just started 🚀

⭐ Highlights

Improved Evaluation of Pipelines

Evaluation helps you find out how well your system is doing on your data. This includes Pipeline level evaluation to ensure that the system's output is really what you're after, but also Node level evaluation so that you can figure out whether it's your Reader or Retriever that is holding back the performance.

In this release, evaluation is much simpler and cleaner to perform. All the functionality is now baked into the Pipeline class and you can kick off the process by providing Label or MultiLabel objects to the Pipeline.eval() method.

eval_result = pipeline.eval(
    labels=labels,
    params={"Retriever": {"top_k": 5}},
)

The output is an EvaluationResult object which stores each Node's prediction for each sample in a Pandas DataFrame - so you can easily inspect granular predictions and potential mistakes without re-running the whole thing. There is a EvaluationResult.calculate_metrics() method which will return the relevant metrics for your evaluation and you can print a convenient summary report via the new .

metrics = eval_result.calculate_metrics()

pipeline.print_eval_report(eval_result)

If you'd like to start evaluating your own systems on your own data, check out our Evaluation Tutorial!

Table QA

A lot of valuable information is stored in tables - we've heard this again and again from the community. While they are an efficient structured data format, it hasn't been possible to search for table contents using traditional NLP techniques. But now, with the new TableTextRetriever and TableReader our users have all the tools they need to query for relevant tables and perform Question Answering.

The TableTextRetriever is the result of our team's research into table retrieval methods which you can read about in this paper that was presented at EMNLP 2021. Behind the scenes, it uses three transformer-based encoders - one for text passages, one for tables, and one for the query. However, in Haystack, you can swap it in for any other dense retrieval model and start working with tables. The TableReader is built upon the TAPAS model and when handed table containing Documents, it can return a single cell as an answer or perform an aggregation operation on a set of cells to form a final answer.

retriever = TableTextRetriever(
    document_store=document_store,
    query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
    passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
    table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
    embed_meta_fields=["title", "section_title"]
)

reader = TableReader(
		model_name_or_path="google/tapas-base-finetuned-wtq",
		max_seq_len=512
)

Have a look at the Table QA documentation if you'd like to learn more or dive into the Table QA tutorial to start unlocking the information in your table data.

Improved Debugging of Pipelines & Nodes

We've made debugging much simpler and also more informative! As long as your node receives a boolean debug argument, it can propagate its input, output or even some custom information to the output of the pipeline. It is now a built-in feature of all existing nodes and can also easily be inherited by your custom nodes.

result = pipeline.run(
        query="Who is the father of Arya Stark?",
        params={
            "debug": True
        }
    )

{'ESRetriever': {'input': {'debug': True,
                           'query': 'Who is the father of Arya Stark?',
                           'root_node': 'Query',
                           'top_k': 1},
                 'output': {'documents': [<Document: {'content': "\n===In the Riverlands===\nThe Stark army reaches the Twins, a bridge strong", ...}>]
                            ...}

To find out more about this feature, check out debugging. To learn how to define custom debug information, have a look at custom debugging.

FARM Migration

Those of you following Haystack from its first days will know that Haystack first evolved out of the FARM framework. While FARM is designed to handle diverse NLP models and tasks, Haystack gives full end-to-end support to search and question answering use cases with a focus on coordinating all components that take a proof-of-concept into production.

Haystack has always relied on FARM for much lower-level processing and modeling. To reduce the implementation overhead and simplify debugging, we have migrated the relevant parts of FARM into the new haystack/modeling package.

⚠️ Breaking Changes & Migration Guide

Migration to v1.0

With the release of v1.0, we decided to make some bold changes.
We believe this has brought a significant improvement in usability and makes the project more future-proof.
While this does come with a few breaking changes, and we do our best to guide you on how to go from v0.x to v1.0.
For more details see the Migration Guide and if you need more guidance, just reach out via Slack.

New Package Structure & Changed Imports

Due to the ever-increasing number of Nodes and Document Stores being integrated into Haystack,
we felt the need to implement a repository structure that makes it easier to navigate to what you're looking for. We've also shortened the length of the imports.

haystack.document_stores

All Document Stores can now be directly accessed from here
Note the pluralization of document_store to document_stores

haystack.nodes

This directory directly contains any class that can be used as a node
This includes File Converters and PreProcessors

haystack.pipelines

This contains all the base, custom and pre-made pipeline classes
Note the pluralization of pipeline to pipelines

haystack.utils

Any utility functions

➡️ For the large majority of imports, the old style still works but this will be deprecated in future releases!

Primitive Objects

Instead of relying on dictionaries, Haystack now standardizes more of the inputs and outputs of Nodes using the following primitive classes:

With these, there is now support for data structures beyond text and the REST API schema is built around their structure.
Using these classes also allows for the autocompletion of fields in your IDE.

Tip: To see examples of these primitive classes being returned, have a look at Ready-Made Pipelines.

Many of the fields in these classes have also been renamed or removed.
You can see a more comprehensive list of them in this Github issue.
Below, we will go through a few cases that are likely to impact established workflows.

Input Document Format

This dictionary schema used to be the recommended way to prepare your data to be indexed.
Now we strongly recommend using our dedicated Document class as a replacement.
The text field has been renamed content to accommodate for cases where it is used for another data format,
for example in Table QA.

Click here to see code example

v0.x:

doc = {
	'text': 'DOCUMENT_TEXT_HERE',
	'meta': {'name': DOCUMENT_NAME, ...}
}

v1.0:

doc = Document(
    content='DOCUMENT_TEXT_HERE',
    meta={'name': DOCUMENT_NAME, ...}
)

From here, you can take the same steps to write Documents into your Document Store.

document_store.write_documents([doc])

Response format of Reader

All Reader Nodes now return Answer objects instead of dictionaries.

Click here to see code example

v0.x:

[
    {
        'answer': 'Fang',
        'score': 13.26807975769043,
        'probability': 0.9657130837440491,
        'context': """Криволапик (Kryvolapyk, kryvi lapy "crooked paws")
            ===Fang (Hagrid's dog)===
            *Chinese (PRC): 牙牙 (ya2 ya) (from 牙 "tooth", 牙,"""
    }
]

v1.0:

[
    <Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9946763813495636, 'co...

Contributors

vblagoje, tholor, and 25 other contributors

Assets 2

9 Join discussion

16 Sep 08:31

tholor

v0.10.0

30dc010

v0.10.0

⭐ Highlights

🚀 Making Pipelines more scalable

You can now easily scale and distribute Haystack Pipelines thanks to the new integration of the Ray framework (https://ray.io/).
Ray allows distributing a Pipeline's components across a cluster of machines. The individual components of a Pipeline can be independently scaled. For instance, an extractive QA Pipeline deployment can have three replicas of the Reader and a single replica for the Retriever. It enables efficient resource utilization by horizontally scaling Components. You can use Ray via the new RayPipeline class (#1255)

To set the number of replicas, add replicas in the YAML config for the node in a pipeline:

components:
    ...

pipelines:
  - name: ray_query_pipeline
    type: RayPipeline
    nodes:
      - name: ESRetriever
        replicas: 2  # number of replicas to create on the Ray cluster
        inputs: [ Query ]

A RayPipeline currently can only be created with a YAML Pipeline config:

from haystack.pipeline import RayPipeline
pipeline = RayPipeline.load_from_yaml(path="my_pipelines.yaml", pipeline_name="my_query_pipeline")
pipeline.run(query="What is the capital of Germany?")

See docs for more details

😍 Making Pipelines more user-friendly

The old Pipeline design came with a couple of flaws:

Impossible to route certain parameters (e.g. top_k) to dedicated nodes
Incorrect parameters in pipeline.run() are silently swallowed
Hard to understand what is in **kwargs when working with node.run() methods
Hard to debug

We tackled those with a big refactoring of the Pipeline class and changed how data is passed between nodes #1321.
This comes now with a few breaking changes:

Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict

pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})

Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node.

pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})

See breaking changes section and the docs for details

📈 Better evaluation metric for QA: Semantic Answer Similarity (SAS)

The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In our recent EMNLP paper, we proposed "SAS", a cross-encoder-based metric for the estimation of semantic answer similarity. We compared it to seven existing metrics and found that it correlates better with human judgement. See our paper #1338

You can use it in Haystack like this:

...
# initialize the node with a SAS model
eval_reader = EvalAnswers(sas_model="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# define a pipeline 
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=eval_retriever, name="EvalDocuments", inputs=["ESRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["EvalDocuments"])
p.add_node(component=eval_reader, name="EvalAnswers", inputs=["QAReader"])
...

See our updated Tutorial 5 for a full example.

🤯 New nodes: Doc Classifier, Re-Ranker, QuestionGenerator & more

More nodes, more use cases:

FARMClassifier node for Document Classification: tag a document at indexing time or add a class downstream in your inference pipeline #1265
SentenceTransformersRanker: Re-Rank your documents after retrieval to maximize the relevance of your results. This implementation uses the popular sentence-transformer models #1209
QuestionGenerator: Question Answering systems are trained to find an answer given a question and a document; but with the recent advances in generative NLP, there are now models that can read a document and suggest questions that can be answered by that document. All this power is available to you now via the QuestionGenerator class.
QuestionGenerator models can be trained using Question Answering datasets. Instead of predicting answers, the QuestionGenerator takes the document as input and is trained to output the questions. This can be useful when you want to add "autosuggest" questions in your search bar or accelerate labeling processes See docs (#1267)

🔭 Better support for OpenSearch

We now support Approximate nearest neighbour (ANN) search in OpenSearch (#1225) and fixed some initialization issues.

📑 New Tutorials

Tutorial 13 - Question Generation:Jupyter noteboook|Colab|Python
Tutorial 14 - Query Classifier:Jupyter noteboook|Colab|Python

⚠️ Breaking Changes

`probability` field removed from results #1340

Having two fields probability and score in answers / documents returned from nodes caused often confusion.
From now on we'll only have one field called score that is in range [0,1]. In QA results, this field is populated with the old probability value, so you can simply switch to this one. These fields have changed in Python and REST API.

Old:

{
  "query": "Who is the father of Arya Stark?",
  "answers": [
    {
      "answer": "Lord Eddard Stark",
      "score": 14.684528350830078,
      "probability": 0.9044522047042847,
      "context": ...,
      ...
    },
   ...
   ]
}

New:

{
  "query": "Who is the father of Arya Stark?",
  "answers": [
    {
      "answer": "Lord Eddard Stark",
      "score": 0.9044522047042847,
      "context": ...,
      ...
    },
   ...
   ]
}

Removed`Finder` #1326

After being deprecated a few months ago, Finder is now gone - R.I.P

Params in `Pipeline.run()` #1321

Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict

Old:

pipeline.run(query="Why?", top_k_retriever=10, no_ans_boost=0.5)

New:

pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})

Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node.
Old:

pipeline.run(query="Why?", top_k_retriever=10, top_k_reader=5)

New:

pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})

Also, custom nodes must not have **kwargs in their run methods anymore and should only return the data (e.g. answers) they produce themselves.

🤓 Detailed Changes

Crawler

Serialize crawler output to JSON #1284
Add Crawler support for indexing pipeline #1360

Converter

Add ImageToTextConverter and PDFToTextOCRConverter that utilize OCR #1349

Preprocessor

Add PreProcessor optional language parameter. #1160
Improve preprocessing logging #1263
Make PreProcessor.process() work on lists of documents #1163

Pipeline

Add Ray integration for Pipelines #1255
MostSimilarDocumentsPipeline introduced #1413
QoL function: access certain nodes in pipeline #1441
Refactor replicas config for Ray Pipelines #1378
Add simple docs2answer node to allow FAQ style QA / Doc search in API #1361
Allow for batch indexing when using Pipelines fix #1168 #1231

Document Stores

Implement OpenSearch ANN [#12...

Contributors

hammer, demarant, and 25 other contributors

Assets 2

21 Jun 16:50

julian-risch

v0.9.0

9e4d7bf

v0.9.0

⭐ Highlights

Long-Form Question Answering (LFQA)

Haystack now provides LFQA with a Seq2SeqGenerator for generative QA and a Retribert Retriever thanks to community member @vblagoje. #1086
If you would like to ask questions where the answer is not a short phrase explicitly given in one of the documents but a more elaborate answer than LFQA is interesting for you. These elaborate answers are generated by combining information from multiple relevant documents.

Document Re-Ranking

For pure "semantic document search" use cases that do not need question answering functionality but only document ranking, there is now a new type of node: Ranker. While the Retriever is a perfect fit for document retrieval, we can further improve its results with the Ranker. #1025
To this end, the Ranker uses a pre-trained model to calculate the semantic similarity of the question and each of the top-k retrieved documents. Documents with a high semantic similarity are ranked higher. The combination of a Retriever and Ranker is especially powerful if you combine a sparse retriever, e.g., ElasticsearchRetriever based on BM25 and a dense Ranker.
A pipeline with a Ranker and Retriever can be setup in just a few lines of code:

...
retriever = ElasticsearchRetriever(document_store=document_store)
ranker = FARMRanker(model_name_or_path="deepset/gbert-base-germandpr-reranking")

p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=ranker, name="Ranker", inputs=["ESRetriever"])
...

Weaviate

Thanks to a contribution by our community member @venuraja79 Weaviate is integrated into Haystack as another DocumentStore #1064
It allows a combination of vector search and scalar filtering, i.e., you can filter for a certain tag and do dense retrieval on that subset. After starting a Weaviate server with docker, it's as simple as:

from haystack.document_store import WeaviateDocumentStore
document_store = WeaviateDocumentStore()

Haystack uses the most recent Weaviate version 1.4.0 and the updating of embeddings has also been optimized #1181

Query Classifier

Some search applications need to distinguish between keyword queries and longer textual questions that come in. If you only want to route longer questions to the Reader branch in order to maximize the accuracy of results and minimize computation efforts/costs and route keyword queries to a Document Retriever, you can do that now with a QueryClassifier node thanks to a contribution by @shahrukhx01. #1099
You could use it as shown in this exemplary pipeline:

New Tutorials

Tutorial 11: Pipelines #991
Tutorial 12: Generative QA with LFQA #1086

⚠️ Breaking Changes

Remove Python 3.6 support #1059
Refactor REST APIs to use Pipelines #922
Bump to FARM 0.8.0, torch 1.8.1 and transformers 4.6.1 #1192

🤓 Detailed Changes

Connector

Add crawler to get texts from websites #775

Preprocessor

Add white space normalization warning #1022
Preserve whitespace during PreProcessor.split() #1121
Fix equality check in preprocessor #969

Pipeline

Add validation for root node in Pipeline #987
Fix passing a list as parameter value in Pipeline YAML #952
Add export of Pipeline YAML config #1003
Add config to JoinDocuments node to allow yaml export in pipelines #1134

Document Stores

Integrate Weaviate as another DocumentStore #957 #1064
Add OpenDistro init #1101
Rename all document stores delete_all_documents() method to delete_documents #1047
Fix Elasticsearch connection for non-admin users #1028
Fix update_embeddings() for FAISSDocumentStore #978
Feature: Enable AWS Elasticsearch IAM connection #965
Fix optional FAISS import #971
Make FAISS import conditional #970
Benchmark milvus #850
Improve Milvus HNSW Performance #1127
Update Milvus benchmarks #1128
Upgrade milvus to 1.1.0 #1066
Update tests for FAISSDocumentStore #999
Add L2 support for FAISS HNSW #1138
Improve the speed of FAISSDocumentStore.delete_documents() #1095
Add options for handling duplicate documents (skip, fail, overwrite) #1088
Update Embeddings - Use update instead of replace #1181
Improve the progress bar in update_embeddings() + Fix filters in update_embeddings() #1063
Using text hash as id to prevent document duplication #1000

Retriever

DPR Training parameter #989
Removed single_model_path; added infer_tokenizer to dpr load() #1060
Integrate sentence transformers into benchmarks #843
added use_amp to the train method, in order to use mixed precision training #1048

Ranker

Re-ranking component for document search without QA #1025
Remove quickfix from reader and ranker #1196
Distinguish labels for calculating similarity scores #1124

Query Classifier

Fix typo in Query Classifier Exception Message #1190
Add QueryClassifier incl. baseline models #1099

Reader

Filtering duplicate answers #1021
Add ONNXRuntime support #157
Remove unused function _get_pseudo_prob #1201

Generator

Integrate LFQA with Haystack - inferencing #1086

Evaluation Nodes

Reduce precision in pipeline eval print functions #943
Fix division by zero error in EvalRetriever #938
Add evaluation nodes for Pipelines #904
Add More top_k handling to EvalDocuments #1133
Prevent merge of same questions on different documents during evaluation #1119

REST API

adding root_path option #982
Add PDF converter dependencies Docker #1107
Disable Gunicorn preload option #960

User Interface

change file-upload response to sidebar #1018
Add File Upload Functionality in UI #995
Streamlit UI Evaluation mode #920
Fix evaluation mode in UI #1024
Fix typo in streamlit UI #1106

Documentation and Tutorials

Add about sections to Tutorial 12 #1195
Tutorial update #1166
Documentation update #1162
Add FAQ page #1151
Refresh API docs #1152
Add docu of confidence scores and calibration method #1131
Adding indentation to markup files #947
Update preprocessing.md #1087
Add badges to readme [#1136](...

Assets 2

13 Apr 15:04

oryx1729

v0.8.0

bba1d80

v0.8.0

⭐ Highlights

This is a major Haystack release with many new features. The release blog post has a detailed summary. Below are the top highlights:

Milvus Document Store

Milvus is an open-source vector database. With the MilvusDocumentStore contributed by @lalitpagaria, embedding based Retrievers like the DensePassageRetriever or EmbeddingRetriever can use production-ready Milvus servers for large-scale deployments.

Knowledge Graph

An experimental integration for KnowledgeGraphs is introduced using GraphDB. The GraphDBKnowlegeGraph stores Triples and executes SPARQL queries. It can be integrated with Text2SparqlRetriever to convert natural language queries to SPARQL.

Pipeline configuration with YAML

The Pipelines can now be configured with YAML. This enables easier sharing of query & indexing configuration, reproducible setups, A/B testing of Pipelines, and moving from development to the production environment.

REST APIs

The REST APIs are revamped to use Pipelines for Query & Indexing files. The YAML configurations are in the rest_api/pipelines.YAML. The new API endpoints are more generic to accommodate custom Pipeline configurations.

Confidence Scores

The answers now have a probability score that is better calibrated to the model's confidence. It has a range of 0-1; 0 signifying very low confidence, while, 1 for very high confidence.

Web Crawler

A Selenium based web crawler is now part of Haystack, thanks to @DIVYA-19 for the contribution. It takes as input a list of URLs and converts extracted text to Haystack Documents.

⚠️ Breaking Changes

REST APIs

The REST APIs got a major revamp with this release.

/doc-qa & /faq-qa endpoints are replaced with a more generic POST /query endpoint. This new endpoint uses Pipelines under-the-hood, that can be configured at rest_api/pipeline.yaml.

The new /query endpoint expects a single query per request instead of a list of query strings.
The new request format is:

{
    "query": "Why did the revenue change?"
}

and the response looks like this:

{
    "query": "Why did the revenue change?",
    "answers": [
        {
            "answer": "rapid technological change and evolving industry standards",
            "question": null,
            "score": 0.543937623500824,
            "probability": 0.014070278964936733,
            "context": "tion process. The market for our products is intensely competitive and is characterized by rapid technological change and     evolving industry standards.",
            "offset_start": 91,
            "offset_end": 149,
            "offset_start_in_doc": 511,
            "offset_end_in_doc": 569,
            "document_id": "f30273b2-4d49-40d8-8824-43b3b6a0ea57",
            "meta": {
                "_split_id": "7"
            }
        },
        {
             // other answers
        }
    ]
}

The /doc-qa-feedback & /faq-qa-feedback endpoints are replaced with a new generic /feedback endpoint.

Created At Timestamp

Previously, all documents/labels in SQLDocumentStore and FAISSDocumentStore had a field called created to store the creation timestamp, while ElasticsearchDocumentStore did not have any timestamp field. Now, all document stores have a created_at field for documents and labels.

RAGenerator

The top_k_answers parameter in the RAGenerator is renamed to top_k for consistency across Haystack components.

Custom Query for Elasticsearch

The placeholder terms in custom_query should not have quotes around them. See more details here.

🤓 Detailed Changes

Pipeline

Fix execution of Pipelines with parallel nodes #901 (@oryx1729)
Add abstract run method to basecomponent #887 (@tholor)
Add support for parallel paths in Pipeline #884 (@oryx1729)
Add runtime parameters to component initialization #873 (@oryx1729 )
Add support for indexing pipelines #816 (@oryx1729 )
Adding translator with many generic input parameter support #782 (@lalitpagaria)
Fix building Pipeline with YAML #800 (@oryx1729)
Load Pipeline with YAML config file #785 (@oryx1729)
Add evaluation nodes for Pipelines #904 (@brandenchan)
Fix passing a list as parameter value in Pipeline YAML #952 (@oryx1729)

Document Store

Fixes elasticsearch auth #871 (@grafke)
Allow more options for elasticsearch client (auth, multiple hosts) #845 (@tholor)
Fix ElasticsearchDocumentStore.query_by_embedding() #823 (@oryx1729)
Introduce incremental updates for embeddings in document stores #812 (@oryx1729)
Add method to get metadata values for a key from Elasticsearch #776 (@oryx1729)
Fix refresh behaviour for Elasticsearch delete #794 (@oryx1729)
Milvus integration #771 (@lalitpagaria)
Add flag for use of window queries in SQLDocumentStore #768 (@oryx1729)
Remove quotes around placeholders in Elasticsearch custom query #762 (@oryx1729)
Fix delete_all_documents for the SQLDocumentStore #761 (@oryx1729)

Retriever

Improve dpr conversion #826 (@Timoeller)
Fix DPR training batch size #898 (@brandenchan)
Upgrade FAISS to 1.7.0 #834 (@tholor)
Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg #811(@psorianom)

Modeling

Add model versioning support #784 (@brandenchan)
Improve preprocessing and adding of eval data #780 (@Timoeller)
SQuAD to DPR dataset converter #765 (@psorianom)
Remove RAG todos after transformers update #781 (@Timoeller)
Update farm version #936 (@Timoeller)

REST API

Refactor REST APIs to use Pipelines #922 (@oryx1729)
Add PDF converter in Dockerfiles #877 (@oryx1729)
Update GPU Dockerimage (Cuda 11, Fix faiss) #836 (@tholor)
Add API endpoint to export accuracy metrics from user feedback + created_at timestamp #803(@tholor)
Fix file upload API #808 (@oryx1729)

File Converter

Add Markdown file convertor #875 (@lalitpagaria)
Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813 (@tholor)

Crawler

Add crawler to get texts from websites #775 (@DIVYA-19)

Knowledge Graph

knowledge graph example #934 (@julian-risch)

Annotation Tool

Annotation Tool: data is not persisted when using local version #853 #855(@venuraja79)

Search UI

Fix UI when API returns fewer answers than expected #828(@tholor)

CI

Revamp CI #825 (@oryx1729)
Fix mypy typing #792 (@oryx1729)
Fix pdftotext dependency in CI #788 (@tholor)

Misc Fixes

Adding indentation to markup files #947 (@julian-risch)
Reduce precision in pipeline eval print functions #943 (@lewtun)
Fix division by zero error in EvalRetriever #938 (@lewtun)
Logged warning in Faiss and Milvus for filters #913 (@peteradorjan)
fixed "cannot allocate memory" exception by specifying max_processes #910(@mosheber)
Fix error when is_impossible not exist [#870](https://github.com/deepset-ai/haystack/pu...

Assets 2

21 Jan 17:42

tholor

v0.7.0

5081542

v0.7.0

⭐ Highlights

New Slack Channel

As many people in the community asked us for it, we decided to open a slack channel!
Join us and ask questions, show what you've built with Haystack, and simply exchange with like-minded folks!

👉 https://haystack.deepset.ai/community/join

Optimizing Memory + CPU consumption of documentstores for large datasets (#733)

Interacting with large datasets can be challenging for the local memory. Therefore, we ...

... add batch_size parameters for most methods of the document store that allow to only load smaller chunks of documents at a time
... add a get_all_documents_generator() method that "streams" documents one by one from your document store.
Both help to lower the memory footprint significantly- especially when calling methods like update_embeddings() on datasets > 1 Mio docs.

Add Simple Demo UI (#671)

Thanks to our community member @tanmaylaud, we now have a great and simple UI that allows you to easily try your search pipelines. Ask questions, see the results, change basic config params, debug the API response and give your colleagues a better flavor of what you are building ...

Support for summarization models (#698)

Thanks to another community contribution from @lalitpagaria we now also support summarization models like PEGASUS in Haystack. You can use them ...

... standalone:

docs = [Document(text="PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions.
                    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by
                    the shutoffs which were expected to last through at least midday tomorrow.")]

summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")
summary = summarizer.predict(documents=docs, generate_single_summary=False)

... as a node in your pipeline:

...
pipeline.add_node(component=summarizer, name="Summarizer", inputs=["Retriever"])

... by simply calling a predefined pipeline that first retrieves and then summarizes the resulting docs:

...
pipe = SearchSummarizationPipeline(summarizer=summarizer, retriever=retriever)
pipe.run()

We see many interesting use cases around search for it. For example, running semantic document search and displaying the summary of docs as a "preview" in the results.

New Tutorials

Wonder how to train a DPR retriever on your own domain dataset? Check out this new tutorial!
Proper preprocessing (Cleaning, Splitting etc.) of docs can have a big impact on your performance. Check out this new tutorial to learn more about it.

⚠️ Breaking Changes

Dropping `index_buffer_size` from FAISSDocumentStore

We removed the arg index_buffer_size from the init of FAISSDocumentStore. "Buffering" is now handled via the new batch_size arguments that you can pass to most methods like write_documents(), update_embeddings() and get_all_documents().

Renaming of Preprocessor arg

Old:

PreProcessor(..., split_stride=5)

New:

PreProcessor(..., split_overlap=5)

🤓 Detailed Changes

Preprocessing / File Conversion

Using PreProcessor functions on eval data #751

DocumentStore

Support filters for DensePassageRetriever + InMemoryDocumentStore #754
use Path class in add_eval_data of haystack.document_store.base.py #745
Make batchwise adding of evaluation data possible #717
Change signature and docstring for ca_certs parameter #730
Rename label id field for elastic & add UPDATE_EXISTING_DOCUMENTS to API config #728
Fix SQLite errors in tests #723
Add support for custom embedding field for InMemoryDocumentStore #640
Using Columns names instead of ORM to get all documents #620

Other

Generate docstrings and deploy to branches to Staging (Website) #731
Script for releasing docs #736
Increase FARM to Version 0.6.2 #755
Reduce memory consumption of fetch_archive_from_http #737
Add links to more resources #746
Fix Tutorial 9 #734
Adding a guard that prevents the tutorial from being executed in every subprocess on windows #729
Add ID to Label schema #727
Automate docstring and tutorial generation with every push to master #718
Pass custom label index name to REST API #724
Correcting pypi download badge #722
Fix GPU docker build #703
Remove sourcerer.io widget #702
Haystack logo is not visible on github mobile app #697
Update pipeline documentation and readme #693
Enable GPU args in tutorials #692
Add docs v0.6.0 #689

Big thanks to all contributors ❤️ !

@Rob192 @antoniolanza1996 @tanmaylaud @lalitpagaria @Timoeller @tanaysoni @bogdankostic @aantti @brandenchan @PiffPaffM @julian-risch

Assets 2

17 Dec 06:53

tholor

v0.6.0

5b81738

v0.6.0

⭐ Highlights

Flexible Pipelines powered by DAGs (#596)

In order to build modern search pipelines, you need two things: powerful building blocks and a flexible way to stick them together.
While we always had great building blocks in Haystack, we didn't have a good way to stick them together so far. That's why we put a lof thought into it in the last weeks and came up with a new Pipeline class that enables many new search scenarios beyond QA. The core idea: you can build a Directed Acyclic Graph (DAG) where each node is one "building block" (Reader, Retriever, Generator ...). Here's a simple example for a "standard" Open-Domain QA Pipeline:

p = Pipeline()
p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)

You can draw the DAG to better inspect what you are building:

p.draw(path="custom_pipe.png")

Multiple retrievers

You can now also use multiple Retrievers and join their results:

p = Pipeline()
p.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
p.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)

Custom nodes

You can easily build your own custom nodes. Just respect the following requirements:

Add a method run(self, **kwargs) to your class. **kwargs will contain the output from the previous node in your graph.
Do whatever you want within run() (e.g. reformatting the query)
Return a tuple that contains your output data (for the next node) and the name of the outgoing edge output_dict, "output_1
Add a class attribute outgoing_edges = 1 that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).

Decision nodes

Or you can add decision nodes where only one "branch" is executed afterwards. This allows, for example, to classify an incoming query and depending on the result routing it to different modules:

    class QueryClassifier():
        outgoing_edges = 2

        def run(self, **kwargs):
            if "?" in kwargs["query"]:
                return (kwargs, "output_1")

            else:
                return (kwargs, "output_2")

    pipe = Pipeline()
    pipe.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
    pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
    pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
    pipe.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults",
                  inputs=["ESRetriever", "DPRRetriever"])
    pipe.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
    res = p.run(query="What did Einstein work on?", top_k_retriever=1)

Default Pipelines (replacing the "Finder")

Last but not least, we added some "Default Pipelines" that allow you to run standard patterns with very few lines of code.
This is replacing the Finder class which is now deprecated.

from haystack.pipeline import DocumentSearchPipeline, ExtractiveQAPipeline, Pipeline, JoinDocuments

# Extractive QA
qa_pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)
res = qa_pipe.run(query="When was Kant born?", top_k_retriever=3, top_k_reader=5)

# Document Search
doc_pipe = DocumentSearchPipeline(retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)

# Generative QA
doc_pipe = GenerativeQAPipeline(generator=rag_generator, retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)

# FAQ based QA
doc_pipe = FAQPipeline(retriever=retriever)
res = doc_pipe.run(query="How can I change my address?", top_k_retriever=3)

We plan many more features around the new pipelines incl. parallelized execution, distributed execution, definition via YAML files, dry runs ...

New DocumentStore for the Open Distro of Elasticsearch (#676)

From now on we also support the Open Distro of Elasticsearch. This allows you to use many of the hosted Elasticsearch services (e.g. from AWS) more easily with Haystack. Usage is similar to the regular ElasticsearchDocumentStore:

document_store = OpenDistroElasticsearchDocumentStore(host="localhost", port="9200", ...)

⚠️ Breaking Changes

As Haystack is extending from QA to further search types, we decided to rename all parameters from question to query.
This includes for example the predict() methods of the Readers but also several other places. See #614 for details.

🤓 Detailed Changes

Preprocessing / File Conversion

Redone: Fix concatenation of sentences in PreProcessor. Add stride for word-based splits with sentence boundaries #641
Add needed whitespace before sentence start #582

DocumentStore

Scale dot product into probabilities #667
Add refresh_type param for Elasticsearch update_embeddings() #630
Add return_embedding parameter for get_all_documents() #615
Adding support for update_existing_documents to sql and faiss document stores #584
Add filters for delete_all_documents() #591

Retriever

Fix saving tokenizers in DPR training + unify save and load dirs #682
fix a typo, num_negatives -> num_positives #681
Refactor DensePassageRetriever._get_predictions #642
Move DPR embeddings from GPU to CPU straight away #618
Add MAP retriever metric for open-domain case #572

Reader / Generator

add GPU support for rag #669
Enable dynamic parameter updates for the FARMReader #650
Add option in API Config to configure if reader can return "No Answer" #609
Fix various generator issues #590

Pipeline

Add support for building custom Search Pipelines #596
Add set_node() for Pipeline #659
Add support for aggregating scores in JoinDocuments node #683
Add pipelines for GenerativeQA & FAQs #645

Other

Cleanup Pytest Fixtures #639
Add latest benchmark run #652
Fix image links in tutorial #663
Update query arg in Tutorial 7 #656
Fix benchmarks #648
Add link to FAISS Info in documentation #643
Improve User Feedback Documentation #539
Add formatting checks for shell scripts #627
Update md files for API docs #631
Clean API docs and increase coverage #621
Add boxes for recommendations #629
Automate benchmarks via CML #518
Add contributor hall of fame #628
README: Fix link to roadmap #626
Fix docstring examples #604
Cleaning the api docs #616
Fix link to DocumentStore page #613
Make more changes to documentation #578
Remove column in benchmark website #608
Make benchmarks clearer #606
Fixing defaults configs for rest_apis #583
Allow list of filter values in REST API #568
Fix CI bug due to new Elasticsearch release and new model release #579
Update Colab Torch Version [#576](https://github.com/deepset...

Assets 2

06 Nov 10:28

tholor

v0.5.0

99e924a

v0.5.0

Highlights

💬 Generative Question Answering via RAG (#484)

Thanks to our community member @lalitpagaria, Haystack now also support generative QA via Retrieval Augmented Generation ("RAG").
Instead of "finding" the answer within a document, these models generate the answer. In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages for real-world applications:
a) it has a manageable model size
b) the answer generation is conditioned on retrieved documents, i.e. the model can easily adjust to domain documents even after training has finished (in contrast: GPT-3 relies on the web data seen during training)

Example:

    question = "who got the first nobel prize in physics?"

    # Retrieve related documents from retriever
    retrieved_docs = retriever.retrieve(query=question)

    # Now generate answer from question and retrieved documents
    predicted_result = generator.predict(
        question=question,
        documents=retrieved_docs,
        top_k=1
    )

You already play around with it in this minimal tutorial:

We are looking forward to improve this class of models further in the next months and already plan a tighter integration into the Finder class.

↗️ Better DPR (incl. training) (#527)

We migrated the existing DensePassageRetriever to an own pipeline based on FARM. This allows a better modularization and most importantly simple training of DPR models! You can either train models from scratch or take an existing DPR model and fine-tune it on your own domain data. The required training data consists of queries and positive passages (i.e. passages that are related to your query / contain the answer) and the format complies with the one in the original DPR codebase.

Example:

dense_passage_retriever.train(self,
                              data_dir: str,
                              train_filename: str,
                              dev_filename: str = None,
                              test_filename: str = None,
                              batch_size: int = 16,
                              embed_title: bool = True,
                              num_hard_negatives: int = 1,
                              n_epochs: int = 3)

Future improvements: At the moment training is only supported on single GPUs. We will add support for Multi-GPU Training via DDP soon.

📊 New Benchmarks

Happy to introduce a new benchmark section on our website!
Do you wonder if you should use BERT, RoBERTa or MiniLM for your reader? Is it worth to use DPR for retrieval instead of Elastic's BM25? How would this impact speed and accuracy?

See the relevant metrics here to guide your decision:
👉 https://haystack.deepset.ai/bm/benchmarks

We will extend this section over time with more models, metrics and key parameters.

⚠️ Breaking Changes

Consistent parameter naming for TransformersReader #510

# old: 
TransformersReader(model="distilbert-base-uncased-distilled-squad" ..) 

# new
TransformersReader(model="distilbert-base-uncased-distilled-squad" ..) 
TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad" ...)

FAISS: Remove phi normalization, support more index types #467

New default index type is "Flat" and params have changed slightly:

# old 
 FAISSDocumentStore(
        sql_url: str = "sqlite:///",
        index_buffer_size: int = 10_000,
        vector_size: int = 768,
        faiss_index: Optional[IndexHNSWFlat] = None,

# new
FAISSDocumentStore(
        sql_url: str = "sqlite:///",
        index_buffer_size: int = 10_000,
        vector_dim: int = 768,
        faiss_index_factory_str: str = "Flat",
        faiss_index: Optional[faiss.swigfaiss.Index] = None,
        return_embedding: Optional[bool] = True,
        **kwargs,

DPR signature

Splitting max_seq_len into two independent params.
Removing remove_sep_tok_from_untitled_passages param.

# old
DensePassageRetriever(
                 document_store: BaseDocumentStore,
                 query_embedding_model: str = "facebook/dpr-question_encoder-single-nq-base",
                 passage_embedding_model: str = "facebook/dpr-ctx_encoder-single-nq-base",
                 max_seq_len: int = 256,
                 use_gpu: bool = True,
                 batch_size: int = 16,
                 embed_title: bool = True,
                 remove_sep_tok_from_untitled_passages: bool = True
                 )

# new 
 DensePassageRetriever(
 		 document_store: BaseDocumentStore,
                 query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base",
                 passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base",
                 max_seq_len_query: int = 64,
                 max_seq_len_passage: int = 256,
                 use_gpu: bool = True,
                 batch_size: int = 16,
                 embed_title: bool = True,
                 use_fast_tokenizers: bool = True,
                 similarity_function: str = "dot_product"
                 ):

Detailed Changes

Preprocessing / File Conversion

Add preprocessing pipeline #473
Restructure checks in PreProcessor #504
Updated the example code to Indexing PDF / Docx files #502
Fix meta data = None in PreProcessor #496
add explicit encoding mode to file_converter/txt.py #478
Skip file conversion if file type is not supported #456

DocumentStore

Add support for MySQL database #556
Allow configuration of Elasticsearch Analyzer (e.g. for other languages) #554
Add support to return embedding #514
Fix scoring in Elasticsearch for dot product #517
Allow filters for get_document_count() #512
Make creation of label index optional #490
Fix update_embeddings function in FAISSDocumentStore #481
FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422
Enable bulk operations on vector IDs for FAISSDocumentStore #460
fixing ElasticsearchDocumentStore initialisation #415
bug: filters on a query_by_embedding #464

Retriever

DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527
Fix retriever evaluation metrics #547
Add save and load method for DPR #550
Typo in dense.py comment #545
Make returning predictions in Finder & Retriever eval() possible #524
Make title info optional when evaluating on QA data #494
Make sentence-transformers usage more user-friendly #439

Reader

Fix FARMReader.eval() handling of no_answers #531
Added automatic mixed precision (AMP) support for reader training from Haystack side #463
Update ONNX conversion for FARMReader #438

Other

Fix sentencepiece dependencies in Dockerfiles #553
Update Dockerfile #537
Removing (deprecated) warnings from the Haystack codebase. #530
Pytest fix memory leak and put pytest marker on slow tests #520
[enhancement] Create deploy_website.yml #450
Add Docker Images & Setup for the Annotation Tool #444

REST API

Make filter value optional in REST API #497
Add Elasticsearch Query DSL compliant Query API #471
Allow configuration of log level in REST API #541
Add create_index and similarity metric to api config #493
Add deepcopy for meta dicts in answers #485
Fix windows platform installation #480
Update GPU docker & fix race condition with multiple workers #436

Documentation / Benchmarks / Tutorials

New readme #534
Add ...

Assets 2

21 Sep 09:01

tholor

v0.4.0

c5f1f9a

v0.4.0

Highlights

💥 New Project Website & Documentation

As the project is growing, we have more and more content that doesn't fit in GitHub.
In this first version of the website, we focused on documentation incl. quick start, usage guides and the API reference.
In the future, we plan to extend this with benchmarks, FAQs, use cases, and other content that helps you to build your QA system.

👉 https://haystack.deepset.ai

📈 Scalable dense retrieval: FAISSDocumentStore

With recent performance gains of dense retrieval methods (learn more about it here), we need document stores that efficiently store vectors and find the most similar ones at query time. While Elasticsearch can also handle vectors, it quickly reaches its limits when dealing with larger datasets. We evaluated a couple of projects (FAISS, Scann, Milvus, Jina ...) that specialize on approximate nearest neighbour (ANN) algorithms for vector similarity. We decided to implement FAISS as it's easy to run in most environments.
We will likely add one of the heavier solutions (e.g. Jina or Milvus) later this year.

The FAISSDocumentStore uses FAISS to handle embeddings and SQL to store the actual texts and meta data.

Usage:

document_store = FAISSDocumentStore(sql_url: str = "sqlite:///",        # SQL DB for text + meta data
                                    vector_size: int = 768)             # Dimensionality of your embeddings

📃 More input file formats: Apache Tika File Converter (#314 )

Thanks to @dany-nonstop you can now extract text from many file formats (docx, pptx, html, epub, odf ...) via Apache Tika.

Usage:

Start Apache Tika Server

docker run -d -p 9998:9998 apache/tika

Do Conversion in Haystack

tika_converter = TikaConverter(
        tika_url = "http://localhost:9998/tika",
        remove_numeric_tables = False,
        remove_whitespace = False,
        remove_empty_lines = False,
        remove_header_footer = False,
        valid_languages = None,
    )
>>> dict = tika_converter.convert(file_path=Path("test/samples/pdf/sample_pdf_1.pdf"))
>>> dict
{ 
  "text": "everything on page one \f then page two \f ..."
  'meta': {'Content-Type': 'application/pdf', 'Creation-Date': '2020-06-02T12:27:28Z', ...}
}

Breaking changes

Restructuring / Renaming of modules (Breaking changes!) (#379)

We've restructured the package to make the usage more intuitive and terminology more consistent.

Rename database module -> document_store
Split indexing module into -> file_converter and preprocessor
Move Document, Label and Multilabel classes into -> schema and simplify import to from haystack import Document, Label, Multilabel

File converters (#393)

Refactoring major parts of the file converters. Not returning pages anymore, but rather adding page break symbols that can be accessed further down the pipeline.

Old:

>>> pages, meta = `Fileconverter.extract_pages(file_path=Path("..."))`

New:

>>> dict = `Fileconverter.convert(file_path="...", meta={"name": "some_name", "category": "news"})`
>>> dict
{ 
  "text": "everything on page one \f then page two \f ..."
  "meta": {"name": "..."}
}

DensePassageRetriever (#308)

Refactored from FB code to transformers code base and loading the models from their model hub now.
Signature has therefore changed to:

retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  use_gpu=True,
                                  embed_title=True,
                                  remove_sep_tok_from_untitled_passages=True)

Deprecate Tags for Document Stores (#286)

We removed the "tags" field that in the past could be associated with Documents and used for filtering your search.
Insead, we use now the more general concept of "meta", where you can supply any custom fields and filter for them at runtime

Old:

dict = {"text": "some", "tags": ["category1", "category2"]}

New

dict =   {"text": "some", "meta": {"category": ["1", "2"] }}

Details

Document Stores

Add FAISS Document Store #253
Fix type casting for vectors in FAISS #399
Fix duplicate vector ids in FAISS #395
Fix document filtering in SQLDocumentStore #396
Move retriever probability calculations to document_store #389
Add FAISS query scores #368
Raise Exception if filters used for FAISSDocumentStore query #338
Add refresh_type arg to ElasticsearchDocumentStore #326
Improve speed for SQLDocumentStore #330
Fix indexing of metadata for FAISS/SQL Document Store #310
Ensure exact match when filtering by meta in Elasticsearch #311
Deprecate Tags for Document Stores #286
Add option to update existing documents when indexing #285
Cast document_ids as strings #284
Add method to update meta fields for documents in Elasticsearch #242
Custom mapping write doc fix #297

Retriever

DPR (Dense Retriever) for InMemoryDocumentStore #316 #332
Refactor DPR from FB to Transformers codebase #308
Restructure update embeddings #304
Added title during DPR passage embedding && ElasticsearchDocumentStore #298
Add eval for Dense Passage Retriever & Refactor handling of labels/feedback #243
Fix type of query_emb in DPR.retrieve() #247
Fix return type of EmbeddingRetriever to numpy array #245

Reader

More robust Reader eval by limiting max answers and creating no answer labels #331
Aggregate multiple no answers in MultiLabel #324
Add "no answer" aggregation to Transformersreader #259
Align TransformersReader with FARMReader #319
Datasilo use all cores for preprocessing #303
Batch prediction in evaluation #137
Aggregate label objects for same questions #292
Add num_processes to reader.train() to configure multiprocessing #271
Added support for unanswerable questions in TransformersReader #258

Preprocessing

Refactor file converter interface #393
Add Tika Converter #314

Finder

Add index arg to Finder.get_answers() and _via_similar_questions() #362

Documentation

Create documentation website #272
Use port 8000 in documentation #357
Documentation #343
Convert Documentation to markdown #386
Add logo to readme #384
Refactor the DPR tutorial to use FAISS #317
Make Tutorials Work on Colab GPUs #322

Other

Exclude embedding fields from the REST API #390
Fix test suite dependency issue on MacOS #374
Add Gunicorn timeout #364
Bump FARM version to 0.4.7 #340
Add Tests for MultiLabel #318
Modified search endpoints logs to dump json #290
Add export answers to CSV function #266

Big thanks to all contributors ♥️

@antoniolanza1996, @dany-nonstop, @philipp-bode, @lalitpagaria , @PiffPaffM , @brandenchan , @tanaysoni , @Timoeller , @tholor, @bogdankostic , @maxupp, @kolk , @venuraja79 , @karimjp

Assets 2

16 Jul 12:30

tholor

0.3.0

a6ec430

0.3.0

🔍 Dense Passage Retrieval

Glad to introduce the new Dense Passage Retriever (aka DPR).
Using dense embeddings of texts is a powerful alternative to score the similarity of texts. This retriever uses two BERT models - one to embed your query, one to embed your passage. This Dual-Encoder architecture can deal much better with the different nature of query and texts (length, syntax ...). It's was published by Karpukhin et al and shows impressive performance - especially if there's no direct overlap between tokens in your queries and your texts.

retriever = DensePassageRetriever(document_store=document_store,
                                  embedding_model="dpr-bert-base-nq",
                                  do_lower_case=True, use_gpu=True)
retriever.retrieve(query="What is cosine similarity?")
# returns: [Document, Document]

See Tutorial 6 for more details

📊 Evaluation

We introduce the option to evaluate your reader, retriever, and the combination of both. While there's usually a good understanding of the reader's performance, the interplay with the retriever is what really matters in practice. You want to answer: Is my retriever a bottleneck? Is it worth increasing top_k for the retriever? How do different retrievers compare in performance? What is the effect on speed?
The new eval() is a first step towards answering those questions and gives a comprehensive picture of your pipeline. Stay tuned for more enhancements here.

document_store.add_eval_data("../data/nq/nq_dev_subset_v2.json")
...
retriever.eval(top_k=10)
reader.eval(document_store=document_store, device=device)
finder.eval(top_k_retriever=10, top_k_reader=10)

See Tutorial 5 for more details

📄 Basic Support for PDF and Docx Files

You can now index PDF and docx files more easily to your DocumentStore. We introduce a new BaseConverter class, that offers basic cleaning functions (e.g. removing footers or tables). It's file format specific child classes (e.g. PDFToTextConverter) handle the actual extraction of the text.

#PDF
from haystack.indexing.file_converters.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)
# => list of str, one per page

#DOCX
from haystack.indexing.file_converters.docx import DocxToTextConverter
converter = DocxToTextConverter()
paragraphs = converter.extract_pages(file_path=file)
#  => list of str, one per paragraph (as docx has no direct notion of pages)

And there's much more that happened ...

Preprocessing

Added Support for Docx Files #225
Add PDF parser for indexing #109
Adjust PDF conversion subprocess for Python v3.6 #194
Fix boundary condition in detection of header/footer in file converters #165

Retriever

Refactor DPR for latest transformers version & change init arg gpu -> use_gpu for DPR and EmbeddingRetriever #239
Add dummy retriever for benchmarking / reader-only settings #235
Fix id for documents returned by the TfidfRetriever #232
Tutorial for Dense Passage Retriever #186
Fix device arg for sentence transformers #124
Fix embeddings from sentence-transformers (type cast & gpu flags) #121
Adding metadata to be returned from tfidf retreiver #122

Reader

Add ONNXRuntime support #157
Fix multi gpu training via Dataparallel #234
Fix document id missing in farm inference output #174
Add document meta for Transformer Reader #114
Fix naming of offset in answers of TransformersReader (for consistency with FARMReader) #204
Adjust to farm handling of no answer #170

DocumentStores

Move document_name attribute to meta #217
Remove meta field when writing documents in Elasticsearch #240
Harmonize meta data handling across doc stores #214
Add filtering by tags for InMemoryDocumentStore #108
Make FAQ question field customizable #146
Increase timeout for Elasticsearch bulk indexing #119
Add embedding query for InMemoryDocumentStore #112
Increase timeout for bulk indexing in ES #130
Add custom port to ElasticsearchDocumentStore #129
Remove hard-coded embedding field #107

REST API

Move out REST API from PyPI package #160
Fix format of /export-doc-qa-feedback to comply with SQuAD #241
Create file upload directory in the REST API #166
Add API endpoint to upload files #154
Missing PORT and SCHEME for elasticsearch to run the API #134
Add EMBEDDING_MODEL_FORMAT in API config #152
Add success response for successful file upload API #195
Add response time in logs #201
Fix rest api in Docker image after refactoring #178

Other

Upgrade to new FARM / Transformers / PyTorch versions #212
Fix Evaluation Dataset #233
Remove mutation of documents in write_documents() #231
Remove mutation of results dict in print_answers() #230
Make doc name optional #100
Fix Dockerfile to build successfully without models directory #210
Docker available for TransformsReader Class #180
Fix embedding method in FAQ-QA Tutorial #220
Add more tests #213
Update docstring for embedding_field and embedding_dim #208
Make "meta" field generic for Document Schema #102
Update tutorials #200
Upgrade FARM version #172
Fix for installing PyTorch on Windows OS #159
Remove Literal type hint #156
Remove PyMuPDF dependency #148
Add missing type hints #138
Add a GitHub Action to start Elasticsearch instance for Build workflow #142
Correct field in evaluation tutorial #139
Update Haystack version in tutorials #136
Fix evaluation #132
Add stalebot #131
Add Reader/Retriever validations in Finder #113
Add document metadata for FAQ style QA #106
Add basic tutorial for FAQ-based QA & batch comp. of embeddings #98
Make saving more explicit in tutorial #95

Thanks to all contributors for working on this and shaping Haystack together: @skirdey @guillim @antoniolanza1996 @F4r1n @arthurbarros @elyase @anirbansaha96 @Timoeller @bogdankostic @tanaysoni @brandenchan

Assets 2

Releases: deepset-ai/haystack

v1.1.0

⭐ Highlights

Model Distillation for Reader Models

Integrated vs. Isolated Pipeline Evaluation Modes

Row-Column-Intersection Model for TableQA

Advanced File Converters

⚠️ Breaking Changes

🤓 Detailed Changes

Pipeline

Models

DocumentStores

REST API

UI / Demo

Documentation

Other Changes

Contributors

1.0.0

🎁 Haystack 1.0

⭐ Highlights

Improved Evaluation of Pipelines

Table QA

Improved Debugging of Pipelines & Nodes

FARM Migration

⚠️ Breaking Changes & Migration Guide

Migration to v1.0

New Package Structure & Changed Imports

Primitive Objects

Input Document Format

Response format of Reader

Contributors

v0.10.0

⭐ Highlights

🚀 Making Pipelines more scalable

😍 Making Pipelines more user-friendly

📈 Better evaluation metric for QA: Semantic Answer Similarity (SAS)

🤯 New nodes: Doc Classifier, Re-Ranker, QuestionGenerator & more

🔭 Better support for OpenSearch

📑 New Tutorials

⚠️ Breaking Changes

probability field removed from results #1340

RemovedFinder #1326

Params in Pipeline.run() #1321

🤓 Detailed Changes

Crawler

Converter

Preprocessor

Pipeline

Document Stores

Contributors

v0.9.0

⭐ Highlights

Long-Form Question Answering (LFQA)

Document Re-Ranking

Weaviate

Query Classifier

New Tutorials

⚠️ Breaking Changes

🤓 Detailed Changes

Connector

Preprocessor

Pipeline

Document Stores

Retriever

Ranker

Query Classifier

Reader

Generator

Evaluation Nodes

REST API

User Interface

Documentation and Tutorials

v0.8.0

⭐ Highlights

Milvus Document Store

Knowledge Graph

Pipeline configuration with YAML

REST APIs

Confidence Scores

Web Crawler

`probability` field removed from results #1340

Removed`Finder` #1326

Params in `Pipeline.run()` #1321

Dropping `index_buffer_size` from FAISSDocumentStore