v0.6.0
⭐ Highlights
Flexible Pipelines powered by DAGs (#596)
In order to build modern search pipelines, you need two things: powerful building blocks and a flexible way to stick them together.
While we always had great building blocks in Haystack, we didn't have a good way to stick them together so far. That's why we put a lof thought into it in the last weeks and came up with a new Pipeline
class that enables many new search scenarios beyond QA. The core idea: you can build a Directed Acyclic Graph (DAG) where each node is one "building block" (Reader, Retriever, Generator ...). Here's a simple example for a "standard" Open-Domain QA Pipeline:
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)
You can draw the DAG to better inspect what you are building:
p.draw(path="custom_pipe.png")
Multiple retrievers
You can now also use multiple Retrievers and join their results:
p = Pipeline()
p.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
p.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)
Custom nodes
You can easily build your own custom nodes. Just respect the following requirements:
- Add a method
run(self, **kwargs)
to your class.**kwargs
will contain the output from the previous node in your graph. - Do whatever you want within
run()
(e.g. reformatting the query) - Return a tuple that contains your output data (for the next node) and the name of the outgoing edge
output_dict, "output_1
- Add a class attribute
outgoing_edges = 1
that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).
Decision nodes
Or you can add decision nodes where only one "branch" is executed afterwards. This allows, for example, to classify an incoming query and depending on the result routing it to different modules:
class QueryClassifier():
outgoing_edges = 2
def run(self, **kwargs):
if "?" in kwargs["query"]:
return (kwargs, "output_1")
else:
return (kwargs, "output_2")
pipe = Pipeline()
pipe.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
pipe.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults",
inputs=["ESRetriever", "DPRRetriever"])
pipe.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)
Default Pipelines (replacing the "Finder")
Last but not least, we added some "Default Pipelines" that allow you to run standard patterns with very few lines of code.
This is replacing the Finder
class which is now deprecated.
from haystack.pipeline import DocumentSearchPipeline, ExtractiveQAPipeline, Pipeline, JoinDocuments
# Extractive QA
qa_pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)
res = qa_pipe.run(query="When was Kant born?", top_k_retriever=3, top_k_reader=5)
# Document Search
doc_pipe = DocumentSearchPipeline(retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)
# Generative QA
doc_pipe = GenerativeQAPipeline(generator=rag_generator, retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)
# FAQ based QA
doc_pipe = FAQPipeline(retriever=retriever)
res = doc_pipe.run(query="How can I change my address?", top_k_retriever=3)
We plan many more features around the new pipelines incl. parallelized execution, distributed execution, definition via YAML files, dry runs ...
New DocumentStore for the Open Distro of Elasticsearch (#676)
From now on we also support the Open Distro of Elasticsearch. This allows you to use many of the hosted Elasticsearch services (e.g. from AWS) more easily with Haystack. Usage is similar to the regular ElasticsearchDocumentStore
:
document_store = OpenDistroElasticsearchDocumentStore(host="localhost", port="9200", ...)
⚠️ Breaking Changes
As Haystack is extending from QA to further search types, we decided to rename all parameters from question
to query
.
This includes for example the predict()
methods of the Readers but also several other places. See #614 for details.
🤓 Detailed Changes
Preprocessing / File Conversion
- Redone: Fix concatenation of sentences in PreProcessor. Add stride for word-based splits with sentence boundaries #641
- Add needed whitespace before sentence start #582
DocumentStore
- Scale dot product into probabilities #667
- Add refresh_type param for Elasticsearch update_embeddings() #630
- Add return_embedding parameter for get_all_documents() #615
- Adding support for update_existing_documents to sql and faiss document stores #584
- Add filters for delete_all_documents() #591
Retriever
- Fix saving tokenizers in DPR training + unify save and load dirs #682
- fix a typo, num_negatives -> num_positives #681
- Refactor DensePassageRetriever._get_predictions #642
- Move DPR embeddings from GPU to CPU straight away #618
- Add MAP retriever metric for open-domain case #572
Reader / Generator
- add GPU support for rag #669
- Enable dynamic parameter updates for the FARMReader #650
- Add option in API Config to configure if reader can return "No Answer" #609
- Fix various generator issues #590
Pipeline
- Add support for building custom Search Pipelines #596
- Add set_node() for Pipeline #659
- Add support for aggregating scores in JoinDocuments node #683
- Add pipelines for GenerativeQA & FAQs #645
Other
- Cleanup Pytest Fixtures #639
- Add latest benchmark run #652
- Fix image links in tutorial #663
- Update query arg in Tutorial 7 #656
- Fix benchmarks #648
- Add link to FAISS Info in documentation #643
- Improve User Feedback Documentation #539
- Add formatting checks for shell scripts #627
- Update md files for API docs #631
- Clean API docs and increase coverage #621
- Add boxes for recommendations #629
- Automate benchmarks via CML #518
- Add contributor hall of fame #628
- README: Fix link to roadmap #626
- Fix docstring examples #604
- Cleaning the api docs #616
- Fix link to DocumentStore page #613
- Make more changes to documentation #578
- Remove column in benchmark website #608
- Make benchmarks clearer #606
- Fixing defaults configs for rest_apis #583
- Allow list of filter values in REST API #568
- Fix CI bug due to new Elasticsearch release and new model release #579
- Update Colab Torch Version #576
❤️ Big thanks to all contributors!
@sadakmed @Krak91 @icy @lalitpagaria @guillim @tanaysoni @tholor @Timoeller @PiffPaffM @bogdankostic