Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update docs for jina embeddings v3 #2803

Merged
merged 5 commits into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 71 additions & 21 deletions site/en/embeddings/embed-with-jina.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,22 +24,44 @@ Then, instantiate the `JinaEmbeddingFunction`:
from pymilvus.model.dense import JinaEmbeddingFunction

jina_ef = JinaEmbeddingFunction(
model_name="jina-embeddings-v2-base-en", # Defaults to `jina-embeddings-v2-base-en`
api_key=JINAAI_API_KEY # Provide your Jina AI API key
model_name="jina-embeddings-v3", # Defaults to `jina-embeddings-v3`
api_key=JINAAI_API_KEY, # Provide your Jina AI API key
task="retrieval.passage", # Specify the task
dimensions=1024, # Defaults to 1024
)
```

__Parameters__:

- `model_name` (*string*)

The name of the Jina AI embedding model to use for encoding. You can specify any of the available Jina AI embedding model names, for example, `jina-embeddings-v2-base-en`, `jina-embeddings-v2-small-en`, etc. If you leave this parameter unspecified, `jina-embeddings-v2-base-en` will be used. For a list of available models, refer to [Jina Embeddings](https://jina.ai/embeddings).
The name of the Jina AI embedding model to use for encoding. You can specify any of the available Jina AI embedding model names, for example, `jina-embeddings-v3`, `jina-embeddings-v2-base-en`, etc. If you leave this parameter unspecified, `jina-embeddings-v3` will be used. For a list of available models, refer to [Jina Embeddings](https://jina.ai/embeddings).

- `api_key` (*string*)

The API key for accessing the Jina AI API.

To create embeddings for documents, use the `encode_documents()` method:
- `task` (*string*)

The type of input passed to the model. Required for embedding models v3 and higher.

- `"retrieval.passage"`: Used to encode large documents in retrieval tasks at indexing time.
- `"retrieval.query"`: Used to encode user queries or questions in retrieval tasks.
- `"classification"`: Used to encode text for text classification tasks.
- `"text-matching"`: Used to encode text for similarity matching, such as measuring similarity between two sentences.
- `"clustering"`: Used for clustering or reranking tasks.

- `dimensions` (*int*)

The number of dimensions the resulting output embeddings should have. Defaults to 1024. Only supported for embedding models v3 and higher.

- `late_chunking` (*bool*)

This parameter controls whether to use the new chunking method [Jina AI introduced last month](https://arxiv.org/abs/2409.04701) for encoding a batch of sentences. Defaults to `False`. When set to `True`, Jina AI API will concatenate all sentences in the input field and feed them as a single string to the model. Internally, the model embeds this long concatenated string and then performs late chunking, returning a list of embeddings that matches the size of the input list.

To create embeddings for documents, use the `encode_documents()` method. This method is designed for documents embeddings in asymmetric retrieval tasks, such as indexing documents for search or recommendation tasks. This method uses `retrieval.passage` as the task.

```python:

```python
docs = [
Expand All @@ -59,18 +81,18 @@ print("Dim:", jina_ef.dim, docs_embeddings[0].shape)
The expected output is similar to the following:

```python
Embeddings: [array([-4.88487840e-01, -4.28095880e-01, 4.90086500e-01, -1.63274320e-01,
3.43437800e-01, 3.21476880e-01, 2.83173790e-02, -3.10403670e-01,
4.76985040e-01, -1.77410420e-01, -3.84803180e-01, -2.19224200e-01,
-2.52898000e-01, 6.62411900e-02, -8.58173100e-01, 1.05221800e+00,
Embeddings: [array([9.80641991e-02, -8.51697400e-02, 7.36531913e-02, 1.42558888e-02,
-2.23589484e-02, 1.68494112e-03, -3.50753777e-02, -3.11530549e-02,
-3.26012149e-02, 5.04568312e-03, 3.69836427e-02, 3.48948985e-02,
8.19722563e-03, 5.88679723e-02, -6.71099266e-03, -1.82369724e-02,
...
-2.04462400e-01, 7.14229800e-01, -1.66823000e-01, 8.72551440e-01,
5.53560140e-01, 8.92506300e-01, -2.39408610e-01, -4.22413560e-01,
-3.19551350e-01, 5.59153850e-01, 2.44338100e-01, -8.60452100e-01])]
Dim: 768 (768,)
2.48654783e-02, 3.43279652e-02, -1.66154150e-02, -9.90478322e-03,
-2.96043139e-03, -8.57473817e-03, -7.39028037e-04, 6.25024503e-03,
-1.08831357e-02, -4.00776342e-02, 3.25369164e-02, -1.42691191e-03])]
Dim: 1024 (1024,)
```

To create embeddings for queries, use the `encode_queries()` method:
To create embeddings for queries, use the `encode_queries()` method. This method is designed for query embeddings in asymmetric retrieval tasks, such as search queries or questions. This method uses `retrieval.query` as the task.

```python
queries = ["When was artificial intelligence founded",
Expand All @@ -85,13 +107,41 @@ print("Dim", jina_ef.dim, query_embeddings[0].shape)
The expected output is similar to the following:

```python
Embeddings: [array([-5.99164660e-01, -3.49827350e-01, 8.22405160e-01, -1.18632730e-01,
5.78107540e-01, 1.09789170e-01, 2.91604200e-01, -3.29306450e-01,
2.93779640e-01, -2.17880800e-01, -6.84535440e-01, -3.79752000e-01,
-3.47541800e-01, 9.20846100e-02, -6.13804400e-01, 6.31312800e-01,
Embeddings: [array([8.79201014e-03, 1.47551354e-02, 4.02722731e-02, -2.52991207e-02,
1.12719582e-02, 3.75947170e-02, 3.97946090e-02, -7.36681819e-02,
-2.17952449e-02, -1.16298944e-02, -6.83426252e-03, -5.12507409e-02,
5.26071340e-02, 6.75181448e-02, 3.92445624e-02, -1.40817231e-02,
...
-1.84993740e-02, 9.38629150e-01, 2.74858470e-02, 1.09396360e+00,
3.96270750e-01, 7.44445800e-01, -1.95404050e-01, -6.08383200e-01,
-3.75076300e-01, 3.87512200e-01, 8.11889650e-01, -3.76407620e-01])]
Dim 768 (768,)
8.81703943e-03, 4.24629413e-02, -2.32944116e-02, -2.05193572e-02,
-3.22035812e-02, 2.81896023e-03, 3.85326855e-02, 3.64372656e-02,
-1.65050142e-02, -4.26847413e-02, 2.02664156e-02, -1.72684863e-02])]
Dim 1024 (1024,)
```

To create embeddings of inputs for similarity matching (such as STS or symmetric retrieval tasks), text classification, clustering, or reranking tasks, use the appropriate `task` parameter value when instantiating the `JinaEmbeddingFunction` class.


```python
from pymilvus.model.dense import JinaEmbeddingFunction

jina_ef = JinaEmbeddingFunction(
model_name="jina-embeddings-v3", # Defaults to `jina-embeddings-v3`
api_key=JINA_API_KEY, # Provide your Jina AI API key
task="text-matching",
dimensions=1024, # Defaults to 1024
)

texts = [
"Follow the white rabbit.", # English
"Sigue al conejo blanco.", # Spanish
"Suis le lapin blanc.", # French
"跟着白兔走。", # Chinese
"اتبع الأرنب الأبيض.", # Arabic
"Folge dem weißen Kaninchen.", # German
]

embeddings = jina_ef(texts)

# Compute similarities
print(embeddings[0] @ embeddings[1].T)
```
25 changes: 18 additions & 7 deletions site/en/integrations/integrate_with_jina.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,18 @@ Jina AI's core embedding model, excels in understanding detailed text, making it
from pymilvus.model.dense import JinaEmbeddingFunction

jina_api_key = "<YOUR_JINA_API_KEY>"
ef = JinaEmbeddingFunction("jina-embeddings-v2-base-en", jina_api_key)
ef = JinaEmbeddingFunction(
"jina-embeddings-v3",
jina_api_key,
task="retrieval.passage",
dimensions=1024
)

query = "what is information retrieval?"
doc = "Information retrieval is the process of finding relevant information from a large collection of data or documents."

qvecs = ef.encode_queries([query])
dvecs = ef.encode_documents([doc])
qvecs = ef.encode_queries([query]) # This method uses `retrieval.query` as the task
dvecs = ef.encode_documents([doc]) # This method uses `retrieval.passage` as the task
```

## Bilingual Embeddings
Expand Down Expand Up @@ -120,8 +125,14 @@ from pymilvus.model.dense import JinaEmbeddingFunction
from pymilvus import MilvusClient

jina_api_key = "<YOUR_JINA_API_KEY>"
ef = JinaEmbeddingFunction("jina-embeddings-v2-base-en", jina_api_key)
DIMENSION = 768 # size of jina-embeddings-v2-base-en
DIMENSION = 1024 # `jina-embeddings-v3` supports flexible embedding sizes (32, 64, 128, 256, 512, 768, 1024), allowing for truncating embeddings to fit your application.
ef = JinaEmbeddingFunction(
"jina-embeddings-v3",
jina_api_key,
task="retrieval.passage",
dimensions=DIMENSION,
)


doc = [
"In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
Expand All @@ -130,7 +141,7 @@ doc = [
"The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.",
]

dvecs = ef.encode_documents(doc)
dvecs = ef.encode_documents(doc) # This method uses `retrieval.passage` as the task

data = [
{"id": i, "vector": dvecs[i], "text": doc[i], "subject": "history"}
Expand Down Expand Up @@ -162,7 +173,7 @@ With all data in Milvus vector database, we can now perform semantic search by g

```python
queries = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"
qvecs = ef.encode_queries([queries])
qvecs = ef.encode_queries([queries]) # This method uses `retrieval.query` as the task

res = milvus_client.search(
collection_name=COLLECTION_NAME, # target collection
Expand Down
Loading