Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The count query returns a value that is greater than the actual size by 1 #37789

Open
1 task done
r0x07k opened this issue Nov 19, 2024 · 3 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@r0x07k
Copy link

r0x07k commented Nov 19, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.9
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus + java
- OS(Ubuntu or CentOS): Windows
- CPU/Memory: N/A
- GPU: N/A
- Others: N/A

Current Behavior

The count query returns a value that is greater than the actual size by 1. This issue occurs with both the Python and Java SDKs.

To reproduce, I created a collection containing 54 elements (double-checked the size before creation). However, the count query incorrectly returns 55.

Python example:

client.query(
    collection_name=name,
    filter="",
    output_fields=["count(*)"]
)

# Output:
[{'count(*)': 55}]

The actual number of entities is 54, as verified below:

res = client.query(
    collection_name=name,
    filter="",
    limit=100,
    output_fields=["id", "text", "metadata"]
)
len(res)  # Returns 54

The Java SDK also incorrectly returns 55.

var queryReq = QueryReq.builder()
    .collectionName(collection.getCollectionId())
    .filter("")
    .outputFields(Arrays.asList("count(*)"))
    .build();
var queryResp = client.query(queryReq);
var res = Integer.parseInt(queryResp.getQueryResults().getFirst().getEntity().entrySet().iterator().next().getValue().toString()); // Returns 55

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@r0x07k r0x07k added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 19, 2024
@yanliang567
Copy link
Contributor

@r0x07k quick questions:
do you have two entities with duplicated ids? you can check it query count(*) for every id:

client.query(
    collection_name=name,
    filter="id==123xxx",
    output_fields=["count(*)"]
)

if not, please attach the milvus logs and the birdwatcher backup for investigation. Please refer to this doc: https://github.com/milvus-io/birdwatcher to backup etcd backup with birdwatcher

/assign @r0x07k

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 19, 2024
@r0x07k
Copy link
Author

r0x07k commented Nov 19, 2024

@yanliang567 All IDs are unique.

res = client.query(
    collection_name=name,
    filter="",
    limit=100,
    output_fields=["id", "text", "metadata"]
)
ids = [item['id'] for item in res]
id_counts = Counter(ids)
duplicate_ids = {id for id, count in id_counts.items() if count > 1} #is empty

WIll these logs be helpful?

2024-11-19 17:04:58 [2024/11/19 22:04:58.276 +00:00] [INFO] [querycoordv2/services.go:197] ["load collection request received"] [traceID=9f681275cf250fa0794d4ef083caa450] [collectionID=453902469135355765] [replicaNumber=1] [resourceGroups="[]"] [refreshMode=false] [schema="name:\"_89786e37_71fe_4ef2_8908_eb3220d0f4de\" fields:<fieldID:100 name:\"id\" is_primary_key:true data_type:VarChar type_params:<key:\"max_length\" value:\"36\" > > fields:<fieldID:101 name:\"text\" data_type:VarChar type_params:<key:\"max_length\" value:\"65535\" > > fields:<fieldID:102 name:\"metadata\" data_type:JSON > fields:<fieldID:103 name:\"vector\" data_type:FloatVector type_params:<key:\"dim\" value:\"768\" > > fields:<fieldID:104 name:\"$meta\" description:\"dynamic schema\" data_type:JSON is_dynamic:true > enable_dynamic_field:true "] [fieldIndexes="[453902469135355771,453902469135355772]"]
2024-11-19 17:04:58 [2024/11/19 22:04:58.276 +00:00] [WARN] [meta/coordinator_broker.go:118] ["failed to get collection level load info"] [collectionID=453902469135355765] [error="collection property not found: collection.replica.number"]
2024-11-19 17:04:58 [2024/11/19 22:04:58.276 +00:00] [WARN] [meta/coordinator_broker.go:125] ["failed to get collection level load info"] [collectionID=453902469135355765] [error="collection property not found: collection.replica.number"]
2024-11-19 17:04:58 [2024/11/19 22:04:58.276 +00:00] [INFO] [rootcoord/root_coord.go:2739] ["received request to describe database "] [traceID=9f681275cf250fa0794d4ef083caa450] [dbName=default]
2024-11-19 17:04:58 [2024/11/19 22:04:58.276 +00:00] [INFO] [rootcoord/root_coord.go:2763] ["done to describe database"] [traceID=9f681275cf250fa0794d4ef083caa450] [dbName=default] [ts=454047537099440131]
2024-11-19 17:04:58 [2024/11/19 22:04:58.276 +00:00] [WARN] [meta/coordinator_broker.go:139] ["failed to get database level load info"] [collectionID=453902469135355765] [error="database property not found: database.replica.number"]
2024-11-19 17:04:58 [2024/11/19 22:04:58.276 +00:00] [WARN] [meta/coordinator_broker.go:148] ["failed to get database level load info"] [collectionID=453902469135355765] [error="database property not found: database.resource_groups"]
2024-11-19 17:04:58 [2024/11/19 22:04:58.276 +00:00] [INFO] [job/scheduler.go:150] ["start to pre-execute job"] [traceID=9f681275cf250fa0794d4ef083caa450] [collectionID=453902469135355765]
2024-11-19 17:04:58 [2024/11/19 22:04:58.276 +00:00] [INFO] [job/scheduler.go:158] ["start to execute job"] [traceID=9f681275cf250fa0794d4ef083caa450] [collectionID=453902469135355765]
2024-11-19 17:04:58 [2024/11/19 22:04:58.276 +00:00] [INFO] [meta/failed_load_cache.go:107] ["FailedLoadCache removes cache"] [collectionID=453902469135355765]
2024-11-19 17:04:58 [2024/11/19 22:04:58.277 +00:00] [INFO] [job/scheduler.go:144] ["start to post-execute job"] [traceID=9f681275cf250fa0794d4ef083caa450] [collectionID=453902469135355765]
2024-11-19 17:04:58 [2024/11/19 22:04:58.277 +00:00] [INFO] [job/scheduler.go:146] ["job finished"] [traceID=9f681275cf250fa0794d4ef083caa450] [collectionID=453902469135355765]
2024-11-19 17:04:58 [2024/11/19 22:04:58.278 +00:00] [INFO] [querycoordv2/services.go:57] ["show collections request received"] [traceID=bb4e81dd93a3ff824ba5977831c4cf7a] [collections="[453902469135355765]"]

@yanliang567
Copy link
Contributor

the query() api returns results with de-duplicated, so that's expected. But query(..., output_fields=["count(*)"]) is a bit special, it returns the count without de-dup. so please try this:

res = client.query(
    collection_name=c_name,
    filter="",
    limit=100,
    output_fields=["id", "text"]
)

for item in res:
    id = item.get("id", None)
    count = client.query(
        collection_name=c_name,
        filter=f"id=='{id}'",              # my id is varchar in this sample
        output_fields=["count(*)"]
    )
    print(f"id: {id}, count: {count}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

2 participants