importing 146 GB faiss ivf flat index fails after 40% #87

gland1 · 2024-05-31T17:52:56Z

Current Behavior

Deployed milvus operator on 3 servers
tried to import faiss ivf flat index(from 200M wiki dataset) size 146Gb
Failed due to max file size 16G.
Increased maxfile size to 1024G
Tries again and failed after 40% done.

This is the error shown:

[2024/05/31 18:46:22.983 +03:00] [ERROR] [dbclient/milvus2x.go:206] ["[Loader] Check Milvus bulkInsertState Error"] [error="rpc error: code = Unknown desc = stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:737 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetImportProgress\n/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:6071 github.com/milvus-io/milvus/internal/proxy.(*Proxy).GetImportProgress\n/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:4649 github.com/milvus-io/milvus/internal/proxy.(*Proxy).GetImportState\n/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:1018 github.com/milvus-io/milvus/internal/distributed/proxy.(*Server).GetImportState\n/go/pkg/mod/github.com/milvus-io/milvus-proto/go-api/[email protected]/milvuspb/milvus.pb.go:13136 github.com/milvus-io/milvus-proto/go-api/v2/milvuspb._MilvusService_GetImportState_Handler.func1\n/go/src/github.com/milvus-io/milvus/internal/proxy/connection/util.go:60 github.com/milvus-io/milvus/internal/proxy/connection.KeepActiveInterceptor: empty grpc client: find no available datacoord, check datacoord state"] [stack="github.com/zilliztech/milvus-migration/core/dbclient.(*Milvus2x).WaitBulkLoadSuccess\n\t/home/runner/work/milvus-migration/milvus-migration/core/dbclient/milvus2x.go:206\ngithub.com/zilliztech/milvus-migration/core/loader.(*Milvus2xLoader).loadDataOne\n\t/home/runner/work/milvus-migration/milvus-migration/core/loader/milvus2x_loader.go:198\ngithub.com/zilliztech/milvus-migration/core/loader.(*Milvus2xLoader).loadDataBatch.func1\n\t/home/runner/work/milvus-migration/milvus-migration/core/loader/milvus2x_loader.go:180\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75"]
load error: rpc error: code = Unknown desc = stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:737 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetImportProgress
/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:6071 github.com/milvus-io/milvus/internal/proxy.(*Proxy).GetImportProgress
/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:4649 github.com/milvus-io/milvus/internal/proxy.(*Proxy).GetImportState
/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:1018 github.com/milvus-io/milvus/internal/distributed/proxy.(*Server).GetImportState
/go/pkg/mod/github.com/milvus-io/milvus-proto/go-api/[email protected]/milvuspb/milvus.pb.go:13136 github.com/milvus-io/milvus-proto/go-api/v2/milvuspb._MilvusService_GetImportState_Handler.func1
/go/src/github.com/milvus-io/milvus/internal/proxy/connection/util.go:60 github.com/milvus-io/milvus/internal/proxy/connection.KeepActiveInterceptor: empty grpc client: find no available datacoord, check datacoord state

Expected Behavior

migration should succeed

Steps To Reproduce

see description

Environment

3 nodes k8s servers bare metal

Anything else?

No response

gland1 · 2024-05-31T18:51:23Z

Also,
Each server has total of 128G
I see that datanode on server one grows in memory reached 83G .. and rising in current attempt

gland1 · 2024-06-01T07:59:58Z

the operation fails after the datanode on server1 reach too much memory and gets evicted

lentitude2tk · 2024-06-03T02:32:39Z

the operation fails after the datanode on server1 reach too much memory and gets evicted

Hello user, I need to confirm the following information, please provide it:

Is your milvus instance version 2.3?
Did you enable PartitionKey or specify PartitionNum for the collection you imported?

gland1 · 2024-06-03T05:50:59Z

the operation fails after the datanode on server1 reach too much memory and gets evicted

Hello user, I need to confirm the following information, please provide it:

Is your milvus instance version 2.3?

Did you enable PartitionKey or specify PartitionNum for the collection you imported?

hi
I'm using milvus 2.4.1.
I did not specify PartitionKey nor PartitionNum .
Do you think using partitions can work around this ?

bigsheeper · 2024-06-03T07:00:30Z

Hello @gland1 , please use the following command to capture the memory information of the datanode when its memory usage is high:

go tool pprof {datanode_ip}:9091/debug/pprof/heap

After execution, you should see the generation of a pprof file, like:

Just provide the generated pprof file.

lentitude2tk · 2024-06-03T07:08:18Z

@gland1

During the time when the error was reported, was there only one import task in progress in the milvus instance, and no other operations?
When you adjusted the parameters of the single file size, did you change other parameters, such as datanode.import.readBufferSizeInMB
You can capture pprof at the high memory point, and we will help you analyze the memory usage. For the use of pprof, please refer to: https://medium.com/@luanrubensf/heap-dump-in-go-using-pprof-ae9837e05419

gland1 · 2024-06-03T07:13:52Z

Hello @gland1 , please use the following command to capture the memory information of the datanode when its memory usage is high:
go tool pprof {datanode_ip}:9091/debug/pprof/heap
After execution, you should see the generation of a pprof file, like:

Just provide the generated pprof file.

it will take some time to reach this state as I now try loading the the dataset by inserts (btw - this also fails after a while
due to timeout and I have to record where I stopped and continue from there
It looks like when pulsar stars flushing the write cache to the disk, things becomes very slow and finally fails on timeout)

gland1 · 2024-06-03T07:15:13Z

@gland1

During the time when the error was reported, was there only one import task in progress in the milvus instance, and no other operations?

When you adjusted the parameters of the single file size, did you change other parameters, such as datanode.import.readBufferSizeInMB

You can capture pprof at the high memory point, and we will help you analyze the memory usage. For the use of pprof, please refer to: https://medium.com/@luanrubensf/heap-dump-in-go-using-pprof-ae9837e05419

- yes just one import task
- no other params
- Yes..see previous comment

gland1 · 2024-06-16T21:14:03Z

I've tried to recreate -
migration kept hanging at 70% - I saw datacord log blowing up to more than 30G.
Reason seems to be sending larger packets than what etcd will accept:
{"log":"[2024/06/16 21:05:57.618 +00:00] [WARN] [etcd/etcd_kv.go:665] ["value size large than 100kb"] [key=datacoord-meta/statslog/450500795239284436/450500795239284437/450500795239296710/100] [value_size(kb)=1120]\n","stream":"stdout","time":"2024-06-16T21:05:57.619059105Z"}
{"log":"{"level":"warn","ts":"2024-06-16T21:05:57.620Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00118ca80/milvus3-etcd.kioxia:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = trying to send message larger than max (2589626 vs. 2097152)"}\n","stream":"stderr","time":"2024-06-16T21:05:57.620224606Z"}
{"log":"[2024/06/16 21:05:57.618 +00:00] [WARN] [etcd/etcd_kv.go:665] ["value size large than 100kb"] [key=datacoord-meta/binlog/450500795239284436/450500795239284437/450500795239296710/101] [value_size(kb)=358]\n","stream":"stdout","time":"2024-06-16T21:05:57.784447554Z"}
{"log":"[2024/06/16 21:

xiaofan-luan · 2024-06-17T01:59:18Z

@lentitude2tk please try to reproduce this in house and see what we can improve

lentitude2tk · 2024-06-17T03:21:19Z

@lentitude2tk please try to reproduce this in house and see what we can improve

OK，I will find the relevant personnel to try and reproduce this issue with this data volume in-house.
Additionally, I would like to confirm some information @gland1 . When you perform the import, does the collection have relevant indexes? If so, you can try setting the parameter dataCoord.import.waitForIndex to false for testing, or you can drop the index before performing the data import.

lentitude2tk · 2024-06-17T06:14:59Z

@gland1 Could you please let us know if you are using a public dataset on wiki? If so, could you provide us with the link? Additionally, could you share the migration.yaml configuration you are using with milvus-migration (sensitive information can be ignored)? We will reproduce the issue you encountered locally and work on resolving it.

zhuwenxing · 2024-06-17T06:45:24Z

@lentitude2tk
If the file exported from upstream is very large, will the migration tool split it into small-size file list then do bulk insert?

lentitude2tk · 2024-06-17T06:54:55Z

@lentitude2tk If the file exported from upstream is very large, will the migration tool split it into small-size file list then do bulk insert?

User feedback: "hanging_at_70%". For version 2.4, 70% indicates that bulkInsert has been completed and it is currently in the process of building the index. Therefore, the issue lies in why buildIndex is hanging.

gland1 · 2024-06-17T06:57:04Z

hi
Please note I'm using unusually large maxSize for segments : 80G

This is the full milvus yaml I'm using:
# This is a sample to deploy a milvus cluster in milvus-operator's default configurations.
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
name: milvus3
namespace: kioxia
labels:
app: milvus
spec:
mode: cluster
dependencies:
etcd:
inCluster:
values:
persistence:
storageClass: standard
size: 10Gi
volumePermissions:
enabled: true
storage:
inCluster:
values:
replicas: 3
persistence:
storageClass: standard-thin
size: 10Ti

    pulsar:      
      inCluster:
        values:                       
          zookeeper:
            replicaCount: 3
            volumes:
              data:
                size: 40Gi
          broker:
            replicaCount: 3
            resources:
              limits:
                cpu: 4
                memory: 16Gi   
            configData:                                                                                                                                 
              PULSAR_MEM: >                                                                                                                             
                -Xms128m -Xmx256m -XX:MaxDirectMemorySize=256m                                                    
              PULSAR_GC: >     
                -XX:+IgnoreUnrecognizedVMOptions                  
                -XX:+UseG1GC                                                                                                         
                -XX:MaxGCPauseMillis=10                                                                                                                   
                -Dio.netty.leakDetectionLevel=disabled                                                                                                    
                -Dio.netty.recycler.linkCapacity=1024                                                                                                   
                -XX:+ParallelRefProcEnabled                                                                                                             
                -XX:+UnlockExperimentalVMOptions                                                                  
                -XX:+DoEscapeAnalysis                                                          
                -XX:ParallelGCThreads=4                                                                                              
                -XX:ConcGCThreads=4                                                                                                                       
                -XX:G1NewSizePercent=50                                                                                                                   
                -XX:+DisableExplicitGC                                                                                                                  
                -XX:-ResizePLAB                                                                                                                         
                -XX:+ExitOnOutOfMemoryError                                                                       
                -XX:+PerfDisableSharedMem                                          
          bookkeeper:
            replicaCount: 3
            configData:                                                                                                                                   
              # we use `bin/pulsar` for starting bookie daemons                                                                                           
              PULSAR_MEM: >                                                                                                                             
                -Xms128m                                                                                                                                
                -Xmx256m                                                                                                                                  
                -XX:MaxDirectMemorySize=256m                                                                                                            
              PULSAR_GC: >        
                -XX:+IgnoreUnrecognizedVMOptions                  
                -XX:+UseG1GC                                                                                                         
                -XX:MaxGCPauseMillis=10                                                                                                                   
                -XX:+ParallelRefProcEnabled                                                                                                             
                -XX:+UnlockExperimentalVMOptions                                                                                                        
                -XX:+DoEscapeAnalysis                                                                                                                   
                -XX:ParallelGCThreads=4                                                                                                                 
                -XX:ConcGCThreads=4                                                                                                                     
                -XX:G1NewSizePercent=50                                                                                                                 
                -XX:+DisableExplicitGC                                                                                                                    
                -XX:-ResizePLAB                                                                                                                         
                -XX:+ExitOnOutOfMemoryError                                                                                                             
                -XX:+PerfDisableSharedMem                                                                                                               
                -verbosegc                                                                                                                              
                -Xloggc:/var/log/bookie-gc.log                                                                                       
                -XX:G1LogLevel=finest                               
            resources:
              limits:
                cpu: 4
                memory: 16Gi             
             
  components:
    proxy:
      replicas: 3
      serviceType: LoadBalancer
    queryNode:
      replicas: 3
      volumeMounts:
      - mountPath: /var/lib/milvus/data
        name: disk
      volumes:
      - name: disk
        hostPath:
          path: "/var/lib/milvus/data"
          type: DirectoryOrCreate
    indexNode:
      replicas: 3
      env:
        - name: LOCAL_STORAGE_SIZE
          value: "300"
      volumeMounts:
      - mountPath: /var/lib/milvus/data
        name: disk
      volumes:
      - name: disk
        hostPath:
          path: "/var/lib/milvus/data"
          type: DirectoryOrCreate      
    dataCoord:
      replicas: 1
    indexCoord:
      replicas: 1
    dataNode:
      replicas: 3
      
  config:
    log:
      file:
        maxAge: 10
        maxBackups: 20
        maxSize: 100            
      format: text
      level: warn
    common:
      DiskIndex:
        BeamWidthRatio: 8
        BuildNumThreadsRatio: 1
        LoadNumThreadRatio: 8
        MaxDegree: 28
        PQCodeBudgetGBRatio: 0.04
        SearchCacheBudgetGBRatio: 0.1
        SearchListSize: 50
    proxy:
      grpc:
        serverMaxRecvSize: 2147483648   # 2GB
        serverMaxSendSize: 2147483648
        clientMaxRecvSize: 2147483648
        clientMaxSendSize: 2147483648
    dataNode:
      import:
        maxImportFileSizeInGB: 1024
    queryNode:
      segcore:
        knowhereThreadPoolNumRatio: 1
    queryCoord:
      loadTimeoutSeconds: 1200
    dataCoord:
      segment:
        maxSize: 81920
        diskSegmentMaxSize: 81920
        sealProportion: 0.9
        smallProportion: 0.5
      compaction:
        rpcTimeout: 180
        timeout: 5600
        levelzero:
          forceTrigger:
            maxSize: 85899345920

This is the migration yaml:
dumper: # configs for the migration job.
worker:
limit: 16
workMode: faiss # operational mode of the migration job.
reader:
bufferSize: 1024
writer:
bufferSize: 1024
loader:
worker:
limit: 16
source: # configs for the source Faiss index.
mode: local
local:
faissFile: /var/lib/milvus/vector-files/ivfflat_base.50M_lists7100.faissindex

target: # configs for the target Milvus collection.
create:
collection:
name: wiki50M2
shardsNums: 12
dim: 768
metricType: L2
mode: remote
remote:
outputDir: testfiles/output/
cloud: aws
endpoint: 10.42.0.104:9000
region: ap-southeast-1
bucket: milvus3
ak: minioadmin
sk: minioadmin
useIAM: false
useSSL: false
checkBucket: true
milvus2x:
endpoint: 172.16.10.111:19530

As for the dataset - we curved 50M from the 88M wiki-all nvidia dataset available at:
https://docs.rapids.ai/api/raft/nightly/wiki_all_dataset/

gland1 · 2024-06-17T06:57:28Z

Any Idea how can I stop the migration ?

xiaocai2333 · 2024-06-17T07:05:17Z

segment maxSize is too large in the configuration. 1024MB is the recommended size.
@gland1

xiaocai2333 · 2024-06-17T07:09:23Z

If the segment is too large, there will be too many binlog files, and some atomic operations cannot be completed. In addition, 80*4 = 320GB+ of memory is required when building the index.

gland1 · 2024-06-17T07:12:40Z

I'm using diskAnn index, so it should require less memory to build
We try to investigate horizontal scaling so we wanted as little as possible segments at first.
But I'll soon try with smaller segment.

Is it possible to stop the migration ?

lentitude2tk · 2024-06-17T07:22:05Z

I'm using diskAnn index, so it should require less memory to build We try to investigate horizontal scaling so we wanted as little as possible segments at first. But I'll soon try with smaller segment.

Is it possible to stop the migration ?

If your target collection is a poc testing collection, you can choose to delete your collection, thus causing the entire migration task to fail

bigsheeper · 2024-06-17T07:29:03Z

@gland1 Why is it necessary to have as few segments as possible when investigating horizontal scaling?

Large segment would result in many side effects, as detailed here: milvus-io/milvus#33808 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importing 146 GB faiss ivf flat index fails after 40% #87

importing 146 GB faiss ivf flat index fails after 40% #87

gland1 commented May 31, 2024 •

edited

Loading

gland1 commented May 31, 2024

gland1 commented Jun 1, 2024

lentitude2tk commented Jun 3, 2024

gland1 commented Jun 3, 2024

bigsheeper commented Jun 3, 2024 •

edited

Loading

lentitude2tk commented Jun 3, 2024

gland1 commented Jun 3, 2024 •

edited

Loading

gland1 commented Jun 3, 2024

gland1 commented Jun 16, 2024

xiaofan-luan commented Jun 17, 2024

lentitude2tk commented Jun 17, 2024 •

edited

Loading

lentitude2tk commented Jun 17, 2024

zhuwenxing commented Jun 17, 2024

lentitude2tk commented Jun 17, 2024 •

edited

Loading

gland1 commented Jun 17, 2024

gland1 commented Jun 17, 2024

xiaocai2333 commented Jun 17, 2024

xiaocai2333 commented Jun 17, 2024

gland1 commented Jun 17, 2024 •

edited

Loading

lentitude2tk commented Jun 17, 2024

bigsheeper commented Jun 17, 2024

importing 146 GB faiss ivf flat index fails after 40% #87

importing 146 GB faiss ivf flat index fails after 40% #87

Comments

gland1 commented May 31, 2024 • edited Loading

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

gland1 commented May 31, 2024

gland1 commented Jun 1, 2024

lentitude2tk commented Jun 3, 2024

gland1 commented Jun 3, 2024

bigsheeper commented Jun 3, 2024 • edited Loading

lentitude2tk commented Jun 3, 2024

gland1 commented Jun 3, 2024 • edited Loading

gland1 commented Jun 3, 2024

gland1 commented Jun 16, 2024

xiaofan-luan commented Jun 17, 2024

lentitude2tk commented Jun 17, 2024 • edited Loading

lentitude2tk commented Jun 17, 2024

zhuwenxing commented Jun 17, 2024

lentitude2tk commented Jun 17, 2024 • edited Loading

gland1 commented Jun 17, 2024

gland1 commented Jun 17, 2024

xiaocai2333 commented Jun 17, 2024

xiaocai2333 commented Jun 17, 2024

gland1 commented Jun 17, 2024 • edited Loading

lentitude2tk commented Jun 17, 2024

bigsheeper commented Jun 17, 2024

gland1 commented May 31, 2024 •

edited

Loading

bigsheeper commented Jun 3, 2024 •

edited

Loading

gland1 commented Jun 3, 2024 •

edited

Loading

lentitude2tk commented Jun 17, 2024 •

edited

Loading

lentitude2tk commented Jun 17, 2024 •

edited

Loading

gland1 commented Jun 17, 2024 •

edited

Loading