Skip to content
This repository has been archived by the owner on Nov 13, 2024. It is now read-only.

importing 146 GB faiss ivf flat index fails after 40% #87

Open
gland1 opened this issue May 31, 2024 · 21 comments
Open

importing 146 GB faiss ivf flat index fails after 40% #87

gland1 opened this issue May 31, 2024 · 21 comments

Comments

@gland1
Copy link

gland1 commented May 31, 2024

Current Behavior

Deployed milvus operator on 3 servers
tried to import faiss ivf flat index(from 200M wiki dataset) size 146Gb
Failed due to max file size 16G.
Increased maxfile size to 1024G
Tries again and failed after 40% done.

This is the error shown:

[2024/05/31 18:46:22.983 +03:00] [ERROR] [dbclient/milvus2x.go:206] ["[Loader] Check Milvus bulkInsertState Error"] [error="rpc error: code = Unknown desc = stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:737 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetImportProgress\n/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:6071 github.com/milvus-io/milvus/internal/proxy.(*Proxy).GetImportProgress\n/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:4649 github.com/milvus-io/milvus/internal/proxy.(*Proxy).GetImportState\n/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:1018 github.com/milvus-io/milvus/internal/distributed/proxy.(*Server).GetImportState\n/go/pkg/mod/github.com/milvus-io/milvus-proto/go-api/[email protected]/milvuspb/milvus.pb.go:13136 github.com/milvus-io/milvus-proto/go-api/v2/milvuspb._MilvusService_GetImportState_Handler.func1\n/go/src/github.com/milvus-io/milvus/internal/proxy/connection/util.go:60 github.com/milvus-io/milvus/internal/proxy/connection.KeepActiveInterceptor: empty grpc client: find no available datacoord, check datacoord state"] [stack="github.com/zilliztech/milvus-migration/core/dbclient.(*Milvus2x).WaitBulkLoadSuccess\n\t/home/runner/work/milvus-migration/milvus-migration/core/dbclient/milvus2x.go:206\ngithub.com/zilliztech/milvus-migration/core/loader.(*Milvus2xLoader).loadDataOne\n\t/home/runner/work/milvus-migration/milvus-migration/core/loader/milvus2x_loader.go:198\ngithub.com/zilliztech/milvus-migration/core/loader.(*Milvus2xLoader).loadDataBatch.func1\n\t/home/runner/work/milvus-migration/milvus-migration/core/loader/milvus2x_loader.go:180\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75"]
load error: rpc error: code = Unknown desc = stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:737 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetImportProgress
/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:6071 github.com/milvus-io/milvus/internal/proxy.(*Proxy).GetImportProgress
/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:4649 github.com/milvus-io/milvus/internal/proxy.(*Proxy).GetImportState
/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:1018 github.com/milvus-io/milvus/internal/distributed/proxy.(*Server).GetImportState
/go/pkg/mod/github.com/milvus-io/milvus-proto/go-api/[email protected]/milvuspb/milvus.pb.go:13136 github.com/milvus-io/milvus-proto/go-api/v2/milvuspb._MilvusService_GetImportState_Handler.func1
/go/src/github.com/milvus-io/milvus/internal/proxy/connection/util.go:60 github.com/milvus-io/milvus/internal/proxy/connection.KeepActiveInterceptor: empty grpc client: find no available datacoord, check datacoord state

Expected Behavior

migration should succeed

Steps To Reproduce

see description

Environment

3 nodes k8s servers bare metal

Anything else?

No response

@gland1
Copy link
Author

gland1 commented May 31, 2024

Also,
Each server has total of 128G
I see that datanode on server one grows in memory reached 83G .. and rising in current attempt

@gland1
Copy link
Author

gland1 commented Jun 1, 2024

the operation fails after the datanode on server1 reach too much memory and gets evicted

@lentitude2tk
Copy link
Collaborator

the operation fails after the datanode on server1 reach too much memory and gets evicted

Hello user, I need to confirm the following information, please provide it:

  1. Is your milvus instance version 2.3?
  2. Did you enable PartitionKey or specify PartitionNum for the collection you imported?

@gland1
Copy link
Author

gland1 commented Jun 3, 2024

the operation fails after the datanode on server1 reach too much memory and gets evicted

Hello user, I need to confirm the following information, please provide it:

  1. Is your milvus instance version 2.3?
  2. Did you enable PartitionKey or specify PartitionNum for the collection you imported?

hi
I'm using milvus 2.4.1.
I did not specify PartitionKey nor PartitionNum .
Do you think using partitions can work around this ?

@bigsheeper
Copy link

bigsheeper commented Jun 3, 2024

Hello @gland1 , please use the following command to capture the memory information of the datanode when its memory usage is high:

go tool pprof {datanode_ip}:9091/debug/pprof/heap

After execution, you should see the generation of a pprof file, like:
image

Just provide the generated pprof file.

@lentitude2tk
Copy link
Collaborator

@gland1

  1. During the time when the error was reported, was there only one import task in progress in the milvus instance, and no other operations?
  2. When you adjusted the parameters of the single file size, did you change other parameters, such as datanode.import.readBufferSizeInMB
  3. You can capture pprof at the high memory point, and we will help you analyze the memory usage. For the use of pprof, please refer to: https://medium.com/@luanrubensf/heap-dump-in-go-using-pprof-ae9837e05419

@gland1
Copy link
Author

gland1 commented Jun 3, 2024

Hello @gland1 , please use the following command to capture the memory information of the datanode when its memory usage is high:

go tool pprof {datanode_ip}:9091/debug/pprof/heap

After execution, you should see the generation of a pprof file, like: image

Just provide the generated pprof file.

it will take some time to reach this state as I now try loading the the dataset by inserts (btw - this also fails after a while
due to timeout and I have to record where I stopped and continue from there
It looks like when pulsar stars flushing the write cache to the disk, things becomes very slow and finally fails on timeout)

@gland1
Copy link
Author

gland1 commented Jun 3, 2024

@gland1

  1. During the time when the error was reported, was there only one import task in progress in the milvus instance, and no other operations?
  2. When you adjusted the parameters of the single file size, did you change other parameters, such as datanode.import.readBufferSizeInMB
  3. You can capture pprof at the high memory point, and we will help you analyze the memory usage. For the use of pprof, please refer to: https://medium.com/@luanrubensf/heap-dump-in-go-using-pprof-ae9837e05419
    • yes just one import task
    • no other params
    • Yes..see previous comment

@gland1
Copy link
Author

gland1 commented Jun 16, 2024

I've tried to recreate -
migration kept hanging at 70% - I saw datacord log blowing up to more than 30G.
Reason seems to be sending larger packets than what etcd will accept:
{"log":"[2024/06/16 21:05:57.618 +00:00] [WARN] [etcd/etcd_kv.go:665] ["value size large than 100kb"] [key=datacoord-meta/statslog/450500795239284436/450500795239284437/450500795239296710/100] [value_size(kb)=1120]\n","stream":"stdout","time":"2024-06-16T21:05:57.619059105Z"}
{"log":"{"level":"warn","ts":"2024-06-16T21:05:57.620Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00118ca80/milvus3-etcd.kioxia:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = trying to send message larger than max (2589626 vs. 2097152)"}\n","stream":"stderr","time":"2024-06-16T21:05:57.620224606Z"}
{"log":"[2024/06/16 21:05:57.618 +00:00] [WARN] [etcd/etcd_kv.go:665] ["value size large than 100kb"] [key=datacoord-meta/binlog/450500795239284436/450500795239284437/450500795239296710/101] [value_size(kb)=358]\n","stream":"stdout","time":"2024-06-16T21:05:57.784447554Z"}
{"log":"[2024/06/16 21:

@xiaofan-luan
Copy link
Collaborator

@lentitude2tk please try to reproduce this in house and see what we can improve

@lentitude2tk
Copy link
Collaborator

lentitude2tk commented Jun 17, 2024

@lentitude2tk please try to reproduce this in house and see what we can improve

OK,I will find the relevant personnel to try and reproduce this issue with this data volume in-house.
Additionally, I would like to confirm some information @gland1 . When you perform the import, does the collection have relevant indexes? If so, you can try setting the parameter dataCoord.import.waitForIndex to false for testing, or you can drop the index before performing the data import.

@lentitude2tk
Copy link
Collaborator

@gland1 Could you please let us know if you are using a public dataset on wiki? If so, could you provide us with the link? Additionally, could you share the migration.yaml configuration you are using with milvus-migration (sensitive information can be ignored)? We will reproduce the issue you encountered locally and work on resolving it.

@zhuwenxing
Copy link
Collaborator

@lentitude2tk
If the file exported from upstream is very large, will the migration tool split it into small-size file list then do bulk insert?

@lentitude2tk
Copy link
Collaborator

lentitude2tk commented Jun 17, 2024

@lentitude2tk If the file exported from upstream is very large, will the migration tool split it into small-size file list then do bulk insert?

User feedback: "hanging_at_70%". For version 2.4, 70% indicates that bulkInsert has been completed and it is currently in the process of building the index. Therefore, the issue lies in why buildIndex is hanging.

@gland1
Copy link
Author

gland1 commented Jun 17, 2024

hi
Please note I'm using unusually large maxSize for segments : 80G

This is the full milvus yaml I'm using:
# This is a sample to deploy a milvus cluster in milvus-operator's default configurations.
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
name: milvus3
namespace: kioxia
labels:
app: milvus
spec:
mode: cluster
dependencies:
etcd:
inCluster:
values:
persistence:
storageClass: standard
size: 10Gi
volumePermissions:
enabled: true
storage:
inCluster:
values:
replicas: 3
persistence:
storageClass: standard-thin
size: 10Ti

    pulsar:      
      inCluster:
        values:                       
          zookeeper:
            replicaCount: 3
            volumes:
              data:
                size: 40Gi
          broker:
            replicaCount: 3
            resources:
              limits:
                cpu: 4
                memory: 16Gi   
            configData:                                                                                                                                 
              PULSAR_MEM: >                                                                                                                             
                -Xms128m -Xmx256m -XX:MaxDirectMemorySize=256m                                                    
              PULSAR_GC: >     
                -XX:+IgnoreUnrecognizedVMOptions                  
                -XX:+UseG1GC                                                                                                         
                -XX:MaxGCPauseMillis=10                                                                                                                   
                -Dio.netty.leakDetectionLevel=disabled                                                                                                    
                -Dio.netty.recycler.linkCapacity=1024                                                                                                   
                -XX:+ParallelRefProcEnabled                                                                                                             
                -XX:+UnlockExperimentalVMOptions                                                                  
                -XX:+DoEscapeAnalysis                                                          
                -XX:ParallelGCThreads=4                                                                                              
                -XX:ConcGCThreads=4                                                                                                                       
                -XX:G1NewSizePercent=50                                                                                                                   
                -XX:+DisableExplicitGC                                                                                                                  
                -XX:-ResizePLAB                                                                                                                         
                -XX:+ExitOnOutOfMemoryError                                                                       
                -XX:+PerfDisableSharedMem                                          
          bookkeeper:
            replicaCount: 3
            configData:                                                                                                                                   
              # we use `bin/pulsar` for starting bookie daemons                                                                                           
              PULSAR_MEM: >                                                                                                                             
                -Xms128m                                                                                                                                
                -Xmx256m                                                                                                                                  
                -XX:MaxDirectMemorySize=256m                                                                                                            
              PULSAR_GC: >        
                -XX:+IgnoreUnrecognizedVMOptions                  
                -XX:+UseG1GC                                                                                                         
                -XX:MaxGCPauseMillis=10                                                                                                                   
                -XX:+ParallelRefProcEnabled                                                                                                             
                -XX:+UnlockExperimentalVMOptions                                                                                                        
                -XX:+DoEscapeAnalysis                                                                                                                   
                -XX:ParallelGCThreads=4                                                                                                                 
                -XX:ConcGCThreads=4                                                                                                                     
                -XX:G1NewSizePercent=50                                                                                                                 
                -XX:+DisableExplicitGC                                                                                                                    
                -XX:-ResizePLAB                                                                                                                         
                -XX:+ExitOnOutOfMemoryError                                                                                                             
                -XX:+PerfDisableSharedMem                                                                                                               
                -verbosegc                                                                                                                              
                -Xloggc:/var/log/bookie-gc.log                                                                                       
                -XX:G1LogLevel=finest                               
            resources:
              limits:
                cpu: 4
                memory: 16Gi             
             
  components:
    proxy:
      replicas: 3
      serviceType: LoadBalancer
    queryNode:
      replicas: 3
      volumeMounts:
      - mountPath: /var/lib/milvus/data
        name: disk
      volumes:
      - name: disk
        hostPath:
          path: "/var/lib/milvus/data"
          type: DirectoryOrCreate
    indexNode:
      replicas: 3
      env:
        - name: LOCAL_STORAGE_SIZE
          value: "300"
      volumeMounts:
      - mountPath: /var/lib/milvus/data
        name: disk
      volumes:
      - name: disk
        hostPath:
          path: "/var/lib/milvus/data"
          type: DirectoryOrCreate      
    dataCoord:
      replicas: 1
    indexCoord:
      replicas: 1
    dataNode:
      replicas: 3
      
  config:
    log:
      file:
        maxAge: 10
        maxBackups: 20
        maxSize: 100            
      format: text
      level: warn
    common:
      DiskIndex:
        BeamWidthRatio: 8
        BuildNumThreadsRatio: 1
        LoadNumThreadRatio: 8
        MaxDegree: 28
        PQCodeBudgetGBRatio: 0.04
        SearchCacheBudgetGBRatio: 0.1
        SearchListSize: 50
    proxy:
      grpc:
        serverMaxRecvSize: 2147483648   # 2GB
        serverMaxSendSize: 2147483648
        clientMaxRecvSize: 2147483648
        clientMaxSendSize: 2147483648
    dataNode:
      import:
        maxImportFileSizeInGB: 1024
    queryNode:
      segcore:
        knowhereThreadPoolNumRatio: 1
    queryCoord:
      loadTimeoutSeconds: 1200
    dataCoord:
      segment:
        maxSize: 81920
        diskSegmentMaxSize: 81920
        sealProportion: 0.9
        smallProportion: 0.5
      compaction:
        rpcTimeout: 180
        timeout: 5600
        levelzero:
          forceTrigger:
            maxSize: 85899345920

This is the migration yaml:
dumper: # configs for the migration job.
worker:
limit: 16
workMode: faiss # operational mode of the migration job.
reader:
bufferSize: 1024
writer:
bufferSize: 1024
loader:
worker:
limit: 16
source: # configs for the source Faiss index.
mode: local
local:
faissFile: /var/lib/milvus/vector-files/ivfflat_base.50M_lists7100.faissindex

target: # configs for the target Milvus collection.
create:
collection:
name: wiki50M2
shardsNums: 12
dim: 768
metricType: L2
mode: remote
remote:
outputDir: testfiles/output/
cloud: aws
endpoint: 10.42.0.104:9000
region: ap-southeast-1
bucket: milvus3
ak: minioadmin
sk: minioadmin
useIAM: false
useSSL: false
checkBucket: true
milvus2x:
endpoint: 172.16.10.111:19530

As for the dataset - we curved 50M from the 88M wiki-all nvidia dataset available at:
https://docs.rapids.ai/api/raft/nightly/wiki_all_dataset/

@gland1
Copy link
Author

gland1 commented Jun 17, 2024

Any Idea how can I stop the migration ?

@xiaocai2333
Copy link

segment maxSize is too large in the configuration. 1024MB is the recommended size.
@gland1

@xiaocai2333
Copy link

If the segment is too large, there will be too many binlog files, and some atomic operations cannot be completed. In addition, 80*4 = 320GB+ of memory is required when building the index.

@gland1
Copy link
Author

gland1 commented Jun 17, 2024

I'm using diskAnn index, so it should require less memory to build
We try to investigate horizontal scaling so we wanted as little as possible segments at first.
But I'll soon try with smaller segment.

Is it possible to stop the migration ?

@lentitude2tk
Copy link
Collaborator

I'm using diskAnn index, so it should require less memory to build We try to investigate horizontal scaling so we wanted as little as possible segments at first. But I'll soon try with smaller segment.

Is it possible to stop the migration ?

If your target collection is a poc testing collection, you can choose to delete your collection, thus causing the entire migration task to fail

@bigsheeper
Copy link

@gland1 Why is it necessary to have as few segments as possible when investigating horizontal scaling?

Large segment would result in many side effects, as detailed here: milvus-io/milvus#33808 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants