zilliztech / milvus-migration

14 stars 1 forks source link

importing 146 GB faiss ivf flat index fails after 40% #87

Open gland1 opened 1 month ago

gland1 commented 1 month ago

Current Behavior

Deployed milvus operator on 3 servers tried to import faiss ivf flat index(from 200M wiki dataset) size 146Gb Failed due to max file size 16G. Increased maxfile size to 1024G Tries again and failed after 40% done.

This is the error shown:

[2024/05/31 18:46:22.983 +03:00] [ERROR] [dbclient/milvus2x.go:206] ["[Loader] Check Milvus bulkInsertState Error"] [error="rpc error: code = Unknown desc = stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:737 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(Client).GetImportProgress\n/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:6071 github.com/milvus-io/milvus/internal/proxy.(Proxy).GetImportProgress\n/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:4649 github.com/milvus-io/milvus/internal/proxy.(Proxy).GetImportState\n/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:1018 github.com/milvus-io/milvus/internal/distributed/proxy.(Server).GetImportState\n/go/pkg/mod/github.com/milvus-io/milvus-proto/go-api/v2@v2.4.2/milvuspb/milvus.pb.go:13136 github.com/milvus-io/milvus-proto/go-api/v2/milvuspb._MilvusService_GetImportState_Handler.func1\n/go/src/github.com/milvus-io/milvus/internal/proxy/connection/util.go:60 github.com/milvus-io/milvus/internal/proxy/connection.KeepActiveInterceptor: empty grpc client: find no available datacoord, check datacoord state"] [stack="github.com/zilliztech/milvus-migration/core/dbclient.(Milvus2x).WaitBulkLoadSuccess\n\t/home/runner/work/milvus-migration/milvus-migration/core/dbclient/milvus2x.go:206\ngithub.com/zilliztech/milvus-migration/core/loader.(Milvus2xLoader).loadDataOne\n\t/home/runner/work/milvus-migration/milvus-migration/core/loader/milvus2x_loader.go:198\ngithub.com/zilliztech/milvus-migration/core/loader.(Milvus2xLoader).loadDataBatch.func1\n\t/home/runner/work/milvus-migration/milvus-migration/core/loader/milvus2x_loader.go:180\ngolang.org/x/sync/errgroup.(Group).Go.func1\n\t/home/runner/go/pkg/mod/golang.org/x/sync@v0.5.0/errgroup/errgroup.go:75"] load error: rpc error: code = Unknown desc = stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).ReCall /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...] /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:737 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(Client).GetImportProgress /go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:6071 github.com/milvus-io/milvus/internal/proxy.(Proxy).GetImportProgress /go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:4649 github.com/milvus-io/milvus/internal/proxy.(Proxy).GetImportState /go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:1018 github.com/milvus-io/milvus/internal/distributed/proxy.(Server).GetImportState /go/pkg/mod/github.com/milvus-io/milvus-proto/go-api/v2@v2.4.2/milvuspb/milvus.pb.go:13136 github.com/milvus-io/milvus-proto/go-api/v2/milvuspb._MilvusService_GetImportState_Handler.func1 /go/src/github.com/milvus-io/milvus/internal/proxy/connection/util.go:60 github.com/milvus-io/milvus/internal/proxy/connection.KeepActiveInterceptor: empty grpc client: find no available datacoord, check datacoord state

Expected Behavior

migration should succeed

Steps To Reproduce

see description

Environment

3 nodes k8s servers bare metal

Anything else?

No response

gland1 commented 1 month ago

Also, Each server has total of 128G I see that datanode on server one grows in memory reached 83G .. and rising in current attempt

gland1 commented 1 month ago

the operation fails after the datanode on server1 reach too much memory and gets evicted

lentitude2tk commented 1 month ago

the operation fails after the datanode on server1 reach too much memory and gets evicted

Hello user, I need to confirm the following information, please provide it:

  1. Is your milvus instance version 2.3?
  2. Did you enable PartitionKey or specify PartitionNum for the collection you imported?
gland1 commented 1 month ago

the operation fails after the datanode on server1 reach too much memory and gets evicted

Hello user, I need to confirm the following information, please provide it:

  1. Is your milvus instance version 2.3?
  2. Did you enable PartitionKey or specify PartitionNum for the collection you imported?

hi I'm using milvus 2.4.1. I did not specify PartitionKey nor PartitionNum . Do you think using partitions can work around this ?

bigsheeper commented 1 month ago

Hello @gland1 , please use the following command to capture the memory information of the datanode when its memory usage is high:

go tool pprof {datanode_ip}:9091/debug/pprof/heap

After execution, you should see the generation of a pprof file, like: image

Just provide the generated pprof file.

lentitude2tk commented 1 month ago

@gland1

  1. During the time when the error was reported, was there only one import task in progress in the milvus instance, and no other operations?
  2. When you adjusted the parameters of the single file size, did you change other parameters, such as datanode.import.readBufferSizeInMB
  3. You can capture pprof at the high memory point, and we will help you analyze the memory usage. For the use of pprof, please refer to: https://medium.com/@luanrubensf/heap-dump-in-go-using-pprof-ae9837e05419
gland1 commented 1 month ago

Hello @gland1 , please use the following command to capture the memory information of the datanode when its memory usage is high:

go tool pprof {datanode_ip}:9091/debug/pprof/heap

After execution, you should see the generation of a pprof file, like: image

Just provide the generated pprof file.

it will take some time to reach this state as I now try loading the the dataset by inserts (btw - this also fails after a while due to timeout and I have to record where I stopped and continue from there It looks like when pulsar stars flushing the write cache to the disk, things becomes very slow and finally fails on timeout)

gland1 commented 1 month ago

@gland1

  1. During the time when the error was reported, was there only one import task in progress in the milvus instance, and no other operations?
  2. When you adjusted the parameters of the single file size, did you change other parameters, such as datanode.import.readBufferSizeInMB
  3. You can capture pprof at the high memory point, and we will help you analyze the memory usage. For the use of pprof, please refer to: https://medium.com/@luanrubensf/heap-dump-in-go-using-pprof-ae9837e05419

1) - yes just one import task 2) - no other params 3) - Yes..see previous comment

gland1 commented 3 weeks ago

I've tried to recreate - migration kept hanging at 70% - I saw datacord log blowing up to more than 30G. Reason seems to be sending larger packets than what etcd will accept: {"log":"[2024/06/16 21:05:57.618 +00:00] [WARN] [etcd/etcd_kv.go:665] [\"value size large than 100kb\"] [key=datacoord-meta/statslog/450500795239284436/450500795239284437/450500795239296710/100] [value_size(kb)=1120]\n","stream":"stdout","time":"2024-06-16T21:05:57.619059105Z"} {"log":"{\"level\":\"warn\",\"ts\":\"2024-06-16T21:05:57.620Z\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.5/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc00118ca80/milvus3-etcd.kioxia:2379\",\"attempt\":0,\"error\":\"rpc error: code = ResourceExhausted desc = trying to send message larger than max (2589626 vs. 2097152)\"}\n","stream":"stderr","time":"2024-06-16T21:05:57.620224606Z"} {"log":"[2024/06/16 21:05:57.618 +00:00] [WARN] [etcd/etcd_kv.go:665] [\"value size large than 100kb\"] [key=datacoord-meta/binlog/450500795239284436/450500795239284437/450500795239296710/101] [value_size(kb)=358]\n","stream":"stdout","time":"2024-06-16T21:05:57.784447554Z"} {"log":"[2024/06/16 21:

xiaofan-luan commented 3 weeks ago

@lentitude2tk please try to reproduce this in house and see what we can improve

lentitude2tk commented 3 weeks ago

@lentitude2tk please try to reproduce this in house and see what we can improve

OK,I will find the relevant personnel to try and reproduce this issue with this data volume in-house. Additionally, I would like to confirm some information @gland1 . When you perform the import, does the collection have relevant indexes? If so, you can try setting the parameter dataCoord.import.waitForIndex to false for testing, or you can drop the index before performing the data import.

lentitude2tk commented 3 weeks ago

@gland1 Could you please let us know if you are using a public dataset on wiki? If so, could you provide us with the link? Additionally, could you share the migration.yaml configuration you are using with milvus-migration (sensitive information can be ignored)? We will reproduce the issue you encountered locally and work on resolving it.

zhuwenxing commented 3 weeks ago

@lentitude2tk If the file exported from upstream is very large, will the migration tool split it into small-size file list then do bulk insert?

lentitude2tk commented 3 weeks ago

@lentitude2tk If the file exported from upstream is very large, will the migration tool split it into small-size file list then do bulk insert?

User feedback: "hanging_at_70%". For version 2.4, 70% indicates that bulkInsert has been completed and it is currently in the process of building the index. Therefore, the issue lies in why buildIndex is hanging.

gland1 commented 3 weeks ago

hi Please note I'm using unusually large maxSize for segments : 80G

This is the full milvus yaml I'm using:

This is a sample to deploy a milvus cluster in milvus-operator's default configurations.

apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
  name: milvus3
  namespace: kioxia
  labels:
    app: milvus
spec:
  mode: cluster
  dependencies:
    etcd:
      inCluster:
        values:
          persistence:    
            storageClass: standard
            size: 10Gi
            volumePermissions:
              enabled: true
    storage:      
      inCluster:
        values:
          replicas: 3
          persistence:    
            storageClass: standard-thin
            size: 10Ti       

    pulsar:      
      inCluster:
        values:                       
          zookeeper:
            replicaCount: 3
            volumes:
              data:
                size: 40Gi
          broker:
            replicaCount: 3
            resources:
              limits:
                cpu: 4
                memory: 16Gi   
            configData:                                                                                                                                 
              PULSAR_MEM: >                                                                                                                             
                -Xms128m -Xmx256m -XX:MaxDirectMemorySize=256m                                                    
              PULSAR_GC: >     
                -XX:+IgnoreUnrecognizedVMOptions                  
                -XX:+UseG1GC                                                                                                         
                -XX:MaxGCPauseMillis=10                                                                                                                   
                -Dio.netty.leakDetectionLevel=disabled                                                                                                    
                -Dio.netty.recycler.linkCapacity=1024                                                                                                   
                -XX:+ParallelRefProcEnabled                                                                                                             
                -XX:+UnlockExperimentalVMOptions                                                                  
                -XX:+DoEscapeAnalysis                                                          
                -XX:ParallelGCThreads=4                                                                                              
                -XX:ConcGCThreads=4                                                                                                                       
                -XX:G1NewSizePercent=50                                                                                                                   
                -XX:+DisableExplicitGC                                                                                                                  
                -XX:-ResizePLAB                                                                                                                         
                -XX:+ExitOnOutOfMemoryError                                                                       
                -XX:+PerfDisableSharedMem                                          
          bookkeeper:
            replicaCount: 3
            configData:                                                                                                                                   
              # we use `bin/pulsar` for starting bookie daemons                                                                                           
              PULSAR_MEM: >                                                                                                                             
                -Xms128m                                                                                                                                
                -Xmx256m                                                                                                                                  
                -XX:MaxDirectMemorySize=256m                                                                                                            
              PULSAR_GC: >        
                -XX:+IgnoreUnrecognizedVMOptions                  
                -XX:+UseG1GC                                                                                                         
                -XX:MaxGCPauseMillis=10                                                                                                                   
                -XX:+ParallelRefProcEnabled                                                                                                             
                -XX:+UnlockExperimentalVMOptions                                                                                                        
                -XX:+DoEscapeAnalysis                                                                                                                   
                -XX:ParallelGCThreads=4                                                                                                                 
                -XX:ConcGCThreads=4                                                                                                                     
                -XX:G1NewSizePercent=50                                                                                                                 
                -XX:+DisableExplicitGC                                                                                                                    
                -XX:-ResizePLAB                                                                                                                         
                -XX:+ExitOnOutOfMemoryError                                                                                                             
                -XX:+PerfDisableSharedMem                                                                                                               
                -verbosegc                                                                                                                              
                -Xloggc:/var/log/bookie-gc.log                                                                                       
                -XX:G1LogLevel=finest                               
            resources:
              limits:
                cpu: 4
                memory: 16Gi             

  components:
    proxy:
      replicas: 3
      serviceType: LoadBalancer
    queryNode:
      replicas: 3
      volumeMounts:
      - mountPath: /var/lib/milvus/data
        name: disk
      volumes:
      - name: disk
        hostPath:
          path: "/var/lib/milvus/data"
          type: DirectoryOrCreate
    indexNode:
      replicas: 3
      env:
        - name: LOCAL_STORAGE_SIZE
          value: "300"
      volumeMounts:
      - mountPath: /var/lib/milvus/data
        name: disk
      volumes:
      - name: disk
        hostPath:
          path: "/var/lib/milvus/data"
          type: DirectoryOrCreate      
    dataCoord:
      replicas: 1
    indexCoord:
      replicas: 1
    dataNode:
      replicas: 3

  config:
    log:
      file:
        maxAge: 10
        maxBackups: 20
        maxSize: 100            
      format: text
      level: warn
    common:
      DiskIndex:
        BeamWidthRatio: 8
        BuildNumThreadsRatio: 1
        LoadNumThreadRatio: 8
        MaxDegree: 28
        PQCodeBudgetGBRatio: 0.04
        SearchCacheBudgetGBRatio: 0.1
        SearchListSize: 50
    proxy:
      grpc:
        serverMaxRecvSize: 2147483648   # 2GB
        serverMaxSendSize: 2147483648
        clientMaxRecvSize: 2147483648
        clientMaxSendSize: 2147483648
    dataNode:
      import:
        maxImportFileSizeInGB: 1024
    queryNode:
      segcore:
        knowhereThreadPoolNumRatio: 1
    queryCoord:
      loadTimeoutSeconds: 1200
    dataCoord:
      segment:
        maxSize: 81920
        diskSegmentMaxSize: 81920
        sealProportion: 0.9
        smallProportion: 0.5
      compaction:
        rpcTimeout: 180
        timeout: 5600
        levelzero:
          forceTrigger:
            maxSize: 85899345920

This is the migration yaml: dumper: # configs for the migration job. worker: limit: 16 workMode: faiss # operational mode of the migration job. reader: bufferSize: 1024 writer: bufferSize: 1024 loader: worker: limit: 16 source: # configs for the source Faiss index. mode: local local: faissFile: /var/lib/milvus/vector-files/ivfflat_base.50M_lists7100.faissindex

target: # configs for the target Milvus collection. create: collection: name: wiki50M2 shardsNums: 12 dim: 768 metricType: L2 mode: remote remote: outputDir: testfiles/output/ cloud: aws endpoint: 10.42.0.104:9000 region: ap-southeast-1 bucket: milvus3 ak: minioadmin sk: minioadmin useIAM: false useSSL: false checkBucket: true milvus2x: endpoint: 172.16.10.111:19530

As for the dataset - we curved 50M from the 88M wiki-all nvidia dataset available at: https://docs.rapids.ai/api/raft/nightly/wiki_all_dataset/

gland1 commented 3 weeks ago

Any Idea how can I stop the migration ?

xiaocai2333 commented 3 weeks ago

segment maxSize is too large in the configuration. 1024MB is the recommended size. @gland1

xiaocai2333 commented 3 weeks ago

If the segment is too large, there will be too many binlog files, and some atomic operations cannot be completed. In addition, 80*4 = 320GB+ of memory is required when building the index.

gland1 commented 3 weeks ago

I'm using diskAnn index, so it should require less memory to build We try to investigate horizontal scaling so we wanted as little as possible segments at first. But I'll soon try with smaller segment.

Is it possible to stop the migration ?

lentitude2tk commented 3 weeks ago

I'm using diskAnn index, so it should require less memory to build We try to investigate horizontal scaling so we wanted as little as possible segments at first. But I'll soon try with smaller segment.

Is it possible to stop the migration ?

If your target collection is a poc testing collection, you can choose to delete your collection, thus causing the entire migration task to fail

bigsheeper commented 3 weeks ago

@gland1 Why is it necessary to have as few segments as possible when investigating horizontal scaling?

Large segment would result in many side effects, as detailed here: https://github.com/milvus-io/milvus/issues/33808#issuecomment-2171028738