Open gland1 opened 5 months ago
Also, Each server has total of 128G I see that datanode on server one grows in memory reached 83G .. and rising in current attempt
the operation fails after the datanode on server1 reach too much memory and gets evicted
the operation fails after the datanode on server1 reach too much memory and gets evicted
Hello user, I need to confirm the following information, please provide it:
the operation fails after the datanode on server1 reach too much memory and gets evicted
Hello user, I need to confirm the following information, please provide it:
- Is your milvus instance version 2.3?
- Did you enable PartitionKey or specify PartitionNum for the collection you imported?
hi I'm using milvus 2.4.1. I did not specify PartitionKey nor PartitionNum . Do you think using partitions can work around this ?
Hello @gland1 , please use the following command to capture the memory information of the datanode when its memory usage is high:
go tool pprof {datanode_ip}:9091/debug/pprof/heap
After execution, you should see the generation of a pprof file, like:
Just provide the generated pprof file.
@gland1
Hello @gland1 , please use the following command to capture the memory information of the datanode when its memory usage is high:
go tool pprof {datanode_ip}:9091/debug/pprof/heap
After execution, you should see the generation of a pprof file, like:
Just provide the generated pprof file.
it will take some time to reach this state as I now try loading the the dataset by inserts (btw - this also fails after a while due to timeout and I have to record where I stopped and continue from there It looks like when pulsar stars flushing the write cache to the disk, things becomes very slow and finally fails on timeout)
@gland1
- During the time when the error was reported, was there only one import task in progress in the milvus instance, and no other operations?
- When you adjusted the parameters of the single file size, did you change other parameters, such as datanode.import.readBufferSizeInMB
- You can capture pprof at the high memory point, and we will help you analyze the memory usage. For the use of pprof, please refer to: https://medium.com/@luanrubensf/heap-dump-in-go-using-pprof-ae9837e05419
1) - yes just one import task 2) - no other params 3) - Yes..see previous comment
I've tried to recreate - migration kept hanging at 70% - I saw datacord log blowing up to more than 30G. Reason seems to be sending larger packets than what etcd will accept: {"log":"[2024/06/16 21:05:57.618 +00:00] [WARN] [etcd/etcd_kv.go:665] [\"value size large than 100kb\"] [key=datacoord-meta/statslog/450500795239284436/450500795239284437/450500795239296710/100] [value_size(kb)=1120]\n","stream":"stdout","time":"2024-06-16T21:05:57.619059105Z"} {"log":"{\"level\":\"warn\",\"ts\":\"2024-06-16T21:05:57.620Z\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.5/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc00118ca80/milvus3-etcd.kioxia:2379\",\"attempt\":0,\"error\":\"rpc error: code = ResourceExhausted desc = trying to send message larger than max (2589626 vs. 2097152)\"}\n","stream":"stderr","time":"2024-06-16T21:05:57.620224606Z"} {"log":"[2024/06/16 21:05:57.618 +00:00] [WARN] [etcd/etcd_kv.go:665] [\"value size large than 100kb\"] [key=datacoord-meta/binlog/450500795239284436/450500795239284437/450500795239296710/101] [value_size(kb)=358]\n","stream":"stdout","time":"2024-06-16T21:05:57.784447554Z"} {"log":"[2024/06/16 21:
@lentitude2tk please try to reproduce this in house and see what we can improve
@lentitude2tk please try to reproduce this in house and see what we can improve
OK,I will find the relevant personnel to try and reproduce this issue with this data volume in-house. Additionally, I would like to confirm some information @gland1 . When you perform the import, does the collection have relevant indexes? If so, you can try setting the parameter dataCoord.import.waitForIndex to false for testing, or you can drop the index before performing the data import.
@gland1 Could you please let us know if you are using a public dataset on wiki? If so, could you provide us with the link? Additionally, could you share the migration.yaml configuration you are using with milvus-migration (sensitive information can be ignored)? We will reproduce the issue you encountered locally and work on resolving it.
@lentitude2tk If the file exported from upstream is very large, will the migration tool split it into small-size file list then do bulk insert?
@lentitude2tk If the file exported from upstream is very large, will the migration tool split it into small-size file list then do bulk insert?
User feedback: "hanging_at_70%". For version 2.4, 70% indicates that bulkInsert has been completed and it is currently in the process of building the index. Therefore, the issue lies in why buildIndex is hanging.
hi Please note I'm using unusually large maxSize for segments : 80G
This is the full milvus yaml I'm using:
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
name: milvus3
namespace: kioxia
labels:
app: milvus
spec:
mode: cluster
dependencies:
etcd:
inCluster:
values:
persistence:
storageClass: standard
size: 10Gi
volumePermissions:
enabled: true
storage:
inCluster:
values:
replicas: 3
persistence:
storageClass: standard-thin
size: 10Ti
pulsar:
inCluster:
values:
zookeeper:
replicaCount: 3
volumes:
data:
size: 40Gi
broker:
replicaCount: 3
resources:
limits:
cpu: 4
memory: 16Gi
configData:
PULSAR_MEM: >
-Xms128m -Xmx256m -XX:MaxDirectMemorySize=256m
PULSAR_GC: >
-XX:+IgnoreUnrecognizedVMOptions
-XX:+UseG1GC
-XX:MaxGCPauseMillis=10
-Dio.netty.leakDetectionLevel=disabled
-Dio.netty.recycler.linkCapacity=1024
-XX:+ParallelRefProcEnabled
-XX:+UnlockExperimentalVMOptions
-XX:+DoEscapeAnalysis
-XX:ParallelGCThreads=4
-XX:ConcGCThreads=4
-XX:G1NewSizePercent=50
-XX:+DisableExplicitGC
-XX:-ResizePLAB
-XX:+ExitOnOutOfMemoryError
-XX:+PerfDisableSharedMem
bookkeeper:
replicaCount: 3
configData:
# we use `bin/pulsar` for starting bookie daemons
PULSAR_MEM: >
-Xms128m
-Xmx256m
-XX:MaxDirectMemorySize=256m
PULSAR_GC: >
-XX:+IgnoreUnrecognizedVMOptions
-XX:+UseG1GC
-XX:MaxGCPauseMillis=10
-XX:+ParallelRefProcEnabled
-XX:+UnlockExperimentalVMOptions
-XX:+DoEscapeAnalysis
-XX:ParallelGCThreads=4
-XX:ConcGCThreads=4
-XX:G1NewSizePercent=50
-XX:+DisableExplicitGC
-XX:-ResizePLAB
-XX:+ExitOnOutOfMemoryError
-XX:+PerfDisableSharedMem
-verbosegc
-Xloggc:/var/log/bookie-gc.log
-XX:G1LogLevel=finest
resources:
limits:
cpu: 4
memory: 16Gi
components:
proxy:
replicas: 3
serviceType: LoadBalancer
queryNode:
replicas: 3
volumeMounts:
- mountPath: /var/lib/milvus/data
name: disk
volumes:
- name: disk
hostPath:
path: "/var/lib/milvus/data"
type: DirectoryOrCreate
indexNode:
replicas: 3
env:
- name: LOCAL_STORAGE_SIZE
value: "300"
volumeMounts:
- mountPath: /var/lib/milvus/data
name: disk
volumes:
- name: disk
hostPath:
path: "/var/lib/milvus/data"
type: DirectoryOrCreate
dataCoord:
replicas: 1
indexCoord:
replicas: 1
dataNode:
replicas: 3
config:
log:
file:
maxAge: 10
maxBackups: 20
maxSize: 100
format: text
level: warn
common:
DiskIndex:
BeamWidthRatio: 8
BuildNumThreadsRatio: 1
LoadNumThreadRatio: 8
MaxDegree: 28
PQCodeBudgetGBRatio: 0.04
SearchCacheBudgetGBRatio: 0.1
SearchListSize: 50
proxy:
grpc:
serverMaxRecvSize: 2147483648 # 2GB
serverMaxSendSize: 2147483648
clientMaxRecvSize: 2147483648
clientMaxSendSize: 2147483648
dataNode:
import:
maxImportFileSizeInGB: 1024
queryNode:
segcore:
knowhereThreadPoolNumRatio: 1
queryCoord:
loadTimeoutSeconds: 1200
dataCoord:
segment:
maxSize: 81920
diskSegmentMaxSize: 81920
sealProportion: 0.9
smallProportion: 0.5
compaction:
rpcTimeout: 180
timeout: 5600
levelzero:
forceTrigger:
maxSize: 85899345920
This is the migration yaml: dumper: # configs for the migration job. worker: limit: 16 workMode: faiss # operational mode of the migration job. reader: bufferSize: 1024 writer: bufferSize: 1024 loader: worker: limit: 16 source: # configs for the source Faiss index. mode: local local: faissFile: /var/lib/milvus/vector-files/ivfflat_base.50M_lists7100.faissindex
target: # configs for the target Milvus collection. create: collection: name: wiki50M2 shardsNums: 12 dim: 768 metricType: L2 mode: remote remote: outputDir: testfiles/output/ cloud: aws endpoint: 10.42.0.104:9000 region: ap-southeast-1 bucket: milvus3 ak: minioadmin sk: minioadmin useIAM: false useSSL: false checkBucket: true milvus2x: endpoint: 172.16.10.111:19530
As for the dataset - we curved 50M from the 88M wiki-all nvidia dataset available at: https://docs.rapids.ai/api/raft/nightly/wiki_all_dataset/
Any Idea how can I stop the migration ?
segment maxSize is too large in the configuration. 1024MB is the recommended size. @gland1
If the segment is too large, there will be too many binlog files, and some atomic operations cannot be completed. In addition, 80*4 = 320GB+ of memory is required when building the index.
I'm using diskAnn index, so it should require less memory to build We try to investigate horizontal scaling so we wanted as little as possible segments at first. But I'll soon try with smaller segment.
Is it possible to stop the migration ?
I'm using diskAnn index, so it should require less memory to build We try to investigate horizontal scaling so we wanted as little as possible segments at first. But I'll soon try with smaller segment.
Is it possible to stop the migration ?
If your target collection is a poc testing collection, you can choose to delete your collection, thus causing the entire migration task to fail
@gland1 Why is it necessary to have as few segments as possible when investigating horizontal scaling?
Large segment would result in many side effects, as detailed here: https://github.com/milvus-io/milvus/issues/33808#issuecomment-2171028738
Current Behavior
Deployed milvus operator on 3 servers tried to import faiss ivf flat index(from 200M wiki dataset) size 146Gb Failed due to max file size 16G. Increased maxfile size to 1024G Tries again and failed after 40% done.
This is the error shown:
[2024/05/31 18:46:22.983 +03:00] [ERROR] [dbclient/milvus2x.go:206] ["[Loader] Check Milvus bulkInsertState Error"] [error="rpc error: code = Unknown desc = stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:737 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(Client).GetImportProgress\n/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:6071 github.com/milvus-io/milvus/internal/proxy.(Proxy).GetImportProgress\n/go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:4649 github.com/milvus-io/milvus/internal/proxy.(Proxy).GetImportState\n/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:1018 github.com/milvus-io/milvus/internal/distributed/proxy.(Server).GetImportState\n/go/pkg/mod/github.com/milvus-io/milvus-proto/go-api/v2@v2.4.2/milvuspb/milvus.pb.go:13136 github.com/milvus-io/milvus-proto/go-api/v2/milvuspb._MilvusService_GetImportState_Handler.func1\n/go/src/github.com/milvus-io/milvus/internal/proxy/connection/util.go:60 github.com/milvus-io/milvus/internal/proxy/connection.KeepActiveInterceptor: empty grpc client: find no available datacoord, check datacoord state"] [stack="github.com/zilliztech/milvus-migration/core/dbclient.(Milvus2x).WaitBulkLoadSuccess\n\t/home/runner/work/milvus-migration/milvus-migration/core/dbclient/milvus2x.go:206\ngithub.com/zilliztech/milvus-migration/core/loader.(Milvus2xLoader).loadDataOne\n\t/home/runner/work/milvus-migration/milvus-migration/core/loader/milvus2x_loader.go:198\ngithub.com/zilliztech/milvus-migration/core/loader.(Milvus2xLoader).loadDataBatch.func1\n\t/home/runner/work/milvus-migration/milvus-migration/core/loader/milvus2x_loader.go:180\ngolang.org/x/sync/errgroup.(Group).Go.func1\n\t/home/runner/go/pkg/mod/golang.org/x/sync@v0.5.0/errgroup/errgroup.go:75"] load error: rpc error: code = Unknown desc = stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).ReCall /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...] /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:737 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(Client).GetImportProgress /go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:6071 github.com/milvus-io/milvus/internal/proxy.(Proxy).GetImportProgress /go/src/github.com/milvus-io/milvus/internal/proxy/impl.go:4649 github.com/milvus-io/milvus/internal/proxy.(Proxy).GetImportState /go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:1018 github.com/milvus-io/milvus/internal/distributed/proxy.(Server).GetImportState /go/pkg/mod/github.com/milvus-io/milvus-proto/go-api/v2@v2.4.2/milvuspb/milvus.pb.go:13136 github.com/milvus-io/milvus-proto/go-api/v2/milvuspb._MilvusService_GetImportState_Handler.func1 /go/src/github.com/milvus-io/milvus/internal/proxy/connection/util.go:60 github.com/milvus-io/milvus/internal/proxy/connection.KeepActiveInterceptor: empty grpc client: find no available datacoord, check datacoord state
Expected Behavior
migration should succeed
Steps To Reproduce
Environment
Anything else?
No response