ydb-platform / ydb

YDB is an open source Distributed SQL Database that combines high availability and scalability with strong consistency and ACID transactions
https://ydb.tech
Apache License 2.0
3.87k stars 533 forks source link

One cluster node is restarted permanently #91

Open oleg68 opened 2 years ago

oleg68 commented 2 years ago

Hello!

I have a 3-node ydb cluster with mirror-3dc.

After some working one of the nodes started craching just after start

июл 11 13:35:45 ydbs3 ydbd[15650]: Starting Kikimr r-1 built by Unknown user
июл 11 13:35:45 ydbs3 ydbd[15650]: No Keys in KeyConfig! Encrypted group DsProxies will not start
июл 11 13:35:45 ydbs3 ydbd[15650]: No Keys in PDiskKeyConfig! Encrypted pdisks will not start
июл 11 13:35:46 ydbs3 ydbd[15650]: GRpc memory quota temporarily disabled due to issues with grpc quoter
июл 11 13:35:48 ydbs3 ydbd[15650]: VERIFY failed (2022-07-11T13:35:48.169987+0300): VDISK[82000000:_:2:0:0]: addr# {ChunkIdx: 135 Offset: 341376 Size: 156957} State# FreeChunks: [] CHAINS: {ChunkSize# 135249920 AppendBlockSize# 4064 MinHugeBlobInBytes# 65536 MaxBlobInBytes# 10485760 {CHAIN {[SlotSize, >
июл 11 13:35:48 ydbs3 ydbd[15650]:   ydb/core/blobstorage/vdisk/huge/blobstorage_hullhugeheap.cpp:810
июл 11 13:35:48 ydbs3 ydbd[15650]:   RecoveryModeAllocate(): requirement it != FreeChunks.end() failed
июл 11 13:35:48 ydbs3 ydbd[15650]: NPrivate::InternalPanicImpl(int, char const*, char const*, int, int, int, TBasicStringBuf<char, std::__y1::char_traits<char> >, char const*, unsigned long)+677 (0xC2B58F5)
июл 11 13:35:48 ydbs3 ydbd[15650]: NPrivate::Panic(NPrivate::TStaticBuf const&, int, char const*, char const*, char const*, ...)+549 (0xC2AD6D5)
июл 11 13:35:48 ydbs3 ydbd[15650]: NKikimr::NHuge::THeap::RecoveryModeAllocate(NKikimr::TDiskPart const&)+527 (0x156F868F)
июл 11 13:35:48 ydbs3 ydbd[15650]: NKikimr::NHuge::THullHugeKeeperPersState::Apply(NActors::TActorContext const&, unsigned long, NKikimr::NHuge::TPutRecoveryLogRec const&)+1542 (0x156EE8F6)
июл 11 13:35:48 ydbs3 ydbd[15650]: NKikimr::TRecoveryLogReplayer::HandleHugeLogoBlob(NActors::TActorContext const&, NKikimr::NPDisk::TLogRecord const&)+82 (0x158110B2)
июл 11 13:35:48 ydbs3 ydbd[15650]: NKikimr::TRecoveryLogReplayer::DispatchLogRecord(NActors::TActorContext const&, NKikimr::NPDisk::TLogRecord const&)+1213 (0x1580FB8D)
июл 11 13:35:48 ydbs3 ydbd[15650]: NKikimr::TRecoveryLogReplayer::InterruptableLogRecordsDispatch(NActors::TActorContext const&)+89 (0x1580F0C9)
июл 11 13:35:48 ydbs3 ydbd[15650]: NKikimr::TRecoveryLogReplayer::Handle(TAutoPtr<NActors::TEventHandle<NKikimr::NPDisk::TEvReadLogResult>, TDelete>&, NActors::TActorContext const&)+559 (0x1580EDEF)
июл 11 13:35:48 ydbs3 ydbd[15650]: void NActors::TExecutorThread::Execute<NActors::TMailboxTable::THTSwapMailbox>(NActors::TMailboxTable::THTSwapMailbox*, unsigned int)+3743 (0xD5D93BF)
июл 11 13:35:48 ydbs3 ydbd[15650]: NActors::TExecutorThread::ThreadProc()+1086 (0xD5D2C0E)
июл 11 13:35:48 ydbs3 ydbd[15650]: ??+0 (0xC2B64E6)
июл 11 13:35:48 ydbs3 ydbd[15650]: ??+0 (0x7F819B47718A)
июл 11 13:35:48 ydbs3 ydbd[15650]: clone+67 (0x7F819B1A6DB3)
июл 11 13:35:48 ydbs3 systemd[1]: Started Process Core Dump (PID 15704/UID 0).

How to repair after this state?

mvgorbunov commented 2 years ago

What size disks do you use in your configuration?

oleg68 commented 2 years ago

3*60 GB on each node; total 9*60 GB

mvgorbunov commented 2 years ago

60GB disk is not enough. We recommend to use at least 80GB for testing purposes https://ydb.tech/ru/docs/cluster/system-requirements

uh-zuh commented 2 years ago

Why you need so big amount of disk space? For example If I need to store only 1GB of my data I suppose that 2GB on each disk is absolutely enough for this.

oleg68 commented 2 years ago

The problem is that even for small amount of data I need not simple 80GB, but 9*80Gb=720Gb.

uh-zuh commented 2 years ago

80 Gb is only for testing purposes according to documentation.

To store 1 Gb of data with fault tolerance you need 3*9*800 Gb = 21600 Gb.