Open bmatican opened 1 year ago
Hi @bmatican, I talked to you earlier on Slack for this issue. I am just formally commenting here to sort of "claim" this issue with @eafdeafd as other UT Austin students are also contributing to YugabyteDB.
Thank you, Albert
Sounds good @albertchokor and thanks again for your interest in YB! @Huqicheng can you provide Albert with some pointers to
Thanks @albertchokor! Hope the answers to the following questions can help you get started.
Where the TSTabletManager code for handling tablet metadata is
Please refer to TSTabletManager::Init
calling OpenTabletMeta
inside src/yb/tserver/ts_tablet_manager.cc
Status TSTabletManager::Init() {
...
deque<RaftGroupMetadataPtr> metas;
// First, load all of the tablet metadata. We do this before we start
// submitting the actual OpenTablet() tasks so that we don't have to compete
// for disk resources, etc, with bootstrap processes and running tablets.
MonoTime start(MonoTime::Now());
for (const string& tablet_id : tablet_ids) {
RaftGroupMetadataPtr meta;
RETURN_NOT_OK_PREPEND(OpenTabletMeta(tablet_id, &meta),
"Failed to open tablet metadata for tablet: " + tablet_id);
if (PREDICT_FALSE(!CanServeTabletData(meta->tablet_data_state()))) {
RETURN_NOT_OK(HandleNonReadyTabletOnStartup(meta));
continue;
}
The non tablet metadata files should be filtered out but it doesn't filter out files with suffix '_pak'
bool IsValidTabletId(const std::string& fname) {
if (fname.find(kTmpInfix) != string::npos) {
LOG(WARNING) << "Ignoring tmp file in tablet metadata dir: " << fname;
return false;
}
if (HasPrefixString(fname, ".")) {
// Hidden file or ./..
VLOG(1) << "Ignoring hidden file in tablet metadata dir: " << fname;
return false;
}
return true;
}
Where any corresponding tests for this functionality are
ts_tablet_manager related tests are under src/yb/tserver/ts_tablet_manager-test.cc
.
After fixing the issue, it's better to create a unit test under src/yb/tserver/ts_tablet_manager-test.cc
.
How to even artificially (locally) repro this issue, eg: create a bad file in a particular path and restart the server
First start a local universe with yb-ctl --rf=1 start --data_dir YOUR_DB_PATH
Connect to db with ysqlsh -h HOST
and create a table.
Then stop the local universe with yb-ctl --rf=1 stop --data_dir YOUR_DB_PATH
Under YOUR_DB_PATH/node-1/disk-1/yb-data/tserver/tablet-meta
, you should find all tablet metadata named with corresponding tablet id. Just rename any of them with _pak
suffix.
Restart the local universe with yb-ctl --rf=1 start --data_dir YOUR_DB_PATH
and you should see it failed.
Try to find fatal logs under YOUR_DB_PATH/node-1/disk-1/yb-data/tserver/logs
.
@Huqicheng Thank you for the info! I'll let you know if I need anything else!
@Huqicheng For creating a unit test on this issue, would it not suffice to write a dummy file with the _pak
suffix in TestListTablets
in fs_manager-test.cc
as shown below? Since we are updating IsValidTabletId()
in fs_manager.cc
from your post above to check for the _pak
suffix? I'll also tag @bmatican to see what he thinks about this.
TEST_F(FsManagerTestBase, TestListTablets) {
auto tablet_ids = ASSERT_RESULT(fs_manager()->ListTabletIds());
ASSERT_EQ(0, tablet_ids.size());
string path = fs_manager()->GetRaftGroupMetadataDirs()[0];
std::unique_ptr<WritableFile> writer;
ASSERT_OK(env_->NewWritableFile(
JoinPathSegments(path, "foo.tmp"), &writer));
ASSERT_OK(env_->NewWritableFile(
JoinPathSegments(path, "foo.tmp.abc123"), &writer));
ASSERT_OK(env_->NewWritableFile(
JoinPathSegments(path, ".hidden"), &writer));
// here we would put some file with _pak suffix
ASSERT_OK(env_->NewWritableFile(
JoinPathSegments(path, "a_tablet_sort_of_pak"), &writer));
ASSERT_OK(env_->NewWritableFile(
JoinPathSegments(path, "a_tablet_sort_of"), &writer));
tablet_ids = ASSERT_RESULT(fs_manager()->ListTabletIds());
ASSERT_EQ(1, tablet_ids.size()) << tablet_ids;
}
If it is better to create a unit test under src/yb/tserver/ts_tablet_manager-test.cc
instead, how would I go about showing that the _pak
suffix files are ignored for tablet creation/loading?
@albertchokor It looks good to me actually.
Jira Link: DB-5989
Description
We recently had a customer incident where a file in the tablet-meta directory was mistakenly backed up to a new name in the same directory. Unfortunately, our current code expects that the file name is the actual tablet UUID and validates that against the data in the PB itself:
Warning: Please confirm that this issue does not contain any sensitive information