Open mbautin opened 2 years ago
To specify LLVM_PROFILE_FILES, we can use similar logic to how we currently name fatal failure reports, e.g.
$ find java -name "*.profraw"
java/yb-pgsql/target/surefire-reports_org.yb.pgsql.TestPgWriteRestart__testWriteRestart/org.yb.pgsql.TestPgWriteRestart.testWriteRestart..m-1.127.97.172.113-port11946.profraw
java/yb-pgsql/target/surefire-reports_org.yb.pgsql.TestPgWriteRestart__testWriteRestart/org.yb.pgsql.TestPgWriteRestart.testWriteRestart..m-2.127.122.93.35-port24351.profraw
java/yb-pgsql/target/surefire-reports_org.yb.pgsql.TestPgWriteRestart__testWriteRestart/org.yb.pgsql.TestPgWriteRestart.testWriteRestart..m-3.127.48.200.186-port28965.profraw
java/yb-pgsql/target/surefire-reports_org.yb.pgsql.TestPgWriteRestart__testWriteRestart/org.yb.pgsql.TestPgWriteRestart.testWriteRestart..ts-1.127.88.193.225-port19059.profraw
java/yb-pgsql/target/surefire-reports_org.yb.pgsql.TestPgWriteRestart__testWriteRestart/org.yb.pgsql.TestPgWriteRestart.testWriteRestart..ts-2.127.198.58.159-port19906.profraw
java/yb-pgsql/target/surefire-reports_org.yb.pgsql.TestPgWriteRestart__testWriteRestart/org.yb.pgsql.TestPgWriteRestart.testWriteRestart..ts-3.127.206.37.75-port18182.profraw
(there is a bug causing a double dot in the name).
When looking at these files' sizes, there are a couple of issues:
I'm now looking into this, rebased your change for now, and seems to work as it did before.
Files for masters are zero size. This is probably caused by not enough profile data being generated before the process is killed. We need some hook to flush profile files before killing the process, e.g. using these APIs.
I can reproduce this. My guess is we are missing the -fprofile-instr-generate -fcoverage-mapping
during linking, then the writing out on exit handler is not done. Will try to fix in CMakeLists.txt
Edit: The linking step was not enough. In the test output I see that we are killing yb-master with SIGKILL, which means we can't catch and handle this. Let me check if I can fix this by waiting longer when we are in coverage builds, since writing out the coverage information will take some time:
I0222 14:12:01.467499 81167744 external_mini_cluster.cc:2226] { daemon_id: m-1 bound_rpc: :0 } Starting Shutdown()
I0222 14:12:01.467522 81167744 external_mini_cluster.cc:2269] Killing /Users/deen/code/yugabyte-db/build/debugcov-clang-dynamic-arm64-ninja/tests-pgwrapper/../bin/yb-master with pid 47052 with SIGKILL
As an aside, if we expect many crashes during testing and want more accurate coverage:
“%c” expands out to nothing, but enables a mode in which profile counter updates are continuously synced to a file. This means that if the instrumented program crashes, or is killed by a signal, perfect coverage information can still be recovered. Continuous mode does not support value profiling for PGO, and is only supported on Darwin at the moment. Support for Linux may be mostly complete but requires testing, and support for Windows may require more extensive changes: please get involved if you are interested in porting this feature.
But I wouldn't set it for now, I'm guessing Linux will be the target arch.
Files for tservers in a single test can reach more than 1 GB (1179436792 bytes). We need to postprocess these files as part of collecting results after running the test.
Can also see this, should compress well, so I will try that. Edit: xz -9 gives me 1.3 GB -> 101 MB.
Do we also want to instrument our postgres server or just tserver and master? Edit: I guess so.
Edit: The linking step was not enough. In the test output I see that we are killing yb-master with SIGKILL, which means we can't catch and handle this. Let me check if I can fix this by waiting longer when we are in coverage builds, since writing out the coverage information will take some time.
This was caused by us using SIGTERM to terminate master processes (and sometimes tserver too). I fixed it by hooking into the glog signal handler to flush then.
There is a cleaner way by sending FlushCoverageRequestPB, which I made the C++ tests use, but wasn't sure how to do this in Java tests, and better to handle SIGTERM when we use it so frequently in tests (not a big fan of this approach in general though).
Do we also want to instrument our postgres server or just tserver and master? Edit: I guess so.
Done.
There was also the issue of tservers running multiple times even during one run and overwriting their old profraw files, added the pid to keep the old ones around until we merge them.
Now the next step is to create actual coverage report from this and check if it's reasonable. If so, then it would be interesting to run with all of our tests and get a unified coverage.
With %c I can finally get reliable coverage. We are killing processes using signals left and right, handling all of them correctly is kind of difficult, and I have apparently not managed to do so yet. The SIGKILLS can't be handled at all obviously, so would have to be changed for coverage builds. I'll have to check if %c also works on Linux.
report1.zip (sorry for double packing, github file size limit hit)
An initial html report of ./yb_build.sh debugcov --java-test TestPgWriteRestart
is attached.
There is a remaining issue of warning: 45415 functions have mismatched data
which happens because of shared library usage. I will check if dependency_graph.py link-whole-program
can help here to improve accuracy.
Report with all ctest runs: I had to run this in parallel to keep the file size of profraw files down, otherwise my disk would fill up (multiple TB):
i=0
sleep 180
while true; do
sleep 60
X=$(find . -name '*profraw*' -mtime +3m)
/Library/Developer/CommandLineTools/usr/bin/llvm-profdata merge -sparse ${=X} -o ctest$i.profdata && rm ${=X}
i=$((i+1))
done
Then finally generated the report:
/Library/Developer/CommandLineTools/usr/bin/llvm-profdata merge -sparse ctest*.profdata -o ctest_combined.profdata && rm ctest[0-9]*.profdata
OBJS=$(for i in build/debugcov-clang-dynamic-arm64-ninja/postgres/bin/* build/debugcov-clang-dynamic-arm64-ninja/bin/* build/debugcov-clang-dynamic-arm64-ninja/tests-*/* build/debugcov-clang-dynamic-arm64-ninja/**/*.dylib; do echo -n "-object $i "; done)
/Library/Developer/CommandLineTools/usr/bin/llvm-cov show ${=OBJS} --instr-profile ctest_combined.profdata --format=html --output-dir=ctest
Next I'm trying to do the same with all java tests.
I'm struggling with getting all java tests to run, so far it always stops at some point. Is it possible that the java tests are not that stable on macOS in general? Is there a way to tell the test runner to keep going after failures and not skip the rest?
$ ./yb_build.sh debugcov --java-tests
[...]
[INFO] Running com.yugabyte.driver.core.TestPartitionMetadata
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 44.837 s - in com.yugabyte.driver.core.TestPartitionMetadata
[INFO] Running com.datastax.driver.core.TestSystemTables
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.347 s - in com.datastax.driver.core.TestSystemTables
[INFO]
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR] TestLoadBalancingPolicy.testHashFunction:101
[ERROR] TestCreateTable.testCreateTableNumTablets:219
[ERROR] TestIndex.testInKeyword:2524->doTestInKeyword:2466->BaseCQLTest.assertQuery:578->BaseCQLTest.assertQuery:574 expected:<...4c93-8f61-eabfa4a803[e2, first, second]Row[5b6962dd-3f90-4c93-8f61-eabfa4a80310, NULL, NULL]]> but was:<...4c93-8f61-eabfa4a803[10, NULL, NULL]Row[5b6962dd-3f90-4c93-8f61-eabfa4a803e2, first, second]]>
[ERROR] TestIndex.testInKeywordInPrepared:2534->doTestInKeyword:2466->BaseCQLTest.assertQuery:578->BaseCQLTest.assertQuery:574 expected:<...4c93-8f61-eabfa4a803[e2, first, second]Row[5b6962dd-3f90-4c93-8f61-eabfa4a80310, NULL, NULL]]> but was:<...4c93-8f61-eabfa4a803[10, NULL, NULL]Row[5b6962dd-3f90-4c93-8f61-eabfa4a803e2, first, second]]>
[ERROR] TestIndex.testInKeywordInPrepared_Transactional:2540->doTestInKeyword:2466->BaseCQLTest.assertQuery:578->BaseCQLTest.assertQuery:574 expected:<...4c93-8f61-eabfa4a803[e2, first, second]Row[5b6962dd-3f90-4c93-8f61-eabfa4a80310, NULL, NULL]]> but was:<...4c93-8f61-eabfa4a803[10, NULL, NULL]Row[5b6962dd-3f90-4c93-8f61-eabfa4a803e2, first, second]]>
[ERROR] TestIndex.testInKeyword_Transactional:2529->doTestInKeyword:2466->BaseCQLTest.assertQuery:578->BaseCQLTest.assertQuery:574 expected:<...4c93-8f61-eabfa4a803[e2, first, second]Row[5b6962dd-3f90-4c93-8f61-eabfa4a80310, NULL, NULL]]> but was:<...4c93-8f61-eabfa4a803[10, NULL, NULL]Row[5b6962dd-3f90-4c93-8f61-eabfa4a803e2, first, second]]>
[ERROR] Errors:
[ERROR] TestAudit>BaseCQLTest.setUpCqlClient:166->BaseCQLTest.connectWithTestDefaults:146 » NoHostAvailable
[ERROR] TestAudit>BaseCQLTest.setUpCqlClient:166->BaseCQLTest.connectWithTestDefaults:146 » NoHostAvailable
[ERROR] TestAudit>BaseCQLTest.setUpCqlClient:166->BaseCQLTest.connectWithTestDefaults:146 » NoHostAvailable
[ERROR] TestAudit>BaseCQLTest.setUpCqlClient:166->BaseCQLTest.connectWithTestDefaults:146 » NoHostAvailable
[ERROR] TestAudit>BaseCQLTest.setUpCqlClient:166->BaseCQLTest.connectWithTestDefaults:146 » NoHostAvailable
[ERROR] TestAudit>BaseCQLTest.setUpCqlClient:166->BaseCQLTest.connectWithTestDefaults:146 » NoHostAvailable
[ERROR] TestAudit>BaseCQLTest.setUpCqlClient:166->BaseCQLTest.connectWithTestDefaults:146 » NoHostAvailable
[ERROR] TestAudit>BaseCQLTest.setUpCqlClient:166->BaseCQLTest.connectWithTestDefaults:146 » NoHostAvailable
[ERROR] TestAudit>BaseCQLTest.setUpCqlClient:166->BaseCQLTest.connectWithTestDefaults:146 » NoHostAvailable
[ERROR] TestAudit>BaseCQLTest.setUpCqlClient:166->BaseCQLTest.connectWithTestDefaults:146 » NoHostAvailable
[ERROR] TestDecimalDataType.testConversionsLimits:616 » InvalidQuery Invalid Arguments...
[ERROR] TestVarIntDataType.testConversionsLimits:489 » InvalidQuery Invalid Arguments....
[INFO]
[ERROR] Tests run: 989, Failures: 6, Errors: 12, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for YugaByte 0.8.15-SNAPSHOT:
[INFO]
[INFO] YugaByte ........................................... SUCCESS [ 1.352 s]
[INFO] YB Annotations ..................................... SUCCESS [ 0.927 s]
[INFO] YB Java Client ..................................... SUCCESS [19:01 min]
[INFO] YB commandline interface ........................... SUCCESS [ 0.753 s]
[INFO] YB CQL Support ..................................... FAILURE [ 04:04 h]
[INFO] YB CQL Support for 4.x Driver ...................... SKIPPED
[INFO] YB PostgreSQL Support .............................. SKIPPED
[INFO] YB Load Tester ..................................... SKIPPED
[INFO] YB Jedis Tests ..................................... SKIPPED
[INFO] YB CDC Connector ................................... SKIPPED
[INFO] YB Multi API ....................................... SKIPPED
[INFO] YB Manual Support .................................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:23 h
[INFO] Finished at: 2022-03-08T03:59:49+01:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on project yb-cql: There are test failures.
[ERROR]
[ERROR] Please refer to /Users/deen/code/yugabyte-db/java/yb-cql/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <args> -rf :yb-cql
[2022-03-08T03:59:49 common-build-env.sh:848 build_yb_java_code_filter_save_output] (End of Java build output)
real 263m37.143s
user 196m55.482s
sys 65m23.727s
I'll try again with <yb.forked.test.process.timeout.sec>7200</yb.forked.test.process.timeout.sec>
in java/pom.xml
With ./yb_build.sh debugcov --java-tests --rjtms --tp 8
I appear to be getting further now, test running for > 12 hours now. Will hopefully get all the java tests now. Next I can add the automatic report generation to yb_build.
Now finished the java tests: Java compilation and tests time : 76177 seconds
(22 hours!) (34% line coverage)
ctests and java tests combined: http://coverage.yugabyte.com.s3-website-us-west-2.amazonaws.com/reports/2.13.0.0-b42/index.html (45% line coverage)
Note that there is still some inaccuracy in the reported coverage (warning: 689576 functions have mismatched data
), but it is probably close enough to reality for our use. In my experience getting rid of this warning and to get more accurate coverage information, we'd have to link all of the code into a single static executable.
Code coverage on macOS works, first one uploaded (should soon be public)
But it should work on Linux, since we have more infrastructure available on Linux. I have two approaches I am evaluating now in parallel:
I created another coverage report last week: http://coverage.yugabyte.com.s3-website-us-west-2.amazonaws.com/reports/2.13.1.0-b60/ Pending some internal discussions I'm pausing this project.
We need to generate code coverage reports. We can use Clang source-based coverage: https://clang.llvm.org/docs/SourceBasedCodeCoverage.html
This will be a new build type (or multiple new build types, based on debug and release). The main compiler arguments are:
-fprofile-instr-generate -fcoverage-mapping
. Each process (unit tests, masters and tservers in external mini clusters) will then output "raw profiles" (.profraw
) into locations specified by the LLVM_PROFILE_FILE environment variable. Then these raw profiles need to be "indexed" to create.profdata
files using a command likellvm-profdata merge -sparse foo.profraw -o foo.profdata
(I think the Clang docs call these files "coverage reports", although they can only be viewed using thellvm-cov show
tool).For proper integration into our testing pipeline, we need to ship per-test indexed reports to generate various kinds of reports (line-oriented vs. file-level summaries).
E.g. line-oriented:
vs. file-level:
We then need to decide where to upload these reports, how to host them, and how to compare code coverage between builds.