yoshihikosuzuki / ClassPro

A K-mer classifier for HiFi reads
GNU General Public License v3.0
8 stars 0 forks source link

ClassPro aborts when merging files #5

Open RNieuwenhuis opened 1 month ago

RNieuwenhuis commented 1 month ago

Hi @yoshihikosuzuki ,

I am using ClassPro like this: $ ClassPro -v -T16 -NResults/13_profile_reads/Profile_17_mers -r7000 Results/11_preprocess_fastq/My_file.fastq.gz And get this output:

Info about inputs:
    # of sequence files   = 1
    First (path,root,ext) = (Results/11_preprocess_fastq, My_file, .fastq.gz)
    FASTK outputs' root   = Results/13_profile_reads/Profile_17_mers
    Otput .class file     = Results/11_preprocess_fastq/My_file.class
    Temp dir path         = /path/to/current/working/directory/
    Total # of reads      = 2900325
    # of reads per thread = 181271
Global histogram inspection:
    Tallest peak count    = 29 (# of k-mers = 275545588)
    Estimated (H,D) cov   = (14,29)
    Estimated R-threshold = 55
Error model not specified. Use the default error model.
Classifying 17-mers...
Resources for phase:  698:58.786 (m:s.ms) user  3:39.675 (m:s.ms) sys  47:26.496 (m:s.ms) wall  1481.1%  15 MB max rss

Merging files...
free(): invalid pointer
Aborted (core dumped)

Is this a bug? I used the linux binaries from your realease v1.0.2.
Most importantly: the .class is generated and has the same number of lines as the input fastq, so it seems complete. Your help would be appreciated, having the program terminate with 0 would be great.

Kind regards,

Ronald

yoshihikosuzuki commented 1 month ago

Hi @RNieuwenhuis,

Thanks for the bug report. It indeed looks like a bug at the very end of the program. Whether the .class file is complete or not depends on which free function the error happens, but if it has the same number of reads then it should be complete.

Like you suggested I would like to fix it, but to do so at least I would like to know which location the error occurred (using valgrind, for example), and if possible also the entire input dataset. Would it be possible for you?

Best regards, Yoshi

RNieuwenhuis commented 1 month ago

Hi @yoshihikosuzuki ,

I ran Valgrind as follows: valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes ClassPro -v -T16 -NResults/13_profile_f1_reads/Profile_17_mers -r7000 Results/13_profile_f1_reads/My_file.fastq.gz 2> valgrind_output.txt

I got the following output. valgrind_output.txt

It says valgrind could not continue. The machine I work on has 750 gb RAM and a 375 Gb allocated swap.

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3089437
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 3089437
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I don't see any problematic limits. Could you perhaps provide some suggestions on how to properly analyze ClassPro using Valgrind? Any valgrind command-line arguments or switches that can prevent this crash?

Kind regards,

Ronald

yoshihikosuzuki commented 1 month ago

Hi @RNieuwenhuis

Thanks for running valgrind. First, to get line number information in the valgrind output, you need to turn off the optimization by modifying line 2 of Makefile:

CFLAGS = -O3 -Wall -Wextra -Wno-unused-function

CFLAGS = -O0 -Wall -Wextra -Wno-unused-function

and recompiling ClassPro with make.

Then, I typically use valgrind with the following options:

valgrind --tool=memcheck --leak-check=full --show-leak-kinds=all --track-origins=yes --log-file=valgrind.log ${COMMAND}

However, doing the above takes your time to some extent, and I would say the bug is marginal and you can proceed with the current output file since the output looks complete, although any bug should not occur.

Best regards, Yoshi

RNieuwenhuis commented 1 month ago

Hi Yoshi,

I followed your instructions and played around with some valgrind command-line arguments increasing memory etc. But after several attempts it still crashes due to out of memory error. This is the output: valgrind.log

I think I cannot help you with debugging the original error because I don't have a smaller dataset. Let me know I can do anything else.

The reason why I really would like to fix this issue is because I made ClassPro part of a workflow and when it does not exit with 0 the pipeline will not finish. To that end I also added ClassPro to bioconda (this PR) based on the binaries. Feel free to improve on the recipe though. Now it relies on the binaries of your release but using the build system allows for much more flexibility. I also could not get the test to work properly. I used the FastK recipe as template since your CLI is similar to it, but I had no success.

Kind regards,

Ronald

yoshihikosuzuki commented 1 month ago

Hello Ronald,

Thanks for investigating it. Out of memory error would be not because of ClassPro but Valgrind. As for the smaller dataset, could you just do e.g. head -10000 or seqkit sample -p0.01 for your input data to make a smaller input?

Actually I don't know so much about bioconda, but does a temporary workaround like ClassPro ... || true work...? (it's obviously not a good thing, though) And what specifically was the reason you could not make the test work successfully?

Thank you so much for your help. Best, Yoshi

RNieuwenhuis commented 1 month ago

Hi Yoshi,

Thanks for your quick reply. Yes the out of memory error is a valgrind error. It is well reported that valgrind profiling increases the memory footprint of whatever is being profiled by a lot.

I did consider subsampling my fastq but wondered if the coverage would not become too low. Would ClassPro just complete when there is very low coverage, albeit with a useless profiling result? In other words: how much does the ClassPro software rely on a proper k-mer histogram?

About bioconda, I initially used ClassPro 2>&1 >/dev/null | grep "Usage: ClassPro" based on the FastK recipe and the similar CLI. Why it failed is unknown to me, the CI tools don't report much. Testing the build locally just worked. So I cannot be more concrete as to why that test fails, unfortunately. I would love to test installing it through conda, since the bioconda recipe PR was reviewed and approved already, but bioconda seems to currently have issues with its CI pipeline as all tests fail for a week already.

I will test with a subsample of my data and report back here.

yoshihikosuzuki commented 1 month ago

Hi Ronald,

For a dataset with a very low coverage the output will be useless (~15x is minimum for meaningful results), but it still can be used for debugging.

I saw the PR on bioconda, and is this the error?

[error]No hosted parallelism has been purchased or granted. To request a free parallelism grant, please fill out the following form https://aka.ms/azpipelines-parallelism-request

I'm not sure what this exactly means, but do we need to fill a form (because ClassPro uses multi-threads)?

RNieuwenhuis commented 1 month ago

Hi Yoshi,

For a dataset with a very low coverage the output will be useless (~15x is minimum for meaningful results), but it still can be used for debugging.

Good to know, then I'll try that.

I saw the PR on bioconda, and is this the error?

Correct, that is the current error. It seems a general error that is not related to ClassPro using multi-threading. There haven't been merges for a week already because every CircleCI check ends with this error. In the Gitter channel they said they were discussing with Microsoft but things progressed slowly. Filling in the form was not necessary as it is a bioconda issue. Latest update on the issue was this:

print

RNieuwenhuis commented 1 month ago

Update: bioconda CI is working again.

&

First, I selected 100k reads from my file into Sample.fastq.gz. When I run ClassPro using an existing FastK profile it fails with Cannot load 100001-th read. When I generate a new FastK profile based on these 100k reads and use that profile for ClassPro I get [ERROR] Could not find any peak count >= 10 in the histogram. Revise data and use the-coption.

So I resorted to mapping my reads, selecting reads only mapping to the first 2 Mbp of the genome. This resulted in some 6000 reads. Testing ClassPro on these reads reproduced the original error of an invalid pointer given to free().

Still valgrind fails with an out-of-memory error on this small dataset..

valgrind.log

Fair to say that I somehow cannot profile ClassPro using valgrind, whatever the test set. I can offer you my test data if you want.

yoshihikosuzuki commented 1 month ago

Hi Ronald,

I'm glad conda works well now.

For the "free" error, I suspect it's an environment-specific problem. So I think it would work without the error in my environment, although of course I can try the small dataset.

Can you provide the versions of your GCC and OS? I will look into it.

RNieuwenhuis commented 1 month ago
cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
 gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.