wdecoster / chopper

MIT License
131 stars 11 forks source link

confuse problem #31

Closed lihaicheng7003 closed 2 months ago

lihaicheng7003 commented 2 months ago

using

chopper -q 10 -l 500 --threads 8 -i 202404181017140_read.fq.gz |gzip > 202404181017140_read.filter.fq.gz

return

Kept 3 reads out of 3 reads

there are only 3 reads in 202404181017140_read.filter.fq.gz. I confirm that there are many reads in the 202404181017140_read.fq.gz.

I'm confused what happen.

lihaicheng7003 commented 2 months ago
gunzip -c 202404181017140_read.fq.gz|chopper -q 10 -l 500 | gzip >202404181017140_read.filter.fq.gz

This command work

Additional information:

202404181017140_read.fq.gz: gzip compressed data, extra field, last modified: Sat Apr 20 13:45:16 2024, max compression
wdecoster commented 2 months ago

Do you mean that the counting of reads is wrong, or do you also get a different set of reads passing the filter for the second command?

lihaicheng7003 commented 2 months ago

using

chopper -q 10 -l 500 --threads 8 -i 202404181017140_read.fq.gz |gzip > 202404181017140_read.filter.fq.gz

return

Kept 3 reads out of 3 reads

there are only 3 reads in 202404181017140_read.filter.fq.gz. I confirm that there are many reads in the 202404181017140_read.fq.gz.

I'm confused what happen.

After this command, there are only three reads in the 202404181017140_read.filter.fq.gz, which is obviously wrong. 202404181017140_read.fq.gz has about two million reads.

lihaicheng7003 commented 2 months ago
gunzip -c 202404181017140_read.fq.gz|chopper -q 10 -l 500 | gzip >202404181017140_read.filter.fq.gz

This command work

Additional information:

202404181017140_read.fq.gz: gzip compressed data, extra field, last modified: Sat Apr 20 13:45:16 2024, max compression

I didn't do anything(same file, same environment), just changed the command and got 1.98 million reads in the 202404181017140_read.filter.fq.gz, and this result should be correct

lihaicheng7003 commented 2 months ago
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
_sysroot_linux-64_curr_repodata_hack 3                   haa98f57_10    defaults
binutils                  2.40                 h4852527_0    conda-forge
binutils_impl_linux-64    2.40                 ha885e6a_0    conda-forge
binutils_linux-64         2.40                 hdade7a5_3    conda-forge
biopython                 1.79                     pypi_0    pypi
c-compiler                1.7.0                hd590300_1    conda-forge
ca-certificates           2024.3.11            h06a4308_0    defaults
certifi                   2021.5.30        py36h06a4308_0    defaults
chopper                   0.8.0                hdcf5f25_0    bioconda
clang                     14.0.6               h06a4308_1    defaults
clang-14                  14.0.6          default_hc6dbbc7_1    defaults
cxx-compiler              1.7.0                h00ab1b0_1    conda-forge
gcc                       12.3.0               h915e2ae_7    conda-forge
gcc_impl_linux-64         12.3.0               h58ffeeb_7    conda-forge
gcc_linux-64              12.3.0               h6477408_3    conda-forge
gxx                       12.3.0               h915e2ae_7    conda-forge
gxx_impl_linux-64         12.3.0               h2a574ab_7    conda-forge
gxx_linux-64              12.3.0               h4a1b8e8_3    conda-forge
kaleido                   0.2.1                    pypi_0    pypi
kernel-headers_linux-64   3.10.0              h57e8cba_10    defaults
ld_impl_linux-64          2.40                 h55db66e_0    conda-forge
libclang-cpp14            14.0.6          default_hc6dbbc7_1    defaults
libffi                    3.3                  he6710b0_2    defaults
libgcc-devel_linux-64     12.3.0             h0223996_107    conda-forge
libgcc-ng                 13.2.0               h77fa898_7    conda-forge
libgomp                   13.2.0               h77fa898_7    conda-forge
libllvm14                 14.0.6               hef93074_0    defaults
libsanitizer              12.3.0               hb8811af_7    conda-forge
libstdcxx-devel_linux-64  12.3.0             h0223996_107    conda-forge
libstdcxx-ng              13.2.0               hc0a3c3a_7    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
nanoget                   1.19.1                   pypi_0    pypi
nanomath                  1.3.0                    pypi_0    pypi
nanoplot                  1.42.0                   pypi_0    pypi
ncurses                   6.4                  h6a678d5_0    defaults
numpy                     1.19.5                   pypi_0    pypi
openssl                   1.1.1w               h7f8727e_0    defaults
packaging                 21.3                     pypi_0    pypi
pandas                    1.1.5                    pypi_0    pypi
pip                       21.2.2           py36h06a4308_0    defaults
plotly                    5.18.0                   pypi_0    pypi
pyarrow                   6.0.1                    pypi_0    pypi
pyparsing                 3.1.2                    pypi_0    pypi
pysam                     0.22.1                   pypi_0    pypi
python                    3.6.13               h12debd9_1    defaults
python-dateutil           2.9.0.post0              pypi_0    pypi
python-deprecated         1.1.0                    pypi_0    pypi
pytz                      2024.1                   pypi_0    pypi
readline                  8.2                  h5eee18b_0    defaults
scipy                     1.5.4                    pypi_0    pypi
setuptools                58.0.4           py36h06a4308_0    defaults
six                       1.16.0                   pypi_0    pypi
sqlite                    3.45.3               h5eee18b_0    defaults
sysroot_linux-64          2.17                h57e8cba_10    defaults
tenacity                  8.2.2                    pypi_0    pypi
tk                        8.6.14               h39e8969_0    defaults
wheel                     0.37.1             pyhd3eb1b0_0    defaults
xz                        5.4.6                h5eee18b_1    defaults
zlib                      1.2.13               hd590300_5    conda-forge
JMencius commented 2 months ago

Hi @lihaicheng7003 Sorry for this issue. Can you share part of your input fastq.gz file to my email zjmeng22@m.fudan.edu.cn, so I can help you figure out why.

JMencius commented 2 months ago

Hi, I try to replicate your error using the built-in test.fastq in chopper/test-data/, but failed. To be notice, I use cargo build to build the chooper, but I don't think that makes any difference. Following is my result:

$ ./chopper -q 10 -l 500 -i /test-data/test.fastq > a.fastq
Kept 205 reads out of 250 reads
$ ./chopper -q 10 -l 500 -i /test-data/test.fastq.gz > b.fastq
Kept 205 reads out of 250 reads
$ gunzip -c /test-data/test.fastq.gz | ./chopper -q 10 -l 500 > c.fastq
Kept 205 reads out of 250 reads
$ gunzip -c /test-data/test.fastq.gz | ./chopper -q 10 -l 500 |gzip >  d.fastq.gz
Kept 205 reads out of 250 reads

I also check the exact length of each output fastq.

$ wc -l *.fastq
     820 a.fastq
     820 b.fastq
     820 c.fastq
     820 d.fastq

I am confused too, and please do send your file if there is no privacy concern.

lihaicheng7003 commented 2 months ago

I found the reason. The file I was using (202404181017140_read.fq.gz) may have been compressed using some special compression method, causing Chopper to fail to parse it correctly. gzip can decompress this file, so the second command works as expected.

lihaicheng7003 commented 2 months ago

Hi @lihaicheng7003 Sorry for this issue. Can you share part of your input fastq.gz file to my email zjmeng22@m.fudan.edu.cn, so I can help you figure out why.

Sorry, I can't provide the complete file as it's too large. Since I don't know what method or tool was used to compress it, I also can't compress a small portion of reads into a similar format. It may not be possible to provide a test file.

lihaicheng7003 commented 2 months ago

Thank you for your response. The situation I encountered isn't a problem with Chopper, , it's a problem with my file format.