wdecoster / nanocomp

Comparison of multiple long read datasets
MIT License
103 stars 8 forks source link

Error printing whole nanocomp report #76

Closed NikoLichi closed 6 months ago

NikoLichi commented 6 months ago

Hi Wouter,

I used nanocomp to compare different runs on technical samples. It runs well before printing the whole report. Please see the errors below. This may be related to the issue #41 I will give a try to the create_feather script and see how it goes...

Thanks for any other help, Niko


2024-03-20 19:02:08,621 Writing html report.
2024-03-20 19:02:08,794 Error tokenizing data. C error: Expected 2 fields in line 21, saw 53
Traceback (most recent call last):
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/nanocomp/NanoComp.py", line 85, in main
    make_report(plots, settings["path"], stats_df=stats_df)
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/nanocomp/NanoComp.py", line 410, in make_report
    html_content.append(utils.stats2html(path + "NanoStats.txt"))
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/nanocomp/utils.py", line 31, in stats2html
    df = pd.read_csv(statsf, sep=":", header=None, names=["feature", "value"])
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 21, saw 53

Traceback (most recent call last):
  File "/home/itg/niko/miniconda3/envs/nanopack/bin/NanoComp", line 10, in <module>
    sys.exit(main())
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/nanocomp/NanoComp.py", line 85, in main
    make_report(plots, settings["path"], stats_df=stats_df)
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/nanocomp/NanoComp.py", line 410, in make_report
    html_content.append(utils.stats2html(path + "NanoStats.txt"))
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/nanocomp/utils.py", line 31, in stats2html
    df = pd.read_csv(statsf, sep=":", header=None, names=["feature", "value"])
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/itg/niko/miniconda3/envs/nanopack/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 21, saw 53
NikoLichi commented 6 months ago

Hi Wouter,

I gave it a try again with the feather files and have exactly the same errors. All the other files are produced but not the final compiled file.

How this can be solved?

This is the code I am using after feather:

NanoComp -t 32 --verbose -f pdf --feather $FILEIN -p NovoVsGeneC -o NovoVsGenC/qualCont/nanopack --names $PREFIX

Thanks and all the best, Niko

wdecoster commented 6 months ago

Hi Niko,

That is quite remarkable. Could you share the NanoStats.txt file?

Thanks, Wouter

NikoLichi commented 6 months ago

Hi Wouter,

If you mean the .log file after the Nanocomp run, it is enclosed below. Otherwise, please let me know which file you refer to.

All the best, Niko NovoVsGeneC_5don5TimPYFNanoComp_20240322_1426.log

wdecoster commented 6 months ago

Was there no NanoStats.txt file? That should also be generated in /NovoVsGenC/qualCont/nanopack/

NikoLichi commented 6 months ago

I completely missed the file across all the other files, sorry. Here it is.

NovoVsGeneC_5don5TimPYFNanoStats.txt

wdecoster commented 6 months ago

Aha now I see! I see the read identifiers on line 21 are like "141:329|2a491583-c20d-48e9-8ccc-49afe630be59". Is this a duplex run?

NikoLichi commented 6 months ago

Oh... interesting No duplex run. This is cDNAseq from an RNA isolation protocol.

This is the output after trimming and finding directionality using Pychopper. They add those identifiers (e.g.,"141:329|") before the actual read identifier.

But... nanocomp is able to process all the metrics with those headers in separate files (HTML and PDF), but the consensus globall output (HTML) fails.

wdecoster commented 6 months ago

Yes, most of the time it doesn't care about that :, but it is used when generating the HTML report of the NanoStats file. Let me see if I can come up with an easy fix.

wdecoster commented 6 months ago

As a workaround, could you see if running with --tsv_stats fixes things?

NikoLichi commented 6 months ago

Yes! It worked! Thanks!

I ran the command for two data sets already, and it works fine! I'll keep this command trick in mind when using pychopper sequences.

All the best, Niko

wdecoster commented 6 months ago

Thanks for the feedback!