zolotarovgl / GeneExt

GeneExt - Gene extension for improved scRNA-seq data counting
GNU General Public License v3.0
2 stars 3 forks source link

EmptyDataError: No columns to parse from file #6

Closed CaiCheng1996 closed 8 months ago

CaiCheng1996 commented 8 months ago

I have got the error: Traceback (most recent call last): File "geneext.py", line 721, in nrow = helper.get_tsv_nrow(genicpeaksfile) File "geneext/helper.py", line 1172, in get_tsv_nrow df = pd.read_csv(bedfile,sep = '\t') File "geneext/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv return _read(filepath_or_buffer, kwds) File "geneext/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 611, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "geneext/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1448, in init self._engine = self._make_engine(f, self.engine) File "geneext/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1723, in _make_engine return mapping[engine](f, self.options) File "geneext/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in init self._reader = parsers.TextReader(src, **kwds) File "parsers.pyx", line 586, in pandas._libs.parsers.TextReader.cinit pandas.errors.EmptyDataError: No columns to parse from file

And the log file shows: Running subsampling ... done Running macs2 ... done Filtering macs2 peaks ...

And I can see the files "allpeaks.bed", "allpeaks_coverage.bed" in the tmp/, but the genic_peaks.bed is empty. How could that happen? Is there someting wrong with gtf or bam?

zolotarovgl commented 8 months ago

This error should have been fixed in the last commit from Dec 13, 2023. Please, run git pull in your GeneExt directory and try again.

Explanation: I am assuming you are running the tool in the --orphan mode? Then, this error usually arises due to an error in bedtools and is related to the orders of chromosomes in the genome file being different from the chromosome order in peaks file (e.g. some chromosomes have no called peaks => different chr order which leads to bedtools error.

Please, let me know if the error persists.

CaiCheng1996 commented 8 months ago

Oh, thank you! There were somgthing wrong in my chr name.

zolotarovgl commented 8 months ago

Great! Let me know if the tool finishes without errors. Also, if you have any other issues / questions, I'm happy to help.

Grisha

CaiCheng1996 commented 8 months ago

Thank you for your help! I do have one another question about the --peak_perc. As the manual introduce, GeneExt will calculate per-base coverage distribution for genic peaks and then used to filter intergenic peaks, and there is an example density-plot, and I do see the file genic_peaks.bed in the tmp/. However, my question is where is the intergenic peaks information? So far I can't do the same density-plot and set the cutoff of --peak_perc.

Regards.

zolotarovgl commented 8 months ago

Thanks for your great question! So far, the tool doesn't output such plots as we wanted to minimize the number of dependencies (e.g. even a minimal base-R plot will require an R to be installed in the environment). However, you can generate such plot using the peak coverage files in the tmp/ directory:

The following code generates a histogram similar to those the Fig.2 of the preprint:

peak_perc = 25

genic = read.table('tmp/genic_peaks.bed')
noov = read.table('tmp/allpeaks_noov.bed')

noov = noov[!noov$V4 %in% intersect(genic$V4,noov$V4),]

genic_cov = genic[,7]
noov_cov = noov[,7]

png('peak_coverage.png')
plot(density(log10(genic_cov)),col = 'red',main = 'Peak coverage',sub = paste0('peak perc: ',peak_perc),xlab = 'log10 Normalized peak coverage')
lines(density(log10(noov_cov)),col = 'blue')
thr = quantile(genic_cov,peak_perc/100)
abline(v = log10(thr), lty = 2)

dev.off()

Please, let me know if this works for you.

peak_coverage