yangence / circfull

a tool to detect and quantify full-length circRNA isoforms from circFL-seq
Other
5 stars 4 forks source link

multiprocessing error(s) with #6

Closed bredward closed 2 years ago

bredward commented 2 years ago

Hi there,

I keep having an issue with circfull RG not running to completion. It seems to be related to multiprocessing however the error still occurs when I reduce the # of threads. I've done a little research to see if it's something I can resolve myself but I have not found a reasonable explanation for what could be causing the problem. If you have any thoughts on how to troubleshoot this (see log pasted below) or if this looks at all familiar, your advice would be greatly appreciated!

For reference, here is the log during the running process and the specific error that I'm getting at the bottom. I should also note that I've tried setting the # threads lower (< the default) and that doesn't seem to make a difference. I've also tried splitting the fastq input into smaller files and it doesn't seem to help either - I see the error message regardless.

2022-06-01 11:33:36

Check fastq file

2022-06-01 11:33:36

Check anno file

2022-06-01 11:33:36

Check genome file

2022-06-01 11:33:36

Align fastq to reference genome: alignFastq

2022-06-01 12:22:14

Transform SAM to BAM: sam2bam

2022-06-01 12:22:17

Analyze SAM file: explainFL

2022-06-01 12:22:36

Filter and classify candidates: filterFL

2022-06-01 12:22:43 #### |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | ETA: 0:00:00

Adjust normal: adjExplainNormal multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/bredward/miniconda3/envs/circfull/lib/python3.9/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/home/bredward/miniconda3/envs/circfull/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/home/bredward/miniconda3/envs/circfull/lib/python3.9/site-packages/circfull-0.0.8-py3.9.egg/circfull/RG_adjExplainNormal.py", line 30, in getNewFL exon1=createIntervals(each1['exon_start'],each1['exon_end']) File "/home/bredward/miniconda3/envs/circfull/lib/python3.9/site-packages/circfull-0.0.8-py3.9.egg/circfull/RG_adjExplainNormal.py", line 7, in createIntervals x=I.empty() AttributeError: module 'intervals' has no attribute 'empty' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/bredward/miniconda3/envs/circfull/bin/circfull", line 33, in sys.exit(load_entry_point('circfull==0.0.8', 'console_scripts', 'circfull')()) File "/home/bredward/miniconda3/envs/circfull/lib/python3.9/site-packages/circfull-0.0.8-py3.9.egg/circfull/circFL_main.py", line 37, in main File "/home/bredward/miniconda3/envs/circfull/lib/python3.9/site-packages/circfull-0.0.8-py3.9.egg/circfull/RG.py", line 87, in RG File "/home/bredward/miniconda3/envs/circfull/lib/python3.9/site-packages/circfull-0.0.8-py3.9.egg/circfull/RG_adjExplainNormal.py", line 60, in adjExplainNormal File "/home/bredward/miniconda3/envs/circfull/lib/python3.9/multiprocessing/pool.py", line 364, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/bredward/miniconda3/envs/circfull/lib/python3.9/multiprocessing/pool.py", line 771, in get raise self._value AttributeError: module 'intervals' has no attribute 'empty'

Again, I appreciate your help with this. I am looking forward to utilizing the full pipeline as soon as I am able to resolve this!

-Bri Edwards

colinliuzelin commented 2 years ago

Hi, Thank you for your question. I think your issue was raised by the wrong module 'intervals'. There are a lot of python packages named intervals in the environment. I suggest you first uninstall any module named intervals (such as intervals, pyinterval) or create a new conda environment. Please intall python-intervals with 'pip install python-intervals'. You can type "import intervals as I; I.empty()" to test it. Best, Zelin

bredward commented 2 years ago

Hi again,

Thank you for responding so quickly! I created a new conda environment without the 'python-intervals' package, as you suggested and used pip to install python-intervals instead and the issue was resolved! However, now I am finding a new error that, if I understand correctly, seems to be related to the path(s) I am using for my output directory when running in cRG mode.

Currently I have a series of files that I have ran through RG and DNSC but now when I try to feed those into cRG, I am getting this error:

2022-06-07 14:39:48

Check DNSC directory

2022-06-07 14:39:48

Make query sequence: createFastq

2022-06-07 14:39:48

Check fastq file

2022-06-07 14:39:48

Check anno file

2022-06-07 14:39:48

Check genome file

2022-06-07 14:39:48

Align fastq to reference genome: alignFastq

2022-06-07 14:40:20

Transform SAM to BAM: sam2bam

2022-06-07 14:40:20

Analyze SAM file: explainFL

2022-06-07 14:40:20

Filter and classify candidates: filterFL

2022-06-07 14:40:20 #### | ETA: --:--:--

Adjust normal: adjExplainNormal Traceback (most recent call last): File "/home/bredward/miniconda3/envs/circfull.new/bin/circfull", line 8, in sys.exit(main()) File "/home/bredward/miniconda3/envs/circfull.new/lib/python3.9/site-packages/circfull/circFL_main.py", line 43, in main cRG.cRG(docopt(cRG.doc, version=version)) File "/home/bredward/miniconda3/envs/circfull.new/lib/python3.9/site-packages/circfull/cRG.py", line 65, in cRG RG(options) File "/home/bredward/miniconda3/envs/circfull.new/lib/python3.9/site-packages/circfull/RG.py", line 87, in RG adjExplainNormal(genome,RG_outPrefix,thread) File "/home/bredward/miniconda3/envs/circfull.new/lib/python3.9/site-packages/circfull/RG_adjExplainNormal.py", line 44, in adjExplainNormal FLdf=pd.read_csv(outPrefix+"explainFL_Normal.txt",sep='\t') File "/home/bredward/miniconda3/envs/circfull.new/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, kwargs) File "/home/bredward/miniconda3/envs/circfull.new/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv return _read(filepath_or_buffer, kwds) File "/home/bredward/miniconda3/envs/circfull.new/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/home/bredward/miniconda3/envs/circfull.new/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 933, in init self._engine = self._make_engine(f, self.engine) File "/home/bredward/miniconda3/envs/circfull.new/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1217, in _make_engine self.handles = get_handle( # type: ignore[call-overload] File "/home/bredward/miniconda3/envs/circfull.new/lib/python3.9/site-packages/pandas/io/common.py", line 789, in get_handle handle = open( FileNotFoundError: [Errno 2] No such file or directory: 'split.1_2/cRG/RG/explainFL_Normal.txt'

It seems to be looking for the 'explainFL_Normal.txt' file in the RG folder, which I do have in my RG run output directory, but the path is not pointing to this folder -- it is pointing to the cRG directory. So far I have stored the output of both RG and DNSC in the same directory, and from what I understood, I would use this same directory for the cRG input & output...is that correct? For reference, here is the line I am running:

circfull cRG -t $thread -g ../ref_genome_files/Lotusjaponicus_Gifu_v1.2_genome.fa -a ../ref_genome_files/sort.gtf.gz -f split.1_2 -o split.1_2

And here is the structure of my output directory after running cRG and getting this error:

split.1_2 ├── DNSC │   ├── TideHunter.tab │   ├── TideHunter_Pass.tab │   ├── novoCluster.txt │   ├── novoseq.fa │   ├── raw2raw.paf │   ├── raw2raw.sort.paf │   ├── rawseq.fa │   ├── test_1.fa │   └── tmp ├── RG │   ├── BS_Normal.txt │   ├── BS_Normal_adj.txt │   ├── ExonEdict.npy │   ├── ExonEdict_fsj.npy │   ├── ExonSdict.npy │   ├── ExonSdict_fsj.npy │   ├── circFL_Normal.bed │   ├── circFL_Normal.txt │   ├── circSeq.fa │   ├── circSeq.th │   ├── constructFL_Normal.txt │   ├── constructFL_Normal_adj.txt │   ├── explainFL.txt │   ├── explainFL_ID2Type.txt │   ├── explainFL_Normal.txt │   ├── explainFL_Normal_adj.txt │   ├── explainFL_noprimary.txt │   ├── fusion │   │   └── tmp │   ├── result_Normal.txt │   ├── strandDict.npy │   ├── strandDict_fsj.npy │   ├── test.minimap2.sam │   └── tmp └── cRG ├── RG │   ├── explainFL.txt │   ├── explainFL_ID2Type.txt │   ├── explainFL_noprimary.txt │   ├── fusion │   │   └── tmp │   ├── test.minimap2.sam │   └── tmp └── pseudo.fq

Any help/thoughts on this would be GREATLY appreciated!

colinliuzelin commented 2 years ago

Hi! I have some thoughts but not sure. I suggest you type this command

cut -f 1 split.1_2/cRG/RG/explainFL_ID2Type.txt |sort |uniq -c

to check whether there are some normal reads to be selected. If you do not see any 'N' type, means the cRG don't get any results. Normally, this error wouldn't happend except you dataset is very small. Please let me know your findings.