williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments
53 stars 25 forks source link

Error: Type checker found wrong number of fields while tokenizing data line (while building reference) #167

Open sheilazuniga opened 2 years ago

sheilazuniga commented 2 years ago

Hi,

I'm not sure why this is happening, I call: IRFinder -m BuildRefFromSTARRef -r /my_dir/GRCh38/irfinder/ -x /my_dir/GRCh38/genome-index_STAR -e /my_dir/GRCh38/extra-input-files-IRFinder/RNA.SpikeIn.ERCC.fasta.gz -b /my_dir/GRCh38/extra-input-files-IRFinder/Human_hg38_nonPolyA_ROI.bed

I got: merging from 64 files and 64 in-memory blocks... Error: Type checker found wrong number of fields while tokenizing data line. Perhaps you have extra TAB at the end of your line? Check with "cat -t" Error: Type checker found wrong number of fields while tokenizing data line. Perhaps you have extra TAB at the end of your line? Check with "cat -t"

The output file containts the following content:

Launching reference build process. The full build might take hours. <Phase 1: STAR Reference Preparation> Oct 27 17:01:21 ... copying the genome FASTA file... Oct 27 17:01:37 ... copying the transcriptome GTF file... Oct 27 17:01:42 ... copying the STAR reference folder... <Phase 2: Mapability Calculation> Oct 27 17:03:11 ... mapping genome fragments back to genome... Oct 27 17:29:03 ... sorting aligned genome fragments... Oct 27 17:35:40 ... indexing aligned genome fragments... Oct 27 17:36:02 ... filtering aligned genome fragments by chromosome/scaffold... Oct 27 17:39:12 ... merging filtered genome fragments... Oct 27 17:39:57 ... calculating regions for exclusion... Oct 27 17:45:01 ... cleaning temporary files... <Phase 3: IRFinder Reference Preparation> Oct 27 17:45:03 ... building Ref 1... Oct 27 17:45:21 ... building Ref 2... Oct 27 17:45:23 ... building Ref 3... Oct 27 17:45:23 ... building Ref 4... Oct 27 17:45:29 ... building Ref 5... Oct 27 17:45:39 ... building Ref 6... Oct 27 17:45:39 ... building Ref 7... Oct 27 17:45:41 ... building Ref 8... Oct 27 17:45:41 ... building Ref 9... Oct 27 17:45:41 ... building Ref 10c... Oct 27 17:45:41 ... building Ref 11c...

And when I enter in the directory

-rw-r--r-- 1 szm group 5.2K Oct 27 17:45 ref-ROI.bed -rw-r--r-- 1 szm group 0 Oct 27 17:45 intergenic.ROI.bed -rw-r--r-- 1 szm group 81M Oct 27 17:45 ref-cover.bed -rw-r--r-- 1 szm group 5.8M Oct 27 17:45 ref-sj.ref -rw-r--r-- 1 szm group 6.5M Oct 27 17:45 ref-read-continues.ref -rw-r--r-- 1 szm group 31M Oct 27 17:45 exclude.omnidirectional.bed -rw-r--r-- 1 szm group 22M Oct 27 17:45 exclude.directional.bed -rw-r--r-- 1 szm group 13M Oct 27 17:45 introns.unique.bed

Could you please help me?

Thank you in advance,

Sheila

dg520 commented 2 years ago

@sheilazuniga What is your OS environment? Have you tried to re-compile the irfinder core?

sheilazuniga commented 2 years ago

Thanks for your answer, I'm user of cluster: NAME="Red Hat Enterprise Linux" VERSION="8.5 (Ootpa)" I haven't recompiled it, I install it using conda.

dg520 commented 2 years ago

@sheilazuniga The conda package is maintained by a third party so I don't know how it is built. The only suggestion I can provide here is to download the GitHub version and try to recompile according to the wiki page.

yangjywhu commented 1 year ago

Hi @dg520 I have recompiled according to the wiki page, and also encountered the same problem. (both in v1.3.1 and v1.2.6) Error: Type checker found wrong number of fields while tokenizing data line. Perhaps you have extra TAB at the end of your line? Check with "cat -t"

In v1.3.1:

IRFinder -m BuildRefFromSTARRef \
  -r result/index/IRFinder \
  -x index_STAR \
  -e RNA.SpikeIn.ERCC.fasta.gz \
  -b Human_hg38_nonPolyA_ROI.bed

I got this output information:

Launching reference build process. The full build might take hours.
<Phase 1: STAR Reference Preparation>
Mar 30 21:46:32 ... copying the genome FASTA file...
Mar 30 21:46:48 ... copying the transcriptome GTF file...
Mar 30 21:46:54 ... copying the STAR reference folder...
<Phase 2: Mapability Calculation>
Mar 30 21:49:27 ... mapping genome fragments back to genome...
Mar 31 03:37:00 ... sorting aligned genome fragments...
[bam_sort_core] merging from 60 files and 60 in-memory blocks...
Mar 31 04:10:37 ... indexing aligned genome fragments...
Mar 31 04:13:09 ... filtering aligned genome fragments by chromosome/scaffold...
Mar 31 05:24:51 ... merging filtered genome fragments...
Mar 31 05:26:47 ... calculating regions for exclusion...
Mar 31 05:35:16 ... cleaning temporary files...
<Phase 3: IRFinder Reference Preparation>
Mar 31 05:35:38 ... building Ref 1...
Mar 31 05:36:28 ... building Ref 2...
Mar 31 05:36:32 ... building Ref 3...
Mar 31 05:36:32 ... building Ref 4...
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"
Mar 31 05:36:46 ... building Ref 5...
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"
Mar 31 05:37:04 ... building Ref 6...
Mar 31 05:37:05 ... building Ref 7...
Mar 31 05:37:09 ... building Ref 8...
Mar 31 05:37:10 ... building Ref 9...
Mar 31 05:37:10 ... building Ref 10c...
Mar 31 05:37:10 ... building Ref 11c...

The output directory is like following. It seems to run successfully?

total 156M
-rw-r--r--. 1 yangjiayi zhoulab  23M Mar 31 21:08 exclude.directional.bed
-rw-r--r--. 1 yangjiayi zhoulab  30M Mar 31 21:08 exclude.omnidirectional.bed
-rw-r--r--. 1 yangjiayi zhoulab    0 Mar 31 21:11 intergenic.ROI.bed
-rw-r--r--. 1 yangjiayi zhoulab  13M Mar 31 21:08 introns.unique.bed
-rw-r--r--. 1 yangjiayi zhoulab  80M Mar 31 21:08 ref-cover.bed
-rw-r--r--. 1 yangjiayi zhoulab 6.4M Mar 31 21:08 ref-read-continues.ref
-rw-r--r--. 1 yangjiayi zhoulab 5.2K Mar 31 21:11 ref-ROI.bed
-rw-r--r--. 1 yangjiayi zhoulab 5.7M Mar 31 21:08 ref-sj.ref

In v1.2.6

IRFinder -m BuildRefProcess \
  -r result/index/IRFinder \
  -e RNA.SpikeIn.ERCC.fasta.gz \
  -b Human_hg38_nonPolyA_ROI.bed

I got this output information:

Launching reference build process. The full build should take at least one hour.
Usage : /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/.snakemake/conda/ee893424cf93d5f01a4ab9e5e1b2aa21_/opt/irfinder-1.2.6/bin/util/IRFinder-BuildRefFromEnsembl mode threads STAR-executable base_ftp_url_of_ensembl_genome+gtf output_directory(must not exist) additional_genome_reference(eg: ERCC) non_polyA_genes-as-bed region_blacklist-as-bed
Usage example: /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/.snakemake/conda/ee893424cf93d5f01a4ab9e5e1b2aa21_/opt/irfinder-1.2.6/bin/util/IRFinder-BuildRefFromEnsembl BuildRef 12 STAR "ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/" "IRFinder/REF/Human" "Refernce-ERCC.fa.gz" [non_polyA_genes.bed] [blacklist.bed]
    STAR --runMode genomeGenerate --genomeDir /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/result/index/IRFinder/STAR --genomeFastaFiles /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/result/index/IRFinder/genome.fa --sjdbGTFfile /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/result/index/IRFinder/transcripts.gtf --sjdbOverhang 150 --runThreadN 28 &>> log-star-build-ref.log
    STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Apr 02 11:41:03 ..... started STAR run
!!!!! WARNING: Could not move Log.out file from ./Log.out into /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/result/index/IRFinder/STAR/Log.out. Will keep ./Log.out

Apr 02 11:41:04 ... starting to generate Genome files
Apr 02 11:42:08 ..... processing annotations GTF
Apr 02 11:42:37 ... starting to sort Suffix Array. This may take a long time...
Apr 02 11:42:54 ... sorting Suffix Array chunks and saving them to disk...
Apr 02 16:00:28 ... loading chunks from disk, packing SA...
Apr 02 16:04:37 ... finished generating suffix array
Apr 02 16:04:37 ... generating Suffix Array index
Apr 02 16:07:48 ... completed Suffix Array index
Apr 02 16:07:48 ..... inserting junctions into the genome indices
Apr 02 16:19:35 ... writing Genome to disk ...
Apr 02 16:19:37 ... writing Suffix Array to disk ...
Apr 02 16:20:51 ... writing SAindex to disk
Apr 02 16:21:00 ..... finished successfully
Star genome build result: 0
Commence STAR mapping run for mapability.
Sun Apr  2 16:21:01 CST 2023

real    302m8.719s
user    263m51.874s
sys 25m59.972s
Completed STAR run.
Sun Apr  2 21:23:10 CST 2023
Commence Coverage calculation.

real    16m42.891s
user    12m53.464s
sys 3m48.529s

real    0m3.080s
user    0m2.942s
sys 0m0.059s
Completed coverage exclusion calculation.
Sun Apr  2 21:39:56 CST 2023
Mapability result: 0
Build Ref 1
Build Ref 2
Build Ref 3
Build Ref 4
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"
Build Ref 5
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"
Build Ref 6
Build Ref 7
Build Ref 8
Build Ref 9
Build Ref 10c
Build Ref 11c
/beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/.snakemake/conda/ee893424cf93d5f01a4ab9e5e1b2aa21_/opt/irfinder-1.2.6/bin/util/Build-BED-refs.sh: line 186: [: missing `]'
COMPLETE
Ref build result: 0
ALL DONE

The output directory is like following.

total 156M
-rw-r--r--. 1 yangjiayi zhoulab  23M Apr  2 21:37 exclude.directional.bed
-rw-r--r--. 1 yangjiayi zhoulab  30M Apr  2 21:37 exclude.omnidirectional.bed
-rw-r--r--. 1 yangjiayi zhoulab    0 Apr  2 21:41 intergenic.ROI.bed
-rw-r--r--. 1 yangjiayi zhoulab  13M Apr  2 21:37 introns.unique.bed
-rw-r--r--. 1 yangjiayi zhoulab  80M Apr  2 21:38 ref-cover.bed
-rw-r--r--. 1 yangjiayi zhoulab 6.4M Apr  2 21:38 ref-read-continues.ref
-rw-r--r--. 1 yangjiayi zhoulab 5.2K Apr  2 21:41 ref-ROI.bed
-rw-r--r--. 1 yangjiayi zhoulab 5.7M Apr  2 21:38 ref-sj.ref

My OS information:

LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.7.1908 (Core)
Release:        7.7.1908
Codename:       Core

What can i do to fix it? Can I quantify IR with this reference?

Thanks Yang

dg520 commented 1 year ago

@yangjywhu Is this extra TAB problem due to the version of Bedtools you used? I have never seen this before, so I cannot judge whether your reference has been successfully prepared or not. I would suggest you to update your Bedtools or try a clean installation of Bedtools v2.30.0 (make sure the new Bedtools is set to the default one to be called).