Closed yuanwsy closed 3 months ago
I use devtools::install_github("yuntianf/Longcellsrc") to download the Longcellsrc first ,and successfully downloaded Longcellpre. Another question : have the Longcellpre instruction be uploaded yet?
Hi yuan, I'm writing the document for this package, but I could first update a simplified version so you can use it on your data.One thing to notice is that I'm still tesing this pipeline to make sure its feasiblility, so it may run into some bugs for some kind of data. I will complete the test as soon as possible in the recent two days, if you want to use before that and meet any problems, please don't hesitate to contact me, I will help debug.
Thanks
Thank you! and I do have some problems when using this tool: I provide gtf_path = "~/gencode.vM34.primary_assembly.annotation.gtf" genome_path = "~/GRCm39.genome.fa.gz" minimap_bed_path = "~/gencode.vM34.primary_assembly.annotation.bed" genome_name = "m39" toolkit = "3" and my data 、barcodes....... and run command like: RunLongcellPre(fastq = fastq,barcode_path = barcodes,toolkit = toolkit, genome_path = genome_path,genome_name = genome_name, gtf_path = gtf_path,minimap_bed_path = minimap_bed_path,work_dir = work_dir, samtools = samtools, minimap2 = minimap2,bedtools = bedtools)
then the situation is :
[1] "Start to build exon annotation based on the gtf file."
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Error in mutate()
:
ℹ In argument: id = exon_id(unique(strand), n())
.
ℹ In group 1: gene = "ENSMUSG00000000"
.
Caused by error in if (strand == "+") ...
:
! the condition has length > 1
Run rlang::last_trace()
to see where the error occurred.
and run the rlang::last_trace():
rlang::last_trace(drop = FALSE) <error/dplyr:::mutate_error> Error in
mutate()
: ℹ In argument:id = exon_id(unique(strand), n())
. ℹ In group 1:gene = "ENSMUSG00000000"
. Caused by error inif (strand == "+") ...
: ! the condition has length > 1Backtrace: ▆
- ├─LongcellPre::RunLongcellPre(...)
- │ ├─base::do.call(annotation, Param)
- │ └─LongcellPre (local)
<fn>
(...)- │ └─LongcellPre:::createAnnotation(...)
- │ └─LongcellPre::gtf2bed(...)
- │ └─temp %>% group_by(gene) %>% ...
- ├─dplyr::mutate(., id = exon_id(unique(strand), n()))
- ├─dplyr:::mutate.data.frame(., id = exon_id(unique(strand), n()))
- │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
- │ ├─base::withCallingHandlers(...)
- │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
- │ └─mask$eval_all_mutate(quo)
- │ └─dplyr (local) eval()
- ├─LongcellPre:::exon_id(unique(strand), n())
- └─base::.handleSimpleError(...)
- └─dplyr (local) h(simpleError(msg, call))
- └─rlang::abort(message, class = error_class, parent = parent, call = error_call)
Do you know where the error comes form?the gtf file I provide?
Hi, Thank you for pointing this out. I haven't benchmarked this pipeline with mouse dataset, so there may be some unanticipated bugs. I have fixed the bug for the annotation function for mouse gtf, it should work this time. Please let me know if you meet any other problems.
Thanks
It seems that I have to download some packages manually when I update the Longcellpre Because when I update the packages,It always reminds me just like this:
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
there is no package called ‘peakRAM’
Calls:
and I have also got some warning because of the package function have some confliction: Warning: replacing previous import ‘dplyr::union’ by ‘igraph::union’ when loading ‘LongcellPre’ Warning: replacing previous import ‘dplyr::as_data_frame’ by ‘igraph::as_data_frame’ when loading ‘LongcellPre’ Warning: replacing previous import ‘dplyr::groups’ by ‘igraph::groups’ when loading ‘LongcellPre’
and the confliction seems to make the library error:
library(LongcellPre) Error in completeSubclasses(classDef2, class1, obj, where) : trying to get slot "subclasses" from an object of a basic class ("NULL") with no slots In addition: Warning messages: 1: replacing previous import ‘dplyr::union’ by ‘igraph::union’ when loading ‘LongcellPre’ 2: replacing previous import ‘dplyr::as_data_frame’ by ‘igraph::as_data_frame’ when loading ‘LongcellPre’ 3: replacing previous import ‘dplyr::groups’ by ‘igraph::groups’ when loading ‘LongcellPre’ Error: package or namespace load failed for ‘LongcellPre’: .onLoad failed in loadNamespace() for 'dbplyr', details: call: setClass(cl, contains = c(prevClass, "VIRTUAL"), where = where) error: error in contained classes ("character") for class “ident”; class definition removed from ‘dbplyr’
Hi,
You can automatically install the dependency while you update the package by setting install_github("yuntianf/LongcellPre",dependency=TRUE)
.
The confliction should be due to I import the whole dplyr
and igraph
in the NAMESPACE, I will update it later but it shouldn't influence the installation and usage.
For the error of dbplyr
, I didn't meet that. LongcellPre
doesn't import that package. I think I need to do some search to figure out this problem.
I have updated the namespace and the warnings disappeared, not sure if it still collides with dbplyr
, you could have a try.
still the same problem, the whole process like this:
library(future) library(Longcellsrc) library(LongcellPre) fastq = "~/fastq.gz" barcodes = "~/barcodes.tsv.gz" gtf_path = "~/gencode.vM34.primary_assembly.annotation.gtf" genome_path = "~/GRCm39.genome.fa.gz" minimap_bed_path = "~/gencode.vM34.primary_assembly.annotation.bed" genome_name = "mm39" toolkit = "3" work_dir = "~/OE/" RunLongcellPre(fastq = fastq,barcode_path = barcodes,toolkit = toolkit, genome_path = genome_path,genome_name = genome_name, gtf_path = gtf_path,minimap_bed_path = minimap_bed_path,work_dir = work_dir, samtools = samtools, minimap2 = minimap2,bedtools = bedtools) [1] "Start to build exon annotation based on the gtf file." Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Error in
mutate()
: ℹ In argument:id = exon_id(unique(strand), n())
. ℹ In group 1:gene = "ENSMUSG00000000001"
. Caused by error inn()
: ! could not find function "n" Runrlang::last_trace()
to see where the error occurred. Warning message: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored.
It seems to stiil inadequacy for mouse genome
Hi,
I didn't meet this problem in my testing, but I update the namespace to specifcally import n()
from dplyr
, please tell me if it still doesn't work.
Thanks! It seems to work now ,
RunLongcellPre(fastq = fastq,barcode_path = barcodes,toolkit = toolkit, genome_path = genome_path,genome_name = genome_name, gtf_path = gtf_path,minimap_bed_path = minimap_bed_path,work_dir = work_dir, samtools = samtools, minimap2 = minimap2,bedtools = bedtools) [1] "Start to build exon annotation based on the gtf file." Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK LongcellPre would be applied with 1 threads in sequential mode. Start to do barcode match:In total 67419192 reads out of 96853581 reads are identified with a valid adapter. Within them, 67418347 reads are identified with a valid tag region. start to build index index finished! The size of index is 3892
and it stucks here for about 12h, is this situation right?
Hi,
Glad to see it worked on your side.
But you data seems quite large, I would suggest to first subsample a smaller data, like 100k reads to test if LongcellPre could work on your data. Then you could run your whole data with parallization by setting the cores
and strategy
parameter in RunLongcellPre
.
You could refer those two parameters in future
package.
For me, worked with mouse genome (at leas in a small downsampled fastq as @yuntianf recommended, haven't tried full) using https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/latest/mm10.fa.gz and https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.ensGene.gtf.gz. With other versions of the genome, I had the same issue.
Hi @rustino, Sorry could you please specify what issues you met? Then I could figure out what happened behind.
Sure @yuntianf, and sorry for the confusion.
I meant this problem:
fastq = "fastq.fastq.gz"
barcodes = "/white_list/737K-august-2016.txt"
gtf_path = "mm10.knownGene.gtf"
genome_path = "mm10.fa"
genome_name = "mm10"
toolkit = 3
work_dir = "/outdir/"
# specify the path for those tools
samtools = "/opt/conda/bin/samtools"
minimap2 = "/opt/conda/bin/minimap2"
bedtools = "/opt/conda/bin/bedtools"
> RunLongcellPre(fastq = fastq, barcode_path = barcodes,toolkit = toolkit,
genome_path = genome_path,genome_name = genome_name,
gtf_path = gtf_path, ,work_dir = work_dir,
samtools = samtools, minimap2 = minimap2, bedtools = bedtools)
[1] "Start to build exon annotation based on the gtf file."
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Error in `mutate()`:
ℹ In argument: `id = exon_id(unique(strand), n())`.
ℹ In group 323: `gene = "A0A087WNV3"`.
Caused by error in `if (strand == "+") ...`:
! the condition has length > 1
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
It was resolved when I switch from mm10.knownGene.gtf to mm10.ensGene.gtf (I edited my previous reply, I'm going back and forth with this program but seems to be working now).
Hi @rustino, Thanks for pointing this out! I test this program all based on ensembl gene id so I didn't meet this bug. I will test annotation with other kinds of gene names to generalize it.
Will there be some notes note me that the RunLongcellPre gets to an end? or after all this command it comes to an end: There are 1012578 sequences identified with a barcode There are 1010005 sequences identified with a barcode There are 1012324 sequences identified with a barcode There are 1009746 sequences identified with a barcode
Hi @yuanwsy , When LongcellPre finishes, it will cast out "LongcellPre pipeline finished!". The notes you showed indicate the barcode match step has been finished.
I got an error that haven't figure out how to mix it now,do you know why?
[bam_sort_core] merging from 0 files and 64 in-memory blocks... ***** WARNING: File /data/R02/yuanwsy/longcellpre/OEsubsample3//annotation/cover.bed has inconsistent naming convention for record: GL456210.1 157534 157862
***** WARNING: File /data/R02/yuanwsy/longcellpre/OEsubsample3//annotation/cover.bed has inconsistent naming convention for record: GL456210.1 157534 157862
***** WARNING: File /data/R02/yuanwsy/longcellpre/OEsubsample3//annotation/temp.bed has inconsistent naming convention for record: GL456354.1 25923 27230 - ENSMUSG00000074720
***** WARNING: File /data/R02/yuanwsy/longcellpre/OEsubsample3//annotation/temp.bed has inconsistent naming convention for record: GL456354.1 25923 27230 - ENSMUSG00000074720
Start to extract isoforms:Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], start[i], : sequence JH584304.1 not found
Hi @yuanwsy,
I think this is an error raised by getSeq
from the package BSgenome
. I guess the JH584304.1
is passed to the names
parameter which should represent the chromosome. Could you please check if JH584304.1
is in the chr
column in you bed file?
And please keep the genome_name
in RunLongcellPre
the same version as the genome you mapped to.
Yeah,I found it in the genecover.bed: JH584304.1 52190 59690 - ENSMUSG00000095041 the JH584304.1 is in the chr column
and I have kept the genome_name in RunLongcellPre the same version as the genome I mapped to(GRCm39-mm39) so I have to delete that line?or do something else?
Could you check you bam file header to see if it also contains JH584304.1
? I guess this JH584304.1
doesn't exist in the BSgenome, so if you are not specifically interested this gene, you could delete it in the gene bed.
I try another annotation.gtf and it works but I meet another problem:
Start to extract isoforms:Isoform extraction took 3.30 mins
Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 188.583 6.1 740.9
Error in $<-
:
! Assigned data rep(unit, (nrow(data)%/%length(unit) + 1))[1:nrow(data)]
must be compatible with existing data.
✖ Existing data has 7210 rows.
✖ Assigned data has 36464 rows.
ℹ Only vectors of size 1 are recycled.
Caused by error in vectbl_recycle_rhs_rows()
:
! Can't recycle input of size 36464 to size 7210.
Run rlang::last_trace()
to see where the error occurred.
rlang::last_trace() <error/tibble_error_assign_incompatible_size> Error in
$<-
: ! Assigned datarep(unit, (nrow(data)%/%length(unit) + 1))[1:nrow(data)]
must be compatible with existing data. ✖ Existing data has 7210 rows. ✖ Assigned data has 36464 rows. ℹ Only vectors of size 1 are recycled. Caused by error invectbl_recycle_rhs_rows()
: ! Can't recycle input of size 36464 to size 7210.Backtrace: ▆
- └─LongcellPre::RunLongcellPre(...)
- ├─base::do.call(umi_count_corres, Param)
- └─LongcellPre (local)
<fn>
(...)- └─LongcellPre:::genes_distribute(data, cores, gene)
- ├─base::
$<-
(*tmp*
, "group", value =<int>
)- └─tibble:::
$<-.tbl_df
(*tmp*
, "group", value =<int>
)- └─tibble:::tbl_subassign(...)
- └─tibble:::vectbl_recycle_rhs_rows(value, fast_nrow(xo), i_arg = NULL, value_arg, call)
I got exactly the same error as @yuanwsy. (I copy here the first part of it).
Error in
$<-: ! Assigned data
rep(unit, (nrow(data)%/%length(unit) +
1))[1:nrow(data)]must be compatible with existing data. ✖ Existing data has 13774 rows. ✖ Assigned data has 2032053 rows. ℹ Only vectors of size 1 are recycled. Caused by error in
vectbl_recycle_rhs_rows(): ! Can't recycle input of size 2032053 to size 13774.
Hi @rustino @yuanwsy , Thanks for pointing this out. This is a bug in the parallization and I have fixed it. Now it should work.
Hi @yuntianf, just let you know that the problem with parallelization is solved. I'll let you know if I run into more issues. Thanks!
It seems to take up too much memory to run the whole pipeline at the same time in my data,especially the step of barcodematch,do you have any recommendation for me to run?(just like if this step can be seperate into several steps to run
Hi @yuanwsy, This is expected as you have a really large dataset, and currently LongcellPre would store the barcode matching result as an intermediate variable, which leads to large memory usage. I could optimize it to read the fastq and store the result stage by stage, but it takes time. For your time efficiency, I would recommend to first split your fastq files into smaller sub-fastqs. Then you could run LongcellPre in seperate steps for each sub-fastq:
### In R
# replace your input path here
fastq = "sub_fastq.fq.gz"
barcodes = "path of your input cell barcode whitelist"
gtf_path = "path of your gtf annotation"
genome_path = "path of your genome annotation"
minimap_bed_path = "path of your bed annotation for minimap2, can be generated from gtf" //unnecessary
genome_name = "the genome name used for mapping, ex. hg38"
toolkit = your 10X sequencing toolkit
work_dir = "The output directory"
# create annotation
init(work_dir)
refer = annotation(gtf_path = gtf_path, work_dir = work_dir ,overwrite = FALSE)
gene_bed = refer[[1]]
gtf = refer[[2]]
# barcode match
plan(strategy = "multisession",workers = 4)
reads_bc = reads_extract_bc(fastq_path = fastq, barcode_path = barcodes , gene_bed = gene_bed,
genome_path = genome_path, genome_name = genome_name, toolkit = toolkit,
minimap_bed_path = minimap_bed_path, work_dir = work_dir, adapter = NULL)
Then in shell you can concat the output from the barcode match step and read it in R as the input for the UMI deduplication step.
### In R
reads_bc = read.table(file.path(work_dir,"BarcodeMatch/BarcodeMatch.txt"),header = TRUE,sep = "\t")
qual = read.table(file.path(work_dir,"BarcodeMatch/adapterNeedle.txt"), header = TRUE, sep = "\t")
plan(strategy = "multisession",workers = 4)
umi_count_corres(data = reads_bc,qual = qual,dir = file.path(work_dir,"/out/"),gene_bed = gene_bed,gtf = gtf)
Hope above steps could help with the memory problem.
Best
Thanks for your help,but when I trying to seperate the step, I try to use the readRDS to read the annotation file for barcodematch step,but if I do so ,the barcode match step come to an error like this gene_bed <- readRDS("/gene_bed.rds") Start to do barcode match:Error: Not compatible with requested type: [type=character; target=integer]. Timing stopped at: 0.002 0 0.002
and when I only do the annotation step ,It will also come to an error: refer = annotation(gtf_path = gtf_path, work_dir = work_dir ,overwrite = FALSE) [1] "Start to build exon annotation based on the gtf file." Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Error in file(file, ifelse(append, "a", "w")) : cannot open the connection
Thanks for your help,but when I trying to seperate the step, I try to use the readRDS to read the annotation file for barcodematch step,but if I do so ,the barcode match step come to an error like this gene_bed <- readRDS("/gene_bed.rds") Start to do barcode match:Error: Not compatible with requested type: [type=character; target=integer]. Timing stopped at: 0.002 0 0.002
For this one, could you please make sure your path for input gene bed is correct, as /gene_bed.rds
seems to be in the root path. If it's correct, could you please print a head part of this file?
and when I only do the annotation step ,It will also come to an error: refer = annotation(gtf_path = gtf_path, work_dir = work_dir ,overwrite = FALSE) [1] "Start to build exon annotation based on the gtf file." Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Error in file(file, ifelse(append, "a", "w")) : cannot open the connection
And for this one, according to the output, it seems that it stops at the step to write the output file, could you please make sure you have the permission to write the output in your output folder?
yes,the path is right, and the head part of the file looks like this : chr start end width strand gene id chr1 3143476 3144545 1070 + ENSMUSG00000102693 1 chr1 3172239 3172348 110 + ENSMUSG00000064842 1
and the second question,you mean I should use the init(work_dir) command to make sure I have the permission to write the output in your output folder?or in other way?
The init(work_dir)
will create serveral subfolders to save the output, if you didn't run it, the output folder doesn't exist, thus the result file couldn't be saved.
The
init(work_dir)
will create serveral subfolders to save the output, if you didn't run it, the output folder doesn't exist, thus the result file couldn't be saved.
Thank you ,I got it ,and could you tell me about the package that contains the init() function? I have not found the init() function
Now I have exported the init
function into the NAMESPACE in LongcellPre, please install the new package and try it again.
still,there is something wrong when seperate these steps, when I run the annotation generation step independently, the barcode match step will get some error: Start to do barcode match:Error: Not compatible with requested type: [type=character; target=integer]. Timing stopped at: 0.001 0.001 0.024 Execution halted
the same error as above,but if I use the runlongcellpre command , the error will not come
Hi @yuanwsy, This is expected as you have a really large dataset, and currently LongcellPre would store the barcode matching result as an intermediate variable, which leads to large memory usage. I could optimize it to read the fastq and store the result stage by stage, but it takes time. For your time efficiency, I would recommend to first split your fastq files into smaller sub-fastqs. Then you could run LongcellPre in seperate steps for each sub-fastq:
### In R # replace your input path here fastq = "sub_fastq.fq.gz" barcodes = "path of your input cell barcode whitelist" gtf_path = "path of your gtf annotation" genome_path = "path of your genome annotation" minimap_bed_path = "path of your bed annotation for minimap2, can be generated from gtf" //unnecessary genome_name = "the genome name used for mapping, ex. hg38" toolkit = your 10X sequencing toolkit work_dir = "The output directory" # create annotation init(work_dir) refer = annotation(gtf_path = gtf_path, work_dir = work_dir ,overwrite = FALSE) gene_bed = refer[[1]] gtf = refer[[2]] # barcode match plan(strategy = "multisession",workers = 4) reads_bc = reads_extract_bc(fastq_path = fastq, barcode_path = barcodes , gene_bed = gene_bed, genome_path = genome_path, genome_name = genome_name, toolkit = toolkit, minimap_bed_path = minimap_bed_path, work_dir = work_dir, adapter = NULL)
Then in shell you can concat the output from the barcode match step and read it in R as the input for the UMI deduplication step.
### In R reads_bc = read.table(file.path(work_dir,"BarcodeMatch/BarcodeMatch.txt"),header = TRUE,sep = "\t") qual = read.table(file.path(work_dir,"BarcodeMatch/adapterNeedle.txt"), header = TRUE, sep = "\t") plan(strategy = "multisession",workers = 4) umi_count_corres(data = reads_bc,qual = qual,dir = file.path(work_dir,"/out/"),gene_bed = gene_bed,gtf = gtf)
Hope above steps could help with the memory problem.
Best
Sorry, I found a typo that I missed a s
in the parameter barcode_path = barcodes
in reads_extract_bc
, I tested the seperate orders and it worked on my side.
after I update the Longcellpre package,I met a new error,and all of my inputs hava never changed: Start to do barcode match:In total 0 reads out of 0 reads are identified with a valid adapter. Within them, 0 reads are identified with a valid tag region. Error in names(x) <- value : 'names' attribute [2] must be the same length as the vector [1]
It is weird,do you know why?
after I update the Longcellpre package,I met a new error,and all of my inputs hava never changed: Start to do barcode match:In total 0 reads out of 0 reads are identified with a valid adapter. Within them, 0 reads are identified with a valid tag region. Error in names(x) <- value : 'names' attribute [2] must be the same length as the vector [1]
It is weird,do you know why?
It's hard to trace the bug based on only the error message, could you please show me the code you run LongcellPre?
library(future) library(Longcellsrc) library(LongcellPre) fastq = "/data/fulllength.fastq.gz" barcodes = "/data/barcodes.tsv.gz" gtf_path = "/data/gencode.vM33.basic.annotation.gtf.gz" genome_path = "/data/GRCm39.genome.fa.gz" genome_name = "mm39" toolkit = "3" work_dir = "/data/fulloutput/" samtools = "/usr/bin/samtools" bedtools = "~/miniconda3/envs/R4.1.2/bin/bedtools" minimap2 = "~/miniconda3/envs/R4.1.2/bin/minimap2" RunLongcellPre(fastq = fastq,barcode_path = barcodes,toolkit = toolkit, genome_path = genome_path,genome_name = genome_name, gtf_path = gtf_path,work_dir = work_dir, samtools = samtools, minimap2 = minimap2,bedtools = bedtools,cores = 32, strategy="multicore")
all of the code for running is above,thanks
This is weird, I only did some modifications in the NAMESPACE, which shouldn't influence the barcode match step. And I couldn't reproduce your error on my side with my test data. If you could share a small subset of your fastq file and also the cell barcode and gene_bed.rds to me, I could do some further diagnosis.
Hi,
Thanks for developing this very useful and much needed tool! Is this version of longcellpre ok to use yet? when I download the longcellpre,I can not download successful the error looks like this: Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called ‘Longcellsrc’ Calls: ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
Execution halted
ERROR: lazy loading failed for package ‘LongcellPre’
thanks for your help