Open Romeo1-1 opened 2 years ago
I am traveling, it will be a couple of days before I will be able to look in detail at the first queries.
Do note the message in the main README, development of linCNV stopped when 10x discontinued their scCNV platform, some things are not stable and we won't be fixing anything at this point, unfortunately (truly, we'd love to have the platform back!). .
The manifest flatfile is from our core facility, you probably won't have one, I don't remember the details of what columns it would require to make your own.
No problem ! I totally understand. I'm trying to get results from samples that were analysed a year ago, but the fact that 10x discontinued the platform truly makes it harder ! And compared to scRNAseq I find overall that the packages are less "user-friendly" (I'm not criticizing your pipeline which I find user-friendly)
And what part of the pipeline are not working ? I probably won't need the "crosstable option", getting a cell x bin matrix will already be perfect for me right now
Thank you in advance
Thanks for your patience. Below I copy option sets that ran without error the last time I used linCNV (earlier in 2022) for the bin and analyze actions. Those are the only actions I have used recently in our work - anything else I cannot vouch for its working/non-working state.
The error you are getting is a file check error. The check function is apparently not getting any value for the file name when checking a file option. My guess is that an option specification was missing or malformed (perhaps by you, perhaps in the programming). Suggestions to rectify it would be to check all options and paths. Also, try using long-form option names. Sorry if this is our bug, but see the "not stable" disclaimer! I do know the program will run the bin and analyze actions if everything is set.
Regarding the file requirements - as long as you had a BED format file with spans and score columns, I think it should work. README.md says "The size of bins in these files is not important, e.g. 1 kb bins". I haven't reviewed the code recently, but the eventual bins used for scanning the genome coverage are larger than 1kb, so it is presumably aggregating the GC and mappability scores over the bins based on the matching BED file rows (e.g. bedtools intersect).
Here is a bit of info on the files we use. They are bigger than I would want to add to this git repo, but if this is where you get stuck we could figure out a way to get them to you.
$ ls -lh /home/wilsonte_lab/clubhouse/genomes/hg38/hg38.gc5Base.bin_1000.bed.gz
-rw-r--r-- 1 wilsonte wilsonte_lab 19M Feb 12 2020 /home/wilsonte_lab/clubhouse/genomes/hg38/hg38.gc5Base.bin_1000.bed.gz
$ ls -lh /home/wilsonte_lab/clubhouse/genomes/hg38/hg38.kmer_50.bin_1000.bed.gz
-rw-r--r-- 1 wilsonte wilsonte_lab 18M Feb 12 2020 /home/wilsonte_lab/clubhouse/genomes/hg38/hg38.kmer_50.bin_1000.bed.gz
Here are the options sets for a working run:
linCNV bin
help options
compute options
-p,--n-cpu 6
-r,--ram-per-cpu 2G
-t,--tmp-dir /treehouse/wilsonte_lab/ssd/tmp
-T,--tmp-dir-large /home/wilsonte_lab/clubhouse/tmp
main options
-n,--data-name Fearon_26887_wt
-o,--output-dir /treehouse/wilsonte_lab/umms-glover/data/linCNV/projects/Fearon_070522/Fearon_26887_wt
shared options
-Q,--min-mapq 5
-P,--ploidy 2
bin options
-R,--cell-ranger-dir /treehouse/wilsonte_lab/path-wilsonte-turbo/globus_from_agc/6075-SA/10x_analysis_6075-SA/Sample_6075-SA-1
-w,--weight-per-cell 10
genome options
-g,--genome mm10
-G,--genome-dir /home/wilsonte_lab/clubhouse/genomes/mm10
-X,--gap-file /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.gap.bed
-B,--bad-regions-file /treehouse/wilsonte_lab/ssd/genomes/Blacklist/lists/mm10-blacklist.v2.bed
-m,--gc-file /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.gc5Base.bin_1000.bed.gz
-M,--mappability-file /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.kmer_50.bin_1000.bed.gz
bwa /garage/wilsonte_lab/bin/bwa/bwa-0.7.17/bwa
samtools /garage/wilsonte_lab/bin/samtools/samtools-1.9/samtools-1.9/samtools
bedtools /garage/wilsonte_lab/bin/bedtools/bedtools_v2.28.0/bedtools2/bin/bedtools
pigz /usr/bin/pigz
Rscript /garage/wilsonte_lab/bin/R/wilson/R-4.2.0/bin/Rscript
linCNV analyze
help options
compute options
-p,--n-cpu 16
-r,--ram-per-cpu 2G
-t,--tmp-dir /treehouse/wilsonte_lab/ssd/tmp
-T,--tmp-dir-large /home/wilsonte_lab/clubhouse/tmp
main options
-n,--data-name Fearon_26887_wt
-o,--output-dir /treehouse/wilsonte_lab/umms-glover/data/linCNV/projects/Fearon_070522/Fearon_26887_wt
normalize options
-c,--min-modal-cn 0.25
-b,--min-mappability 0.25
-x,--max-excluded-bases 1000
-a,--min-allele-depth 2
scan options
-S,--n-scan-bins 100
segment options
-s,--transition-prob 1e-06
genome options
-g,--genome mm10
-G,--genome-dir /home/wilsonte_lab/clubhouse/genomes/mm10
-X,--gap-file /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.gap.bed
-B,--bad-regions-file /treehouse/wilsonte_lab/ssd/genomes/Blacklist/lists/mm10-blacklist.v2.bed
-m,--gc-file /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.gc5Base.bin_1000.bed.gz
-M,--mappability-file /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.kmer_50.bin_1000.bed.gz
shared options
-Q,--min-mapq 5
-P,--ploidy 2
bin options
-R,--cell-ranger-dir /treehouse/wilsonte_lab/path-wilsonte-turbo/globus_from_agc/6075-SA/10x_analysis_6075-SA/Sample_6075-SA-1
-w,--weight-per-cell 10
bwa /garage/wilsonte_lab/bin/bwa/bwa-0.7.17/bwa
samtools /garage/wilsonte_lab/bin/samtools/samtools-1.9/samtools-1.9/samtools
bedtools /garage/wilsonte_lab/bin/bedtools/bedtools_v2.28.0/bedtools2/bin/bedtools
pigz /usr/bin/pigz
Rscript /garage/wilsonte_lab/bin/R/wilson/R-4.2.0/bin/Rscript
Hello, thank you very much for your answer !
After reviewing the code I think that the GC and mappability files are causing troubles. I also miss a gap file but this was easier to find. Do you think you could send them to me via WeTransfer please ?
Thank you in advance
I have attempted to post the hg38 and mm10 files in question on Mendeley Data. They are in moderation, but I believe the url for access once approved will be:
Thank you very much, i'll keep you posted !
It is public on Mendeley Data now.
I'm very sorry but even with your files I still get the errror. The only file that i miss when i compare to your "output" is the "gap file", even it's not written as "required". I'm currently trying to make one myself but it's still unclear. As i understand it, I should make one using bedtools genomecov on my hg38.fa file, right ? I'll keep you updated once it's ready.
No, the gap file is the missing regions of the genome (runs of N bases), you can download file gap.txt.gz from UCSC: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/
It is a trivial matter to convert that file to BED format using awk, perl, python or whatever tool you prefer to parse the columns.
Thank you very much for your help ! The gap file was indeed required to start the process. The analysis is still running so I'm not sure that the pipeline is fully working but i'll keep you posted.
Thanks a lot again !
Hello again, i'm very sorry to bother you again but I have an issue with the "analyze part".
COMMAND STEP SCRIPT DATE
bin 1 bin/bin.sh Fri Sep 9 18:42:03 MDT 2022
setting bin endpoints and counting reads per bin per cell
Error: No cells marked as accepted! Did you manually mark cells in web interface?
Execution halted
This is the command i'm running, very similar to yours. I tried to find a R Shiny app in the linCNV directory but couldn't find one. Is there something I should do ?
linCNV analyze
-p,--n-cpu 4
-r,--ram-per-cpu 4G
-t,--tmp-dir /tmp
-T,--tmp-dir-large /tmp
main options
-n,--data-name xxx
-o,--output-dir /mnt/h/xxx/outs/linCNV
genome options
-g,--genome hg38
-G,--genome-dir /mnt/h/Genome
-X,--gap-file /mnt/h/Genome/gap.bed
-B,--bad-regions-file /mnt/h/Genome/hg38.blacklist.bed
-m,--gc-file /mnt/h/Genome/hg38.gc5Base.bin_1000.bed
-M,--mappability-file /mnt/h/Genome/hg38.kmer_50.bin_1000.bed
normalize options
-c,--min-modal-cn 0.25
-b,--min-mappability 0.25
-x,--max-excluded-bases 1000
-a,--min-allele-depth 2
segment options
-s,--transition-prob 1e-06
shared options
-Q,--min-mapq 5
-P,--ploidy 2
scan options
-S,--n-scan-bins 100
bin options
-R,--cell-ranger-dir /mnt/h/xxx/outs
-w,--weight-per-cell 10
bwa /home/chris/miniconda3/bin/bwa
samtools /home/chris/miniconda3/bin/samtools
bedtools /home/chris/miniconda3/bin/bedtools
pigz /home/chris/miniconda3/bin/pigz
Rscript /usr/bin/Rscript
Thank you in advance
I thought you might ask that. The pipeline as currently implemented has a manual step for marking acceptable cells to analyze again. Marking is accomplished in the mark_cells Shiny app here in the repo:
https://github.com/wilsonte-umich/linCNV/tree/master/_server
I'm sorry if this is cumbersome, but its where the project stopped (the long term thought was to get better at automating the cell selection to avoid the manual step).
If you don't want to mark cells (although it IS educational!), you can also manually create/edit the cell marking file to accept all cells and then analyze should run fine (or you could hack the code to bypass the accepted cell check).
Hello @wilsonte-umich ,
I'm sorry for the delay, I got involved in other projects and I didn't got time to come back to you.
I've not been able to run the RShiny App. After running :
library(shiny)
setwd("/home/chris/Python/linCNV
runApp("_server/mark_cells")
I get :
An error has occurred! object 'cellTypes' not found
At first I thought it may come from the fact that I run R on Windows and Python/this pipeline through WSL2. But even after installing Rstudio on WSL2 I get the same error.
I tried putting "mark_cells" in different folders, as in the output folder of the "bin" command, but I got the same error every time.
I didn't even try to hack the code because I'm not good enough at coding.
Finally, I decided to modify the Rdata file from the "bin" output and have been able to run the "analyze" command. It seems to have worked perfectly !
But now I face the same problem as before, when running "heat_map" or "qc_plots" I get the error object 'projects' not found
. The error probably comes from my end because I probably ran RShiny app once or twice before.
Thank you again in advance for your help,
I believe (haven't done it recently) you need to modify and run local.R to run the server on your computer. https://github.com/wilsonte-umich/linCNV/blob/master/_server/mark_cells/local.R
Sorry it isn't well documented, but this project's development ended a while ago in an incomplete state, it's on GitHub "as is".
I'm also not sure it is a tool I'd recommend anymore. I've been looking at some of our 10x scCNV data recently and seeing things that will be challenging for the method used in linCNV. I'm writing new tools but will be many months to something sharable (its a side project).
Hello, thank you for this pipeline that seems very interesting !
I have 3 questions :
About the input files, especially the "gc" and "mappability" bed files. It's the first package I find that needs those 2 files. I found the mappability bed file for hg38 on this website (https://bismap.hoffmanlab.org/) and the GC file here (https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips). I had to convert the bw to wig then bed file. This seem like a lot of pre-processing and they do not have the same "bin" size. Did I do fine or will that be a problem downstream ?
About the "bin" command, when i use the dry run there's no problem, but when I run the command (cf below), i get
Do you know why it isn't working ?
Thank you a lot in advance,