Help with input files & error in bin command

Romeo1-1 commented 2 years ago

Hello, thank you for this pipeline that seems very interesting !

I have 3 questions :

About the input files, especially the "gc" and "mappability" bed files. It's the first package I find that needs those 2 files. I found the mappability bed file for hg38 on this website (https://bismap.hoffmanlab.org/) and the GC file here (https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips). I had to convert the bw to wig then bed file. This seem like a lot of pre-processing and they do not have the same "bin" size. Did I do fine or will that be a problem downstream ?
About the "bin" command, when i use the dry run there's no problem, but when I run the command (cf below), i get

./linCNV bin -p 4 -n test -o /mnt/h/test/outs/linCNV -R /mnt/h/test/outs -g hg38 -G /mnt/h/Genome -B /mnt/h/Genome/hg38.blacklist.bed -m /mnt/h/Genome/hg38.gc.bed -M /mnt/h/Genome/k50.umap.bed

compute options
    -p,--n-cpu               4
    -r,--ram-per-cpu         4G
    -t,--tmp-dir             /tmp
    -T,--tmp-dir-large       /tmp
main options
    -n,--data-name           test
    -o,--output-dir          /mnt/h/test/outs/linCNV
shared options
    -Q,--min-mapq            5
    -P,--ploidy              2
genome options
    -g,--genome              hg38
    -G,--genome-dir          /mnt/h/Genome
    -B,--bad-regions-file    /mnt/h/Genome/hg38.blacklist.bed
    -m,--gc-file             /mnt/h/Genome/hg38.gc.bed
    -M,--mappability-file    /mnt/h/Genome/k50.umap.bed
bin options
    -R,--cell-ranger-dir     /mnt/h/test/outs
    -w,--weight-per-cell     10

bwa            /home/xxx/miniconda3/bin/bwa
samtools       /home/xxx/miniconda3/bin/samtools
bedtools       /usr/bin/bedtools
pigz           /home/xxx/miniconda3/bin/pigz
Rscript        /usr/bin/Rscript

Use of uninitialized value $fileName in -e at ./_workflow/launcher_utilities.pl line 121.
Use of uninitialized value $fileName in concatenation (.) or string at ./_workflow/launcher_utilities.pl line 121.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
file not found:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Do you know why it isn't working ?

Out of curiosity, i tried the "crosstable" option, but I didn't have a 'manifest-file'. I coudln't find it on Google either, where can you find it ?

Thank you a lot in advance,

wilsonte-umich commented 2 years ago

I am traveling, it will be a couple of days before I will be able to look in detail at the first queries.

Do note the message in the main README, development of linCNV stopped when 10x discontinued their scCNV platform, some things are not stable and we won't be fixing anything at this point, unfortunately (truly, we'd love to have the platform back!). .

The manifest flatfile is from our core facility, you probably won't have one, I don't remember the details of what columns it would require to make your own.

Romeo1-1 commented 2 years ago

No problem ! I totally understand. I'm trying to get results from samples that were analysed a year ago, but the fact that 10x discontinued the platform truly makes it harder ! And compared to scRNAseq I find overall that the packages are less "user-friendly" (I'm not criticizing your pipeline which I find user-friendly)

And what part of the pipeline are not working ? I probably won't need the "crosstable option", getting a cell x bin matrix will already be perfect for me right now

Thank you in advance

wilsonte-umich commented 2 years ago

Thanks for your patience. Below I copy option sets that ran without error the last time I used linCNV (earlier in 2022) for the bin and analyze actions. Those are the only actions I have used recently in our work - anything else I cannot vouch for its working/non-working state.

The error you are getting is a file check error. The check function is apparently not getting any value for the file name when checking a file option. My guess is that an option specification was missing or malformed (perhaps by you, perhaps in the programming). Suggestions to rectify it would be to check all options and paths. Also, try using long-form option names. Sorry if this is our bug, but see the "not stable" disclaimer! I do know the program will run the bin and analyze actions if everything is set.

Regarding the file requirements - as long as you had a BED format file with spans and score columns, I think it should work. README.md says "The size of bins in these files is not important, e.g. 1 kb bins". I haven't reviewed the code recently, but the eventual bins used for scanning the genome coverage are larger than 1kb, so it is presumably aggregating the GC and mappability scores over the bins based on the matching BED file rows (e.g. bedtools intersect).

Here is a bit of info on the files we use. They are bigger than I would want to add to this git repo, but if this is where you get stuck we could figure out a way to get them to you.

$ ls -lh /home/wilsonte_lab/clubhouse/genomes/hg38/hg38.gc5Base.bin_1000.bed.gz
-rw-r--r-- 1 wilsonte wilsonte_lab 19M Feb 12  2020 /home/wilsonte_lab/clubhouse/genomes/hg38/hg38.gc5Base.bin_1000.bed.gz

$ ls -lh /home/wilsonte_lab/clubhouse/genomes/hg38/hg38.kmer_50.bin_1000.bed.gz
-rw-r--r-- 1 wilsonte wilsonte_lab 18M Feb 12  2020 /home/wilsonte_lab/clubhouse/genomes/hg38/hg38.kmer_50.bin_1000.bed.gz

Here are the options sets for a working run:

linCNV bin

help options
compute options
    -p,--n-cpu               6
    -r,--ram-per-cpu         2G
    -t,--tmp-dir             /treehouse/wilsonte_lab/ssd/tmp
    -T,--tmp-dir-large       /home/wilsonte_lab/clubhouse/tmp
main options
    -n,--data-name           Fearon_26887_wt
    -o,--output-dir          /treehouse/wilsonte_lab/umms-glover/data/linCNV/projects/Fearon_070522/Fearon_26887_wt
shared options
    -Q,--min-mapq            5
    -P,--ploidy              2
bin options
    -R,--cell-ranger-dir     /treehouse/wilsonte_lab/path-wilsonte-turbo/globus_from_agc/6075-SA/10x_analysis_6075-SA/Sample_6075-SA-1
    -w,--weight-per-cell     10
genome options
    -g,--genome              mm10
    -G,--genome-dir          /home/wilsonte_lab/clubhouse/genomes/mm10
    -X,--gap-file            /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.gap.bed
    -B,--bad-regions-file    /treehouse/wilsonte_lab/ssd/genomes/Blacklist/lists/mm10-blacklist.v2.bed
    -m,--gc-file             /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.gc5Base.bin_1000.bed.gz
    -M,--mappability-file    /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.kmer_50.bin_1000.bed.gz

bwa            /garage/wilsonte_lab/bin/bwa/bwa-0.7.17/bwa
samtools       /garage/wilsonte_lab/bin/samtools/samtools-1.9/samtools-1.9/samtools
bedtools       /garage/wilsonte_lab/bin/bedtools/bedtools_v2.28.0/bedtools2/bin/bedtools
pigz           /usr/bin/pigz
Rscript        /garage/wilsonte_lab/bin/R/wilson/R-4.2.0/bin/Rscript

linCNV analyze

help options
compute options
    -p,--n-cpu               16
    -r,--ram-per-cpu         2G
    -t,--tmp-dir             /treehouse/wilsonte_lab/ssd/tmp
    -T,--tmp-dir-large       /home/wilsonte_lab/clubhouse/tmp
main options
    -n,--data-name           Fearon_26887_wt
    -o,--output-dir          /treehouse/wilsonte_lab/umms-glover/data/linCNV/projects/Fearon_070522/Fearon_26887_wt
normalize options
    -c,--min-modal-cn        0.25
    -b,--min-mappability     0.25
    -x,--max-excluded-bases  1000
    -a,--min-allele-depth    2
scan options
    -S,--n-scan-bins         100
segment options
    -s,--transition-prob     1e-06
genome options
    -g,--genome              mm10
    -G,--genome-dir          /home/wilsonte_lab/clubhouse/genomes/mm10
    -X,--gap-file            /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.gap.bed
    -B,--bad-regions-file    /treehouse/wilsonte_lab/ssd/genomes/Blacklist/lists/mm10-blacklist.v2.bed
    -m,--gc-file             /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.gc5Base.bin_1000.bed.gz
    -M,--mappability-file    /home/wilsonte_lab/clubhouse/genomes/mm10/mm10.kmer_50.bin_1000.bed.gz
shared options
    -Q,--min-mapq            5
    -P,--ploidy              2
bin options
    -R,--cell-ranger-dir     /treehouse/wilsonte_lab/path-wilsonte-turbo/globus_from_agc/6075-SA/10x_analysis_6075-SA/Sample_6075-SA-1
    -w,--weight-per-cell     10

bwa            /garage/wilsonte_lab/bin/bwa/bwa-0.7.17/bwa
samtools       /garage/wilsonte_lab/bin/samtools/samtools-1.9/samtools-1.9/samtools
bedtools       /garage/wilsonte_lab/bin/bedtools/bedtools_v2.28.0/bedtools2/bin/bedtools
pigz           /usr/bin/pigz
Rscript        /garage/wilsonte_lab/bin/R/wilson/R-4.2.0/bin/Rscript

Romeo1-1 commented 2 years ago

Hello, thank you very much for your answer !

After reviewing the code I think that the GC and mappability files are causing troubles. I also miss a gap file but this was easier to find. Do you think you could send them to me via WeTransfer please ?

Thank you in advance

wilsonte-umich commented 2 years ago

I have attempted to post the hg38 and mm10 files in question on Mendeley Data. They are in moderation, but I believe the url for access once approved will be:

https://data.mendeley.com/datasets/jr36ntmzsh

Romeo1-1 commented 2 years ago

Thank you very much, i'll keep you posted !

wilsonte-umich commented 2 years ago

It is public on Mendeley Data now.

Romeo1-1 commented 2 years ago

I'm very sorry but even with your files I still get the errror. The only file that i miss when i compare to your "output" is the "gap file", even it's not written as "required". I'm currently trying to make one myself but it's still unclear. As i understand it, I should make one using bedtools genomecov on my hg38.fa file, right ? I'll keep you updated once it's ready.

wilsonte-umich commented 2 years ago

No, the gap file is the missing regions of the genome (runs of N bases), you can download file gap.txt.gz from UCSC: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

It is a trivial matter to convert that file to BED format using awk, perl, python or whatever tool you prefer to parse the columns.

Romeo1-1 commented 2 years ago

Thank you very much for your help ! The gap file was indeed required to start the process. The analysis is still running so I'm not sure that the pipeline is fully working but i'll keep you posted.

Thanks a lot again !

Romeo1-1 commented 2 years ago

Hello again, i'm very sorry to bother you again but I have an issue with the "analyze part".

COMMAND STEP    SCRIPT  DATE
bin     1       bin/bin.sh      Fri Sep  9 18:42:03 MDT 2022

setting bin endpoints and counting reads per bin per cell
Error: No cells marked as accepted! Did you manually mark cells in web interface?
Execution halted

This is the command i'm running, very similar to yours. I tried to find a R Shiny app in the linCNV directory but couldn't find one. Is there something I should do ?

linCNV analyze

    -p,--n-cpu               4
    -r,--ram-per-cpu         4G
    -t,--tmp-dir             /tmp
    -T,--tmp-dir-large       /tmp
main options
    -n,--data-name           xxx
    -o,--output-dir          /mnt/h/xxx/outs/linCNV
genome options
    -g,--genome              hg38
    -G,--genome-dir          /mnt/h/Genome
    -X,--gap-file            /mnt/h/Genome/gap.bed
    -B,--bad-regions-file    /mnt/h/Genome/hg38.blacklist.bed
    -m,--gc-file             /mnt/h/Genome/hg38.gc5Base.bin_1000.bed
    -M,--mappability-file    /mnt/h/Genome/hg38.kmer_50.bin_1000.bed
normalize options
    -c,--min-modal-cn        0.25
    -b,--min-mappability     0.25
    -x,--max-excluded-bases  1000
    -a,--min-allele-depth    2
segment options
    -s,--transition-prob     1e-06
shared options
    -Q,--min-mapq            5
    -P,--ploidy              2
scan options
    -S,--n-scan-bins         100
bin options
    -R,--cell-ranger-dir     /mnt/h/xxx/outs
    -w,--weight-per-cell     10

bwa            /home/chris/miniconda3/bin/bwa
samtools       /home/chris/miniconda3/bin/samtools
bedtools       /home/chris/miniconda3/bin/bedtools
pigz           /home/chris/miniconda3/bin/pigz
Rscript        /usr/bin/Rscript

Thank you in advance

wilsonte-umich commented 2 years ago

I thought you might ask that. The pipeline as currently implemented has a manual step for marking acceptable cells to analyze again. Marking is accomplished in the mark_cells Shiny app here in the repo:

https://github.com/wilsonte-umich/linCNV/tree/master/_server

I'm sorry if this is cumbersome, but its where the project stopped (the long term thought was to get better at automating the cell selection to avoid the manual step).

If you don't want to mark cells (although it IS educational!), you can also manually create/edit the cell marking file to accept all cells and then analyze should run fine (or you could hack the code to bypass the accepted cell check).

Romeo1-1 commented 2 years ago

Hello @wilsonte-umich ,

I'm sorry for the delay, I got involved in other projects and I didn't got time to come back to you.

I've not been able to run the RShiny App. After running :

library(shiny)
setwd("/home/chris/Python/linCNV
runApp("_server/mark_cells")

I get :

An error has occurred! object 'cellTypes' not found

At first I thought it may come from the fact that I run R on Windows and Python/this pipeline through WSL2. But even after installing Rstudio on WSL2 I get the same error.

I tried putting "mark_cells" in different folders, as in the output folder of the "bin" command, but I got the same error every time.

I didn't even try to hack the code because I'm not good enough at coding.

Finally, I decided to modify the Rdata file from the "bin" output and have been able to run the "analyze" command. It seems to have worked perfectly !

But now I face the same problem as before, when running "heat_map" or "qc_plots" I get the error object 'projects' not found. The error probably comes from my end because I probably ran RShiny app once or twice before.

Thank you again in advance for your help,

wilsonte-umich commented 2 years ago

I believe (haven't done it recently) you need to modify and run local.R to run the server on your computer. https://github.com/wilsonte-umich/linCNV/blob/master/_server/mark_cells/local.R

Sorry it isn't well documented, but this project's development ended a while ago in an incomplete state, it's on GitHub "as is".

I'm also not sure it is a tool I'd recommend anymore. I've been looking at some of our 10x scCNV data recently and seeing things that will be challenging for the method used in linCNV. I'm writing new tools but will be many months to something sharable (its a side project).

wilsonte-umich / linCNV

Help with input files & error in bin command #1