wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
153 stars 44 forks source link

support for 10x atac-seq #35

Open micans opened 4 years ago

micans commented 4 years ago

10x atac-seq data does not have UMI. Souporcell gives:

Traceback (most recent call last): File "/opt/souporcell/souporcell_pipeline.py", line 104, in <module> assert float(num_umi) / float(num_read_test) > 0.5, "Less than 50% of first 100000 reads have UMI tag (UB), turn on -- AssertionError: Less than 50% of first 100000 reads have UMI tag (UB), turn on --ignore True to ignore

When run with --ignore True this leads to

0 loaded 0 counts, is this a problem?
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/libcore/option.rs:378:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

A workaround, as you mentioned, is to add dummy UB tags with incremented integer value for each read.

wheaton5 commented 4 years ago

Looking to add this this weekend. Will update when complete and tested.

wheaton5 commented 4 years ago

--no_umi option now available.

micans commented 4 years ago

Fantastic, thank you very much, will run it this evening or tomorrow.

wheaton5 commented 4 years ago

Reopen if you have issues, otherwise I'm closing this now.

Best, Haynes

micans commented 4 years ago

Hello, I get this error when running it:

Traceback (most recent call last):
  File "/opt/souporcell/renamer.py", line 61, in <module>
    fastq.write(read.seq+"\n")
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 500, in <module>
    (region_fastqs, all_fastqs) = make_fastqs(args)
  File "/opt/souporcell/souporcell_pipeline.py", line 186, in make_fastqs
    assert not(procs[index].returncode), "renamer subprocess terminated abnormally with code " + str(procs[index].returncode)
AssertionError: renamer subprocess terminated abnormally with code 1

Stdout was:

END8738154 ok
checking modules
imports done
checking bam for expected tags
checking fasta
restarting pipeline in existing directory soc
generating fastqs with cell barcodes and umis in readname

Output created was:

farm5-head2,/data/END8738154/soc () ls -l
total 22209964
-rw-r--r-- 1 svd cellgeni 6141127589 Mar 10 11:26 souporcell_fastq_0_0.fq
-rw-r--r-- 1 svd cellgeni 1975836298 Mar 10 11:21 souporcell_fastq_1_0.fq
-rw-r--r-- 1 svd cellgeni 1324725022 Mar 10 11:20 souporcell_fastq_2_0.fq
-rw-r--r-- 1 svd cellgeni  341972656 Mar 10 11:21 souporcell_fastq_2_1.fq
-rw-r--r-- 1 svd cellgeni 1243642626 Mar 10 11:20 souporcell_fastq_3_0.fq
-rw-r--r-- 1 svd cellgeni 3084480084 Mar 10 11:24 souporcell_fastq_3_1.fq
-rw-r--r-- 1 svd cellgeni 2375807667 Mar 10 11:21 souporcell_fastq_4_0.fq
-rw-r--r-- 1 svd cellgeni  104089175 Mar 10 11:19 souporcell_fastq_5_0.fq
-rw-r--r-- 1 svd cellgeni 2790825214 Mar 10 11:22 souporcell_fastq_5_1.fq
-rw-r--r-- 1 svd cellgeni 1806936452 Mar 10 11:21 souporcell_fastq_6_0.fq
-rw-r--r-- 1 svd cellgeni   45621282 Mar 10 11:19 souporcell_fastq_7_0.fq
-rw-r--r-- 1 svd cellgeni 1507879354 Mar 10 11:21 souporcell_fastq_7_1.fq

Command line was:

singularity exec -B /full/path/to/souporcell.sif souporcell_pipeline.py -i possorted_genome_bam.bam -b barcodes.tsv -f /full/path/to/genome.fa -t 8 -o soc -k 4 --no_umi True

I hope I haven't missed something obvious, am a bit stuck. read.seq is None apparently!

wheaton5 commented 4 years ago

Um... Not sure. First of all I would try deleting the output directory and restarting. It looks like you restarted on a partially complete run and given that the code has changed maybe something about intermediate outputs didn't meet expectations.

micans commented 4 years ago

Mmmm. I did delete the END8738154/soc directory. There is a small chance I did it for the wrong sample, I'll do it again.

wheaton5 commented 4 years ago

I know that bams can have entries with null chrom/pos/mapq etc if they are unmapped. But I have not heard of reads with null sequences. It is of course technically possible in the format, but I don't know why anyone would write out a null sequence bam entry. I can put a check in for that, but I suspect something else is wrong if we are running into that and just ignoring those reads is ignoring an input data problem or something else.

micans commented 4 years ago

I've done it again with the same result. Also did samtools view possorted_genome_bam.bam | tail -n 200 > ttt which looks fine (silly test). I'm checked my genome file before, is there any way the genome file could interact with this? What I'll do now is to use the workaround that you mentioned and tee it to a sam file as well. If the workaround leads to the same error I can look at the sam file to see if there is a problem there.

wheaton5 commented 4 years ago

If you point me at the bam file location I can run a simple test.

micans commented 4 years ago

(Sent it via mail)

wheaton5 commented 4 years ago

I don't see an email from you 1hr later. Checked spam and all mail.

micans commented 4 years ago

Sorry, now sent to your actual e-mail address.

wheaton5 commented 4 years ago

Looks like there are in fact reads with no sequence, but they are mapped lol. This must be a quirk of the order or processing, all of these reads have long TR tags which is adapter sequence that has been trimmed off the read. I will look into this a bit more and may bring it to the attention of the 10x cellranger team, but I guess I can just ignore these reads in souporcell. I'll let you know when I have a solution.

wheaton5 commented 4 years ago

An example

A00968:59:HNCV3DMXX:2:2230:8938:32283 69 chr2 32916508 0 = 32916508 0 * MC:Z:49M AS:i:0XS:i:0 CR:Z:ACCTGCTGTTCAGTAC CY:Z:FFFFFFFFFFFFFFFF CB:Z:ACCTGCTGTTCAGTAC-1 BC:Z:CCTACCAT QT:Z:FFFFFF:F TR:Z:CTTCTCTTATACACATCTCCGAGCCCACGAGACCCTACCATCTCGCGTAT TQ:Z:F:FFFFFFFFFFFFF:FFFFFFFFFFFFF:FFFFFFFFFFF,:F,FFFFF GP:i:-1 MP:i:281872929 MQ:i:0 RG:Z:cellranger-atac120_count_33264_WSSS_END8738154_GRCh38-1_2_0:MissingLibrary:1:HNCV3DMXX:1

okay so it is marked as unmapped. That must be because it's mate is mapped and by default when you have one read from a pair mapped and the other unmapped, you put the unmapped read directly after the mapped one and give it that location. This is pretty dumb but whatever we can deal with it.

wheaton5 commented 4 years ago

2 line fix, now just wait an hour for the singularity build...

micans commented 4 years ago

Thanks a lot Haynes for checking this so quickly. And as I type this I see the 2 line fix coming in. Fab. We have 16 samples we are going to run this on, your support of course very much appreciated.

wheaton5 commented 4 years ago
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1MwJ2wDUuAIAZW3aRqI6zzCa7S9PN0-YI' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1MwJ2wDUuAIAZW3aRqI6zzCa7S9PN0-YI" -O souporcell.sif && rm -rf /tmp/cookies.txt
micans commented 4 years ago

Got it, now running. Edit - as it was running happily I have now submitted all 16 samples. I'm off for three days, but will try to update this later today whether they all ran to conclusion.

micans commented 4 years ago

Out of 16 jobs 11 succeeded, one is still running, and four failed. I've attached the concatenated error logs for those four; there seem to be three different types of error in there. I'm off on holidays now, will be back on Monday and can supply more information then. In case you're interested the analysis directories are all (two levels) under the data directory in the file location I sent, with log files in data/logs and scripts in data/../actions. I've noticed e.g. in the first doublets.err file this line: 44 loaded 0 counts, is this a problem?, but otherwise haven't looked for any patterns or sample QC. One of the errors is [E::bgzf_read] Read block operation failed with error 33 after 0 of 4 bytes; I don't think it's a quota issue as another sample completed successfully later. errors.txt

micans commented 4 years ago

Update: Very likely these four samples should have been run with different -k. This does not entirely negate the report I'd say, but of course I'll be running it with the intended k now.

wheaton5 commented 4 years ago

Sorry for the delay on this. Will put some effort towards it starting tomorrow. I just fled to the US to wait out the coronavirus with family.

micans commented 4 years ago

Hey, glad you're home with family. I've run SoC in a few different ways now. When I ran with the common variants file I got this:

***** WARNING: File /lustre/scratch117/cellgen/cellgeni/TIC-atacseq/tic-564/data/filtered_2p_1kgenomes_GRCh38.vcf has inconsistent naming convention for record:
1       16103   .       T       G       .       PASS    AF=0.02;AC=126;NS=2548;AN=5096;EAS_AF=0.0;EUR_AF=0.04;AFR_AF=0.03;AMR_AF=0.03;SAS_AF=0.02;VT=SNP;DP=29994

***** WARNING: File /lustre/scratch117/cellgen/cellgeni/TIC-atacseq/tic-564/data/filtered_2p_1kgenomes_GRCh38.vcf has inconsistent naming convention for record:
1       16103   .       T       G       .       PASS    AF=0.02;AC=126;NS=2548;AN=5096;EAS_AF=0.0;EUR_AF=0.04;AFR_AF=0.03;AMR_AF=0.03;SAS_AF=0.02;VT=SNP;DP=29994

Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 535, in <module>
    doublets(args, ref_mtx, alt_mtx, cluster_file)
  File "/opt/souporcell/souporcell_pipeline.py", line 481, in doublets
    subprocess.check_call(["troublet", "--alts", alt_mtx, "--refs", ref_mtx, "--clusters", cluster_file], stdout = dub, stderr = err)
  File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['troublet', '--alts', 'soc/alt.mtx', '--refs', 'soc/ref.mtx', '--clusters', 'soc/clusters_tmp.tsv']' returned non-zero exit status 101.

I wondered whether running with the common variants file you provide, in combination with atac should in principle work?

There are a lot of unassigned cells, in the thousands.

Perhaps the quickest way to insight may be if you look at the data or even run SoC on it? I'm of course very happy to run it myself with any suggestions. My setup is this: In the actions directory in the path that I sent earlier there are two scripts: A wrapper script around SoC called spoon.sh and a submit script submit.sh. The submit script sets -k to different values for different groups of samples. The wrapper script essentially does this:

singularity exec -B /lustre/path $im souporcell_pipeline.py -i $psb -b $bcd -f $fa -t $CPU -o soc2 -k $thek --no_umi True

I've tried running it with --skip_remap True --common_variants $cv as well (that didn't work). Latest outputs are in the soc2 directories under the sample IDs in the data directory. Grepping for unassigned: grep -c unassigned */soc2/clusters.tsv the result indicates > 50% cells unassigned.

annamath commented 4 years ago

Are there any updates for this issue? I am getting similar errors and I also tried with other common variants files, with different allele frequencies: 0.02%, 0.05%, 0.0005%. But the error remains. I noticed that the output file common_variants_covered.vcf is empty. The command I used is bellow: singularity exec souporcell.sif souporcell_pipeline.py --bam /data/sample4_counts/sample4/outs/possorted_bam.bam --out_dir /data/sample4_genotype_souporcell --fasta /data/refdata-cellranger-atac-GRCh38-1.2.0/fasta/genome.fa --common_variants /data/<common_variants.vcf> --barcodes /data/barcodes.tsv --threads 8 --clusters 8 --ploidy 2 --skip_remap SKIP_REMAP --no_umi NO_UMI

Thanks! Anna

wheaton5 commented 4 years ago

Hi @annamath definitely a bug was introduced in my recent set of commits. Fix incoming today. But not 100% sure that bug is what you are experiencing.

Also --skip_remap True and --no_umi True, not like you have it. I should make an assert to check this or better make it a flag and not a boolean parameter.

wheaton5 commented 4 years ago

fix live

micans commented 4 years ago

Hi, I ran this on 16 samples, and the results are still mostly unassigned. This is a summary plus small section of clusters.tsv for one of the samples:

     16 doublet
   2050 singlet
   5377 unassigned
farm5-head2,atacseq/tic-564/data () head -n 20 SAMPLEID/soc2/clusters.tsv | cut -b 1-50
barcode status  assignment      log_prob_singleton      log_p
AAACGAAAGCGTCTGC-1      unassigned      3/1     -2.3037945038006
AAACGAAAGTATCTGC-1      singlet 2       -5.908980259021121      -8
AAACGAAAGTCGATAA-1      singlet 1       -6.536346148212079      -1
AAACGAAAGTTCCCGG-1      singlet 0       -3.8533101076467293     -
AAACGAACAGCTTACA-1      unassigned      0       -2.211613314788154
AAACGAACAGGGTACA-1      unassigned      0/1     -0.6931471805599
AAACGAACATAGTCCA-1      unassigned      3       -1.038083762039585
AAACGAACATCATGTG-1      unassigned      3       -0.724895878874830
AAACGAAGTAAACGAT-1      unassigned      3/0     -1.7217667427088
AAACGAAGTAACGGCA-1      unassigned      0/3     -4.7620291728093
AAACGAAGTTTAAGGA-1      singlet 2       -6.426728248148587      -7
AAACGAATCCCAGCAG-1      unassigned      2       -3.957924569022875
AAACGAATCGAGTGTT-1      singlet 0       -2.0553437645803196     -
AAACGAATCGGGAATG-1      unassigned      1       -4.003515409130633
AAACGAATCTTGTACT-1      unassigned      0/2     -5.1253132827478
AAACTCGAGAGAACCC-1      unassigned      0       -0.917749393490980
AAACTCGAGGAATGGA-1      unassigned      0/3     -6.8841289350546
AAACTCGCAAACCTAC-1      unassigned      2       -2.00935754787678
AAACTCGCACCCTTAC-1      singlet 3       -9.637265287929914      -1

In the above I'm slightly curious about e.g. unassigned<TAB>3. This was with the most recent image downloaded after 'fix live' (990707712 bytes). I can provide all sorts of information or paths to the data in Sanger lustre.

wheaton5 commented 4 years ago

Okay I'll run one of these myself to be able to debug.

wheaton5 commented 4 years ago

@micans how many clusters in sample END8738154?

edit: looks like u ran with 4 before, so ill test with that.

wheaton5 commented 4 years ago

I got 972/3500 unassigned. But I think this may be due to the nature of scATACseq with pileups at a few locations. If a cell only covers a handful of SNPs it may be hard to assign. I will look into it more tomorrow.

micans commented 4 years ago

That sounds better than what I had (for the ones where I did run k=2); is there any chance you could run the sample ID ending in 56 as well? My results for that (k=2) is

      6 doublet
   2270 singlet
   1792 unassigned

In the logs I see that /opt/souporcell/souporcell/target/release/souporcell -k 2 -a soc2/alt.mtx -r soc2/ref.mtx --restarts 100 -b barcodes.tsv --min_ref 4 --min_alt 4 --threads 8 was run (so correct k).

Actually the owner told me "there is one where most of the cells are coming form one donor, but the rest should be fine".

The sample ending 63 is another one I ran with k=2, the results there are

      4 doublet
   1422 singlet
   5310 unassigned

These counts were obtained with cut -f 2 clusters.tsv | sort | uniq -c | grep -v status

yannaquino commented 4 years ago

Hi! I am also trying to use souporcell to demultiplex scATAC-seq data. I have downloaded the latest version of your software and am using the --no_umi tag. This is what the command line looks like:

singularity exec souporcell_pipeline.py \ -i ${TASKDIR}/cellRanger/${SAMPLE}/outs/possorted_bam.bam \ -b ${TASKDIR}/souporcell/passing/${SAMPLE}_passing_barcodes.tsv \ -f ${TASKDIR}/souporcell/genome.fa \ -o ${TASKDIR}/souporcell/passing/${SAMPLE}v2 \ -k 8 \ --no_umi True \ -t ${SLURM_CPUS_PER_TASK} Everything seems to work fine through the remapping stage. Then, I get this error on the freebayes subprocess:

Traceback (most recent call last): File "/opt/souporcell/souporcell_pipeline.py", line 551, in <module> final_vcf = freebayes(args, bam, fasta) File "/opt/souporcell/souporcell_pipeline.py", line 412, in freebayes assert not(procs[index].returncode), "freebayes subprocess terminated abnormally with code " + str(procs[index].returncode) AssertionError: freebayes subprocess terminated abnormally with code -6 Do you have any idea about what that may be due to?

Thanks a lot for souporcell and the active feedback!

wheaton5 commented 4 years ago

@micans @yannaquino Sorry I missed these posts somehow. Just saw them as I was going through and closing resolved issues. Are you all still having problems?

gokceneraslan commented 3 years ago

Any ideas on whether remapping is necessary or not for 10X ATAC data, given that cellranger atac uses BWA-MEM and not STAR?

wheaton5 commented 3 years ago

While I have not tested BWA-MEM, it should in principle work and not need the remapping step.

scfurl commented 3 years ago

Hi,

Thanks @wheaton5 for SoC! I have had enormous success with scRNA on other occasions, but as mentioned above, with cellranger-atac output i get a large number of unassigned.

singlet 137 status 1 unassigned 13297

The vcf file has 3579 variants, and I invoked SoC as follows:

singularity exec $sif souporcell_pipeline.py \ -i possorted_bam.bam -b barcodes.tsv -f $REF \ -t 15 -o $OUTDIR -k 2 --no_umi True

The cells were pbmc mixed from 2 donors so k should only be 2. I made a barcodes.tsv file myself with the barcodes I know are cells (and not droplets).

Any help or thoughts would be greatly appreciated!!