yanailab / celseq2

Generate the UMI count matrix from CEL-Seq2 sequencing data
https://yanailab.github.io/celseq2/
BSD 3-Clause "New" or "Revised" License
19 stars 13 forks source link

demo data in CEL-Seq2 #39

Closed yueli8 closed 4 years ago

yueli8 commented 4 years ago

Hello,

Thank you for developing so nice software: CEL-Seq2.

Where can I download the demo data of in Experiment Table:

lane1-R1.fastq.gz, lane1-R2.fastq.gz, lane2-R1.fastq.gz, and lane2-R2.fastq.gz

in the website:

https://github.com/yanailab/celseq2

OR in the file config_single_lib.yaml:

R1: '/ifs/home/yy1533/Lab/cel-seq-pipe/demo/data/7_S1_L001_R1_001.fastq.gz'
R2: '/ifs/home/yy1533/Lab/cel-seq-pipe/demo/data/7_S1_L001_R2_001.fastq.gz'

Thank you in advance for great help!

Best,

Yue

Puriney commented 4 years ago

Hello Yue, Sorry I did not provide the demo data at Github. It was expected to be a small CEL-Seq2 sequencing fastq file. I was afraid I no longer work at NYU so I cannot provide you the data. However, here are two solutions.

  1. If possible, you could use the original CEL-Seq2 paper's data. It seems each single cell has it own FASTQ so you need to merge them to run this celseq2 pipeline.

  2. Alternatively, I found myself have implemented dummy CEL-seq2 sequencing reads. It just randomly generated a pair of R1 + R2 fastq.gz, a GTF and a fasta of a dummy species.

Once you install celseq2, these Bash commands are available to you.

# generate dummy species
celseq2-dummy-species --gtf a/b/c.gtf --fasta a/b/c.fasta
# generate dummy CEL-seq2 reads
celseq2-simulate --gtf a/b/c.gtf --fasta a/b/c.fasta --savetor1 R1.fq --saveto2 R2.fq

In this way you will have everything to technically run the celseq2 pipeline.

Best, Yun

yueli8 commented 4 years ago

Hello, Yun,

Thank you so much for your detailed explanation!

Thank you again and really appreciated!

Best,

Yue

li@-Desktop-590-p0xxx:~/celseq2-master$ celseq2-simulate --gtf Danio_rerio.GRCz10.87.gtf --fasta Danio_rerio.GRCz11.dna.toplevel.fa --savetor1 SRR9609653_1.fastq --savetor2 SRR9609653_2.fastq
Traceback (most recent call last):
File "/home/li/.local/bin/celseq2-simulate", line 8, in
sys.exit(main())
File "/home/li/.local/lib/python3.6/site-packages/celseq2/dummy_CELSeq2_reads.py", line 264, in main
gzip=args.gzip)
File "/home/li/.local/lib/python3.6/site-packages/celseq2/dummy_CELSeq2_reads.py", line 183, in dummy_CELSeq2
min_qual=default_qual)
File "/home/li/.local/lib/python3.6/site-packages/celseq2/dummy_CELSeq2_reads.py", line 110, in dummy_readquality
assert not length is None, 'Specify length.'
AssertionError: Specify length.
li@-Desktop-590-p0xxx:~/celseq2-master$ head -n 12 SRR9609653_1.fastq 
@MN00336:7:00CELSEQ2:1:78888:10964:367 1:N:0:ATGCGC
AAAAAAAGACTC
+
,@/9?C9:+86E
@MN00336:7:00CELSEQ2:1:20266:32663:7 1:N:0:ATGCGC
AAAAATAGACTC
+
.-?/D.;8,+GI
@MN00336:2:00CELSEQ2:1:42122:46947:341 1:N:0:ATGCGC
AAAAAGAGACTC
+
.1<76.0.4;;/

li@-Desktop-590-p0xxx:~/celseq2-master$ head -n 12 SRR9609653_2.fastq 
@MN00336:7:00CELSEQ2:1:78888:10964:367 2:N:0:ATGCGC
CATTCTTCACAGAGTATCTGCAGGTATTGATCACCCCTGATCAGTTATTG
+
;B80ED6-::B=:<6,,HE4F;.E=G@->CF5;@0EC,=60=I<.:CHA,
@MN00336:7:00CELSEQ2:1:20266:32663:7 2:N:0:ATGCGC
GCGGAAGCCCCGCTCCAGCTAAACACCCAGTGGTTCGCTCTCAGAGTTAT
+
.AIGF5A:H@EE6A-1H7@<3=G5500D@H?H1016I19,+;>:GEH<3/
@MN00336:2:00CELSEQ2:1:42122:46947:341 2:N:0:ATGCGC
GCAGCTCCTGCTTTCATTCAATTAGTGTGATTACAGACAAACGTCATCAA
+
22-9HDC954+1I/<8AI4HB:<88D->@A/B?C>-B?01CD,0I7,9@:
Puriney commented 4 years ago

No. My simulation is really naive so it is only working for the dummy species I generated. Here it seems you directly run on the zerbra fish.

yueli8 commented 4 years ago

Hello, Puriney,

Thank you so much for your message!

  1. I downloaded the gtf and fasta file of C_elegans.

It only comes out Chr1, is that correct?

li@Desktop-590-p0xxx:~/celseq2-master$ celseq2-dummy-species --gtf Caenorhabditis_elegans.WBcel235.99.gtf --fasta Caenorhabditis_elegans.WBcel235.dna.toplevel.fa 
End of chr1 is 4200
  1. I downloaded the GSE78779 only the C-elegans species, and merged them together: GSE78779_C_elegans.fastq. Still has error:
li@Desktop-590-p0xxx:~/celseq2-master$ celseq2-simulate --gtf Caenorhabditis_elegans.WBcel235.99.gtf  --fasta Caenorhabditis_elegans.WBcel235.dna.toplevel.fa --savetor1 GSE78779_C_elegans.fastq 
usage: celseq2-simulate [-h] [--gtf FILE] [--fasta FILE] --savetor1 FILE
                        --savetor2 FILE [--expected-alignment FILE] [--gzip]
                        [--test] [--verbose]
celseq2-simulate: error: the following arguments are required: --savetor2

But When I used following command, it only comes out one fastq file: SRR3196087_1 fastq, not two R1+R2.


li@Desktop-590-p0xxx:~/sratoolkit.2.9.6-1-ubuntu64/bin$ fastq-dump --split-files SRR3196087.sra 
Read 1273706 spots for SRR3196087.sra
Written 1273706 spots for SRR3196087.sra

Thank you again and really appreciate any of your help!

Best,

Yue

Puriney commented 4 years ago

The simulation is to create a fake genome but it is designed by me. So don't use any real species data. The goal is simply to run the pipeline.

  1. Generate a fake species celseq2-dummy-species --gtf fake_species.gtf --fasta fake_species.fasta

This command will generate the new GTF and FASTA files for this fake genome.

  1. Generate dummy CEL-seq2 reads celseq2-simulate --gtf fake_species.gtf --fasta fake_species.fasta --savetor1 R1.fq --saveto2 R2.fq

The command will generate random reads for the fake species.

From now on, consider the fake species as a human/mouse genome data. Do routine pre-processing.

  1. Run bowtie2 / STAR to index the fake_species.fa

  2. Follow the tutorial to setup the pipeline.

  3. Use the R1.fq, R2.fq, fake_species.gtf and the aligner index to run the pipeline.

yueli8 commented 4 years ago

Hello, Puriney,

Thank you so much for your great help!

It works!

Thank you again!

Best,

Yue

Puriney commented 4 years ago

Glad to hear that. Thanks for using our tool.