Downloading FASTQ data from GEO/SRA

本文起始于bioinfo-core的一个讨论，bioinfo-core全称是“Managers, Staff, and Scientists of Bioinformatics, Data Science and Research IT core facilities”. 如果感兴趣，可以再去这个网站（https://mailman.open-bio.org/mailman/listinfo/bioinfo-core）订阅。在2019年2月份的一个讨论中，这个社群讨论的主题是“Downloading fastq data from GEO/SRA”。

笔者觉得很有收获，所以决定总结一下讨论的内容，以便大家更好的了解下载GEO/SRA数据中的N中奇淫巧技。

下载数据可能会遇到的问题：

能否直接下载Fastq数据而不是先下载sra文件然后在转化成fastq文件

But I feel SRA should actually just offer an easy service for querying sample sheet instead of I am using a 3rd party library. Or maybe it exists but I just missed it.

Labrador

Labrador（https://www.bioinformatics.babraham.ac.uk/projects/labrador/）的功能是一个基因web的用来实现公共数据管理和自动化的一个工具。

SRA explorer

Note that I ported some of this functionality from Labrador in to a stand-alone tool: https://ewels.github.io/sra-explorer/

It doesn’t have any extra functionality though, just finding URLs basically. If the ENA has a consistent FTL URL structure then it should be simple to extend it to work with direct FastQ downloads too though.

SRAexplorer(https://github.com/ewels/sra-explorer)的目的是一旦你有了对应的数据的编号，你可以

SRA explorer This tool aims to make datasets within the Sequence Read Archive more accessible.SRA-Explorer allows you to collect SRA datasets and get a quick bash download script for either SRA files or now FastQ files! (courtesy of the ENA API)

It doesn't have any extra functionality though, just finding URLs basically. If the ENA has a consistent FTL URL structure then it should be simple to extend it to work with direct FastQ downloads too though.

I was inspired by Hubert’s code snippet with the ENA API call to get download links and have just extended sra-explorer to use this too.

Now you can quickly get a bash script with direct FastQ download commands, complete with nice filenames: https://ewels.github.io/sra-explorer/ https://ewels.github.io/sra-explorer/

I also fixed a bug discovered by Simon that was causing sra-explorer to fail in Firefox and Edge browsers, so if you weren’t able to get any results earlier today it may be worth having another go.

SRAdbV2- R Interface to the NCBI SRA metadata

Download fastq directly via a work around where we go via the ENA archive

链接的主要内容：

downloading_fastq_GEO.pdf

enaBrowserTools

A collection of scripts to assist in the retrieval of data from the ENA Browser

We have a script that needs the GSM of each sample in a CSV file and then it will download all the SRA associated files, convert to fastq, and merge them if they are from the same sample.

It is available here:

https://github.com/bcbio/bcbio-nextgen/blob/master/scripts/bcbio_prepare_samples.py

You will need to install bcbio-nextgen to use it though, and have fastq-dump in your path. But you can install it with bioconda and it should work.

The file should look like this (project1.csv) (only the two first columns and mandatory):

samplename,description,tissue,sequencer,comments GSM458535,ER0000000103,Melanoblast,Illumina.Genome.Analyzer.II, GSM458536,ER0000000104,Melanocyte,Illumina.Genome.Analyzer.II,

And once bcbio-nextgen is installed, you can run it like this:

bcbio_prepare_samples.py --out merged --csv project1.csv

参考文献

February 2019 Archives by subject: https://mailman.open-bio.org/mailman/private/bioinfo-core/2019-February/subject.html

xie186 / miscellaneous_note