ytchen0323 / cloud-scale-bwamem

Apache License 2.0
15 stars 9 forks source link

Question: why is the fastq stored as Adam format after upload? Please #14

Open xubo245 opened 7 years ago

xubo245 commented 7 years ago

Question: why is the fastq stored as Adam format after upload? Please

parquet format maybe use less size, and more universally and friendly for subsequent operation?

I want to know the reason, please, thanks.

ytchen0323 commented 7 years ago

I think we provide both ADAM and a schema of FASTQ format. The reason to use ADAM is that they provide tools for Sort, Markduplicate, IndelRealignment, Base recalibration, and so on in the cluster level. That's what we need after CS-BWAMEM (alignment) in the data pre-processing step.

In addition, I believe ADAM also leverages Parquet.

I am not sure about your use case after alignment. Maybe you can provide your use case?

xubo245 commented 7 years ago

ADAM also leverages Parquet and avro. ADAM provides different format for FASTA,FASTQ, SAM and so on

If CS-BWAMEM support ADAM format before alignment, we can use ADAM to analysis the FASTQ data. It also avoid transform after CS-BWAMEM (alignment).

xubo245 commented 7 years ago

I want to set up a system, which includes read mapping, variant analysis, and simple disease analysis based on Spark, Alluxio, HDFS.

ytchen0323 commented 7 years ago

We wrote some glue logics to leverage ADAM after alignment. You can use "merge" or "sort" in the command line to output it into ADAM format.

The only problem is that we did not support the up-to-date ADAM version. But if you use the ADAM version in the .pom file, it should work.

xubo245 commented 7 years ago

I want to leverage ADAM before alignment ...(I used merge before, if no subfile, it can avoid merge time, I don't known the reason)

Yes, I change it to Adam-0.18.2, but don't try adam-0.21.0.

Thank you for replying.