Open xubo245 opened 7 years ago
I think we provide both ADAM and a schema of FASTQ format. The reason to use ADAM is that they provide tools for Sort, Markduplicate, IndelRealignment, Base recalibration, and so on in the cluster level. That's what we need after CS-BWAMEM (alignment) in the data pre-processing step.
In addition, I believe ADAM also leverages Parquet.
I am not sure about your use case after alignment. Maybe you can provide your use case?
ADAM also leverages Parquet and avro. ADAM provides different format for FASTA,FASTQ, SAM and so on
If CS-BWAMEM support ADAM format before alignment, we can use ADAM to analysis the FASTQ data. It also avoid transform after CS-BWAMEM (alignment).
I want to set up a system, which includes read mapping, variant analysis, and simple disease analysis based on Spark, Alluxio, HDFS.
We wrote some glue logics to leverage ADAM after alignment. You can use "merge" or "sort" in the command line to output it into ADAM format.
The only problem is that we did not support the up-to-date ADAM version. But if you use the ADAM version in the .pom file, it should work.
I want to leverage ADAM before alignment ...(I used merge before, if no subfile, it can avoid merge time, I don't known the reason)
Yes, I change it to Adam-0.18.2, but don't try adam-0.21.0.
Thank you for replying.
Question: why is the fastq stored as Adam format after upload? Please
parquet format maybe use less size, and more universally and friendly for subsequent operation?
I want to know the reason, please, thanks.