template/
of this repository) containing shell scripts that can be used for semi-automated genome assembly, quality assessment, and manual curation with HiFi + Omni-C reads (or only HiFi reads).template/config.sh
) that stores species-specific parameters.There is no need for any "installation" procedure because this workflow is after all merely a collection of bash scripts (plus a simple control of the SLURM job manager). All you need to do is thus clone this repository:
git clone https://github.com/yoshihikosuzuki/ant-asm-workflow
and then copy the template/
directory in this repository to somewhere you wish to work with a specific sample:
# NOTE: Assuming <your-working-dir-name> does not exist yet
cp -r ant-asm-workflow/template <your-working-dir-name>
The copied directory is the place you run the workflow.
Below we start with an abstract description about the overall structure of the working directory. Then, we provide practical information and commands on how to work with it.
For each sample to be assembled, we propose the following directory structure to generate and compare different types of assemblies while dividing the tasks in genome assembly.
:information_source: IMPORTANT:
- Below, those surrounded by
<
>
are the names up to you or up to tools you used.- However, on the other hand, the structure of the directories and the names of the files MUST be exactly the same as described below except those surrounded by
<
>
. DO NOT change them unless you know how everything works.->
means a symbolic link (made with$ ln -s
).- Multiple directories (e.g.
<assembler-1>/
,<assembler-2>/
, ..., under01-asm/
) mean multiple assemblies with different assemblers or different parameters to be compared.
<your-working-dir-name>
├── 00-data/
│ ├── hifi/
│ │ └── hifi.fastq -> <path-to-your-hifi-reads>.fastq
│ │
│ └── omnic/
│ ├── omnic_R1_001.fastq -> <path-to-your-omnic-reads>_R1_001.fastq
│ └── omnic_R2_001.fastq -> <path-to-your-omnic-reads>_R2_001.fastq
│
├── 01-asm/
│ ├── <assembler-1>/
│ │ ├── hifi.fastq -> ../../00-data/hifi/hifi.fastq
│ │ ...
│ │
│ ├── <assembler-2>/
| ...
│
├── 10-contigs/
│ ├── <assembly-name-1>/
│ │ ├── contigs.fasta -> ../../01-asm/<assembler-N>/<contig-fasta-file>
│ │ ...
│ │
│ ├── <assembly-name-2>/
| ...
│
├── 11-scaf/
│ ├── <scaf-tool-1>/
│ │ ├── omnic_R1_001.fastq -> ../../00-data/omnic/omnic_R1_001.fastq
│ │ ├── omnic_R2_001.fastq -> ../../00-data/omnic/omnic_R2_001.fastq
│ │ ├── contigs.fasta* -> ../../10-contigs/<assembly-name-N>/contigs.fasta*
│ │ ...
│ │
│ ├── <scaf-tool-2>/
| ...
│
│── 20-scaffolds/
│ ├── <scaf-name-1>/
│ │ ├── scaffolds.fasta -> ../../11-scaf/<scaf-tool-N>/<scaf-fasta-file>
│ │ ...
│ │
│ ├── <scaf-name-2>/
│ ...
│
└── 21-curation/
├── <curation-name-1>/
│ ├── scaffolds.fasta -> ../../11-scaf/<scaf-tool-N>/output_for_curation/scaffolds.fasta
│ ├── merged_nodups.txt -> ../../11-scaf/<scaf-tool-N>/output_for_curation/merged_nodups.txt
│ ├── scaffolds.review.assembly
│ ├── ...
│
├── <curation-name-2>/
....
A short description about the role of each directory is as follows:
00-data/
: Storage and QC of readshifi.fastq
for HiFi; andomnic_R1_001.fastq
and omnic_R2_001.fastq
for Omni-C.01-asm/
: Contig assembly00-data/
; or10-contigs/
: Quality assessment of contigs01-asm/
, and11-scaf/
: Scaffolding00-data/
and a smylink to a draft contig assembly at 10-contigs/
.20-scaffolds/
: Quality assessment of scaffolds11-scaf/
,21-curation/
: Manual curationscaffolds.review.assembly
). Other required files are also symlinked.To automatically generate the directory structure above and to provide scripts for the tools we adopt, we have a ready-made template directory named template/
in this GitHub repository. The contents in the template directory (except symlinks of data files and marginal scripts for removing intermediate files) are as follows:
template/
|-- config.sh
|-- 00-data
| |-- hifi
| | |-- run_all.sh
| | |-- run_fastk.sh
| | `-- run_genescope.sh
| `-- omnic
| |-- run_all.sh
| |-- run_fastk.sh
| `-- run_genescope.sh
|-- 01-asm
| |-- hicanu
| | `-- run_hicanu.sh
| |-- hifiasm
| | `-- run_hifiasm.sh
| |-- hifiasm-phased
| | `-- run_hifiasm.sh
| |-- ipa
| | `-- run_ipa.sh
| |-- peregrine
| | `-- run_peregrine.sh
| `-- template-purge-dups
| |-- run_purge_dups.sh
| `-- run_purge_dups_plot.sh
|-- 10-contigs
| |-- template
| | |-- 00-make_index.sh
| | |-- 01-busco
| | | `-- run_busco.sh
| | |-- 02-merqury
| | | `-- run_merquryfk.sh
| | |-- 02-merquryfk
| | | `-- run_merquryfk.sh
| | |-- 04-winnowmap
| | | `-- run_winnowmap.sh
| | |-- 05-deepvariant
| | | `-- run_deepvariant.sh
| | |-- 06-mapqv
| | | `-- run_mapqv.sh
| | |-- 07-asset
| | | `-- run_asset.sh
| | |-- output_for_curation
| | |-- output_for_stats
| | `-- run_all.sh
| `-- template-phased
| |-- 00-make_index.sh
| |-- 01-busco-all
| | `-- run_busco.sh
| |-- 01-busco-hap1
| | `-- run_busco.sh
| |-- 01-busco-hap2
| | `-- run_busco.sh
| |-- 02-merqury
| | `-- run_merqury.sh
| |-- 02-merquryfk
| | `-- run_merquryfk.sh
| |-- 04-winnowmap
| | `-- run_winnowmap.sh
| |-- 05-deepvariant
| | `-- run_deepvariant.sh
| |-- 06-mapqv
| | `-- run_mapqv.sh
| |-- 07-asset
| | `-- run_asset.sh
| |-- 10-mummer
| | `-- run_mummer.sh
| |-- output_for_stats
| `-- run_all.sh
|-- 11-scaf
| |-- template-3ddna
| | |-- output
| | |-- output_for_curation
| | `-- run_3ddna.sh
| |-- template-omnic-partition
| | `-- run_omnic_partition.sh
| `-- template-salsa
| |-- output
| |-- output_for_curation
| `-- run_salsa.sh
|-- 20-scaffolds
| |-- template
| | |-- 00-make_index.sh
| | |-- 01-busco
| | | `-- run_busco.sh
| | |-- 02-merqury
| | | `-- run_merqury.sh
| | |-- 02-merquryfk
| | | `-- run_merquryfk.sh
| | |-- 04-winnowmap
| | | `-- run_winnowmap.sh
| | |-- 05-deepvariant
| | | `-- run_deepvariant.sh
| | |-- 06-mapqv
| | | `-- run_mapqv.sh
| | |-- 07-asset
| | | `-- run_asset.sh
| | |-- 09-telomere
| | | `-- run_make_telomere_bed.sh
| | |-- output_for_curation
| | |-- output_for_stats
| | `-- run_all.sh
| `-- template-phased
| |-- 00-make_index.sh
| |-- 01-busco-all
| | `-- run_busco.sh
| |-- 01-busco-hap1
| | `-- run_busco.sh
| |-- 01-busco-hap2
| | `-- run_busco.sh
| |-- 02-merqury
| | `-- run_merqury.sh
| |-- 02-merquryfk
| | `-- run_merquryfk.sh
| |-- 04-winnowmap
| | `-- run_winnowmap.sh
| |-- 05-deepvariant
| | `-- run_deepvariant.sh
| |-- 06-mapqv
| | `-- run_mapqv.sh
| |-- 07-asset
| | `-- run_asset.sh
| |-- 09-telomere
| | `-- run_make_telomere_bed.sh
| |-- 10-mummer
| | `-- run_mummer.sh
| |-- output_for_curation
| |-- output_for_stats
| `-- run_all.sh
`-- 21-curation
`-- template
`-- make_curated_scaffolds.sh
The workflow is NOT (and will never be) fully automatic for many reasons. That is, you need to run the following commands step-by-step, although most of the steps can be easily executed using ready-made shell scripts.
:information_source: IMPORTANT:
In the commnds below, you need to replace each name surrounded by
<
>
(e.g. path to an input data file, name of an assembly) with something up to you. On the other hand, however, you must NOT change the names of the other files/directories.:memo: NOTE:
&&
at the end of a line below means "execute the next command if the previous command finishes successfully." You do not have to add it if you are running each line interactively in a shell.
# Assuming <your-working-dir-name> does not exist yet
cp -r <path-to>/ant-asm-workflow/template/ <your-working-dir-name>
cd <your-working-dir-name>
cd 00-data/ &&
cd hifi/ &&
ln -sf <path-to-your-hifi-reads>.fastq ./hifi.fastq &&
cd ..
cd omnic/ &&
ln -sf <path-to-your-omnic-reads>_R1_001.fastq ./omnic_R1_001.fastq &&
ln -sf <path-to-your-omnic-reads>_R2_001.fastq ./omnic_R2_001.fastq &&
cd ..
cd ..
:memo: NOTE:
You can update the scripts in your working directory while keeping the other files as they are (which is useful when the workflow itself is updated) by:
# NOTE: <your-working-dir-name> needs to exist rsync -auv <path-to>/ant-asm-workflow/template/ <your-working-dir-name>/ # For example, to update scripts in `10-contigs/<assembly-name>`: cd <your-working-dir-name>/10-contigs/ rsync -auv template/ <assembly-name>/
:warning: WARNING: Do NOT forget
/
aftertemplate
in the command above.
Some species-dependent (as well as computer-environment-specific) settings are written in a single file, <your-working-dir-name>/config.sh
, where the default values are for ants.
If you wish to apply this workflow to species other than ants, you need to change them, especially BUSCO's lineage (hymenoptera_odb10
by default) and telomeric motif sequence (TTAGG
by default whereas human is TTAGGG
), as necessary. The most significant parameters are as follows:
Variable Name | Default Value | Description |
---|---|---|
AUTO_DEL |
true |
If true , intermediate files are deleted after each job |
HIC_ENZYME_NAME |
Parameter for 3D-DNA. Empty for Omni-C | |
BUSCO_DB |
hymenoptera_odb10 |
BUSCO's lineage database name |
TELOMERE_MOTIF |
TTAGG |
For finding telomeres from scaffolds |
cd 00-data/ &&
cd hifi/ &&
sbatch run_all.sh && # Will submit several child jobs using SLURM
cd ..
cd omnic/ &&
sbatch run_all.sh &&
cd ..
cd ..
After submitting these jobs, you can immediately proceed to the next step 2 (i.e. You do not have to wait for them to finish).
After these jobs (and their child jobs) finish, check the GeneScope plot and the Smudgeplot for each of HiFi and Omni-C:
:information_source: IMPORTANT output files in this step for QC:
00-data/hifi/
hifi.genescope/
: GeneScope for HiFi reads
summary.txt
: Fitting resulttransformed_linear_plot.png
: k-mer (k=40 by default) count histogram with the fittinghifi.genomescope_L*_U*_smudgeplot.png
: Smudgeplot for HiFi reads00-data/omnic/
omnic.genescope/
: GeneScope for Omni-C reads
summary.txt
: Fitting resulttransformed_linear_plot.png
: k-mer (k=21 by default) count histogram with the fittingomnic.genomescope_L*_U*_smudgeplot.png
: Smudgeplot for Omni-C reads
:warning: WARNING about purge_dups:
You need to run a purge_dups task (not in the directory you ran hifiasm/HiCanu/etc. but) in a separated, new subdirectory right under
01-asm/
, as described just below.
cd 01-asm/ &&
# NOTE: <assembler> = hifiasm, hicanu, etc. If you wish to try different parameters with the same assembler, then make a new directory.
cd <assembler>/ &&
sbatch run_<assembler>.sh &&
cd ..
# To run purge_dups, do the followings after the assembly finishes
cp -r template-purge-dups/ <assembler>-pd &&
cd <assembler>-pd &&
# NOTE: <contig-fasta-file> depends on the assembler
ln -sf ../<assembler>/<contig-fasta-file> ./contigs.fasta &&
sbatch run_purge_dups_plot.sh &&
# After run_purge_dups_plot.sh finishes, here write the values of `L`, `M`, `U` in run_purge_dups.sh based on `PB.hist.plot`
sbatch run_purge_dups.sh &&
cd ..
cd ..
The hifiasm-phased
directory is for haplotype-phased assembly, and the others are for haplotype-merged assembly. For a phased assembly, you need to run purge_dups for each of the two haplotype-specific contigs independently.
The assembly stats such as contig N50 length is written at the bottom of the log file, e.g. hifiasm/hifiasm.log
.
:warning: WARNING:
Only a single assembly to be assessed must be placed under each subdirectory. That is, for example, if you wish to evaluate all of the i) primary contigs, ii) purged primary contigs, and iii) primary unitigs generated by the same assembler, then you need to make a distinct directory for each of them.
For a haplotype-merged assembly:
cd 10-contigs/ &&
# NOTE: <assembly-name> = an arbitrary name representing an assembly generated in 01-asm/ (i.e. hifiasm, hifiasm-p_ctg, hicanu-pd, etc.)
cp -r templete/ <assembly-name> &&
cd <assembly-name> &&
# NOTE: <contig-fasta-file> = "purged.fa" for those after purge_dups; otherwise, something specific to an assembler
ln -sf ../../01-asm/<assembly-dir>/<contig-fasta-file> ./contigs.fasta &&
sbatch run_all.sh &&
cd ..
cd ..
For a haplotype-phased assembly:
cd 10-contigs/ &&
cp -r templete-phased/ <assembly-name> &&
cd <assembly-name> &&
ln -sf ../../01-asm/<assembly-dir>/<contig-hap1-fasta-file> ./contigs.hap1.fasta &&
ln -sf ../../01-asm/<assembly-dir>/<contig-hap2-fasta-file> ./contigs.hap2.fasta &&
sbatch run_all.sh &&
cd ..
cd ..
For each contig assembly, after run_all.sh
finishes, check i) assembly stats such as contig N50 length, ii) BUSCO score, iii) Merqury QV, iv) mapping QV, and v) reliable block N50:
:information_source: IMPORTANT output files in this step for quality metrics (Use
$ cat
to see the results):Haplotype-merged assembly:
10-contigs/<assembly-name>/output_for_stats/
busco_score
: BUSCO resultmerqury_qv
: Merqury's K-mer based QVmapping_qv
: Mapping based QVreliable_block_n50
: Reliable block N50 lengthHaplotype-phased assembly:
10-contigs/<assembly-name>/output_for_stats/
busco_score_hap1
: BUSCO result for haplotype 1 contigsbusco_score_hap2
: BUSCO result for haplotype 2 contigsbusco_score_all
: BUSCO result for haplotype 1 + 2 contigsmerqury_qv
: Merqury's K-mer based QVmapping_qv
: Mapping based QVreliable_block_n50
: Reliable block N50 length
:warning: WARNING:
A single subdirectory must be only for a single scaffolding task. That is, you must NOT make nested directories like
hifiasm/salsa/
norhifiasm/3ddna
. Instead, you need to makehifiasm-salsa
andhifiasm-3ddna
right under11-scaf/
. Likewise, if you wish to try different parameters with the same scaffolding tool, then make a new directory.
For a haplotype-merged assembly:
cd 11-scaf/ &&
# NOTE: <scaf-tool> = {3ddna, salsa}, <assembly-name> = a directory name in 10-contigs/.
cp -r template-<scaf-tool>/ <assembly-name>-<scaf-tool> &&
cd <assembly-name>-<scaf-tool> &&
ln -sf ../../10-contigs/<assembly-name>/contigs.fasta* . && # WARNING: Do not forget * !!!
sbatch run_<scaf-tool>.sh &&
cd ..
cd ..
For a haplotype-phased assembly, you need to first split the Omni-C reads into two sets of haplotype-specific reads, which can be used for both SALSA and 3D-DNA:
cd 11-scaf/ &&
cp -r template-omnic-partition/ <assembly-name>-omnic-partition &&
cd <assembly-name>-omnic-partition &&
ln -sf ../../10-contigs/<assembly-name>/contigs.* . && # NOTE: contigs.hap* must exist
sbatch run_partition.sh &&
cd ..
cd ..
and then do scaffolding for each haplotype just like the haplotype-merged assembly using the partitioned Omni-C reads:
cp -r template-<scaf-tool>/ <assembly-name>-<scaf-tool> &&
cd <assembly-name>-<scaf-tool> &&
ln -sf ../../10-contigs/<assembly-name>/contigs.hap<N>.fasta* . && # WARNING: Do not forget * !!!
ln -sf ../<assembly-name>-omnic-partition/omnic_R1_001.hap<N>.fastq omnic_R1_001.fastq &&
ln -sf ../<assembly-name>-omnic-partition/omnic_R2_001.hap<N>.fastq omnic_R2_001.fastq &&
sbatch run_<scaf-tool>.sh &&
cd ..
:information_source: IMPORTANT output files in this step:
For scaffold evaluation (in Step 6):
11-scaf/<scaf-name>/output/
scaffolds_FINAL.fasta
: Final scaffoldsFor manual curation (in Step 7)::
11-scaf/<scaf-name>/output_for_curation/
scaffolds.fasta
: Final scaffoldsscaffolds.hic
: Omni-C contact matrix editable in Juiceboxscaffolds.assembly
: Index file for Juiceboxmerged_nodups.txt
: Raw contact information for curated scaffold generation
:warning: WARNING:
Only one scaffold FASTA per subdirectory, just like above.
For a haplotype-merged assembly:
cd 20-scaffolds/ &&
# NOTE: <scaf-name> = an arbitrary name representing an assembly generated in 11-scaf/ (i.e. hifiasm-salsa, hifiasm-pd-3ddna, etc.)
cp -r templete/ <scaf-name> &&
cd <scaf-name> &&
# NOTE: <scaf-fasta-file> depends on the scaffolding tool
ln -sf ../../11-scaf/<scaf-dir>/<scaf-fasta-file> ./scaffolds.fasta &&
sbatch run_all.sh &&
cd ..
cd ..
For a haplotype-phased assembly:
cd 20-scaffolds/ &&
cp -r templete/ <scaf-name> &&
cd <scaf-name> &&
ln -sf ../../11-scaf/<scaf-dir>/<scaf-hap1-fasta-file> ./scaffolds.hap1.fasta &&
ln -sf ../../11-scaf/<scaf-dir>/<scaf-hap2-fasta-file> ./scaffolds.hap2.fasta &&
sbatch run_all.sh &&
cd ..
cd ..
For each scaffold assembly, after run_all.sh
finishes, check i) assembly stats such as scaffold N50 length, ii) BUSCO score, iii) Merqury QV, iv) mapping QV, and v) reliable block N50:
:information_source: IMPORTANT output files in this step for quality metrics (Use
$ cat
to see the results):For haplotype-merged assembly:
20-scaffolds/<scaf-name>/
busco_score
: BUSCO resultmerqury_qv
: Merqury's K-mer based QVmapping_qv
: Mapping based QVreliable_block_n50
: Reliable block N50 lengthFor haplotype-phased assembly:
20-scaffolds/<scaf-name>/
busco_score_hap1
: BUSCO result for haplotype 1 scaffoldsbusco_score_hap2
: BUSCO result for haplotype 2 scaffoldsbusco_score_all
: BUSCO result for haplotype 1 + 2 scaffoldsmerqury_qv
: Merqury's K-mer based QVmapping_qv
: Mapping based QVreliable_block_n50
: Reliable block N50 length
Along with these qualty metrics, we generate several .bed files compatible (only) for Juicebox to help manual curation:
:information_source: IMPORTANT output files in this step for manual curation (in Step 7):
20-scaffolds/<scaf-name>/output_for_curation/
scaffolds[.hap<N>].gaps.JBAT.bed
: Locations of sequence gaps (=N
bases) between contigsscaffolds[.hap<N>].unreliable.JBAT.bed
: Locations of unreliable regions (where read coverage is very low)scaffolds[.hap<N>].telomere.filtered.JBAT.bed
: Locations of long tandem arrays of telomeric motifs
Input files:
11-scaf/<scaf-name>/output_for_curation/
scaffolds.hic
: Hi-C contact matrix (per haplotype if haplotype-phased assembly)scaffolds.assembly
: Hi-C index file (per haplotype if haplotype-phased assembly)20-scaffolds/<scaf-name>/output_for_curation/
scaffolds[.hap<N>].gaps.JBAT.bed
: Sequence gapsscaffolds[.hap<N>].unreliable.JBAT.bed
: Unreliable regionsscaffolds[.hap<N>].telomere.filtered.JBAT.bed
: Long telomere arraysSteps:
.hic
and .assembly
files in JBAT..bed
files of sequence gaps, unreliable regions, and tandem arrays as 1D tracks (via View
→ Show Annotation Panel
→ Add Local
).Enable straight edge
via Right mouse click..assembly
file and run the following commands:cd 21-curation/ &&
cp -r templete/ <curation-name> &&
cd <curation-name> &&
ln -sf ../../11-scaf/<scaf-dir>/output_for_curation/* . &&
mv <your-reviewed-assembly> scaffolds.review.assembly &&
sbatch run_curation.sh &&
cd ..
cd ..
This will generate scaffolds.FINAL.fasta
, which contains curated scaffolds.
Example of JBAT manipulation: