Issue assembling plant genome with NECAT

LeoVincenzi commented 1 year ago

Hi, I'm working on a plant genome and I'm trying to assemble it with NECAT, but the final assembly I obtain is really inconsistent. The expected genome size is 1.2 Gbp and I'm working with Oxford Nanopore reads. The starting data for the assembly are reported in the following table:

The obtained results are the following:

The command I run was /opt/NECAT/Linux-amd64/bin/necat.pl assemble config.txt and the config file was compiled as it follows:

PROJECT=Plant_genome ONT_READ_LIST=read_list.txt GENOME_SIZE=1200000000 THREADS=15 MIN_READ_LENGTH=3000 PREP_OUTPUT_COVERAGE=28 OVLP_FAST_OPTIONS=-n 500 -z 20 -b 2000 -e 0.5 -j 0 -u 1 -a 1000 OVLP_SENSITIVE_OPTIONS=-n 500 -z 10 -e 0.5 -j 0 -u 1 -a 1000 CNS_FAST_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0 CNS_SENSITIVE_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0 TRIM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 1 -a 400 ASM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 0 -a 400 NUM_ITER=1 CNS_OUTPUT_COVERAGE=28 CLEANUP=1 USE_GRID=true GRID_NODE=8 GRID_OPTIONS= SMALL_MEMORY=0 FSA_OL_FILTER_OPTIONS= FSA_ASSEMBLE_OPTIONS= FSA_CTG_BRIDGE_OPTIONS= POLISH_CONTIGS=true

I would like to understand why the assembly obtained is so poor and how can I improve it. Maybe the parameters used for this dataset are inadequate?

lemene commented 1 year ago

Hi, 28X is slightly less than the coverage of Nanopore reads expected by NECAT (>=40X). This affects the integrity of the assembly. Using the following parameters may improve the assembly. FSA_OL_FILTER_OPTIONS=--min_coverage 2 2 can be replaced by 1 or 3.

The folders 4-fsa, 5-align_contigs and 6-bridge_contigs need to be renamed or deleted before running the command necat.pl bridge cfgfile. This will skip the error correction step and reassemble the corrected reads.

LeoVincenzi commented 1 year ago

Hi, thanks to your suggestion, we end up with an assembly of the desired size and with a high N50 value. I've also noticed that the 'bridge' improve the contiguity doubling the N50 that we could get from the 'assemble' step. Anyway, I would like to ask you how the parameter FSA_OL_FILTER_OPTIONS affect the assembly: I suppose it is implied in the overlapping regions, but if we start from a high coverage (40x), why should we consider a minimum coverage with such a low value (1,2,3,..)?

lemene commented 1 year ago

Hi @LeoVincenzi The assembler calculates the coverage of each read. If the coverage is less than the threshold min_coverage, the read and the related overlaps are filtered out. The assembler can automatically calculate a value for it, but sometimes it is not appropriate. According to our experience, min_coverage = 3 is not a bad choice.

lemene commented 1 year ago

Some raw reads are broken into multiple corrected reads in the error correction step. The unbroken raw reads are used to bridge the contigs, so the assembler can output the longer N50.

xiaochuanle / NECAT

Issue assembling plant genome with NECAT #47