zx0223winner / HSDFinder

a tool to predict highly similar duplicates (HSDs) in eukaryotes
MIT License
2 stars 1 forks source link

What criteria was used to collect the HSDs in HSDatabase? #7

Open zx0223winner opened 1 year ago

zx0223winner commented 1 year ago

Although there is no golden rule to distinguish partial duplicates from more complete ones, it is believed that the candidate HSDs turn to have less than 50% amino acid length difference and similar function of conserved domains.

To balance the HSDs detection sensitivity and accuracy, we have improved the duplicates genes detection and decreased the “snowball effect” via using a series of combo threshold from 90%_10aa to 90%_100aa and from 50%_10aa to 50%_100aa, which can to some extent balance the HSDs detection sensitivity and accuracy. The combo threshold was selected via using a series of thresholds: E + (D + (C + (B +A))).

A = 90%_100aa+(90%_70aa+(90%_50aa+(90%_30aa+90%_10aa))) B = 80%_100aa+(80%_70aa+(80%_50aa+(80%_30aa+80%_10aa))) C = 70%_100aa+(70%_70aa+(70%_50aa+(70%_30aa+70%_10aa))) D = 60%_100aa+(60%_70aa+(60%_50aa+(60%_30aa+60%_10aa))) E = 50%_100aa+(50%_70aa+(50%_50aa+(50%_30aa+50%_10aa)))

zx0223winner commented 1 year ago

A combination of thresholds was used to acquire a larger dataset of HSD candidates. All-against-all protein sequence similarity search using BLASTP (E-value cutoff of ≤1e-10) filtered via the criteria within certain amino acid length differences and larger than certain amino acid pairwise identities. HSDs candidates were added one after another at different homology assessment metrics (i.e., HSDs identified at more relaxed thresholds were treated more strictly than those found using more conservative thresholds).

For example, HSDs identified at a threshold of 90%_30aa were added on to those identified at a threshold of 90%_10aa (denoted as “ 90%_30aa+90%_10aa”); any redundant HSDs candidates picked out at this combo threshold were removed if the more relaxed threshold (i.e., 90%_30aa) had the identical genes or contained the same gene copies from the stricter cut-off (i.e., 90%_10aa).

Moreover, any HSDs candidates pinpointed at the combo threshold (90%_30aa+90%_10aa) were removed if the minimum gene copy length was less than half of the maximum gene copy length for each HSD, or if HSD candidates had gene copies with incomplete conserved domains (i.e., different number of Pfam domains). After filtering the combo threshold at (90%_30aa+90%_10aa), we added on a more relaxed threshold 90%_50aa (i.e., 90%_50aa+(90%_30aa+90%_10aa)) and then carried out the same HSD candidate removal/filtering process.

To minimize the redundancy and to acquire a larger dataset of HSD candidates, we processed each selected species with the following combination of thresholds: E + (D + (C + (B +A))).

zx0223winner commented 1 year ago

At the same time, since you have already mastered the usage of HSDFinder. if you interest in detecting more duplicates from your fish genomes or worry about missing any important duplicates genes, I would suggest you read the criteria we used to collect duplicates in HSDatabase, https://github.com/zx0223winner/HSDFinder/issues/7

To acquire more HSDs for each of your species, I will need you to re-run the HSDFinder with different thresholds, right now you only have 90_10 for each of your species (e.g., Aven.hsd.species.txt). Here, 90_10 represent 90% amino acid identity, within 10aa length difference, the complete 25 files for each of your species are :

90_10; 90_30;90_50;90_70;90_100; 80_10; 80_30;80_50;80_70;80_100; 70_10; 70_30;70_50;70_70;70_100; 60_10; 60_30;60_50;60_70;60_100; 50_10; 50_30;50_50;50_70;50_100;

You can do the batch work locally or run one at a time from online. So in total for your 14 species you will finally have 350 HSDs files, please label your file like below “species_name.number_number.txt” and place every 25 HSD files in 14 fish species folders, I have a custom script can run all files at a time.

Arabidopsis_thaliana.50_100.txt Arabidopsis_thaliana.50_10.txt Arabidopsis_thaliana.50_30.txt Arabidopsis_thaliana.50_50.txt Arabidopsis_thaliana.50_70.txt Arabidopsis_thaliana.60_100.txt Arabidopsis_thaliana.60_10.txt Arabidopsis_thaliana.60_30.txt Arabidopsis_thaliana.60_50.txt Arabidopsis_thaliana.60_70.txt Arabidopsis_thaliana.70_100.txt Arabidopsis_thaliana.70_10.txt Arabidopsis_thaliana.70_30.txt Arabidopsis_thaliana.70_50.txt Arabidopsis_thaliana.70_70.txt Arabidopsis_thaliana.80_100.txt Arabidopsis_thaliana.80_10.txt Arabidopsis_thaliana.80_30.txt Arabidopsis_thaliana.80_50.txt Arabidopsis_thaliana.80_70.txt Arabidopsis_thaliana.90_100.txt Arabidopsis_thaliana.90_10.txt Arabidopsis_thaliana.90_30.txt Arabidopsis_thaliana.90_50.txt Arabidopsis_thaliana.90_70.txt