zx0223winner / HSDFinder

a tool to predict highly similar duplicates (HSDs) in eukaryotes
MIT License
2 stars 1 forks source link

How to visualize HSDs data in line graph? How to find common HSDs among species? Shall I rerun the HSD finder with new parameters? #8

Open zx0223winner opened 1 year ago

zx0223winner commented 1 year ago

My question is how do I visualize the HSD finder outputs as you show in the step 4 (manual) ?

Step 4 used Microsoft Office Excel to generate an example, you are free to explore other type of graphs (bar, line, pie images) using R script. To create that line graph, you do need different thresholds as I suggested for each species or at least for one species.

How do I find the common HSD's present in all 15 genomes ? further to proceed with selection analysis.

Proceeding the section analysis such as dN/dS is beyond the power/functionability of current HSDFinder, which I can not help. I would recommend you to read PAML http://abacus.gene.ucl.ac.uk/software/paml.html to proceed that analysis. But to find common HSDs, my suggestion is go to the Heatmap tabular output file (.tsv) where you can find the common HSDs sharing the same KO pathway number for different fish genomes if you have sorted the 2nd column. A custom downstream python script (HSDicipher) will be available to process that .tsv file. Upcoming link will post here.

I'm very afraid to have more HSD's from 15 genomes, if I rerun with new parameters.

That’s correct, when you relax your threshold, you will have more HSDs.

Anyway, I ran with the default parameters as per the instructions given in the github page, do you think should I rerun the HSD finder with new parameters ?

You can stick with current criteria (90_10) or proceed as I suggested to re-run more HSDs https://github.com/zx0223winner/HSDFinder/issues/7. It is all depending on your goal of your project. The time you want to contribute to the analysis. The expectation of the Journal you want to publish. Are you trying to find as many duplicates as possible or just get a taste of only those nearly identical in your species? If you are worrying about missing any important duplicates genes, my suggestions are to try different thresholds as I suggested, because if you only use 90_10 as the threshold, you are at the risk of missing other duplicates. However, if you follow my last email, you can acquire more complete HSD list for each species.

~Xi