zheminzhou / PEPPAN

Phylogeny Enhanded Prediction of PAN-genome
https://doi.org/10.1101/2020.01.03.894154
GNU General Public License v3.0
39 stars 10 forks source link

Relationship between Example .gff and .matrix files? #3

Open cizydorczyk opened 4 years ago

cizydorczyk commented 4 years ago

Hello,

Let me start off by saying the tool looks great! I've had a major problem dealing with incomplete/mis-annotated/truncated genes in pangenome analyses, and no other tools have been designed to deal with such issues -- I think this is a great advantage of PEPPA!

Before trying the tool out on my own dataset, I tried running the provided dataset. It runs just fine, but perhaps I am misunderstanding something about the output. The .gff file produced by PEPPA.py contains thousands of entries, but the .matrix file produced by PEPPA_parser.py only contains ~>200 genes, and many ortholog groups noted in the .gff file are absent from the .matrix file.

Is this because this is a reduced/sample dataset designed to run quickly? The pangenome is reported as 223 genes, with a core genome of 31 genes, with an average number of genes/genome at 88...in a full analysis, all genes identified in the .gff would be included in the .matrix file, would they not, provided they pass pseudogene filtering/etc.?

Thank you, Conrad