Relationship between Example .gff and .matrix files?

Hello,

Let me start off by saying the tool looks great! I've had a major problem dealing with incomplete/mis-annotated/truncated genes in pangenome analyses, and no other tools have been designed to deal with such issues -- I think this is a great advantage of PEPPA!

Before trying the tool out on my own dataset, I tried running the provided dataset. It runs just fine, but perhaps I am misunderstanding something about the output. The .gff file produced by PEPPA.py contains thousands of entries, but the .matrix file produced by PEPPA_parser.py only contains ~>200 genes, and many ortholog groups noted in the .gff file are absent from the .matrix file.

Is this because this is a reduced/sample dataset designed to run quickly? The pangenome is reported as 223 genes, with a core genome of 31 genes, with an average number of genes/genome at 88...in a full analysis, all genes identified in the .gff would be included in the .matrix file, would they not, provided they pass pseudogene filtering/etc.?

Thank you, Conrad

zheminzhou / PEPPAN

Relationship between Example .gff and .matrix files? #3