Closed jacodela closed 6 years ago
As mentioned in the README for the pan-genome-analysis repository: "files required for visualizing the pan-genome using pan-genome-visualization." Please see the explanation for each type of the output files there.
Instructions regarding the visualization application have been described in the README for the pan-genome-visualization repository: https://github.com/wdingx/pan-genome-visualization#send-your-own-data-to-the-local-server
The summary statistics for all gene clusters can be found in geneCluster.json. One can write a small script to extract the needed information. Alternatively, it's easy to parse the clustering result file (./data/YourSpecies/allclusters_final.tsv) (each line refers to a gene cluster containing genes separated by tabs). The data for a Venn diagram can also be extracted from that file.
Thanks for your quick answer! I'm aware of the instructions regarding the visualization application, which I have running. However, I find PanX somewhat restrictive in terms of what can be done from the visualization, and I wish to complement my analyses using different software. I think that the explanation you have just provided, together with the ability of downloading the gene cluster table that is generated in the visualization in a format that easy to read (e.g. csv), without having to parse the allclusters_final.tsv
and the geneCluster.json
would be of great use for the users.
Each visualization application has its own focus. PanX is meant to offer interactive exploration of large datasets of bacterial genomes and its design focused on combined breadth and depth: various summary statistics for gene clusters, the comparative panel for species tree and gene trees, gene presence/absence patterns, gain/loss events, strain-associated metadata and more. Venn diagram isn't the optimal way for visually comparing hundreds/thousands of strains.
All the files in the vis folder (such as geneCluster.json) are needed for data visualization on the browser.
"without having to parse ... would be of great use for the users."
I've written a simply script that can be used for converting the geneCluster.json into a csv file. The script and description will be updated tomorrow.
The script can be found here: https://github.com/neherlab/pan-genome-analysis/blob/master/scripts/helper_functions.py
Thanks a lot for your help and swift responses.
As far as I can tell from reading the PanX docs, there's currently not much information describing the data in the output files, nor how to use the visualization tool beyond a few short animations. For example, is there a single file that I can use to perform analyses in software such R or should I gather and combine information from different files? How can I determine the number and name of genes shared by a given number of strains in my analysis to obtain summary statistics? If I want to create a Venn diagram comparing the multiple genomes in my analysis, can I obtain this data from the visualization or any of the files?