wdingx / pan-genome-visualization

Other
14 stars 22 forks source link

Documentation on output files or download from visualization #4

Closed jacodela closed 6 years ago

jacodela commented 6 years ago

As far as I can tell from reading the PanX docs, there's currently not much information describing the data in the output files, nor how to use the visualization tool beyond a few short animations. For example, is there a single file that I can use to perform analyses in software such R or should I gather and combine information from different files? How can I determine the number and name of genes shared by a given number of strains in my analysis to obtain summary statistics? If I want to create a Venn diagram comparing the multiple genomes in my analysis, can I obtain this data from the visualization or any of the files?

wdingx commented 6 years ago

As mentioned in the README for the pan-genome-analysis repository: "files required for visualizing the pan-genome using pan-genome-visualization." Please see the explanation for each type of the output files there.

Instructions regarding the visualization application have been described in the README for the pan-genome-visualization repository: https://github.com/wdingx/pan-genome-visualization#send-your-own-data-to-the-local-server

The summary statistics for all gene clusters can be found in geneCluster.json. One can write a small script to extract the needed information. Alternatively, it's easy to parse the clustering result file (./data/YourSpecies/allclusters_final.tsv) (each line refers to a gene cluster containing genes separated by tabs). The data for a Venn diagram can also be extracted from that file.

jacodela commented 6 years ago

Thanks for your quick answer! I'm aware of the instructions regarding the visualization application, which I have running. However, I find PanX somewhat restrictive in terms of what can be done from the visualization, and I wish to complement my analyses using different software. I think that the explanation you have just provided, together with the ability of downloading the gene cluster table that is generated in the visualization in a format that easy to read (e.g. csv), without having to parse the allclusters_final.tsv and the geneCluster.json would be of great use for the users.

wdingx commented 6 years ago

Each visualization application has its own focus. PanX is meant to offer interactive exploration of large datasets of bacterial genomes and its design focused on combined breadth and depth: various summary statistics for gene clusters, the comparative panel for species tree and gene trees, gene presence/absence patterns, gain/loss events, strain-associated metadata and more. Venn diagram isn't the optimal way for visually comparing hundreds/thousands of strains.

All the files in the vis folder (such as geneCluster.json) are needed for data visualization on the browser.

"without having to parse ... would be of great use for the users."

I've written a simply script that can be used for converting the geneCluster.json into a csv file. The script and description will be updated tomorrow.

wdingx commented 6 years ago

The script can be found here: https://github.com/neherlab/pan-genome-analysis/blob/master/scripts/helper_functions.py

jacodela commented 6 years ago

Thanks a lot for your help and swift responses.