Add heatmap and heatmap script

krish8484 commented 4 years ago

Description

This PR aims to add heatmap (Probabilities vs Sequence) with custom labels or sequence logos as axis. Fixes #16 (issue)

Type of change

Please delete options that are not relevant.

[x] New feature (non-breaking change which adds functionality)

Checklist:

[x] My code follows the style guidelines of this project
[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[ ] My changes generate no new warnings
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] New and existing unit tests pass locally with my changes
[x] I have not reduced the existing code coverage

krish8484 commented 4 years ago

I have used ggplot2 in R to plot a heatmap. It has been saved at tests\unit\. Currently about 75% of the map is greyed out because we dont have the information in the tsv file. For example, if the binding position for some motif is 15-19 then we have only the required probability for positions 15-19 but we dont have any data about the probability on 0-14 and 20-65, so NA data had to be added there. The map certainly represents the data in tsv file correctly though. The script is in workflow\scripts\. Do let me know the changes needed in the graph and I will also start the plotting of sequence logos simultaneously.

AngryMaciek commented 4 years ago

In principle it is good, just need some aesthetical madifications:

As for the color bar - It would make more sense to have white=0, red=1. so then the intensity of the red color would mark the binding probability posterior.
Could you please remove the grey boxes on top/bottom, this is really not neccessary for the presentation.
Yaxis should not be labelled as "probability" since on that axis we present distinct motifs. They all have their own name so it would be better just to remove the Yaxis name. Pass an empty string?
Please remove the plot title from the image as well. It is actually uncommon to annotate titles to plots within scientific papers (unlike from presentations, where titles are more welcome)
Its OK to add NA to places where we do not have data - lets try how it would look. Or maybe empty strings as well? that way we could cheat a little and indeed do not flood the plot with a multidude of NAs.

Apart from these - you are good to add sequence logos. Let's try it out, we have to compose the information somehow... Maybe we could have the annotations on Yaxis as: [LOGO] [MOTIF NAME] [.....................] ?

krish8484 commented 4 years ago

Hey, I have made all the suggested changes and the graph definitely looks better now. For the sequence logos, I am planning to use python's logomaker, so I will be writing a separate script for that, save the logos as images and import it in the R script. To plot the logos, we need to give files such as those in tests/integration/pwm as input. So the python script will take 2 inputs, first will be the tsv file and the other one will be the directory containing matrix files? Is this fine, or do we have a better idea?

Also, in the matrix files, those matrices are probability matrices, aren't they?

AngryMaciek commented 4 years ago

Yep, it looks better but we can still improve it:

Is it possible to force the heatmap cells square? I know that in Python there is a argument (sth. like aspect_ratio) that enforces such behaviour. I think this would be highly recommended - the plot would look the same, regardless of the number of motifs analyzed.
The overwhelming grey parts are irritating me. I am considering having all-white cells where there was no information about any binding. Is it possible to set NA color to white? What do you think?
Given all cells are white we might need some separation between cells (to avoid them blending together into one big white blob). Is it possible to set a very thin line that would separate the cells?

Also, one previous point:

Its OK to add NA to places where we do not have data - lets try how it would look. Or maybe empty strings as well? that way we could cheat a little and indeed do not flood the plot with a multidude of NAs.

Regarding the plotting strategy:
Are you sure this is necessary? If you plot all the sequence logos separately and then compose it all together you have much of a work overhead: you will have to create 2 rules, 2 scripts, 2 unit tests and 2 YAML env files. In my opinion it would be easier and faster for you to generate the whole thing on the fly in a one plotting step. Please take a look at this section: https://www.bioconductor.org/packages/devel/bioc/vignettes/motifStack/inst/doc/motifStack_HTML.html#plot_motifs_with_ggplot2 It seems to be possible to integrate them with ggplot2. Could you also use it for annotation as well?

Yes, text files per motif contain binding probabilities for each of the four bases (A,C,G,T).

krish8484 commented 4 years ago

Made the grids square, background is white and now they are separated by a thin black line. What changes are expected in the graph by changing NA to empty string?

For the sequence logos, I tried building something with the help of the link you posted, but motifstack does not work with most of the R versions and I ran into a plethora of library errors. On the other hand, we can get what we want from logomaker.

The way I was planning this is by calling the python script from the R script with the filepath for the input matrix using the system command in R just to plot the logos. By this way, we wont need two rules.

I think it will work. What do you think?

AngryMaciek commented 4 years ago

What changes are expected in the graph by changing NA to empty string?

Forget it, Mihaela is right, the number would be too small, we do not need to annotate the heatmap cells with scores.

On the other hand, we can get what we want from logomaker.

OK, if you are sure then let's go with logomaker.

Since it turns out it is not that trivial to add the logos to the heatmap it would be better to move that task away, for possible future, and close this feature branch with the heatmap we currently have. That way we should reach first working version soon (and this is important for us - to have the repository working asap). We may add the sequence logos later.

The way I was planning this is by calling the python script from the R script with the filepath for the input matrix using the system command in R just to plot the logos. By this way, we wont need two rules.

Bad practice. When you are building a computationl workflow for data analysis all your analysis steps should be smeared over one abstraction level. So as we are using snakemake we should have all our processing steps encapsulated as a separate rules - the pipeline expands in width. Calling one Python script from another R script is not a very good design - it's like recursive calls of data analysis (in depth) instead of sequential, as it should be.

So in the end for the incorporation of sequence logos (if we decide to still pursue this) into the heatmap there will be two more feature branches needes:

one rule to generate the logos
modification of the heatmap plotting.

But let's close this issue first.

krish8484 commented 4 years ago

I have made the changes. Note that the md5sums are failing for the pdf files because unlike txt and tsv files, pdf files also contain the data of the date of creation (and date of modification). Since we are creating the pdf files with the R script, the date of creation will always change and therefore the md5sum will never be the same whenever it is reproduced.

AngryMaciek commented 4 years ago

Does svg contain these metadata as well? I think not. If not - could you please change the vector graphics format for the final plot? It should all run through then... The same would apply for the unit test of that new processing step.

krish8484 commented 4 years ago

The last test is passing on my machine, don't know what's causing it to fail on the server.

AngryMaciek commented 4 years ago

Dear Krish, Why did you move away some of the YAML recipes for internal conda v-envs dedicated to snakemake rules under ./envs? Right now some files are under ./envs and some are under ./workflow/envs (as they should be) and its a mess. What is the logic here?

Also, I do not see you building and activating all snakemake v-envs in the travis CI, as I asked. I cannot check the md5 issue before you address these v-env inconsistencies.

krish8484 commented 4 years ago

Apologies, I misunderstood the path for the env files, I will change it. Also, I activated then env file for combine_result rule, but I am not sure for which scripts should I activate motevo.yml and bash.yml for? The last test I was referring to is the the last one in .travis.

AngryMaciek commented 4 years ago

Apologies, obviously we do not have scripts that woudl require unit tests with that two v-envs, you mentioned... Bash is just there for simple bash commands, motevo is a separate package which we trust and testing it is not on our side.

AngryMaciek commented 4 years ago

@krish8484 : I have cloned the repo, checked out this feature branch and I can confirm that the integration test fails due to incorrect md5sum check for the plot, just as on the CI... This is strange, since md5-checking the same plot but while unit-testing raises no error. This suggest that there is no problem with svg format and the metadata as previously, right? I do not know why does the integration test work for you but not me, neither CI. Could you please delete the directory, clone fresh repo, check out the branch and try yourself again?

krish8484 commented 4 years ago

Yes, there is certainly no issue with the metadata as it was in case of pdf. I too cloned a fresh repo and reran the tests. I am still getting a positive result, maybe my system is behaving differently(I have no clue why).

AngryMaciek commented 4 years ago

I have described the plots checksums problem in #24 .
We would not want a minor detail like that to stop our work so for now we will turn off the CI tests.

zavolanlab / bindz-rbp