zavolanlab / bindz-rbp

RBP module for bindz, a bioinformatics tool to detect regulators' binding sites on RNA sequences.
https://github.com/zavolanlab/bindz-rbp
Apache License 2.0
6 stars 1 forks source link

Add heatmap and heatmap script #22

Closed krish8484 closed 4 years ago

krish8484 commented 4 years ago

Description

This PR aims to add heatmap (Probabilities vs Sequence) with custom labels or sequence logos as axis. Fixes #16 (issue)

Type of change

Please delete options that are not relevant.

Checklist:

krish8484 commented 4 years ago

I have used ggplot2 in R to plot a heatmap. It has been saved at tests\unit\. Currently about 75% of the map is greyed out because we dont have the information in the tsv file. For example, if the binding position for some motif is 15-19 then we have only the required probability for positions 15-19 but we dont have any data about the probability on 0-14 and 20-65, so NA data had to be added there. The map certainly represents the data in tsv file correctly though. The script is in workflow\scripts\. Do let me know the changes needed in the graph and I will also start the plotting of sequence logos simultaneously.

AngryMaciek commented 4 years ago

In principle it is good, just need some aesthetical madifications:

Apart from these - you are good to add sequence logos. Let's try it out, we have to compose the information somehow... Maybe we could have the annotations on Yaxis as: [LOGO] [MOTIF NAME] [.....................] ?

krish8484 commented 4 years ago

Hey, I have made all the suggested changes and the graph definitely looks better now. For the sequence logos, I am planning to use python's logomaker, so I will be writing a separate script for that, save the logos as images and import it in the R script. To plot the logos, we need to give files such as those in tests/integration/pwm as input. So the python script will take 2 inputs, first will be the tsv file and the other one will be the directory containing matrix files? Is this fine, or do we have a better idea?

Also, in the matrix files, those matrices are probability matrices, aren't they?

AngryMaciek commented 4 years ago

Yep, it looks better but we can still improve it:

Also, one previous point:

Its OK to add NA to places where we do not have data - lets try how it would look. Or maybe empty strings as well? that way we could cheat a little and indeed do not flood the plot with a multidude of NAs.

Regarding the plotting strategy:
Are you sure this is necessary? If you plot all the sequence logos separately and then compose it all together you have much of a work overhead: you will have to create 2 rules, 2 scripts, 2 unit tests and 2 YAML env files. In my opinion it would be easier and faster for you to generate the whole thing on the fly in a one plotting step. Please take a look at this section: https://www.bioconductor.org/packages/devel/bioc/vignettes/motifStack/inst/doc/motifStack_HTML.html#plot_motifs_with_ggplot2 It seems to be possible to integrate them with ggplot2. Could you also use it for annotation as well?

Yes, text files per motif contain binding probabilities for each of the four bases (A,C,G,T).

krish8484 commented 4 years ago

Made the grids square, background is white and now they are separated by a thin black line. What changes are expected in the graph by changing NA to empty string?

For the sequence logos, I tried building something with the help of the link you posted, but motifstack does not work with most of the R versions and I ran into a plethora of library errors. On the other hand, we can get what we want from logomaker.

The way I was planning this is by calling the python script from the R script with the filepath for the input matrix using the system command in R just to plot the logos. By this way, we wont need two rules.

I think it will work. What do you think?

AngryMaciek commented 4 years ago

What changes are expected in the graph by changing NA to empty string?

Forget it, Mihaela is right, the number would be too small, we do not need to annotate the heatmap cells with scores.

On the other hand, we can get what we want from logomaker.

OK, if you are sure then let's go with logomaker.

Since it turns out it is not that trivial to add the logos to the heatmap it would be better to move that task away, for possible future, and close this feature branch with the heatmap we currently have. That way we should reach first working version soon (and this is important for us - to have the repository working asap). We may add the sequence logos later.

The way I was planning this is by calling the python script from the R script with the filepath for the input matrix using the system command in R just to plot the logos. By this way, we wont need two rules.

Bad practice. When you are building a computationl workflow for data analysis all your analysis steps should be smeared over one abstraction level. So as we are using snakemake we should have all our processing steps encapsulated as a separate rules - the pipeline expands in width. Calling one Python script from another R script is not a very good design - it's like recursive calls of data analysis (in depth) instead of sequential, as it should be.

So in the end for the incorporation of sequence logos (if we decide to still pursue this) into the heatmap there will be two more feature branches needes:

But let's close this issue first.

krish8484 commented 4 years ago

I have made the changes. Note that the md5sums are failing for the pdf files because unlike txt and tsv files, pdf files also contain the data of the date of creation (and date of modification). Since we are creating the pdf files with the R script, the date of creation will always change and therefore the md5sum will never be the same whenever it is reproduced.

AngryMaciek commented 4 years ago

Does svg contain these metadata as well? I think not. If not - could you please change the vector graphics format for the final plot? It should all run through then... The same would apply for the unit test of that new processing step.

krish8484 commented 4 years ago

The last test is passing on my machine, don't know what's causing it to fail on the server.

AngryMaciek commented 4 years ago

Dear Krish, Why did you move away some of the YAML recipes for internal conda v-envs dedicated to snakemake rules under ./envs? Right now some files are under ./envs and some are under ./workflow/envs (as they should be) and its a mess. What is the logic here?

Also, I do not see you building and activating all snakemake v-envs in the travis CI, as I asked. I cannot check the md5 issue before you address these v-env inconsistencies.

krish8484 commented 4 years ago

Apologies, I misunderstood the path for the env files, I will change it. Also, I activated then env file for combine_result rule, but I am not sure for which scripts should I activate motevo.yml and bash.yml for? The last test I was referring to is the the last one in .travis.

AngryMaciek commented 4 years ago

Apologies, obviously we do not have scripts that woudl require unit tests with that two v-envs, you mentioned... Bash is just there for simple bash commands, motevo is a separate package which we trust and testing it is not on our side.

AngryMaciek commented 4 years ago

@krish8484 : I have cloned the repo, checked out this feature branch and I can confirm that the integration test fails due to incorrect md5sum check for the plot, just as on the CI... This is strange, since md5-checking the same plot but while unit-testing raises no error. This suggest that there is no problem with svg format and the metadata as previously, right? I do not know why does the integration test work for you but not me, neither CI. Could you please delete the directory, clone fresh repo, check out the branch and try yourself again?

krish8484 commented 4 years ago

Yes, there is certainly no issue with the metadata as it was in case of pdf. I too cloned a fresh repo and reran the tests. I am still getting a positive result, maybe my system is behaving differently(I have no clue why).

AngryMaciek commented 4 years ago

I have described the plots checksums problem in #24 .
We would not want a minor detail like that to stop our work so for now we will turn off the CI tests.