Closed krish8484 closed 4 years ago
I have used ggplot2
in R to plot a heatmap. It has been saved at tests\unit\
. Currently about 75% of the map is greyed out because we dont have the information in the tsv file. For example, if the binding position for some motif is 15-19 then we have only the required probability for positions 15-19 but we dont have any data about the probability on 0-14 and 20-65, so NA data had to be added there. The map certainly represents the data in tsv file correctly though. The script is in workflow\scripts\
. Do let me know the changes needed in the graph and I will also start the plotting of sequence logos simultaneously.
In principle it is good, just need some aesthetical madifications:
Apart from these - you are good to add sequence logos. Let's try it out, we have to compose the information somehow... Maybe we could have the annotations on Yaxis as: [LOGO] [MOTIF NAME] [.....................] ?
Hey, I have made all the suggested changes and the graph definitely looks better now.
For the sequence logos, I am planning to use python
's logomaker
, so I will be writing a separate script for that, save the logos as images and import it in the R script.
To plot the logos, we need to give files such as those in tests/integration/pwm
as input. So the python script will take 2 inputs, first will be the tsv file and the other one will be the directory containing matrix files? Is this fine, or do we have a better idea?
Also, in the matrix files, those matrices are probability matrices, aren't they?
Yep, it looks better but we can still improve it:
NA
color to white? What do you think?Also, one previous point:
Its OK to add NA to places where we do not have data - lets try how it would look. Or maybe empty strings as well? that way we could cheat a little and indeed do not flood the plot with a multidude of NAs.
Regarding the plotting strategy:
Are you sure this is necessary? If you plot all the sequence logos separately and then compose it all together you have much of a work overhead: you will have to create 2 rules, 2 scripts, 2 unit tests and 2 YAML env files. In my opinion it would be easier and faster for you to generate the whole thing on the fly in a one plotting step.
Please take a look at this section:
https://www.bioconductor.org/packages/devel/bioc/vignettes/motifStack/inst/doc/motifStack_HTML.html#plot_motifs_with_ggplot2
It seems to be possible to integrate them with ggplot2
. Could you also use it for annotation as well?
Yes, text files per motif contain binding probabilities for each of the four bases (A,C,G,T).
Made the grids square, background is white and now they are separated by a thin black line. What changes are expected in the graph by changing NA to empty string?
For the sequence logos, I tried building something with the help of the link you posted, but motifstack
does not work with most of the R versions and I ran into a plethora of library errors.
On the other hand, we can get what we want from logomaker
.
The way I was planning this is by calling the python
script from the R
script with the filepath for the input matrix using the system
command in R just to plot the logos. By this way, we wont need two rules.
I think it will work. What do you think?
What changes are expected in the graph by changing NA to empty string?
Forget it, Mihaela is right, the number would be too small, we do not need to annotate the heatmap cells with scores.
On the other hand, we can get what we want from logomaker.
OK, if you are sure then let's go with logomaker
.
Since it turns out it is not that trivial to add the logos to the heatmap it would be better to move that task away, for possible future, and close this feature branch with the heatmap we currently have. That way we should reach first working version soon (and this is important for us - to have the repository working asap). We may add the sequence logos later.
The way I was planning this is by calling the python script from the R script with the filepath for the input matrix using the system command in R just to plot the logos. By this way, we wont need two rules.
Bad practice. When you are building a computationl workflow for data analysis all your analysis steps should be smeared over one abstraction level. So as we are using snakemake we should have all our processing steps encapsulated as a separate rules - the pipeline expands in width. Calling one Python
script from another R
script is not a very good design - it's like recursive calls of data analysis (in depth) instead of sequential, as it should be.
So in the end for the incorporation of sequence logos (if we decide to still pursue this) into the heatmap there will be two more feature branches needes:
But let's close this issue first.
I have made the changes. Note that the md5sums are failing for the pdf files because unlike txt and tsv files, pdf files also contain the data of the date of creation (and date of modification). Since we are creating the pdf files with the R
script, the date of creation will always change and therefore the md5sum will never be the same whenever it is reproduced.
Does svg
contain these metadata as well? I think not. If not - could you please change the vector graphics format for the final plot? It should all run through then...
The same would apply for the unit test of that new processing step.
The last test is passing on my machine, don't know what's causing it to fail on the server.
Dear Krish,
Why did you move away some of the YAML recipes for internal conda v-envs dedicated to snakemake rules under ./envs
? Right now some files are under ./envs
and some are under ./workflow/envs
(as they should be) and its a mess. What is the logic here?
Also, I do not see you building and activating all snakemake v-envs in the travis CI, as I asked. I cannot check the md5 issue before you address these v-env inconsistencies.
Apologies, I misunderstood the path for the env
files, I will change it.
Also, I activated then env
file for combine_result
rule, but I am not sure for which scripts should I activate motevo.yml
and bash.yml
for?
The last test I was referring to is the the last one in .travis
.
Apologies, obviously we do not have scripts that woudl require unit tests with that two v-envs, you mentioned... Bash is just there for simple bash commands, motevo is a separate package which we trust and testing it is not on our side.
@krish8484 : I have cloned the repo, checked out this feature branch and I can confirm that the integration test fails due to incorrect md5sum check for the plot, just as on the CI...
This is strange, since md5-checking the same plot but while unit-testing raises no error. This suggest that there is no problem with svg
format and the metadata as previously, right?
I do not know why does the integration test work for you but not me, neither CI. Could you please delete the directory, clone fresh repo, check out the branch and try yourself again?
Yes, there is certainly no issue with the metadata as it was in case of pdf
.
I too cloned a fresh repo and reran the tests. I am still getting a positive result, maybe my system is behaving differently(I have no clue why).
I have described the plots checksums problem in #24 .
We would not want a minor detail like that to stop our work so for now we will turn off the CI tests.
Description
This PR aims to add heatmap (Probabilities vs Sequence) with custom labels or sequence logos as axis. Fixes #16 (issue)
Type of change
Please delete options that are not relevant.
Checklist: