Closed has2k1 closed 8 years ago
I will have some time until the weekend to do the layer/faceting refactoring.
I've also once looked into scales and found that tricky: currently we do scales by letting matplotlib handle the scales (label/gidlines/ placement and label formatting). Going one layer deeper ( = we do this computations in ggplot) will mean quite a lot more code...
I think we will need to find a way to do it incrementally.
You can base the layer refactoring off #266, it is ready enough. I will put up a status update.
I'll put together some kind of roadmap on this.
So, if I interpret your comments in #266 right you want to do this PR first before the layer/facet refactoring?
Right. Otherwise it would just be adding more craft that would make the problem harder to fix. And given what would be involved, it would also require some kind of freeze on the codebase, because even the initial step that would get in the basics would involve modifying many parts. That would make merging hellish.
So, do you want to take this? Then I will wait with the rest of #221 (layer/faceting refactoring)
@glamp, @EricChiang
I can do the first step (plain refactor and placeholders for what we can anticipate, no new stuff) after we agree on the way forward and get everyone on board. Thereafter, more changes can be made on an ongoing basis. We would need to make sure subsequent PRs don't take any shortcuts with regards to the organisation.
One step I want to see is scale adding after doing the transformations in each layer: if x is a date column, add scale_x_date
and so on.
The first step would include:
assign_visual_mapping
and the assign_*
functions. Each visual aesthetic would then have a default scale instance where the palette
attribute of the scale would do the mapping.panel
instance would be introduced and it would hold the ranges for each facet. More info.geom._plot_unit
and stat.calculate
would then take a range as the second
argument. The functions themselves can then be modified later on (when scale training guarantees correct range values) to make use of the range.Probable issues
stat
calculations but the present transformations are happening at the end of the plot building and are delegated to matplotlib i.e ax.set_xscale('log', basex=self.scale_x_log)
.Any stuff worth highlighting that I've left out?
Actually I would like to see a highlevel overview how scales are used during draw (kind of like "first the transforms, ... scales for ..., send the data for each layer to the layer plotting function"). I have't looked into the scales part of ggplot2, so I actually have no understanding what is needed here :-)
Here are my notes.
Assuming a ggplot object gg
has a list of scales that were added by the
user or if nothing was added.
After print(gg)
:
An example from ggplot2
# facet grid, free scale
> gg <- ggplot(mtcars, aes(mpg, wt, colour = factor(cyl))) + geom_point()
> gg = gg + facet_grid(vs ~ am, scales = "free")
> x = print(gg)
> x$panel$layout
PANEL ROW COL vs am SCALE_X SCALE_Y
1 1 1 1 0 0 1 1
2 2 1 2 0 1 2 1
3 3 2 1 1 0 1 2
4 4 2 2 1 1 2 2
# facet wrap, free scale
> gg <- ggplot(mtcars, aes(mpg, wt, colour = factor(cyl))) + geom_point()
> gg = gg + facet_wrap(vs ~ am, scales = "free")
> x = print(gg)
> x$panel$layout
PANEL ROW COL vs am SCALE_X SCALE_Y AXIS_X AXIS_Y
4 1 1 1 0 0 1 1 TRUE TRUE
1 2 1 2 0 1 2 2 TRUE TRUE
3 3 2 1 1 0 3 3 TRUE TRUE
2 4 2 2 1 1 4 4 TRUE TRUE
# facet wrap
> gg <- ggplot(mtcars, aes(mpg, wt, colour = factor(cyl))) + geom_point()
> gg = gg + facet_wrap(vs ~ am)
> x = print(gg)
> x$panel$layout
PANEL ROW COL vs am SCALE_X SCALE_Y AXIS_X AXIS_Y
4 1 1 1 0 0 1 1 FALSE TRUE
1 2 1 2 0 1 1 1 FALSE FALSE
3 3 2 1 1 0 1 1 TRUE TRUE
2 4 2 2 1 1 1 1 TRUE FALSE
# Notes
# -----
# For all the plots the 4 facets are ordered as follows
#######
# 1 2 #
# 3 4 #
#######
# it's clear how the scales are shared.
geom
specific columns from the more general stat
computations)The layout table makes a good organisational tool. I think if we adopt it, we can add a the ax
s for plotting in there. On the whole, it would save from the deeply nested control statements in ggplot.draw
.
I still haven't yet quite understood the reason for the reset and retraining. The rationale seems to be that after calculating the stats
the ranges may be off, but this has not yet quite clicked for me.
The problem I see is the "use the scale to transform the data":
In the end I think this means that we are using mpl as a "canvas" but do the actual painting (plot a line here and here, a rect there and there,...) ourself. In my opinion we should not go that far down in the stack but keep ggplot as a wrapper around the highlevel functions of mpl and not build a replacement.
back to this case: I think we should all do most of the above, but as a first step we should do this with using mpl scaling methods and not send transformed data to mpl.
I'm also not sure if we should do the faceting rewrite first (#259). Most of the above is not scale specific but more fixing bad things like the assign* thingies or the limit calculations. I wonder if it would be helpfull if such things are done in small steps?
layer
and rewrite the draw method to use that (probably in #259)
cc: @glamp
To an extent, the ggplot (or more generally the grammar of graphics) methodology pushes us closer to using Matplotlib as a canvas for drawing. There is geom for points, for polygons, for lines and for rectangles. But, all these can be plotted with relatively high level operations using Matplotlib.
Roughly, the levels are:
1. Canvas lines and painting instance -> Artist
2. Specialised painting instances -> Collections and Patches
3. Graphing functions -> e.g hist, hist2d, boxplot, hexbin, ...
The level 3 functions involve statistics/summaries and then plot representations. But ggplot has stats which can be mixed with various plotters, so we have to work at level 2. Preferably with collections and not patches. For some cases Matplotlib even has convenience wrappers e.g scatter
, vlines
and hlines
around the collections. Some of the wrappers are nice [1], others have small annoyances [2]. Yet the collections instances themselves can be as simple to use.
Where we should worry about duplication of functionality is the stuff around the canvas, i.e the ticks & labels around the axes. Matplotlib does an excellent job on this and it should keep on doing it. At present, the problem is with the faceting in that to give the facets common axes, we have had to do it ourselves. But, when we get to handle the scales and panel correctly, and do the training, we will be able to give Matplotlib the correct limits per facet and it should reproduce the correct ticks. So no more tick calculation for us, hopefully.
The transform thing is tricky and here are the points of contention/think about.
stat='identity'
and position='identity'
A possible solution to get the best of both may be to normalize the positions (x, xmax, xmin, xend, xintercept, y, ymax, ymin, yend, yintercept) according to the x and y ranges so that they fall in the [0, 1] range of the axes coordinate system and do the plotting with transform='ax.transAxes'
. I think mpld3 would have no problem with this as it does the same kinds of manipulation.
I think facetting being a more of a partitioning task should come later on. If we are organised in how we do the rest, the facetting should be simpler and more pleasant to redo. Plus, all this should be done in small pieces and that is where the proper scales and panel instances come in. Since they are just structures, if they are in place they make it easier to implement the rest.
[1]: ax.scatter
as wrapper is real convenience.
[2]: In PR #266 the switch was made to use ax.vlines
and ax.hlines
. All that they add to the LineCollection instance is type checking and extend the canvas. The plot info that comes in is in the right form so the type checking is not needed and the canvas extension adds a lot of space that makes the plot look awkward. In this particular case the real collection instance is a better choice in the long term.
Reading the transforms tutorial, I get the feeling tha we should use that to implement scales: we can use "data" system and for "new" scales we can add custom scales to matplotlib (http://matplotlib.org/1.3.1/devel/add_new_projection.htm). Using such a system would probably also integrate into mpl extensions like maps.
So in my mind scale_...
become objects which add a mpl scale to the ggplot object which is then used to configure each axis. This would also give us the ticks and limits "for free"
But I'm more and more convinced we should a) do this after introducing the layer
and b) should do a small protoype first (no complete ggplot style but simple use some array, do some data manipulations and then use the manipulated data to plot -> a short version of what happens on draw()
)
Regarding boxplots: I looked at the bxp method in axis and I'm really not looking forward to duplicate that code... I think we should wait with such things until we have done the easier steps first.
With need for normalisations, I had a feeling we were moving in the projections territory. Although projections are temporarily avoidable -- until we need to do coordinate transforms, we might as well start exploring all that they could offer.
Yes free ticks, we will already have managed the limits properly. Plus, the Matplotlib tick generation is better than ggplot2 for non linear scales. We may have to do "ticks" for categorical histograms though, but that is the odd case.
Either the first step with the scales, or introducing the layer
can come first. The reasons I want the scales first is to limit the other changes/bugfixes that could be made in parts that we now consider obsolete going forward. The second being, scale training can then be done in layer
introduction. However, the two do not conflict.
A proper boxplot implementation can serve as a good test our code organisation. Right from stat_boxplot
, what it CREATES
and how we store those results (points, lines and boxes) into the dataframe, all the way to maybe geom_boxplot._plot_unit
re-using the other geoms to do the plotting. Plus if using notches, dealing with that. When the time comes, it may be a good idea to think about it along with geom_violin
.
This is quite interesting: http://cpsievert.github.io/2014/06/visualizing-ggplot2-internals-with-shiny-and-d3/
Somewhere we also need to fix descrite values and tick labels: right now we you can add labels for all discrete values, but only 5 or so are shown. That's because the default Locator only shows 5 an we probably have to somewhere add the right locator: In my Code I right now do that in matplotlib:
from matplotlib.ticker import MultipleLocator
plt.gca().xaxis.set_major_locator(MultipleLocator())
I am chipping away at this -- slowly -- and it will be a while before anything is working. It will be a much wider refactoring and will include facetting. It will touch on #259 and may be even make all of it unnecessary.
The best way to get a plot process that adheres to the process described in Section 2 of http://vita.had.co.nz/papers/layered-grammar.pdf, is to mimic https://github.com/hadley/ggplot2/blob/master/R/plot-build.r, which is the "high level" itemized above.
There is no way to avoid it, this will be very destructive.
Possibly #248 and #306 are related - I keep running into that issue. I have another variant where adding a geom_text messes the scales up and moves the graph viewport away from most of the points seemingly randomly:
I think the problem(s) may be coming from more than one source. In the plot building pipeline there isn't a clear distinction of what should be handled by ggplot and what is handled by matplotlib. The best solution is to fix the underlying architecture to adhere to the same process as ggplot2. This is what I am trying to do in #360, it is big (almost a complete rewrite) and I haven't made any commits lately.
@has2k1 no worries, not trying to pressure you, just trying to tie bugs together and investigate further so if/when you tackle this you'll have less hunting to do.
One thing I'd appreciate is where I'd start looking if I wanted to fix the outputs manually - like, after the matplotlib elements are generated, where / how would I intercept them before they get repr()'d so I can, say, manually fix the viewport or investigate what's going on?
@has2k1 Perfect, thank you so much.
Currently, the scales that are present are just holders for settings passed onto the ggplot object at
__radd__
time. The scales need to be more complete with knowledge of the data domain and the functions to map the domain onto the aesthetic range, i.e. the currentassign_*
functions need to be refactored into the scales.The scales should be added to the ggplot object and should remain separate objects until plot time e.g under
ggplot.scales
. Also, only the ggplot scales should know about the transformations, the matplotlib scales available throughax
should know nothing.All this is key for among other things, a straight forward facetting implementation and statistic computations. Scale trainning will also be easy this way, since it needs to be done to avoid a bunch of edge cases. e.g.
geom_text
andgeom_bar