Proper scales - Githubissues

has2k1 commented 10 years ago

Currently, the scales that are present are just holders for settings passed onto the ggplot object at __radd__ time. The scales need to be more complete with knowledge of the data domain and the functions to map the domain onto the aesthetic range, i.e. the current assign_* functions need to be refactored into the scales.

The scales should be added to the ggplot object and should remain separate objects until plot time e.g under ggplot.scales. Also, only the ggplot scales should know about the transformations, the matplotlib scales available through ax should know nothing.

All this is key for among other things, a straight forward facetting implementation and statistic computations. Scale trainning will also be easy this way, since it needs to be done to avoid a bunch of edge cases. e.g. geom_text and geom_bar

jankatins commented 10 years ago

I will have some time until the weekend to do the layer/faceting refactoring.

I've also once looked into scales and found that tricky: currently we do scales by letting matplotlib handle the scales (label/gidlines/ placement and label formatting). Going one layer deeper ( = we do this computations in ggplot) will mean quite a lot more code...

has2k1 commented 10 years ago

I think we will need to find a way to do it incrementally.

You can base the layer refactoring off #266, it is ready enough. I will put up a status update.

jankatins commented 10 years ago

Percent formatting: http://stackoverflow.com/questions/23226818/python-ggplot-format-axis-number-as-percent-not-functioning

has2k1 commented 10 years ago

I'll put together some kind of roadmap on this.

jankatins commented 10 years ago

So, if I interpret your comments in #266 right you want to do this PR first before the layer/facet refactoring?

has2k1 commented 10 years ago

Right. Otherwise it would just be adding more craft that would make the problem harder to fix. And given what would be involved, it would also require some kind of freeze on the codebase, because even the initial step that would get in the basics would involve modifying many parts. That would make merging hellish.

jankatins commented 10 years ago

So, do you want to take this? Then I will wait with the rest of #221 (layer/faceting refactoring)

has2k1 commented 10 years ago

@glamp, @EricChiang

I can do the first step (plain refactor and placeholders for what we can anticipate, no new stuff) after we agree on the way forward and get everyone on board. Thereafter, more changes can be made on an ongoing basis. We would need to make sure subsequent PRs don't take any shortcuts with regards to the organisation.

jankatins commented 10 years ago

One step I want to see is scale adding after doing the transformations in each layer: if x is a date column, add scale_x_date and so on.

has2k1 commented 10 years ago

The first step would include:

Getting rid of the assign_visual_mapping and the assign_* functions. Each visual aesthetic would then have a default scale instance where the palette attribute of the scale would do the mapping.
The above step would affect how pieces of the legend are collected. This would also need to be sorted out.
A panel instance would be introduced and it would hold the ranges for each facet. More info.
geom._plot_unit and stat.calculate would then take a range as the second argument. The functions themselves can then be modified later on (when scale training guarantees correct range values) to make use of the range.

Probable issues

There may be some regressions in terms functionality or some aspects that need to be moved to the scales may not be done in the first step. e.g. the scale transformations should be happening before the stat calculations but the present transformations are happening at the end of the plot building and are delegated to matplotlib i.e ax.set_xscale('log', basex=self.scale_x_log).
The process may involve changes in many places and would probably make merging/rebasing a disaster if changes to the master branch are not controlled.

Any stuff worth highlighting that I've left out?

jankatins commented 10 years ago

Actually I would like to see a highlevel overview how scales are used during draw (kind of like "first the transforms, ... scales for ..., send the data for each layer to the layer plotting function"). I have't looked into the scales part of ggplot2, so I actually have no understanding what is needed here :-)

has2k1 commented 10 years ago

Here are my notes.

Assuming a ggplot object gg has a list of scales that were added by the user or if nothing was added. After print(gg):

create a panel
learn the layout of the facets. This involves the rows and columns, the labels of the facets and which position scales apply to which facet.

An example from ggplot2

# facet grid, free scale
> gg <- ggplot(mtcars, aes(mpg, wt, colour = factor(cyl))) + geom_point()
> gg = gg + facet_grid(vs ~ am, scales = "free")
> x = print(gg)
> x$panel$layout

  PANEL ROW COL vs am SCALE_X SCALE_Y
1     1   1   1  0  0       1       1
2     2   1   2  0  1       2       1
3     3   2   1  1  0       1       2
4     4   2   2  1  1       2       2

# facet wrap, free scale
> gg <- ggplot(mtcars, aes(mpg, wt, colour = factor(cyl))) + geom_point()
> gg = gg + facet_wrap(vs ~ am, scales = "free")
> x = print(gg)
> x$panel$layout

  PANEL ROW COL vs am SCALE_X SCALE_Y AXIS_X AXIS_Y
4     1   1   1  0  0       1       1   TRUE   TRUE
1     2   1   2  0  1       2       2   TRUE   TRUE
3     3   2   1  1  0       3       3   TRUE   TRUE
2     4   2   2  1  1       4       4   TRUE   TRUE

# facet wrap
> gg <- ggplot(mtcars, aes(mpg, wt, colour = factor(cyl))) + geom_point()
> gg = gg + facet_wrap(vs ~ am)
> x = print(gg)
> x$panel$layout
  PANEL ROW COL vs am SCALE_X SCALE_Y AXIS_X AXIS_Y
4     1   1   1  0  0       1       1  FALSE   TRUE
1     2   1   2  0  1       1       1  FALSE  FALSE
3     3   2   1  1  0       1       1   TRUE   TRUE
2     4   2   2  1  1       1       1   TRUE  FALSE

# Notes
# -----
# For all the plots the 4 facets are ordered as follows
 #######
 # 1 2 #
 # 3 4 #
 #######
# it's clear how the scales are shared.

assign each data point to a facet; adds an extra column to the dataframe
learn the groups from the aesthetics or the group parameter and also add default scales for the visual aesthetics.
transform the data using the scales ie. log10, sqrt, ...
train(find the range) x and y scales store results in the panel (1st time)
map the positions (x and y) onto the scales (use palette) (1st time)
calculate and map the statistics (palette)
add x or y scales if they are missing at this point.
reparameterise (creates geom specific columns from the more general stat computations)
adjust positions
reset ranges
train (find the range) x and y scales store results in the panel (2nd time)
map the positions (x and y) onto the scales (use palette) (2nd time)
train & map non position scales (use palette)
train (learn) the ranges using the coordinate system (normalized 0-1) and store them in the panel. (Working in a normalized system is probably useful for coordinate transforms)
send data with the correct range for each facet to the geoms for plotting. (use the layout)

The layout table makes a good organisational tool. I think if we adopt it, we can add a the axs for plotting in there. On the whole, it would save from the deeply nested control statements in ggplot.draw.

I still haven't yet quite understood the reason for the reset and retraining. The rationale seems to be that after calculating the stats the ranges may be off, but this has not yet quite clicked for me.

jankatins commented 10 years ago

The problem I see is the "use the scale to transform the data":

This means that mpl does not see "[1,10,100,1000] which should be plotted in a logscale" but "[0,1,2,3] with custom labels". This will not work with things like mpld3 (https://github.com/jakevdp/mpld3).
I also don't know how well mpl can be instructed to set grid lines: if you use mpls methods to transform to log scales, the gridlines will be set according to "normal" scales (gridlines will be at [1,log(1.5), 2]: http://matplotlib.org/examples/pylab_examples/aspect_loglog.html) and not [1, 1.5, 2]).

In the end I think this means that we are using mpl as a "canvas" but do the actual painting (plot a line here and here, a rect there and there,...) ourself. In my opinion we should not go that far down in the stack but keep ggplot as a wrapper around the highlevel functions of mpl and not build a replacement.

back to this case: I think we should all do most of the above, but as a first step we should do this with using mpl scaling methods and not send transformed data to mpl.

I'm also not sure if we should do the faceting rewrite first (#259). Most of the above is not scale specific but more fixing bad things like the assign* thingies or the limit calculations. I wonder if it would be helpfull if such things are done in small steps?

introduce layer and rewrite the draw method to use that (probably in #259)
- all "transforms wil be handled here...
rewrite faceting code as a "loop over all facets, cleanup limits/... afterwards" (#259)
Add default scales during the (former) assign* code
Rewrite the limits code as specified above with training

jankatins commented 10 years ago

cc: @glamp

has2k1 commented 10 years ago

To an extent, the ggplot (or more generally the grammar of graphics) methodology pushes us closer to using Matplotlib as a canvas for drawing. There is geom for points, for polygons, for lines and for rectangles. But, all these can be plotted with relatively high level operations using Matplotlib.

Roughly, the levels are:

1. Canvas lines and painting instance -> Artist
2. Specialised painting instances     -> Collections and Patches
3. Graphing functions                 -> e.g hist, hist2d, boxplot, hexbin, ...

The level 3 functions involve statistics/summaries and then plot representations. But ggplot has stats which can be mixed with various plotters, so we have to work at level 2. Preferably with collections and not patches. For some cases Matplotlib even has convenience wrappers e.g scatter, vlines and hlines around the collections. Some of the wrappers are nice [1], others have small annoyances [2]. Yet the collections instances themselves can be as simple to use.

Where we should worry about duplication of functionality is the stuff around the canvas, i.e the ticks & labels around the axes. Matplotlib does an excellent job on this and it should keep on doing it. At present, the problem is with the faceting in that to give the facets common axes, we have had to do it ourselves. But, when we get to handle the scales and panel correctly, and do the training, we will be able to give Matplotlib the correct limits per facet and it should reproduce the correct ticks. So no more tick calculation for us, hopefully.

The transform thing is tricky and here are the points of contention/think about.

ggplot2 transforms the data and does statistics on the transformed data and at plot time, it puts dummy labels along the axes
the scale transformation of ggplot2 is customizable i.e controlled by a function
transforming after calculating statistics changes the meaning of the statistics
there is no problem when stat='identity' and position='identity'
Matplotlib scaling is limited to linear, log and symlog
Matplotlib transforms the axis points and plots are mapped onto that.
Matplotlib has excellent tick generation, grid lines and labelling

A possible solution to get the best of both may be to normalize the positions (x, xmax, xmin, xend, xintercept, y, ymax, ymin, yend, yintercept) according to the x and y ranges so that they fall in the [0, 1] range of the axes coordinate system and do the plotting with transform='ax.transAxes'. I think mpld3 would have no problem with this as it does the same kinds of manipulation.

I think facetting being a more of a partitioning task should come later on. If we are organised in how we do the rest, the facetting should be simpler and more pleasant to redo. Plus, all this should be done in small pieces and that is where the proper scales and panel instances come in. Since they are just structures, if they are in place they make it easier to implement the rest.

[1]: ax.scatter as wrapper is real convenience.

[2]: In PR #266 the switch was made to use ax.vlines and ax.hlines. All that they add to the LineCollection instance is type checking and extend the canvas. The plot info that comes in is in the right form so the type checking is not needed and the canvas extension adds a lot of space that makes the plot look awkward. In this particular case the real collection instance is a better choice in the long term.

jankatins commented 10 years ago

Reading the transforms tutorial, I get the feeling tha we should use that to implement scales: we can use "data" system and for "new" scales we can add custom scales to matplotlib (http://matplotlib.org/1.3.1/devel/add_new_projection.htm). Using such a system would probably also integrate into mpl extensions like maps.

So in my mind scale_... become objects which add a mpl scale to the ggplot object which is then used to configure each axis. This would also give us the ticks and limits "for free"

use mpl transforms (+ scale_... sets them on each axis)
We don't do any ticklabels and so on ourself, but use mpl thingies.
Lines charts (and points) will be painted via the the level2 API of mpl, using collections (for each layer)

But I'm more and more convinced we should a) do this after introducing the layer and b) should do a small protoype first (no complete ggplot style but simple use some array, do some data manipulations and then use the manipulated data to plot -> a short version of what happens on draw())

Regarding boxplots: I looked at the bxp method in axis and I'm really not looking forward to duplicate that code... I think we should wait with such things until we have done the easier steps first.

has2k1 commented 10 years ago

With need for normalisations, I had a feeling we were moving in the projections territory. Although projections are temporarily avoidable -- until we need to do coordinate transforms, we might as well start exploring all that they could offer.

Yes free ticks, we will already have managed the limits properly. Plus, the Matplotlib tick generation is better than ggplot2 for non linear scales. We may have to do "ticks" for categorical histograms though, but that is the odd case.

Either the first step with the scales, or introducing the layer can come first. The reasons I want the scales first is to limit the other changes/bugfixes that could be made in parts that we now consider obsolete going forward. The second being, scale training can then be done in layer introduction. However, the two do not conflict.

A proper boxplot implementation can serve as a good test our code organisation. Right from stat_boxplot, what it CREATES and how we store those results (points, lines and boxes) into the dataframe, all the way to maybe geom_boxplot._plot_unit re-using the other geoms to do the plotting. Plus if using notches, dealing with that. When the time comes, it may be a good idea to think about it along with geom_violin.

jankatins commented 10 years ago

This is quite interesting: http://cpsievert.github.io/2014/06/visualizing-ggplot2-internals-with-shiny-and-d3/

jankatins commented 10 years ago

Somewhere we also need to fix descrite values and tick labels: right now we you can add labels for all discrete values, but only 5 or so are shown. That's because the default Locator only shows 5 an we probably have to somewhere add the right locator: In my Code I right now do that in matplotlib:

from matplotlib.ticker import MultipleLocator
plt.gca().xaxis.set_major_locator(MultipleLocator())

has2k1 commented 10 years ago

I am chipping away at this -- slowly -- and it will be a while before anything is working. It will be a much wider refactoring and will include facetting. It will touch on #259 and may be even make all of it unnecessary.

The best way to get a plot process that adheres to the process described in Section 2 of http://vita.had.co.nz/papers/layered-grammar.pdf, is to mimic https://github.com/hadley/ggplot2/blob/master/R/plot-build.r, which is the "high level" itemized above.

There is no way to avoid it, this will be very destructive.

makmanalp commented 9 years ago

Possibly #248 and #306 are related - I keep running into that issue. I have another variant where adding a geom_text messes the scales up and moves the graph viewport away from most of the points seemingly randomly:

screen shot 2015-02-09 at 1 15 07 pm screen shot 2015-02-09 at 1 14 59 pm

has2k1 commented 9 years ago

I think the problem(s) may be coming from more than one source. In the plot building pipeline there isn't a clear distinction of what should be handled by ggplot and what is handled by matplotlib. The best solution is to fix the underlying architecture to adhere to the same process as ggplot2. This is what I am trying to do in #360, it is big (almost a complete rewrite) and I haven't made any commits lately.

makmanalp commented 9 years ago

@has2k1 no worries, not trying to pressure you, just trying to tie bugs together and investigate further so if/when you tackle this you'll have less hunting to do.

One thing I'd appreciate is where I'd start looking if I wanted to fix the outputs manually - like, after the matplotlib elements are generated, where / how would I intercept them before they get repr()'d so I can, say, manually fix the viewport or investigate what's going on?

has2k1 commented 9 years ago

@makmanalp, that would have to be in the ggplot/ggplot.py draw method. There is a lot going on in there and it is a very unsightly. Maybe try poking around after the log scales are applied.

makmanalp commented 9 years ago

@has2k1 Perfect, thank you so much.

yhat / ggpy

Proper scales #283