yhat / ggpy

ggplot port for python
http://yhat.github.io/ggpy/
BSD 2-Clause "Simplified" License
3.7k stars 573 forks source link

Subsetting data in plot #116

Closed kevindavenport closed 10 years ago

kevindavenport commented 10 years ago

In R I can plot additional points based on some other criteria using the R subset command as follows:

%%R -i DF_diff_xy # list object to be transferred to python here
install.packages("ggplot2") # Had to add this for some reason, shouldn't be necessary
library(ggplot2)
df = data.frame(DF_diff_xy)
plot = ggplot(df, aes(x = X, y = Y)) + 
geom_point(alpha = .8, color = 'dodgerblue',size = 5) +
geom_point(data=subset(df, Y >= 6.7 | X >= 4), color = 'red',size = 6) +
theme(axis.text.x = element_text(size= rel(1.5),angle=90, hjust=1)) +
ggtitle('Distance Pairs with outliers highlighted in red')
print(plot)

In Python my thinking was I could specify a row slice of a dataframe for additional highlight as so:

from ggplot import *

ggplot(DF_diff_xy, aes(x = 'X', y ='Y')) + \
    geom_point(alpha=1, size=100, color='dodgerblue') + \
    geom_point(data = DF_diff_xy[:1], alpha=1, color='black')

This didn't work however, any ideas?

Thanks, Kevin Davenport http://kldavenport.com

jankatins commented 10 years ago

Currently it is not possible to specify data per geom.

@glamp Currently the ggplot._get_layers() method is not really the equivalent to to ggplots2 "layer": the real layer information is inggplot.geoms. But as the data is set inggplot._get_layers(...)` and the geoms are iterated afterwards, the geom (real "Layer") can't set it's own dataset. I would suggest change the iteration to

for geom in geoms:
    _data = geom.data or self.data
    for sub_layer in self._get_layers(_data):
         [...]

geom._init__() would then pop data from the args and save it to geom.data.

jankatins commented 10 years ago

Ok, it's (of course... :-/ ) not as easy: when you do that (and transform the data from the geom with the aes like ggplot.__init__() does... geom specific aes mapping was also not implemented yet), then there is an error because the plotting code assumes that there are some "assigned colors", but as this is a new dataset, they aren't... So actually this also needs to look into how to assign colors and so on...

One way would be to refactor the assign_*(gg) functions to build_*_mapping(data, aes, legend, gg), which would set the needed columns in data based on the passed in aes and gg (only manual color mapping and so on...). But that's for tomorrow...

jankatins commented 10 years ago

This can be closed

kevindavenport commented 10 years ago

Awesome Jan, thank you for your contribution. I think I can update http://kldavenport.com/mahalanobis-distance-and-outliers/ now :)

jankatins commented 10 years ago

Let's see if it works for you :-)

jankatins commented 10 years ago

downloaded you ipynb and run it here: it works :-)

jankatins commented 10 years ago

Just for the reference, here are the changes I had to do to the :

# needed because in latest pandas, the series are not anymore numpy arrays...
# see https://github.com/pydata/pandas/issues/5698
xydata = DF_diff_xy.values
xycols = DF_diff_xy.columns
--
%%R -i xydata,xycols # list object to be transferred to python here
install.packages("ggplot2") # Had to add this for some reason, shouldn't be necessary
library(ggplot2)
df = data.frame(xydata)
names(df) <- c(xycols)
plot = ggplot(df, aes(x = X, y = Y)) + 
geom_point(alpha = .8, color = 'dodgerblue',size = 5) +
geom_point(data=subset(df, Y >= 6.7 | X >= 4), color = 'red',size = 6) +
theme(axis.text.x = element_text(size= rel(1.5),angle=90, hjust=1)) +
ggtitle('Distance Pairs with outliers highlighted in red')
print(plot)
--
from ggplot import *

ggplot(DF_diff_xy, aes(x = 'X', y ='Y')) + \
    geom_point(alpha=1, size=100, color='dodgerblue') + \
    geom_point(data = DF_diff_xy[(DF_diff_xy.Y >= 6.7) | (DF_diff_xy.X >= 4)],alpha=1, size = 100, color='red')  
kevindavenport commented 10 years ago

Just tried it, works perfectly! Will start updating my blog post now :)