wearepal / data-science-types

Mypy stubs, i.e., type information, for numpy, pandas and matplotlib
Apache License 2.0
202 stars 52 forks source link

slicing with pandas .loc #71

Open PaddyAlton opened 4 years ago

PaddyAlton commented 4 years ago

Hi! Cool project, and great to finally be able to not just ignore missing imports in mypy.ini!

I've just been getting started today trying out the library with some pre-existing code that heavily uses pandas. I know it's a work-in-progress so was not too surprised to get a few errors.

Some of these were just missing bits of functionality that would, I think, be straightforward additions - things like pd.date_range, pd.to_datetime, pd.tseries and the like. I'm hoping to find some time to contribute, since I see you encourage it. :)

I was a bit less sure about about the results I got for this pattern: df.loc[:, "column_name"] =, for which mypy threw this:

error: Invalid index type "Tuple[slice, str]" for "_LocIndexerFrame"; expected type "Tuple[Union[str, str_], Union[str, str_]]"

This was solveable with a minor refactor, but I had to just type: ignore this one:

error: No overload variant of "__getitem__" of "_LocIndexerFrame" matches argument type "slice"

which appeared when doing df.loc["2018-10":, "column_name"]/df.loc[datetime_object:, "column_name"] (using the date-time index slicing functionality)

I wondered whether supporting slices is unfeasible or something you'd hope to include? .loc's behaviour is pretty complex so I appreciate type-annotating fully would be painful!

A related issue was this mypy error:

error: Invalid index type "Tuple[Series[bool], str]" for "_LocIndexerFrame"; expected type "Tuple[Union[Union[Series[bool], ndarray[bool_], List[bool]], List[str]], Union[Union[Series[bool], ndarray[bool_], List[bool]], List[str]]]"

This one comes about from df.loc[boolean_series, "column_name"] and my examination of the expected type showed me I could refactor to df.loc[boolean_series, ["column_name"]] to get the same functionality - looks to me as though, as implemented, the type annotations allow you to either pass two collections to .loc or two labels, but not a mixture?

Just want to check what's in-scope for this project before slinging PRs around!

tmke8 commented 4 years ago

Hi!

pandas indexing is indeed incredibly complex and a major source of headaches for this project.

Currently, .loc indexing accepts these kinds of inputs:

_MaskType = Union[Series[bool], _np.ndarray[_np.bool_], List[bool]]
_StrLike = Union[str, _np.str_]

class _LocIndexerFrame:
    # get item
    @overload
    def __getitem__(self, idx: _MaskType) -> DataFrame: ...
    @overload
    def __getitem__(self, idx: _StrLike) -> Series: ...
    @overload
    def __getitem__(self, idx: List[_StrLike]) -> DataFrame: ...
    @overload
    def __getitem__(
        self, idx: Tuple[Union[_MaskType, List[str]], Union[_MaskType, List[str]]],
    ) -> DataFrame: ...
    @overload
    def __getitem__(self, idx: Tuple[_StrLike, _StrLike]) -> float: ...

We should definitely accept slices here. This should be as easy as adding slice to the _MaskType:

_MaskType = Union[slice, Series[bool], _np.ndarray[_np.bool_], List[bool]]

(slice is a built-in type that's used for slices.)

df.loc[boolean_series, "column_name"] should also definitely work. We would need to add a new overload for that. Probably like this:

    @overload
    def __getitem__(
        self, idx: Tuple[Union[_MaskType, _StrLike], Union[_MaskType, _StrLike]],
    ) -> Series: ...

Although I'm not sure whether this returns a Series or a DataFrame.

I will add these to the library in the coming days, but you're also very welcome to send a PR if you like :)

Some of these were just missing bits of functionality that would, I think, be straightforward additions - things like pd.date_range, pd.to_datetime, pd.tseries and the like. I'm hoping to find some time to contribute, since I see you encourage it. :)

We don't have any type stubs for datetime functionality yet. So you might need to add some type stubs for datetime objects, but otherwise it should be straight-forward.

Before you spend too much time on it, I should warn you though that pandas seems to have started adding type annotations to the code itself. However, right now they don't expose this type information to the user (which is why mypy complains). Also, their type annotations are very permissive because they want to capture all allowed inputs. But yeah, the pandas aspect of this library soon might not be needed anymore :D

PaddyAlton commented 4 years ago

Thanks :)

Ah, I was wondering about that - diving into the pandas library GitHub to understand how they implement indexing did show that they have lots of type-annotations, so I didn't quite follow why my type-checker wasn't picking them up.