ycroissant / plm

Panel Data Econometrics with R
GNU General Public License v2.0
49 stars 13 forks source link

Allowing data-masking computations on pdata.frame or: implementing the pdata.frame as a data frame of pseries in a memory efficient way. #15

Closed SebKrantz closed 2 years ago

SebKrantz commented 2 years ago

As already presented to Kevin in e-mail conversation, an idea I had for enhancing the pdata.frame to allow pseries methods to apply inside data masking functions e.g. with(pdata, lag(variable)) or lm(var1 ~ lag(var2) + between(var3), pdata), was to make the pdata.fame a data frame of pseries. The problem with this is obviously memory efficiency: Attaching the index and names attribute to every column in the data frame inflates the size of the frame multiple times its original size. Having thought more about this, I present here a solution I find both simple and elegant:

I create an external pointer to the "index" attribute of the pdata.frame (using a bit of C code) and attach it to every column inside the frame, and add a class 'pseries' to each column. The pdata.frame() function was modified, adding the lines

    iptr <- .Call(C_create_eptr, index)
    mkps <- function(x) {
      attr(x, "index") <- iptr
      class(x) <- c("pseries", class(x))
      x
    }
    x <- lapply(x, mkps)

right before the attributes are attached at the end of the function. Then, I created a function getpsidx() to retrieve the "index" attribute of its argument as follows:

getpsidx <- function(x) {
  ix <- attr(x, "index")
  switch(typeof(ix), externalptr = .Call(C_get_eptr, ix), ix)
}

If ixis an external pointer, the C function simply fetches the index from it (which works globally, independent of the evaluation environment; the 'index' attribute is protected from garbage collection as long as the pointer exists). Otherwise, if the index is already a list (or something else), it is just returned.

Finally, I modify all pseries methods in the package, replacing attr(x, "index") or similar with getpsidx(x) or similar, and I have also changed $<-.pdata.frame as follows:

"$<-.pdata.frame" <- function(x, name, value) {
  if (!inherits(value, "pseries")) class(value) <- c("pseries", class(value))
  attr(value, "index") <- .Call(C_create_eptr, attr(x, "index"))
  "$<-.data.frame"(x, name, value)
}

Note that this modification only applies to pseries stored in the frame. I do not modify the $.pdata.frame or [[.pdata.frame mehods or any other methods which create proper pseries when pulling them out of the frame.

With the modification presented here, data-masking computations work fine (the pseries methods apply properly), it is memory efficient (the size of the frame remains the same), and it does not impact in any way the operation of the package. Of course, the change requires the addition of a src/ directory and a single C file (named extptr.c that defines and exports the callable C functions create_eptr and get_eptr).

SebKrantz commented 2 years ago

@eddelbuettel perhaps if you have time you could have a brief look if the external pointer is instituted properly for the purpose here. I am particularly still wondering if the finalizer makes sense.

eddelbuettel commented 2 years ago

Hi @SebKrantz -- I can try to take a look. We use them in several packages, but I ended up needing explicit finalizers in only one (after valgrind slapped my fingers). So I give you a firm shrug :man_shrugging: too and recommend testing. Also paging @enchufa2 who knows his way around XPtr and back a few times over.

SebKrantz commented 2 years ago

Thanks @eddelbuettel thats good to know that it might not even be necessary. The external pointer is used here to point to and protect the 'index' attribute of a pdata.frame (a list of 2 factors). The code I used here is very simple, I just need to make sure that the 'index' is never removed from memory as long as the pointer is around, but of course if the pdata.frame was deleted alltogether, including the external pointers, it should be possible to garbage collect. I though that's what the finalizer was for, but I might be wrong.

eddelbuettel commented 2 years ago

Yes, that is precise what should happen and generally does.

Enchufa2 commented 2 years ago

The thing that may or may not be required is the onexit=TRUE argument. Not used here, but for reference, Rcpp historically sets it to FALSE, because it is generally not needed (after all, when the session quits, all the memory is returned to the OS anyway). However, sometimes the order in which this memory is freed matters. And this is what Dirk was talking about, and that's why we recently exposed RCPP_USE_FINALIZE_ON_EXIT (which could become the new default at some point).

TL;DR, in general (and probably in this case too), onexit=TRUE is not required. But when it is, it hits hard, and setting it rarely harms, so... I would set it, as you did. :)

SebKrantz commented 2 years ago

Thanks a lot @Enchufa2 for the great explanation, really appreciate it. Then I am confident that this will work, but we will test it of course (It's up to plm maintainers now to take this further).

SebKrantz commented 2 years ago

This has now been implemented in collapse: https://sebkrantz.github.io/collapse/reference/indexing.html. Feel free to take advantage of it. the to_plm() function turns indexed series and frames into regular plm objects...