xdf-modules / pyxdf

Python package for working with XDF files
BSD 2-Clause "Simplified" License
36 stars 17 forks source link

Simplify clock sync #71

Closed cbrnr closed 4 years ago

cbrnr commented 4 years ago

While going through the _clock_sync function, I found an opportunity to simplify the code a bit. Furthermore, it seems like winsor_threshold was not really used as intended - instead if winsorizing, it merely divided the clock values by its value. Therefore, I've removed the parameter.

cboulay commented 4 years ago

I'm guessing there's a reason Christian wanted the winsorizing in the python xdf importer. Let's get his input. @chkothe

chkothe commented 4 years ago

Hey Clemens and Chad -- that parameter is indeed relevant for the robust estimation on the next line (it's the delta parameter of the Huber loss function above which deviations from the mean are treated as potential outliers and thus robustly).

cboulay commented 4 years ago

@cbrnr It isn't used only for the clock values, but also for X 2 lines above.

As best as I can tell, values below rho will have their residuals penalized by L2, whereas values above rho will have their values penalized by L1. Instead of passing this value to the robust regression function as the rho value, you divide both x and y by the value and use the default rho=1. Is this equivalent? Am I interpreting the intention properly? If so, do you think we could simply pass the value to the robust regression function argument? Are there any numerical precision considerations I'm missing?

While we're on this topic, is that still considered 'winsorizing'? There doesn't seem to be any clipping. Maybe we should rename the parameter?

cbrnr commented 4 years ago

I don't see where winsor_threshold is used except for dividing X and y - that just scales the values of the clock and offsets by an equal amount. _robust_fit is then called with X and y with the default rho=1. If indeed rho should be rho / winsor_threshold (or some other scaled value), then I think it would be better to integrate that into the parameter itself.

But this is more or less an API decision, currently I find it really difficult to see how this algorithm works if part of a parameter needs to be worked into the data before applying the function. I'll revert this change and only leave the simplified reshaping of X.

chkothe commented 4 years ago

@cbrnr yes the intention of those divisions should have been documented better, my fault. @cboulay the purpose of rho parameter is actually unrelated (it's a coupling parameter in the alternating-directions method), although it's possible that it does ultimately simplify in the way you describe (without having checked it myself) -- but normally the Huber threshold would be a third parameter that would have to be worked into the optimizer (I think some of the 1's in the code would have to be replaced by that). Now, while that would make things more explicit, and perhaps allow the robust fit method to be used in other contexts, for now it'd be a purely internal reshuffling, and we'd need to implement some strong tests to ensure that the behavior is identical (since it's easy for breakages to go unnoticed in that sort of code, especially if the test data don't exhibit the kinds of pathologies that the method is designed to protect against).

chkothe commented 4 years ago

@cboulay re the naming, indeed it's a bit of a misnomer, although there's a subtle correspondence between the two methods (one trimming the input data and the other trimming an iterate). I think one could document that it's technically a misnomer and/or have a transition period with deprecation warning. In fact if you guys want to rename it, maybe one shouldn't even rename it to huber_threshold then (since we might swap that method some day, which would incur yet another API breakage) but more like robust_threshold. For now I'd personally err on the side of not breaking things for users without a strong reason (in line with the Linux mantra of never breaking user space) -- as in, there has to be enough on the upside to make up for things erroring our for scores of data analysts, and the cascade of issues coming with that.

cbrnr commented 4 years ago

I agree that we don't have to break/change anything for this function because it's not part of the public API. However, adding some explanations would help a great deal.

cbrnr commented 4 years ago

I've reverted all changes related to winsor_threshold and left only the array creation simplification.

cbrnr commented 4 years ago

I thinks it's the other way round, /= is in-place division, whereas X / winsor_threshold allocates a new array. But like you said, it's no big deal here anyway.

cboulay commented 4 years ago

I know I've encountered situations where X /= a didn't work because of a shape mismatch in an intermediate variable that numpy creates but X = X / a worked fine. I guess this doesn't mean that the latter doesn't make a copy. Maybe it's because of the copy that it works better in some cases? Anyway, this is way off-topic. I think the change is fine as is.