Closed cbrnr closed 4 years ago
I'm guessing there's a reason Christian wanted the winsorizing in the python xdf importer. Let's get his input. @chkothe
Hey Clemens and Chad -- that parameter is indeed relevant for the robust estimation on the next line (it's the delta parameter of the Huber loss function above which deviations from the mean are treated as potential outliers and thus robustly).
@cbrnr It isn't used only for the clock values, but also for X 2 lines above.
As best as I can tell, values below rho
will have their residuals penalized by L2, whereas values above rho
will have their values penalized by L1. Instead of passing this value to the robust regression function as the rho
value, you divide both x and y by the value and use the default rho=1
. Is this equivalent? Am I interpreting the intention properly? If so, do you think we could simply pass the value to the robust regression function argument? Are there any numerical precision considerations I'm missing?
While we're on this topic, is that still considered 'winsorizing'? There doesn't seem to be any clipping. Maybe we should rename the parameter?
I don't see where winsor_threshold
is used except for dividing X
and y
- that just scales the values of the clock and offsets by an equal amount. _robust_fit
is then called with X
and y
with the default rho=1
. If indeed rho
should be rho / winsor_threshold
(or some other scaled value), then I think it would be better to integrate that into the parameter itself.
But this is more or less an API decision, currently I find it really difficult to see how this algorithm works if part of a parameter needs to be worked into the data before applying the function. I'll revert this change and only leave the simplified reshaping of X
.
@cbrnr yes the intention of those divisions should have been documented better, my fault. @cboulay the purpose of rho parameter is actually unrelated (it's a coupling parameter in the alternating-directions method), although it's possible that it does ultimately simplify in the way you describe (without having checked it myself) -- but normally the Huber threshold would be a third parameter that would have to be worked into the optimizer (I think some of the 1's in the code would have to be replaced by that). Now, while that would make things more explicit, and perhaps allow the robust fit method to be used in other contexts, for now it'd be a purely internal reshuffling, and we'd need to implement some strong tests to ensure that the behavior is identical (since it's easy for breakages to go unnoticed in that sort of code, especially if the test data don't exhibit the kinds of pathologies that the method is designed to protect against).
@cboulay re the naming, indeed it's a bit of a misnomer, although there's a subtle correspondence between the two methods (one trimming the input data and the other trimming an iterate). I think one could document that it's technically a misnomer and/or have a transition period with deprecation warning. In fact if you guys want to rename it, maybe one shouldn't even rename it to huber_threshold then (since we might swap that method some day, which would incur yet another API breakage) but more like robust_threshold. For now I'd personally err on the side of not breaking things for users without a strong reason (in line with the Linux mantra of never breaking user space) -- as in, there has to be enough on the upside to make up for things erroring our for scores of data analysts, and the cascade of issues coming with that.
I agree that we don't have to break/change anything for this function because it's not part of the public API. However, adding some explanations would help a great deal.
I've reverted all changes related to winsor_threshold
and left only the array creation simplification.
I thinks it's the other way round, /=
is in-place division, whereas X / winsor_threshold
allocates a new array. But like you said, it's no big deal here anyway.
I know I've encountered situations where X /= a
didn't work because of a shape mismatch in an intermediate variable that numpy creates but X = X / a
worked fine. I guess this doesn't mean that the latter doesn't make a copy. Maybe it's because of the copy that it works better in some cases? Anyway, this is way off-topic. I think the change is fine as is.
While going through the
_clock_sync
function, I found an opportunity to simplify the code a bit. Furthermore, it seems likewinsor_threshold
was not really used as intended - instead if winsorizing, it merely divided the clock values by its value. Therefore, I've removed the parameter.