welch / tdigest

tdigest: javascript implementation of Dunning's T-Digest for streaming quantile approximation
MIT License
69 stars 11 forks source link

[BUG FIX] Continuous centroid sometimes set to string #10

Closed domhauton closed 5 years ago

domhauton commented 5 years ago

While processing a large data set (attached) the percentile calculations return an equation string.

By casting the centroid mean to a string we can prevent this issue.

screen shot 2018-12-05 at 15 42 53

Dataset: my_set.csv.zip

Test Code:

const TDigest = require('./tdigest.js').TDigest;
const csv = require('csv-parser');
const fs = require('fs');

const results = [];

let tdigest = new TDigest();

fs.createReadStream('/Users/dominichauton/Desktop/my_set.csv')
    .pipe(csv())
    .on('data', function(x){
        let value = x['change rate'];
        tdigest.push(value);
        // tdigest.compress()
    })
    .on('end', () => {
        tdigest.compress();
        console.log(tdigest.summary());
        // console.log("P50 ~ "+tdigest.percentile(0.5));
        // console.log("P80 ~ "+tdigest.percentile(0.8));
        // console.log("P90 ~ "+tdigest.percentile(0.9));
        // console.log("P99 ~ "+tdigest.percentile(0.99));
        // console.log("P999 ~ "+tdigest.percentile(0.999));
    });

Result pre-fix:

approximating 199999 samples using 695 centroids
min = 1012878967
Q1  = 3147195358.4957838
Q2  = 396488133-394579145.32856387
Q3  = 497638555-249710228.21663383
max = 999859484

Result post-fix:

approximating 695 samples using 695 centroids
min = 3
Q1  = 469536336
Q2  = 999859484
Q3  = 1539650282
max = 2628793125
welch commented 5 years ago

Hi,

I appreciate your attention to this!

I've always considered it poor form for a numeric utility to quietly coerce improper input like that. Slightly better would be to never switch out of discrete mode if the input is not a numeric type, but even this is another hidden assumption to deal with incorrect input. Cleanest to expect numeric data rather than guessing or coercing.