grid.datum should also accept an object

HoraceShmorace commented 8 years ago

When working with large datasets, it would be more performant and convenient if grid.datum() could also accept an object, not just an array. This would allow for updating of specific datapoints in specific rows without having to loop through the array to find the correct row (even Array.find and Array.filter are just looping behind the scenes).

So in addition to this usage:

grid.datum([
  { id: 'row1', name: 'First row', foo: 'bar' },
  // ... more rows here
  { id: 'row1000', name: 'One-thousandth row', foo: 'bar' }
]);

... we could also do:

grid.datum({
  'row1': { name: 'First row', foo: 'bar' },
  // ... more rows here
  'row1000': { name: 'One-thousandth row',, foo: 'bar' }
});

That way, rather than having to loop through to find the row we want to update, we could directly address it.

// assuming we cached our data in a variable named 'data'
data['row799'].foo = 'baz';
grid.datum(data);

Poetro commented 8 years ago

Order of object keys are not defined, so there is no consistent way to loop through them, unless you do a sort on them.

The Object.keys() method returns an array of a given object's own enumerable properties, in the same order as that provided by a for...in loop (the difference being that a for-in loop enumerates properties in the prototype chain as well).

The for...in statement iterates over the enumerable properties of an object, in arbitrary order. For each distinct property, statements can be executed.

Poetro commented 8 years ago

Also you can directly address any element of an array.

data[798].foo = 'baz'
grid.datum(data)

HoraceShmorace commented 8 years ago

There doesn't need to be a 'consistent way' to loop through object keys. Just do it in the order they appear, top to bottom.

And addressing an array via index is not realistic for a real world implementation. Currently, we get a collection of items (rows), and then we get updates for each individual row every 10 seconds. So if we have a 1000 items, every 10 seconds we're looping through them 1000 times.

HoraceShmorace commented 8 years ago

Or get the keys via Object.keys, and sort them.

mstade commented 8 years ago

Wait, I'm too dumb to get this – how is getting array items via index any different from getting object properties via key? They are the same thing, no?

HoraceShmorace commented 8 years ago

I'll point out that other grids do this. This would make any real-time application with a large dataset significantly more performant.

HoraceShmorace commented 8 years ago

@mstade - Because if you're getting real-time data updates for individual rows, you don't know the index of the row.

Poetro commented 8 years ago

@HoraceShmorace I don't see why setting the datum as object would be faster from the grid's perspective. It would still have to build an array to loop through it.

mstade commented 8 years ago

Oh wait I think I understand you better now @HoraceShmorace. Let me see if I can break it down.

You have a reference to the row itself – by reference I presume not an actual reference to the row object, but an id of some kind – but don't know its index in the dataset? So what you have to do then is to loop through the array, find the needle by checking id values, update the row when you find it, then render the grid again?

Whereas if you had the dataset being a map instead, you could just index by id directly, thereby not needing the loop to find the row in question?

HoraceShmorace commented 8 years ago

@Poetro - The performance gain in implementation for real-time data applications would significantly outweigh the performance lost to the grid doing that conversion.

HoraceShmorace commented 8 years ago

Again, imagine thousands of rows, all individually updating every 5-10 seconds. It's not even realistic to do with an array. The looping required is incredibly wasteful.

mstade commented 8 years ago

Oh yeah for sure. Let's make an example of this, it'll make a nice test case. A question, are there insertions and deletions as well, or just updates?

HoraceShmorace commented 8 years ago

Yeah, all three.

Edit: Actually, not really. Here's the flow (it's a securities trading application):

We get an array of securities from the server. This list will rarely change, only when the user adds more or fewer securities to their list (so additions/deletions technically happen on the server, resulting in a new dataset). This is cached in a data variable.
We call grid.datum(data).
We loop through our data array, and subscribe to real-time updates for each individual security, every 10 seconds (or less).
As each update is received (potentially 1000s), we loop through the data array to find the affected row (by a security id).
We update the row, and call grid.datum(data) again.

Poetro commented 8 years ago

Looping through thousands of rows to find one and update it is much faster, then rendering the grid itself. And you can keep references to rows in a Map by id.

var map = new Map()
data.forEach(function (datum) { map.set(datum.id, datum) })
// ...
map.get('row799').row = 'baz'
grid.datum(data)

mstade commented 8 years ago

@HoraceShmorace thanks for that detail. I've got an example running right now with 1000 rows, each individual row updated every (1000 + 100 * Math.random())ms with some random data. It's pretty snappy to be honest. I'll see if I can make it work in a block and show you, it may very well be a bad example so it'd be good to get your feedback.

HoraceShmorace commented 8 years ago

@mstade - It's not only about performance (and sure, a stand-alone example might be snappy outside of a larger application), but also about implementation complexity.

mstade commented 8 years ago

No I get that @HoraceShmorace, but I want to get the example up so we have something to work with. If we were to implement support for objects as datasets, we'd have a concrete use case to test it against.

mstade commented 8 years ago

Alright, apologies for the delay. Here's the demo I mentioned, where the grid is populated with a 1000 rows that are all individually updated on a separate timer. Hit the start updates button to kick it into gear once loaded. For every tick, there's a loop to find the row assigned to that tick, update it with some random data in the username column, render the grid and duck out of the loop. For good measure the dataset is shuffled, although I doubt this makes much of a difference.

Granted this is a contrived example, but it fares much better than I thought it would and the code is hardly complex. It looks like it stutters a fair bit, but I think that's more the perceived performance than actual performance – shortening the interval duration makes the stutter effect go away. Scrolling the grid while it's updating is pretty good too at least on my machine, better so in Safari than Chrome but both are pretty good. I expected severe jank but can't say I'm experiencing that – milage may certainly vary. I did not test in Firefox.

Now I guess the million dollar question: is this a representative example of your use case? I think so by reading your explanation. Note however that I only ever do grid.call(table) since the dataset doesn't grow or shrink. I did a quick test with grid.datum(rows).call(table) as well and it seemed to be equally snappy, but I may be lying to make friends. :o)

I'm sure there's plenty of room for performance improvements across the board in this component, but I've got to say this isn't half bad. Upping the rows to 5000 definitely starts to add some very noticeable lag, but still workable as a naive solution. I tried removing the render call from the row update loops and instead have it just run naively every 1000ms – surprisingly this seemed to have quite little effect on at least perceived performance. I also took a quick gander at the memory profile and saw a nice sawtooth pattern all the way, no obvious leaks at least.

A penny for your thoughts @HoraceShmorace. (I'm cheap!)

mstade commented 8 years ago

(For bonus fun: sort by the username column.)

gabrielmontagne commented 8 years ago

Hi, thanks for the suggestion. I understand that you'd like to keep an index of your rows, and that's perfectly cool -- makes a lot of sense for your use case. But I don't think we should implement this at the grid level, because the first thing we'll do on the grid side is to explode the object into rows and forget about your map.

The grid itself doesn't need the rows to be constant or even indexed or unique. Many times the grid is used where completely fresh data sets that might even "stand" for the data that was already there -- merging and keeping the data client being only one use case -- so object references would be meaningless then. The only "id" the grid cares about is the index in the array you pass it.

I'd suggest, as has been suggested above, that you implement a simple layer for your model, perhaps driven by something as simple as _.indexBy, something along these lines:

you get the array
you index it
you give the grid the array
you update your rows through your index
you call the grid again -- because the array has the reference to the same rows as your index, the grid will update as you expect.

I know this would be not hard to implement on the grid side, but it would stretch the API -- and will make it less D3-idiomatic and more confusing (as it'd somehow hint that row ids are "understood" by the grid -- which they are not). It will not make the grid itself more performant even though it'd make it simpler for you in your use case to consume.

gabrielmontagne commented 8 years ago

Another thing to keep in mind is that the grid does care very much about row order, and even though we "kind of know" rows on certain browsers will be expanded in the same order they were added, as @Poetro mentioned above, the order is not defined and we shouldn't trust it.

mstade commented 8 years ago

Thanks for that @gabrielmontagne. Looking at my demo which is admittedly (and purposefully) quite naive – can you you see any obvious ways to improve performance or the general pattern? Not so much in terms of loop or no loop, but things like, minimize calls to the table function in this example, or other sort of generic performance considerations like that?

cristiano-belloni commented 8 years ago

Hi guys. We implemented what @gabrielmontagne was writing about before.

The implementation is in this block, and it provides direct modification via id on change without looping, as requested. Thanks to the fact that we're modifying references, we don't need to provide an object, just index the array before creating the table.

gabrielmontagne commented 8 years ago

@mstade , regarding your question about your rig, we've exposed a couple of optimization hooks for row updates and we've used your rig to test them. Please see the live block here https://bl.ocks.org/gabrielmontagne/d9bca200af7ee7330ed82fa804551565 and the proposed PR #22.

HoraceShmorace commented 8 years ago

I'd argue that there's real value to the consumers of this component to take a more declarative approach to implementation than this highly imperative approach. It's trivial to accept an object as the argument, even if it comes with the caveat that row order can't be guaranteed, so the implementer should then explicitly sort the grid themselves (this actually seems like a best practice either way).

HoraceShmorace commented 8 years ago

I'd also argue that Map isn't well enough supported, and "indexing" a JavaScript array is not actually a normal thing, because that's precisely what objects are for.

However, for now we'll just convert our data object to an array at the last minute.

mstade commented 8 years ago

@HoraceShmorace did you check the _.indexBy demo? It's pretty clean and definitely less mechanical than the demo I did. The update code is just what you asked for and the prep-work is trivial. Even if you don't use underscore, the prep work is really just:

var rowById = indexBy(rows, 'id')

function indexBy(rows, key) {
  return rows.reduce(function(map, row) {
    map[row[key]] = row
    return map
  }, {})
}

The only thing you have to be careful to do is that if you add, remove, or replace a row, the corresponding change has to also happen in the dataset. The naive approach is to just regenerate the rowById map after having changed the dataset, I doubt it'd be a severe performance hit if it's done as rarely as you say.

I'm keen to agree with @gabrielmontagne's stance on this and say this isn't really something that should be implemented in the grid. I'll be closing this issue now. Happy to reconsider if you push a PR with the changes you envision. A lot of good commentary and follow-on work came out of this thread, so thank you very much for starting it!

zambezi / grid

grid.datum should also accept an object #20