zond / god

A Go database
Other
531 stars 38 forks source link

Set Intersection Theoretical Question. #3

Closed ai10 closed 11 years ago

ai10 commented 11 years ago

I'm interested in repetitively executing a matching operation on a K x K (K = ~8000) matrix of Sets to determine best matches by Set intersection. The Sets may each have on the order 100-2000 elements each. And would change regularly. And potentially the number of matrices would grow and be processed in parallel.

At first I thought to do this using Redis Set Intersection SINT with a O(N^2) Lua script, then I started to think I should look into keeping track of the sets with Redis and piping them in groups to a GPU for batches of Set Compare ops.

Now I'm trying to figure out if radix-tree / mirror arrangement is better suited. With some simple straight forward paper scratching I'm thinking that it does not.

It seems that you are addressing a problem of Set operations with your design, or is this simply a noted side benefit?

Do you think the radix tree / mirror arrangement would have an advantage for operations for my use case?

zond commented 11 years ago

I am not 100% certain I understand what it is you are trying to do. Let me try to rephrase it, to see if I am getting it:

You have about 8000 volatile sets, and you want to find the intersections of all of them?

Yeah, this would take (unless you do something outside the naive solution, I guess) about 64e6 intersections?

Redis SINTER seems to be O(N*M) where N is the smallest set, and M is the number of sets.

If my guesstimate (I have not spent lots of time doing this formally or properly) is correct, god intersections are O(S+log(S1)+...log(Sn)) where S is the size of the result set and Sn is the size of set n.

This could mean that god is more efficient than Redis at this, but there may be constant factors that affect this more than I can guess.

Also, doing it in a GPU sounds like a very efficient operation if you do it right, just due to being able to throw lots and lots of transistors on the problem.

Regarding what I address with my design: I think that the set and sorted set operations are one of the best things in Redis, so they were a 'must have' for me from the beginning, so they could be considered one of the major focus points with god.

Regarding mirror trees: If I understand your problem correctly, I don't see how you would use them?

ai10 commented 11 years ago

Yeah, 64e6 intersection per batch is what I've looked at. Though I think there are approaches using extra tables that could reduce the exhaustive number, I'm not sure they give much advantage in worst case computations.

I'm going to look more closely at your intersection code. Just going through the architecture document I incorrectly inferred that radix and mirror had a part in your Set algorithms and saw only symmetric result using the mirror.

Redis docs publish a 'worst case' N*M; maybe your calculation presumes evenly distributed data i.e. balanced tree best case? (I'm a rusty few years out of school on this stuff.)

Nonetheless I'm very impressed by what I see you are doing, though Redis is a high standard to compete with.

zond commented 11 years ago

Ok, well any optimization reducing the number of intersections you have to do sounds worthwhile to explore :)

Maybe I ought to improve the architecture document - all data in god is contained in radix trees. Each node contains one big radix tree (even if the different nodes contain different parts of it), and each sub tree is a radix tree contained in the big radix tree.

Mirror trees are only useful if you want to sort the data in your trees based on their values as well, and they cost you a lot of performance, since all updates will remove from the mirror tree, update the main tree and then update the mirror tree...

How do you mean symmetric results?

Yeah, perhaps Redis is mostly faster than the worst case. My implementation depends heavily on everything being both sorted and easily skippable to begin with. Maybe I should extend the docs there to describe what is actually happening.

Yeah, I am a bit scared about contending with Redis, but I guess you have to aim for the stars etc :)

ai10 commented 11 years ago

Forgive me God creator for being a little bit evil but yeah, I have a patent pending on reducing the number of intersections which is based on statistical expectations about the underlying nature of the data. And I might be able to squeeze out even more. But ultimately it always seems to break down into buckets that need a exhaustive comparisons because of the potentially heavy importance of statistical outliers. And it looks like GPU processing is the best bet for O(1) GB buckets at this last stage.

But I'm keen that this God project would be good for larger aggregates of data and my imagination is running wild about single hash bloom filters for each radix tree. And I think its cool that an API server can be baked right in.

Maybe i'm being loose with the term symmetric result but a simple instance looked like an unbalanced tree would mirror into a unbalanced tree; and I've boggled for a moment about some possible primal dual algorithm which could somehow reduce worst case scenario to best case scenario like a balanced tree; but could not see it through.

I'm not a Ph.D. on this CS or Math stuff though, I'm more of a crackpot electrical engineer with a master in computational math who's into this sort of thing more intuitively but without formal proofs.

I'm getting excited about this God project. Maybe I'll be able to contribute something worthwhile: some newb questions for a FAQ, uncover unnoticed issues, write some unit test, or something. Thanks for the sharing your good work and answering my questions.

zond commented 11 years ago

Forgive me God creator for being a little bit evil but yeah, I have a patent pending on reducing the number of intersections which is based on statistical expectations about the underlying nature of the data. And I might be able to squeeze out even more. But ultimately it always seems to break down into buckets that need a exhaustive comparisons because of the potentially heavy importance of statistical outliers. And it looks like GPU processing is the best bet for O(1) GB buckets at this last stage.

Haha, I am not all happy about the project name. I chose it when the project was less serious and more of just a crazy idea...

But I wish you the best of luck with your optimizations and GPU processing :)

But I'm keen that this God project would be good for larger aggregates of data and my imagination is running wild about single hash bloom filters for each radix tree. And I think its cool that an API server can be baked right in.

What did you want to do with the Bloom filters?

Maybe i'm being loose with the term symmetric result but a simple instance looked like an unbalanced tree would mirror into a unbalanced tree; and I've boggled for a moment about some possible primal dual algorithm which could somehow reduce worst case scenario to best case scenario like a balanced tree; but could not see it through.

Yeah, I started out with treaps, and other rebalancing binary trees. But the Merkle part really requires trees where the same data => the same structure :/

Anyway, I wouldn't worry about the radixes being unbalanced. The radix algorithm is quite good even in worst cases. Look it up on wikipedia ;)

I'm not a Ph.D. on this CS or Math stuff though, I'm more of a crackpot electrical engineer with a master in computational math who's into this sort of thing more intuitively but without formal proofs.

Almost like me :p

I'm getting excited about this God project. Maybe I'll be able to contribute something worthwhile: some newb questions for a FAQ, uncover unnoticed issues, write some unit test, or something. Thanks for the sharing your good work and answering my questions.

Please, any help or input is appreciated!

ai10 commented 11 years ago

My thought on the single hash bloom filter would be to use it as a fuzzy set compare against the entire node (or sub tree) by bit matching a bloom filter trained on S1 against a bloom filter trained on a node to get a number of potential matches among the agglomeration of sets on the node. bloom filters do give significant false positives so the filter would need be somewhat large I think at least 8*(total_items) bits - but this could be a constant time first step in finding neighborhoods of sets where they would most likely have commonalities. The problem is the statistical significance is not very high if the expected size of set intersection is insufficient. 'curse of dimensionality'

zond commented 11 years ago

Ah, I see. I think.

So you wanted to train a bloom filter on a bunch of sets, to be able to quickly filter out those that it's worthwhile to actually do intersections with? Clever :)

ai10 commented 11 years ago

Yeah, that was one idea.

Have you considered a name change to something like 'goradix' or 'gorad' which also seems generally available as a .io domain name? (s%4 man) , gorad.io seems like a awesome trademark for someone into music too ;)

zond commented 11 years ago

That one I haven't considered. Although I feel that the fact that the data structure I use are radix trees is a mere detail, so maybe the project shouldn't be named after it?

I considered naming it "Go Database!" but felt it was a bit presumptuous to name it as though it was the only go database. Though "god" may be considered presumptuous as well I guess :)

I also considered naming it BjörnDB after a friend of mine who fixed my car...

ai10 commented 11 years ago

^^ gorad.io !!!

zond commented 11 years ago

Haha gorad was nice! I have to consider that, but it is promising. Maybe rename it tonight if I make up my mind...

zond commented 11 years ago

Heh, I was about to register the domain, when I saw that .io domains cost 60 GBP a year :O

ai10 commented 11 years ago

Yeah, at least its a barrier to all these cyber squatters looking for a windfall.

I'm not with iron.io, but I noticed they are looking for Go developers to join their team, and seem open to remote people, and might be keen for you to continue to work on this project since it might be consistent with their overall product offering.

Do you have a lot of other time commitments beyond this project?

If I arrange for the purchase of gorad.io will you change the project name?

zond commented 11 years ago

True that. But I'll hold off on investing in domain names until I at least have a few bona fide users of the product :)

Oh, well, I work as an independent contractor so I am always open for opportunities. And being paid to work on something like this would be dreamy :D

ai10 commented 11 years ago

jobs.iron.io

If you might apply I'll send them a letter, a 'second', ... can often be a little persuasive for them to look at your code, I think they'll like it.

zond commented 11 years ago

Oh, they seem to do really cool stuff! And I know and like SF! (even if I am not currently able to move there, I have been there a few times on business lately)

I applied, and referred to this thread ;)

Also: I just now noticed (it's hard to follow these threads from a mobile screen) that you offered to buy the domain gorad.io.

Yes, yes I would change the name to gorad if you bought gorad.io. And yes, I have lots of commitments, but the only immutable ones are my wife and child :D