large dataset problem - Githubissues

LidaMsv commented 2 years ago

Hello I'm using your source code for my current project and I have a problem with building FPHeaderTable for a large-scale real dataset named webdocs. I think because there are about 5M unique items in this dataset, the HeaderTable is not being made!! My question is that how big datasets your code can manage and what should I do in order to build the HeaderTable? Thanks in advance

yhf8377 commented 2 years ago

Hello.

This project was my coding practice a few years ago when I was working on a CUDA related association rule mining task. However, I ended up using a different approach and never completed this project. So the program code must contain lots of bugs They are not even fully functional. Feel free to build on top of my work, but do not expect my code to work out-of-box.

Regarding the memory requirement, you will need to do your own calculations based on the size of the internal data structures used and the number of instances required to support your data input. Again, as I never finished this project I did not optimize memory footprint either.

On Oct 5, 2022, at 12:46 PM, LidaMsv @.***> wrote:

Hello I'm using your source code for my current project and I have a problem with building FPHeaderTable for a large-scale real dataset named webdocs. I think because there are about 5M unique items in this dataset, the HeaderTable is not being made!! My question is that how big datasets your code can manage and what should I do in order to build the HeaderTable? Thanks in advance

— Reply to this email directly, view it on GitHub https://github.com/yhf8377/cuda-fp-growth/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASEV53U6IK4DBR2SATFFMDWBWPERANCNFSM6AAAAAAQ5WISEI. You are receiving this because you are subscribed to this thread.

LidaMsv commented 1 year ago

Hello again. Thank you for your answer. Now I know what I want from your code:) Though, I have another question if you don't mind. My question is that why did you transpose the constructed bitmap? What was the difference? Did you need to transpose it in order to build the radix tree or you had another reason? Thank you again for answering me.

yhf8377 commented 1 year ago

Honestly, I don't remember...

There will be two ways of representing items in transactions. First is an "item" map where each row represents an item and the bits represent whether each transaction contains this item. For example, item A: 00100001 may indicate that transaction #0 and #5 contains item A. The other is a "transaction" map where each row is a transaction and the bits represent whether it contains a specific item. For example, transaction #0: 00010001 would indicate that transaction 0 contains item A and item E.

Item map will be useful when counting the frequency of a specific item set. For example, to count the frequency of item set A+B+C, you just bit-and the item map for A, B and C, then do a __popc() call across all bytes of the result (only when a transaction contains all items in the set would the resulting bit be 1; so the number of 1s is the frequency count of that item set).

I don't remember why I transposed the item map into transaction map. Maybe it is required by the algorithms described in the articles I referenced... I really don't remember.

LidaMsv commented 1 year ago

Thank you for your complete explanation. Actually, I found the answer in the last paper you referenced. And it was exactly required by the algorithm in that paper. Thanks again for your time:)

yhf8377 / cuda-fp-growth

large dataset problem #5