weiyinwei / MMGCN

MMGCN: Multi-modal Graph Convolution Network forPersonalized Recommendation of Micro-video
280 stars 52 forks source link

Data Pre-Processing Code #55

Open harshgeek4coder opened 1 year ago

harshgeek4coder commented 1 year ago

Hey there, @weiyinwei Thanks for your paper and the approach to tackle multi-modal deep learning here. My question is , I can see and get the datasets mentioned, but I see there is no code for pre-processing the datasets. Could you kindly provide the code for data pre processing - the steps through which you built graph - train.npy, etc

harshgeek4coder commented 1 year ago

Also, @weiyinwei - A small request, can you kindly send or push a sample of these DATA files ? If possible, it would be great Thanks

rohnson1999 commented 1 year ago

I tried to find dataset clue in author's whole github repository, I couldn't find any file(even process scripts) about it. Seems like due to some copyright issues, author can't share data with us. But I try to follow author's methods mentioned in their paper. Turns out, I collect 6,184,294 records in MovieLens10M, which is 6 times bigger than their 1,239,508.

harshgeek4coder commented 1 year ago

I tried to find dataset clue in author's whole github repository, I couldn't find any file(even process scripts) about it. Seems like due to some copyright issues, author can't share data with us. But I try to follow author's methods mentioned in their paper. Turns out, I collect 6,184,294 records in MovieLens10M, which is 6 times bigger than their 1,239,508.

Hey @rohnson1999 , appreciate your input here. I understand what you meant. I agree, but I am also concerned with how the author built graphs - for example, the .npy files -> that is also kind of necessary to know how to process and build those.npy files as graphs to pass it to data loader and then to graph network eventually. @weiyinwei , Really Appreciate your inputs here.

Thanks

rohnson1999 commented 1 year ago

And I also curious why their user number in MovieLens is 55485, because if you read file: ratings.dat in python(https://grouplens.org/datasets/movielens/10m/), you will see there are 69878 unique user numbers. It's reasonable to cut items number because some movies' trailers and descriptions are missing, but I don't understand why they cut user numbers?

rohnson1999 commented 1 year ago

Based on my understanding, 3 features(Audio, Text, Keyframe pictures).npy files are numpy array files. You can use Vggish, Sentence2Vec, ResNet50 to extract features respectively and eventually get these npy files. But in general, you have to first crawl corresponding movie trailers and movie descriptions on Imdb.com, and then use deep neural models to get the features.

rohnson1999 commented 1 year ago

I spend 2 months try to build my MovieLens dataset, but there are just few information.

harshgeek4coder commented 8 months ago

Hi @rohnson1999, were you able to get or create code for preprocessing data in the mentioned format - positive interactions of users against items for this MMGCN Paper ?

weiyinwei commented 8 months ago

@rohnson1999 Yes, we should remove the items without the features in three modalities. And also, after removing such items, some users may have an empty interaction history. So, we also remove these users. @harshgeek4coder The .npy files store the user-item pairs which correspond the edge, i.e., <head_node, tail_node> , in the graph.

harshgeek4coder commented 8 months ago

@rohnson1999 Yes, we should remove the items without the features in three modalities. And also, after removing such items, some users may have an empty interaction history. So, we also remove these users. @harshgeek4coder The .npy files store the user-item pairs which correspond the edge, i.e., <head_node, tail_node> , in the graph.

Hi @weiyinwei , Thanks for the reply. I had one question - this edge which is positive interaction - is it unidirectional or bidirectional ? Since for Graph Neural Networks, the edge creation and passing it to further dense layers might require both sided edges.

If possible, can you provide any sample pre-processing code for movielens dataset? Thanks a ton!

241416 commented 1 month ago

I tried to find dataset clue in author's whole github repository, I couldn't find any file(even process scripts) about it. Seems like due to some copyright issues, author can't share data with us. But I try to follow author's methods mentioned in their paper. Turns out, I collect 6,184,294 records in MovieLens10M, which is 6 times bigger than their 1,239,508.

Hey @rohnson1999 , appreciate your input here. I understand what you meant. I agree, but I am also concerned with how the author built graphs - for example, the .npy files -> that is also kind of necessary to know how to process and build those.npy files as graphs to pass it to data loader and then to graph network eventually. @weiyinwei , Really Appreciate your inputs here.

Thanks

the auther said briefly in the paper, I guess it's transferred by data provider, from raw videas to visual/textual data,I'm also curious about that extraction process,