zyang1580 / CoLLM

The implementation for the work "CoLLM: Integrating Collaborative Embeddings into Large Language Models for Recommendation".
BSD 3-Clause "New" or "Revised" License
47 stars 6 forks source link

Inconsistency of ML-1M statistics in paper and released propcessed dataset #2

Closed dekoponTree closed 1 week ago

dekoponTree commented 8 months ago

The paper reported that ML-1M dataset have 839 users and 3,256 items.

image

The statistics is inconsistent with released datasets, which can be reproduced via following scripts

import pandas as pd

train_ = pd.read_pickle('train_ood2.pkl')
valid_ = pd.read_pickle('valid_ood2.pkl')
test_ = pd.read_pickle('test_ood2.pkl')

uids = set(train_.uid.unique()) | set(valid_.uid.unique()) | set(test_.uid.unique())
iids = set(train_.iid.unique()) | set(valid_.iid.unique()) | set(test_.iid.unique())

print(len(uids), len(iids)) # 838, 3255
zyang1580 commented 8 months ago

The statistics include one additional padding ID.