Open akeyhero opened 11 months ago
I would suggest https://huggingface.co/datasets/shunk031/JGLUE, which includes exactly the same as the original MARC-ja
from datasets import load_dataset
dataset = load_dataset("shunk031/JGLUE", name="MARC-ja")
Thank you for the suggestion. I'm the maintainer of the shunk031/JGLUE
repository. Unfortunately, that code also uses https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz to load the dataset (ref. https://github.com/shunk031/huggingface-datasets_JGLUE/issues/9).
I have personally reported this issue to the AWS representative and am awaiting a response.
Sorry for the confusion!! And thank you for your quick followup @shunk031 !
Thank you for you report. Wait for the response.
The following post says that "Amazon has decided to stop distributing the multilingual reviews dataset." We wait for an official announcement. https://huggingface.co/datasets/amazon_reviews_multi/discussions/4#64c3898db63057f1fd3ce1a0
Thank you for the great benchmark.
Amazon Reviews Corpus seems to be inaccessible.
and with the command from https://registry.opendata.aws/amazon-reviews-ml/
~~We may be able to move to HuggingFace: https://huggingface.co/datasets/amazon_reviews_multi (I can not validate that I can generate the same dataset as the original one.)~~ (also not available)