yahoojapan / JGLUE

JGLUE: Japanese General Language Understanding Evaluation
Creative Commons Attribution Share Alike 4.0 International
294 stars 17 forks source link

Unable to generate MARC-ja because of 403 Forbidden #10

Open akeyhero opened 11 months ago

akeyhero commented 11 months ago

Thank you for the great benchmark.

Amazon Reviews Corpus seems to be inaccessible.

$ wget https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
--2023-07-31 15:22:11--  https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
s3.amazonaws.com (s3.amazonaws.com) をDNSに問いあわせています... 52.216.98.53, 52.216.41.112, 52.216.249.70, ...
s3.amazonaws.com (s3.amazonaws.com)|52.216.98.53|:443 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 403 Forbidden
2023-07-31 15:22:11 エラー 403: Forbidden。

and with the command from https://registry.opendata.aws/amazon-reviews-ml/

$ aws s3 ls --no-sign-request s3://amazon-reviews-ml/

An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

~~We may be able to move to HuggingFace: https://huggingface.co/datasets/amazon_reviews_multi (I can not validate that I can generate the same dataset as the original one.)~~ (also not available)

kaisugi commented 11 months ago

I would suggest https://huggingface.co/datasets/shunk031/JGLUE, which includes exactly the same as the original MARC-ja

from datasets import load_dataset
dataset = load_dataset("shunk031/JGLUE", name="MARC-ja")
shunk031 commented 11 months ago

Thank you for the suggestion. I'm the maintainer of the shunk031/JGLUE repository. Unfortunately, that code also uses https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz to load the dataset (ref. https://github.com/shunk031/huggingface-datasets_JGLUE/issues/9).

I have personally reported this issue to the AWS representative and am awaiting a response.

kaisugi commented 11 months ago

Sorry for the confusion!! And thank you for your quick followup @shunk031 !

tomohideshibata commented 11 months ago

Thank you for you report. Wait for the response.

tomohideshibata commented 11 months ago

The following post says that "Amazon has decided to stop distributing the multilingual reviews dataset." We wait for an official announcement. https://huggingface.co/datasets/amazon_reviews_multi/discussions/4#64c3898db63057f1fd3ce1a0