Multilingual Amazon Reviews Corpus (MARC) {En, Jp, De, Fr, Es, Zh} [2015 2019] text classification: review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g., 'books', 'appliances', etc.), balanced across the 5 possible star ratings, in each language. data split: 200,000, 5,000, and 5,000.
Multilingual Amazon Reviews Corpus (MARC) {En, Jp, De, Fr, Es, Zh} [2015 2019] text classification: review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g., 'books', 'appliances', etc.), balanced across the 5 possible star ratings, in each language. data split: 200,000, 5,000, and 5,000.
Building Educational Applications (BAE)
TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
GitHub
问和答都自然(提问者想知道、回答也用本地语言)的语言多样数据集。用比较自然、发散的方式,刺激人们提问,并在Wiki中锁定答案。
Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution
对NMT和MLLM的coreference resolution (CoR) & commonsense reasoning (CSR)
SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks
测试对任务的泛化能力/指令跟随能力:1616个tasks和专家写的instructions。 Tk-instrunct