xiaoman-zhang / PMC-VQA

PMC-VQA is a large-scale medical visual question-answering dataset, which contains 227k VQA pairs of 149k images that cover various modalities or diseases.
MIT License
174 stars 11 forks source link

Split generation exception on fetching the `PMC-VQA` data #18

Open nicolay-r opened 7 months ago

nicolay-r commented 7 months ago

Dear @xiaoman-zhang, I am attempting for downloading the dataset using dataset library.

Using python 3.10 and dataset==2.15.0 launching the dataset copying as follows:

import datasets
from pathlib import Path
datasets.config.DOWNLOADED_DATASETS_PATH = "./data"
dataset = datasets.load_dataset("xmcmic/PMC-VQA", split='train[:10]')

I end up into the split generation issue:

File "/home/datasets/PMC-VQA/venv/lib/python3.10/site-packages/datasets/table.py", line 2290, in cast_table_to_schema
    raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
index: int64
Figure_path: string
Caption: string
Question: string
Choice A: string
Choice B: string
Choice C: string
Choice D: string
Answer: string
split: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1408
to
{'Figure_path': Value(dtype='string', id=None), 'Question': Value(dtype='string', id=None), 'Answer': Value(dtype='string', id=None), 'Choice A': Value(dtype='string', id=None), 'Choice B': Value(dtype='string', id=None), 'Choice C': Value(dtype='string', id=None), 'Choice D': Value(dtype='string', id=None), 'Answer_label': Value(dtype='string', id=None)}
because column names don't match

Is it expected behaviour and what would be recommendation on accessing the dataset?

Thank you for assistance!