Open ChawDoe opened 1 year ago
I use Image.open() and BytesIO() to read bytes data. However, the num_patches is unavailable in the data, how could I get it? Thanks.
Hi, how are you trying to load the data? You shouldn't have to manually load the bytes from disk. You can use the HF datasets library like this:
from datasets import load_dataset
wiki_dataset = load_dataset(
"Team-PIXEL/rendered-wikipedia-english",
split="train",
use_auth_token=True,
streaming=True
)
# print the first dataset entry
print(next(iter(wiki_dataset)))
# prints {'pixel_values': <PIL.PngImagePlugin.PngImageFile image mode=L size=8464x16 at 0x7F3FFEA7F5E0>, 'num_patches': 469}
You could use streaming=False
if you prefer to download everything to disk first, and then index the data using print(wiki_dataset[0])
.
The same applies to the rendered BookCorpus dataset. So this is how you would load the dataset, and then you can use dataset's map
or set_transform
to apply transformations. I recommend to use set_transform
so the transformations are applied every time the dataloader is called.
I used the pre-rendered data which i manually downloaded from the website.
And I found that the data is a dict which contains keys {'path', 'bytes'}, which is not RGB image. How could I do data transform?