xplip / pixel

Research code for pixel-based encoders of language (PIXEL)
https://arxiv.org/abs/2207.06991
Apache License 2.0
332 stars 33 forks source link

Unable to Load Data #9

Open ChawDoe opened 1 year ago

ChawDoe commented 1 year ago
  File "scripts/training/run_pretraining.py", line 465, in preprocess_images
    examples["pixel_values"] = [transforms(image) for image in examples[image_column_name]]  # bytes, path
  File "scripts/training/run_pretraining.py", line 465, in <listcomp>
    examples["pixel_values"] = [transforms(image) for image in examples[image_column_name]]  # bytes, path
  File "/usr/local/lib/python3.8/dist-packages/torchvision/transforms/transforms.py", line 61, in __call__
    img = t(img)
  File "/usr/local/lib/python3.8/dist-packages/torchvision/transforms/transforms.py", line 437, in __call__
    return self.lambd(img)
  File "/cpfs/shared/research/public-data/cv/driving/pixel/pixel-main/src/pixel/pixel_utils/transforms.py", line 211, in <lambda>
    transforms = [Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img)]
AttributeError: 'dict' object has no attribute 'mode'

I used the pre-rendered data which i manually downloaded from the website.
And I found that the data is a dict which contains keys {'path', 'bytes'}, which is not RGB image. How could I do data transform?

ChawDoe commented 1 year ago

I use Image.open() and BytesIO() to read bytes data. However, the num_patches is unavailable in the data, how could I get it? Thanks.

xplip commented 1 year ago

Hi, how are you trying to load the data? You shouldn't have to manually load the bytes from disk. You can use the HF datasets library like this:

from datasets import load_dataset

wiki_dataset = load_dataset(
    "Team-PIXEL/rendered-wikipedia-english",
    split="train",
    use_auth_token=True,
    streaming=True
)

# print the first dataset entry
print(next(iter(wiki_dataset)))

# prints {'pixel_values': <PIL.PngImagePlugin.PngImageFile image mode=L size=8464x16 at 0x7F3FFEA7F5E0>, 'num_patches': 469}

You could use streaming=False if you prefer to download everything to disk first, and then index the data using print(wiki_dataset[0]).

The same applies to the rendered BookCorpus dataset. So this is how you would load the dataset, and then you can use dataset's map or set_transform to apply transformations. I recommend to use set_transform so the transformations are applied every time the dataloader is called.