webis-de / small-text

Active Learning for Text Classification in Python
https://small-text.readthedocs.io/
MIT License
547 stars 60 forks source link

Dataset cloning wraps the label #35

Closed chschroeder closed 1 year ago

chschroeder commented 1 year ago

Bug description

Selecting a sub(data)set and then cloning a dataset wraps the labels in a superfluous "ndarray()". This affects PytorchTextClassificationDataset and TransformersDataset.

Edit: I noticed this because clf.predict() on the cloned dataset raised TypeError: len() of unsized object.

Steps to reproduce

Example for TransformersDataset:

import unittest
from tests.utils.datasets import random_transformer_dataset

class CloneBugTest(unittest.TestCase):

    def test_asd(self):
        dataset = random_transformer_dataset(num_samples=20,
                                             multi_label=False,
                                             num_classes=3)
        indices = [0, 1]

        dataset_cloned = dataset[indices].clone()

        first_label = dataset.data[0][TransformersDataset.INDEX_LABEL]
        first_label_cloned = dataset_cloned.data[0][TransformersDataset.INDEX_LABEL]

        print(first_label, str(first_label), repr(first_label))
        print(first_label_cloned, str(first_label_cloned), repr(first_label_cloned))

Output:

0 0 0
0 0 array(0)

Expected behavior

Expected Output:

0 0 0
0 0 0

Environment:

Python version: 3.8 small-text version: 1.3.0 small-text integrations (e.g., transformers): transformers PyTorch version (if applicable): -

Installation (pip, conda, or from source): pip CUDA version (if applicable): -

Additional information

--