webis-de / small-text

Active Learning for Text Classification in Python
https://small-text.readthedocs.io/
MIT License
547 stars 60 forks source link

Batch size in greedy coreset batching is different than expected #50

Closed chschroeder closed 10 months ago

chschroeder commented 10 months ago

Bug description

In order to reduce the memory usage, greedy coreset is computed in batches. The number and size of batches is currently based on the wrong set.

Nevertheless, the method has failed silently but gracefully so far, resulting in batches of a different size than expected, unless when number of unlabeled indices is less than the batch size, where it results in an error similar to the following:

<...>
  File "/path/to/site-packages/small_text/query_strategies/coresets.py", line 131, in sample
    return greedy_coreset(embeddings, indices_unlabeled, indices_labeled, n,
  File "/path/to/site-packages/small_text/query_strategies/coresets.py", line 79, in greedy_coreset
    dist = dist_func(batch, x[indices_s], normalized=normalized)
  File "/path/to/site-packages/small_text/query_strategies/coresets.py", line 25, in _euclidean_distance
    return pairwise_distances(a, b, metric='euclidean')
  File "/path/to/site-packages/sklearn/metrics/pairwise.py", line 2195, in pairwise_distances
    return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
  File "/path/to/site-packages/sklearn/metrics/pairwise.py", line 1765, in _parallel_pairwise
    return func(X, Y, **kwds)
  File "/path/to/site-packages/sklearn/metrics/pairwise.py", line 310, in euclidean_distances
    X, Y = check_pairwise_arrays(X, Y)
  File "/path/to/site-packages/sklearn/metrics/pairwise.py", line 165, in check_pairwise_arrays
    X = check_array(
  File "/path/to/site-packages/sklearn/utils/validation.py", line 969, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0, 768)) while a minimum of 1 is required by check_pairwise_arrays.

Steps to reproduce

--

Environment:

small-text version: 1.3.x, 2.0.0-dev

Addition information

--