yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
829 stars 141 forks source link

kmeans #25

Closed attardi closed 4 years ago

attardi commented 4 years ago

I get this error when training on the Tamil treebank:

File "/project/piqasso/tools/biaffine-parser/parser/utils/alg.py", line 18, in kmeans assert len(d) >= k, f"unable to assign {len(d)} datapoints to {k} clusters" AssertionError: unable to assign 25 datapoints to 32 clusters

With the debugger I found that in the invocation of kmeans(x, k) with len(x) = 80, k = 32 at line 10 d, indices, f = x.unique(return_inverse=True, return_counts=True)

d = tensor([ 6., 7., 8., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 26., 27., 32., 33., 35., 45., 51.]) len(d) = 25

f =tensor([4, 1, 1, 5, 1, 7, 6, 4, 8, 8, 2, 2, 1, 8, 5, 2, 1, 2, 3, 3, 1, 2, 1, 1, 1]) len(f) = 25

With other treebanks it work fine.

Thank you for the nice and useful project.

yzhangcs commented 4 years ago

Hi, thanks for reporting this issue. This is because the sentence lengths are uniformly distributed, and sentences can only be assigned to no more than 25 buckets (as shown in the error log). You can fix this error by setting a smaller number of buckets, e.g., --buckets=16.