In week01 there are two lines in gibberish in the data:
index 403687 'parisflatlist\n'
index 35227 'AOSDHIADSOIHADSO DASODASHDASOH\n'
If not replaced with zeros (as the function description explicitly requires), it may cause NaN when averaging in get_phrase_embedding (division by zero). However, that comment is easily missed and is omitted in the seminar video recording. That NaN backfires only in find_nearest in a dot product operation.
I'd suggest to add additional assert before find_nearest which saves students' time and hints that data_vectors were composed with errors.
In week01 there are two lines in gibberish in the data:
'parisflatlist\n'
'AOSDHIADSOIHADSO DASODASHDASOH\n'
If not replaced with zeros (as the function description explicitly requires), it may cause NaN when averaging in
get_phrase_embedding
(division by zero). However, that comment is easily missed and is omitted in the seminar video recording. That NaN backfires only infind_nearest
in a dot product operation.I'd suggest to add additional assert before
find_nearest
which saves students' time and hints thatdata_vectors
were composed with errors.