sentence length and padding

yoonkim / CNN_sentence

CNNs for sentence classification

2.05k stars 826 forks source link

sentence length and padding #9

Open attardi opened 8 years ago

attardi commented 8 years ago

Why do you pad all sentences to the same length, currently fixed to 56? It should not be necessary, since in the paper you say that the "pooling scheme naturally deals with variable sentence lengths". Shouldn't padding depend on filter size? Right now it is fixed at 5 in the call to make_idx_data_cv(revs, word_idx_map, i, max_l=56, k=300, filter_h=5) BTW: k is not used.

yoonkim commented 8 years ago

it's because we do SGD with mini-batches, and each mini-batch has sentences of varying lengths. one could sort/group the batches based on sentence length and then there would be no need to pad (as is often done in NMT).

stephenhky commented 8 years ago

A carry-on question: if the sentence length allowed n is greater than the real length of a sentence, what would the vector be for the remaining vectors? Are they set to zero? Or given random values to the vector elements?

HYY0508 commented 4 years ago

Traceback (most recent call last): File "conv_net_sentence.py", line 311, in datasets = make_idx_data_cv(revs, word_idx_map, i, max_l=56,k=300, filter_h=5) File "conv_net_sentence.py", line 283, in make_idx_data_cv train = np.array(train,dtype="int") ValueError: setting an array element with a sequence. follow your code,I meet a quesion,Is that the same reason you're talking about?

moses9591 commented 4 years ago

You should change this line train = np.array(train,dtype="int") as following: train = np.array(train,dtype="object")