zhiguowang / BiMPM

BiMPM: Bilateral Multi-Perspective Matching for Natural Language Sentences
Apache License 2.0
439 stars 150 forks source link

Big Data Problem #33

Open xljhtq opened 6 years ago

xljhtq commented 6 years ago

When I load the file with many data, I have met with a problem. The free memory will be smaller and smaller because of the exitence of sorting algorithm in the preprocessing step. What should I do to optimize it ?

zhiguowang commented 6 years ago

I think one solution is to modify the "InstanceBatch" class in "SentenceMatchDataStream.py". Right now, my code will load all data into memory and pad all variables beforehand (https://github.com/zhiguowang/BiMPM/blob/master/src/SentenceMatchDataStream.py#L165). However, the padding part will cost a lot of memory.

One way to fix this is that don't pad variables while loading all data, but conduct the padding procedure right before you use it. This line (https://github.com/zhiguowang/BiMPM/blob/master/src/SentenceMatchTrainer.py#L92) may be a good position to insert your padding function.