yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
827 stars 139 forks source link

Endless RAM consumption #73

Closed MinionAttack closed 3 years ago

MinionAttack commented 3 years ago

Hi,

I am trying to test SuPar in several languages and after running several tests with 100 dimension embeddings, I am now trying to repeat the experiments with 300 dimension embeddings.

The problem is that SuPar, as soon as it starts, in the "INFO Building the fields" phase, starts to occupy the RAM memory in a crazy way. My test computer has 64 GB of RAM and a SWAP partition of 122 GB, so in less than 4 minutes I had all the RAM occupied and the SWAP partition was 30 GB occupied and going up.

I am using the Universal Dependency files for English (EWT) and for the embeddings I'm using the FASTTEXT model for English (after deleting the first line by hand).

This is the configuration file:

[Data]
encoder = 'lstm'
feat = ['tag', 'char']

[Network]
n_embed = 300
n_char_embed = 100
n_char_hidden = 100
n_feat_embed = 100
embed_dropout = .33
n_lstm_hidden = 400
n_lstm_layers = 3
encoder_dropout = .33
n_arc_mlp = 500
n_rel_mlp = 100
mlp_dropout = .33

[Optimizer]
lr = 2e-3
mu = .9
nu = .9
eps = 1e-12
weight_decay = 0
clip = 5.0
min_freq = 2
fix_len = 20
decay = .75
decay_steps = 5000
update_steps = 1

And this is the command to launch the parser (batch file):

#!/usr/bin/env bash
python -m supar.cmds.biaffine_dep train --build --device 0 --conf config/ud.biaffine.dep.lstm.char.ini \
--n-embed 300 --feat tag char --encoder lstm --unk unk \
--embed data/Pruebas/Embeddings/English/cc.en.300.vec \
--train data/Pruebas/UD/English-EWT/en_ewt-ud-train.conll \
--dev data/Pruebas/UD/English-EWT/en_ewt-ud-dev.conll \
--test data/Pruebas/UD/English-EWT/en_ewt-ud-test.conll \
--path models/Pruebas/English-EWT/Iteracion_1

And the parser, just before the "INFO Building the fields" message, prints:

---------------------+-------------------------------
Param                |             Value             
---------------------+-------------------------------
encoder              |              lstm             
feat                 |        ['tag', 'char']        
n_embed              |              300              
n_char_embed         |              100              
n_char_hidden        |              100              
n_feat_embed         |              100              
embed_dropout        |              0.33             
n_lstm_hidden        |              400              
n_lstm_layers        |               3               
encoder_dropout      |              0.33             
n_arc_mlp            |              500              
n_rel_mlp            |              100              
mlp_dropout          |              0.33             
lr                   |             0.002             
mu                   |              0.9              
nu                   |              0.9              
eps                  |             1e-12             
weight_decay         |               0               
clip                 |              5.0              
min_freq             |               2               
fix_len              |               20              
decay                |              0.75             
decay_steps          |              5000             
update_steps         |               1               
tree                 |             False             
proj                 |             False             
partial              |             False             
mode                 |             train             
path                 | models/Pruebas/English-EWT/Iteracion_1
device               |               0               
seed                 |               1               
threads              |               16              
local_rank           |               -1              
build                |              True             
punct                |             False             
max_len              |              None             
buckets              |               32              
train                | data/Pruebas/UD/English-EWT/en_ewt-ud-train.conll
dev                  | data/Pruebas/UD/English-EWT/en_ewt-ud-dev.conll
test                 | data/Pruebas/UD/English-EWT/en_ewt-ud-test.conll
embed                | data/Pruebas/Embeddings/English/cc.en.300.vec
unk                  |              unk              
bert                 |        bert-base-cased        
---------------------+-------------------------------

Regards.

yzhangcs commented 3 years ago

@MinionAttack Hi, it occurs because fasttext embeddings are very huge. You can refer to #36 to solve the issue.

MinionAttack commented 3 years ago

Thanks for the reply, I'm going to do a fair comparison and I've seen that the other parser I'm using limits the alphabet to 200K embeddings, so I've switched:

    @classmethod
    def load(cls, path, unk=None):

        with open(path, 'r') as f:
            lines = [line for line in f]

        splits = [line.split() for line in lines]
        tokens, vectors = zip(*[(s[0], list(map(float, s[1:])))
                                for s in splits])

        return cls(tokens, vectors, unk=unk)

to

    @classmethod
    def load(cls, path, unk=None):
        MAX_VOCABULARY_SIZE = 200000

        with open(path, 'r') as f:
            lines = [line for line in f]

        if len(lines) > MAX_VOCABULARY_SIZE:
            lines = lines[:MAX_VOCABULARY_SIZE]

        splits = [line.split() for line in lines]
        tokens, vectors = zip(*[(s[0], list(map(float, s[1:])))
                                for s in splits])

        return cls(tokens, vectors, unk=unk)

And now the RAM consumption is normal.