yandex / odyssey

Scalable PostgreSQL connection pooler
BSD 3-Clause "New" or "Revised" License
3.22k stars 164 forks source link

Problem of memory #421

Open flionet89 opened 2 years ago

flionet89 commented 2 years ago

Dobryy den'! YA stolknulsya s problemoy utechki pamyati. Ustanovil Odyssey i pustil cherez nego vse mikro servisy chto byli. Eto poryadka 100 podov. V itoge odyssey ne vysvobozhdayet pamyat'. Ispol'zuyu konfig: 199 / 5,000 Translation results Good afternoon! I ran into a memory leak issue. I installed Odyssey and launched all the micro services that were through it. This is about 100 pods. As a result, odyssey does not release memory. I am using config: conf yandex.txt The schedule is like this: image The log has this message: Tell me what could be the problem?

cleocir commented 2 years ago

Same issue here with version 1.2

image

x4m commented 2 years ago

Oops, soory, somehow I missed the issue..

Do you have something specific in your workload? So far I know about one leak in error passing from server to client, but do not have a reproduction yet.

alexdyukov commented 2 years ago

auth_query and nothing more. Does system worker keep connections without any frees? at start:

1 2022-06-20T11:13:22Z info [none none] (stats) clients 0
1 2022-06-20T11:13:22Z info [none none] (stats) worker[0]: msg (0 allocated, 0 cached, 3 freed, 0 cache_size), coroutines (1 active, 0 cached), clients_processed: 0
1 2022-06-20T11:13:25Z info [none none] (stats) system worker: msg (5 allocated, 0 cached, 1 freed, 0 cache_size), coroutines (3 active, 0 cached) startup errors 0

in a few hours:

1 2022-06-20T16:11:04Z info [none none] (stats) clients 0
1 2022-06-20T16:11:04Z info [none none] (stats) worker[0]: msg (0 allocated, 0 cached, 5951 freed, 0 cache_size), coroutines (1 active, 0 cached), clients_processed: 0
1 2022-06-20T16:11:07Z info [none none] (stats) system worker: msg (5953 allocated, 0 cached, 1 freed, 0 cache_size), coroutines (3 active, 0 cached) startup errors 0

and in kubernetes stats memory usage increases photo_2022-06-20_19-20-55 :

vilyansky commented 1 year ago

Hello, I faced with the same problem.

memoryleaksodyssey

what kind of diagnosis can be done?

GRouslan commented 1 year ago

Same problem. Version 1.3

image image
flionet89 commented 1 year ago

Добрый день. Есть ли понимание в чем может быть проблема?

x4m commented 1 year ago

Понимание - есть, утечка, скорее всего, в auth_query. Но пока руки не дошли до этого, смотрю в 487 и 483.

Same in English, if anyone is concerned. Ruslan asked if root cause is known. Root cause is probable memory leak in auth_query implementation. I'm not actively working on this leak right now, because issues 487 and 483 are in my scope.

evkuzin commented 1 year ago

This is still on. I can repro it easy with

while true
do
sudo -u postgres pgbench -c 500 -r -T10 -j40 -h 127.0.0.1 -U testuser test
done

And see that RSS memory is growing steadily.

ilya-maltsev commented 1 year ago

In fact, there are leaks without auth_query, especially if you increase stack_size

evkuzin commented 1 year ago

@x4m, any chance to look into it in the nearest future? 🙏

x4m commented 1 year ago

Definitely I'll look into this. But It does not reproduce for me with regular pgbench.

ilya-maltsev commented 1 year ago

@x4m Can you show odyssey.conf that you are used in tests?

evkuzin commented 1 year ago

Here is mine

daemonize no
pid_file "/var/lib/odyssey/odyssey.pid"
locks_dir "/run/odyssey"
graceful_die_on_errors no
#enable_online_restart yes
bindwith_reuseport yes
log_file "/var/log/postgresql/odyssey.log"
log_format "%p %t %l [%i %s] (%c) %m\n"
log_to_stdout no
log_syslog no
log_syslog_ident "odyssey"
log_syslog_facility "daemon"
log_debug no
log_config no
log_session no
log_query no
log_stats yes
stats_interval 60
promhttp_server_port 7777
workers 5
resolvers 2
readahead 8192
cache_coroutine 5
coroutine_stack_size 8
nodelay yes
keepalive 15
keepalive_keep_interval 75
keepalive_probes 9
keepalive_usr_timeout 0
bindwith_reuseport yes
unix_socket_dir "/var/run/postgresql"
unix_socket_mode "0777"

listen {
    host "*"
    port 5432
    backlog 256
    tls "allow"
    tls_ca_file "/etc/ca-certificates/root.crt"
    tls_key_file "/etc/odyssey/odyssey.key"
    tls_cert_file "/etc/odyssey/odyssey.crt"
    compression no
}

listen {
    port 5432
    backlog 256
    compression no
}

storage "postgres_server" {
    type "remote"
    host "127.0.0.1"
    port 5433
}

storage "postgres_server_unixsock" {
    type "remote"
    port 5433
}

storage "local" {
    type "local"
}

database default {
    user default {
        authentication "md5"
        auth_query "SELECT uname, phash FROM user_lookup($1)"
        auth_query_db "postgres"
        auth_query_user "odyssey"
        storage "postgres_server"
        pool "transaction"
        pool_size 100
        pool_timeout 0
        pool_ttl 60
        pool_discard yes
        pool_cancel yes
        pool_rollback yes
        pool_client_idle_timeout 0
        pool_idle_in_transaction_timeout 0
        client_fwd_error yes
        application_name_add_host yes
        reserve_session_server_connection yes
        server_lifetime 3600
        log_debug no
        quantiles "0.99,0.95,0.5"
    }
    user "odyssey" {
        authentication "none"
        storage "postgres_server_unixsock"
        pool "session"
        pool_routing "internal"
    }
}
database "odyssey_console" {
    user "prometheus" {
        authentication "md5"
        auth_query "SELECT uname, phash FROM user_lookup($1)"
        auth_query_db "postgres"
        auth_query_user "odyssey"
        storage "local"
        pool "session"
        role "stat"
    }
}
evkuzin commented 1 year ago

Did you manage to reproduce it? May I help somehow with reproducing it?

x4m commented 1 year ago

I'm overwhelmed by the number of tasks, sorry... currently we are working on making auth_query better (support for SCRAM, caching etc), I hope we will track this leak in that project.

evkuzin commented 1 year ago

Hi! How is the work going? When shall we expect the new version?

evkuzin commented 1 year ago

@x4m in case it could help. Disabling the auth query + turning off all logging allowed running it more or less stable. So it's not exactly an auth query issue, I think. It might be a machinarium issue.

x4m commented 1 year ago

Recently I've fixed one leak in query cancelling #527 But so far no further progress on the issue. @evkuzin is there some specific kind of logging that seems to leak? I do not even have a reproduction. Some folks hinted that lowering pool_ttl might highlight some leaks, though I could not reproduce it yet.

evkuzin commented 1 year ago

Thank you for looking into it!

Currently, it works for me if I disable all possible logging everywhere.

image

This picture is about the amount of free memory on the node. Both replicas were OOM, then one restarted with no logs and static users config, and the second (which is eating all the memory) with config like I posted above. Try the config above and see RSS for Odyssey

while true
do
pgbench -n -P2 -C -c 500 -t100000 -r -j100 -h ODYSSEY_HOST -U test sbtest
done
while true
do
sudo ps -axo pid,rss | grep $(pgrep odyssey) | awk '{print $2}'; sleep 10
done
8808
24760
25176
25088
25136
25044
25356
25432
25452
25288
25316
25320
25704
25736
25708
25564
25628
25836
25764
25876
25776
25800
evkuzin commented 1 year ago

I'll try repro tomorrow with the build from the master and your fix.

evkuzin commented 1 year ago

Мне кажется моя проблема в том что я собираю бинарник как то не так. Может такое быть? Я открыл другое Issue

https://github.com/yandex/odyssey/issues/538

evkuzin commented 1 year ago

И да - память перестала течь (по крайней мере на стенде на котором прошлая версия текла) Спасибо!

Object905 commented 2 months ago

Tried disabling logging and prometheus. Still leaks with auth_query (build on Possible fix for mem leak https://github.com/yandex/odyssey/pull/685) Static user configuration is too much hussle for my case. Have to restart poolers roughly once a day due to this.

12 hours of leaking: image

ramili4 commented 1 month ago

И да - память перестала течь (по крайней мере на стенде на котором прошлая версия текла) Спасибо!

Просто отключив логирование?