Open flionet89 opened 2 years ago
Same issue here with version 1.2
Oops, soory, somehow I missed the issue..
Do you have something specific in your workload? So far I know about one leak in error passing from server to client, but do not have a reproduction yet.
auth_query and nothing more. Does system worker keep connections without any frees? at start:
1 2022-06-20T11:13:22Z info [none none] (stats) clients 0
1 2022-06-20T11:13:22Z info [none none] (stats) worker[0]: msg (0 allocated, 0 cached, 3 freed, 0 cache_size), coroutines (1 active, 0 cached), clients_processed: 0
1 2022-06-20T11:13:25Z info [none none] (stats) system worker: msg (5 allocated, 0 cached, 1 freed, 0 cache_size), coroutines (3 active, 0 cached) startup errors 0
in a few hours:
1 2022-06-20T16:11:04Z info [none none] (stats) clients 0
1 2022-06-20T16:11:04Z info [none none] (stats) worker[0]: msg (0 allocated, 0 cached, 5951 freed, 0 cache_size), coroutines (1 active, 0 cached), clients_processed: 0
1 2022-06-20T16:11:07Z info [none none] (stats) system worker: msg (5953 allocated, 0 cached, 1 freed, 0 cache_size), coroutines (3 active, 0 cached) startup errors 0
and in kubernetes stats memory usage increases :
Hello, I faced with the same problem.
what kind of diagnosis can be done?
Same problem. Version 1.3
Добрый день. Есть ли понимание в чем может быть проблема?
Понимание - есть, утечка, скорее всего, в auth_query. Но пока руки не дошли до этого, смотрю в 487 и 483.
Same in English, if anyone is concerned. Ruslan asked if root cause is known. Root cause is probable memory leak in auth_query implementation. I'm not actively working on this leak right now, because issues 487 and 483 are in my scope.
This is still on. I can repro it easy with
while true
do
sudo -u postgres pgbench -c 500 -r -T10 -j40 -h 127.0.0.1 -U testuser test
done
And see that RSS memory is growing steadily.
In fact, there are leaks without auth_query, especially if you increase stack_size
@x4m, any chance to look into it in the nearest future? 🙏
Definitely I'll look into this. But It does not reproduce for me with regular pgbench.
@x4m Can you show odyssey.conf that you are used in tests?
Here is mine
daemonize no
pid_file "/var/lib/odyssey/odyssey.pid"
locks_dir "/run/odyssey"
graceful_die_on_errors no
#enable_online_restart yes
bindwith_reuseport yes
log_file "/var/log/postgresql/odyssey.log"
log_format "%p %t %l [%i %s] (%c) %m\n"
log_to_stdout no
log_syslog no
log_syslog_ident "odyssey"
log_syslog_facility "daemon"
log_debug no
log_config no
log_session no
log_query no
log_stats yes
stats_interval 60
promhttp_server_port 7777
workers 5
resolvers 2
readahead 8192
cache_coroutine 5
coroutine_stack_size 8
nodelay yes
keepalive 15
keepalive_keep_interval 75
keepalive_probes 9
keepalive_usr_timeout 0
bindwith_reuseport yes
unix_socket_dir "/var/run/postgresql"
unix_socket_mode "0777"
listen {
host "*"
port 5432
backlog 256
tls "allow"
tls_ca_file "/etc/ca-certificates/root.crt"
tls_key_file "/etc/odyssey/odyssey.key"
tls_cert_file "/etc/odyssey/odyssey.crt"
compression no
}
listen {
port 5432
backlog 256
compression no
}
storage "postgres_server" {
type "remote"
host "127.0.0.1"
port 5433
}
storage "postgres_server_unixsock" {
type "remote"
port 5433
}
storage "local" {
type "local"
}
database default {
user default {
authentication "md5"
auth_query "SELECT uname, phash FROM user_lookup($1)"
auth_query_db "postgres"
auth_query_user "odyssey"
storage "postgres_server"
pool "transaction"
pool_size 100
pool_timeout 0
pool_ttl 60
pool_discard yes
pool_cancel yes
pool_rollback yes
pool_client_idle_timeout 0
pool_idle_in_transaction_timeout 0
client_fwd_error yes
application_name_add_host yes
reserve_session_server_connection yes
server_lifetime 3600
log_debug no
quantiles "0.99,0.95,0.5"
}
user "odyssey" {
authentication "none"
storage "postgres_server_unixsock"
pool "session"
pool_routing "internal"
}
}
database "odyssey_console" {
user "prometheus" {
authentication "md5"
auth_query "SELECT uname, phash FROM user_lookup($1)"
auth_query_db "postgres"
auth_query_user "odyssey"
storage "local"
pool "session"
role "stat"
}
}
Did you manage to reproduce it? May I help somehow with reproducing it?
I'm overwhelmed by the number of tasks, sorry... currently we are working on making auth_query better (support for SCRAM, caching etc), I hope we will track this leak in that project.
Hi! How is the work going? When shall we expect the new version?
@x4m in case it could help. Disabling the auth query + turning off all logging allowed running it more or less stable. So it's not exactly an auth query issue, I think. It might be a machinarium issue.
Recently I've fixed one leak in query cancelling #527 But so far no further progress on the issue. @evkuzin is there some specific kind of logging that seems to leak? I do not even have a reproduction. Some folks hinted that lowering pool_ttl might highlight some leaks, though I could not reproduce it yet.
Thank you for looking into it!
Currently, it works for me if I disable all possible logging everywhere.
This picture is about the amount of free memory on the node. Both replicas were OOM, then one restarted with no logs and static users config, and the second (which is eating all the memory) with config like I posted above. Try the config above and see RSS for Odyssey
while true
do
pgbench -n -P2 -C -c 500 -t100000 -r -j100 -h ODYSSEY_HOST -U test sbtest
done
while true
do
sudo ps -axo pid,rss | grep $(pgrep odyssey) | awk '{print $2}'; sleep 10
done
8808
24760
25176
25088
25136
25044
25356
25432
25452
25288
25316
25320
25704
25736
25708
25564
25628
25836
25764
25876
25776
25800
I'll try repro tomorrow with the build from the master and your fix.
Мне кажется моя проблема в том что я собираю бинарник как то не так. Может такое быть? Я открыл другое Issue
И да - память перестала течь (по крайней мере на стенде на котором прошлая версия текла) Спасибо!
Tried disabling logging and prometheus. Still leaks with auth_query (build on Possible fix for mem leak
https://github.com/yandex/odyssey/pull/685)
Static user configuration is too much hussle for my case. Have to restart poolers roughly once a day due to this.
12 hours of leaking:
И да - память перестала течь (по крайней мере на стенде на котором прошлая версия текла) Спасибо!
Просто отключив логирование?
Dobryy den'! YA stolknulsya s problemoy utechki pamyati. Ustanovil Odyssey i pustil cherez nego vse mikro servisy chto byli. Eto poryadka 100 podov. V itoge odyssey ne vysvobozhdayet pamyat'. Ispol'zuyu konfig: 199 / 5,000 Translation results Good afternoon! I ran into a memory leak issue. I installed Odyssey and launched all the micro services that were through it. This is about 100 pods. As a result, odyssey does not release memory. I am using config: conf yandex.txt The schedule is like this: The log has this message: Tell me what could be the problem?