(Bug report) Docker healthcheck creates gigantic amounts of defunct processes, ultimately freezing server

asoftbird commented 1 year ago

Trilium Version

0.57.3

What operating system are you using?

Other Linux

What is your setup?

Local + server sync

Operating System Version

SMP Debian 4.19.194-2 (2021-06-21) x86_64 GNU/Linux

Description

I'm running Trilium through docker and I've noticed that the health check pretty much always runs on "unhealthy", even merely minutes after starting. Trilium doesn't seem to experience any issues and runs just fine, though I do get tons of zombie processes that take up PIDs.

A small excerpt from docker inspect trilium:

"State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 20145,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2023-01-31T19:35:08.128875654Z",
            "FinishedAt": "2023-01-31T19:34:30.920671605Z",
            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 61,
                "Log": [
                    {
                        "Start": "2023-01-31T21:03:44.762876913+01:00",
                        "End": "2023-01-31T21:03:44.86980051+01:00",
                        "ExitCode": 1,
                        "Output": ""
                    },
                    {
                        "Start": "2023-01-31T21:04:14.873137152+01:00",
                        "End": "2023-01-31T21:04:14.975751672+01:00",
                        "ExitCode": 1,
                        "Output": ""
                    },

Upon inspecting the code for the healthcheck I find that /api/health-check is simply checked for a 200 response; going to the API endpoint of my trilium site from an external computer works just fine, but running wget 127.0.0.1:8080/api/health-check on the server itself returns no data:

> wget 127.0.0.1:8080/api/health-check
--2023-01-31 21:15:55--  http://127.0.0.1:8080/api/health-check
Connecting to 127.0.0.1:8080... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

This health check also seems to create a lot of zombie processes, one for each failed check. (I don't know if this happens for succesful checks as well; I've never had one!) Every 30 seconds a new PID is created, which aligns with the timestamps in the docker inspect trilium text.

I run the trilium container with the following command:
docker run --name=trilium -d -p 0.0.0.0:8080:8080 -v ~/trilium-data:/home/node/trilium-data -v /etc/letsencrypt:/home/node/trilium-cert zadam/trilium:0.57.3

Is this my server configuration being wonky or is this trilium being unable to connect to itself somehow?

(I am aware I am not running the latest version; however, I looked through the changelogs and couldn't seem to find anything that seemed relevant; I haven't been able to update yet).

Error logs

19:35:08.864 DB size: 160235 KB 19:35:11.025 Trusted reverse proxy: false 19:35:11.044 App HTTPS server starting up at port 8080 19:35:11.045 { "appVersion": "0.57.3", "dbVersion": 197, "syncVersion": 26, "buildDate": "2022-12-02T22:44:40+01:00", "buildRevision": "42cd33369407a1a64eeeced4a0729b5fcd7cac47", "dataDirectory": "/home/node/trilium-data", "clipperProtocolVersion": "1.0", "utcDateTime": "2023-01-31T19:35:08.861Z" } 19:35:11.045 CPU model: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, logical cores: 4 freq: 2199 Mhz 19:35:11.054 Listening on port 8080 19:35:11.091 Slow query took 34ms: SELECT noteId, title, type, mime, isProtected, dateCreated, dateModified, utcDateCreated, utcDateModified FROM notes WHERE isDeleted = 0 19:35:11.155 Slow query took 28ms: SELECT attributeId, noteId, type, name, value, isInheritable, position, utcDateModified FROM attributes WHERE isDeleted = 0 19:35:11.181 Becca (note cache) load took 125ms 19:35:15.215 Table counts: notes: 1033, note_revisions: 1712, branches: 1047, attributes: 2110, etapi_tokens: 1 19:35:15.311 Slow query took 91ms: SELECT id, entityId FROM entity_changes LEFT JOIN notes ON entityId = noteId WHERE entity_changes.isErased = 0 AND entity_changes.entityName = 'notes' AND noteId IS NULL 19:35:16.524 Slow query took 1134ms: SELECT notes.noteId, notes.type, notes.mime FROM notes JOIN note_contents USING (noteId) WHERE isDeleted = 0 AND isProtected = 0 AND content IS NULL 19:35:16.550 All consistency checks passed with no errors detected (took 1351ms) 19:46:05.288 Generated CSRF token 5EG44gh4-5U0Ee9NA8f-C4yKGdLutAgh2XmCs with secret undefined 19:46:05.682 200 GET /api/options with 6731 bytes took 3ms 19:46:05.690 304 GET /api/tree with 43490 bytes took 4ms 19:46:05.702 304 GET /api/keyboard-actions with 11935 bytes took 2ms 19:46:05.706 304 GET /api/script/widgets with 2717 bytes took 3ms 19:46:05.709 304 GET /api/keyboard-shortcuts-for-notes with 2 bytes took 3ms 19:46:05.798 200 POST /api/tree/load with 3965 bytes took 2ms 19:46:05.984 200 POST /api/tree/load with 2844 bytes took 2ms 19:46:05.997 200 POST /api/tree/load with 7080 bytes took 3ms 19:46:06.003 304 GET /api/search/%23bookmarked%20or%20%23bookmarkFolder with 16 bytes took 5ms 19:46:06.069 200 POST /api/tree/load with 4471 bytes took 2ms 19:46:06.075 200 POST /api/tree/load with 8886 bytes took 5ms 19:46:06.137 200 POST /api/tree/load with 8472 bytes took 2ms 19:46:06.989 304 GET /api/note-map/JDcS1eZGIjVY/backlink-count with 11 bytes took 2ms 19:46:06.991 304 GET /api/notes/JDcS1eZGIjVY with 2916 bytes took 2ms 19:46:07.982 304 GET /api/script/startup with 861 bytes took 2ms 20:00:03.361 Generated CSRF token b863jFA3-312zfQ1EY0X5Ga43SD0uJOyIbpE with secret undefined 20:00:03.773 304 GET /api/tree with 43490 bytes took 3ms 20:00:03.777 304 GET /api/options with 6731 bytes took 4ms 20:00:03.780 304 GET /api/keyboard-actions with 11935 bytes took 2ms 20:00:03.782 304 GET /api/keyboard-shortcuts-for-notes with 2 bytes took 1ms 20:00:03.784 304 GET /api/script/widgets with 2717 bytes took 1ms 20:00:03.841 200 POST /api/tree/load with 3965 bytes took 2ms 20:00:04.035 200 POST /api/tree/load with 2844 bytes took 1ms 20:00:04.039 200 POST /api/tree/load with 7080 bytes took 3ms 20:00:04.041 304 GET /api/search/%23bookmarked%20or%20%23bookmarkFolder with 16 bytes took 1ms 20:00:04.085 200 POST /api/tree/load with 4471 bytes took 3ms 20:00:04.098 200 POST /api/tree/load with 8886 bytes took 4ms 20:00:04.149 200 POST /api/tree/load with 8472 bytes took 2ms 20:00:04.977 304 GET /api/note-map/JDcS1eZGIjVY/backlink-count with 11 bytes took 1ms 20:00:04.979 304 GET /api/notes/JDcS1eZGIjVY with 2916 bytes took 1ms 20:00:06.006 304 GET /api/script/startup with 861 bytes took 2ms 20:00:09.456 200 GET /api/health-check with 15 bytes took 2ms 20:00:15.684 304 GET /api/health-check with 15 bytes took 1ms 20:00:16.938 304 GET /api/health-check with 15 bytes took 1ms

Note: The last three lines occured only after manually navigating to https://[external-ip]/api/health-check.

T012m3n7oR commented 1 year ago

I can see in your log it shows running HTTPS on 8080. Are you actually running HTTPS on 8080, or is it HTTP on 8080? If you're running HTTPS, and you try to curl the same endpoint for http (127.0.0.1 with no scheme (http/https)) I would assume it would fail as there's nothing on 8080 running base http.

asoftbird commented 1 year ago

I can see in your log it shows running HTTPS on 8080. Are you actually running HTTPS on 8080, or is it HTTP on 8080? If you're running HTTPS, and you try to curl the same endpoint for http (127.0.0.1 with no scheme (http/https)) I would assume it would fail as there's nothing on 8080 running base http.

Might be correct, given following commands for http and https:

$ curl -k https://127.0.0.1:8080/api/health-check
{"status":"ok"}

$ curl 127.0.0.1:8080/api/health-check
curl: (52) Empty reply from server

In any case this was just a guess as to what the cause might be, and the issue persists; after a few days of runtime, my server froze again, and I saw literal thousands of defunct zombie processes in my process list again. The output from ps aux -H has 16250 lines.

Upon checking the latest docker status log, I found that it had just about as many failed health checks (it sends one every 30 seconds iirc):

"Status":"unhealthy","FailingStreak":16517

What's going wrong here? It just keeps freezing my server after a few days of runtime, even without using it.
Can I at least disable this health check anywhere to see if it helps somehow?

zadam commented 1 year ago

Hi, it's not possible to disable the healthcheck without rebuilding the image.

In the dev branch (future 0.59) I reimplemented the healthcheck in node.js (from bash) which makes it possible to access the configuration (hostname, port). But specifically for built-in TLS the healthcheck will remain disabled for now (it will probably need some special handling).

asoftbird commented 1 year ago

Hi, it's not possible to disable the healthcheck without rebuilding the image.

In the dev branch (future 0.59) I reimplemented the healthcheck in node.js (from bash) which makes it possible to access the configuration (hostname, port). But specifically for built-in TLS the healthcheck will remain disabled for now (it will probably need some special handling).

Ah, nice.

In the meantime, what can I do to make sure the process cleans up its defunct processes?

zadam commented 1 year ago

@asoftbird TBH I don't know.

asoftbird commented 1 year ago

For the time being I've just resorted to restarting docker containers once the count of defunct processes gets beyond an arbitrary threshold, running hourly.
Dropping this lil snippet out here in case anyone else runs into this issue and needs the sysadmin equivalent of Flex Seal.

#!/bin/bash
defunct_processes=$(ps -ef | grep 'defunct' | wc -l)
echo 'defunct processes:' $defunct_processes'/500'

if [ $defunct_processes -gt 500 ]
then
  echo 'too many defunct processes, restarting'
  docker restart $(docker ps -a -q -f status=running)
fi

It's something!

sigaloid commented 1 year ago

Hi, oddly enough, I just checked the defunct process count of Trilium.cc and I do not have those many defunct processes. All containers are running 0.58.8 which doesn't seem (?) to have the fix from dev yet - if the health checks are regularly failing it would multiply many times across all of the containers. I only have 13 defunct processes though. I'm using podman - perhaps there's a difference in the way they handle these defunct processes?

zadam commented 1 year ago

New healthcheck is in 0.59.0-beta.

asoftbird commented 1 year ago

Seems to be resolved after updating to v0.59.2!

zadam / trilium