Closed hblanks closed 5 years ago
Fix deployed into staging, and ParsePdf.who_iris running now. Memory usage seems to be holding below 512MB:
Every 2.0s: kubectl top pod -l app=airflow-worker ast325: Fri Sep 6 12:45:20 2019
NAME CPU(cores) MEMORY(bytes)
airflow-worker-6588fd668b-cvxnn 921m 475Mi
Well...it held for a while. Task died at 1h29m:
[Fri Sep 6 13:05:54 2019] /usr/local/bin/ cpuset=d626adfdaf059ddc755bf0ef60d81b8a316b5e22f580c320c975bfaba08eb4e4 mems_allowed=0
[Fri Sep 6 13:05:54 2019] CPU: 1 PID: 15519 Comm: /usr/local/bin/ Not tainted 4.14.123-111.109.amzn2.x86_64 #1
[Fri Sep 6 13:05:54 2019] Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017
[Fri Sep 6 13:05:54 2019] Call Trace:
[Fri Sep 6 13:05:54 2019] dump_stack+0x5c/0x82
[Fri Sep 6 13:05:54 2019] dump_header+0x94/0x229
[Fri Sep 6 13:05:54 2019] oom_kill_process+0x223/0x420
[Fri Sep 6 13:05:54 2019] out_of_memory+0x2af/0x4d0
[Fri Sep 6 13:05:54 2019] mem_cgroup_out_of_memory+0x49/0x80
[Fri Sep 6 13:05:54 2019] mem_cgroup_oom_synchronize+0x2ed/0x330
[Fri Sep 6 13:05:54 2019] ? mem_cgroup_css_online+0x30/0x30
[Fri Sep 6 13:05:54 2019] pagefault_out_of_memory+0x32/0x77
[Fri Sep 6 13:05:54 2019] __do_page_fault+0x4b4/0x4c0
[Fri Sep 6 13:05:54 2019] ? async_page_fault+0x2f/0x50
[Fri Sep 6 13:05:54 2019] async_page_fault+0x45/0x50
[Fri Sep 6 13:05:54 2019] RIP: 86b6a8b0:0x2
[Fri Sep 6 13:05:54 2019] RSP: 0006:000000000000000c EFLAGS: 7f119f21c4c0
[Fri Sep 6 13:05:54 2019] Task in /kubepods/burstable/pod4ded6b32-d099-11e9-b6ff-02611a391c58/d626adfdaf059ddc755bf0ef60d81b8a316b5e22f580c320c975bfaba08eb4e4 killed as a result of limit of /kubepods/burstable/pod4ded6b32-d099-11e9-b6ff-0
2611a391c58
[Fri Sep 6 13:05:54 2019] memory: usage 3145720kB, limit 3145728kB, failcnt 31375
[Fri Sep 6 13:05:54 2019] memory+swap: usage 3145720kB, limit 9007199254740988kB, failcnt 0
[Fri Sep 6 13:05:54 2019] kmem: usage 14432kB, limit 9007199254740988kB, failcnt 0
[Fri Sep 6 13:05:54 2019] Memory cgroup stats for /kubepods/burstable/pod4ded6b32-d099-11e9-b6ff-02611a391c58: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB ina
ctive_file:0KB active_file:0KB unevictable:0KB
[Fri Sep 6 13:05:54 2019] Memory cgroup stats for /kubepods/burstable/pod4ded6b32-d099-11e9-b6ff-02611a391c58/97f11a79f4bbad4bf0089d08600549ca2da7c4ae92422128416985df4c25156d: cache:0KB rss:44KB rss_huge:0KB shmem:0KB mapped_file:0KB dirt
y:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:44KB inactive_file:0KB active_file:0KB unevictable:0KB
[Fri Sep 6 13:05:54 2019] Memory cgroup stats for /kubepods/burstable/pod4ded6b32-d099-11e9-b6ff-02611a391c58/d626adfdaf059ddc755bf0ef60d81b8a316b5e22f580c320c975bfaba08eb4e4: cache:8KB rss:3131236KB rss_huge:0KB shmem:8KB mapped_file:8KB
dirty:0KB writeback:0KB swap:0KB inactive_anon:8KB active_anon:3131204KB inactive_file:0KB active_file:0KB unevictable:0KB
[Fri Sep 6 13:05:54 2019] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[Fri Sep 6 13:05:54 2019] [13808] 0 13808 256 1 4 2 0 -998 pause
[Fri Sep 6 13:05:54 2019] [14125] 33 14125 98775 25832 165 4 0 610 [celeryd: celer
[Fri Sep 6 13:05:54 2019] [14184] 33 14184 96310 24392 165 4 0 610 /usr/local/bin/
[Fri Sep 6 13:05:54 2019] [14185] 33 14185 98597 21759 153 4 0 610 [celeryd: celer
[Fri Sep 6 13:05:54 2019] [14186] 33 14186 98534 21725 153 4 0 610 [celeryd: celer
[Fri Sep 6 13:05:54 2019] [15513] 33 15513 121060 29848 179 4 0 610 /usr/local/bin/
[Fri Sep 6 13:05:54 2019] [15519] 33 15519 788398 713097 1514 6 0 610 /usr/local/bin/
[Fri Sep 6 13:05:54 2019] Memory cgroup out of memory: Kill process 15519 (/usr/local/bin/) score 1518 or sacrifice child
[Fri Sep 6 13:05:54 2019] Killed process 15519 (/usr/local/bin/) total-vm:3153592kB, anon-rss:2824400kB, file-rss:27988kB, shmem-rss:0kB
[Fri Sep 6 13:05:55 2019] oom_reaper: reaped process 15519 (/usr/local/bin/), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Logs from time of OOM indicate the last PDF processed. Maybe it's bad? Lessee...
[2019-09-06 13:03:58,113] {logging_mixin.py:95} INFO - [2019-09-06 13:03:58,113] {main.py:202} INFO - 698a20fc77d3056a51a60fc04a3c8485 --- 69
[2019-09-06 13:03:58,375] {logging_mixin.py:95} INFO - [2019-09-06 13:03:58,375] {main.py:202} INFO - 695c5825f9f7fa2f97b505ddfc1682cb --- 69
[2019-09-06 13:04:01,952] {logging_mixin.py:95} INFO - [2019-09-06 13:04:01,952] {main.py:202} INFO - 699e5a3af976e61dd4bd88f90de94934 --- 69
[2019-09-06 13:04:02,073] {logging_mixin.py:95} INFO - [2019-09-06 13:04:02,073] {main.py:202} INFO - 695a224a7ed725976f5c8827665cf513 --- 69
[2019-09-06 13:06:00,958] {logging_mixin.py:95} INFO - [2019-09-06 13:06:00,957] {local_task_job.py:105} INFO - Task exited with return code -9
Well...at least it's an interesting PDF!
s3://datalabs-staging/reach-airflow/output/policy/spider/who_iris/spider-who_iris/pdf/69/695a224a7ed725976f5c8827665cf513.pdf
Planting a manifest with just the files whose hash starts with 69
to see if this is due to a slow leak or one of the files in this set.
import json
import sys
d = json.load(sys.stdin)
d['content'] = {k: v for k, v in d['content'].items() if k == '69'}
json.dump(d, sys.stdout)
cat policytool-scrape--scraper-who_iris.json | python3 /tmp/filter-manifest.py > new.json
hblanks@ast325:~$ cat new.json
{"metadata": {"organisation": "who_iris", "start-time": "2019-09-04T15:41:35.252616", "stop-time": "2019-09-04T20:52:03.642647"}, "content": {"69": ["69cdafad1885e9e744c2e9ead9fb019d", "691737fceb5cb4fe66965c7eb657c32b", "6949dfb3ebdab35802527f454eaa9bdd", "691f01821959f5bcf979bdcd12d0ad30", "69e5997da17756757c5281c15fdb7b0d", "69dddb5f4ee713d0148934f40332a9a3", "697c05fd4b49ecd3442994a5ce99fbb7", "69598759ef3e7aef9dbcb4a5cc3f34ee", "69efa6970c22c54be06896ff55c487d9", "69f6dfb0e5f441ed6f9b644a8e14a06a", "69f08fa812084888773c826bee399350", "695b212320104c9de4e9baadf42f0500", "69880b6fc838c2443d27c0603ef5a889", "6985c378c3dabf2bc2b0bceac8494b0b", "699854cd293535b035da981eadb01bd3", "69ab1d2a9defcb5be84195a696d3fb76", "6913e8a6e18fff6b9251b29206308c43", "69ccd74ba12e97dfdac6d9670e9a1df7", "69f48f4ae0168f3c59008f85c9297441", "690105e9cdc9e335d1acc26ce71759f4", "69145330daf4278ec5b59bfc394e2f7b", "6988df8ee4481ce39f75da07ab6cbbc8", "69597d564ca41142769e82be46f89bc5", "69774ddba4b61e9d2fb609471ff33786", "6914fcabb522a56fef40669e9e07504a", "690c77b6479af284e3c484a57e8b8c3c", "698a20fc77d3056a51a60fc04a3c8485", "695c5825f9f7fa2f97b505ddfc1682cb", "699e5a3af976e61dd4bd88f90de94934", "695a224a7ed725976f5c8827665cf513", "6970edec1c74decd95b9d91bd75debee", "69ba99ce60e490ca41f3fe27116f7e0e", "6950b51f16c9aa0bdd7bf0222ba84e5b", "69d58de9ef2fe7e35ce4707daf684ff9", "6921c79e11a869647ae999684782633c", "690de57f746c544b8131e9b51b906ff3", "69b605ebb0971da993107a409f1a1983", "69768202fcb00a80a5849348d550ed8c", "693d8d57ff3badb03b2f3e4dbad5a69f", "69cff6d309ef56dd59ab913414fee206", "6989ad85f08fe1df70b007693297ff0e", "69cff584af22d31fee4be46a5189bc2e", "69572980264b3c5010c90044c21d8ccc", "69336c46609672a8afee24c4591a4eae", "693486cbcf4d0895ee7fbfceb47286cf", "69311998c8d55128b671d5a632be928a"]}}
s3cmd cp s3://datalabs-staging/reach-airflow/output/policy/spider/who_iris/spider-who_iris/policytool-scrape--scraper-who_iris.json s3://datalabs-staging/reach-airflow/output/policy/spider/who_iris/spider-who_iris/policytool-scrape--scraper-who_iris.json.bak
s3cmd put new.json s3://datalabs-staging/reach-airflow/output/policy/spider/who_iris/spider-who_iris/policytool-scrape--scraper-who_iris.json
And then running the task manually in Airflow...here we go.
Memory usage in the pod is initially at ~500MB, but swells to 1.8GB when we get to the PDF above. Bit of hassle figuring out if that's Python or the shell-out we do that's using all the memory...adding htop
to the pod by ssh'ing to the node, then docker exec -ti -u root $CONTAINER /bin/bash
.
And...yes...it's not the shell out. It's us.
1 [|||| 3.4%] Tasks: 20, 2 thr; 2 running
2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Load average: 0.52 0.33 0.17
Mem[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 1.33G/7.69G] Uptime: 17 days, 22:30:03
Swp[ 0K/0K]
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
8408 www-data 20 0 900M 606M 28068 R 99.7 7.7 0:23.09 /usr/local/bin/python3 /usr/local/bin/airflow run policy ParsePdf.who_iris 2019-09-04T15:41:24.016117+00:00 --job_id 357 --pool default_
8401 root 20 0 23176 3488 2740 R 0.0 0.0 0:00.04 htop
8402 www-data 20 0 472M 116M 27300 S 0.7 1.5 0:02.05 /usr/local/bin/python3 /usr/local/bin/airflow run policy ParsePdf.who_iris 2019-09-04T15:41:24.016117+00:00 --local --pool default_pool
1 www-data 20 0 385M 100M 23548 S 0.0 1.3 5:12.87 [celeryd: celery@airflow-worker-6588fd668b-cvxnn:MainProcess] -active- (worker)
14 www-data 20 0 376M 97568 23456 R 0.0 1.2 0:32.67 [celeryd: celery@airflow-worker-6588fd668b-cvxserve_logs
Tracking in https://github.com/wellcometrust/policytool/issues/227 as this bug is different than the incremental leak filed here.
An incremental memory (and file) leak was causing larger ParsePdf tasks to fail once the number of parsed PDFs got large.
Specifically, ParsePdf.who_iris failed at 2019-09-04 21:15:38.840725+00:00...and we had an OOM in the pod then, same for ParsePdf.parliament at 09-05T14:41:28.156105+00:00:
These tasks were running in a pod provisioned with 2GB RAM. WHO had a concurrent task running (parliament spider), parliament did not:
(So, even in case where scraper is not running, we can blow out the memory usage..)