wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
25 stars 4 forks source link

ParsePdf for parliament (and WHO) is using > 2GB RAM b/c of incremental leak #224

Closed hblanks closed 5 years ago

hblanks commented 5 years ago

An incremental memory (and file) leak was causing larger ParsePdf tasks to fail once the number of parsed PDFs got large.

Specifically, ParsePdf.who_iris failed at 2019-09-04 21:15:38.840725+00:00...and we had an OOM in the pod then, same for ParsePdf.parliament at 09-05T14:41:28.156105+00:00:

www-data@airflow-worker-6dbbc4f96b-jcwxb:/var/log$ dmesg -T  | grep oom_reap
[Wed Sep  4 21:15:27 2019] oom_reaper: reaped process 11559 (/usr/local/bin/), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Thu Sep  5 14:41:21 2019] oom_reaper: reaped process 20989 (/usr/local/bin/), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

These tasks were running in a pod provisioned with 2GB RAM. WHO had a concurrent task running (parliament spider), parliament did not:

usage

(So, even in case where scraper is not running, we can blow out the memory usage..)

hblanks commented 5 years ago

Fix deployed into staging, and ParsePdf.who_iris running now. Memory usage seems to be holding below 512MB:

Every 2.0s: kubectl top pod -l app=airflow-worker                                      ast325: Fri Sep  6 12:45:20 2019

NAME                              CPU(cores)   MEMORY(bytes)
airflow-worker-6588fd668b-cvxnn   921m         475Mi
hblanks commented 5 years ago

Well...it held for a while. Task died at 1h29m:

[Fri Sep  6 13:05:54 2019] /usr/local/bin/ cpuset=d626adfdaf059ddc755bf0ef60d81b8a316b5e22f580c320c975bfaba08eb4e4 mems_allowed=0
[Fri Sep  6 13:05:54 2019] CPU: 1 PID: 15519 Comm: /usr/local/bin/ Not tainted 4.14.123-111.109.amzn2.x86_64 #1
[Fri Sep  6 13:05:54 2019] Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017
[Fri Sep  6 13:05:54 2019] Call Trace:
[Fri Sep  6 13:05:54 2019]  dump_stack+0x5c/0x82
[Fri Sep  6 13:05:54 2019]  dump_header+0x94/0x229
[Fri Sep  6 13:05:54 2019]  oom_kill_process+0x223/0x420
[Fri Sep  6 13:05:54 2019]  out_of_memory+0x2af/0x4d0
[Fri Sep  6 13:05:54 2019]  mem_cgroup_out_of_memory+0x49/0x80
[Fri Sep  6 13:05:54 2019]  mem_cgroup_oom_synchronize+0x2ed/0x330
[Fri Sep  6 13:05:54 2019]  ? mem_cgroup_css_online+0x30/0x30
[Fri Sep  6 13:05:54 2019]  pagefault_out_of_memory+0x32/0x77
[Fri Sep  6 13:05:54 2019]  __do_page_fault+0x4b4/0x4c0
[Fri Sep  6 13:05:54 2019]  ? async_page_fault+0x2f/0x50
[Fri Sep  6 13:05:54 2019]  async_page_fault+0x45/0x50
[Fri Sep  6 13:05:54 2019] RIP: 86b6a8b0:0x2
[Fri Sep  6 13:05:54 2019] RSP: 0006:000000000000000c EFLAGS: 7f119f21c4c0
[Fri Sep  6 13:05:54 2019] Task in /kubepods/burstable/pod4ded6b32-d099-11e9-b6ff-02611a391c58/d626adfdaf059ddc755bf0ef60d81b8a316b5e22f580c320c975bfaba08eb4e4 killed as a result of limit of /kubepods/burstable/pod4ded6b32-d099-11e9-b6ff-0
2611a391c58
[Fri Sep  6 13:05:54 2019] memory: usage 3145720kB, limit 3145728kB, failcnt 31375
[Fri Sep  6 13:05:54 2019] memory+swap: usage 3145720kB, limit 9007199254740988kB, failcnt 0
[Fri Sep  6 13:05:54 2019] kmem: usage 14432kB, limit 9007199254740988kB, failcnt 0
[Fri Sep  6 13:05:54 2019] Memory cgroup stats for /kubepods/burstable/pod4ded6b32-d099-11e9-b6ff-02611a391c58: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB ina
ctive_file:0KB active_file:0KB unevictable:0KB
[Fri Sep  6 13:05:54 2019] Memory cgroup stats for /kubepods/burstable/pod4ded6b32-d099-11e9-b6ff-02611a391c58/97f11a79f4bbad4bf0089d08600549ca2da7c4ae92422128416985df4c25156d: cache:0KB rss:44KB rss_huge:0KB shmem:0KB mapped_file:0KB dirt
y:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:44KB inactive_file:0KB active_file:0KB unevictable:0KB
[Fri Sep  6 13:05:54 2019] Memory cgroup stats for /kubepods/burstable/pod4ded6b32-d099-11e9-b6ff-02611a391c58/d626adfdaf059ddc755bf0ef60d81b8a316b5e22f580c320c975bfaba08eb4e4: cache:8KB rss:3131236KB rss_huge:0KB shmem:8KB mapped_file:8KB
 dirty:0KB writeback:0KB swap:0KB inactive_anon:8KB active_anon:3131204KB inactive_file:0KB active_file:0KB unevictable:0KB
[Fri Sep  6 13:05:54 2019] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[Fri Sep  6 13:05:54 2019] [13808]     0 13808      256        1       4       2        0          -998 pause
[Fri Sep  6 13:05:54 2019] [14125]    33 14125    98775    25832     165       4        0           610 [celeryd: celer
[Fri Sep  6 13:05:54 2019] [14184]    33 14184    96310    24392     165       4        0           610 /usr/local/bin/
[Fri Sep  6 13:05:54 2019] [14185]    33 14185    98597    21759     153       4        0           610 [celeryd: celer
[Fri Sep  6 13:05:54 2019] [14186]    33 14186    98534    21725     153       4        0           610 [celeryd: celer
[Fri Sep  6 13:05:54 2019] [15513]    33 15513   121060    29848     179       4        0           610 /usr/local/bin/
[Fri Sep  6 13:05:54 2019] [15519]    33 15519   788398   713097    1514       6        0           610 /usr/local/bin/
[Fri Sep  6 13:05:54 2019] Memory cgroup out of memory: Kill process 15519 (/usr/local/bin/) score 1518 or sacrifice child
[Fri Sep  6 13:05:54 2019] Killed process 15519 (/usr/local/bin/) total-vm:3153592kB, anon-rss:2824400kB, file-rss:27988kB, shmem-rss:0kB
[Fri Sep  6 13:05:55 2019] oom_reaper: reaped process 15519 (/usr/local/bin/), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
hblanks commented 5 years ago

Logs from time of OOM indicate the last PDF processed. Maybe it's bad? Lessee...

[2019-09-06 13:03:58,113] {logging_mixin.py:95} INFO - [2019-09-06 13:03:58,113] {main.py:202} INFO - 698a20fc77d3056a51a60fc04a3c8485  --- 69
[2019-09-06 13:03:58,375] {logging_mixin.py:95} INFO - [2019-09-06 13:03:58,375] {main.py:202} INFO - 695c5825f9f7fa2f97b505ddfc1682cb  --- 69
[2019-09-06 13:04:01,952] {logging_mixin.py:95} INFO - [2019-09-06 13:04:01,952] {main.py:202} INFO - 699e5a3af976e61dd4bd88f90de94934  --- 69
[2019-09-06 13:04:02,073] {logging_mixin.py:95} INFO - [2019-09-06 13:04:02,073] {main.py:202} INFO - 695a224a7ed725976f5c8827665cf513  --- 69
[2019-09-06 13:06:00,958] {logging_mixin.py:95} INFO - [2019-09-06 13:06:00,957] {local_task_job.py:105} INFO - Task exited with return code -9
hblanks commented 5 years ago

Well...at least it's an interesting PDF!

smoking-kills

s3://datalabs-staging/reach-airflow/output/policy/spider/who_iris/spider-who_iris/pdf/69/695a224a7ed725976f5c8827665cf513.pdf
hblanks commented 5 years ago

Planting a manifest with just the files whose hash starts with 69 to see if this is due to a slow leak or one of the files in this set.

import json
import sys

d = json.load(sys.stdin)
d['content'] = {k: v for k, v in d['content'].items() if k == '69'}
json.dump(d, sys.stdout)
cat policytool-scrape--scraper-who_iris.json | python3 /tmp/filter-manifest.py > new.json
hblanks@ast325:~$ cat new.json 
{"metadata": {"organisation": "who_iris", "start-time": "2019-09-04T15:41:35.252616", "stop-time": "2019-09-04T20:52:03.642647"}, "content": {"69": ["69cdafad1885e9e744c2e9ead9fb019d", "691737fceb5cb4fe66965c7eb657c32b", "6949dfb3ebdab35802527f454eaa9bdd", "691f01821959f5bcf979bdcd12d0ad30", "69e5997da17756757c5281c15fdb7b0d", "69dddb5f4ee713d0148934f40332a9a3", "697c05fd4b49ecd3442994a5ce99fbb7", "69598759ef3e7aef9dbcb4a5cc3f34ee", "69efa6970c22c54be06896ff55c487d9", "69f6dfb0e5f441ed6f9b644a8e14a06a", "69f08fa812084888773c826bee399350", "695b212320104c9de4e9baadf42f0500", "69880b6fc838c2443d27c0603ef5a889", "6985c378c3dabf2bc2b0bceac8494b0b", "699854cd293535b035da981eadb01bd3", "69ab1d2a9defcb5be84195a696d3fb76", "6913e8a6e18fff6b9251b29206308c43", "69ccd74ba12e97dfdac6d9670e9a1df7", "69f48f4ae0168f3c59008f85c9297441", "690105e9cdc9e335d1acc26ce71759f4", "69145330daf4278ec5b59bfc394e2f7b", "6988df8ee4481ce39f75da07ab6cbbc8", "69597d564ca41142769e82be46f89bc5", "69774ddba4b61e9d2fb609471ff33786", "6914fcabb522a56fef40669e9e07504a", "690c77b6479af284e3c484a57e8b8c3c", "698a20fc77d3056a51a60fc04a3c8485", "695c5825f9f7fa2f97b505ddfc1682cb", "699e5a3af976e61dd4bd88f90de94934", "695a224a7ed725976f5c8827665cf513", "6970edec1c74decd95b9d91bd75debee", "69ba99ce60e490ca41f3fe27116f7e0e", "6950b51f16c9aa0bdd7bf0222ba84e5b", "69d58de9ef2fe7e35ce4707daf684ff9", "6921c79e11a869647ae999684782633c", "690de57f746c544b8131e9b51b906ff3", "69b605ebb0971da993107a409f1a1983", "69768202fcb00a80a5849348d550ed8c", "693d8d57ff3badb03b2f3e4dbad5a69f", "69cff6d309ef56dd59ab913414fee206", "6989ad85f08fe1df70b007693297ff0e", "69cff584af22d31fee4be46a5189bc2e", "69572980264b3c5010c90044c21d8ccc", "69336c46609672a8afee24c4591a4eae", "693486cbcf4d0895ee7fbfceb47286cf", "69311998c8d55128b671d5a632be928a"]}}
s3cmd cp s3://datalabs-staging/reach-airflow/output/policy/spider/who_iris/spider-who_iris/policytool-scrape--scraper-who_iris.json s3://datalabs-staging/reach-airflow/output/policy/spider/who_iris/spider-who_iris/policytool-scrape--scraper-who_iris.json.bak
s3cmd put new.json s3://datalabs-staging/reach-airflow/output/policy/spider/who_iris/spider-who_iris/policytool-scrape--scraper-who_iris.json

And then running the task manually in Airflow...here we go.

hblanks commented 5 years ago

Memory usage in the pod is initially at ~500MB, but swells to 1.8GB when we get to the PDF above. Bit of hassle figuring out if that's Python or the shell-out we do that's using all the memory...adding htop to the pod by ssh'ing to the node, then docker exec -ti -u root $CONTAINER /bin/bash.

hblanks commented 5 years ago

And...yes...it's not the shell out. It's us.

  1  [||||                                                                                 3.4%]   Tasks: 20, 2 thr; 2 running
  2  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]   Load average: 0.52 0.33 0.17
  Mem[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||               1.33G/7.69G]   Uptime: 17 days, 22:30:03
  Swp[                                                                                    0K/0K]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 8408 www-data   20   0  900M  606M 28068 R 99.7  7.7  0:23.09 /usr/local/bin/python3 /usr/local/bin/airflow run policy ParsePdf.who_iris 2019-09-04T15:41:24.016117+00:00 --job_id 357 --pool default_
 8401 root       20   0 23176  3488  2740 R  0.0  0.0  0:00.04 htop
 8402 www-data   20   0  472M  116M 27300 S  0.7  1.5  0:02.05 /usr/local/bin/python3 /usr/local/bin/airflow run policy ParsePdf.who_iris 2019-09-04T15:41:24.016117+00:00 --local --pool default_pool
    1 www-data   20   0  385M  100M 23548 S  0.0  1.3  5:12.87 [celeryd: celery@airflow-worker-6588fd668b-cvxnn:MainProcess] -active- (worker)
   14 www-data   20   0  376M 97568 23456 R  0.0  1.2  0:32.67 [celeryd: celery@airflow-worker-6588fd668b-cvxserve_logs
hblanks commented 5 years ago

Tracking in https://github.com/wellcometrust/policytool/issues/227 as this bug is different than the incremental leak filed here.