zachowj / xfinity-data-usage

Fetch Xfinity data usage and serve it via an HTTP endpoint, publish it to MQTT or post it to an URL.
https://hub.docker.com/r/zachowj/xfinity-data-usage
MIT License
53 stars 15 forks source link

CPU usage is periodically left very high until container restart #57

Closed xconverge closed 2 years ago

xconverge commented 2 years ago

Every few days, something happens that leaves chromium instances with very high CPU usage

I also realized that all of the chromium instances are leftover and still running too!?

I setup a cronjob to restart this container nightly due to this issue

I think there "might" be a relation to #54, I typically see this error in the log prior to restarting the container

Here is a small start to debugging when found in this state:

root@0d139d67fa28:/home/node/app# ps aux                            
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
node         1  0.0  0.0   2144     0 ?        Ss   00:00   0:00 dumb-init -- node dist/app.js
node         8  0.0  0.6 11165692 64292 ?      Ssl  00:00   0:04 node dist/app.js
node       354 40.3  1.3 1796296 131668 ?      SLsl 16:01  44:49 /usr/lib/chromium/chromium --show-component-extension-options 
node       368  0.0  1.0 402036 108208 ?       SL   16:01   0:00 /usr/lib/chromium/chromium --type=zygote --no-zygote-sandbox -
node       369  0.0  1.0 402036 107868 ?       SL   16:01   0:00 /usr/lib/chromium/chromium --type=zygote --no-sandbox --headle
node       389 92.4  0.7 38905496 76520 ?      Rl   16:01 102:38 /usr/lib/chromium/chromium --type=renderer --no-sandbox --disa
node       391 92.6  0.7 38905496 76028 ?      Sl   16:01 102:50 /usr/lib/chromium/chromium --type=renderer --no-sandbox --disa
node       405  0.0  0.4 774580 42024 ?        Sl   16:01   0:00 /usr/lib/chromium/chromium --type=gpu-process --field-trial-ha
node       423  0.0  0.0   3868  3244 pts/0    Ss+  17:47   0:00 bash
root       439  0.0  0.0   3868  3304 pts/1    Ss   17:48   0:00 bash
root       806  0.0  0.0   7640  2780 pts/1    R+   17:52   0:00 ps aux

Is there any more info I can get to supplement this info for this issue next time it happens? Does anyone else experience this?

zachowj commented 2 years ago

I'm not seeing any high CPU usage and my container has been running for over 6 months. I get the TimeoutError but I don't see it hurting my setup.

xconverge commented 2 years ago

I am not sure how to best pin it down, the nightly restart of the container mitigates it plenty for me. I can move the container to another server in a bit and try that for a few days/weeks

xconverge commented 2 years ago

I am running a fork with updated dependencies and a few more puppeteer flags (mostly --no-zygote) I will report back or open a PR if my problems go away.

My initial report is not fully accurate, the CPU seems to go back to normal after a successful update, and my interval is 4 hrs, so if whatever the issue is occurs, the cpu is pinned for that 4 hrs, and then there is another chance for it to resolve itself or continue to be pinned for another 4 hours.

I saw the same thing on 2 different hosts/environments/os. I am guessing you wouldnt notice this if your interval is pretty low and if you have a bunch of cpu overhead as well

xconverge commented 2 years ago

New theory....based on the logs, I seem to periodically be getting stuck at this page.goto and the waitUntil never resolves I will look into adding a timeout mechanism or something maybe, but there are a lot of page loads in here that wait with no timeout, so maybe others get hungup periodically as well:

    private async authenticate() {
        logger.debug(`Loading (${LOGIN_URL})`);
        const page = await this.getPage();

        const response = await page.goto(LOGIN_URL, { waitUntil: ['networkidle0', 'load', 'domcontentloaded'] });
zachowj commented 2 years ago

https://pptr.dev/api/puppeteer.page.goto

The default timeout for goto is 30 seconds.

xconverge commented 2 years ago

https://pptr.dev/api/puppeteer.page.goto

The default timeout for goto is 30 seconds.

Thanks, I see, that doesnt really explain the log I was looking at then, I will run things for another day or so (I have debug logging enabled now) so that should be a bit more helpful

xconverge commented 2 years ago

Ok its definitely not timing out in 30 seconds...from the same page you sent:

The default value can be changed by using the [Page.setDefaultNavigationTimeout()](https://pptr.dev/api/puppeteer.page.setdefaultnavigationtimeout) or [Page.setDefaultTimeout()](https://pptr.dev/api/puppeteer.page.setdefaulttimeout) methods.

and in getPage() we are calling:

await this.#page.setDefaultNavigationTimeout(this.#pageTimeout);

I will try defining pageTimeout in my config.yml to play with it a bit (I know it is trying to also set it to 30 by default but that doesnt seem to be working)

zachowj commented 2 years ago

If it's not defined in the config is defaults here.

https://github.com/zachowj/xfinity-data-usage/blob/8c8c035076e2aa33d25774bef0c4f95c294dd4a1/src/config.ts#L32

xconverge commented 2 years ago

If it's not defined in the config is defaults here.

https://github.com/zachowj/xfinity-data-usage/blob/8c8c035076e2aa33d25774bef0c4f95c294dd4a1/src/config.ts#L32

Yes I saw that and was what I meant by "(I know it is trying to also set it to 30 by default but that doesnt seem to be working)" but I definitely have a log message that was like this for over 5 minutes, I restarted the container at 16:49:38

2022-07-19 16:40:27 [info] : Xfinity Update every 240 mins

2022-07-19 16:40:27 [info] : Http server will be started

2022-07-19 16:40:27 [info] : Will publish to MQTT (Home Assistant) on new data

2022-07-19 16:40:27 [info] : http server started

2022-07-19 16:40:28 [verbose] : Fetching Data

2022-07-19 16:40:28 [debug] : Loading (https://customer.xfinity.com)

2022-07-19 16:40:28 [info] : Connected to MQTT

2022-07-19 16:49:38 [info] : Xfinity Update every 240 mins

2022-07-19 16:49:38 [info] : Http server will be started

2022-07-19 16:49:38 [info] : Will publish to MQTT (Home Assistant) on new data

2022-07-19 16:49:38 [info] : http server started

2022-07-19 16:49:38 [verbose] : Fetching Data

2022-07-19 16:49:38 [debug] : Loading (https://customer.xfinity.com)

2022-07-19 16:49:38 [info] : Connected to MQTT

2022-07-19 16:49:42 [debug] : Login redirects: undefined

2022-07-19 16:49:42 [debug] : Logging in
zachowj commented 2 years ago

You could try dropping in a logger after each line and using the screenshot function to see if the page looks like it is supposed.

If you're running this in a docker you'll want to switch these comments.

https://github.com/zachowj/xfinity-data-usage/blob/8c8c035076e2aa33d25774bef0c4f95c294dd4a1/src/xfinity.ts#L302-L303

xconverge commented 2 years ago

What do you think about this:

https://github.com/zachowj/xfinity-data-usage/blob/8c8c035076e2aa33d25774bef0c4f95c294dd4a1/src/xfinity.ts#L247

needs a "this.#page === undefined ||" or some changed logic. The request seems to be getting stuck here

zachowj commented 2 years ago

That's what the ? is doing.

if(this.#page !== false || this.#page.isClosed() !== false) {
xconverge commented 2 years ago

sorry for rubber ducking hard, I will get some more time on it and report back...I was missing quite a few log statements so with what I have so far it looks like potentially hanging up on browser.newPage but I will wait until I am sure before looking into it more

xconverge commented 2 years ago

Ok so sometimes it just fails to get a new page

https://github.com/zachowj/xfinity-data-usage/blob/8c8c035076e2aa33d25774bef0c4f95c294dd4a1/src/xfinity.ts#L249

and then when the interval is up, it tries again but fails again, so it never works again until the container is restarted

If I restart the container 10 times, it does this maybe 1 out of the 10 times

I added dumpio: true to get some more debug from puppeteer but it gives me nothing helpful. I also added --disable-gpu because there were quite a few gpu errors happening

working:

2022-07-20 11:48:16 [info] : Xfinity Update every 240 mins

2022-07-20 11:48:16 [info] : Http server will be started

2022-07-20 11:48:16 [info] : Will publish to MQTT (Home Assistant) on new data

2022-07-20 11:48:16 [info] : http server started

2022-07-20 11:48:16 [verbose] : Fetching Data

2022-07-20 11:48:16 [debug] : Loading (https://customer.xfinity.com)

2022-07-20 11:48:16 [debug] : Page:

2022-07-20 11:48:16 [debug] : undefined

2022-07-20 11:48:16 [debug] : Getting browser...

2022-07-20 11:48:17 [info] : Connected to MQTT

[0720/114817.340246:ERROR:browser_main_loop.cc(540)] Failed to open an X11 connection.

[0720/114817.343741:ERROR:bus.cc(393)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory

[0720/114817.349331:WARNING:audio_manager_linux.cc(69)] Falling back to ALSA for audio output. PulseAudio is not available or could not be initialized.

DevTools listening on ws://127.0.0.1:40309/devtools/browser/ea76ae77-cb77-4f37-9d17-18445f98584e

2022-07-20 11:48:17 [debug] : Got browser!

2022-07-20 11:48:17 [debug] : Getting new page...

2022-07-20 11:48:17 [debug] : Got new page!

2022-07-20 11:48:17 [debug] : Setting cookies, user agent, and viewport...

2022-07-20 11:48:17 [debug] : Cookies, user agent, and viewport set!

2022-07-20 11:48:17 [debug] : Calling setDefaultNavigationTimeout with a value of:

2022-07-20 11:48:17 [debug] : 30000

2022-07-20 11:48:17 [debug] : Got Page

[0720/114817.984491:INFO:CONSOLE(1)] "A parser-blocking, cross site (i.e. different eTLD+1) script, https://static.cimcontent.net/data-layer/?appId=resi_myaccount, is invoked via document.write. The network request for this script MAY be blocked by the browser in this or a future page load due to poor network connectivity. If blocked in this page load, it will be confirmed in a subsequent console message. See https://www.chromestatus.com/feature/5718547946799104 for more details.", source: https://customer.xfinity.com/ (1)

[0720/114817.992400:INFO:CONSOLE(1)] "A parser-blocking, cross site (i.e. different eTLD+1) script, https://static.cimcontent.net/data-layer/?appId=resi_myaccount, is invoked via document.write. The network request for this script MAY be blocked by the browser in this or a future page load due to poor network connectivity. If blocked in this page load, it will be confirmed in a subsequent console message. See https://www.chromestatus.com/feature/5718547946799104 for more details.", source: https://customer.xfinity.com/ (1)

[0720/114818.825904:INFO:CONSOLE(0)] "Uncaught (in promise) NetworkError: A network error occurred.", source: https://customer.xfinity.com/ (0)

2022-07-20 11:48:21 [debug] : Login redirects: 0

2022-07-20 11:48:21 [debug] : Loading Usage https://customer.xfinity.com/apis/csp/account/me/services/internet/usage?filter=internet

2022-07-20 11:48:21 [debug] : Page:

2022-07-20 11:48:21 [debug] : [object Object]

2022-07-20 11:48:21 [debug] : Using existing page:

2022-07-20 11:48:21 [debug] : [object Object]

2022-07-20 11:48:24 [debug] : Saving cookies for next fetch

2022-07-20 11:48:24 [debug] : Page:

2022-07-20 11:48:24 [debug] : [object Object]

2022-07-20 11:48:24 [debug] : Using existing page:

2022-07-20 11:48:24 [debug] : [object Object]

2022-07-20 11:48:24 [verbose] : Data retrieved

2022-07-20 11:48:24 [info] : Usage updated: 482 GB. Next update at 3:48:16 PM

not working:

2022-07-20 11:49:46 [info] : Xfinity Update every 240 mins

2022-07-20 11:49:46 [info] : Http server will be started

2022-07-20 11:49:46 [info] : Will publish to MQTT (Home Assistant) on new data

2022-07-20 11:49:46 [info] : http server started

2022-07-20 11:49:46 [verbose] : Fetching Data

2022-07-20 11:49:46 [debug] : Loading (https://customer.xfinity.com)

2022-07-20 11:49:46 [debug] : Page:

2022-07-20 11:49:46 [debug] : undefined

2022-07-20 11:49:46 [debug] : Getting browser...

2022-07-20 11:49:46 [info] : Connected to MQTT

[0720/114947.268915:ERROR:browser_main_loop.cc(540)] Failed to open an X11 connection.

[0720/114947.276692:ERROR:bus.cc(393)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory

[0720/114947.287006:WARNING:audio_manager_linux.cc(69)] Falling back to ALSA for audio output. PulseAudio is not available or could not be initialized.

DevTools listening on ws://127.0.0.1:34947/devtools/browser/27007cf9-5efe-48b2-869e-b908ee0945e3

2022-07-20 11:49:47 [debug] : Got browser!

2022-07-20 11:49:47 [debug] : Getting new page...

also not working when I let it run for the interval to see if it fixed itself (prior to adding some of the debug flags)

022-07-19 18:40:59 [info] : Xfinity Update every 240 mins

2022-07-19 18:40:59 [info] : Http server will be started

2022-07-19 18:40:59 [info] : Will publish to MQTT (Home Assistant) on new data

2022-07-19 18:40:59 [info] : http server started

2022-07-19 18:40:59 [verbose] : Fetching Data

2022-07-19 18:40:59 [debug] : Loading (https://customer.xfinity.com)

2022-07-19 18:40:59 [debug] : Page:

2022-07-19 18:40:59 [debug] : undefined

2022-07-19 18:40:59 [debug] : Getting browser...

2022-07-19 18:40:59 [info] : Connected to MQTT

2022-07-19 18:41:00 [debug] : Got browser!

2022-07-19 18:41:00 [debug] : Getting new page...

2022-07-19 22:40:59 [verbose] : Fetching Data

2022-07-19 22:40:59 [debug] : Loading (https://customer.xfinity.com)

2022-07-19 22:40:59 [debug] : Page:

2022-07-19 22:40:59 [debug] : undefined

2022-07-19 22:40:59 [debug] : Getting browser...

2022-07-19 22:40:59 [debug] : Got browser!

2022-07-19 22:40:59 [debug] : Getting new page...
xconverge commented 2 years ago

I changed a lot at once after some reading on newPage issues and will report back

  1. running alpine 3.16 instead of buster, image is 390MB instead of 600+
  2. Using "pipe: true" to use pipes instead of websockets for puppeteer

I am fine managing/using my own fork forever if no one else has this problem

xconverge commented 2 years ago

Looking promising so far, I will keep an eye on it over the next few days

https://github.com/zachowj/xfinity-data-usage/compare/main...xconverge:xfinity-data-usage:changesForNewpageLockup

I am considering this solved for me, let me know if I should open an MR or if it is too large of a change without solid engineering behind it, I definitely jumped around a bit...

zachowj commented 2 years ago

I have no objections to making the changes if it works across the builds.

What type of hardware are you running this on? If I recall correctly I couldn't get the arm7 to build successfully with chromium when using alpine.

I'm interested in learning which steps actually fixed your issue.

I really only see 3 major changes that I think would change anything:

  1. puppeteer updated
  2. switch to alpine/possible a different version of chromium
  3. chromium startup flags
xconverge commented 2 years ago

I will get some days of runtime on this, I am happy managing my own fork for myself (and also thankful for you making this available in the first place!)

I am currently running it on a pi4 x64 (armv8) but was running it on a synology NAS running dsm7 before (both systems exhibited the same issue periodically)

I suspect "pipe: true" as the launch parameter is the main fix, but (2) is a strong contender (it is chromium 102.x I believe).

(1) didnt really change anything other than potential compatibility with chromium versions.

The other startup flags (no-zygote, single process, and disable-gpu) all make sense and removed a few weird behaviors/errors I saw when I had dumpio: true but did not change the behavior.

Looking at puppeteer issues related to "newpage hangs", "newpage stalls" pointed me to a LOT of different related looking issues.

xconverge commented 2 years ago

Just following up here, I have had 0 problems since I switched to alpine/my fork, but I dont think it is worth redoing all of the builds etc for different platforms if no one else has this issue, 0 additional thumbs ups since I opened it close to 6 months ago...I will continue to run what works for me, thanks a lot for the image!

zachowj commented 2 years ago

It probably would be nice to move all the builds to alpine. If I find some time I might see if I can get it working.