widdix / mastodon-on-aws

Host your own Mastodon instance on AWS
https://cloudonaut.io/mastodon-on-aws/
134 stars 27 forks source link

Enable auto-scaling for sidekiq #20

Open michaelwittig opened 1 year ago

michaelwittig commented 1 year ago

see

    My Sidekiq task is regularly pegging at 100% CPU utilization... definitely need some guidance on configuring scaling...

Originally posted by @scrappydog in https://github.com/widdix/mastodon-on-aws/issues/1#issuecomment-1331097244

    @scrappydog Same for us. I'm not sure if that is an issue. It likely doesn't matter if the background tasks utilize all resources as long as they finish withou much delay. For us, we see spikes to 100% but only for minutes. Do you see the same pattern?
Screenshot 2022-11-28 at 09 42 10

Originally posted by @michaelwittig in https://github.com/widdix/mastodon-on-aws/issues/1#issuecomment-1331127095

    That looks very similar to utilization on my instance.

My inner system admin really "wants" to add another task... but I agree as long as jobs are completing in a reasonable time it's not an immediate issue.

BUT we are running tiny instances for testing... we NEED a way to scale up... :-)

Originally posted by @scrappydog in https://github.com/widdix/mastodon-on-aws/issues/1#issuecomment-1331147201

    I bumped the CPU allocation up on the Sidekiq task to CPU .5 vCPU | Memory 3 GB... 

This feels happier for now... but it doesn't address the real scalability question...

Originally posted by @scrappydog in https://github.com/widdix/mastodon-on-aws/issues/1#issuecomment-1331393233

    ![image](https://user-images.githubusercontent.com/125875/204807795-541c039e-3b58-4bb2-922f-5f1e3d528938.png)

Upgraded about half way through this graph... definably a lot better!

Originally posted by @scrappydog in https://github.com/widdix/mastodon-on-aws/issues/1#issuecomment-1332150543

scrappydog commented 1 year ago

image Status update after a couple days with the Sidekiq task to CPU .5 vCPU | Memory 3 GB

compuguy commented 1 year ago

There is a way to do auto scaling for most of the sidekiq queues. Except for the scheduler. You can only have one of those. This article helped me work on some of my experiments with scaling Sidekiq (https://nora.codes/post/scaling-mastodon-in-the-face-of-an-exodus/). At a minimum you need at least 1 gigabyte of memory for each instance. I'm not sure how many threads though. The default is 5 but it might make sense to reduce it to maybe 2 based on the amount of cpu units on each container instance you have (I'm using 0.5 for each).

vesteinn commented 1 year ago

Were you able to integrate these changes into the CloudFormation configuration @compuguy? After increasing the Cpu and Memory flags I'm still seeing full load.

compuguy commented 1 year ago

I honestly went down a different road @vesteinn. I moved the mail and scheduler queues to their own separate instance, with 0.25 vCPU and .5 GB of memory. You can only have one scheduler queue per Mastodon instance, so I left it with the mail queue which wasn't using much CPU or RAM. Then I made the SidekiqService container run the rest of the needed queues AppCommand: 'bash,-c,bundle exec sidekiq -q default -q pull -q push -q ingress' with 0.5 CPU and 1 gig of memory (See: https://github.com/compuguy/mastodon-on-aws/blob/istoleyourpw-deploy/mastodon.yaml#L269). Memory seems to be good, but I still get way to many CPUUtilizationTooHighAlarms, especially when trends are updating. On the bright side, it is scaling up the instances when needed. I'm thinking of upping it to 1 vCPU, which would require upping the memory per container to 2 GB of memory. Here's a CPU utilization chart for the past week:

Screenshot from 2022-12-11 17-54-56

pegli commented 1 year ago

I wanted to share an incident report I created after a member of my instance reported problems uploading videos:

https://hub.montereybay.social/blog/degraded-service-video-transcoding-failures.html

tl;dr: iPhone video transcoding with ffmpeg was causing the CPU and memory usage to spike on the Sidekiq service. Changing vCPUs from 0.25 -> 0.5 and memory from 0.5 Gb -> 1 Gb in the Task Definition and redeploying that service resolved the issue, at least temporarily.

My instance is still pretty small at 19 users. If anyone would like me to report additional statistics, let me know what you want to see -- I'm happy to share operational metrics.

michaelwittig commented 1 year ago

@pegli We increased memory from 0.5 to 1 GB in #16 The CPU is still at 0.25 which is not a lot of horse powers :)

Yes, we are interested in metrics! RequestCountPerTarget for both ALB target groups (web and streaming) as well as CPU and memory of web, streaming and sidekiq.

pegli commented 1 year ago

At your service! https://hub.montereybay.social/Operations.html now has a public CloudWatch dashboard with all of those metrics.

michaelwittig commented 1 year ago

@pegli That's cool :) Do you mind sharing the JSON definition (open the dashboard in the CloudWatch UI, click Actions -> View/edit source) of the dashboard? We could add it to the template.