widdix / aws-cf-templates

Free Templates for AWS CloudFormation
https://templates.cloudonaut.io/
Apache License 2.0
2.75k stars 1.38k forks source link

Support for http2 on EC2 Instances / mod_ssl Not Enabled / Warning Regarding Elastic File System (EFS) Speed Slowdown Causing Crashes - See Final Post #168

Closed ghost closed 6 years ago

ghost commented 6 years ago

@michaelwittig @andreaswittig

I came across two interesting errors in the server logs I wanted to ask you about...

1. The mpm module (prefork.c) is not supported by mod_http2. The mpm determines how things are processed in your server. HTTP/2 has more demands in this regard and the currently selected mpm will just not do. This is an advisory warning. Your server will continue to work, but the HTTP/2 protocol will be inactive.

2. mod_ssl does not seem to be enabled

For 1:

I can see that http2 is enabled for both CloudFront and the Load Balancer, but it's not enabled on the EC2 instances due to this issue with the mpm module. I've read that this can be corrected by using a different MPM like "Event."

Here is some good information about this from Apache: https://http2.pro/doc/Apache#prefork-http2

This page above has the following warning as well:

_If you are using PHP, it is likely that PHP is integrated to Apache via the modphp module, which requires the prefork MPM. If you switch out from prefork MPM, you will need to use PHP as FastCGI.

If both CloudFront and the Load Balancer are using http2, shouldn't the EC2 instances also be running it? What do you think?

It seems to me that this has to be enabled from within the launch configuration so that Apache is using the correct modules. Do you think this is something that should be changed within your templates so that http2 is supported across the board?

For 2:

The fact that I am seeing an error for the mod_ssl module tells me that Apache is trying to use this for some reason, but I do not know why. I know how to enable additional modules for php using the template, but I wanted to ask you about this specific module and why I am seeing this error before I make any changes.

I look forward to hearing your thoughts on both these things...

michaelwittig commented 6 years ago

hi,

  1. CloudFront does http2 for you if browsers support it. no need to do this on the backend.

  2. you don't need to terminate SSL/TLS on Apache WebServer. CloudFront and Load Balancer do this for you. no need to do this on the backend.

Does this make sense?

ghost commented 6 years ago

Hi Michael,

Good to hear from you and thanks for your reply!

1. I had a similar thought, but could there be any advantages to running http2 across the board? Or would this instead cause potential problems?

2. I realize that SSL is being provided by CloudFront and the Load Balancer, but why is the server displaying this error as if it wants mod_ssl to be installed?

I thought you should know...

I found both these errors while investigating the cause of a recent crash of our site (luckily the site is not yet live). It's the second time we've had such a crash. It manifests itself as a database connectivity error, when there seems to be nothing wrong with the DB. Our server instances would terminate due to health check failures within minutes of launch. The only way I could resolve this was with a complete termination and recreation of the entire stack. Trying to restore the DB separately from snapshots had no effect.

I was very alarmed when this most recent crash occurred so I started looking closely into the logs. That's when I saw these issues (even though they are obviously unrelated). I couldn't find any evidence of what caused the actual crashes. I am almost certain it was not a true DB issue as we restored the stack from the same DB that had previously failed. It's very unsettling to see this kind of instability and have no idea of the cause. If you have any advice whatsoever regarding this I would greatly appreciate it!

ghost commented 6 years ago

@michaelwittig @andreaswittig

I'm closing this thread, but to anyone who has experienced crashes of the type I described here, we did discover the source of the problem, and it was the Elastic File System (EFS).

All of the issues we experienced were due to the EFS slowing to a crawl, which had the effect of choking the entire stack. We were able to determine this by temporarily relocating the files in /var/www/html to an EC2 instance instead of being in the EFS. As soon as the files were moved off the EFS all the problems disappeared. This proved it was indeed the EFS causing all the issues.

We believe that the EFS slowed down dramatically (due to lack of burst credits or perhaps another cause), and that this caused a cascade of failures in the stack. All of the HTTP requests, huge numbers of DB connections, failure of the ELB, the DB connection error, etc, were all because the EFS was not responding fast enough.

After doing some research of my own I discovered that our problems were not unique, and that many people have experienced similar speed/performance related issues using the EFS. There are many people who believe the EFS is not capable of properly supporting a production environment or any complex website for that matter. I must admit after my own experiences I agree with this assessment.

We are currently re-engineering our stack to circumvent the EFS for almost all files, except for a small number of dynamic files which must be shared between all instances. All of our static files (WordPress core, plugins, themes, etc.) will now be directly hosted on the EC2 instances. We're changing the deployment pipeline to automatically generate an AMI which will pull our static files from an S3 bucket where we are deploying to. That updated AMI is then used to create new EC2 instances. In effect we are no longer bootstrapping the AMI and are instead preparing it ahead of time during deployments.

The CloudFormation templates here are amazing, and I highly recommend them. However, the use of the EFS to host WordPress is a dangerous point of failure (especially in a production environment). I have seen this in numerous crashes now over the past year, all of which I believe were caused by the EFS slowing down. For this reason I highly recommend that anyone using these templates should move their static files away from the EFS, or otherwise they must highly monitor the speed/performance of the EFS to avoid it choking the entire stack.

I hope this helps anyone who experiences the same problems I have. It took a great deal of investigation for us to determine it was the EFS causing our troubles.

michaelwittig commented 6 years ago

Can you share the CloudWatch Metrics of the EFS File System during the slowness?

ghost commented 6 years ago

@michaelwittig

I would have, but we already replaced the stack after completing the initial investigation. I had to find a way to bring it all back online for our developers (this happened in our development environment this time).

I did ask the person who helped us investigate this if he had looked at the EFS metrics at the time we were experiencing the slow-down. He said he did in fact look at it, but that everything looked normal. I believe we were being heavily throttled by AWS, or perhaps something else was going on. Whatever the cause was, the speed was incredibly slow. It basically caused every single system to fail in a cascade.

We're currently working on re-engineering around the EFS, but are not completely eliminating it. We're going to still use it for the dynamic directories in /wp-content/uploads/ in conjunction with the WP Offload S3 plugin. We're hoping that the EFS will hold up with relatively few files in it this time.

Our plans are to closely monitor the EFS after the new stack version in running, to see if we see any slowdowns or anything out of the ordinary. We're also looking into the issue of the burst credits.

The development environment is currently using the original stack version (all EFS), and it will be this way for another 2-4 weeks. So it is possible the same problem will recur. If it does I'll be taking a much closer look at the EFS metrics.

I'll keep you posted here with anything new I learn. This is definitely a serious EFS issue that can affect this particular stack because it stores the entire WP install in the EFS. One other thing, I have noticed 5XX alarms that occur on the ELB now and then in combination with instances terminating/re-launching. I think this is likely caused by the same EFS issue, but it recovers when a new instance spins up.

I'll definitely keep you filled in if I learn anything new. Thanks for following up here Michael!

ghost commented 6 years ago

@michaelwittig

I wanted to let you know that we had another occurrence of this problem and this time I was able to view the EFS metrics during the failure. It was indeed caused by a depletion of burst credits. I now know that this was the cause of every major failure I've had over the past year. The burst credits apparently start out in the TB range but then just steadily drop over weeks and months. It can take a while, but eventually they run out and when that happens there is no recovering the EFS. The throughout drops to 0 and it is impossible to take any actions for recovery.

The only procedure I have found that works after such a failure is to change the logical ID of the EFS, which forces CloudFormation to replace the EFS instance. When a new instance is created it resets the burst credit balance. Naturally after doing this you need to restore all your files. I also found that the template will reinstall a default version of WordPress when replacing the EFS in this fashion. But short of replacing the entire stack this is the only way to recover the EFS after it runs out of burst credits.

I've done extensive research on this issue and the only preventative measure you can take is to place dummy data on the EFS so you get higher limits. The more data you have in the file system the higher a limit tier they give you.

There's a table on this page below that outlines the limits based on file system size: https://docs.aws.amazon.com/efs/latest/ug/performance.html

I'm moving our solution away from the EFS so this won't pose a problem for us in the future. But I'm hoping some of what I wrote here will help others who are using these templates as-is. The EFS is simply not suited to host a WordPress website which has tens of thousands of files and relies upon fast I/O.

To anyone dealing with this problem consider moving all of the static files onto the EC2 instances. Only use the EFS for files that must be shared between the servers.

Here's a good article on this subject: https://www.jeffgeerling.com/blog/2018/getting-best-performance-out-amazon-efs

michaelwittig commented 6 years ago

thanks for investigating this issue. There is an alarm defined in the template EFSFileSystemBurstCreditBalanceTooLowAlarm that should warn you before you run out of credits. Do you set the ParentAlertStack parameter of the WordPress stack and if yes, have you received an alert?

Out of curiosity: Why wasn't it an option for you to upload let's say 50-100 GB of dummy data to EFS to increase performance?

ghost commented 6 years ago

@michaelwittig

Yes, I've always used the ParentAlertStack. For some reason I never received the CloudWatch alarm for the burst credits. In fact, I didn't receive it on past failures of this type either. That alarm is definitely not working. I thought about this a while back and checked all my emails, and I never received the alarm. I wish I had because then I maybe could have taken preventative measures.

Regarding using the dummy data to increase EFS performance...

To be frank I only learned about that a few weeks ago. By the time I learned that trick I already decided to re-engineer our stack away from the EFS. We're moving to a containerized solution where all of the static files will be on the instances. That accounts for 99% of the application. There's a few specialized files/folders (mostly in /uploads) that still need to be shared between the servers, and for that we're still using the EFS. What we're doing is mounting each of those folders to a separate EFS instance. I'm going to monitor them closely to see if we need to use dummy data. My goal is to eventually eliminate the EFS completely, but I need to deal with each of those special folders one by one. For now we're hoping that by keeping all the static files off the EFS it will stabilize the entire stack. (By the way, to anyone else reading this, use the WP Offload S3 plugin to offload the WordPress media files.)

One thing is for certain... anyone using these templates as-is needs to use dummy data. It wouldn't be a bad idea to write this into the template itself. You could have it create enough dummy data to get the EFS into the second performance tier. I really think this is a necessity for anyone running a WordPress site completely within the EFS. Making this part of the stack could save a lot of people from experiencing the kind of failures that I did. Having an alarm for those credits is simply not enough.

jorgevazquez commented 5 years ago

I know this is an old thread but these problems could have been avoided by provisioning IOPS for the EFS in the first place. Leaving the field at 0 is asking for trouble.

ghost commented 5 years ago

@jorgevazquez

Yes, this is an old thread, but I appreciate you adding valuable info.

I'm not familiar with IOPS, but will look into it tomorrow. I'm still of the opinion that hosting a website with as many files as WordPress is not wise from the EFS alone. Perhaps your recommendation here would have solved that.

I can confirm that the dummy data solution does work, but it is also quite expensive, as I learned the hard way.

We're currently using the Elastic Container Service, and we containerized all our static files (practically every file in WP). It completely solved the issue with the EFS for us. We are still using the EFS for files shared between all the servers, and so far there are no issues with that. There just aren't enough files in the EFS to deplete the burst credits. At the moment I don't need any dummy data, which is saving a lot of money too.

I can only say good things about Docker and the EFS. I had to get some qualified help to set it all up, but it was well worth it. The containers run beautifully, and we have never had a single problem. I have to do my deployments manually (first to an S3 bucket), but frankly this has more advantages than disadvantages. It forces us to be very careful with deployments. We're also using some plugin based solutions like the WP Offload Media and WP Offload SES plugins, which are a complete necessity.

Anyway, if anyone reading this ever wants to know more about our stack, just ask! ;)

~ Michael

go2smartphone commented 4 years ago

Hi Michael,

Good to hear from you and thanks for your reply!

1. I had a similar thought, but could there be any advantages to running http2 across the board? Or would this instead cause potential problems?

2. I realize that SSL is being provided by CloudFront and the Load Balancer, but why is the server displaying this error as if it wants mod_ssl to be installed?

I thought you should know...

I found both these errors while investigating the cause of a recent crash of our site (luckily the site is not yet live). It's the second time we've had such a crash. It manifests itself as a database connectivity error, when there seems to be nothing wrong with the DB. Our server instances would terminate due to health check failures within minutes of launch. The only way I could resolve this was with a complete termination and recreation of the entire stack. Trying to restore the DB separately from snapshots had no effect.

I was very alarmed when this most recent crash occurred so I started looking closely into the logs. That's when I saw these issues (even though they are obviously unrelated). I couldn't find any evidence of what caused the actual crashes. I am almost certain it was not a true DB issue as we restored the stack from the same DB that had previously failed. It's very unsettling to see this kind of instability and have no idea of the cause. If you have any advice whatsoever regarding this I would greatly appreciate it!

have the same issues when using m5 tier