webiny / webiny-js

Open-source serverless enterprise CMS. Includes a headless CMS, page builder, form builder, and file manager. Easy to customize and expand. Deploys to AWS.
https://www.webiny.com
Other
7.31k stars 603 forks source link

Improve page SSR generation and caching mechanisms #622

Closed adrians5j closed 4 years ago

adrians5j commented 4 years ago

The goal of this issue is to define an improved version of page SSR generation and caching mechanisms.

How it works currently

Once a page request arrives, site Lambda checks first if there's SSR content for the requested URL already stored in its local cache (a plain object in memory). If so, it returns that, otherwise it calls the SSR lambda, caches the results locally, and returns it. Additionally, the sent response will also include proper headers, which will instruct the CDN to cache the response for a short period of time, so that each consequent request doesn't even reach the Lambda function.

Possible improvement

Once the user (page editor) publishes the page, we immediately execute SSR for it, and save the output directly into the database (we could add a new model for this - PbPageCache).

Once a real user visits the site, the Lambda function that's responsible for returning the HTML won't execute SSR anymore. We are removing this step. This time it will simply fetch the SSR content for the requested URL, from the PageBuilder service.

We could either create a separate field in GQL schema, or we could just create a simple REST endpoint on PageBuilder component, which would receive the requested URL for which the SSR needs to be returned. Eg.

GET https://cloudfront.my-api-xyz.com/page-builder/ssr/my-super-url-that-needs-html/right-now

The advantage of using a GET and a simple REST route is that we can additionally cache the result on CDN, thus making consequent requests from site Lambda extremely fast. 🚀

But, for "v1", we could just try with a simple GQL field, the output that site lambda returns will be cached on the site CDN anyways.

TTL and recreating cache

When returning the SSR content, if we detect that the cache that's about to be sent is more than X seconds old, we will recreate the cache, but in an async fashion, meaning we will still send the old page SSR, but will also trigger the lambda (InvocationType: "Event") that will recreate the page SSR.

Additional features (TBD)

We can discuss these once we discuss the main topic. The ideas are:

1. "invalidate cache" button in page builder

Although cool, this is problematic since we cannot do anything on CDN programatically. We can only confidently "clear" the cache in the database.

2. Disable caching for admins, so they can always see the up-to-date version

This might be a cool feature, although might require some auth stuff.

Or..... One thing that just came to my mind, is that we could just append an additional query param when previewing pages from page builder, which would force-ignore cache and force SSR generation on each request.

Currently when you click "Preview" in Page Builder, you are redirected to an URL that looks like this: https://customsite.z1.webiny.com/welcome-to-webiny?preview=5dd2efe269c40200014709c8.

Maybe we could skip SSR cache reading when the preview is ON or something...

adrians5j commented 4 years ago

Could somebody share an opinion on the following...

Once a real user visits the site, the Lambda function that's responsible for returning the HTML won't execute SSR anymore. We are removing this step. This time it will simply fetch the SSR content for the requested URL, from the PageBuilder service.

This means that the PageBuilder service needs to be able to generate the SSR output on demand (and save it into the DB of course).

I was thinking we could maybe add some kind of a query param, that would force the site Lambda to return the SSR HTML, which PageBuilder could then save into the DB.

For example mysite.com/about-us?ssr=true. This is a super simple way of doing it. If needed, we could maybe send a more sophisticated value via query params to trigger the SSR.

Note: PageBuilder in the Admin app actually knows the URL of the site, so I guess there shouldn't be any problems with implementation.

Any thoughts on this?

Pavel910 commented 4 years ago

Or maybe an HTTP header? Just to keep the URL in original form, and avoid any potential params collisions, etc. Something like X_WEBINY_SSR=true?

adrians5j commented 4 years ago

I like this, cool! @Pavel910

SvenAlHamad commented 4 years ago

Originally we used the header, so we can do the same here again. Another note is that the lambda that receives the user's request needs to check the ttl of the page in the cache. If the content is older than X it still needs to be able to somehow trigger the SSR function to refresh the snapshot.

adrians5j commented 4 years ago

Yes, the explained logic is valid, but this logic wil be contained in the PageBuilder GQL field that's responsible for returning the SSR HTML. It will also determined whether the cached SSR HTML is old or not. If it's old, it will still return it as the response to the received request but will also trigger an async re-fetch process, which means the new SSR HTML will be served on the next request.

Interesting enough, this gives us a nice little option too - enables us to add cache TTL adjusting in the PageBuilder settings. So the user would be able to adjust it more easily. Just a thought...

@SvenAlHamad

SvenAlHamad commented 4 years ago

Sounds good. Also the ttl in that case could be controlled per page even, but let’s keep those options aside for the next iteration.

adrians5j commented 4 years ago

Yeah, per-page TTLs sounds even more awesome!

But I agree, next iteration it is.

SvenAlHamad commented 4 years ago

Another edge case might occur here where the lambda that receives a user request would need to trigger SSR. When a user installs Webiny, a fresh installation, the SSR cache will most likely be empty. How will we handle that? We can maybe pre-populate the cache into the database upon the installation.

This then leaves another edge case. What about users with current Webiny version and pages they've created and already published. How can we handle that?

Pavel910 commented 4 years ago

I don't think initial setup should bother with the cache, since we only have 3-4 demo pages. Nobody is going to publish that to production before going into the admin app and fiddling with pages.

Pavel910 commented 4 years ago

As for existing sites, our site handler should be able to request a render from the API. Or, maybe we should have a Regenerate all pages button somewhere in the backend so you can simply click that and have your cache rebuilt in a few minutes?

SvenAlHamad commented 4 years ago

In that case, can we go back to client-side render and then trigger the SSR job in the background? I don't think there is a need for Regenerate all pages function just yet.

Pavel910 commented 4 years ago

Except bots will not be happy about that. I think at this point, as a fallback to no cache found problem, we can simply run SSR directly from the site handler when that case happens. But we can try both during implementation and see what feels better.

SvenAlHamad commented 4 years ago

That should also work fine.

adrians5j commented 4 years ago

I agree with @Pavel910 , no cache found issue will only happen once for each page, and after a single visit happens, this issue won't be here anymore.

But I'm thinking we can do this automatically on page save, if there is no cache at all, just create it. Workign on this as we speak, so I'll let you know how it works. I'm not talking about constant recreation, only when the cache is totally empty, which is the case for all newly created pages.

Pavel910 commented 4 years ago

@sven is talking about the case where you already have 100 published pages, and you update to the new SSR (like Roman). There will not be a save event, so we just need to have a fallback,

adrians5j commented 4 years ago

Ah I see....

OK.

roman-vabishchevych commented 4 years ago

And we need to think about PagesList cache, not only PageBuilder.

Pavel910 commented 4 years ago

@roman-vabishchevych could you elaborate please?

SvenAlHamad commented 4 years ago

@roman-vabishchevych this is handled through this SSR mechanism. Basically you'll get a static snapshot of your whole page. If that page has a page list component inside, it will be part of the snapshot so it won't add any delay to the render.

roman-vabishchevych commented 4 years ago

e.g. https://d306wk9ds9z2uk.cloudfront.net/uk/team on this page I have a request for a lot of other pages, and some times this takes time (3-6 sec).

roman-vabishchevych commented 4 years ago

@roman-vabishchevych this is handled through this SSR mechanism. Basically you'll get a static snapshot of your whole page. If that page has a page list component inside, it will be part of the snapshot so it won't add any delay to the render.

Will this depend on lambda cold-start too?

Now I am trying to investigate these lags. https://lumigo.io/blog/how-to-improve-aws-lambda-cold-start-performance/

When I hosted v1 webiny on my DigitalOcean in docker, it worked very fast (600-800ms).

SvenAlHamad commented 4 years ago

Yes, so my previous comment is still true. The snapshot will have all the content inside.

In terms of the lag, the snapshot will also be cached on the CDN, so the lag should be especially small, and if there is one it should be >1s as lambdas are not inside a VPC, so the cold start is much shorter.

SvenAlHamad commented 4 years ago

@roman-vabishchevych you can have a look at my blog post here, of an earlier version where we implemented the same solution, it will give you an expectation of what performance gains you can expect: https://medium.com/hackernoon/how-to-ssr-in-a-serverless-environment-and-make-your-visitors-400-happier-5a2a101ecb15

SvenAlHamad commented 4 years ago

Just another comment for @doitadrian who’s implementing this. Please make sure to:

  1. Bypass the SSR cache when a page is requested with a ?preview query parameter.
  2. Also in the same case, make sure to send a no-cache downstream to the CDN.

In terms of the UX, when a user is publishing a page, be that via the page editor or in the page list screen. When a page is published, display a message, something in-line: “the page is published, but it can take ip to 30 seconds until the new content is visible.“ This message is needed, although we recreate the SSR snapshot once the page is published, the cdn might prevent the user from seeing the new content immediately.

adrians5j commented 4 years ago

Sounds good @SvenAlHamad.

adrians5j commented 4 years ago

A short update

In general, everything we discussed here is implemented, with some very nice extras and optimizations too. One of the most important features is link-preloading. Basically, we noticed that even with the SSR, when navigating from one page to another, we still have the loading overlay, because the page content is fetched on-demand, or in other words, when users click on links.

So in order to eliminate these long page transitions, we implemented a client-side mechanism, which will prefetch links when needed. So if you have a page with three links on it, pages behind them will be prefetched and by clicking on any of the three, the transition will be immediate.

Plans

1. New caching mechanism

Although everything we discussed here is implemented, unfortunately, we still see one recurring problem that needs to be solved - the notorious cold starts.

When Lambda functions are warm, the serving of pages is fast enough, no problem there. But when opening the site after a longer period of time, the initial page load time can reach up to 5-7 seconds, and in some cases even longer (seen 10+ seconds cases). This happens because, in order to return the stale cached SSR HTML (stale cache from DB, not generating anything here in real-time, as discussed), we still have to invoke three Lambda functions, which takes time:

  1. initial site Lambda that serves the content
  2. site Lambda invokes the ssrCache Lambda (basic Apollo service)
  3. the request to the ssrCache happens over Apollo Gateway Lambda

Because of this, we've decided to take additional steps to fix this. We are going to introduce more aggressive, but still smart, caching techniques.

How it works?

For starters, we are going to define long cache TTLs (e.g. 1 month) for every kind of page that gets returned via SSR. So basically, pages will always be served super fast from CloudFront.

And from the admin side, we will purge the cache for a specific URL, each time the user publishes a page from the Admin area.

But of course, there's that issue with dynamic data on pages. For example, if page A contains a pages-list page element and a new page B is published, the pages-list on the page A should contain it. So we need to find a way to refresh pages where the linked content has changed.

For this, we agreed to implement the following async event-based solution.

Continuing with the case explained above, when a new page B is published, in an async fashion, we will trigger an async process that will simply find all pages that are affected by this action, and mark them as "dirty". This simply means that the cache is not valid anymore and that it needs to be refreshed. You might ask yourself: "But instead of marking the pages as "dirty", why don't we just purge cache for all relevant pages immediately?". This is because CDNs have limits on the number of pages that can be purged at the same time, and we don't want to hit that limit and we also wanted to avoid doing any massive purges.

After that, once a user visits the actual page, an async API call will be triggered (that won't affect site load speed / UX), to check if the page is dirty or not. If so, the page will be purged from cache (again, all happens asynchronously), and on the next refresh, a user will get the new content. This way we don't send a massive amount of requests to the CDN, and caches are regenerated on demand.

2. Improve link prefetching

Currently, whenever a link is detected, the content will be prefetched. We don't want that in all cases, that's why we will implement the prefetch props, with which you will be able to avoid any kind of prefetching. Not sure at the moment, maybe we'll event create a different component for these links.

Start of implementation

I will start with the implementation today. Hoping to have it done by the end of this / beginning of next week. Will keep you updated.

kisg commented 4 years ago

I have a different (alternative) suggestion: Why not use S3 as the page cache (together with any media and other assets)?

Whenever a page is modified, just store the SSR page data on S3 with the correct site structure, and serve the site from there. To make sure that the actual site API service (which could live at a different api.sitename.domain.com address so the browser will load it in parallel) is warm when needed, at the end of each page there would be a snippet to call a webiny specific tracking api.

No need to set long TTLs or manually invalidate CloudFront caches... etc., serving a static site from S3 is a standard CloudFront use case.

Pages that require authentication could be solved by this method: https://blog.octo.com/en/authorisation-for-aws-s3-static-website/

For client side rendered pages / apps, the boilerplate HTML that will load the app could still be published to S3, and loaded from there.

What do you think?

adrians5j commented 4 years ago

Hi @kisg, thank you for your thoughts!

Basically we are doing this, but instead of storing the SSR directly on S3, we store it in the database and cache it on the CDN. Serving pages from a CDN edge will always be faster (and cheaper) because it's closer to the real user.

We will be doing a release of this feature very soon, so stay tuned.

P.S. I will check the link you provided.

kisg commented 4 years ago

Hi @doitadrian, thank you for the quick reply!

I understand what you are doing, but with your current solution, when the CDN needs to pull page content from the source location, it needs to query the site lambda function, and this leads to a potential slow path if the lambda is not running.

My suggestion is to eliminate this slow path by using S3 as the source for the website. Of course, the pages would still be cached by the CDN, but even the slow path (CDN -> S3 request with an If-Modified-Since header) is much faster than CDN -> Lambda (needs to start) -> MongoDB, not to mention that it also uses less resources, and can be more resilient to failures (e.g the basic site can still be online even if the dynamic parts (lambdas, database ... etc.) are offline for any reason.

Pavel910 commented 4 years ago

Having a lambda gives us more control over the request; apps consist not only of pages from Page Builder, but also of custom app routes which the API is not aware of. Those pages still require SSR capabilities, and with a lambda we can in fact render them as regular pages. While with S3 we can only have pages that come from Page Builder, as we know those in advance and render them when a page is published.

Also, that small lambda that handles the request is very small in size and doesn't have that much of a cold start time, it's a very simple lambda that only routes traffic, and immediately returns data from DB. Having page cache in DB also gives us the ability to purge/update the cache whenever we see fit.

We'll see how our solution behaves, and if necessary, we'll do another round of improvements.

Thanks for the input ❤️

kisg commented 4 years ago

Thanks to you all for your great work on Webiny! It's a beautiful piece of software, that is why I am here. :)

I don't want to persuade you to change / replace what you are doing, however I do plan to implement this method I described because then a Webiny site would get all the speed and reliability advantages of a static website (e.g. as if it was built using Gatsby), and all the nice dynamic features, like forms, page builder ... etc.

It would still be possible to handle the dynamic paths using the site lambda simply by adding the necessary page rules (CloudFlare terminology, I am not sure about CloudFront) to the CDN configuration.

Would you consider accepting a contribution with such an alternative implementation as an optional (configurable) feature?

Pavel910 commented 4 years ago

Absolutely, depending on the implementation we can always support multiple setups so developers can pick what suits their project best. We'd be glad to help and provide input along the way, so feel free to begin working on it and in the end we can discuss if it is something that can be a simple configuration switch, or maybe a separate set of plugins.

We're releasing the new SSR system tomorrow, so give it a try before starting your work on a different implementation, and let us know what you think :)

adrians5j commented 4 years ago

Hey @kisg, thanks again for your thoughts!

Just wanted to confirm a few things that @Pavel910 brought up, and add a few of my own thoughts.

As he mentioned, this might work if the only thing that needs to get SSRed are the pages created from the Page Builder. But that's not the case. Users can in fact choose not to use PB at all, and maybe only utilize the Form Builder app, and just code simple React pages with forms embedded in them. In this case, as you've also mentioned, we would still need a Site Lambda to do the SSR.

Note that the SSR HTML is stored in a database per URL, meaning it doesn't really matter what's behind it, be it a page created via PB or a simple hand-coded login form.

Additionally, storing SSR HTML in a database enables us to do searches, and store additional data about every cache entry. For example...

There are other events that can cause the SSR HTML invalidation, for example when a user updates the main menu. When that happens, we actually only invalidate the URLs that contain that specific menu. This is more effective than to invalidate the whole site. The same happens for example with forms, when a form is changed, we again just invalidate the caches that contain it. So what I'm trying to say here, I'm not sure if we could have that kind of functionality with your way of doing it.

What I also wanted to add, is that in some cases, when doing SSR, different users might receive different SSR HTML based on the sent request headers. Not sure if you can accomplish this via your method. But then again, solutions for this issue can be different, depending on the app and requirements.

BTW, one thing that cought my eye is the If-Modified-Since header, that you mentioned. I've just read a bit about it, and I think it could be helpful for our current implementation. So thanks for that! 👍