withastro / adapters

Home for Astro's core maintained adapters
67 stars 38 forks source link

Cloudflare: _routes.json is truncated #374

Open schummar opened 2 months ago

schummar commented 2 months ago

Astro Info

Astro                    v4.11.5
Node                     v20.12.2
System                   Linux (x64)
Package Manager          npm
Output                   hybrid
Adapter                  @astrojs/cloudflare
Integrations             auto-import
                         @astrojs/mdx
                         @astrojs/react
                         @astrojs/tailwind
                         @astrojs/sitemap
                         @playform/compress

Describe the Bug

Hi, I have worked on automatic _routes.json generation some time ago. I tried to make the file as small as possible, because Cloudflare imposes a surprising limit of 100 rules. I guess there were issues or the feature or it was not possible to maintain with the big refactor or something. In any case, it's back to the old, simple way of generating the file.

Simple is good of course, but the way the limit is handled now is: After 100, the rest of the rules is simply cut off. I was very surprised that my page was not working as expected - because many of my static routes resulted in function invocations. That can be a lot slower and it can become really expensive, if one it not careful.

What's the expected result?

Hi, I have worked on automatic _routes.json generation some time ago. I tried to make the file as small as possible, because Cloudflare imposes a surprising limit of 100 rules. I guess there were issues or the feature or it was not possible to maintain with the big refactor or something. In any case, it's back to the old, simple way of generating the file now.

Simple is good of course, but the way the limit is handled now is: After 100, the rest of the rules is simply cut off. I was very surprised that my page was not working as expected. A lot of static routes are resulting in function invocations instead! That can be a lot slower and expensive if you have a bit more traffic. I think, truncating is not the best solution. It is a common use case that sites have many static pages and only a few dynamic ones.

Would you be interested in investigating a more efficient approach once again? If so, what were the reasons/problems that prompted that simplification?

Link to Minimal Reproducible Example

https://stackblitz.com/edit/github-qvzkwv?file=dist%2F_routes.json

Participation

alexanderniebuhr commented 2 months ago

Yeah we needed to rethink the implementation, due to some limitations. Could you explain what a more efficient approach is in your opinion?

schummar commented 2 months ago

Sure. I realize there are probably some special cases to consider, but on the surface it could work like this - that's how I built it at the time:

  1. A list of static routes and a list of dynamic routes is calculated - or I guess this info is already available somewhere
  2. Try to merge entries in each list if they have common prefixes - but only as far this does not create a conflict with the other list
  3. Either use the include list for dynamic routes, with some exclude entries where needed. Or use the exclude list for static routes, with some include entries where needed. Whichever produces the smaller result.

Example: Let's say we have src/pages/index.astro src/pages/static/[id].astro <- With 1000 ids supplied via getStaticPaths src/pages/dynamic/[id].astro <- with export const prerender = false;

Current output - with 90% of static pages becoming dynamic when they should not:

{
  "version": 1,
  "include": [
    "/*"
  ],
  "exclude": [
    "/",
    "/_astro/*",
    "/favicon.svg",
    "/static/0",
    "/static/1",
   ...
    "/static/95"
  ]
}

Nicer output - which was produced with version <= 9:

{
  "version": 1,
  "include": [
    "/_image",
    "/dynamic/*"
  ],
  "exclude": []
}

If there are a few more dynamic routes and the static ones become the exception, it could also become:

{
  "version": 1,
  "include": [
    "/*"
  ],
  "exclude": [
    "/static/*"
  ]
}

And if there is a mix of dynamic and static routes in the same folder, something like:

{
  "version": 1,
  "include": [
    "/mostlyDynamic/*"
  ],
  "exclude": [
    "/mostlyDynamic/exceptThisOne" // excludes override includes if I remember correctly
  ]
}

Yes, that brings more complexity into the code, but right now the automatic generation is a bit misleading in my opinion, because it simply does not do what you expect. So if you are not against it in principle, I could try my luck.

I would help to be aware of any special cases that might not be obvious. Like

alexanderniebuhr commented 2 months ago

Either use the include list for dynamic routes, with some exclude entries where needed. Or use the exclude list for static routes, with some include entries where needed. Whichever produces the smaller result.

This is a false assumption. We can't go by the size only. Some project specific circumstances, force to use a /* for the include array, regardless of the amount of rules. In fact other frameworks route everything towards the SSR function, us having some excludes is actually already better.

Example: Let's say we have src/pages/index.astro src/pages/static/[id].astro <- With 1000 ids supplied via getStaticPaths src/pages/dynamic/[id].astro <- with export const prerender = false;

Actually for your example the current output is the correct one. To be clear I'm open for a better approach, but we need to discuss this further. Compatibility goes above optimization here (means if we loose some static pages in the exclude array, but won't have an error for other pages, which would otherwise be not routed correctly, we would always prefer the one which routes more things dynamically)

I'm happy to give more insights, but I hope this makes sense.

schummar commented 2 months ago

This is a false assumption. We can't go by the size only. Some project specific circumstances, force to use a /* for the include array, regardless of the amount of rules. In fact other frameworks route everything towards the SSR function, us having some excludes is actually already better.

By choosing the smaller result I mean between multiple correct options, the smaller one should be chosen. Of course, not throwing errors and always returning a proper result is the most important thing. But if we can make sure of that, a smaller solution or rather a more complete solution would be better, right?

Actually for your example the current output is the correct one. To be clear I'm open for a better approach, but we need to discuss this further. Compatibility goes above optimization here (means if we loose some static pages in the exclude array, but won't have an error for other pages, which would otherwise be not routed correctly, we would always prefer the one which routes more things dynamically)

I'm happy to give more insights, but I hope this makes sense.

Correct in the sense that users always see the correct page - yes. Correct in the sense that performance and costs are as you would expect - no. This really can be a problem for even moderately popular sites. 100k requests per day is not unrealistic - since you would possibly have many requests per page visit if you invoke a function for almost every page and many of the assets on them. Hybrid mode is one of the cool things about Astro, right? That you can have both the speed and efficiency of static pages and the power of functions at the same time in a neat package. If you build with Astro and deploy using the official Astro Cloudflare adapter, you expect everything to behave as it does in your dev environment.

So the question is, is it possible to guarantee correctness (at least when using Astro's own mechanisms - if you use custom functions, I guess anything goes). I'm interested in the circumstances that "force to use a /* for the include array". I mean if one uses a src/pages/[whatever].astro file, sure everything needs to be handled dynamically except for some explicit routes. Then the solution cannot be optimized. That fact is well known and can be considered. Are the other circumstances where we cannot know this?

alexanderniebuhr commented 1 month ago

Only have time to answer quickly, will answer in detail later.

But if the user doesn't have a pre-rendered static /404 route, we have to use /*. Only if the user has such static 404 route we can optimize the routes file.

schummar commented 1 month ago

Ah, ok. What does the function do in that case?

From the Cloudflare docs (https://developers.cloudflare.com/pages/configuration/serving-pages):

Not Found behavior You can define a custom page to be displayed when Pages cannot find a requested file by creating a 404.html file. Pages will then attempt to find the closest 404 page. If one is not found in the same directory as the route you are currently requesting, it will continue to look up the directory tree for a matching 404.html file, ending in /404.html. This means that you can define custom 404 paths for situations like /blog/404.html and /404.html, and Pages will automatically render the correct one depending on the situation.

Single-page application (SPA) rendering If your project does not include a top-level 404.html file, Pages assumes that you are deploying a single-page application. This includes frameworks like React, Vue, and Angular. Pages’ default single-page application behavior matches all incoming paths to the root (/), allowing you to capture URLs like /about or /help and respond to them from within your SPA.

When I test this without custom 404 on Cloudflare and access a route that does not exists, I simply see the homepage. Not the same Astro generated 404 page that I see when developing.

alexanderniebuhr commented 1 month ago

Exactly what is described, this is not the correct behavior. Similar case happens if you have static routes and potential dynamic routes under the same subpaths. So to summarize, we are interested in every idea to make this better, as long as it is compatible and guarantee correctness for every scenario. If we would have to decide if we need a documentation which tells users, what not to do vs. just routing to SSR, we would always want to fallback to SSR instead of the limitation. I would like to protect you from doing a lot of work which ends up not usable, so if you would like to work on this, I would suggest you start with defining all logic in pseudo code first, so we can check all of that before you invest resources in the actual implementation, but it's up to you, how you would like to continue. I would probably update the current logic anyways in the future too.

On another note, I wonder if the routes.json which would be generated for you, if you add a static 404 page, is more aligned with the old one 🤷

schummar commented 1 month ago

Exactly what is described, this is not the correct behavior.

I was testing with the current version.

Similar case happens if you have static routes and potential dynamic routes under the same subpaths.

Yes, those cases need to be handled correctly. And that - figuring out which wildcard path can clash with wich other wildcard path - will be the main difficulty.

So to summarize, we are interested in every idea to make this better, as long as it is compatible and guarantee correctness for every scenario.

I will think about it. And if it seems doable, I will present my ideas. Thanks for the feedback and hints so far 😀

On another note, I wonder if the routes.json which would be generated for you, if you add a static 404 page, is more aligned with the old one 🤷

We have a static 404 page, but the output is as posted earlier.

alexanderniebuhr commented 1 month ago

If you have a static 404 page, which results in dist/404.html, then the only reason why you get a /* in the include array is, that there is no "safe/correct" way to generate a _routes.json with exclude only.

P.S.: Another note here, I have some early access to upcoming Cloudflare changes, and it seems like there might be a new mode we need to support, which doesn't support a _routes.json file at all potentially, because it will route all assets for free, without defining them. So whatever we do, it should be correct and potentially also opt-in with the default being - "route everything through the function"

schummar commented 2 weeks ago

If you have a static 404 page, which results in dist/404.html, then the only reason why you get a /* in the include array is, that there is no "safe/correct" way to generate a _routes.json with exclude only.

Took some time, but now I got around to try my luck. And in fact, there was a small issue that prevented the routes generation from behaving like you said. So without big/dangerous changes, we can already do a lot better. See #423

P.S.: Another note here, I have some early access to upcoming Cloudflare changes, and it seems like there might be a new mode we need to support, which doesn't support a _routes.json file at all potentially, because it will route all assets for free, without defining them. So whatever we do, it should be correct and potentially also opt-in with the default being - "route everything through the function"

Maybe you are talking about the new "Static Assets" and "Frameworks" feature, that is in the Cloudflare Workers docs now. Looks interesting. Not sure if this will work for Pages as well, but the thought of not bothering with a _routes.json, but instead having Cloudflare detecting static assets automatically would be very cool indeed.