yuzutech / kroki

Creates diagrams from textual descriptions!
https://kroki.io
MIT License
2.87k stars 211 forks source link

Update `compileTime` used for caching? #1062

Open MarcelWaldvogel opened 2 years ago

MarcelWaldvogel commented 2 years ago

There is a hardcoded compileTime in the caching file, which is used in the Last-Modified header of the returned files:

https://github.com/yuzutech/kroki/blob/c4e0bb386f4742bd788d5fa6b53c2d49100f9e4c/server/src/main/java/io/kroki/server/response/Caching.java#L15

As the returned diagram will only depend on the input string and the software version used, this seems like a good approach. However, the date in there is Wed Sep 4 07:12:58 UTC 2019 and probably some pieces of the rendering have changed in these 2+ years, so that this variable should be updated regularly.

What comes to mind right now:

What do you think?

ggrossetie commented 2 years ago

I think we should find a way to include the diagram version:

https://github.com/yuzutech/kroki/blob/c4e0bb386f4742bd788d5fa6b53c2d49100f9e4c/server/src/main/java/io/kroki/server/service/DiagramService.java#L17

The last modification date depends on:

The main issue is that Last-Modified is an HTTP date, not a unique identifier. Using Kroki compile time or start time is inaccurate.

MDN Web Docs states:

The Last-Modified response HTTP header contains a date and time when the origin server believes the resource was last modified. It is used as a validator to determine if the resource is the same as the previously stored one. Less accurate than an ETag header, it is a fallback mechanism.

Since ETag header is fully supported, I guess we could remove this header? https://caniuse.com/?search=etag

However, and as mentioned above, we will need to compute the ETag from the input string, diagram options and the diagram library version.

MarcelWaldvogel commented 2 years ago

So, Last-Modified would just be the current date? (I think dropping it would not be helpful/necessary)

I see no ETag handling in the code, just ETag creation. So, the caching and If-Match handling (and If-Modified-Since for Last-Modified above) is entire left to caches (proxy, client)?

Then, why not just calculate the ETag as a strong hash of the response? Switching to input parameters (input string, options, versions) seems more complicated and only makes sense when Kroki itself can forego repeated processing by handling If-Match/If-Modified-Since itself?

ggrossetie commented 2 years ago

So, Last-Modified would just be the current date? (I think dropping it would not be helpful/necessary)

I don't think we should set Last-Modified if we cannot reliably compute a value.

If we really want to keep it then I guess we should use the release/build date (for instance, using Maven maven.build.timestamp).

I see no ETag handling in the code, just ETag creation. So, the caching and If-Match handling (and If-Modified-Since for Last-Modified above) is entire left to caches (proxy, client)?

Currently yes but I think the gateway server should implement this logic and return a 304 Not Modified.

Then, why not just calculate the ETag as a strong hash of the response?

Because we want to save processing time. If we need to generate an image from the input in order to calculate an ETag that's a bit unfortunate. Since generating an image is a pure function (i.e., it always return the same image from the same input, given that the diagram library used is the same), we should leverage that and calculate the ETag using the input.

Switching to input parameters (input string, options, versions) seems more complicated and only makes sense when Kroki itself can forego repeated processing by handling If-Match/If-Modified-Since itself?

That's the plan 👍

MarcelWaldvogel commented 2 years ago

I'm in favor of outsourcing complexity to someone who already has implemented it instead of implementing it on my own.

So, unless this is a real issue (I do not know the load on kroki.io, but it should be no problem at all for a local installation), I would add a frontend proxy with enough on-disk cache, and the problem is solved, however we create Last-Modified and the ETag. Then, also, the Last-Modified may be set to the current time (which is fine, as it will not be used and touching the file on disk would have the same result) and ETag can just be calculated from the output. (I got the impression that configuration variables can be all over the HTTP request, in all kinds of formats/capitalizations; but this is just from the glance when trying to make header checks work.)

If it becomes a performance problem (or already is?), then looking at Kroki-side caching might make sense, but having more data on the workload would be great (how often is the same file requested, how long does generating take, …).

ggrossetie commented 2 years ago

I'm in favor of outsourcing complexity to someone who already has implemented it instead of implementing it on my own.

Me too but in this case, NGINX (or really any reverse/frontend proxy) cannot do much with ETag.

So, unless this is a real issue (I do not know the load on kroki.io, but it should be no problem at all for a local installation), I would add a frontend proxy with enough on-disk cache, however we create Last-Modified and the ETag

As far as I know, NGINX won't use ETag but Expires and Cache-Control on GET requests.

Then, also, the Last-Modified may be set to the current time (which is fine, as it will not be used and touching the file on disk would have the same result)

To quote MDN: "The Last-Modified response HTTP header contains a date and time when the origin server believes the resource was last modified."

https://kroki.io/graphviz/svg/eNpLyUwvSizIUHBXqPZIzcnJ17ULzy_KSanlAgB1EAjQ is basically a static image (we could generate it at build/compilation time). So the resource was last modified when Kroki was released/built.

and ETag can just be calculated from the output. (I got the impression that configuration variables can be all over the HTTP request, in all kinds of formats/capitalizations; but this is just from the glance when trying to make header checks work.)

Again, if we do that we cannot save processing time which is a missed opportunity. We can indeed pass diagram options using query parameters, HTTP headers or in the body of POST requests but they are normalized as a JSON object before calling the pure function:

https://github.com/yuzutech/kroki/blob/a22288b54ed92e0ea5076082af06d3a1ec5c87be/server/src/main/java/io/kroki/server/service/DiagramService.java#L20

This function will return the same output given the same input. So the ETag can be generated just before calling convert if the request contains a If-None-Match header. If the generates ETag matches the If-None-Match value, then we can return a 304 without a body! (and without calling convert).

We should implement this logic here:

https://github.com/yuzutech/kroki/blob/c4e0bb386f4742bd788d5fa6b53c2d49100f9e4c/server/src/main/java/io/kroki/server/service/DiagramHandler.java#L65

And there:

https://github.com/yuzutech/kroki/blob/c4e0bb386f4742bd788d5fa6b53c2d49100f9e4c/server/src/main/java/io/kroki/server/service/DiagramHandler.java#L100

If it becomes a performance problem (or already is?), then looking at Kroki-side caching might make sense, but having more data on the workload would be great (how often is the same file requested, how long does generating take, …).

We don't want to cache images (i.e., save on disk) on the Kroki server but we do want to be as efficient as possible. If we can skip work, respond faster and save bandwidth then we should do it.

ggrossetie commented 2 years ago

Anyway, this issue was about Last-Modified and compileTime. Let's focus on that. I think compileTime should be equal to maven.build.timestamp. We can add a reference in: https://github.com/yuzutech/kroki/blob/main/server/src/main/resources/application.properties

app.buildTime=${maven.build.timestamp}

Then, we can (statically) load /application.properties and get the value of app.buildTime.

I think we should create a dedicated issue for ETag.

MarcelWaldvogel commented 2 years ago

It seems that the design for Last-Modified is clear, i.e., that it should be the time at which the code was built, as the result will only depend on the request and the code. I'm spinning off the caching discussion to a new issue.

(One open question about the Last-Modified time: When we have external containers, such as Mermaid, will they create their own Last-Modified or will Kroki set its build time? The former would be preferred, in case the Mermaid container was updated independently.)

ggrossetie commented 2 years ago

It seems that the design for Last-Modified is clear, i.e., that it should be the time at which the code was built, as the result will only depend on the request and the code. I'm spinning off the caching discussion to a new issue.

👍🏻

(One open question about the Last-Modified time: When we have external containers, such as Mermaid, will they create their own Last-Modified or will Kroki set its build time? The former would be preferred, in case the Mermaid container was updated independently.)

Ideally yes but, currently, we do release all containers all together so it's not really an issue 😄

MarcelWaldvogel commented 2 years ago

Build time as Last-Modified is now included in #1065, together with tests.

ggrossetie commented 2 years ago

@MarcelWaldvogel I believe the initial scope of this issue was resolved in #1065. Should we close this issue?