zazuko / kopflos

kopflos - Linked Data APIs
MIT License
14 stars 6 forks source link

API Documentation caching #104

Open tpluscode opened 1 year ago

tpluscode commented 1 year ago

I'm looking into ways for improving hydra APIs by using cache headers. A first-order recommendation is to use versioned assets and long-lived immutable cache. I think this fits the most common case of API Documentation which remains static at least until the server app is restarted.

To implement this behaviour would require three changes:

First, to add a query string to the documentation header. Possible something like UNIX timestamp

-link: </api>; rel="http://www.w3.org/ns/hydra/core#apiDocumentation"
+link: </api?v=123456789>; rel="http://www.w3.org/ns/hydra/core#apiDocumentation"

Second, cache-control to the API Documentation itself

Cache-Control: max-age=31536000, immutable

Lastly, to actually serve the triples with all URIs /api rewritten to /api?v=123456789 so that client can correctly find it in the representation.

This should allow proxies to cache the API documentation.

Here is the diff that solved my problem:

diff --git a/node_modules/hydra-box/lib/middleware/apiHeader.js b/node_modules/hydra-box/lib/middleware/apiHeader.js
index 8751fbc..e9b799c 100644
--- a/node_modules/hydra-box/lib/middleware/apiHeader.js
+++ b/node_modules/hydra-box/lib/middleware/apiHeader.js
@@ -1,16 +1,29 @@
 const { Router } = require('express')
+const $rdf = require('rdf-ext')
+
+const timestamp = Date.now()

 function factory (api) {
   const router = new Router()

+  const timeDependentApiId = $rdf.namedNode(`${api.term.value}?v=${timestamp}`)
+  const dataset = api.dataset.map(({ subject, predicate, object, graph }) => {
+    return $rdf.quad(
+      subject.equals(api.term) ? timeDependentApiId : subject,
+      predicate,
+      object.equals(api.term) ? timeDependentApiId : object,
+      graph)
+  })
+
   router.use((req, res, next) => {
-    res.setLink(api.term.value, 'http://www.w3.org/ns/hydra/core#apiDocumentation')
+    res.setLink(timeDependentApiId, 'http://www.w3.org/ns/hydra/core#apiDocumentation')

     next()
   })

   router.get(api.path, (req, res, next) => {
-    res.dataset(api.dataset).catch(next)
+    res.setHeader('cache-control', 'max-age=31536000, immutable')
+    res.dataset(dataset).catch(next)
   })

   return router

This issue body was partially generated by patch-package.

tpluscode commented 1 year ago

Having experimented with this approach a little I had limited success. The problem with a query string is that this is identified as a different identifier which caused me trouble on the client trying to find the documentation resource.

A different approach I tried was with a shorter cache age and etag. This appears to work nicely

diff --git a/node_modules/hydra-box/lib/middleware/apiHeader.js b/node_modules/hydra-box/lib/middleware/apiHeader.js
index 8751fbc..33546b7 100644
--- a/node_modules/hydra-box/lib/middleware/apiHeader.js
+++ b/node_modules/hydra-box/lib/middleware/apiHeader.js
@@ -1,15 +1,32 @@
 const { Router } = require('express')
+const $rdf = require('rdf-ext')
+const etag = require('etag')
+const toCanonical = require('rdf-dataset-ext/toCanonical.js')
+const preconditions = require('express-preconditions')

 function factory (api) {
   const router = new Router()

+  const apiEtag = etag(toCanonical(api.dataset))
+
   router.use((req, res, next) => {
     res.setLink(api.term.value, 'http://www.w3.org/ns/hydra/core#apiDocumentation')

     next()
   })

-  router.get(api.path, (req, res, next) => {
+  router.get(api.path,
+    preconditions({
+      async stateAsync() {
+        return {
+          etag: apiEtag
+        }
+      }
+    }),
+    (req, res, next) => {
+
+    res.setHeader('cache-control', 'max-age=30, stale-while-revalidate=30')
+    res.setHeader('etag', apiEtag)
     res.dataset(api.dataset).catch(next)
   })

There is no one way to set caching, and APIs may choose not to completely. I was thinking that maybe hydra-box could introduce extension points to plug middleware before the get(api.path) handler? Something like

-function factory (api) {
+function factory (api, ...beforeApi) {

-  router.get(api.path, (req, res, next) => {
+  router.get(api.path, ...beforeApi, (req, res, next) => {
    res.dataset(api.dataset).catch(next)
  })
}

For the configuration above, I would provide the preconditions middleware and a second, to set the cache-control and etag headers to my liking