Draft for new configuration format

webp-sh / webp_server_go

Go version of WebP Server. A tool that will serve your JPG/PNG/BMP/SVGs as WebP/AVIF format with compression, on-the-fly.

https://docs.webp.sh

GNU General Public License v3.0

1.83k stars 175 forks source link

Draft for new configuration format #217

Closed n0vad3v closed 1 year ago

n0vad3v commented 1 year ago

Let's discuss for a newer configuration format, currently our config.json looks like this:

{
  "HOST": "127.0.0.1",
  "PORT": "3333",
  "QUALITY": "80",
  "IMG_PATH": "/path/to/pics",
  "EXHAUST_PATH": "/path/to/exhaust",
  "ALLOWED_TYPES": ["jpg","png","jpeg","bmp"],
  "ENABLE_AVIF": false,
  "ENABLE_EXTRA_PARAMS": false
}

However, there are more features that might need configuration, such as below:

Limit global concurrent convert count by https://github.com/webp-sh/webp_server_go/issues/75, maybe we should add CONCURRENT_COUNT to it?
Add Multiple backend in Proxy Mode by https://github.com/webp-sh/webp_server_go/issues/206
Maybe we can support multiple path descibed as https://docs.webp.sh/usage/multipath/
Will we need to add a MAX_EXHAUST_SIZE to limit the max exhaust directory size, related issue: https://github.com/webp-sh/webp_server_go/issues/214
Maybe we can use gocache wrapper instead to use other caching services in the future more easily, by @bugfest in https://github.com/webp-sh/webp_server_go/pull/216#issuecomment-1565649006

For multiple path and multiple backend part, I'm personally in favor of pattern looks like this:

{
  "HOST": "127.0.0.1",
  "PORT": "3333",
  "IMG_PATH": "/opt/pics",
  "EXHAUST_PATH": "/path/to/exhaust",
  "IMG_MAP": {
    "/i": "/home/ubuntu/pic",
    "/img": "/home/ubuntu/pic2",
    "/": "/home/pic1",
    "www.example.com": "http://origin.example.com:8000",
  },
  "ALLOWED_TYPES": ["jpg","png","jpeg"],
  "ENABLE_AVIF": false,
  "ENABLE_EXTRA_PARAMS": false
}

As this provides more backward compatabilities to previous versions, however there's still a need that given the config above, what will be the users' case when using this, like:

How will user write reverse proxy config to make both local path and remote path work together
How to determine whether a key in IMG_MAP is local path or remote path
Will WebP Server Go need to handle TLS when key in IMG_MAP is started with https:// and users are using it listening directly on port 443/80 to serve traffic.

However we may still need to discuss whether those above are real user scenarios.

bugfest commented 1 year ago

I agree using a collapsed IMG_MAP map to configure both modes (direct & proxy) might simpler way for the user to write configurations. If this is the preferred way, then I suggest using the scheme prefix + domain + port + [uri] scheme://my.domain.tld:port[/uri] to represent proxy mode. Additionally, when using only the uri portion, we assume it's a local path. We might add some checks during the start phase to ensure all these local paths (relative to the current work directory or absolute) exist.

As webp_server_go is not handling SSL/TLS termination, we could safely assume the key entry must start by http:// and port has to match the one defined at PORT. This is quite close to the current way to identify proxy mode, so might be easier for users to transition to this model. I'd also start discouraging users to use IMG_PATH to setup proxy mode with a future deprecation warning, e.g.: https://github.com/webp-sh/webp_server_go/pull/207/files#diff-bd900498995223167f47bb6639ed2b46cc5ee16446a026ad46a2a02c8d96651cR62-R64

In summary, a valid config might look like:

{
  "HOST": "127.0.0.1",
  "PORT": "3333",
  "IMG_PATH": "/opt/pics",
  "EXHAUST_PATH": "/path/to/exhaust",
  "IMG_MAP": {
    "/i": "/home/ubuntu/pic",
    "/img": "/home/ubuntu/pic2",
    "/": "/home/pic1",
    "test": "/home/test", # Assets at /home/test served at http://localhost:3333/test
    "www.example.com": "http://origin.example.com:8000", # Assets at http://localhost:3333/www.example.com proxied to http://origin.example.com:8000
    "http://www.example.com:3333": "http://origin.example.com:8000", # Assets at http://www.example.com:3333 proxied to http://origin.example.com:8000
  },
  "ALLOWED_TYPES": ["jpg","png","jpeg"],
  "ENABLE_AVIF": false,
  "ENABLE_EXTRA_PARAMS": false
}

I also want to take the opportunitity to raise a question about the map precedence. Given the following IMG_MAP:

{
    "/img": "/home/ubuntu/pic2",
    "/img/1": "/home/pic3"
}

where should http://<host:port>/img/1/2.jpg be looked for? /home/ubuntu/pic2/1/2.jpg or /home/pic3/2.jpg?
where should http://<host:port>/img/10.jpg be looked for? /home/ubuntu/pic2/1.jpg or /home/pic3/0.jpg?

bugfest commented 1 year ago

Answering myself to the path question (and sorry for the noise if this is not 100% config related) we can:

assume all the keys have to be existing directories
if more than one rule/path applies, the most specific wins

So that, given:

{
    "/img": "/home/ubuntu/pic2",
    "/img/1": "/home/pic3"
}

http://<host:port>/img/1/2.jpg matches both /img/ and /img/1/ uris. /img/1/ is the most specific and wins, so that we serve /home/pic3/2.jpg
http://<host:port>/img/10.jpg matches /img/ uri, we serve /home/ubuntu/pic2/10.jpg

n0vad3v commented 1 year ago

I agree with the [the most specific wins] policy and the config format you suggested, I have some other concerns:

Given "http://www.example.com:3333": "http://origin.example.com:8000", # Assets at http://www.example.com:3333 proxied to http://origin.example.com:8000 line, how should user understand the :3333 here and "PORT": "3333" above
Now we have only "EXHAUST_PATH": "/path/to/exhaust", if using multiple host/image_path, how should we design this EXHAUST_PATH to make sure the converted files didn't collide

bugfest commented 1 year ago

* Given `"http://www.example.com:3333": "http://origin.example.com:8000", # Assets at http://www.example.com:3333 proxied to http://origin.example.com:8000` line, how should user understand the `:3333` here and `"PORT": "3333"` above

We can make the port optional so we can let users use http://www.example.com in the config and we assume it can be translated as http://www.example.com[:3333] given that currently only one port is allowed.

* Now we have only `"EXHAUST_PATH": "/path/to/exhaust"`, if using multiple host/image_path, how should we design this `EXHAUST_PATH` to make sure the converted files didn't collide

In my original PR I suggested creating a subdirectory with the hostname string: https://github.com/webp-sh/webp_server_go/pull/207/files#diff-d1e1aef472c68045d1f3c4f68a932e458699037baa4e68d5ca07629d4792bee0R150

Furthermore, we can store all these optimized assets using hashes so that the URI characters are not a problem in the filesystem: https://github.com/webp-sh/webp_server_go/pull/207/files#diff-d1e1aef472c68045d1f3c4f68a932e458699037baa4e68d5ca07629d4792bee0R146-R148

n0vad3v commented 1 year ago

Furthermore, we can store all these optimized assets using hashes so that the URI characters are not a problem in the filesystem

Agree with it, but currently we are only hashing the reqURI:

// "mypic/123.jpg?someother=200&somebugs=200" -> 9c415345dd2456a20a43a1318fccab904a9d1257a305c3a563849c2283b78f8f
reqURIwithQueryHash := HashPath(reqURIwithQuery)

If we are serving multiple hostname, how should we arrange exhaust directory.

Currently I have an idea for the structure for it, suppose we have config like this:

{
  "HOST": "127.0.0.1",
  "PORT": "3333",
  "IMG_PATH": "/opt/pics",
  "EXHAUST_PATH": "/path/to/exhaust",
  "IMG_MAP": {
    "/i": "/home/ubuntu/pic",
    "/img": "/home/ubuntu/pic2",
    "/": "/home/pic1",
    "test": "/home/test",
    "www.example.com": "http://origin.example.com:8000",
    "http://www.example.com": "http://origin.example.com:8000",
  },
  "ALLOWED_TYPES": ["jpg","png","jpeg"],
  "ENABLE_AVIF": false,
  "ENABLE_EXTRA_PARAMS": false
}

For all the non-proxy part, all the exhaust files goes under /path/to/exhaust, like /path/to/exhaust/9c415345dd2456a20a43a1318fccab904a9d1257a305c3a563849c2283b78f8f, since we are hashing them, should have no collision here.
For proxy mode like www.example.com, the exhaust files are under /path/to/exhaust/origin.example.com/9c415345dd2456a20a43a1318fccab904a9d1257a305c3a563849c2283b78f8f-12345, the 12345 is an example the remote Etag hash. (And this is like the current https://webp.se/ working mechanism, which is a WebP Server Go Fork that supports multiple backends and works purely for proxy mode)
For other hostname like "http://www.example.com", since they have the same proxy origin as "www.example.com": "http://origin.example.com:8000",, it will share the same exhaust directory.

bugfest commented 1 year ago

* For all the non-proxy part, all the exhaust files goes under `/path/to/exhaust`, like `/path/to/exhaust/9c415345dd2456a20a43a1318fccab904a9d1257a305c3a563849c2283b78f8f`, since we are hashing them, should have no collision here.

agree

* For proxy mode like `www.example.com`, the exhaust files are under `/path/to/exhaust/origin.example.com/9c415345dd2456a20a43a1318fccab904a9d1257a305c3a563849c2283b78f8f-12345`, the `12345` is an example the remote Etag hash. (And this is like the current https://webp.se/ working mechanism, which is a WebP Server Go Fork that supports multiple backends and works purely for proxy mode)

* For other hostname like `"http://www.example.com"`, since they have the same proxy origin as ` "www.example.com": "http://origin.example.com:8000",`, it will share the same exhaust directory.

Partially agree, we might need to add scheme and port:

...
"IMG_MAP": {
    "http://www.example0.com": "http://origin.example.com:8001",
    "http://www.example1.com": "http://origin.example.com:8001",
    "http://www.example2.com": "https://origin.example.com:8002",
  },
...

as even being hosted in the same domain the server behind the port might expose completely different content.

/path/to/exhaust/http_origin.example.com_8001/<hashed-cache-file>
/path/to/exhaust/https_origin.example.com_8002/<hashed-cache-file>

If we extract the host from the key instead, we ensure they're independent in the cache. Also ensures if the origin provides different content based on the incoming Host headers

/path/to/exhaust/www.example0.com/<hashed-cache-file>
/path/to/exhaust/www.example1.com/<hashed-cache-file>
/path/to/exhaust/www.example2.com/<hashed-cache-file>

BennyThink commented 1 year ago

{
  "HOST": "127.0.0.1",
  "PORT": "3333",
  "IMG_PATH": "/opt/pics",
  "EXHAUST_PATH": "/path/to/exhaust",
  "IMG_MAP": {
    "/i": "/home/ubuntu/pic",
    "/img": "/home/ubuntu/pic2",
    "/": "/home/pic1",
    "test": "/home/test",
    "www.example.com": "http://origin.example.com:8000",
    "http://www.example.com": "http://origin.example.com:8000",
  },
  "ALLOWED_TYPES": ["jpg","png","jpeg"],
  "ENABLE_AVIF": false,
  "ENABLE_EXTRA_PARAMS": false
}

This seems good to me. In every key of IMG_MAP, if it starts with http(s) we would assume it's proxyMode and the value must be pointed to a source origin server. Otherwise we would assume it's local mode, request starts with this key will be mapped to corresponding source dir.

One more thing, in order to avoid overlay2 or other file system max filename limit, we should consider hash, and put every key/value in different but corresponding folder:

hash converted webp/avif, i.e. exhaust/test/hash("test/img/2.timestamp.webp")
hash remote-raw image file, i.e. remote-raw/example1.com/hash("wp-content/1.jpg")

Do we need to implement this before or after refactor to package mode?

bugfest commented 1 year ago

One more thing, in order to avoid overlay2 or other file system max filename limit, we should consider hash, and put every key/value in different but corresponding folder:
1. hash converted webp/avif, i.e. `exhaust/test/hash("test/img/2.timestamp.webp")`

2. hash remote-raw image file, i.e. `remote-raw/example1.com/hash("wp-content/1.jpg")`

Is there any specific value on having the images at exhaust in a format readable by the host/user? I'm exploring https://github.com/dgraph-io/badger to keep track of the remote images metadata and we could use it easily to save the blobs there. If that'd be an option we could get rid of all these non-ACID and FS limitations all at once.

bugfest commented 1 year ago

Also, I'm not convinced about keeping the directory as it might lead to security problems: 1) Opens the possibility of directory transversal attacks 2) If we parse + clean the URI, we might remove characters that the remote system requires to provide a valid response. This would complicate troubleshooting of proxy mode

If the only reason is to keep a reasonable amount of files per directory, we could use the N-first characters to create file buckets so that, e.g: with N=4:

a2e9897a617d33aabcbe55d8bf6b087acbc97af4
a2e9d460bb3c7d98845187c716a30db81c44b615
a9ea3856de277e671f27d2974f7dcb98cefed4d9
0e0334129b1bb45902add335eca175730f35d0d4
16b552347e02010d4c1d8cfda3f8acbb14a01719
16b5e579684c4fd1d474d08c326a42a4a0a2a588
16b50aa7743c014c1a1d5aa0e5f2aa33d3ae8f3b

Can be stored as:

a2e9/
  a2e9897a617d33aabcbe55d8bf6b087acbc97af4
  a2e9d460bb3c7d98845187c716a30db81c44b615
a9ea/
  a9ea3856de277e671f27d2974f7dcb98cefed4d9
0e03/
  0e0334129b1bb45902add335eca175730f35d0d4
16b5/
  16b552347e02010d4c1d8cfda3f8acbb14a01719
  16b5e579684c4fd1d474d08c326a42a4a0a2a588
  16b50aa7743c014c1a1d5aa0e5f2aa33d3ae8f3b

bugfest commented 1 year ago

Hi @n0vad3v, @BennyThink - Can you review my proposal of not keeping the dir structure from the URI? https://github.com/webp-sh/webp_server_go/issues/217#issuecomment-1566203968

n0vad3v commented 1 year ago

@bugfest Sorry we've missed this issue, recently I and @BennyThink were discussing the exhaust file format,

any specific value on having the images at exhaust in a format readable by the host/user

Indeed, that might be not necessary to keep the exhaust path files human readable, and making the filename unify (both in proxy-mode and non-proxy-mode) does make sense.

we could use the N-first characters to create file buckets so that, e.g: with N=4:

I personally don't think it's necessary, just 'flatten' all the path should be OK.

Current situation (as of version 0.9.3)

Non Proxy Mode

Origin image at ./pics/picsa/123.jpg
Converted ./exhaust/picsa/123.1677986846.webp_width=0&height=0

Proxy Mode

Origin image at https://docs.webp.sh/images/123.jpg
Downloaded image at ./remote-raw/9471b9724d0122a4-b8f999ab5acd69d03f5e904b1b84eb79210536
Converted ./exhaust/images/123.1677986846.webp_width=0&height=0

b8f999ab5acd69d03f5e904b1b84eb79210536 is combined from server response of etag and length, b8f999ab5acd69d03f5e904b1b84eb79 is etag and 210536 is length. 9471b9724d0122a4 is hash content of https://docs.webp.sh/images/123.jpg.

So.. the new solution should have the following feature:

Unified converted image path format for both proxy mode and non-proxy mode.
Be able to check if original image has changed in non-proxy mode (currently it's done by checking timestamp, encoded in exhaust filename like 123.1677986846.webp

Hmmm, I think we might need some time to propose a good solution for this. 🤔 (Ideas are welcomed.)

bugfest commented 1 year ago

Hi @n0vad3v I still have some concerns when using parts of the URI in the filesystem. The most notorious case is when the file name contains scape characters interpreted or stripped by the OS, e.g. "+" sign. I found myself into this problem again while rewriting PR #226 code.

I've been thinking into the metadata tracking problem (etag, timestamp, cache/must-revalidate headers) for a while now. I believe adding more metadata to the filename in the remote-raw and exhaust folders cannot be easily extended and would require a redesign. I stumble upon these ideas:

Creating a small metadata DB in memory that we could also easily serialize into the file system. Adding keys to the struct and updates might be complex or require to wipe the DB.
Use some in-disk (sqlite?) with DDL/DML support to store and query this data.
Use an external key/value DB like memcache to store this data

Another option is to drop this metadata tracking and just fetch the image everytime by default. We can then move this complexity to a different system to track it: e.g. Varnish. Webp_server_go will use varnish as remote and varnish will decide when to retrieve the asset again or provide a cached version.

This last option, might be a bit controversial at first, but will allow us to reduce the complexity of the implementation a lot.

BennyThink commented 1 year ago

I don't really think involve external system(varnish) is a good idea because it would make the deployment more dependent on other project, not lightweight anymore.

Also, saving in some kind of db is good. Yes it will solve many problems. However that means we need to maintain that db, I believe it's more convenience to find our solution in filesystem itself.

bugfest commented 1 year ago

Fair point. What about if we use a small meta file per asset?

We can save the structure in json so that can be easily parsed/patched. Example:

Non Proxy Mode

Origin image at ./pics/picsa/123.jpg
Converted to:
- ./exhaust/ca3e31219832cd956c2fecf05ac411f738c58712.webp
- ./exhaust/ca3e31219832cd956c2fecf05ac411f738c58712.avif
Metadata:
- ./metadb/ca3e31219832cd956c2fecf05ac411f738c58712.json

./exhaust/ca3e31219832cd956c2fecf05ac411f738c58712.json contents stores the file metadata:

{ 
"remote": false,
"url": "file://picsa/123.jpg",
"ts": 1677986846,
"etag": "abc...123",
"length": 123456,
"width": 0,
"height": 0
}

Proxy Mode

Origin image at https://docs.webp.sh/images/123.jpg
Downloaded image at ./remote-raw/b8f999ab5acd69d03f5e904b1b84eb79210536
Converted:
- ./exhaust/b8f999ab5acd69d03f5e904b1b84eb79210536.webp
- ./exhaust/b8f999ab5acd69d03f5e904b1b84eb79210536.avif
Metadata:
- ./metadb/b8f999ab5acd69d03f5e904b1b84eb79210536.json

./metadb/b8f999ab5acd69d03f5e904b1b84eb79210536.json contents stores the file metadata:

{ 
"remote": true,
"url": "https://docs.webp.sh/images/123.jpg",
"ts": 1677986846,
"etag": "b8f999ab5acd69d03f5e904b1b84eb79",
"length": 210536,
"width": 0,
"height": 0
}

The metadata file could be stored in any other place.

This way we can store all metadata without an external system. The Filesystem problems due to special characters go away and we have a flexible way to track all this information we can later use to implement cleanups, backups and stats.

BennyThink commented 1 year ago

that sounds good, we can save every thing we want in a json file.

BennyThink commented 1 year ago

Hello @bugfest , we have implemented metadata:

{
  "id": "2836a47938985a2c",
  "path": "/123.png?width=300\u0026height=300",
  "checksum": "5c1a310832cec0f5"
}

I guess it's time to aim for new configuration format.

bugfest commented 1 year ago

Great! I'll finish the lazy PR and I'll start with the multi-backend one 🎉