Closed n0vad3v closed 1 year ago
I agree using a collapsed IMG_MAP
map to configure both modes (direct & proxy) might simpler way for the user to write configurations. If this is the preferred way, then I suggest using the scheme prefix + domain + port + [uri] scheme://my.domain.tld:port[/uri]
to represent proxy mode. Additionally, when using only the uri
portion, we assume it's a local path. We might add some checks during the start phase to ensure all these local paths (relative to the current work directory or absolute) exist.
As webp_server_go
is not handling SSL/TLS termination, we could safely assume the key entry must start by http://
and port has to match the one defined at PORT
. This is quite close to the current way to identify proxy mode, so might be easier for users to transition to this model. I'd also start discouraging users to use IMG_PATH
to setup proxy mode with a future deprecation warning, e.g.: https://github.com/webp-sh/webp_server_go/pull/207/files#diff-bd900498995223167f47bb6639ed2b46cc5ee16446a026ad46a2a02c8d96651cR62-R64
In summary, a valid config might look like:
{
"HOST": "127.0.0.1",
"PORT": "3333",
"IMG_PATH": "/opt/pics",
"EXHAUST_PATH": "/path/to/exhaust",
"IMG_MAP": {
"/i": "/home/ubuntu/pic",
"/img": "/home/ubuntu/pic2",
"/": "/home/pic1",
"test": "/home/test", # Assets at /home/test served at http://localhost:3333/test
"www.example.com": "http://origin.example.com:8000", # Assets at http://localhost:3333/www.example.com proxied to http://origin.example.com:8000
"http://www.example.com:3333": "http://origin.example.com:8000", # Assets at http://www.example.com:3333 proxied to http://origin.example.com:8000
},
"ALLOWED_TYPES": ["jpg","png","jpeg"],
"ENABLE_AVIF": false,
"ENABLE_EXTRA_PARAMS": false
}
I also want to take the opportunitity to raise a question about the map precedence. Given the following IMG_MAP
:
{
"/img": "/home/ubuntu/pic2",
"/img/1": "/home/pic3"
}
http://<host:port>/img/1/2.jpg
be looked for? /home/ubuntu/pic2/1/2.jpg
or /home/pic3/2.jpg
?http://<host:port>/img/10.jpg
be looked for? /home/ubuntu/pic2/1.jpg
or /home/pic3/0.jpg
?Answering myself to the path question (and sorry for the noise if this is not 100% config related) we can:
So that, given:
{
"/img": "/home/ubuntu/pic2",
"/img/1": "/home/pic3"
}
http://<host:port>/img/1/2.jpg
matches both /img/
and /img/1/
uris. /img/1/
is the most specific and wins, so that we serve /home/pic3/2.jpg
http://<host:port>/img/10.jpg
matches /img/
uri, we serve /home/ubuntu/pic2/10.jpg
I agree with the [the most specific wins] policy and the config format you suggested, I have some other concerns:
"http://www.example.com:3333": "http://origin.example.com:8000", # Assets at http://www.example.com:3333 proxied to http://origin.example.com:8000
line, how should user understand the :3333
here and "PORT": "3333"
above"EXHAUST_PATH": "/path/to/exhaust"
, if using multiple host/image_path, how should we design this EXHAUST_PATH
to make sure the converted files didn't collide* Given `"http://www.example.com:3333": "http://origin.example.com:8000", # Assets at http://www.example.com:3333 proxied to http://origin.example.com:8000` line, how should user understand the `:3333` here and `"PORT": "3333"` above
We can make the port optional so we can let users use http://www.example.com
in the config and we assume it can be translated as http://www.example.com[:3333]
given that currently only one port is allowed.
* Now we have only `"EXHAUST_PATH": "/path/to/exhaust"`, if using multiple host/image_path, how should we design this `EXHAUST_PATH` to make sure the converted files didn't collide
In my original PR I suggested creating a subdirectory with the hostname string: https://github.com/webp-sh/webp_server_go/pull/207/files#diff-d1e1aef472c68045d1f3c4f68a932e458699037baa4e68d5ca07629d4792bee0R150
Furthermore, we can store all these optimized assets using hashes so that the URI characters are not a problem in the filesystem: https://github.com/webp-sh/webp_server_go/pull/207/files#diff-d1e1aef472c68045d1f3c4f68a932e458699037baa4e68d5ca07629d4792bee0R146-R148
Furthermore, we can store all these optimized assets using hashes so that the URI characters are not a problem in the filesystem
Agree with it, but currently we are only hashing the reqURI:
// "mypic/123.jpg?someother=200&somebugs=200" -> 9c415345dd2456a20a43a1318fccab904a9d1257a305c3a563849c2283b78f8f
reqURIwithQueryHash := HashPath(reqURIwithQuery)
If we are serving multiple hostname, how should we arrange exhaust directory.
Currently I have an idea for the structure for it, suppose we have config like this:
{
"HOST": "127.0.0.1",
"PORT": "3333",
"IMG_PATH": "/opt/pics",
"EXHAUST_PATH": "/path/to/exhaust",
"IMG_MAP": {
"/i": "/home/ubuntu/pic",
"/img": "/home/ubuntu/pic2",
"/": "/home/pic1",
"test": "/home/test",
"www.example.com": "http://origin.example.com:8000",
"http://www.example.com": "http://origin.example.com:8000",
},
"ALLOWED_TYPES": ["jpg","png","jpeg"],
"ENABLE_AVIF": false,
"ENABLE_EXTRA_PARAMS": false
}
/path/to/exhaust
, like /path/to/exhaust/9c415345dd2456a20a43a1318fccab904a9d1257a305c3a563849c2283b78f8f
, since we are hashing them, should have no collision here.www.example.com
, the exhaust files are under /path/to/exhaust/origin.example.com/9c415345dd2456a20a43a1318fccab904a9d1257a305c3a563849c2283b78f8f-12345
, the 12345
is an example the remote Etag hash. (And this is like the current https://webp.se/ working mechanism, which is a WebP Server Go Fork that supports multiple backends and works purely for proxy mode)"http://www.example.com"
, since they have the same proxy origin as "www.example.com": "http://origin.example.com:8000",
, it will share the same exhaust directory.* For all the non-proxy part, all the exhaust files goes under `/path/to/exhaust`, like `/path/to/exhaust/9c415345dd2456a20a43a1318fccab904a9d1257a305c3a563849c2283b78f8f`, since we are hashing them, should have no collision here.
agree
* For proxy mode like `www.example.com`, the exhaust files are under `/path/to/exhaust/origin.example.com/9c415345dd2456a20a43a1318fccab904a9d1257a305c3a563849c2283b78f8f-12345`, the `12345` is an example the remote Etag hash. (And this is like the current https://webp.se/ working mechanism, which is a WebP Server Go Fork that supports multiple backends and works purely for proxy mode) * For other hostname like `"http://www.example.com"`, since they have the same proxy origin as ` "www.example.com": "http://origin.example.com:8000",`, it will share the same exhaust directory.
Partially agree, we might need to add scheme and port:
...
"IMG_MAP": {
"http://www.example0.com": "http://origin.example.com:8001",
"http://www.example1.com": "http://origin.example.com:8001",
"http://www.example2.com": "https://origin.example.com:8002",
},
...
as even being hosted in the same domain the server behind the port might expose completely different content.
/path/to/exhaust/http_origin.example.com_8001/<hashed-cache-file>
/path/to/exhaust/https_origin.example.com_8002/<hashed-cache-file>
If we extract the host from the key instead, we ensure they're independent in the cache. Also ensures if the origin provides different content based on the incoming Host
headers
/path/to/exhaust/www.example0.com/<hashed-cache-file>
/path/to/exhaust/www.example1.com/<hashed-cache-file>
/path/to/exhaust/www.example2.com/<hashed-cache-file>
{
"HOST": "127.0.0.1",
"PORT": "3333",
"IMG_PATH": "/opt/pics",
"EXHAUST_PATH": "/path/to/exhaust",
"IMG_MAP": {
"/i": "/home/ubuntu/pic",
"/img": "/home/ubuntu/pic2",
"/": "/home/pic1",
"test": "/home/test",
"www.example.com": "http://origin.example.com:8000",
"http://www.example.com": "http://origin.example.com:8000",
},
"ALLOWED_TYPES": ["jpg","png","jpeg"],
"ENABLE_AVIF": false,
"ENABLE_EXTRA_PARAMS": false
}
This seems good to me. In every key of IMG_MAP
, if it starts with http(s)
we would assume it's proxyMode and the value must be pointed to a source origin server. Otherwise we would assume it's local mode, request starts with this key will be mapped to corresponding source dir.
One more thing, in order to avoid overlay2 or other file system max filename limit, we should consider hash, and put every key/value in different but corresponding folder:
exhaust/test/hash("test/img/2.timestamp.webp")
remote-raw/example1.com/hash("wp-content/1.jpg")
Do we need to implement this before or after refactor to package mode?
One more thing, in order to avoid overlay2 or other file system max filename limit, we should consider hash, and put every key/value in different but corresponding folder:
1. hash converted webp/avif, i.e. `exhaust/test/hash("test/img/2.timestamp.webp")` 2. hash remote-raw image file, i.e. `remote-raw/example1.com/hash("wp-content/1.jpg")`
Is there any specific value on having the images at exhaust in a format readable by the host/user? I'm exploring https://github.com/dgraph-io/badger to keep track of the remote images metadata and we could use it easily to save the blobs there. If that'd be an option we could get rid of all these non-ACID and FS limitations all at once.
Also, I'm not convinced about keeping the directory as it might lead to security problems: 1) Opens the possibility of directory transversal attacks 2) If we parse + clean the URI, we might remove characters that the remote system requires to provide a valid response. This would complicate troubleshooting of proxy mode
If the only reason is to keep a reasonable amount of files per directory, we could use the N-first characters to create file buckets so that, e.g: with N=4:
a2e9897a617d33aabcbe55d8bf6b087acbc97af4
a2e9d460bb3c7d98845187c716a30db81c44b615
a9ea3856de277e671f27d2974f7dcb98cefed4d9
0e0334129b1bb45902add335eca175730f35d0d4
16b552347e02010d4c1d8cfda3f8acbb14a01719
16b5e579684c4fd1d474d08c326a42a4a0a2a588
16b50aa7743c014c1a1d5aa0e5f2aa33d3ae8f3b
Can be stored as:
a2e9/
a2e9897a617d33aabcbe55d8bf6b087acbc97af4
a2e9d460bb3c7d98845187c716a30db81c44b615
a9ea/
a9ea3856de277e671f27d2974f7dcb98cefed4d9
0e03/
0e0334129b1bb45902add335eca175730f35d0d4
16b5/
16b552347e02010d4c1d8cfda3f8acbb14a01719
16b5e579684c4fd1d474d08c326a42a4a0a2a588
16b50aa7743c014c1a1d5aa0e5f2aa33d3ae8f3b
Hi @n0vad3v, @BennyThink - Can you review my proposal of not keeping the dir structure from the URI? https://github.com/webp-sh/webp_server_go/issues/217#issuecomment-1566203968
@bugfest Sorry we've missed this issue, recently I and @BennyThink were discussing the exhaust
file format,
any specific value on having the images at exhaust in a format readable by the host/user
Indeed, that might be not necessary to keep the exhaust path files human readable, and making the filename unify (both in proxy-mode and non-proxy-mode) does make sense.
we could use the N-first characters to create file buckets so that, e.g: with N=4:
I personally don't think it's necessary, just 'flatten' all the path should be OK.
Non Proxy Mode
./pics/picsa/123.jpg
./exhaust/picsa/123.1677986846.webp_width=0&height=0
Proxy Mode
https://docs.webp.sh/images/123.jpg
./remote-raw/9471b9724d0122a4-b8f999ab5acd69d03f5e904b1b84eb79210536
./exhaust/images/123.1677986846.webp_width=0&height=0
b8f999ab5acd69d03f5e904b1b84eb79210536
is combined from server response of etag and length, b8f999ab5acd69d03f5e904b1b84eb79
is etag and 210536
is length.
9471b9724d0122a4
is hash content of https://docs.webp.sh/images/123.jpg
.
So.. the new solution should have the following feature:
123.1677986846.webp
Hmmm, I think we might need some time to propose a good solution for this. 🤔 (Ideas are welcomed.)
Hi @n0vad3v I still have some concerns when using parts of the URI in the filesystem. The most notorious case is when the file name contains scape characters interpreted or stripped by the OS, e.g. "+" sign. I found myself into this problem again while rewriting PR #226 code.
I've been thinking into the metadata tracking problem (etag, timestamp, cache/must-revalidate headers) for a while now. I believe adding more metadata to the filename in the remote-raw and exhaust folders cannot be easily extended and would require a redesign. I stumble upon these ideas:
Another option is to drop this metadata tracking and just fetch the image everytime by default. We can then move this complexity to a different system to track it: e.g. Varnish. Webp_server_go will use varnish as remote and varnish will decide when to retrieve the asset again or provide a cached version.
This last option, might be a bit controversial at first, but will allow us to reduce the complexity of the implementation a lot.
I don't really think involve external system(varnish) is a good idea because it would make the deployment more dependent on other project, not lightweight anymore.
Also, saving in some kind of db is good. Yes it will solve many problems. However that means we need to maintain that db, I believe it's more convenience to find our solution in filesystem itself.
Fair point. What about if we use a small meta file per asset?
We can save the structure in json
so that can be easily parsed/patched. Example:
Non Proxy Mode
Origin image at ./pics/picsa/123.jpg
Converted to:
- ./exhaust/ca3e31219832cd956c2fecf05ac411f738c58712.webp
- ./exhaust/ca3e31219832cd956c2fecf05ac411f738c58712.avif
Metadata:
- ./metadb/ca3e31219832cd956c2fecf05ac411f738c58712.json
./exhaust/ca3e31219832cd956c2fecf05ac411f738c58712.json
contents stores the file metadata:
{
"remote": false,
"url": "file://picsa/123.jpg",
"ts": 1677986846,
"etag": "abc...123",
"length": 123456,
"width": 0,
"height": 0
}
Proxy Mode
Origin image at https://docs.webp.sh/images/123.jpg
Downloaded image at ./remote-raw/b8f999ab5acd69d03f5e904b1b84eb79210536
Converted:
- ./exhaust/b8f999ab5acd69d03f5e904b1b84eb79210536.webp
- ./exhaust/b8f999ab5acd69d03f5e904b1b84eb79210536.avif
Metadata:
- ./metadb/b8f999ab5acd69d03f5e904b1b84eb79210536.json
./metadb/b8f999ab5acd69d03f5e904b1b84eb79210536.json
contents stores the file metadata:
{
"remote": true,
"url": "https://docs.webp.sh/images/123.jpg",
"ts": 1677986846,
"etag": "b8f999ab5acd69d03f5e904b1b84eb79",
"length": 210536,
"width": 0,
"height": 0
}
The metadata file could be stored in any other place.
This way we can store all metadata without an external system. The Filesystem problems due to special characters go away and we have a flexible way to track all this information we can later use to implement cleanups, backups and stats.
that sounds good, we can save every thing we want in a json file.
Hello @bugfest , we have implemented metadata:
{
"id": "2836a47938985a2c",
"path": "/123.png?width=300\u0026height=300",
"checksum": "5c1a310832cec0f5"
}
I guess it's time to aim for new configuration format.
Great! I'll finish the lazy PR and I'll start with the multi-backend one 🎉
Let's discuss for a newer configuration format, currently our
config.json
looks like this:However, there are more features that might need configuration, such as below:
CONCURRENT_COUNT
to it?MAX_EXHAUST_SIZE
to limit the max exhaust directory size, related issue: https://github.com/webp-sh/webp_server_go/issues/214For multiple path and multiple backend part, I'm personally in favor of pattern looks like this:
As this provides more backward compatabilities to previous versions, however there's still a need that given the config above, what will be the users' case when using this, like:
https://
and users are using it listening directly on port 443/80 to serve traffic.However we may still need to discuss whether those above are real user scenarios.