wincent / masochist

⛓ Website infrastructure for over-engineers
MIT License
78 stars 26 forks source link

Cache more stuff in Redis #218

Open wincent opened 6 months ago

wincent commented 6 months ago

Whether or not I do this will depend on how far I go with:

but I think we're relying a wee bit too heavily on Git as a source of truth for some information in the current implementation. The original idea was:

So, that's why Redis has sorted sets so we can do things like get an ordered list of blog posts (ordered by created-at) or wiki articles (ordered by updated-at) or tags (ordered by count of tagged items) etc. But we end up underutilizing Redis, IMO, and going to Git to answer questions like "when was this object created and updated?" when we load content blobs into memcached. The idea is that the first look-up will be "slow", but not too slow, because Git is fast, and then subsequent look-ups will be lightning fast because of memcached. But I don't like the way we only get away with this because the content history isn't too deep (thousands of commits instead of millions) and because we use pretty aggressive path-limiting.

I know that caching more things forces you to come to grips with one of the two hard problems (cache invalidation), but the thing is we're already relying on Redis and on the cache invalidation being robust. I think we should start treating Redis not just as an index for some things, but an index for all things, and that includes things like created-at and updated-at metadata for all blobs. Really the only things I want to hit Git for are to read the contents (fast) and perform git-grep-based searches (also fast).

It's true that in #180 I'll be going mostly static (apart from searches), so it's possible I won't do anything here and instead opt to finish that, but I wanted to jot it down anyway.

I think the other thing that's bothering me is all the rev-parse calls. Those are lightning fast too, but once again, it feels like something like we should just ask Redis for given that it already has to have an up-to-date index (and really, we only ever call rev-parse in order to aid us in answering some other question that we really should be answering by asking Redis...).

wincent commented 5 months ago

I'm thinking I can replace the complicated incremental indexing with simpler and faster full indexing. I only need to read the first few hundred bytes of each blob in HEAD to grab the current tags, and I can get all the date info with a single git-log call[^something]. Say I have 10k commits, I think the latter can be loaded and parsed in a single-digit number of seconds even on a wimpy EC2 instance, and if I have, say, 4K files, even with slow I/O (say 100ms/per file) we're still only talking about 400 seconds. In reality, I hope it's quite a bit faster than that so the actual time for a full reindex might be as low as 40s, which would be fine.

Update: A quick test of this shows I can do a full re-index on my local machine using this approach in about 2.2s.

[^something]: Something like:

```
git log --format='%H %at %ct' --name-status --no-merges --no-renames -z --diff-filter=ADM -- content
```