database: Add denormalized Stream.subscriber_count field.

showell commented 3 years ago

This is a good issue for somebody with good knowledge of postgres triggers, Django ORM migrations, and the Zulip subscriber model.

Right now the only way to count the subscribers for a particular Zulip stream is to query the Subscription model. Not only is this query somewhat expensive at times (especially if you're trying to do it with multiple streams), it's a bit error-prone if you don't know the difference between stream_id and recipient_id or you forget to check is_active flags on both the subscription and the user.

It would be nice if we just had a field on Stream called subscriber_count and a trigger on Subscription that updated Stream.subscriber_count any time a user added/removed/activated/deactivated a subscription. We would possibly also need a trigger to run when UserProfile.is_active flipped.

I believe we actually want actual database-side triggers as opposed to Django post_save hooks, because the latter are subject to races.

This work is probably valuable as its own project, since the extra column on Stream would take up very little database space, and it would be immediately useful for Zulip admins who wanted to do ad-hoc queries. But then the follow-up would be to send the subscriber counts to the webapp, so that the webapp never has to do its own O(N) work to calculate subscriber counts on a stream.

See https://chat.zulip.org/#narrow/stream/2-general/topic/bulky.20subscriber.20data/near/1104411 for more discussion.

Note that postgres has a feature for computed columns (https://www.postgresql.org/docs/12/ddl-generated-columns.html), but it doesn't seem powerful enough for this use case.

showell commented 3 years ago

@mateuszmandera and I talked over PMs, and we think we probably want to do this mostly with post_save triggers, rather than doing a native database trigger. I'll let him elaborate. We also fear any increment/decrement approaches would be vulnerable to races, so we would probably just have our post_save do the O(N) computation ever time somebody subscribes/unsubscribes.

timabbott commented 3 years ago

Can you share the reasoning? I believe post_save triggers have the risk that they'll be executed in a different transaction -- perhaps you mean pre_save triggers? A native database trigger to increment/decrement I believe would also be safe from that sort of issue.

I should also mention another option is to just cache this detail in a stream cache, and flush that cache when subscribers change.

showell commented 3 years ago

I think @mateuszmandera and I are mostly worried about native database triggers from a maintenance standpoint. We're not sure that people will know to look for that code, and it may require more expertise to change the triggers than if we just did stuff in Django.

An important thing we would have to figure out if we went the native-database route is this: how do we invalidate the Stream cache when subscriber_count changes? We would probably have to have Django code know when subscriber info is changing and then invalidate the cached Stream entries accordingly. And then we have two mechanisms to maintain instead of one.

I'm not sure we even need a pre-save post-save hook here. I'm more in favor a of a simple approach where we have some helper function called update_subscriber_counts_for_stream_ids(stream_ids). And we'd just call that in any codepath where we are subscribing/unsubscribing to a stream. Normally it would just be a single stream id, but we'd also call it when we deactivate users.

The function would delegate most of the work to postgres, so even things that are theoretically quadratic (i.e. update all M streams for the deactivated user with the O(N) subscriber count) should be fast. And these would be inside a transaction.

If some obscure codepath like soft deactivation got overlooked, it wouldn't be the end of the world, as the next subscribe/unsubscribe event would fix the count with current info.

The query would be the SQL equivalent of this:


update subscriber_count = (select count(*) from subscriber where bla bla bla)
join stuff
for stream in streams

showell commented 3 years ago

And then the update function is basically:


sql = # whatever
execute(sql)
bulk_invalidate_streams_in_cache(stream_ids)

timabbott commented 3 years ago

I think we can solve the maintenance problem by just documenting the triggers in comments by the field definition in models.py. The cache invalidation problem is a real concern, though of course solvable with Django code next to all the writes.

I think if we're going to do something not using a database trigger, we should consider using a cache rather than the database; cache seems like the right abstraction for "computed value that might be a bit stale".

mateuszmandera commented 3 years ago

The concern we had regarding caching approach was that we send subscriber counts for a lot of streams on page load - so that would be a lot of cache queries.

On the triggers vs post_save question - if we're doing re-count queries for updating subscriber_count (as opposed to increment queries), then that makes races not too big of a deal - the subscriber count can only be off for a short moment until the last trigger (whether it's an actual database trigger or post_save) runs and sets the count to a correct value. So that might make the disadvantage of post_save negligible.

showell commented 3 years ago

I'm pretty anti-cache for this. Caches are good for single values, but we typically get subscriber counts in bulk, during page load, and different users have different sets of streams, so it's not a precise fit to have a big blob of subscriber counts that we cache (not to mention that that would churn somewhat frequently).

I think the database is the right abstraction for this. And I would still advocate the simple strategy of just having Django call update_subscriber_counts_for_stream_ids(stream_ids) in the appropriate places.

timabbott commented 3 years ago

The concern we had regarding caching approach was that we send subscriber counts for a lot of streams on page load - so that would be a lot of cache queries.

I don't understand that concern -- we can easily do bulk cache queries with generic_bulk_cached_fetched.

It is relevant that we fetch all the other subscribers to a stream when editing its subscriptions to send peer_add events anyway, which would mean bulk_add|remove_subscriptions could update these fields with just the cheap database query to do a write to the Stream object; but we could just as easily do a cache write at that point instead.

(And then we'd need the somewhat more annoying work in the deactivate/reactivate user code paths)

andersk commented 3 years ago

(Serializing my thoughts from #backend.)

We should be able to make subscription count queries plenty fast enough without denormalization, using a database index. If I add this index:

--- a/zerver/models.py
+++ b/zerver/models.py
@@ -29,7 +29,7 @@ from django.contrib.auth.models import AbstractBaseUser, PermissionsMixin, UserM
 from django.core.exceptions import ValidationError
 from django.core.validators import MinLengthValidator, RegexValidator, URLValidator, validate_email
 from django.db import models, transaction
-from django.db.models import CASCADE, Manager, Q, Sum
+from django.db.models import CASCADE, Index, Manager, Q, Sum
 from django.db.models.query import QuerySet
 from django.db.models.signals import post_delete, post_save
 from django.utils.functional import Promise
@@ -2306,6 +2306,9 @@ class Subscription(models.Model):
     wildcard_mentions_notify: Optional[bool] = models.BooleanField(null=True, default=None)

     class Meta:
+        indexes = [
+            Index(fields=["recipient", "user_profile"]),
+        ]
         unique_together = ("user_profile", "recipient")

     def __str__(self) -> str:

and the corresponding Django-generated migration, then the single SQL query generated by this Django ORM code:

for stream in (
    Stream.objects
    .annotate(subscriber_count=Count("recipient__subscription"))
    .all()
):
    print(stream.name, stream.subscriber_count)

is planned by PostgreSQL as follows:

zulip=> EXPLAIN SELECT "zerver_stream"."id", "zerver_stream"."name", "zerver_stream"."realm_id", "zerver_stream"."date_created", "zerver_stream"."deactivated", "zerver_stream"."description", "zerver_stream"."rendered_description", "zerver_stream"."recipient_id", "zerver_stream"."invite_only", "zerver_stream"."history_public_to_subscribers", "zerver_stream"."is_web_public", "zerver_stream"."stream_post_policy", "zerver_stream"."is_in_zephyr_realm", "zerver_stream"."email_token", "zerver_stream"."message_retention_days", "zerver_stream"."first_message_id", COUNT("zerver_subscription"."id") AS "subscriber_count" FROM "zerver_stream" LEFT OUTER JOIN "zerver_recipient" ON ("zerver_stream"."recipient_id" = "zerver_recipient"."id") LEFT OUTER JOIN "zerver_subscription" ON ("zerver_recipient"."id" = "zerver_subscription"."recipient_id") GROUP BY "zerver_stream"."id";
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=140.61..140.75 rows=14 width=811)
   Group Key: zerver_stream.id
   ->  Nested Loop Left Join  (cost=0.70..139.70 rows=182 width=807)
         ->  Nested Loop Left Join  (cost=0.29..105.38 rows=14 width=807)
               ->  Seq Scan on zerver_stream  (cost=0.00..1.14 rows=14 width=803)
               ->  Index Only Scan using zerver_recipient_pkey on zerver_recipient  (cost=0.29..7.45 rows=1 width=4)
                     Index Cond: (id = zerver_stream.recipient_id)
         ->  Index Scan using zerver_subs_recipie_2d5a8e_idx on zerver_subscription  (cost=0.42..1.92 rows=53 width=8)
               Index Cond: (zerver_recipient.id = recipient_id)
(9 rows)

which looks like it’s going to be fast (O(n log_b m) or something).

Tim points out that we may want to filter the counts for active users, which makes this a bit more complicated—I think the simplest approach would be to copy the is_active column into the Subscription table and use that as an index condition.

zulipbot commented 3 years ago

Hello @mateuszmandera, you claimed this issue to work on it, but this issue and any referenced pull requests haven't been updated for 10 days. Are you still working on this issue?

If so, please update this issue by leaving a comment on this issue to let me know that you're still working on it. Otherwise, I'll automatically remove you from this issue in 4 days.

If you've decided to work on something else, simply comment @zulipbot abandon so that someone else can claim it and continue from where you left off.

Thank you for your valuable contributions to Zulip!

zulipbot commented 3 years ago

Hello @zulip/server-misc members, this issue was labeled with the "area: db cleanup" label, so you may want to check it out!

timabbott commented 3 years ago

Having caught up on the relevant threads, the next step in this direction is https://github.com/zulip/zulip/pull/17214.

(We'll probably end up doing that + appropriate indexes rather than a new field, I think).

zulip / zulip

database: Add denormalized Stream.subscriber_count field. #17101