Stats endpoints, some indices

Will close #9

See notes at:

https://gist.github.com/lfborjas/2fd2d237d5600b392231ae2c472017bb

[x] Initial queries for keyword and emotion stats
[x] Configuration and indexes for the above
[x] Model tests.
[ ] Endpoints (with sane min/max?)
[ ] Endpoint tests.
[x] further analyses of the dreamFilters query, and any necessary indexes.

Notes for the future:

Multilingual support?

To make the best use of our indexes, and be explicit, we're using english as the text search dictionary, and maintaining an index for title || ' ' || description in english. As per the documentation, we should be able to actually support multiple languages:

It is possible to set up more complex expression indexes wherein the configuration name is specified by another column, e.g.: CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector(config_name, body));

Which would mean that we'd need to store the language preference with the UserAccount entity, and propagate to each Dream row accordingly. The text search functions would then take this value as a parameter. This would be cool for our international users!

More indexes?

Currently, dream.user_id isn't indexed. Should we be adding that one? For my small test data, it doesn't seem like it's used (user size is ~3% of dream size) -- but it should speed up lookups by user id, probably?
Do we need a partial index on is_private? Not sure what the distribution would be, see https://hakibenita.com/sql-tricks-application-dba -- if there's fewer private dreams than public, maybe it's helpful?

A query like this would do the trick:

SELECT attname, n_distinct, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'dream' AND attname='is_private';

In my test data it of course yields:

 attname   | n_distinct | most_common_vals | most_common_freqs 
------------+------------+------------------+-------------------
 is_private |          2 | {f,t}            | {0.5,0.5}
(1 row)

The linked article also mentions the notion of correlation, and how it affects the choice of an index scan vs. a more expensive operation. A correlation of 1 means that fewer pages may be read, and an index scan will likely be used. Here's how it looks for my test data (which doesn't reflect real trends!)

undercurrent_dev=# SELECT tablename, attname, correlation from pg_stats where tablename = 'dream';
 tablename |   attname    | correlation  
-----------+--------------+--------------
 dream     | id           |   0.99999905
 dream     | is_lucid     |    0.5080458
 dream     | user_id      | -0.020969613
 dream     | title        |   0.32992414
 dream     | description  |   0.35647446
 dream     | is_nightmare |   0.51219034
 dream     | is_recurring |   0.50512606
 dream     | is_private   |   0.49398795
 dream     | is_starred   |   0.50084716
 dream     | emotions     |   0.17618044
 dream     | dreamed_at   |    0.3357743
 dream     | created_at   |            1
 dream     | updated_at   |            1
(13 rows)

References

Appendix: some illustrative query plans

undercurrent_dev=# explain analyze select c.word,                                                                                                                                                                                                           count (dream.id) filter (where is_lucid = true) as are_lucid,                                                                                                                                                                                               count (dream.id) filter (where is_nightmare = true) as are_nightmare,                                                                                                                                                                                   count (dream.id) filter (where is_recurring = true) as are_recurring,                                                                                                                                                                                   count (*) as total_dreams                                                                                                                                                                                                                           from dream join                                                                                                                                                                                                                                             (select word, ndoc from ts_stat($$select to_tsvector('english_simple', title || ' ' || description) from dream$$)) as c                                                                                                                                 on to_tsvector ('english_simple', title || ' ' || description) @@ to_tsquery('english_simple', c.word)                                                                                                                                              where dreamed_at between '2020-01-01' and '2020-06-01' group by c.word order by total_dreams desc                                                                                                                                                       limit 10;
                                                                                               QUERY PLAN                                                                                               
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=184326.85..184326.87 rows=10 width=64) (actual time=70.200..70.204 rows=10 loops=1)
   ->  Sort  (cost=184326.85..184327.35 rows=200 width=64) (actual time=70.197..70.199 rows=10 loops=1)
         Sort Key: (count(*)) DESC
         Sort Method: quicksort  Memory: 26kB
         ->  HashAggregate  (cost=184320.52..184322.52 rows=200 width=64) (actual time=70.156..70.174 rows=14 loops=1)
               Group Key: ts_stat.word
               ->  Nested Loop  (cost=0.48..182160.52 rows=172800 width=43) (actual time=47.008..63.092 rows=14989 loops=1)
                     ->  Function Scan on ts_stat  (cost=0.03..100.03 rows=10000 width=32) (actual time=46.694..46.701 rows=15 loops=1)
                     ->  Bitmap Heap Scan on dream  (cost=0.45..18.04 rows=17 width=44) (actual time=0.264..0.917 rows=999 loops=15)
                           Recheck Cond: (to_tsvector('english_simple'::regconfig, (((title)::text || ' '::text) || (description)::text)) @@ to_tsquery('english_simple'::regconfig, ts_stat.word))
                           Filter: ((dreamed_at >= '2020-01-01 00:00:00-05'::timestamp with time zone) AND (dreamed_at <= '2020-06-01 00:00:00-04'::timestamp with time zone))
                           Rows Removed by Filter: 500
                           Heap Blocks: exact=1107
                           ->  Bitmap Index Scan on word_cloud_index  (cost=0.00..0.45 rows=26 width=0) (actual time=0.245..0.245 rows=1499 loops=15)
                                 Index Cond: (to_tsvector('english_simple'::regconfig, (((title)::text || ' '::text) || (description)::text)) @@ to_tsquery('english_simple'::regconfig, ts_stat.word))
 Planning Time: 0.325 ms
 Execution Time: 70.329 ms
(17 rows)

undercurrent_dev=# 
undercurrent_dev=#  select emotion #>> '{}', count (*) as c from dream cross join lateral jsonb_array_elements (emotions::jsonb) as emotion
undercurrent_dev-#     where to_tsvector ('english_simple', title || ' ' || description) @@ to_tsquery ('english_simple', 'smol')
undercurrent_dev-#     and dreamed_at between '2020-01-01' and '2020-06-01'
undercurrent_dev-#     group by emotion order by c desc limit 1
undercurrent_dev-# ;
 ?column? |  c  
----------+-----
 joy      | 576
(1 row)

undercurrent_dev=# explain analyze select emotion #>> '{}', count (*) as c from dream cross join lateral jsonb_array_elements (emotions::jsonb) as emotion                                                                                                  where to_tsvector ('english_simple', title || ' ' || description) @@ to_tsquery ('english_simple', 'smol')                                                                                                                                              and dreamed_at between '2020-01-01' and '2020-06-01'                                                                                                                                                                                                    group by emotion order by c desc limit 1                                                                                                                                                                                                            ;                                                                                                                                                                                                                                                                                                                                         QUERY PLAN                                                                                   ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=115.64..115.65 rows=1 width=72) (actual time=11.483..11.483 rows=1 loops=1)   ->  Sort  (cost=115.64..115.89 rows=100 width=72) (actual time=11.476..11.476 rows=1 loops=1)         Sort Key: (count(*)) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  HashAggregate  (cost=113.89..115.14 rows=100 width=72) (actual time=11.459..11.462 rows=4 loops=1)
               Group Key: emotion.value
               ->  Nested Loop  (cost=8.20..105.39 rows=1700 width=32) (actual time=0.348..8.472 rows=1920 loops=1)
                     ->  Bitmap Heap Scan on dream  (cost=8.20..71.39 rows=17 width=20) (actual time=0.333..2.428 rows=1152 loops=1)
                           Recheck Cond: (to_tsvector('english_simple'::regconfig, (((title)::text || ' '::text) || (description)::text)) @@ '''smol'''::tsquery)
                           Filter: ((dreamed_at >= '2020-01-01 00:00:00-05'::timestamp with time zone) AND (dreamed_at <= '2020-06-01 00:00:00-04'::timestamp with time zone))
                           Rows Removed by Filter: 576
                           Heap Blocks: exact=85
                           ->  Bitmap Index Scan on word_cloud_index  (cost=0.00..8.19 rows=26 width=0) (actual time=0.311..0.311 rows=1728 loops=1)
                                 Index Cond: (to_tsvector('english_simple'::regconfig, (((title)::text || ' '::text) || (description)::text)) @@ '''smol'''::tsquery)
                     ->  Function Scan on jsonb_array_elements emotion  (cost=0.00..1.00 rows=100 width=32) (actual time=0.003..0.004 rows=2 loops=1152)
 Planning Time: 0.308 ms
 Execution Time: 11.573 ms
(17 rows)

undercurrent_dev=# explain analyze select * from dream join user_account on dream.user_id = user_account.id where is_private = false and is_lucid = true and dreamed_at <= '2020-03-02' and dreamed_at >= '2020-01-01' and emotions @> '["joy"]' and user_id = 8 order by dream.id desc limit 1000;                                                                                                                                                                                                                                                                                                             QUERY PLAN                                                                                                 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=21.53..21.54 rows=1 width=219) (actual time=0.625..0.626 rows=2 loops=1)   ->  Sort  (cost=21.53..21.54 rows=1 width=219) (actual time=0.623..0.623 rows=2 loops=1)         Sort Key: dream.id DESC         Sort Method: quicksort  Memory: 25kB         ->  Nested Loop  (cost=12.80..21.52 rows=1 width=219) (actual time=0.524..0.605 rows=2 loops=1)               ->  Bitmap Heap Scan on dream  (cost=12.80..16.82 rows=1 width=98) (actual time=0.503..0.520 rows=2 loops=1)
                     Recheck Cond: ((user_id = 8) AND (emotions @> '["joy"]'::jsonb))
                     Filter: ((NOT is_private) AND is_lucid AND (dreamed_at <= '2020-03-02 00:00:00-05'::timestamp with time zone) AND (dreamed_at >= '2020-01-01 00:00:00-05'::timestamp with time zone))
                     Rows Removed by Filter: 12
                     Heap Blocks: exact=13
                     ->  BitmapAnd  (cost=12.80..12.80 rows=1 width=0) (actual time=0.483..0.483 rows=0 loops=1)
                           ->  Bitmap Index Scan on idx_user_id  (cost=0.00..4.51 rows=31 width=0) (actual time=0.054..0.054 rows=28 loops=1)
                                 Index Cond: (user_id = 8)
                           ->  Bitmap Index Scan on idx_dream_emotions  (cost=0.00..8.04 rows=5 width=0) (actual time=0.424..0.424 rows=2592 loops=1)
                                 Index Cond: (emotions @> '["joy"]'::jsonb)
               ->  Seq Scan on user_account  (cost=0.00..4.69 rows=1 width=113) (actual time=0.010..0.039 rows=1 loops=2)
                     Filter: (id = 8)
                     Rows Removed by Filter: 134
 Planning Time: 1.280 ms
 Execution Time: 0.739 ms
(20 rows)

undercurrent_dev=# explain analyze select * from dream join user_account on dream.user_id = user_account.id where to_tsvector('english', title || ' ' || description) @@ websearch_to_tsquery('english', 'ALPACA PROPHECY??') and emotions @> '["joy"]';                                                                              QUERY PLAN                                                                             -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join  (cost=16.28..21.15 rows=1 width=211) (actual time=1.229..1.469 rows=292 loops=1)   Hash Cond: (user_account.id = dream.user_id)   ->  Seq Scan on user_account  (cost=0.00..4.35 rows=135 width=113) (actual time=0.011..0.034 rows=135 loops=1)   ->  Hash  (cost=16.27..16.27 rows=1 width=98) (actual time=1.204..1.204 rows=292 loops=1)         Buckets: 1024  Batches: 1  Memory Usage: 50kB         ->  Bitmap Heap Scan on dream  (cost=12.00..16.27 rows=1 width=98) (actual time=0.530..1.084 rows=292 loops=1)
               Recheck Cond: (to_tsvector('english'::regconfig, (((title)::text || ' '::text) || (description)::text)) @@ '''alpaca'' & ''propheci'''::tsquery)
               Filter: (emotions @> '["joy"]'::jsonb)
               Rows Removed by Filter: 293
               Heap Blocks: exact=85
               ->  Bitmap Index Scan on idx_dream_ftsearch  (cost=0.00..12.00 rows=1 width=0) (actual time=0.503..0.503 rows=585 loops=1)
                     Index Cond: (to_tsvector('english'::regconfig, (((title)::text || ' '::text) || (description)::text)) @@ '''alpaca'' & ''propheci'''::tsquery)
 Planning Time: 5.492 ms
 Execution Time: 1.579 ms
(14 rows)

zrcadlo / undercurrent_api