workfloworchestrator / orchestrator-core

The workflow orchestrator core repository
Apache License 2.0
40 stars 15 forks source link

[Bug]: Fix Subscriptions search for the different postgres versions #602

Closed pboers1988 closed 5 months ago

pboers1988 commented 6 months ago

What happened?

Using more complex search syntax, like the | operator test fail as the database does not return a result. This test also used to be flaky, so it might be caused by a race condition somewhere, however higher versions of postgres do seem to make the results more predictable, indicating a relationship to the database version.

See the result of a matrix test run, where the Python version and Postgres version are manipulated. https://github.com/workfloworchestrator/orchestrator-core/actions/runs/8555513507/attempts/2?pr=599

Version

alpha

What Postgres version are you seeing the problem on?

Postgres 11, Postgres 12, Postgres 13

pboers1988 commented 6 months ago

After the first initial runs, we can start to see a pattern. The behaviour of to_tsquery seems to be different from postgres 14 onwards. Further more it seems that test_multiple_subscriptions is flaky.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0c4f355c6a5fd437f71349f2f3d5d491382572b7

dmarjenburgh commented 6 months ago

There is a difference in how Postgres parses the input to to_tsquery between versions <14 and >=14.

We should be able to search for a single subscription_id with either of the following queries:

This is handled by orchestrator-core by parsing the string and passing the result to to_tsquery like so: to_tsquery('simple', 'subscription <-> id <-> 99933227-08bb-46f5-960c-6a2ec3ddaab4')

The result of this expression is:

-- Postgres <14
'subscription' <-> 'id' <-> ( '99933227' & '-08' & 'bb-46f5-960c-6a2ec3ddaab4' & 'bb' & '46f5' & '960c' & '6a2ec3ddaab4' )

-- Postgres >=14
'subscription' <-> 'id' <-> ( '99933227' <-> '-08' <-> 'bb-46f5-960c-6a2ec3ddaab4' <-> 'bb' <-> '46f5' <-> '960c' <-> '6a2ec3ddaab4' )

The Postgres<14 result means searching for the tokens subscription followed by id followed by 99933227 AND -08 AND ... . This is nonsensical and will never yield a result.

In contrast, the Postgres>=14 result means all the tokens should be consecutive, which is exactly what we need and gives the correct result.

EDIT: The relevant commit mentioning and fixing this issue is here: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0c4f355c6a5fd437f71349f2f3d5d491382572b7

dmarjenburgh commented 5 months ago

There is another issue, which is that Postgres tokenizes text using a default parser which can read parts of a UUID as a number.

For example, consider the output of the following query:

SELECT * from ts_debug('simple', replace('0e12d6bb-ef55-4e1e-9b10-5cf1afeef440', '-', '_'));
alias description token dictionaries dictionary lexemes
sfloat Scientific notation 0e12 {simple} simple {0e12}
numword Word, letters and digits d6bb {simple} simple {d6bb}
blank Space symbols _ {} null null
numword Word, letters and digits ef55 {simple} simple {ef55}
blank Space symbols _ {} null null
sfloat Scientific notation 4e1 {simple} simple {4e1}
asciiword Word, all ASCII e {simple} simple {e}
blank Space symbols _ {} null null
numword Word, letters and digits 9b10 {simple} simple {9b10}
blank Space symbols _ {} null null
numword Word, letters and digits 5cf1afeef440 {simple} simple {5cf1afeef440}

This is the behavior of the default parser (the only one shipping with Postgres). Adding your own parser entails creating a procedure in C and requires superuser privileges on the database. Not a good option.

This has consequences for querying on a UUID. For Postgres >=14, we can just query for the entire UUID, as the query is parsed in the same way as the tsvector, but in Postgres <14, we would have to do it manually or some records can not be found with a UUID search...

I have implemented a workaround for this in #603, which emits separate db queries to ts_parse and let the db parse parts of the search query so that it matches with the ts_vector (This is the normal behavior in Postgres >=14)