Closed pboers1988 closed 5 months ago
After the first initial runs, we can start to see a pattern. The behaviour of to_tsquery
seems to be different from postgres 14 onwards. Further more it seems that test_multiple_subscriptions
is flaky.
There is a difference in how Postgres parses the input to to_tsquery
between versions <14 and >=14.
We should be able to search for a single subscription_id with either of the following queries:
subscriptionId:99933227-08bb-46f5-960c-6a2ec3ddaab4
subscription_id:99933227-08bb-46f5-960c-6a2ec3ddaab4
This is handled by orchestrator-core by parsing the string and passing the result to to_tsquery
like so:
to_tsquery('simple', 'subscription <-> id <-> 99933227-08bb-46f5-960c-6a2ec3ddaab4')
The result of this expression is:
-- Postgres <14
'subscription' <-> 'id' <-> ( '99933227' & '-08' & 'bb-46f5-960c-6a2ec3ddaab4' & 'bb' & '46f5' & '960c' & '6a2ec3ddaab4' )
-- Postgres >=14
'subscription' <-> 'id' <-> ( '99933227' <-> '-08' <-> 'bb-46f5-960c-6a2ec3ddaab4' <-> 'bb' <-> '46f5' <-> '960c' <-> '6a2ec3ddaab4' )
The Postgres<14 result means searching for the tokens subscription
followed by id
followed by 99933227
AND -08
AND ... . This is nonsensical and will never yield a result.
In contrast, the Postgres>=14 result means all the tokens should be consecutive, which is exactly what we need and gives the correct result.
EDIT: The relevant commit mentioning and fixing this issue is here: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0c4f355c6a5fd437f71349f2f3d5d491382572b7
There is another issue, which is that Postgres tokenizes text using a default parser which can read parts of a UUID as a number.
For example, consider the output of the following query:
SELECT * from ts_debug('simple', replace('0e12d6bb-ef55-4e1e-9b10-5cf1afeef440', '-', '_'));
alias | description | token | dictionaries | dictionary | lexemes |
---|---|---|---|---|---|
sfloat | Scientific notation | 0e12 | {simple} | simple | {0e12} |
numword | Word, letters and digits | d6bb | {simple} | simple | {d6bb} |
blank | Space symbols | _ | {} | null | null |
numword | Word, letters and digits | ef55 | {simple} | simple | {ef55} |
blank | Space symbols | _ | {} | null | null |
sfloat | Scientific notation | 4e1 | {simple} | simple | {4e1} |
asciiword | Word, all ASCII | e | {simple} | simple | {e} |
blank | Space symbols | _ | {} | null | null |
numword | Word, letters and digits | 9b10 | {simple} | simple | {9b10} |
blank | Space symbols | _ | {} | null | null |
numword | Word, letters and digits | 5cf1afeef440 | {simple} | simple | {5cf1afeef440} |
This is the behavior of the default parser (the only one shipping with Postgres). Adding your own parser entails creating a procedure in C and requires superuser privileges on the database. Not a good option.
This has consequences for querying on a UUID. For Postgres >=14, we can just query for the entire UUID, as the query is parsed in the same way as the tsvector, but in Postgres <14, we would have to do it manually or some records can not be found with a UUID search...
I have implemented a workaround for this in #603, which emits separate db queries to ts_parse
and let the db parse parts of the search query so that it matches with the ts_vector (This is the normal behavior in Postgres >=14)
What happened?
Using more complex search syntax, like the
|
operator test fail as the database does not return a result. This test also used to be flaky, so it might be caused by a race condition somewhere, however higher versions of postgres do seem to make the results more predictable, indicating a relationship to the database version.See the result of a matrix test run, where the Python version and Postgres version are manipulated. https://github.com/workfloworchestrator/orchestrator-core/actions/runs/8555513507/attempts/2?pr=599
Version
alpha
What Postgres version are you seeing the problem on?
Postgres 11, Postgres 12, Postgres 13