Open aravind-nallan-yb opened 2 years ago
This range (as @pkj415 pointed out) on the 3rd statement looks like a invalid range that should yield no matches. We are asking for x > +ve number AND x < -ve number
ycqlsh> select count(*) from inventory.inventory where token(gtin13,location_id) > 6148820866244280320 and token(gtin13,location_id) <= -9223372036854775808;
I asked @aravind-nallan-yb to instead query a non-circular/non-confusing set of ranges, for example these 4 ranges:
token() >= -9xx and token() < -6xxx
token() >= -6xx and token() < 0
token() >= 0 and token() < 6xx
token() >= 6xx and token <= 9xx
and he confirmed that it works cleanly.. and the return values add up to the expected 1000000 number.
CC: @m-iancu , @tedyu
@kmuthukk @aravind-nallan-yb This behavior for INT64_MIN (token(gtin13,location_id) <= -9223372036854775808
) looks odd, but it was required for Cassandra compatibility -- and it was also used by some connectors (specifically Presto).
See
So I think the 4 statements above (in the original message) are technically almost correct. I believe the miss is that on the 4th statement the:
select count(*) from inventory.inventory where token(gtin13,location_id) > -9223372036854775808 ...
should really be a non-strict inequality (>=
) condition on -9223372036854775808
because, as explained in my message above, the token(gtin13,location_id) <= -9223372036854775808
condition on the third statement does not actually include the case where token is -9223372036854775808
.
@aravind-nallan-yb do you mind testing that?
Note: It might be that the fix in #312 Fix handling of min value as token upper bound is not actually correct w.r.t. to vanilla Cassandra and we were supposed to really include the = -9223372036854775808
case in that.
@aravind-nallan-yb Maybe we can run the same scenario in vanilla Cassandra 3.11 to compare?
cc @kmuthukk @tedyu
Also, in my example:
token() >= -9xx and token() < -6xxx
token() >= -6xx and token() < 0
token() >= 0 and token() < 6xx
token() >= 6xx and token < 9xx
the last, I think, statement actually needs to be.
token() >= 6xx and token <= 9xx
@aravind-nallan-yb - can you check that also?
UPDATE: I fixed this inline in my previous message.
The above combination of queries checks out. Also the spark counts are good too. So we can safely chalk this up to dsbulk issue.
scala> sc.cassandraTable("inventory","inventory").cassandraCount 2021-11-19 08:27:45 WARN NettyUtil:73 - Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead. res0: Long = 1000000
As per the Cassandra code (see here), it seems that the query with token <= -9223372036854775808
works as a wrap-around and does include the case when the token is exactly -9223372036854775808
.
This is true for any range interval where the lower bound is greater than the upper bound.
So the condition token <= -9223372036854775808
may not really mean no restriction.
For queries like:
select count(*) from inventory.inventory where token(gtin13,location_id) > 6148820866244280320 and token(gtin13,location_id) <= -9223372036854775808;
i.e. when the upper bound of token(k) is INT64_MIN (-9223372036854775808), the output should count all rows that have token > 6148820866244280320 as well as those rows that have token <= INT64_MIN.
This behaviour is required for Cassandra compatibility -- and it is also used by some connectors (specifically Presto) and tools like Dsbulk as mentioned above.
In Cassandra, the output of above query counts all the records that have token > 6148820866244280320 and records that have token = INT64_MIN (btw, Cassandra does not assign INT64_MIN token to any record as we can see here).
In Yugabyte-db, the output of above query counts all the records that have token > 6148820866244280320, but it does not count those records that have token = INT64_MIN, which it should for YBDB to work with tools like Dsbulk.
The rows that are missing for above query in YBDB are the ones that have token = INT64_MIN.
If this issue can be fixed on server-side, we may not have to fix the individual tools that rely on this behaviour (i.e. to count the rows that have token = INT64_MIN when above query is run).
Jira Link: DB-1159
Description
The table has 1M rows.
Now break down the tokens into four ranges and make sure the sum of the counts matches the total number of rows.
If i add up the results of individual range queries, it comes out to 999988. It misses 12 rows.