yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.99k stars 1.07k forks source link

[YCQL] Full table counts using token ranges do not match data in table #10637

Open aravind-nallan-yb opened 2 years ago

aravind-nallan-yb commented 2 years ago

Jira Link: DB-1159

Description

The table has 1M rows.

ycqlsh> select count(*) from inventory.inventory;

 count
---------
 1000000

Now break down the tokens into four ranges and make sure the sum of the counts matches the total number of rows.

ycqlsh> select count(*) from inventory.inventory where token(gtin13,location_id) > -6149102341220990976 and token(gtin13,location_id) <= 0;

 count
--------
 333192

(1 rows)
ycqlsh> select count(*) from inventory.inventory where token(gtin13,location_id) > 0 and token(gtin13,location_id) <= 6148820866244280320
   ... ;

 count
--------
 333762

(1 rows)
ycqlsh> select count(*) from inventory.inventory where token(gtin13,location_id) > 6148820866244280320 and token(gtin13,location_id) <= -9223372036854775808;

 count
--------
 166585

(1 rows)
ycqlsh> select count(*) from inventory.inventory where token(gtin13,location_id) > -9223372036854775808 and token(gtin13,location_id) <= -6149102341220990976;

 count
--------
 166449

(1 rows)

If i add up the results of individual range queries, it comes out to 999988. It misses 12 rows.

kmuthukk commented 2 years ago

This range (as @pkj415 pointed out) on the 3rd statement looks like a invalid range that should yield no matches. We are asking for x > +ve number AND x < -ve number

ycqlsh> select count(*) from inventory.inventory where token(gtin13,location_id) > 6148820866244280320 and token(gtin13,location_id) <= -9223372036854775808;

I asked @aravind-nallan-yb to instead query a non-circular/non-confusing set of ranges, for example these 4 ranges:

token() >= -9xx and token() < -6xxx
token() >= -6xx and token() < 0
token() >= 0 and token() < 6xx
token() >= 6xx and token <= 9xx

and he confirmed that it works cleanly.. and the return values add up to the expected 1000000 number.

CC: @m-iancu , @tedyu

m-iancu commented 2 years ago

@kmuthukk @aravind-nallan-yb This behavior for INT64_MIN (token(gtin13,location_id) <= -9223372036854775808) looks odd, but it was required for Cassandra compatibility -- and it was also used by some connectors (specifically Presto). See

m-iancu commented 2 years ago

So I think the 4 statements above (in the original message) are technically almost correct. I believe the miss is that on the 4th statement the:

select count(*) from inventory.inventory where token(gtin13,location_id) > -9223372036854775808 ...

should really be a non-strict inequality (>=) condition on -9223372036854775808 because, as explained in my message above, the token(gtin13,location_id) <= -9223372036854775808 condition on the third statement does not actually include the case where token is -9223372036854775808.

@aravind-nallan-yb do you mind testing that?

Note: It might be that the fix in #312 Fix handling of min value as token upper bound is not actually correct w.r.t. to vanilla Cassandra and we were supposed to really include the = -9223372036854775808 case in that.

@aravind-nallan-yb Maybe we can run the same scenario in vanilla Cassandra 3.11 to compare?

cc @kmuthukk @tedyu

kmuthukk commented 2 years ago

Also, in my example:

token() >= -9xx and token() < -6xxx
token() >= -6xx and token() < 0
token() >= 0 and token() < 6xx
token() >= 6xx and token < 9xx

the last, I think, statement actually needs to be.

token() >= 6xx and token <= 9xx

@aravind-nallan-yb - can you check that also?

UPDATE: I fixed this inline in my previous message.

aravind-nallan-yb commented 2 years ago

The above combination of queries checks out. Also the spark counts are good too. So we can safely chalk this up to dsbulk issue. scala> sc.cassandraTable("inventory","inventory").cassandraCount 2021-11-19 08:27:45 WARN NettyUtil:73 - Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead. res0: Long = 1000000

ashetkar commented 2 years ago

As per the Cassandra code (see here), it seems that the query with token <= -9223372036854775808 works as a wrap-around and does include the case when the token is exactly -9223372036854775808.

This is true for any range interval where the lower bound is greater than the upper bound.

So the condition token <= -9223372036854775808 may not really mean no restriction.

HarshDaryani896 commented 2 years ago

For queries like:

select count(*) from inventory.inventory where token(gtin13,location_id) > 6148820866244280320 and token(gtin13,location_id) <= -9223372036854775808;

i.e. when the upper bound of token(k) is INT64_MIN (-9223372036854775808), the output should count all rows that have token > 6148820866244280320 as well as those rows that have token <= INT64_MIN.

This behaviour is required for Cassandra compatibility -- and it is also used by some connectors (specifically Presto) and tools like Dsbulk as mentioned above.

In Cassandra, the output of above query counts all the records that have token > 6148820866244280320 and records that have token = INT64_MIN (btw, Cassandra does not assign INT64_MIN token to any record as we can see here).

In Yugabyte-db, the output of above query counts all the records that have token > 6148820866244280320, but it does not count those records that have token = INT64_MIN, which it should for YBDB to work with tools like Dsbulk.

The rows that are missing for above query in YBDB are the ones that have token = INT64_MIN.

If this issue can be fixed on server-side, we may not have to fix the individual tools that rely on this behaviour (i.e. to count the rows that have token = INT64_MIN when above query is run).