Implement `vector<>` data type

Jadw1 commented 3 weeks ago

Add support for vector type. The vector is a fixed-length collection with specified type of elements: VECTOR<INT, 5>.

The implementation should:

extend CQL syntax
adjust built-in type hierarchy and implement all necessary abstractions
add support for serialization/deserialization
other required adjustments...
add unit tests similar to other data types

As a result, a user should be able to use vector type in the same way as any other data type.

Note: None of Scylla's drivers support vector type. Until the driver team (or we) adds this functionality, we're probably forced to use Cassandra's driver.

Apache Cassandra issue: https://issues.apache.org/jira/browse/CASSANDRA-18504 Patch adding other data type to Scylla (some code might be outdate): https://github.com/scylladb/scylladb/commit/509626fe08a49bf9312d1abb6f888c97cbadba1a

piodul commented 3 weeks ago

Link to the specification of the protocol (explains how vector values should be serialized): https://github.com/apache/cassandra/blob/15ed18e9d49f48e88f40b90c156248b8b697c7e2/doc/native_protocol_v5.spec#L1210-L1215

piodul commented 1 week ago

I propose structuring the implementation in the following way. The most safe, in my opinion, way to approach developing this would be to go over the phases and implement them in order, perhaps revisiting the previous ones if some adjustments need to be made. The stuff here is bound tightly enough that I don't think splitting into smaller PRs makes sense.

Introduce support for representing the vector datatype internally

This will be a rather big part which requires thorough lecture of the code in the ./types/ directory (although there are some files in other directories).

The goal is to be able to represent vector types internally in Scylla. The most important components are data_value and abstract_type, both defined in ./types/types.hh.

data_value is a type which can represent any value that CQL allows. It can hold a value of an arbitrary CQL type (e.g. int, blob, set, etc.).
abstract_type is a dynamic representation of a CQL type.

First, I recommend getting familiar with how this support looks like for some "native" type (i.e. a type that is not a collection) and then look at how lists and sets are supported. Look at the definition and implementation of the following:

int32_type_impl (defined in ./concrete_types.hh)
set_type_impl (defined in ./types/list_type_impl.hh)
data_type (defined in ./types/types.hh)

At this point, you can implement a vector_type_impl and extend data_type so that you can create a data_type which is a vector, and you can get a vector back out of it (via visit).

Perhaps you will have to implement more stuff after all, but I'm not sure what will be needed, and the above are required for certain. I recommend proceeding with the later steps and add more stuff in the types module as needed, then rework it when preparing the PR for review.

Extend CQL grammar to be able to express the vector type

Now that you have an internal representation of the vector type, you can implement necessary syntax so that you can create a table with a column of the vector type. Start by adding the syntax and work your way down the abstractions, implementing what is needed. After this point, you should be able to create a table with a vector datatype and, most likely, be able to write to / read from the table (by using the bind markers, i.e. ? signs in the query).

Extend CQL grammar to be able to express vector literals

This will require delving into the cql3 layer.

The first thing that should be done there is changing the name list to list_or_vector in the collection_constructor::style_type enum.

Then, go over all occurrences of list_or_vector and fix those places up:

fmt::formatter<cql3::expr::expression::printer>::format - no need to change, assuming that the syntax is the same as for lists
do_evaluate(const collection_constructor& collection, ...) - evaluate_list could be changed to evaluate_list_or_vector, and adjusted accordingly
try_prepare_expression - list_prepare_expression should be changed in similar way as evaluate_list from the previous point
test_assignment - ditto

Tests

Some tests that use the python driver would be appreciated. For now, you can just substitute the python driver for the upstream driver if Scylla fork does not support vector types. These tests could actually be developed in parallel to other steps and, for now, only ran against Cassandra - running them against a valid implementation will make sure that the tests make sense.

There is also an option to write boost unit tests. There are some tests of this kind for data_value and abstract_type abstractions - check out ./test/boost/types_test.cc and ./test/boost/user_types_test.cc. It is a good idea to write at least some of those before reaching the last stage.

Jadw1 commented 1 week ago

There is also an option to write boost unit tests. There are some tests of this kind for data_value and abstract_type abstractions - check out ./test/boost/types_test.cc and ./test/boost/user_types_test.cc. It is a good idea to write at least some of those before reaching the last stage.

Boost test is a good way to check vector type implementation, especially in the first stage when CQL layer doesn't support vector type yet. Types module is independent of the rest of database systems, so you can validate the implementation without spinning up the whole system (for instance, test cases in test/boost/types_test.cc don't use cql_env).

zpp-2024-vector-search / scylladb

Implement `vector<>` data type #2