Open Jadw1 opened 3 weeks ago
Link to the specification of the protocol (explains how vector values should be serialized): https://github.com/apache/cassandra/blob/15ed18e9d49f48e88f40b90c156248b8b697c7e2/doc/native_protocol_v5.spec#L1210-L1215
I propose structuring the implementation in the following way. The most safe, in my opinion, way to approach developing this would be to go over the phases and implement them in order, perhaps revisiting the previous ones if some adjustments need to be made. The stuff here is bound tightly enough that I don't think splitting into smaller PRs makes sense.
Introduce support for representing the vector datatype internally
This will be a rather big part which requires thorough lecture of the code in the ./types/
directory (although there are some files in other directories).
The goal is to be able to represent vector types internally in Scylla. The most important components are data_value
and abstract_type
, both defined in ./types/types.hh
.
data_value
is a type which can represent any value that CQL allows. It can hold a value of an arbitrary CQL type (e.g. int, blob, setabstract_type
is a dynamic representation of a CQL type.First, I recommend getting familiar with how this support looks like for some "native" type (i.e. a type that is not a collection) and then look at how lists and sets are supported. Look at the definition and implementation of the following:
int32_type_impl
(defined in ./concrete_types.hh
)set_type_impl
(defined in ./types/list_type_impl.hh
)data_type
(defined in ./types/types.hh
)At this point, you can implement a vector_type_impl
and extend data_type
so that you can create a data_type
which is a vector, and you can get a vector back out of it (via visit
).
Perhaps you will have to implement more stuff after all, but I'm not sure what will be needed, and the above are required for certain. I recommend proceeding with the later steps and add more stuff in the types
module as needed, then rework it when preparing the PR for review.
Extend CQL grammar to be able to express the vector
type
Now that you have an internal representation of the vector type, you can implement necessary syntax so that you can create a table with a column of the vector type. Start by adding the syntax and work your way down the abstractions, implementing what is needed. After this point, you should be able to create a table with a vector datatype and, most likely, be able to write to / read from the table (by using the bind markers, i.e. ?
signs in the query).
Extend CQL grammar to be able to express vector literals
This will require delving into the cql3
layer.
The first thing that should be done there is changing the name list
to list_or_vector
in the collection_constructor::style_type
enum.
Then, go over all occurrences of list_or_vector
and fix those places up:
fmt::formatter<cql3::expr::expression::printer>::format
- no need to change, assuming that the syntax is the same as for listsdo_evaluate(const collection_constructor& collection, ...)
- evaluate_list
could be changed to evaluate_list_or_vector
, and adjusted accordinglytry_prepare_expression
- list_prepare_expression
should be changed in similar way as evaluate_list
from the previous pointtest_assignment
- dittoTests
Some tests that use the python driver would be appreciated. For now, you can just substitute the python driver for the upstream driver if Scylla fork does not support vector types. These tests could actually be developed in parallel to other steps and, for now, only ran against Cassandra - running them against a valid implementation will make sure that the tests make sense.
There is also an option to write boost unit tests. There are some tests of this kind for data_value
and abstract_type
abstractions - check out ./test/boost/types_test.cc
and ./test/boost/user_types_test.cc
. It is a good idea to write at least some of those before reaching the last stage.
There is also an option to write boost unit tests. There are some tests of this kind for data_value and abstract_type abstractions - check out ./test/boost/types_test.cc and ./test/boost/user_types_test.cc. It is a good idea to write at least some of those before reaching the last stage.
Boost test is a good way to check vector type implementation, especially in the first stage when CQL layer doesn't support vector type yet. Types module is independent of the rest of database systems, so you can validate the implementation without spinning up the whole system (for instance, test cases in test/boost/types_test.cc
don't use cql_env
).
Add support for vector type. The vector is a fixed-length collection with specified type of elements:
VECTOR<INT, 5>
.The implementation should:
As a result, a user should be able to use vector type in the same way as any other data type.
Note: None of Scylla's drivers support vector type. Until the driver team (or we) adds this functionality, we're probably forced to use Cassandra's driver.
Apache Cassandra issue: https://issues.apache.org/jira/browse/CASSANDRA-18504 Patch adding other data type to Scylla (some code might be outdate): https://github.com/scylladb/scylladb/commit/509626fe08a49bf9312d1abb6f888c97cbadba1a