Adding Session Support - Githubissues

maldrake commented 3 years ago

@wolf4ood,

I'm working on something for which support for sessions would be helpful. I'd like to be able to do multiple queries / traversals within a single transaction, because the latter queries depend on the results returned by earlier queries. I'm happy to do the work to prepare the PR, but first I wanted to ask your thoughts on how I might go about it.

As far as I can tell from the reading the Apache documentation and some of their client code, it looks as though to use sessions:

the requests all need to come on the same connection (I'm not sure about this, though)
the message to the server must have a processor field set to session
the message to the server must have a session field set to a unique session identifier, recommended to be a UUID to minimize the risk of collision
at the end of the session, the client must send a message with an op of close to close the session and the transaction

Including the processor and session fields in the message looks fairly simple. I think I would add those fields to the Message enum variants and make corresponding changes to the various message_with_args* functions.

The part I'm less certain about is changes to GremlinClient and aio::GremlinClient, because of the connection pooling. The connection pool is currently encapsulated within the execute and submit_traversal functions, so there's no guarantee that the caller will get the same connection to the server for successive queries or traversals.

I couldn't tell for certain from the Apache documentation, so I'm not completely certain that queries and traversals in the same session have to arrive over the same connection. If not, the change is much easier. One approach would allow the addition of session information (i.e. a session ID) to a variant of the execute and submit_traversal functions. I would also add a close_session function to client to be able to close out the session.

My best guess from the documentation is that this won't be enough, though, because the messages in a session must come from the same connection. If that's true:

One option would be to add a separate SessionedClient, which operates without a pool in the execute and submit_traversal functions. Instead it would have a single Connection. I think from the documentation that after a session is closed, the same connection can be reused later either sessionless or with a new session, so one could conceivably have a pool of SessionedClients. I'm not certain, though. The upside is that it doesn't change existing GremlinClient behavior, although it might be a bit complex to get traversals to work with either an existing GremlinClient or the new SessionedClient.
Another option would be to restructure the code so that the pool is a pool of multiple GremlinClient structs that each have a single Connection, rather than a single GremlinClient struct with a pool of multiple Connection structs. This way, the client only has one connection, and the user of the gremlin-client library uses a pool to get a GremlinClient. The advantage of this approach is that there's no creation of a separate SessionedClient, which probably reduces code duplication and keep things a little simpler inside traversals where submit_traversal is called. However, it is more of a breaking change for clients that are currently relying on GremlinClient to manage the connection pool, and it's more complex for users of the library to create and manage the pool.

In general, is session support a feature you'd be interested in? If so... you know a lot more about Gremlin, Gremlin-based servers, and gremlin-rs than I do, so does the above sound mostly correct, or did I miss something important? Assuming I'm not too far off, if I work on a PR, do you have a preference about the design direction I go for handling the client pooling issue?

And above all -- many thanks again for this project!

wolf4ood commented 3 years ago

Hi @maldrake

thanks for raising this issue. Session support would be a cool feature indeed. I didn't investigate much the session on protocol, but i can dig more. I think that the session is decoupled from the connection but i might be wrong. I would try to check this first before making design decision. But in case if the session is bounded to the single connection yes option 1 or 2 probably are needed.

I will try to check in other clients too :)

jdeepee commented 3 years ago

Coming in here to say that I would be very happy to help with the implementation of this feature, this is much needed for our project also.

@wolf4ood have you had anymore clarity looking at other GLV's? I will also do some research over this weekend.

jdeepee commented 3 years ago

I just quickly glanced at the python GLV and I can see that if sessions are enabled then the client is only allowing a pool size of 1.

(https://github.com/apache/tinkerpop/blob/master/gremlin-python/src/main/python/gremlin_python/driver/client.py)

I guess this is a pretty good indication to the fact that session must be sent over the same connection.

wolf4ood commented 3 years ago

Hi @jdeepee not yet, I should have some time this weekend. But indeed the python one should give us some indication that the session is bounded to the connection.

maldrake commented 3 years ago

Looking into this a little further, I think we're going to need to bind a session to a particular client and connection. According to the Tinkerpop documentation (https://tinkerpop.apache.org/docs/current/reference/#sessions):

If there are multiple Gremlin Server instances, communication from the client to the server must be bound to the server that the session was initialized in. Gremlin Server does not share session state as the transactional context of a Graph is bound to the thread it was initialized in.

More broadly than just Tinkerpop, other graph databases seem to offer options for clustering as well. If someone is doing something like DNS round-robin to load balance connections, then different client instances may be connected to different server instances, which don't share sessions.

Looking back over the two options above, I think my preference would be for option 2: pull the pool outside of the client, so that you have a pool of clients, with each client having a single connection. Then give the client struct three additional functions, to begin, commit, and rollback a transaction. Does that sound like a good way to proceed?

wolf4ood commented 3 years ago

How the python/js and other users of GLV with sessions that limit the pool to one are suppose to use that client over multiple requests? Do the users need to pool manually?

wolf4ood commented 3 years ago

@jdeepee @maldrake I've checked the JS one and actually the client has only one WS connection. So i guess if you use session you need to handle manually the pool by yourself

maldrake commented 3 years ago

Agreed. I read the code the same way. This week is turning out to be busier that I expected, but I have some time off from my day job coming up in the next couple weeks. I'll try to get a draft PR together as a starting point for review and discussion.

wolf4ood / gremlin-rs

Adding Session Support #113