wg / lettuce

Scalable Java Redis client
Apache License 2.0
229 stars 61 forks source link

Binary redis codec + JUnit test #8

Closed sscarduzio closed 11 years ago

sscarduzio commented 11 years ago

This adds support for reading and writing non-UTF8 strings from and to Redis. This is absolutely necessary if you want to store anything that is not a simple string.

Use cases:

sscarduzio commented 11 years ago

@wg Is there anything I need to amend to get this merged? I'd like to include Lettuce as a maven dependency in my project at a certain point.

wg commented 11 years ago

Hi Simone, thanks for the pull request and apologies for the delayed response! In general I think RedisCodec implementations tend to be application-specific and should be part of the application's codebase rather than lettuce's. You should be able to use stock lettuce from maven as a dependency and then pass in your own custom codec.

In this case it's difficult for me to say whether most apps that want to store binary values would also use byte[] keys, i.e. maybe keys should still be UTF-8 or even ASCII... and I don't want to be maintaining a bunch of codecs with minor differences =) Plus when storing binary data it would be desirable to have the codec itself do any necessary (de)serialization directly from the ByteBuffer when possible

sscarduzio commented 11 years ago

I see what you mean, but unfortunately when you use protobuffer segregating deserialization in codec is a problem because whenever you want to deserialize something, you have to call a method inside a specific message class.

For example if you expect a calendar event from the content of this particular Redis key: CalendarEvent.parseFrom(theByteArray)

So again, I if you use protobuffer, I don't see a particularly pretty way to achieve a application specific binary codec.

The same goes with the suggestion of deserializing directly from the ByteBuffer. No real gain with it because deserialization needs anyway all the data at once (as a byte[], indeed), not chunks. So you would have to copy everything in a byte[] at a certain point. Which is anyway much less processing than passing everything through a UTF8 encoder. I suppose.

Related question: what's the need to re-encode every key in UTF8? Especially when it is received already as a String (as for UTF8StringCodec.java)

Thanks for the review and your comments! Lettuce is really cool :)

-Simone.

wg commented 11 years ago

Yeah, storing protobuf blobs is one of the original inspirations for codecs, though I haven't actually tried implementing it. I'd probably try using protobuf's extensions or unions, or just store a 2-byte prefix that identifies the protobuf message type, then the codec could call the appropriate deserializer. Keeping the type explicit in the data seems like a good idea.

Many libraries that deal with binary data do take byte[]s rather than ByteBuffers, but I suspect there are some designed for high performance that deal directly with ByteBuffers to reduce the amount of copying that needs to occur, allow for off-heap storage, etc.

Regarding the UTF-8 codec I'm sure you know that Java's Strings use 16-bit characters, so using Strings as keys and/or values means they must be converted to/from bytes.

sscarduzio commented 11 years ago

Sure they need to be converted to bytes at a certain point, but why not simply reading all the byte array and creating a new string like new String(ba, "UTF-8").

I see you reading from the byte buffer chunk by chunk into a CharBuffer and flipping it before calling a toString. https://github.com/wg/lettuce/blob/master/src/main/java/com/lambdaworks/redis/codec/Utf8StringCodec.java#L51

As far as I know you will end up having one copy of the data in the char buffer and a copy in the new object created when you call .toString()

I rarely deal with stuff this low level, so maybe you could share the reasoning behind the funky stuff :) Just out of personal interest.

Thanks!

wg commented 11 years ago

Well the String(byte[], String charset) constructor does basically the same thing as Utf8StringCodec, except that it must allocate a ByteBuffer to store the byte[], allocate a CharBuffer to store the codec output, deal with encoder caching, etc. The lettuce codec should be much more efficient because it avoids the intermediate copy in a byte[], doesn't need to allocate a new ByteBuffer or CharBuffer, doesn't need to look up the Charset, etc.