whale2 / async-kinesis-client

Python Kinesis Client library utilising asyncio
MIT License
13 stars 4 forks source link

Kinesis PutRecords Limits #2

Closed bisoldi closed 5 years ago

bisoldi commented 5 years ago

Line 103 in kinesis_producer.py, you have:

# I hope I'm implementing this correctly, as there are different hints about maximum data sizes
# in boto3 docs and general AWS docs
if record_size > MAX_RECORD_SIZE:
    raise ValueError('Record # {} exceeded max record size of {}; size={}; record={}'.format(n, MAX_RECORD_SIZE, record_size, datum))

I think that's incorrect. You're comparing the size of the data itself against the 1MB limit per record, when it sounds like it should be the size of the data + partition key + hashkey. You use the correct (assuming my understanding is correct) methodology on line 12 when you compare total datum size against the max request size (5MB):

datum_size = utils._sizeof(datum)

if self.buf_size + datum_size > MAX_BATCH_SIZE:
    resp.append(await self.flush())

Below is from the AWS Kinesis PutRecords API documentation:

Each PutRecords request can support up to 500 records. Each record in the request can be as large as 1 MiB, up to a limit of 5 MiB for the entire request, including partition keys. Each shard can support writes up to 1,000 records per second, up to a maximum data write total of 1 MiB per second.

And here is the sample request the documentation provides:

{
   "Records": [ 
      { 
         "Data": blob,
         "ExplicitHashKey": "string",
         "PartitionKey": "string"
      }
   ],
   "StreamName": "string"
}

The way I'm reading the documentation, the "record" that can be a maximum of 1MB includes the Data, ExplicitHashKey and PartitionKey.

I think you can remove record_size = utils._sizeof(datum.get('Data')) and compare datum_size against MAX_RECORD_SIZE.

whale2 commented 5 years ago

I still think that documentation is not very clear in this particular place, but thanks for the heads up. I think I'll just try calling the API with that big record to find out what the real limit is.

bisoldi commented 5 years ago

Totally agree on the quality of the documentation. If it will help, I am going to put in a tech support request and ask for clarification and I'll post the answer here.

whale2 commented 5 years ago

I played with it a bit. So far, with boto3==1.9.49 and botocore==1.12.49 the size of partition key is ignored. I was able to put records of exactly 1048576 bytes and got exception for anything bigger. I tried partition keys of different sizes and unless it is bigger than allowed 256 characters, no exception raised. Adding ExplicitHashKey does not change anything.

With batch put_records() it is essentially the same - max 1 MB per record and max 5 MB of all records combined no matter what are partition keys.

However, in async-kinesis-client I'm calculating the record size wrong - as this is just a plain byte or bytearray value, simple len() should be used. Going to fix this soon.

bisoldi commented 5 years ago

Wow, that's amazing...blows my mind, because that's not what the docs imply (as confusing as it is).

Anyways, I already submitted the ticket, so I'll be happy to share the response.

whale2 commented 5 years ago

Size calculation reworked in 0.1.3. @bisoldi - still would love to know AWS's response, though. Thanks. Closing.

bisoldi commented 5 years ago

@whale2 Just to close the loop on this....It seems the documentation does not match reality.

My email to AWS:

I'm seeking clarification on the PutRecords API limitations, specifically the individual record size limit. Below is the relevant paragraph:

"Each PutRecords request can support up to 500 records. Each record in the request can be as large as 1 MiB, up to a limit of 5 MiB for the entire request, including partition keys. Each shard can support writes up to 1,000 records per second, up to a maximum data write total of 1 MiB per second."

My question is, does the 1MB limit per record INCLUDE the partition key and explicit hash key? Or does it include only the data blob itself?

And their response:

To answer your question, the 1MB upper limit that applies to the size of each record, includes the data blob (the payload before base64-encoding) as well as the partition key [1]. Therefore, we can look at it as follows: Size of data blob + partition key =< 1MB

References: [1] https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html#API_PutRecord_RequestParameters

whale2 commented 5 years ago

@bisoldi I think I should check deeper. Maybe it is boto3 which doesn't raise the exception while kinesis really trims or ignores the record of 1M size without partition key. Shame on me, I didn't check if I really received that messages.