Loris load testing and capacity planning

wellcomecollection / platform

Wellcome Collection Digital Platform

https://developers.wellcomecollection.org/

MIT License

48 stars 10 forks source link

Loris load testing and capacity planning #537

Closed jtweed closed 7 years ago

alexwlchan commented 7 years ago

As discussed last week, I think there two key traffic types to test:

Random access into segments of the collection that aren’t in cache
Repeated access to a handful of images (e.g. in a /explore article)

I’m going to investigate using Siege as a load tester, because I already have the docs locally and it seems easy to configure.

alexwlchan commented 7 years ago

Hitting a static endpoint, the initial limiting factor is CPU power on my VPS:

$ docker run -it wellcome/siege siege --concurrent=100 --time=5M --benchmark https://iiif.wellcomecollection.org/image
Transactions:              72292 hits
Availability:             100.00 %
Elapsed time:             299.75 secs
Data transferred:          15.18 MB
Response time:              0.31 secs
Transaction rate:         241.17 trans/sec
Throughput:             0.05 MB/sec
Concurrency:               73.67
Successful transactions:       72334
Failed transactions:               0
Longest transaction:            1.23
Shortest transaction:           0.05

alexwlchan commented 7 years ago

I also have the ability to request lots of full-sized images, like so – I generated image URLs for the first 1000 V images (which are in fullsize.txt):

$ docker run -v (pwd)/fullsize.txt:/urls.txt -it wellcome/siege siege -f /urls.txt --concurrent=20 --time=10S --benchmark
Lifting the server siege...
Transactions:                578 hits
Availability:             100.00 %
Elapsed time:             299.72 secs
Data transferred:        1077.42 MB
Response time:             48.84 secs
Transaction rate:           1.93 trans/sec
Throughput:             3.59 MB/sec
Concurrency:               94.19
Successful transactions:         578
Failed transactions:               0
Longest transaction:          120.94
Shortest transaction:           3.04

Much slower – but Loris still holds up.

Tooling is now done – Python allows us to generate a sample of URLs for whatever profile we like. So what sort of tests would we like to do?

alexwlchan commented 7 years ago

Here’s a finger-in-the-air test proposal: we split the traffic for the test (by constructing a suitably weighted list of URLs) in the following way):

0.01% = hits to /image
35% = an assortment of thumbnail images, like you might load for the search page
35% = repeated access to a small pool of images, at full resolution or common thumbnail sizes, like you might get from reading /explore articles
20% = other full-sized images from Miro, a random selection, like you’d see on individual works/item pages
10% = “weird” requests that include crops, rotations, and so on – so we see what impact lots of image processing might have

Run that test for, say, 15 minutes, and we say it’s good enough if Loris serves more than 20k successful requests (which averages to ~20 requests per second).

Does that sound plausible?

alexwlchan commented 7 years ago

Oh, and we should probably watch the CPU/memory use on the container to see how it does as we’re going along.

alexwlchan commented 7 years ago

Moving this back to Next for now – we need to make some changes to our Loris architecture before we’re ready to do load testing.

alexwlchan commented 7 years ago

Notes from our discussion just now:

1.5s 99th percentile
1s 95th percentile
99% requests return 200 or 304

And tests we want to run:

[ ] With and without CloudFront
[ ] Push it to breaking point
[ ] At one task
[ ] At two tasks

alexwlchan commented 7 years ago

Also we probably want to expand the pool of images, currently it’s only the first 1000 V images under test.

kenoir commented 7 years ago

I'd say this is well and truly done!