Configurable timeout length

bhauer commented 11 years ago

Hi again @wg. It has been brought to our attention that in some of the results from our frameworks benchmarks, Wrk was reporting socket timeouts that we failed to capture in our automatic scraping of results that ultimately turn into charts and tables.

For example, see the raw results we captured for Go at 256 concurrency on EC2 (scroll down to the bottom of this file):

https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/results/ec2/20130404200504/json/go/raw

Would you be able to make the timeout period configurable? A quick spot-check of the code suggests (and I am probably wrong here!) that it's set to 2,000ms. I am probably wrong because other results such as the Cake tests show maximum latency higher than 2 seconds:

https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/results/ec2/20130404200504/query/cake/raw

(Incidentally, that file has many more socket timeouts reported.)

If you think we're doing something wrong unrelated to the timeout, we'd love your advice.

Some additional questions:

Can you give a brief description of the difference between a read error and a timeout? Is a "read error" indicative of the server responding with a 500 response?
Can you double-check the Max req/sec per thread calculation? It looks to always be an integer but is rendered with decimal-point precision. E.g., within that Go results file linked above, at 128 concurrency, the max is "8.00k."
Related to the above, do you have any conjecture for why we see a thread's Max req/seq that is so much higher than the average req/seq across all threads?

Thanks again for all of your help and I apologize for looping you in here. Please let me know how I can return the favor.

wg commented 11 years ago

Hello @bhauer,

Making the timeout interval a cmdline option does seem like a reasonable thing to do. You are correct that the current timeout constant (SOCKET_TIMEOUT_MS) is 2s, however they're also only sampled every 2s (TIMEOUT_INTERVAL_MS). So the number of timeouts isn't comprehensive, just an indicator of problems.

connect/read/write errors are actual failures of calls to connect()/read()/write(), but timeouts just mean a complete response wasn't received in time. Note that the request does continue even when a timeout occurs.

Non-2xx or 3xx HTTP status codes are reported as a separate line if any occur.

Yes, the request rate is recorded at second granularity so max req/sec will always be a whole number. It's printed as a decimal to match the format of the other columns and rows.

I'm not quite sure what you mean about one thread's max req/sec being much higher than the average result, but I could guess that might occur with some sort of unfair scheduling.

bhauer commented 11 years ago

Hi @wg! As always, thanks for the prompt reply.

Would you mind taking a look at the following from the Go results for 256 concurrency on EC2?

Making 100000 requests to http://10.253.68.28:8080/json
  4 threads and 256 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 7.81ms 6.15ms 26.54ms 50.00%
    Req/Sec 2.94k 2.58k 7.00k 46.75%
  100000 requests in 7.26s, 13.35MB read
  Socket errors: connect 0, read 0, write 0, timeout 53
Requests/sec: 13769.23
Transfer/sec: 1.84MB

Since a request continues even if there is a timeout, and the timeout is 2 seconds (with a possible addition of up to 2 seconds due to sampling resolution) am I correct to assume that the maximum latency should be something over 2s here? Those 53 timeouts took >2s each? But the report says the maximum was 26.54ms. What am I misunderstanding?

You're probably right about some unfairness of scheduling for a thread reporting a significantly greater maximum than the average. I guess the max column for req/seq just seems odd to me since it's a whole number and unexpectedly disproportionate in this case.

wg commented 11 years ago

Ahh I see what you're getting at, the thread statistics are based on a random sampling and aren't comprehensive. Every 100ms (SAMPLE_INTERVAL_MS) a random connection is sampled for each thread and a max of 100,000 (SAMPLES) samples are retained until the end of the test.

bhauer commented 11 years ago

Aha! That helps my understanding.

Although it seems that increasing the timeout will do nothing but remove that message from the "Socket errors" report at the end (since requests proceed anyway), I'd nevertheless like to retain that as a feature request if you're willing to add that.

At the very least, that would put readers of the file at ease that we aren't rendering invalid data in ignoring the timeouts reported by Wrk. (That is, after all, what I suspected once I realized we had these error messages in our raw output files.)

jameslyo commented 11 years ago

So just to clarify my understanding here -- this implies that when running the test cited above (100k connections taking 7.26s, with a reported latency of 7.8ms, that actually this is a sample of roughly 72 connections spread evenly in time throughout the run, and from those 72 samples, the latency max, mean, and stdev are calculated. Is that correct? Or -- I suppose its 4x that because its sampled per thread (and run with 4 threads)? (ie ~288 samples)

bhauer commented 11 years ago

That is how I understand it now too. Although, as I continue to think about it, my confusion comes back. I am surprised that if 53 requests took longer than 2 seconds (that is, long enough to trigger the timeout warning) that not one of those requests was caught by the ~288 random samples taken. I haven't worked the numbers to understand how statistically likely it would be to encounter one of those long requests, but naively it seems the longer the request goes, the more likely it's going to be "caught" by the random sampling.

For the time being, I think my inclination for our framework benchmarks is to:

If available in a future version of Wrk, we will set the timeout to something very high such as 60 seconds. Even without the command line option, we could recompile Wrk with a higher value for the timeout. At the very least that removes the "error" in the output and assuages the concern of someone like myself who sees that and reacts out of a naive understanding of Wrk.
Clarify that our exercise as a whole is subject to the particular implementation details of the load test tool being used, regardless of which we select. We previously were subject to the particulars of WeigHTTP and now we are subject to the particulars of Wrk. We prefer Wrk because it gives "good enough" visibility into the additional latency data points we want.
Acknowledge that there may be some cases where especially high-latency requests may not be captured in the average, max, and stdev computations for latency which can result in "sniff checks" where throughput does not necessarily align with latency.

I think that even with that last point, in general, high-RPS platforms and frameworks tend to be low-latency and vice-versa. So I am comfortable with all of this. @jameslyo and @wg, do you think that's reasonable?

jameslyo commented 11 years ago

Average throughput should be the inverse of average latency. What is interesting about latency is really a characterization of the distribution. Which unfortunately this doesn't provide as much of as I'd like to see -- but its better than nothing. @wg, if you ever have the time, providing a way to get more complete latency numbers would be a nice feature add. If you ever do go down that road. Median and percentile calculations would be wonderful. (something like 25, 50, 75, 90, 99, 999) As that characterizes the latency distribution better than an average. Just something to keep in mind.

As far as long requests being more likely: I'd have to read the code to see how the sample is performed. If its a sample from a bucket of "completed connections" that gets flushed and refilled every 100ms -- then no, the 2s request has no more chance of being picked than any other. If its a sample of outstanding requests then it could be picked -- but then you would have to wait until it finished to record its result (making the implementation a bit tricky) And you would have to be careful not to pick it again. But then yes, there would be substantially increased chances of picking a longer request for inclusion. But I think I worked it out to be still only about 8% chance. Too many assumptions to hold me to that though.

@bhauer I think its fine for now. I think it just means the latency numbers are less useful than they first appeared, but they are still something to look at thats interesting. Being clear about how they are calculated and how the results could be skewed (in either direction) as a result would be good. (Now that we know!) It would have alleviated my concern when I saw the throughput/latency disparity in the numbers in the first place.

bhauer commented 11 years ago

@jameslyo Good point about the sample point likely being the end of the request and therefore being no more likely to catch a 2 second request than a 5 millisecond request. Without knowing the Wrk code here either, I would agree that seems the most plausible.

Agreed on the need for clarity in the messaging. Bottom line: still interesting to have, but take the data with a grain of salt, especially with the very high performance frameworks that are more likely to have their exceptional cases missed by the Wrk latency sampler.

See my follow up reply back in our repo for some more thoughts.

wg commented 11 years ago

@bhauer, @jameslyo interesting discussion! In my experience a web framework benchmark needs to run for longer than 7.26s to get an accurate feel for performance. Many times I've seen great performance up front then serious degradation over time.

wrk is designed for a bit more intensive tests, say 100m requests over 60s and should scale up to tests that run for an hour or two. This is why I took a sampling approach and why the # of samples and sampling intervals are tuned as they are.

bhauer commented 11 years ago

Hi @wg. Interesting points!

My colleague Pat (@pfalls1) and I have discussed this problem we have: given the orders of magnitude of performance that our suite of frameworks span, running 100,000 requests can complete in mere seconds in one case but in hours in another. We ended up with 100,000 as our request count so that it felt reasonably high but didn't take days for slower frameworks to complete.

Stupid question: does Wrk have an alternate mode that runs for a specified number of seconds? We could alter our tests to run for, say, 3 minutes minutes each. We have approximately 15 tests per framework and around 45 frameworks at last count. This would also allow us to complete a full run in a predictable window of time (~45 minutes x ~45 frameworks = ~34 hours, which is a little less than our last full run time on EC2).

stormbrew commented 11 years ago

I'd just like to chime in and say that being able to run a benchmark on a time basis instead of a request basis is a great idea. Having to choose a number of requests forces you to guess at the result of the very thing you're trying to determine (rps) for no real gain. As you say, benchmarks should be run for a longer time than a few seconds.

jameslyo commented 11 years ago

@wg I'll agree here. I don't think it was intended to be so short really. I think the notion was to run the same number of requests through a bunch of various frameworks and see how they perform. some run incredibly slowly and some much much faster. Some running faster mean that they complete very fast -- and in this context, as a result aren't sampled very much.

As I write this -- @bhauer just wrote my next question which was why is it configured as a number of requests, if the sampling is time based. Why not let it run for a fixed amount of time? Or -- (alternatively) why not sample a fixed % of finished requests? Or use a config flag to switch between one mode (time or request) and base the sampling off whichever they prefer. If they want 20min of runtime -- use time based sampling. If they want 100M requests -- use request based sampling. That would be pretty useful in many contexts I suspect.

wg commented 11 years ago

@bhauer @stormbrew @jameslyo yeah, I'm also thinking that using time, rather than requests, is a much better basis for running a benchmark. I pretty much always tune the number of requests based on how patient I'm feeling anyway.

defanator commented 11 years ago

Where is the "Vote" button here? :) +1 to the idea of adding something like "-d 10m" (d=duration)

wg commented 11 years ago

Thanks for your input everyone! And thanks for "-d/--duration" @defanator, perfect name for the option.

I've just pushed a commit that makes time, rather than requests, the basis for the benchmark duration. It makes a lot more sense and also resulted in quite a bit cleaner code, but do note that the -r option has been eliminated in favor of -d.

I also included a commit that adds a --timeout option to set the socket/request timeout.

wg / wrk

Configurable timeout length #28