Open zmoog opened 11 months ago
First, I'll read https://esrally.readthedocs.io/en/stable/quickstart.html to learn the Rally basics.
pipx install esrally
Checking the existing Rally benchmark for K8s at https://github.com/elastic/k8s-integration-infra/tree/cloudnative/cloudnative
Using a local cluster.
Create a track
$ esrally create-track --track="container-logs" --target-hosts=https://localhost:9200 --data-stream="logs-kubernetes.container_logs*" --client-options="use_ssl:true,verify_certs:false,ca_certs:false,basic_auth_user:'elastic',basic_auth_password:'changeme'"
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
[INFO] Connected to Elasticsearch cluster [5fbd93b394ed] version [8.10.0-SNAPSHOT].
Extracting documents for index [.ds-logs-kubernetes.co... 1000/1000 docs [100.0% done]
Extracting documents for index [.ds-logs-kubernete... 353379/353379 docs [100.0% done]
Extracting documents for index [.ds-logs-kubernetes.co... 1000/1000 docs [100.0% done]
Extracting documents for index [.ds-logs-kubernetes.... 10185/10185 docs [100.0% done]
[INFO] Track container-logs has been created. Run it with: esrally --track-path=/Users/zmoog/code/projects/elastic/k8s-integration-infra/container-logs/tracks/container-logs
---------------------------------
[INFO] SUCCESS (took 113 seconds)
---------------------------------
Checking the results:
$ find .
.
./tracks
./tracks/container-logs
./tracks/container-logs/.ds-logs-kubernetes.container_logs.agents-default-2023.08.30-000001-documents.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs-default-2023.08.30-000001-documents.json.bz2
./tracks/container-logs/.ds-logs-kubernetes.container_logs.agents-default-2023.08.30-000001-documents-1k.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs-default-2023.08.30-000001-documents-1k.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs.agents-default-2023.08.30-000001-documents-1k.json.bz2
./tracks/container-logs/.ds-logs-kubernetes.container_logs.agents-default-2023.08.30-000001.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs-default-2023.08.30-000001-documents.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs-default-2023.08.30-000001-documents-1k.json.bz2
./tracks/container-logs/.ds-logs-kubernetes.container_logs-default-2023.08.30-000001.json
./tracks/container-logs/track.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs.agents-default-2023.08.30-000001-documents.json.bz2
List tracks
$ esrally list tracks --track-path=tracks/container-logs
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
Available tracks:
Name Description Documents Compressed Size Uncompressed Size
-------------- ------------------------------------------ ----------- ----------------- -------------------
container-logs Tracker-generated track for container-logs 363,564 15.8 MB 931.1 MB
-------------------------------
[INFO] SUCCESS (took 0 seconds)
-------------------------------
List races
esrally list races
wait_for_status
from green
to yellow
since this is a single node cluster.$ esrally race \
--target-hosts=localhost:9200 \
--track-path=tracks/container-logs \
--client-options="use_ssl:true,verify_certs:false,ca_certs:false,basic_auth_user:'elastic',basic_auth_password:'changeme'" \
--pipeline=benchmark-only \
--kill-running-processes
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
[INFO] Race id is [63e80dc7-a658-4b67-90c5-a73e198f62ac]
[INFO] Racing on track [container-logs] and car ['external'] with version [8.10.0-SNAPSHOT].
[WARNING] merges_total_time is 195945 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] merges_total_throttled_time is 8581 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 965545 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 254226 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 2317453 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-index [100% done]
Running create-index [100% done]
Running cluster-health [100% done]
Running bulk [100% done]
------------------------------------------------------
_______ __ _____
/ ____(_)___ ____ _/ / / ___/_________ ________
/ /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
/ __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/
/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/
------------------------------------------------------
| Metric | Task | Value | Unit |
|---------------------------------------------------------------:|-------:|-----------------:|-------:|
| Cumulative indexing time of primary shards | | 18.0194 | min |
| Min cumulative indexing time across primary shards | | 0 | min |
| Median cumulative indexing time across primary shards | | 0.0466667 | min |
| Max cumulative indexing time across primary shards | | 5.87675 | min |
| Cumulative indexing throttle time of primary shards | | 0 | min |
| Min cumulative indexing throttle time across primary shards | | 0 | min |
| Median cumulative indexing throttle time across primary shards | | 0 | min |
| Max cumulative indexing throttle time across primary shards | | 0 | min |
| Cumulative merge time of primary shards | | 3.68822 | min |
| Cumulative merge count of primary shards | | 1148 | |
| Min cumulative merge time across primary shards | | 0 | min |
| Median cumulative merge time across primary shards | | 0.0114667 | min |
| Max cumulative merge time across primary shards | | 1.04943 | min |
| Cumulative merge throttle time of primary shards | | 0.259733 | min |
| Min cumulative merge throttle time across primary shards | | 0 | min |
| Median cumulative merge throttle time across primary shards | | 0 | min |
| Max cumulative merge throttle time across primary shards | | 0.259733 | min |
| Cumulative refresh time of primary shards | | 4.27292 | min |
| Cumulative refresh count of primary shards | | 30297 | |
| Min cumulative refresh time across primary shards | | 0 | min |
| Median cumulative refresh time across primary shards | | 0.0077 | min |
| Max cumulative refresh time across primary shards | | 0.879883 | min |
| Cumulative flush time of primary shards | | 38.6595 | min |
| Cumulative flush count of primary shards | | 28281 | |
| Min cumulative flush time across primary shards | | 3.33333e-05 | min |
| Median cumulative flush time across primary shards | | 0.143767 | min |
| Max cumulative flush time across primary shards | | 7.96648 | min |
| Total Young Gen GC time | | 3.002 | s |
| Total Young Gen GC count | | 144 | |
| Total Old Gen GC time | | 0 | s |
| Total Old Gen GC count | | 0 | |
| Store size | | 0.996763 | GB |
| Translog size | | 0.0564239 | GB |
| Heap used for segments | | 0 | MB |
| Heap used for doc values | | 0 | MB |
| Heap used for terms | | 0 | MB |
| Heap used for norms | | 0 | MB |
| Heap used for points | | 0 | MB |
| Heap used for stored fields | | 0 | MB |
| Segment count | | 652 | |
| Total Ingest Pipeline count | | 298632 | |
| Total Ingest Pipeline time | | 24.567 | s |
| Total Ingest Pipeline failed | | 0 | |
| Min Throughput | bulk | 741.76 | docs/s |
| Mean Throughput | bulk | 7042.58 | docs/s |
| Median Throughput | bulk | 7071.05 | docs/s |
| Max Throughput | bulk | 9845.35 | docs/s |
| 50th percentile latency | bulk | 3459.26 | ms |
| 90th percentile latency | bulk | 6476.21 | ms |
| 100th percentile latency | bulk | 9193.49 | ms |
| 50th percentile service time | bulk | 3459.26 | ms |
| 90th percentile service time | bulk | 6476.21 | ms |
| 100th percentile service time | bulk | 9193.49 | ms |
| error rate | bulk | 16.25 | % |
[WARNING] Error rate is 16.25 for operation 'bulk'. Please check the logs.
--------------------------------
[INFO] SUCCESS (took 80 seconds)
--------------------------------
Since the error rate for the operation "bulk" is quite high (16.25%), I want to try this using a full fledged cluster instead of the cluster created locally by elastic-package for development.
Unfortunately, running rally on one of my test cluster in Elastic Cloud does not work and fails systematically:
$ esrally race
--target-hosts=https://workshop.es.eastus2.azure.elastic-cloud.com \
--track-path=tracks/container-logs \
--client-options="basic_auth_user:'esrally',basic_auth_password:'<REDACTED>'" \
--pipeline=benchmark-only \
--kill-running-processes \
--on-error=abort
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
[INFO] Race id is [12fae33c-e13c-469b-a23f-06b75c5bec98]
[INFO] Racing on track [container-logs] and car ['external'] with version [8.8.0].
[WARNING] merges_total_time is 2842617 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] merges_total_throttled_time is 240370 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 12771173 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 90916 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 4302426 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-index [100% done]
Running create-index [100% done]
Running cluster-health [100% done]
Running bulk [ 0% done]
[WARNING] Could not terminate all internal processes within timeout. Please check and force-terminate all Rally processes.
[ERROR] Cannot race. Error in load generator [0]
Cannot run task [bulk]: Request returned an error. Error type: transport, Description: network connection timed out
Getting further help:
*********************
* Check the log files in /Users/zmoog/.rally/logs for errors.
* Read the documentation at https://esrally.readthedocs.io/en/2.8.0/.
* Ask a question on the forum at https://discuss.elastic.co/tags/c/elastic-stack/elasticsearch/rally.
* Raise an issue at https://github.com/elastic/rally/issues and include the log files in /Users/zmoog/.rally/logs.
---------------------------------
[INFO] FAILURE (took 139 seconds)
---------------------------------
The "bulk" step never moves beyond 0% and fails with a "network connection timed out" error.
The problem is that my home network and Internet connection (especially the uplink speed) are a bottleneck. Increasing the timeout from 60 to 120 seconds allows Rally to complete the race, but the throughput numbers could be better.
Here's the run with the client option timeout:120
:
esrally race
--target-hosts=https://<REDACTED> \
--track-path=tracks/container-logs \
--client-options="timeout:120,basic_auth_user:'esrally',basic_auth_password:'<REDACTED>'" \
--pipeline=benchmark-only \
--kill-running-processes \
--on-error=abort
Maybe we should take the metworking into account.
It seems this time the "bulk" task is progressing!
Okay, this time the race completed, but the numbers a very low:
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
[INFO] Race id is [4df59e6f-908f-4eca-8314-733db335d469]
[INFO] Racing on track [container-logs] and car ['external'] with version [8.8.0].
[WARNING] merges_total_time is 2842617 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] merges_total_throttled_time is 240370 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 12771173 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 90947 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 4302426 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-index [100% done]
Running create-index [100% done]
Running cluster-health [100% done]
Running bulk [100% done]
------------------------------------------------------
_______ __ _____
/ ____(_)___ ____ _/ / / ___/_________ ________
/ /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
/ __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/
/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/
------------------------------------------------------
| Metric | Task | Value | Unit |
|---------------------------------------------------------------:|-------:|-----------------:|-------:|
| Cumulative indexing time of primary shards | | 212.853 | min |
| Min cumulative indexing time across primary shards | | 0 | min |
| Median cumulative indexing time across primary shards | | 0 | min |
| Max cumulative indexing time across primary shards | | 108.049 | min |
| Cumulative indexing throttle time of primary shards | | 0 | min |
| Min cumulative indexing throttle time across primary shards | | 0 | min |
| Median cumulative indexing throttle time across primary shards | | 0 | min |
| Max cumulative indexing throttle time across primary shards | | 0 | min |
| Cumulative merge time of primary shards | | 47.377 | min |
| Cumulative merge count of primary shards | | 1787 | |
| Min cumulative merge time across primary shards | | 0 | min |
| Median cumulative merge time across primary shards | | 0 | min |
| Max cumulative merge time across primary shards | | 23.6138 | min |
| Cumulative merge throttle time of primary shards | | 4.00617 | min |
| Min cumulative merge throttle time across primary shards | | 0 | min |
| Median cumulative merge throttle time across primary shards | | 0 | min |
| Max cumulative merge throttle time across primary shards | | 2.0599 | min |
| Cumulative refresh time of primary shards | | 1.51613 | min |
| Cumulative refresh count of primary shards | | 251216 | |
| Min cumulative refresh time across primary shards | | 0.0001 | min |
| Median cumulative refresh time across primary shards | | 0.000416667 | min |
| Max cumulative refresh time across primary shards | | 0.292683 | min |
| Cumulative flush time of primary shards | | 71.7071 | min |
| Cumulative flush count of primary shards | | 55177 | |
| Min cumulative flush time across primary shards | | 0 | min |
| Median cumulative flush time across primary shards | | 0.0003 | min |
| Max cumulative flush time across primary shards | | 17.3939 | min |
| Total Young Gen GC time | | 6.905 | s |
| Total Young Gen GC count | | 259 | |
| Total Old Gen GC time | | 0 | s |
| Total Old Gen GC count | | 0 | |
| Store size | | 97.3903 | GB |
| Translog size | | 3.1041e-05 | GB |
| Heap used for segments | | 0 | MB |
| Heap used for doc values | | 0 | MB |
| Heap used for terms | | 0 | MB |
| Heap used for norms | | 0 | MB |
| Heap used for points | | 0 | MB |
| Heap used for stored fields | | 0 | MB |
| Segment count | | 2665 | |
| Total Ingest Pipeline count | | 353379 | |
| Total Ingest Pipeline time | | 19.309 | s |
| Total Ingest Pipeline failed | | 0 | |
| Min Throughput | bulk | 72.45 | docs/s |
| Mean Throughput | bulk | 609.51 | docs/s |
| Median Throughput | bulk | 639.76 | docs/s |
| Max Throughput | bulk | 670.41 | docs/s |
| 50th percentile latency | bulk | 52924.8 | ms |
| 90th percentile latency | bulk | 63883.9 | ms |
| 100th percentile latency | bulk | 73875.6 | ms |
| 50th percentile service time | bulk | 52924.8 | ms |
| 90th percentile service time | bulk | 63883.9 | ms |
| 100th percentile service time | bulk | 73875.6 | ms |
| error rate | bulk | 0 | % |
---------------------------------
[INFO] SUCCESS (took 561 seconds)
---------------------------------
I am setting up an EC2 instance to run this test again.
Copying the track to the EC2 instance:
scp -i path/to/key.pem -r container-logs ec2-user@ec2-1-2-3-4.eu-west-1.compute.amazonaws.com:/home/ec2-user/path/to/whatever
Now using the following setup:
esrally race \
--target-hosts=https://<REDACTED> \
--track-path=tracks/container-logs \
--client-options="use_ssl:true,basic_auth_user:'elastic',basic_auth_password:'<REDACTED>'" \
--pipeline=benchmark-only \
--kill-running-processes \
--on-error=abort
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
[INFO] Race id is [aa3645d4-0bb9-4918-946d-47d648e87b40]
[INFO] Racing on track [container-logs] and car ['external'] with version [8.10.0].
Running delete-index [100% done]
Running create-index [100% done]
Running cluster-health [100% done]
Running bulk [100% done]
------------------------------------------------------
_______ __ _____
/ ____(_)___ ____ _/ / / ___/_________ ________
/ /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
/ __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/
/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/
------------------------------------------------------
| Metric | Task | Value | Unit |
|---------------------------------------------------------------:|-------:|-----------------:|-------:|
| Cumulative indexing time of primary shards | | 0.00015 | min |
| Min cumulative indexing time across primary shards | | 0 | min |
| Median cumulative indexing time across primary shards | | 0 | min |
| Max cumulative indexing time across primary shards | | 8.33333e-05 | min |
| Cumulative indexing throttle time of primary shards | | 0 | min |
| Min cumulative indexing throttle time across primary shards | | 0 | min |
| Median cumulative indexing throttle time across primary shards | | 0 | min |
| Max cumulative indexing throttle time across primary shards | | 0 | min |
| Cumulative merge time of primary shards | | 0 | min |
| Cumulative merge count of primary shards | | 0 | |
| Min cumulative merge time across primary shards | | 0 | min |
| Median cumulative merge time across primary shards | | 0 | min |
| Max cumulative merge time across primary shards | | 0 | min |
| Cumulative merge throttle time of primary shards | | 0 | min |
| Min cumulative merge throttle time across primary shards | | 0 | min |
| Median cumulative merge throttle time across primary shards | | 0 | min |
| Max cumulative merge throttle time across primary shards | | 0 | min |
| Cumulative refresh time of primary shards | | 0.0024 | min |
| Cumulative refresh count of primary shards | | 1448 | |
| Min cumulative refresh time across primary shards | | 0 | min |
| Median cumulative refresh time across primary shards | | 0 | min |
| Max cumulative refresh time across primary shards | | 0.00103333 | min |
| Cumulative flush time of primary shards | | 0.0100333 | min |
| Cumulative flush count of primary shards | | 15 | |
| Min cumulative flush time across primary shards | | 0.00025 | min |
| Median cumulative flush time across primary shards | | 0.000316667 | min |
| Max cumulative flush time across primary shards | | 0.00306667 | min |
| Total Young Gen GC time | | 0.67 | s |
| Total Young Gen GC count | | 18 | |
| Total Old Gen GC time | | 0 | s |
| Total Old Gen GC count | | 0 | |
| Store size | | 0.000137327 | GB |
| Translog size | | 1.1269e-06 | GB |
| Heap used for segments | | 0 | MB |
| Heap used for doc values | | 0 | MB |
| Heap used for terms | | 0 | MB |
| Heap used for norms | | 0 | MB |
| Heap used for points | | 0 | MB |
| Heap used for stored fields | | 0 | MB |
| Segment count | | 4 | |
| Total Ingest Pipeline count | | 353379 | |
| Total Ingest Pipeline time | | 17.575 | s |
| Total Ingest Pipeline failed | | 0 | |
| Min Throughput | bulk | 883.99 | docs/s |
| Mean Throughput | bulk | 7917.35 | docs/s |
| Median Throughput | bulk | 8278.41 | docs/s |
| Max Throughput | bulk | 9813.75 | docs/s |
| 50th percentile latency | bulk | 3405.52 | ms |
| 90th percentile latency | bulk | 4475.19 | ms |
| 100th percentile latency | bulk | 6197.52 | ms |
| 50th percentile service time | bulk | 3405.52 | ms |
| 90th percentile service time | bulk | 4475.19 | ms |
| 100th percentile service time | bulk | 6197.52 | ms |
| error rate | bulk | 0 | % |
--------------------------------
[INFO] SUCCESS (took 82 seconds)
--------------------------------
I am adding support for the new routing rules in the Kubernetes integration.
Adding routing involve using the new reroute processor in the container-logs integration. We want to check the performance impact of adding it.