zmoog / public-notes

Apache License 2.0
0 stars 1 forks source link

Figure out how to benchmark the ingest performance for the Kubernetes integration #44

Open zmoog opened 11 months ago

zmoog commented 11 months ago

I am adding support for the new routing rules in the Kubernetes integration.

Adding routing involve using the new reroute processor in the container-logs integration. We want to check the performance impact of adding it.

zmoog commented 11 months ago

First, I'll read https://esrally.readthedocs.io/en/stable/quickstart.html to learn the Rally basics.

zmoog commented 11 months ago

Install

pipx install esrally
zmoog commented 11 months ago

Checking the existing Rally benchmark for K8s at https://github.com/elastic/k8s-integration-infra/tree/cloudnative/cloudnative

zmoog commented 10 months ago

Cluster

Using a local cluster.

Tracks

Create a track

$ esrally create-track --track="container-logs" --target-hosts=https://localhost:9200 --data-stream="logs-kubernetes.container_logs*" --client-options="use_ssl:true,verify_certs:false,ca_certs:false,basic_auth_user:'elastic',basic_auth_password:'changeme'"

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Connected to Elasticsearch cluster [5fbd93b394ed] version [8.10.0-SNAPSHOT].

Extracting documents for index [.ds-logs-kubernetes.co...     1000/1000 docs [100.0% done]
Extracting documents for index [.ds-logs-kubernete...     353379/353379 docs [100.0% done]
Extracting documents for index [.ds-logs-kubernetes.co...     1000/1000 docs [100.0% done]
Extracting documents for index [.ds-logs-kubernetes....     10185/10185 docs [100.0% done]

[INFO] Track container-logs has been created. Run it with: esrally --track-path=/Users/zmoog/code/projects/elastic/k8s-integration-infra/container-logs/tracks/container-logs

---------------------------------
[INFO] SUCCESS (took 113 seconds)
---------------------------------

Checking the results:

$ find .
.
./tracks
./tracks/container-logs
./tracks/container-logs/.ds-logs-kubernetes.container_logs.agents-default-2023.08.30-000001-documents.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs-default-2023.08.30-000001-documents.json.bz2
./tracks/container-logs/.ds-logs-kubernetes.container_logs.agents-default-2023.08.30-000001-documents-1k.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs-default-2023.08.30-000001-documents-1k.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs.agents-default-2023.08.30-000001-documents-1k.json.bz2
./tracks/container-logs/.ds-logs-kubernetes.container_logs.agents-default-2023.08.30-000001.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs-default-2023.08.30-000001-documents.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs-default-2023.08.30-000001-documents-1k.json.bz2
./tracks/container-logs/.ds-logs-kubernetes.container_logs-default-2023.08.30-000001.json
./tracks/container-logs/track.json
./tracks/container-logs/.ds-logs-kubernetes.container_logs.agents-default-2023.08.30-000001-documents.json.bz2

List tracks

$ esrally list tracks --track-path=tracks/container-logs

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

Available tracks:

Name            Description                                 Documents    Compressed Size    Uncompressed Size
--------------  ------------------------------------------  -----------  -----------------  -------------------
container-logs  Tracker-generated track for container-logs  363,564      15.8 MB            931.1 MB

-------------------------------
[INFO] SUCCESS (took 0 seconds)
-------------------------------

List races

esrally list races
$ esrally race \
    --target-hosts=localhost:9200 \
    --track-path=tracks/container-logs \
    --client-options="use_ssl:true,verify_certs:false,ca_certs:false,basic_auth_user:'elastic',basic_auth_password:'changeme'" \
    --pipeline=benchmark-only \
    --kill-running-processes

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [63e80dc7-a658-4b67-90c5-a73e198f62ac]
[INFO] Racing on track [container-logs] and car ['external'] with version [8.10.0-SNAPSHOT].

[WARNING] merges_total_time is 195945 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] merges_total_throttled_time is 8581 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 965545 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 254226 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 2317453 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running cluster-health                                                         [100% done]
Running bulk                                                                   [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |   Task |            Value |   Unit |
|---------------------------------------------------------------:|-------:|-----------------:|-------:|
|                     Cumulative indexing time of primary shards |        |     18.0194      |    min |
|             Min cumulative indexing time across primary shards |        |      0           |    min |
|          Median cumulative indexing time across primary shards |        |      0.0466667   |    min |
|             Max cumulative indexing time across primary shards |        |      5.87675     |    min |
|            Cumulative indexing throttle time of primary shards |        |      0           |    min |
|    Min cumulative indexing throttle time across primary shards |        |      0           |    min |
| Median cumulative indexing throttle time across primary shards |        |      0           |    min |
|    Max cumulative indexing throttle time across primary shards |        |      0           |    min |
|                        Cumulative merge time of primary shards |        |      3.68822     |    min |
|                       Cumulative merge count of primary shards |        |   1148           |        |
|                Min cumulative merge time across primary shards |        |      0           |    min |
|             Median cumulative merge time across primary shards |        |      0.0114667   |    min |
|                Max cumulative merge time across primary shards |        |      1.04943     |    min |
|               Cumulative merge throttle time of primary shards |        |      0.259733    |    min |
|       Min cumulative merge throttle time across primary shards |        |      0           |    min |
|    Median cumulative merge throttle time across primary shards |        |      0           |    min |
|       Max cumulative merge throttle time across primary shards |        |      0.259733    |    min |
|                      Cumulative refresh time of primary shards |        |      4.27292     |    min |
|                     Cumulative refresh count of primary shards |        |  30297           |        |
|              Min cumulative refresh time across primary shards |        |      0           |    min |
|           Median cumulative refresh time across primary shards |        |      0.0077      |    min |
|              Max cumulative refresh time across primary shards |        |      0.879883    |    min |
|                        Cumulative flush time of primary shards |        |     38.6595      |    min |
|                       Cumulative flush count of primary shards |        |  28281           |        |
|                Min cumulative flush time across primary shards |        |      3.33333e-05 |    min |
|             Median cumulative flush time across primary shards |        |      0.143767    |    min |
|                Max cumulative flush time across primary shards |        |      7.96648     |    min |
|                                        Total Young Gen GC time |        |      3.002       |      s |
|                                       Total Young Gen GC count |        |    144           |        |
|                                          Total Old Gen GC time |        |      0           |      s |
|                                         Total Old Gen GC count |        |      0           |        |
|                                                     Store size |        |      0.996763    |     GB |
|                                                  Translog size |        |      0.0564239   |     GB |
|                                         Heap used for segments |        |      0           |     MB |
|                                       Heap used for doc values |        |      0           |     MB |
|                                            Heap used for terms |        |      0           |     MB |
|                                            Heap used for norms |        |      0           |     MB |
|                                           Heap used for points |        |      0           |     MB |
|                                    Heap used for stored fields |        |      0           |     MB |
|                                                  Segment count |        |    652           |        |
|                                    Total Ingest Pipeline count |        | 298632           |        |
|                                     Total Ingest Pipeline time |        |     24.567       |      s |
|                                   Total Ingest Pipeline failed |        |      0           |        |
|                                                 Min Throughput |   bulk |    741.76        | docs/s |
|                                                Mean Throughput |   bulk |   7042.58        | docs/s |
|                                              Median Throughput |   bulk |   7071.05        | docs/s |
|                                                 Max Throughput |   bulk |   9845.35        | docs/s |
|                                        50th percentile latency |   bulk |   3459.26        |     ms |
|                                        90th percentile latency |   bulk |   6476.21        |     ms |
|                                       100th percentile latency |   bulk |   9193.49        |     ms |
|                                   50th percentile service time |   bulk |   3459.26        |     ms |
|                                   90th percentile service time |   bulk |   6476.21        |     ms |
|                                  100th percentile service time |   bulk |   9193.49        |     ms |
|                                                     error rate |   bulk |     16.25        |      % |

[WARNING] Error rate is 16.25 for operation 'bulk'. Please check the logs.

--------------------------------
[INFO] SUCCESS (took 80 seconds)
--------------------------------
zmoog commented 10 months ago

Since the error rate for the operation "bulk" is quite high (16.25%), I want to try this using a full fledged cluster instead of the cluster created locally by elastic-package for development.

zmoog commented 10 months ago

Unfortunately, running rally on one of my test cluster in Elastic Cloud does not work and fails systematically:

$ esrally race 
    --target-hosts=https://workshop.es.eastus2.azure.elastic-cloud.com \
    --track-path=tracks/container-logs \
    --client-options="basic_auth_user:'esrally',basic_auth_password:'<REDACTED>'" \ 
    --pipeline=benchmark-only \
    --kill-running-processes \
    --on-error=abort

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [12fae33c-e13c-469b-a23f-06b75c5bec98]
[INFO] Racing on track [container-logs] and car ['external'] with version [8.8.0].

[WARNING] merges_total_time is 2842617 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] merges_total_throttled_time is 240370 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 12771173 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 90916 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 4302426 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running cluster-health                                                         [100% done]
Running bulk                                                                   [  0% done]

[WARNING] Could not terminate all internal processes within timeout. Please check and force-terminate all Rally processes.
[ERROR] Cannot race. Error in load generator [0]
        Cannot run task [bulk]: Request returned an error. Error type: transport, Description: network connection timed out

Getting further help:
*********************
* Check the log files in /Users/zmoog/.rally/logs for errors.
* Read the documentation at https://esrally.readthedocs.io/en/2.8.0/.
* Ask a question on the forum at https://discuss.elastic.co/tags/c/elastic-stack/elasticsearch/rally.
* Raise an issue at https://github.com/elastic/rally/issues and include the log files in /Users/zmoog/.rally/logs.

---------------------------------
[INFO] FAILURE (took 139 seconds)
---------------------------------

The "bulk" step never moves beyond 0% and fails with a "network connection timed out" error.

zmoog commented 10 months ago

The problem is that my home network and Internet connection (especially the uplink speed) are a bottleneck. Increasing the timeout from 60 to 120 seconds allows Rally to complete the race, but the throughput numbers could be better.

Here's the run with the client option timeout:120:

esrally race 
    --target-hosts=https://<REDACTED> \
    --track-path=tracks/container-logs \
    --client-options="timeout:120,basic_auth_user:'esrally',basic_auth_password:'<REDACTED>'" \
    --pipeline=benchmark-only \
    --kill-running-processes \
    --on-error=abort

Maybe we should take the metworking into account.

It seems this time the "bulk" task is progressing!

Okay, this time the race completed, but the numbers a very low:

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [4df59e6f-908f-4eca-8314-733db335d469]
[INFO] Racing on track [container-logs] and car ['external'] with version [8.8.0].

[WARNING] merges_total_time is 2842617 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] merges_total_throttled_time is 240370 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 12771173 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 90947 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 4302426 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running cluster-health                                                         [100% done]
Running bulk                                                                   [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |   Task |            Value |   Unit |
|---------------------------------------------------------------:|-------:|-----------------:|-------:|
|                     Cumulative indexing time of primary shards |        |    212.853       |    min |
|             Min cumulative indexing time across primary shards |        |      0           |    min |
|          Median cumulative indexing time across primary shards |        |      0           |    min |
|             Max cumulative indexing time across primary shards |        |    108.049       |    min |
|            Cumulative indexing throttle time of primary shards |        |      0           |    min |
|    Min cumulative indexing throttle time across primary shards |        |      0           |    min |
| Median cumulative indexing throttle time across primary shards |        |      0           |    min |
|    Max cumulative indexing throttle time across primary shards |        |      0           |    min |
|                        Cumulative merge time of primary shards |        |     47.377       |    min |
|                       Cumulative merge count of primary shards |        |   1787           |        |
|                Min cumulative merge time across primary shards |        |      0           |    min |
|             Median cumulative merge time across primary shards |        |      0           |    min |
|                Max cumulative merge time across primary shards |        |     23.6138      |    min |
|               Cumulative merge throttle time of primary shards |        |      4.00617     |    min |
|       Min cumulative merge throttle time across primary shards |        |      0           |    min |
|    Median cumulative merge throttle time across primary shards |        |      0           |    min |
|       Max cumulative merge throttle time across primary shards |        |      2.0599      |    min |
|                      Cumulative refresh time of primary shards |        |      1.51613     |    min |
|                     Cumulative refresh count of primary shards |        | 251216           |        |
|              Min cumulative refresh time across primary shards |        |      0.0001      |    min |
|           Median cumulative refresh time across primary shards |        |      0.000416667 |    min |
|              Max cumulative refresh time across primary shards |        |      0.292683    |    min |
|                        Cumulative flush time of primary shards |        |     71.7071      |    min |
|                       Cumulative flush count of primary shards |        |  55177           |        |
|                Min cumulative flush time across primary shards |        |      0           |    min |
|             Median cumulative flush time across primary shards |        |      0.0003      |    min |
|                Max cumulative flush time across primary shards |        |     17.3939      |    min |
|                                        Total Young Gen GC time |        |      6.905       |      s |
|                                       Total Young Gen GC count |        |    259           |        |
|                                          Total Old Gen GC time |        |      0           |      s |
|                                         Total Old Gen GC count |        |      0           |        |
|                                                     Store size |        |     97.3903      |     GB |
|                                                  Translog size |        |      3.1041e-05  |     GB |
|                                         Heap used for segments |        |      0           |     MB |
|                                       Heap used for doc values |        |      0           |     MB |
|                                            Heap used for terms |        |      0           |     MB |
|                                            Heap used for norms |        |      0           |     MB |
|                                           Heap used for points |        |      0           |     MB |
|                                    Heap used for stored fields |        |      0           |     MB |
|                                                  Segment count |        |   2665           |        |
|                                    Total Ingest Pipeline count |        | 353379           |        |
|                                     Total Ingest Pipeline time |        |     19.309       |      s |
|                                   Total Ingest Pipeline failed |        |      0           |        |
|                                                 Min Throughput |   bulk |     72.45        | docs/s |
|                                                Mean Throughput |   bulk |    609.51        | docs/s |
|                                              Median Throughput |   bulk |    639.76        | docs/s |
|                                                 Max Throughput |   bulk |    670.41        | docs/s |
|                                        50th percentile latency |   bulk |  52924.8         |     ms |
|                                        90th percentile latency |   bulk |  63883.9         |     ms |
|                                       100th percentile latency |   bulk |  73875.6         |     ms |
|                                   50th percentile service time |   bulk |  52924.8         |     ms |
|                                   90th percentile service time |   bulk |  63883.9         |     ms |
|                                  100th percentile service time |   bulk |  73875.6         |     ms |
|                                                     error rate |   bulk |      0           |      % |

---------------------------------
[INFO] SUCCESS (took 561 seconds)
---------------------------------
zmoog commented 10 months ago

I am setting up an EC2 instance to run this test again.

Copying the track to the EC2 instance:

scp -i path/to/key.pem -r container-logs ec2-user@ec2-1-2-3-4.eu-west-1.compute.amazonaws.com:/home/ec2-user/path/to/whatever
zmoog commented 10 months ago

Now using the following setup:

esrally race \
    --target-hosts=https://<REDACTED> \
    --track-path=tracks/container-logs \
    --client-options="use_ssl:true,basic_auth_user:'elastic',basic_auth_password:'<REDACTED>'" \
    --pipeline=benchmark-only \
    --kill-running-processes \
    --on-error=abort

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [aa3645d4-0bb9-4918-946d-47d648e87b40]
[INFO] Racing on track [container-logs] and car ['external'] with version [8.10.0].

Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running cluster-health                                                         [100% done]
Running bulk                                                                   [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |   Task |            Value |   Unit |
|---------------------------------------------------------------:|-------:|-----------------:|-------:|
|                     Cumulative indexing time of primary shards |        |      0.00015     |    min |
|             Min cumulative indexing time across primary shards |        |      0           |    min |
|          Median cumulative indexing time across primary shards |        |      0           |    min |
|             Max cumulative indexing time across primary shards |        |      8.33333e-05 |    min |
|            Cumulative indexing throttle time of primary shards |        |      0           |    min |
|    Min cumulative indexing throttle time across primary shards |        |      0           |    min |
| Median cumulative indexing throttle time across primary shards |        |      0           |    min |
|    Max cumulative indexing throttle time across primary shards |        |      0           |    min |
|                        Cumulative merge time of primary shards |        |      0           |    min |
|                       Cumulative merge count of primary shards |        |      0           |        |
|                Min cumulative merge time across primary shards |        |      0           |    min |
|             Median cumulative merge time across primary shards |        |      0           |    min |
|                Max cumulative merge time across primary shards |        |      0           |    min |
|               Cumulative merge throttle time of primary shards |        |      0           |    min |
|       Min cumulative merge throttle time across primary shards |        |      0           |    min |
|    Median cumulative merge throttle time across primary shards |        |      0           |    min |
|       Max cumulative merge throttle time across primary shards |        |      0           |    min |
|                      Cumulative refresh time of primary shards |        |      0.0024      |    min |
|                     Cumulative refresh count of primary shards |        |   1448           |        |
|              Min cumulative refresh time across primary shards |        |      0           |    min |
|           Median cumulative refresh time across primary shards |        |      0           |    min |
|              Max cumulative refresh time across primary shards |        |      0.00103333  |    min |
|                        Cumulative flush time of primary shards |        |      0.0100333   |    min |
|                       Cumulative flush count of primary shards |        |     15           |        |
|                Min cumulative flush time across primary shards |        |      0.00025     |    min |
|             Median cumulative flush time across primary shards |        |      0.000316667 |    min |
|                Max cumulative flush time across primary shards |        |      0.00306667  |    min |
|                                        Total Young Gen GC time |        |      0.67        |      s |
|                                       Total Young Gen GC count |        |     18           |        |
|                                          Total Old Gen GC time |        |      0           |      s |
|                                         Total Old Gen GC count |        |      0           |        |
|                                                     Store size |        |      0.000137327 |     GB |
|                                                  Translog size |        |      1.1269e-06  |     GB |
|                                         Heap used for segments |        |      0           |     MB |
|                                       Heap used for doc values |        |      0           |     MB |
|                                            Heap used for terms |        |      0           |     MB |
|                                            Heap used for norms |        |      0           |     MB |
|                                           Heap used for points |        |      0           |     MB |
|                                    Heap used for stored fields |        |      0           |     MB |
|                                                  Segment count |        |      4           |        |
|                                    Total Ingest Pipeline count |        | 353379           |        |
|                                     Total Ingest Pipeline time |        |     17.575       |      s |
|                                   Total Ingest Pipeline failed |        |      0           |        |
|                                                 Min Throughput |   bulk |    883.99        | docs/s |
|                                                Mean Throughput |   bulk |   7917.35        | docs/s |
|                                              Median Throughput |   bulk |   8278.41        | docs/s |
|                                                 Max Throughput |   bulk |   9813.75        | docs/s |
|                                        50th percentile latency |   bulk |   3405.52        |     ms |
|                                        90th percentile latency |   bulk |   4475.19        |     ms |
|                                       100th percentile latency |   bulk |   6197.52        |     ms |
|                                   50th percentile service time |   bulk |   3405.52        |     ms |
|                                   90th percentile service time |   bulk |   4475.19        |     ms |
|                                  100th percentile service time |   bulk |   6197.52        |     ms |
|                                                     error rate |   bulk |      0           |      % |

--------------------------------
[INFO] SUCCESS (took 82 seconds)
--------------------------------