zmoog / public-notes

Apache License 2.0
0 stars 1 forks source link

Figure out how to performance test the aws-s3 input in polling mode #77

Open zmoog opened 7 months ago

zmoog commented 7 months ago

I want to run performance tests on the aws-s3 input using the Elastic Agent 8.12.0 to ingest Cloudtrail logs.

The test will run on the following EC2 instance.

Instance type Architecture vCPU Memory (GiB) Network Bandwidth (Gbps)
c7g.2xlarge arm64 8 16 Up to 15

The goals are:

Authentication

We will use an EC2 instance profile only, so there will be no authentication-related options in the integration settings.

Dataset

As a test dataset, I will use an S3 bucket containing 1.2M objects of Cloudtrail logs.

I set up the integration to process only S3 object files created in 2024 (301254 objects at the time of the test). To achieve this, I updated the file selector regex to match the following expression:

/CloudTrail/[a-z]{2}-[a-z]+-\d+/2024/.*$

Here is the agent policy:

  - id: aws-s3-cloudtrail-fb93f962-4474-4f15-9032-0473c01ae54b
    name: aws-21
    revision: 12
    type: aws-s3
    use_output: 6a7e2784-665c-4208-afac-380bac62351e
    meta:
      package:
        name: aws
        version: 2.8.5
    data_stream:
      namespace: sdh3988.3
    package_policy_id: fb93f962-4474-4f15-9032-0473c01ae54b
    streams:
      - id: aws-s3-aws.cloudtrail-fb93f962-4474-4f15-9032-0473c01ae54b
        data_stream:
          dataset: aws.cloudtrail
          type: logs
        bucket_arn: '<REDACTED>'
        number_of_workers: 100
        bucket_list_interval: 1m
        file_selectors:
          - regex: '/CloudTrail/[a-z]{2}-[a-z]+-\d+/2024/.*$'
            expand_event_list_from_field: Records
          - regex: '/CloudTrail-Digest/[a-z]{2}-[a-z]+-\d+/2024/.*$'
          - regex: '/CloudTrail-Insight/[a-z]{2}-[a-z]+-\d+/2024/.*$'
            expand_event_list_from_field: Records
        expand_event_list_from_field: Records
        content_type: application/json
        tags:
          - forwarded
          - aws-cloudtrail
        publisher_pipeline.disable_host: true

Test process

zmoog commented 7 months ago

Run 1

In the first run:

The goal was to evaluate the impact of interrupting the processing and resuming from the start.

I saved the list of the existing objects in the bucket in the s3_objects_list-2.csv file. Here's the number of objects for 2024:

$ cat s3_objects_list-2.csv | grep '/2024/' | wc -l
  301254
zmoog commented 7 months ago

16:19

Started a new test run.

The total number of s3 object in the bucket for the year 2024 is:

$ cat s3_objects_list-2.csv | grep '/2024/' | wc -l
  301254
$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail

2024-02-09T15:13:35.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     8       8       19626   20102   0
2024-02-09T15:14:05.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     437     437     21367   21355   0
2024-02-09T15:14:35.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     482     482     20594   20301   0
2024-02-09T15:15:05.285Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     844     844     22555   22173   0
2024-02-09T15:15:35.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     837     837     26651   26678   0
2024-02-09T15:16:05.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     0       0       21808   22476   0
2024-02-09T15:16:35.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1757    1757    30058   28340   0
2024-02-09T15:17:05.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1560    1560    35085   35498   0
2024-02-09T15:17:35.285Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1189    1189    27336   27506   0
2024-02-09T15:18:05.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1142    1142    30204   30254   0

Let's check how many s3 objects already landed in the logs-aws.cloudtrail-sdh3988.2 data stream:

GET logs-aws.cloudtrail-sdh3988.2/_search
{
  "size": 0,
  "aggs": {
    "unique_categories": {
      "cardinality": {
        "field": "aws.s3.object.key"
      }
    }
  }
}
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "unique_categories": {
      "value": 26523
    }
  }
}

16:33

$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail

2024-02-09T15:28:35.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     3523    3523    37945   35049   0
2024-02-09T15:29:05.285Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     3447    3447    66879   67887   0
2024-02-09T15:29:35.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     2788    2788    59590   59566   0
2024-02-09T15:30:05.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     3164    3164    55884   55892   0
2024-02-09T15:30:35.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     3227    3227    44460   44420   0
2024-02-09T15:31:05.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     2806    2806    49356   48918   0
2024-02-09T15:31:35.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     3358    3358    46105   46817   0
2024-02-09T15:32:05.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     3284    3284    49197   48054   0
2024-02-09T15:32:35.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     3260    3260    49180   50323   0
2024-02-09T15:33:05.286Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     0       0       43971   46019   0
GET logs-aws.cloudtrail-sdh3988.2/_search
{
  "size": 0,
  "aggs": {
    "unique_categories": {
      "cardinality": {
        "field": "aws.s3.object.key"
      }
    }
  }
}
// ----
{
  "took": 2976,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "unique_categories": {
      "value": 128941
    }
  }
}

16:55 I changed the output config, accidentally restarting the input! 🙈

Now the input probably restarted listing the object from the beginning, and the number of unique s3 objects stopped:

$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail
2024-02-09T15:52:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     600     1       0
2024-02-09T15:52:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     500             0
2024-02-09T15:53:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     500             0
2024-02-09T15:53:31.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     600             0
2024-02-09T15:54:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     500     1       0
2024-02-09T15:54:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     500     1       0
2024-02-09T15:55:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     500             0
2024-02-09T15:55:31.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     600     1       0
2024-02-09T15:56:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     500             0
GET logs-aws.cloudtrail-sdh3988.2/_search
{
  "size": 0,
  "aggs": {
    "unique_categories": {
      "cardinality": {
        "field": "aws.s3.object.key"
      }
    }
  }
}
// ----
{
  "took": 89,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "unique_categories": {
      "value": 146141
    }
  }
}

16:57

It increased the number of .monitoring.metrics.libbeat.output.events.total, but still no .monitoring.metrics.libbeat.output.events.acked:

$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail

2024-02-09T15:54:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     500     1       0
2024-02-09T15:55:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     500             0
2024-02-09T15:55:31.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     600     1       0
2024-02-09T15:56:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     100     100     500             0
2024-02-09T15:56:31.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     794     794     1667    1       0
2024-02-09T15:57:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     764     764     3583            0
2024-02-09T15:57:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     764     764     3495            0
2024-02-09T15:58:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     796     796     4238            0
2024-02-09T15:58:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     638     638     3445            0
2024-02-09T15:59:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     445     445     3388            0

hypothesis: maybe the state registry was not updated and the input thinks the objects are new, but they aren't.

17:34

The unique s3 object count started increasing again:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "unique_categories": {
      "value": 146153
    }
  }
}

However, still no acks:

$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail

2024-02-09T16:31:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1156    1156    12122           0
2024-02-09T16:31:31.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1093    1093    4819            0
2024-02-09T16:32:01.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1199    1199    5338            0
2024-02-09T16:32:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1056    1056    5769            0
2024-02-09T16:33:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1056    1056    6117            0
2024-02-09T16:33:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1357    1357    5768            0
2024-02-09T16:34:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1244    1244    6144            0
2024-02-09T16:34:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1038    1038    6069            0
2024-02-09T16:35:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1042    1024    6096            0
2024-02-09T16:35:31.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     23      23      5206            0

18:38

After a while, the input probably catches up until where it was before, and now it's crunching data with acks:

cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail

2024-02-09T17:33:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     0       0       34911   36511   0
2024-02-09T17:34:01.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     585     585     3209    2624    0
2024-02-09T17:34:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     78      78      2983    3568    0
2024-02-09T17:35:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     592     592     3159    3159    0
2024-02-09T17:35:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     708     708     2778    2778    0
2024-02-09T17:36:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     593     593     3264    3264    0
2024-02-09T17:36:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     617     617     3106    3106    0
2024-02-09T17:37:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     455     455     3042    3042    0
2024-02-09T17:37:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     527     527     2988    2988    0
2024-02-09T17:38:01.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     682     682     2963    2963    0

The object count is increasing:

{
  "took": 268,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "unique_categories": {
      "value": 158827
    }
  }
}

CleanShot 2024-02-09 at 18 41 14@2x

19:04

More progress

{
  "took": 1816,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "unique_categories": {
      "value": 192835
    }
  }
}

19:39

{
  "took": 2838,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "unique_categories": {
      "value": 249732
    }
  }
}
$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail

2024-02-09T18:36:31.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1166    1166    5822    5822    0
2024-02-09T18:37:01.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1081    1081    5021    5021    0
2024-02-09T18:37:31.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1161    1161    5380    5380    0
2024-02-09T18:38:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     1085    1085    5957    5957    0
2024-02-09T18:38:31.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     12841   12801   30337   23937   0
2024-02-09T18:39:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     12810   12801   116800  116800  0
2024-02-09T18:39:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     12863   12801   116800  116800  0
2024-02-09T18:40:01.046Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     12854   12801   107200  107200  0
2024-02-09T18:40:31.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     12828   12801   118400  118400  0
2024-02-09T18:41:01.045Z        aws-s3-6a7e2784-665c-4208-afac-380bac62351e     12837   12801   105600  105600  0

20:00

{
  "took": 4259,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "unique_categories": {
      "value": 288266
    }
  }
}

21:57

Finished!

{
  "took": 474,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "unique_categories": {
      "value": 300483
    }
  }
}

The process:

started: 16:10 restarted: 16:55 completed: 20:10

CleanShot 2024-02-09 at 21 59 57@2x

zmoog commented 7 months ago

Run 2

The process:

started: 22:00 completed: 02:30

No Agent restarts.

CleanShot 2024-02-12 at 11 04 44@2x

zmoog commented 7 months ago

Run 2 details

CleanShot 2024-02-12 at 11 12 56@2x

zmoog commented 7 months ago

Run 3

Goal

Settings

I removed the regex to filter objects from 2024. In this run, the Agent will start from the beginning of our bucket containing 1.3M objects

input:

output:

Input

id: ae3df6d8-b182-48f9-9cab-d36d9c824133
revision: 73
outputs:
  6a7e2784-665c-4208-afac-380bac62351e:
    type: elasticsearch
    preset: latency
  default:
    type: elasticsearch
    preset: balanced
inputs:
  - id: aws-s3-cloudtrail-fb93f962-4474-4f15-9032-0473c01ae54b
    name: aws-21
    revision: 16
    type: aws-s3
    use_output: 6a7e2784-665c-4208-afac-380bac62351e
    meta:
      package:
        name: aws
        version: 2.11.3
    data_stream:
      namespace: sdh3988.r3
    package_policy_id: fb93f962-4474-4f15-9032-0473c01ae54b
    streams:
      - id: aws-s3-aws.cloudtrail-fb93f962-4474-4f15-9032-0473c01ae54b
        data_stream:
          dataset: aws.cloudtrail
          type: logs
        bucket_arn: '<REDACTED>'
        number_of_workers: 100
        bucket_list_interval: 1m
        file_selectors:
          - regex: /CloudTrail/
            expand_event_list_from_field: Records
          - regex: /CloudTrail-Digest/
          - regex: /CloudTrail-Insight/
            expand_event_list_from_field: Records
        expand_event_list_from_field: Records
        content_type: application/json
        tags:
          - forwarded
          - aws-cloudtrail
        publisher_pipeline.disable_host: true

Output

Preset: latency

zmoog commented 7 months ago

Run 3 outcome

started: 2024-02-12 11:03 (UTC) curent: 2024-02-14 03:00 (UTC)

events: 52,952,571 hits objects: 1,325,888

With the run 3 settings, the agent spent 2+ days to process 1.3M objects and ingest 53M events.

image