Open zmoog opened 7 months ago
In the first run:
The goal was to evaluate the impact of interrupting the processing and resuming from the start.
I saved the list of the existing objects in the bucket in the s3_objects_list-2.csv
file. Here's the number of objects for 2024:
$ cat s3_objects_list-2.csv | grep '/2024/' | wc -l
301254
Started a new test run.
The total number of s3 object in the bucket for the year 2024 is:
$ cat s3_objects_list-2.csv | grep '/2024/' | wc -l
301254
$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail
2024-02-09T15:13:35.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 8 8 19626 20102 0
2024-02-09T15:14:05.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 437 437 21367 21355 0
2024-02-09T15:14:35.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 482 482 20594 20301 0
2024-02-09T15:15:05.285Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 844 844 22555 22173 0
2024-02-09T15:15:35.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 837 837 26651 26678 0
2024-02-09T15:16:05.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 0 0 21808 22476 0
2024-02-09T15:16:35.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 1757 1757 30058 28340 0
2024-02-09T15:17:05.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 1560 1560 35085 35498 0
2024-02-09T15:17:35.285Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 1189 1189 27336 27506 0
2024-02-09T15:18:05.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 1142 1142 30204 30254 0
Let's check how many s3 objects already landed in the logs-aws.cloudtrail-sdh3988.2
data stream:
GET logs-aws.cloudtrail-sdh3988.2/_search
{
"size": 0,
"aggs": {
"unique_categories": {
"cardinality": {
"field": "aws.s3.object.key"
}
}
}
}
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"unique_categories": {
"value": 26523
}
}
}
$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail
2024-02-09T15:28:35.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 3523 3523 37945 35049 0
2024-02-09T15:29:05.285Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 3447 3447 66879 67887 0
2024-02-09T15:29:35.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 2788 2788 59590 59566 0
2024-02-09T15:30:05.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 3164 3164 55884 55892 0
2024-02-09T15:30:35.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 3227 3227 44460 44420 0
2024-02-09T15:31:05.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 2806 2806 49356 48918 0
2024-02-09T15:31:35.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 3358 3358 46105 46817 0
2024-02-09T15:32:05.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 3284 3284 49197 48054 0
2024-02-09T15:32:35.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 3260 3260 49180 50323 0
2024-02-09T15:33:05.286Z aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4 0 0 43971 46019 0
GET logs-aws.cloudtrail-sdh3988.2/_search
{
"size": 0,
"aggs": {
"unique_categories": {
"cardinality": {
"field": "aws.s3.object.key"
}
}
}
}
// ----
{
"took": 2976,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"unique_categories": {
"value": 128941
}
}
}
16:55 I changed the output config, accidentally restarting the input! 🙈
Now the input probably restarted listing the object from the beginning, and the number of unique s3 objects stopped:
$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail
2024-02-09T15:52:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 600 1 0
2024-02-09T15:52:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 500 0
2024-02-09T15:53:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 500 0
2024-02-09T15:53:31.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 600 0
2024-02-09T15:54:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 500 1 0
2024-02-09T15:54:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 500 1 0
2024-02-09T15:55:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 500 0
2024-02-09T15:55:31.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 600 1 0
2024-02-09T15:56:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 500 0
GET logs-aws.cloudtrail-sdh3988.2/_search
{
"size": 0,
"aggs": {
"unique_categories": {
"cardinality": {
"field": "aws.s3.object.key"
}
}
}
}
// ----
{
"took": 89,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"unique_categories": {
"value": 146141
}
}
}
It increased the number of .monitoring.metrics.libbeat.output.events.total
, but still no .monitoring.metrics.libbeat.output.events.acked
:
$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail
2024-02-09T15:54:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 500 1 0
2024-02-09T15:55:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 500 0
2024-02-09T15:55:31.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 600 1 0
2024-02-09T15:56:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 100 100 500 0
2024-02-09T15:56:31.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 794 794 1667 1 0
2024-02-09T15:57:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 764 764 3583 0
2024-02-09T15:57:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 764 764 3495 0
2024-02-09T15:58:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 796 796 4238 0
2024-02-09T15:58:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 638 638 3445 0
2024-02-09T15:59:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 445 445 3388 0
hypothesis: maybe the state registry was not updated and the input thinks the objects are new, but they aren't.
The unique s3 object count started increasing again:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"unique_categories": {
"value": 146153
}
}
}
However, still no acks:
$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail
2024-02-09T16:31:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1156 1156 12122 0
2024-02-09T16:31:31.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1093 1093 4819 0
2024-02-09T16:32:01.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1199 1199 5338 0
2024-02-09T16:32:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1056 1056 5769 0
2024-02-09T16:33:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1056 1056 6117 0
2024-02-09T16:33:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1357 1357 5768 0
2024-02-09T16:34:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1244 1244 6144 0
2024-02-09T16:34:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1038 1038 6069 0
2024-02-09T16:35:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1042 1024 6096 0
2024-02-09T16:35:31.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 23 23 5206 0
After a while, the input probably catches up until where it was before, and now it's crunching data with acks:
cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail
2024-02-09T17:33:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 0 0 34911 36511 0
2024-02-09T17:34:01.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 585 585 3209 2624 0
2024-02-09T17:34:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 78 78 2983 3568 0
2024-02-09T17:35:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 592 592 3159 3159 0
2024-02-09T17:35:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 708 708 2778 2778 0
2024-02-09T17:36:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 593 593 3264 3264 0
2024-02-09T17:36:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 617 617 3106 3106 0
2024-02-09T17:37:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 455 455 3042 3042 0
2024-02-09T17:37:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 527 527 2988 2988 0
2024-02-09T17:38:01.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 682 682 2963 2963 0
The object count is increasing:
{
"took": 268,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"unique_categories": {
"value": 158827
}
}
}
More progress
{
"took": 1816,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"unique_categories": {
"value": 192835
}
}
}
{
"took": 2838,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"unique_categories": {
"value": 249732
}
}
}
$ cat logs/elastic-agent-* | grep "Non-zero" | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3 | tail
2024-02-09T18:36:31.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1166 1166 5822 5822 0
2024-02-09T18:37:01.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1081 1081 5021 5021 0
2024-02-09T18:37:31.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1161 1161 5380 5380 0
2024-02-09T18:38:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 1085 1085 5957 5957 0
2024-02-09T18:38:31.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 12841 12801 30337 23937 0
2024-02-09T18:39:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 12810 12801 116800 116800 0
2024-02-09T18:39:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 12863 12801 116800 116800 0
2024-02-09T18:40:01.046Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 12854 12801 107200 107200 0
2024-02-09T18:40:31.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 12828 12801 118400 118400 0
2024-02-09T18:41:01.045Z aws-s3-6a7e2784-665c-4208-afac-380bac62351e 12837 12801 105600 105600 0
{
"took": 4259,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"unique_categories": {
"value": 288266
}
}
}
Finished!
{
"took": 474,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"unique_categories": {
"value": 300483
}
}
}
The process:
started: 16:10 restarted: 16:55 completed: 20:10
The process:
started: 22:00 completed: 02:30
No Agent restarts.
Run 2 details
I removed the regex to filter objects from 2024. In this run, the Agent will start from the beginning of our bucket containing 1.3M objects
input:
output:
id: ae3df6d8-b182-48f9-9cab-d36d9c824133
revision: 73
outputs:
6a7e2784-665c-4208-afac-380bac62351e:
type: elasticsearch
preset: latency
default:
type: elasticsearch
preset: balanced
inputs:
- id: aws-s3-cloudtrail-fb93f962-4474-4f15-9032-0473c01ae54b
name: aws-21
revision: 16
type: aws-s3
use_output: 6a7e2784-665c-4208-afac-380bac62351e
meta:
package:
name: aws
version: 2.11.3
data_stream:
namespace: sdh3988.r3
package_policy_id: fb93f962-4474-4f15-9032-0473c01ae54b
streams:
- id: aws-s3-aws.cloudtrail-fb93f962-4474-4f15-9032-0473c01ae54b
data_stream:
dataset: aws.cloudtrail
type: logs
bucket_arn: '<REDACTED>'
number_of_workers: 100
bucket_list_interval: 1m
file_selectors:
- regex: /CloudTrail/
expand_event_list_from_field: Records
- regex: /CloudTrail-Digest/
- regex: /CloudTrail-Insight/
expand_event_list_from_field: Records
expand_event_list_from_field: Records
content_type: application/json
tags:
- forwarded
- aws-cloudtrail
publisher_pipeline.disable_host: true
Preset: latency
started: 2024-02-12 11:03 (UTC) curent: 2024-02-14 03:00 (UTC)
events: 52,952,571 hits objects: 1,325,888
With the run 3 settings, the agent spent 2+ days to process 1.3M objects and ingest 53M events.
I want to run performance tests on the
aws-s3
input using the Elastic Agent 8.12.0 to ingest Cloudtrail logs.The test will run on the following EC2 instance.
c7g.2xlarge
The goals are:
Authentication
We will use an EC2 instance profile only, so there will be no authentication-related options in the integration settings.
Dataset
As a test dataset, I will use an S3 bucket containing 1.2M objects of Cloudtrail logs.
I set up the integration to process only S3 object files created in 2024 (301254 objects at the time of the test). To achieve this, I updated the file selector regex to match the following expression:
Here is the agent policy:
Test process