Open zmoog opened 6 months ago
I need to learn how to set up and run the OTel Collector first.
Since the Elastic Stack natively supports the OpenTelemetry protocol (OTLP), we will send the metrics directly to the Elastic Stack.
I will use Docker and Docker Compose as deployment method, using https://github.com/open-telemetry/opentelemetry-demo/blob/main/docker-compose.minimal.yml as template.
We'll use the otel/opentelemetry-collector-contrib:0.91.0
image and a volume to provide a custom config file named ./src/otelcollector/otelcol-config.yml
.
# docker-compose.yml
services:
# OpenTelemetry Collector
otelcol:
image: otel/opentelemetry-collector-contrib:0.91.0
container_name: otel-col
deploy:
resources:
limits:
memory: 125M
restart: unless-stopped
command: [ "--config=/etc/otelcol-config.yml" ]
volumes:
- ./src/otelcollector/otelcol-config.yml:/etc/otelcol-config.yml
ports:
- "4317" # OTLP over gRPC receiver
- "4318" # OTLP over HTTP receiver
environment:
- ELASTIC_APM_SERVER_ENDPOINT="<DEPLOYMENTapm.eastus2.azure.elastic-cloud.com:443"
- ELASTIC_APM_SECRET_TOKEN="<TOKEN>"
We are setting up the OTel collector to pull metrics from Azure Monitor API and send them to Elasticsearch using the OTLP on the APM server.
Here's the config file:
receivers:
otlp:
protocols:
grpc:
http:
cors:
allowed_origins:
- "http://*"
- "https://*"
azuremonitor:
subscription_id: "${subscription_id}"
tenant_id: "${tenant_id}"
client_id: "${client_id}"
client_secret: "${client_secret}"
cloud: AzureCloud
# resource_groups:
# - "${resource_group1}"
# - "${resource_group2}"
services:
- "Microsoft.Compute/virtualMachines"
collection_interval: 60s
initial_delay: 1s
exporters:
logging:
loglevel: warn
otlp/elastic:
# Elastic APM server https endpoint without the "https://" prefix
endpoint: "${ELASTIC_APM_SERVER_ENDPOINT}"
headers:
# Elastic APM Server secret token
Authorization: "Bearer ${ELASTIC_APM_SECRET_TOKEN}"
service:
pipelines:
metrics:
receivers: [otlp, azuremonitor]
exporters: [logging, otlp/elastic]
Bring up the stack to run the OTel Collector:
$ docker-compose up
[+] Running 1/0
✔ Container otel-col Created 0.0s
Attaching to otel-col
otel-col | 2023-12-28T10:46:54.287Z info service@v0.91.0/telemetry.go:86 Setting up own telemetry...
otel-col | 2023-12-28T10:46:54.287Z info service@v0.91.0/telemetry.go:203 Serving Prometheus metrics {"address": ":8888", "level": "Basic"}
otel-col | 2023-12-28T10:46:54.287Z info exporter@v0.91.0/exporter.go:275 Deprecated component. Will be removed in future releases. {"kind": "exporter", "data_type": "metrics", "name": "logging"}
otel-col | 2023-12-28T10:46:54.287Z warn common/factory.go:68 'loglevel' option is deprecated in favor of 'verbosity'. Set 'verbosity' to equivalent value to preserve behavior. {"kind": "exporter", "data_type": "metrics", "name": "logging", "loglevel": "warn", "equivalent verbosity level": "Basic"}
otel-col | 2023-12-28T10:46:54.287Z info receiver@v0.91.0/receiver.go:296 Development component. May change in the future. {"kind": "receiver", "name": "azuremonitor", "data_type": "metrics"}
otel-col | 2023-12-28T10:46:54.287Z info service@v0.91.0/service.go:145 Starting otelcol-contrib... {"Version": "0.91.0", "NumCPU": 4}
otel-col | 2023-12-28T10:46:54.287Z info extensions/extensions.go:34 Starting extensions...
otel-col | 2023-12-28T10:46:54.288Z warn internal@v0.91.0/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks {"kind": "receiver", "name": "otlp", "data_type": "metrics", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
otel-col | 2023-12-28T10:46:54.288Z info otlpreceiver@v0.91.0/otlp.go:83 Starting GRPC server {"kind": "receiver", "name": "otlp", "data_type": "metrics", "endpoint": "0.0.0.0:4317"}
otel-col | 2023-12-28T10:46:54.288Z warn internal@v0.91.0/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks {"kind": "receiver", "name": "otlp", "data_type": "metrics", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
otel-col | 2023-12-28T10:46:54.288Z info otlpreceiver@v0.91.0/otlp.go:101 Starting HTTP server {"kind": "receiver", "name": "otlp", "data_type": "metrics", "endpoint": "0.0.0.0:4318"}
otel-col | 2023-12-28T10:46:54.288Z info service@v0.91.0/service.go:171 Everything is ready. Begin running and processing data.
otel-col | 2023-12-28T10:47:02.252Z info MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 1284, "data points": 1324}
Here's an highlight from the OTel collector config:
azuremonitor:
subscription_id: "${subscription_id}"
tenant_id: "${tenant_id}"
client_id: "${client_id}"
client_secret: "${client_secret}"
cloud: AzureCloud
services:
- "Microsoft.Compute/virtualMachines"
collection_interval: 60s
initial_delay: 1s
In this scenario, I am collecting Microsoft.Compute/virtualMachines
metrics using a collection_interval 60s
.
It's worth noting that all Microsoft.Compute/virtualMachines metrics have the PT1M
time grain, so in this scenario the collection_interval matches the time grain.
I want to check if the azuremonitorreceiver
also skips metrics collection creating gaps when time grain and collection interval are the same.
I am looking for values for the metric azure_network_in_total
for the VM rajvi-test-sql-vm
in the data stream apm.app.unknown
.
In the last 60 mins, I can see the receiver does not get a metric value every minute:
I tried to compare the values for a single metric (network in total) for one specific VM (rajivi-test-sql-vm) over the last 60 minutes on the following systems:
Metricbeat and azure monitor receiver collect compute VM metrics with a PT1M time grain using a 60-second collection interval. They collected metrics running concurrently on the same machine and using the same credentials.
Here's the source of the metrics on Azure Portal:
Over the 60-minute time windows, we can see a few (6) missing values.
Here is the same time windows with the values for the same metric and resource:
These are the same values that were missing on Azure.
The azure monitor receiver has many more missing values.
When the time grain has the exact duration of the collection interval (for example, PT1M time grain and 60-second interval), there's the risk that a slightly shorter or longer collection interval duration may cause the metricset/receiver to skip the collection, creating gaps.
I want to collect metric values using the azuremonitorreceiver from the opentelemetry-collector-contrib project.
The goal is evaluate how it works while collecting PT5M metrics using a 5 mins collection interval.