zmoog / public-notes

Apache License 2.0
0 stars 1 forks source link

Figure out how to collect Azure metrics using the azuremonitorreceiver from the OTel Collector #67

Open zmoog opened 6 months ago

zmoog commented 6 months ago

I want to collect metric values using the azuremonitorreceiver from the opentelemetry-collector-contrib project.

The goal is evaluate how it works while collecting PT5M metrics using a 5 mins collection interval.

zmoog commented 6 months ago

OTel Collector

I need to learn how to set up and run the OTel Collector first.

Overview

Since the Elastic Stack natively supports the OpenTelemetry protocol (OTLP), we will send the metrics directly to the Elastic Stack.

image

Deployment model

I will use Docker and Docker Compose as deployment method, using https://github.com/open-telemetry/opentelemetry-demo/blob/main/docker-compose.minimal.yml as template.

Docker Compose

We'll use the otel/opentelemetry-collector-contrib:0.91.0 image and a volume to provide a custom config file named ./src/otelcollector/otelcol-config.yml.

# docker-compose.yml
services:
  # OpenTelemetry Collector
  otelcol:
    image: otel/opentelemetry-collector-contrib:0.91.0
    container_name: otel-col
    deploy:
      resources:
        limits:
          memory: 125M
    restart: unless-stopped
    command: [ "--config=/etc/otelcol-config.yml" ]
    volumes:
      - ./src/otelcollector/otelcol-config.yml:/etc/otelcol-config.yml
    ports:
      - "4317"          # OTLP over gRPC receiver
      - "4318"          # OTLP over HTTP receiver
    environment:
      - ELASTIC_APM_SERVER_ENDPOINT="<DEPLOYMENTapm.eastus2.azure.elastic-cloud.com:443"
      - ELASTIC_APM_SECRET_TOKEN="<TOKEN>"  

OTel Collector config file

We are setting up the OTel collector to pull metrics from Azure Monitor API and send them to Elasticsearch using the OTLP on the APM server.

Here's the config file:

receivers:
  otlp:
    protocols:
      grpc:
      http:
        cors:
          allowed_origins:
            - "http://*"
            - "https://*"
  azuremonitor:
    subscription_id: "${subscription_id}"
    tenant_id: "${tenant_id}"
    client_id: "${client_id}"
    client_secret: "${client_secret}"
    cloud: AzureCloud
    # resource_groups:
    #   - "${resource_group1}"
    #   - "${resource_group2}"
    services:
      - "Microsoft.Compute/virtualMachines"
    collection_interval: 60s
    initial_delay: 1s

exporters:
  logging:
    loglevel: warn 
  otlp/elastic: 
    # Elastic APM server https endpoint without the "https://" prefix
    endpoint: "${ELASTIC_APM_SERVER_ENDPOINT}"  
    headers:
      # Elastic APM Server secret token
      Authorization: "Bearer ${ELASTIC_APM_SECRET_TOKEN}"    

service:
  pipelines:
    metrics:
      receivers: [otlp, azuremonitor]
      exporters: [logging, otlp/elastic]
zmoog commented 6 months ago

Collect metrics

Bring up the stack to run the OTel Collector:

$ docker-compose up
[+] Running 1/0
 ✔ Container otel-col  Created                                                                                                                                                                                                                                 0.0s 
Attaching to otel-col
otel-col  | 2023-12-28T10:46:54.287Z    info    service@v0.91.0/telemetry.go:86 Setting up own telemetry...
otel-col  | 2023-12-28T10:46:54.287Z    info    service@v0.91.0/telemetry.go:203        Serving Prometheus metrics      {"address": ":8888", "level": "Basic"}
otel-col  | 2023-12-28T10:46:54.287Z    info    exporter@v0.91.0/exporter.go:275        Deprecated component. Will be removed in future releases.       {"kind": "exporter", "data_type": "metrics", "name": "logging"}
otel-col  | 2023-12-28T10:46:54.287Z    warn    common/factory.go:68    'loglevel' option is deprecated in favor of 'verbosity'. Set 'verbosity' to equivalent value to preserve behavior.      {"kind": "exporter", "data_type": "metrics", "name": "logging", "loglevel": "warn", "equivalent verbosity level": "Basic"}
otel-col  | 2023-12-28T10:46:54.287Z    info    receiver@v0.91.0/receiver.go:296        Development component. May change in the future.        {"kind": "receiver", "name": "azuremonitor", "data_type": "metrics"}
otel-col  | 2023-12-28T10:46:54.287Z    info    service@v0.91.0/service.go:145  Starting otelcol-contrib...     {"Version": "0.91.0", "NumCPU": 4}
otel-col  | 2023-12-28T10:46:54.287Z    info    extensions/extensions.go:34     Starting extensions...
otel-col  | 2023-12-28T10:46:54.288Z    warn    internal@v0.91.0/warning.go:40  Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks        {"kind": "receiver", "name": "otlp", "data_type": "metrics", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
otel-col  | 2023-12-28T10:46:54.288Z    info    otlpreceiver@v0.91.0/otlp.go:83 Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "metrics", "endpoint": "0.0.0.0:4317"}
otel-col  | 2023-12-28T10:46:54.288Z    warn    internal@v0.91.0/warning.go:40  Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks        {"kind": "receiver", "name": "otlp", "data_type": "metrics", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
otel-col  | 2023-12-28T10:46:54.288Z    info    otlpreceiver@v0.91.0/otlp.go:101        Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "metrics", "endpoint": "0.0.0.0:4318"}
otel-col  | 2023-12-28T10:46:54.288Z    info    service@v0.91.0/service.go:171  Everything is ready. Begin running and processing data.
otel-col  | 2023-12-28T10:47:02.252Z    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 1284, "data points": 1324}
zmoog commented 6 months ago

Checking the results

Overview

Here's an highlight from the OTel collector config:

  azuremonitor:
    subscription_id: "${subscription_id}"
    tenant_id: "${tenant_id}"
    client_id: "${client_id}"
    client_secret: "${client_secret}"
    cloud: AzureCloud
    services:
      - "Microsoft.Compute/virtualMachines"
    collection_interval: 60s
    initial_delay: 1s

In this scenario, I am collecting Microsoft.Compute/virtualMachines metrics using a collection_interval 60s.

It's worth noting that all Microsoft.Compute/virtualMachines metrics have the PT1M time grain, so in this scenario the collection_interval matches the time grain.

Goal

I want to check if the azuremonitorreceiver also skips metrics collection creating gaps when time grain and collection interval are the same.

Results

I am looking for values for the metric azure_network_in_total for the VM rajvi-test-sql-vm in the data stream apm.app.unknown.

In the last 60 mins, I can see the receiver does not get a metric value every minute:

CleanShot 2023-12-27 at 18 38 05@2x

zmoog commented 6 months ago

I tried to compare the values for a single metric (network in total) for one specific VM (rajivi-test-sql-vm) over the last 60 minutes on the following systems:

Metricbeat and azure monitor receiver collect compute VM metrics with a PT1M time grain using a 60-second collection interval. They collected metrics running concurrently on the same machine and using the same credentials.

Azure Metrics

Here's the source of the metrics on Azure Portal:

CleanShot 2023-12-29 at 15 30 00@2x

Over the 60-minute time windows, we can see a few (6) missing values.

Metricbeat 8.13.0-SNAPSHOT

Here is the same time windows with the values for the same metric and resource:

CleanShot 2023-12-29 at 15 31 08@2x

These are the same values that were missing on Azure.

OTel Collector 0.91 using the azure monitor receiver

CleanShot 2023-12-29 at 15 32 49@2x

The azure monitor receiver has many more missing values.

zmoog commented 6 months ago

Conclusions

When the time grain has the exact duration of the collection interval (for example, PT1M time grain and 60-second interval), there's the risk that a slightly shorter or longer collection interval duration may cause the metricset/receiver to skip the collection, creating gaps.