wundergraph / cosmo

The open-source solution to building, maintaining, and collaborating on GraphQL Federation at Scale. The alternative to Apollo Studio and GraphOS.
https://cosmo-docs.wundergraph.com/
Apache License 2.0
704 stars 103 forks source link

ServiceName column missing in the Otel Table in Clickhouse #1166

Closed anmolghosh closed 1 month ago

anmolghosh commented 1 month ago

Component(s)

otelcollector

Component version

0.17.1

wgc version

0.40.2

controlplane version

0.107.0

router version

0.109.1

What happened?

Description

We have Cosmo deployed inside kubernetes cluster using the standard helm chart provided by Wundergraph team. When we are looking at the logs of otelcollector, we see following errors

insert sum metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_sum
insert gauge metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_gauge
insert histogram metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_histogram

Running a count query on the above table table inside clickhouse show all 3 tables have zero record.

Also, if we review the schema on the following file, we can verify that the column ServiceName was never defined in first place https://github.com/wundergraph/cosmo/blob/main/controlplane/db/schema.sql

Steps to Reproduce

Install Wundergraph cosmo using the standard helm chart with otelcollector enabled Prometheus is kept disabled on all components.

Expected Result

There are no errors in otelcollector logs and records are being inserted into corresponding clickhouse tables

Actual Result

All inserts to otel_metrics_sum, otel_metrics_gauge and otel_metrics_histogram are failing with error

 No such column ServiceName in table

Environment information

Environment

WunderGraph Cosmo Stack running on AWS EKS Kubernetes Cluster. Stack deployed using helm chart

oci://ghcr.io/wundergraph/cosmo/helm-charts/cosmo

Versions:

Helm Chart: 0.11.1 cdn: 0.10.1 controlplane: 0.107.0 studio: 0.88.1 router: 0.109.1 otelcollector: 0.17.1 graphqlmetrics: 0.22.0

Router configuration

version: "1"
headers:
  all: # Header rules for all origin requests.
    request:
      - op: "propagate"
        named: Authorization
      - op: "propagate"
        named: ClientKey
      - op: "propagate"
        named: refresh_token
      - op: "propagate"
        named: Traceparent
subgraph_error_propagation:
  enabled: true
  mode: "pass-through"
traffic_shaping:
  all:
    request_timeout: 120s
cors:
  allow_headers:
    - Origin
    - Content-Length
    - Content-Type
    - ClientKey
    - Refresh_Token

Router execution config

No response

Log output

2024-09-13T00:56:28+05:30 {"level":"info","ts":1726169188.0839438,"caller":"exporterhelper/retry_sender.go:118","msg":"Exporting failed. Will retry the request after interval.","kind":"exporter","data_type":"metrics","name":"clickhouse","error":"insert sum metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_sum (fa3d81a4-19e7-4ef0-9832-dabab1acd4c5)\ninsert gauge metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_gauge (a3cb4634-2048-4bad-9ac6-8828c9631d21)\ninsert histogram metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_histogram (edbfb99c-5432-418d-875c-1538d263e1f8)","interval":"35.094430427s"}
2024-09-13T00:57:03+05:30 {"level":"info","ts":1726169223.1858504,"caller":"exporterhelper/retry_sender.go:118","msg":"Exporting failed. Will retry the request after interval.","kind":"exporter","data_type":"metrics","name":"clickhouse","error":"insert gauge metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_gauge (a3cb4634-2048-4bad-9ac6-8828c9631d21)\ninsert sum metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_sum (fa3d81a4-19e7-4ef0-9832-dabab1acd4c5)\ninsert histogram metrics fail:code: 16, message: No such column ServiceName in table cosmo.otel_metrics_histogram (edbfb99c-5432-418d-875c-1538d263e1f8)","interval":"18.860788202s"}

Additional context

No response

github-actions[bot] commented 1 month ago

WunderGraph commits fully to Open Source and we want to make sure that we can help you as fast as possible. The roadmap is driven by our customers and we have to prioritize issues that are important to them. You can influence the priority by becoming a customer. Please contact us here.

StarpTech commented 1 month ago

Hi @anmolghosh, this must be related to an old migration issue. I'd assume that you run Cosmo already for a while. You should be able to fix this by adding the column manually to the tables. The schema definition can be found here.

anmolghosh commented 1 month ago

Hi @StarpTech

Yes we are running it for some time now. We are also in process of migrating the clickhouse from kubernetes hosted to cloud managed. I am assuming running the migration from scratch on the new cloud instance should fix any table inconsistencies.

anmolghosh commented 1 month ago

Hi @StarpTech

Any reason the Table Schema is not updated in either of the schema file or migration folder:

Since the Cosmo's helm chart run the Clickhouse migration job by default using dbmate when OtelCollector is enabled, this will never execute because of the clause CREATE TABLE IF NOT EXISTS Ref:

So even if anyone is setting up cosmo from scratch, might face similar issue

anmolghosh commented 1 month ago

Hi @StarpTech

I see the team have added helm hooks to handle it. Will do a fresh run and see how it goes.

Thanks again.

Suggestion: We should either delete the tables from controlplane/db/schema.sql if the goal is to create initial tables with this or alternatively we can update the file to be in sync with latest schema

StarpTech commented 1 month ago

Hi, yes this should fix it.

Any reason the Table Schema is not updated in either of the schema file or migration folder:

This file is no longer in active use and is considered obsolete, so we plan to clean it up.

Unfortunately, OTEL collector migration was not included, and the fix has only been applied to customers who encountered issues. We understand this is not ideal, and we will make every effort to prevent such situations in the future.

Since the Cosmo's helm chart run the Clickhouse migration job by default using dbmate when OtelCollector is enabled, this will never execute because of the clause CREATE TABLE IF NOT EXISTS Ref:

The OTEL collector apply its own migrations. There is no automatism yet to modify the OTEL collector schema after the initial migration.

StarpTech commented 1 month ago

Feel free to reopen it when you think it was not resolved.