ytyou / ticktock

TickTockDB is an OpenTSDB-like time series database, with much better performance.
GNU General Public License v3.0
72 stars 8 forks source link

TT crashes with SIGSEGV when executing a query #45

Closed jens-ylja closed 1 year ago

jens-ylja commented 1 year ago

I have some curious crashes with SIGSEGV (11) when TT executes a query issued via Grafana.

After several hours of struggling around, I drilled it down to the following simple procedure (reproducible with just telnet and curl):

  1. start a new clean TT instance
  2. put a measurement using the OpenTSDB line protocol: put sensor.status 1681147860 0 type=s10e device=s10e
  3. execute a query to verify: curl 'http://localhost:6182/api/query?start=0&m=first:sensor.status' -> you should see one metric
  4. put another measurement using the OpenTSDB line protocol - note, this has one tag more: put sensor.status 1681147870 0 type=s10e device=s10e _field=val
  5. re-execute the query -> you should see two metrics now - they differ in timestamp and the additional tag
  6. put a third measurement using the Influx line protocol: sensor.status,device=s10e,type=s10e val=0 1681147680
  7. re-execute the query -> crash

You can execute this in any order of steps 2, 4 and 6. You can even leave step 4 out. If - on the other side - running 4 and 6 only, all works well -> you will see two metrics which are completely identical except the time stamps.

One can think - what a crazy idea to mix OpenTSDB and Influx line protocols. But I was unsure which one to use is the better idea. Using OpenTSDB would use consistent APIs for read and write. But when having a series of fields within a measurement, the Influx line protocol seems to be much more compact.

Note: I ran this using TT version 0.11.4 on ARM v71. I'll try the behaviour with 0.11.6 within the next couple of days.

ylin30 commented 1 year ago

I have some curious crashes with SIGSEGV (11) when TT executes a query issued via Grafana.

Thanks. We will try to repro in our side here.

  1. put another measurement using the OpenTSDB line protocol - note, this has one tag more: put sensor.status 1681147870 0 type=s10e device=s10e _field=val

_field is a reserved word for Influx line protocol. It is actually mapped to a metric in OpenTSDB protocol. We don't expect such a usage. This might be the cause. If so, we might consider blocking such a usage in the future.

You can execute this in any order of steps 2, 4 and 6. You can even leave step 4 out. If - on the other side - running 4 and 6 only, all works well -> you will see two metrics which are completely identical except the time stamps.

One can think - what a crazy idea to mix OpenTSDB and Influx line protocols. But I was unsure which one to use is the better idea. Using OpenTSDB would use consistent APIs for read and write. But when having a series of fields within a measurement, the Influx line protocol seems to be much more compact.

Yah, it is really a stress testing mixture of Opentsdb and influx line protocol. Influx line protocol is definitely more compact and the preferred way to write. TickTockDB also performs much better if fields in one line are always in the same order between different requests. For example, this is good.

cpu,host=rpi,id=1 usr=10,sys=20,idle=70 1465839830000
cpu,host=rpi,id=1 usr=10,sys=20,idle=70 1465839890000

This is bad:

cpu,host=rpi,id=1 usr=10,sys=20,idle=70 1465839830000
cpu,host=rpi,id=1 sys=20,idle=70,usr=10 1465839890000

Note: I ran this using TT version 0.11.4 on ARM v71. I'll try the behaviour with 0.11.6 within the next couple of days.

I suspect 0.11.6 might also crash in this scenario.

We will get back to you once we confirm the repro. Thanks!

jens-ylja commented 1 year ago

Regarding the usage of _field in OpenTSDB protocol uses. This was intended to be able to use the same queries (e.g. in Grafana) regardless of the line protocol used for writing the data. Otherwise the queries and/or evaluation of the results must differ depending on which line protocol was used when writing the data.

ylin30 commented 1 year ago

Regarding the usage of _field in OpenTSDB protocol uses. This was intended to be able to use the same queries (e.g. in Grafana) regardless of the line protocol used for writing the data. Otherwise the queries and/or evaluation of the results must differ depending on which line protocol was used when writing the data.

Fair. You are right.

_field is a reserved word for Influx line protocol. It is actually mapped to a metric in OpenTSDB protocol.

I was wrong in this comment above. It is actually considered a new time series but still a tag.

So far I tried 0.11.5 in RPI-zero-w (ARM V6 - 32bit) and it doesn't crash although the query results look fishy to me. m=first:sensor.status It seems the first keyword is not processed properly. I am resetting to 0.11.4 and see if TT crash.

ylin30 commented 1 year ago

@jens-ylja I can repro the crash in 0.11.4 in RPI-zero-w (ArmV6 32bit) now. You can try 0.11.5 or latest version 0.11.6 with no crash. Nevertheless, there are still a couple of problems we need to resolve. Mainly:

  1. Opentsdb put using _field (such as put sensor.status 1681147870 0 type=s10e device=s10e _field=val) will be blocked. It will cause mismatch in meta data. Reads with _field are still allowed.
  2. WE will double check queries like m=first:sensor.status.

Thanks for your efforts. It really helps us improve TickTockDB.

ytyou commented 1 year ago
  1. This is indeed a bug. We have made a fix in the 'dev' branch. It needs further testing before it can be released;
  2. 'first' and 'last' can not be used for aggregation; it only make sense when used to perform downsampling;
  3. Using '_field' in OpenTSDB style put request is still allowed, but we are considering to ban it. Maybe there's a good reason to support it...
jens-ylja commented 1 year ago

@ytyou - thanks for reply. Regarding your points:

  1. This is indeed a bug. We have made a fix in the 'dev' branch. It needs further testing before it can be released;

Good to hear I could help to find and fix a bug.

  1. 'first' and 'last' can not be used for aggregation; it only make sense when used to perform downsampling;

For my queries none of the aggregations make sense at all. I have a couple of these sensor.status series, each with a different type and device tag set - the values are discrete (enum values). I want to query them all together but do not want them to be aggregated. Unfortunately with Grafana I'm required to provide an aggregation. But none of these aggregations makes sense at all for enumeration values. The only ones working - providing all sensor.status series as separate metrics - are first and last. A none aggregation would be nice but I cannot configure it to Grafana.

  1. Using '_field' in OpenTSDB style put request is still allowed, but we are considering to ban it. Maybe there's a good reason to support it...

My reason was the query compatibility between metrics written with OpenTSDB line protocol and these written with the Influx one. But as @ylin30's suggestion was to use the Influx protocol - this compatibility isn't a reason any longer (for me).

ytyou commented 1 year ago

We will add a "none" aggregator. Thanks for the suggestion.

ylin30 commented 1 year ago
  1. 'first' and 'last' can not be used for aggregation; it only make sense when used to perform downsampling;

For my queries none of the aggregations make sense at all. I have a couple of these sensor.status series, each with a different type and device tag set - the values are discrete (enum values). I want to query them all together but do not want them to be aggregated. Unfortunately with Grafana I'm required to provide an aggregation. But none of these aggregations makes sense at all for enumeration values. The only ones working - providing all sensor.status series as separate metrics - are first and last. A none aggregation would be nice but I cannot configure it to Grafana.

@jens-ylja Are you looking for a way to retrieve all original individual data points in related time series? If so, some other users raised the same question before. A none aggregator would be a good fit though OpenTSDB doesn't provide it. So far our workaround is to explicitly add all tags in query, such as type=* device=*.

jens-ylja commented 1 year ago

@jens-ylja Are you looking for a way to retrieve all original individual data points in related time series? If so, some other users raised the same question before. A none aggregator would be a good fit though OpenTSDB doesn't provide it. So far our workaround is to explicitly add all tags in query, such as type=* device=*.

@ylin30 According to http://opentsdb.net/docs/build/html/user_guide/query/aggregators.html OpenTSDB seems to have a none aggregator since version 2.3 - Am I wrong with this?

I checked the type=* device=* notation - it does the job :)

Thanks Jens

ytyou commented 1 year ago

@jens-ylja We have just released v0.11.7, which should have all the fixes mentioned in this ticket. A "none" aggregator is added, although you may need to refresh the page a few times for Grafana to ditch its cache and pull from TickTockDB again.

jens-ylja commented 1 year ago

@ytyou I switched to 0.11.7 now and can confirm the none aggregator to work -> removed the type=* device=* query tags. I additionally can confirm the first and last aggregators are not accepted any longer (except for down sampling).

I didn't verified the "crash test" - but as I understand, the tag name _field shouldn't be allowed any longer. I'll give feedback after trying this too.

jens-ylja commented 1 year ago

I just re-tested the origin scenario with TT version 0.11.7:

  1. put sensor.status 1681147860 0 type=s10e device=s10e
  2. curl 'http://localhost:5000/api/query?start=0&m=none:sensor.status' -> this returned a single metric -> correct
  3. put sensor.status 1681147870 0 type=s10e device=s10e _field=val
  4. curl 'http://localhost:5000/api/query?start=0&m=none:sensor.status' -> this returned two metrics -> as expected
  5. sensor.status,device=s10e,type=s10e val=0 1681147680
  6. curl 'http://localhost:5000/api/query?start=0&m=none:sensor.status' -> this returned two metrics (one with two values) -> as expected

Seems this issue is fixed. I close it.