ytyou / ticktock

TickTockDB is an OpenTSDB-like time series database, with much better performance.
GNU General Public License v3.0
72 stars 8 forks source link

tt process hangs after ping request #63

Open corecell opened 7 months ago

corecell commented 7 months ago

Hello there!

First at all, project doesn't build due to error: include/kv.h:60:46: error: ‘std::string’ has not been declared which can simple fixed by adding #include <string> to include/kv.h

After build, i just simple run ./bin/tt -c conf/tt.conf and in other console run curl -XPOST http://127.0.0.1:6182/api/admin?cmd=ping after that tt proces start eating 100% one of four CPU cores.

Checked versions: v0.11.8 - v0.12.1 Build command # make -f Makefile.ubuntu all Build log: https://pastebin.com/ZscYapa5 System info: https://pastebin.com/qDnux7hs $ ./bin/tt -c conf/tt.conf output: https://pastebin.com/34BP8YDw ticktock.log (log.level = DEBUG): https://pastebin.com/mD5rWitT

ylin30 commented 7 months ago

That's a serious problem. Thx for the report. It seems you are running TT in dell optiplex with Arch Linux. We never test it in Arch Linux before. We will get back to you ASAP.

ylin30 commented 7 months ago

With a docker in official arch-linux image, I can repro the build failure but I don't repro the 100% cpu usage problem. TT works fine after ping succeeds.

I note this in your log, 2023-12-06 11:09:38.228 [INFO] [http_listener_10] Interrupted (11), shutting down.... There was a segmentation fault so TT didn't start up correctly in your case.

We don't have Dell Optiplex to repro but we will test in in a VM instead of docker in the next step. There are lots of warnings in the build even after the #include <string> fix in include/kv.h. Maybe the warnings also matter to the SEGVFault.

ytyou commented 7 months ago

TT crashes as soon as you ping it (using admin/ping.sh). The cause of the crash is due to the "alignas(64)" attribute on TcpConnection. In file include/tcp.h, TcpConnection is defined as

class alignas(64) TcpConnection : public Recyclable

For some reason, ArchLinux does not like alignas(64). A quick work-around is to remove it as follows:

class TcpConnection : public Recyclable

And then re-compile everything. This should unblock you. In the meantime, we will figure out a permanent solution.

Thanks!

corecell commented 7 months ago

Yeah, i have Dell Optiplex 3050 as small home server.

A quick work-around is to remove it as follows: class TcpConnection : public Recyclable

This work-around work fine, pong answer received as excepted.

I run docker image (latest tag) and seems something wrong even with docker on Arch Linux. Docker ps: cf794e6bdf19 ytyou/ticktock:latest "/opt/ticktock/scrip…" 2 days ago Up 8 minutes (health: starting) 6180/tcp, 6183/tcp, 0.0.0.0:6181-6182->6181-6182/tcp, 0.0.0.0:6181->6181/udp, :::6181-6182->6181-6182/tcp, :::6181->6181/udp practical_euclid

Log from docker after ping request: https://pastebin.com/u33pVQUp But in this situation no CPU cores loaded at 100%. I can provide any necessary information if needed.

Thanks!

ytyou commented 7 months ago

We use 'ping' to determine the health of the TT container. Since the latest version does not have the work-around, it will never consider the container healthy (ping will fail). Hence the "starting" status. We are working on releasing an update, including a new docker image. So stay tuned.

Thank you.

ylin30 commented 7 months ago

A quick work-around is to remove it as follows: class TcpConnection : public Recyclable

This work-around work fine, pong answer received as excepted.

Good to hear that.

I run docker image (latest tag) and seems something wrong even with docker on Arch Linux. Docker ps: cf794e6bdf19 ytyou/ticktock:latest "/opt/ticktock/scrip…" 2 days ago Up 8 minutes (health: starting) 6180/tcp, 6183/tcp, 0.0.0.0:6181-6182->6181-6182/tcp, 0.0.0.0:6181->6181/udp, :::6181-6182->6181-6182/tcp, :::6181->6181/udp practical_euclid

Log from docker after ping request: https://pastebin.com/u33pVQUp But in this situation no CPU cores loaded at 100%. I can provide any necessary information if needed.

As @ytyou pointed out, docker is stuck at STARTING state because ping doesn't work. Did u succeed in pinging TT from outside the docker? I don't even find ping request in your docker log so I wonder if something wrong with ports mapping between host and docker. Can you do ping inside docker instead?

BTW, I see u just use TT's official docker image (ubuntu inside docker) in arch linux at host level. What I did is to run a docker with arch OS inside the docker. At host level the OS is ubuntu. I have to git pull TT inside the docker.

ylin30 commented 7 months ago

@corecell We are closed to a major release (v0.20) which will dramatically improve read performance (especially to huge queries). Would you mind to stick with the work-around for the time being? We might not have resources (including time, hardware, experience with Arch) to release a minor release for Arch on top of v0.12. Testing (e.g., regression tests) before releases will take a long time, especially due to the fact that Arch is a new OS to us and new problems might pop up unexpectedly.

Docker may be a good solution. Please let me know if you can ping inside docker, as suggested above.

corecell commented 6 months ago

Did u succeed in pinging TT from outside the docker?

@ylin30 no, ping unsuccessful from outside and inside container. Ports are checked and accessible from outside the container. See logs (nmap, netstat, tcpdump)

Iptables return zero valid packets after any ping executed: 0 0 ACCEPT 6 -- !docker0 docker0 0.0.0.0/0 172.17.0.2 tcp dpt:6182 iptraf-ng shows requests with stalled ACK flag --A-

There interesting monent at tcpdump output: optiplex.37756 > 172.17.0.2.6182: Flags [.], cksum 0x584c (incorrect -> 0xa7e0), ack 3541126866, win 251, options [nop,nop,TS val 1775403397 ecr 1795857315], length 0 cksum 0x584c (incorrect -> 0xa7e0) - looks like the root cause of the problem.

Would you mind to stick with the work-around for the time being?

Yes, sure, i'll try to do my best.

Thanks!