Open zshuangyan opened 5 years ago
今天早上发现falcon-hbs进程又不在了,重新启动falcon-hbs进程后,发现其占用的cpu持续在600%~800%之间。
[zhangshuangyan@gs-server-5693 ~]$ top top - 11:34:29 up 178 days, 21:36, 1 user, load average: 8.21, 9.62, 9.59 Tasks: 308 total, 4 running, 304 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.7 us, 0.2 sy, 0.0 ni, 98.9 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 16261276 total, 11915604 free, 2886908 used, 1458764 buff/cache KiB Swap: 16777212 total, 15668760 free, 1108452 used. 13054704 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 52125 root 20 0 2505800 1.606g 4564 S 656.2 10.4 432:03.15 falcon-hbs 4023 mysql 20 0 4751528 203796 3832 S 68.8 1.3 1586:43 mysqld
使用golang的定位工具pprof,获取30s内cpu的统计信息
[zhangshuangyan@gs-server-5693 ~]$ go tool pprof http://localhost:6031/debug/pprof/profile?seconds=30 Fetching profile over HTTP from http://localhost:6031/debug/pprof/profile?seconds=30 Saved profile in /home/zhangshuangyan@gridsum.com/pprof/pprof.falcon-hbs.samples.cpu.001.pb.gz File: falcon-hbs Type: cpu Time: Apr 9, 2019 at 11:14am (CST) Duration: 30.05s, Total samples = 3.33mins (665.49%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top20 Showing nodes accounting for 123.15s, 61.58% of 199.97s total Dropped 261 nodes (cum <= 1s) Showing top 20 nodes out of 117 flat flat% sum% cum cum% 23.26s 11.63% 11.63% 23.26s 11.63% runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go 9.95s 4.98% 16.61% 23.42s 11.71% runtime.scanobject /usr/local/go/src/runtime/mgcmark.go 9.07s 4.54% 21.14% 12.24s 6.12% runtime.mapaccess2_faststr /usr/local/go/src/runtime/hashmap_fast.go 7.47s 3.74% 24.88% 10.63s 5.32% runtime.gcmarknewobject /usr/local/go/src/runtime/mgcmark.go 6.46s 3.23% 28.11% 8.30s 4.15% runtime.step /usr/local/go/src/runtime/symtab.go 6.35s 3.18% 31.28% 67.12s 33.57% runtime.mallocgc /usr/local/go/src/runtime/malloc.go 5.76s 2.88% 34.17% 16.68s 8.34% runtime.scanblock /usr/local/go/src/runtime/mgcmark.go 5.30s 2.65% 36.82% 5.30s 2.65% runtime.greyobject /usr/local/go/src/runtime/mbitmap.go 5.19s 2.60% 39.41% 5.19s 2.60% runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s 5.14s 2.57% 41.98% 13.44s 6.72% runtime.pcvalue /usr/local/go/src/runtime/symtab.go 4.67s 2.34% 44.32% 41.24s 20.62% runtime.gentraceback /usr/local/go/src/runtime/traceback.go 4.64s 2.32% 46.64% 10.04s 5.02% runtime.sweepone /usr/local/go/src/runtime/mgcsweep.go 4.63s 2.32% 48.95% 146.76s 73.39% github.com/open-falcon/falcon-plus/modules/hbs/db.QueryHostMetricTags /home/zhangshuangyan@gridsum.com/go/src/github.com/open-falcon/falcon-plus/modules/hbs/db/alarm.go 4.51s 2.26% 51.21% 4.54s 2.27% runtime.greyobject /usr/local/go/src/runtime/mgcmark.go 4.44s 2.22% 53.43% 7.71s 3.86% runtime.mapaccess1_faststr /usr/local/go/src/runtime/hashmap_fast.go 3.94s 1.97% 55.40% 21.47s 10.74% runtime.writebarrierptr_prewrite1 /usr/local/go/src/runtime/mbarrier.go 3.39s 1.70% 57.09% 3.39s 1.70% runtime.memeqbody /usr/local/go/src/runtime/asm_amd64.s 3.11s 1.56% 58.65% 3.11s 1.56% runtime.heapBitsSetType /usr/local/go/src/runtime/mbitmap.go 3.07s 1.54% 60.18% 111.35s 55.68% runtime.systemstack /usr/local/go/src/runtime/asm_amd64.s 2.80s 1.40% 61.58% 12.42s 6.21% runtime.shade /usr/local/go/src/runtime/mgcmark.go
可以看到,github.com/open-falcon/falcon-plus/modules/hbs/db.QueryHostMetricTags这个方法占用了73.39%的cpu,我们看下这个方法的源码:
func QueryHostMetricTags() (map[string]map[string]string, error) { hostMetricTags := make(NestedMap) sql := "select metric, hosts, tags from alarm_tag where is_active=1" rows, err := DB.Query(sql) if err != nil { log.Println(fmt.Sprintf("ERROR: %s\nsql: %s", err, sql)) return nil, err } defer rows.Close() for rows.Next() { var ( metric string hosts string tags string ) err = rows.Scan(&metric, &hosts, &tags) if err != nil { log.Println("ERROR:", err) continue } for _, host := range parseHosts(hosts) { value := hostMetricTags.setdefault(host, metric, "") if value == "" { hostMetricTags[host][metric] = tags } else { hostMetricTags[host][metric] += "," + tags } } } for host, metrics := range hostMetricTags { for metric, tags := range metrics { hostMetricTags[host][metric] = parseTags(tags) } } return hostMetricTags, nil }
主要是查询alarm_tag表中状态为1的条目,得到一个嵌套字典,结合top统计中MySQL的cpu负载也很高,推断alarm_tag这个表中的数据应该比较大, 进入数据库查看alarm_tag表的统计信息:
mysql> select count(*) from alarm_tag; +----------+ | count(*) | +----------+ | 23812 | +----------+ 1 row in set (0.02 sec)
数据量已经达到上万条,而每分钟会有上千台agent远程调用这个方法查询主机管理的告警策略,造成了较高的cpu和iowait。
跟告警模块沟通后,发现是压测造的数据,把表中的数据清空后,falcon-hbs的cpu恢复正常。
[zhangshuangyan@gs-server-5693 falcon]$ top top - 13:55:56 up 178 days, 23:58, 2 users, load average: 1.46, 7.26, 9.41 Tasks: 308 total, 1 running, 307 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.4 us, 0.3 sy, 0.0 ni, 99.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 16261276 total, 13528336 free, 1227588 used, 1505352 buff/cache KiB Swap: 16777212 total, 15699900 free, 1077312 used. 14705352 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4023 mysql 20 0 4751528 209184 4548 S 3.0 1.3 1633:34 mysqld 36410 root 20 0 469104 27036 3952 S 2.7 0.2 0:01.16 falcon-hbs
从这次定位得出的结论:
今天早上发现falcon-hbs进程又不在了,重新启动falcon-hbs进程后,发现其占用的cpu持续在600%~800%之间。
使用golang的定位工具pprof,获取30s内cpu的统计信息
可以看到,github.com/open-falcon/falcon-plus/modules/hbs/db.QueryHostMetricTags这个方法占用了73.39%的cpu,我们看下这个方法的源码:
主要是查询alarm_tag表中状态为1的条目,得到一个嵌套字典,结合top统计中MySQL的cpu负载也很高,推断alarm_tag这个表中的数据应该比较大, 进入数据库查看alarm_tag表的统计信息:
数据量已经达到上万条,而每分钟会有上千台agent远程调用这个方法查询主机管理的告警策略,造成了较高的cpu和iowait。
跟告警模块沟通后,发现是压测造的数据,把表中的数据清空后,falcon-hbs的cpu恢复正常。
从这次定位得出的结论: