falcon-hbs cpu负载过高的定位

今天早上发现falcon-hbs进程又不在了，重新启动falcon-hbs进程后，发现其占用的cpu持续在600%~800%之间。

[zhangshuangyan@gs-server-5693 ~]$ top

top - 11:34:29 up 178 days, 21:36,  1 user,  load average: 8.21, 9.62, 9.59
Tasks: 308 total,   4 running, 304 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us,  0.2 sy,  0.0 ni, 98.9 id,  0.2 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16261276 total, 11915604 free,  2886908 used,  1458764 buff/cache
KiB Swap: 16777212 total, 15668760 free,  1108452 used. 13054704 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                 
52125 root      20   0 2505800 1.606g   4564 S 656.2 10.4 432:03.15 falcon-hbs                                                                                                                              
 4023 mysql     20   0 4751528 203796   3832 S  68.8  1.3   1586:43 mysqld

使用golang的定位工具pprof，获取30s内cpu的统计信息

[zhangshuangyan@gs-server-5693 ~]$ go tool pprof http://localhost:6031/debug/pprof/profile?seconds=30
Fetching profile over HTTP from http://localhost:6031/debug/pprof/profile?seconds=30
Saved profile in /home/zhangshuangyan@gridsum.com/pprof/pprof.falcon-hbs.samples.cpu.001.pb.gz
File: falcon-hbs
Type: cpu
Time: Apr 9, 2019 at 11:14am (CST)
Duration: 30.05s, Total samples = 3.33mins (665.49%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 123.15s, 61.58% of 199.97s total
Dropped 261 nodes (cum <= 1s)
Showing top 20 nodes out of 117
      flat  flat%   sum%        cum   cum%
    23.26s 11.63% 11.63%     23.26s 11.63%  runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go
     9.95s  4.98% 16.61%     23.42s 11.71%  runtime.scanobject /usr/local/go/src/runtime/mgcmark.go
     9.07s  4.54% 21.14%     12.24s  6.12%  runtime.mapaccess2_faststr /usr/local/go/src/runtime/hashmap_fast.go
     7.47s  3.74% 24.88%     10.63s  5.32%  runtime.gcmarknewobject /usr/local/go/src/runtime/mgcmark.go
     6.46s  3.23% 28.11%      8.30s  4.15%  runtime.step /usr/local/go/src/runtime/symtab.go
     6.35s  3.18% 31.28%     67.12s 33.57%  runtime.mallocgc /usr/local/go/src/runtime/malloc.go
     5.76s  2.88% 34.17%     16.68s  8.34%  runtime.scanblock /usr/local/go/src/runtime/mgcmark.go
     5.30s  2.65% 36.82%      5.30s  2.65%  runtime.greyobject /usr/local/go/src/runtime/mbitmap.go
     5.19s  2.60% 39.41%      5.19s  2.60%  runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s
     5.14s  2.57% 41.98%     13.44s  6.72%  runtime.pcvalue /usr/local/go/src/runtime/symtab.go
     4.67s  2.34% 44.32%     41.24s 20.62%  runtime.gentraceback /usr/local/go/src/runtime/traceback.go
     4.64s  2.32% 46.64%     10.04s  5.02%  runtime.sweepone /usr/local/go/src/runtime/mgcsweep.go
     4.63s  2.32% 48.95%    146.76s 73.39%  github.com/open-falcon/falcon-plus/modules/hbs/db.QueryHostMetricTags /home/zhangshuangyan@gridsum.com/go/src/github.com/open-falcon/falcon-plus/modules/hbs/db/alarm.go
     4.51s  2.26% 51.21%      4.54s  2.27%  runtime.greyobject /usr/local/go/src/runtime/mgcmark.go
     4.44s  2.22% 53.43%      7.71s  3.86%  runtime.mapaccess1_faststr /usr/local/go/src/runtime/hashmap_fast.go
     3.94s  1.97% 55.40%     21.47s 10.74%  runtime.writebarrierptr_prewrite1 /usr/local/go/src/runtime/mbarrier.go
     3.39s  1.70% 57.09%      3.39s  1.70%  runtime.memeqbody /usr/local/go/src/runtime/asm_amd64.s
     3.11s  1.56% 58.65%      3.11s  1.56%  runtime.heapBitsSetType /usr/local/go/src/runtime/mbitmap.go
     3.07s  1.54% 60.18%    111.35s 55.68%  runtime.systemstack /usr/local/go/src/runtime/asm_amd64.s
     2.80s  1.40% 61.58%     12.42s  6.21%  runtime.shade /usr/local/go/src/runtime/mgcmark.go

可以看到，github.com/open-falcon/falcon-plus/modules/hbs/db.QueryHostMetricTags这个方法占用了73.39%的cpu，我们看下这个方法的源码：

func QueryHostMetricTags() (map[string]map[string]string, error) {
    hostMetricTags := make(NestedMap)

    sql := "select metric, hosts, tags from alarm_tag where is_active=1"
    rows, err := DB.Query(sql)
    if err != nil {
        log.Println(fmt.Sprintf("ERROR: %s\nsql: %s", err, sql))
        return nil, err
    }

    defer rows.Close()
    for rows.Next() {
        var (
            metric string
            hosts  string
            tags   string
        )

        err = rows.Scan(&metric, &hosts, &tags)
        if err != nil {
            log.Println("ERROR:", err)
            continue
        }

        for _, host := range parseHosts(hosts) {
            value := hostMetricTags.setdefault(host, metric, "")
            if value == "" {
                hostMetricTags[host][metric] = tags
            } else {
                hostMetricTags[host][metric] += "," + tags
            }
        }
    }

    for host, metrics := range hostMetricTags {
        for metric, tags := range metrics {
            hostMetricTags[host][metric] = parseTags(tags)
        }
    }

    return hostMetricTags, nil
}

主要是查询alarm_tag表中状态为1的条目，得到一个嵌套字典，结合top统计中MySQL的cpu负载也很高，推断alarm_tag这个表中的数据应该比较大，进入数据库查看alarm_tag表的统计信息：

mysql> select count(*) from alarm_tag;
+----------+
| count(*) |
+----------+
|    23812 |
+----------+
1 row in set (0.02 sec)

数据量已经达到上万条，而每分钟会有上千台agent远程调用这个方法查询主机管理的告警策略，造成了较高的cpu和iowait。

跟告警模块沟通后，发现是压测造的数据，把表中的数据清空后，falcon-hbs的cpu恢复正常。

[zhangshuangyan@gs-server-5693 falcon]$ top

top - 13:55:56 up 178 days, 23:58,  2 users,  load average: 1.46, 7.26, 9.41
Tasks: 308 total,   1 running, 307 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.3 sy,  0.0 ni, 99.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16261276 total, 13528336 free,  1227588 used,  1505352 buff/cache
KiB Swap: 16777212 total, 15699900 free,  1077312 used. 14705352 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                 
 4023 mysql     20   0 4751528 209184   4548 S   3.0  1.3   1633:34 mysqld                                                                                                                                  
36410 root      20   0  469104  27036   3952 S   2.7  0.2   0:01.16 falcon-hbs

从这次定位得出的结论：

falcon-hbs目前能够支撑配置TAG的告警策略数量有限，不能达到万的数量级
可以加一层缓存，降低cpu和io

zshuangyan / blog

falcon-hbs cpu负载过高的定位 #33