zhouhaibing089 / blog

A repository to record some ideas
https://blog.zhouhaibing.com
9 stars 2 forks source link

You may need to set image pull progress deadline higher for dockershim #31

Open zhouhaibing089 opened 5 years ago

zhouhaibing089 commented 5 years ago

你也许需要调大dockershim上image pull progress deadline这个参数

This Thursday, we encountered an issue where we can not create pods from a 3+GB docker image. Here is how it looks like when we run kubectl describe:

这周四, 我们发现我们没有办法在集群中部署一个镜像大约为3G的pod了, 下面显示了当我们运行kubectl describe后的输出.

renamed the domain to follow policy (域名有修改以遵循公司政策)

Events:
  Type     Reason   Age                  From                                           Message
  ----     ------   ----                 ----                                           -------
  Normal   Killing  24m                  kubelet, tess-node-2g2mc.sutarts.rno.yabe.com  Killing container with id docker://cassini-pod-jaghrr:Container spec hash changed (704561258 vs 269937958).. Container will be killed and recreated.
  Warning  Failed   7m44s (x3 over 14m)  kubelet, tess-node-2g2mc.sutarts.rno.yabe.com  Failed to pull image "ecr.vip.yabec3.com/cassini/cassini:5.0.13-7": rpc error: code = Canceled desc = context canceled
  Warning  Failed   7m44s (x3 over 14m)  kubelet, tess-node-2g2mc.sutarts.rno.yabe.com  Error: ErrImagePull
  Normal   BackOff  7m7s (x5 over 14m)   kubelet, tess-node-2g2mc.sutarts.rno.yabe.com  Back-off pulling image "ecr.vip.yabe.com/cassini/cassini:5.0.13-7"
  Warning  Failed   7m7s (x5 over 14m)   kubelet, tess-node-2g2mc.sutarts.rno.yabe.com  Error: ImagePullBackOff
  Normal   Pulling  6m56s (x4 over 24m)  kubelet, tess-node-2g2mc.sutarts.rno.yabe.com  pulling image "ecr.vip.yabec3.com/cassini/cassini:5.0.13-7"

We can see that image pulling is failing, and keep failing after several retries.

我们可以看到, 它起不来是因为没有办法下载镜像, 而且即便多次重试也还是失败.

It looks like our registry server can not handle the load to pull 3+GB images from 60+ nodes(a full rack) simultaneously even when we have 5 cache pods running inside the cluster locally. By design, the cache pods fetch data from remote storage and proxies to end client, once image layer data is available in cache pods, all the transferring won't be related to remote anymore.

这个现象看起来像是我们的服务器没办法处理60多个机器同时下载这个3GB大的镜像, 这不太合常理, 事实上, 我们在这个集群中还运行了五个文件缓存pod, 从设计上来讲, 所有的镜像下载只会有一次是真正从远端下载, 一旦缓存住, 所有的下载就都可以从本地服务了.

Maybe it is caused by continuous failure while cache pods are fetching from remote at the first time? In order to test it out, I tried to do that pulling manually, and observed the data transferring rate is around 2MB/s which is totally unexpected, but it did not fail.

也许是为因为缓存pod在第一次从远端存储下载时总是失败? 为了进一步测试这个猜想, 我尝试找到一台机器来手动下载这个3GB的镜像, 虽然下载速度大概只有2MB/s, 但是最终下载还是能成功的.

I am a little bit worried, people are blocked, but I can not figure out where the issue is exactly. To make some progress, I realized maybe I can rip out the possibility where the initial downloading can be problematic, so I deployed one controller to populate image data into cache pods before it can serve any traffic, and I would hope, all the data pulling can be served locally right away. Luckily, I can run the controller easily as it has been developed long time ago.

我陷入了焦急的状态, 有人因为这个问题工作进行不下去了, 而我还尚且不能发现到底问题出在了哪. 盲目之中, 我意识到也许我可以部署一个新的控制器来做预缓存, 也就是在真正的客户下载之前, 我就保证所有的缓存pod中已经有了所有的数据, 这样我就可以排除从远端初始下载出问题的这个可能性. 幸运的是, 这个新的控制器早就开发好了.

I have to make some modifications due to aging issue, and then run the controller successfully. I made one spec so the controller can watch and caching the image data proactively. After a while, I can confirm all the layers are synced successfully, which means they are all available locally now. But, again, it does not help, kubelet are still reporting image pulling failure events. What's the fuck.

我还是做了一些修改, 毕竟这个工具年限已久, 也没有真正发布过, 不过还好, 没花多少时间, 我就把它成功部署上去了. 于是乎, 我创建了一个spec, 这个控制器观察到这个spec, 开始从远端拉取镜像, 并缓存在前面所说的缓存pod上. 但是, 再一次, kubelet并没有因此就把镜像成功下载下来, 相同的错误仍然在发生.

Worried and hopeless, I start to increase the number of cacher pods, from 5 to 20, then 40, and eventually crazily goes to 100, but still no luck.

焦急又无助, 我实在没辙了, 于是我开始加大缓存pod的数量, 从五加到二十, 又加到四十, 再加到100, 但是还是不行.

So what can I do, people are waiting one me, the guy is still doing provisioning, but see all the new pods are in ImagePullBackOff status. The last resort, I have to help people out, so I logged in to every nodes, and run the pulling manually, I did this from 5:00pm to 11:00pm, sounds crazy, right? I promise I won't do that again.

那还能怎么办呢, 人们在等我, 但是所有新部署的pods又都处在ImagePullBackOff的状态. 没有办法, 我只能手动来了, 从下午五点到晚上十一点, 我一台一台机器登录上去手动运行docker pull命令, 现在想想, 实在是太疯狂了, 我再也不想干这种事情了.

That's the end of Thursday, all the provisioned pods are running fine now, but that's not the end of story, the provision will continue on Friday, do I have to do the same manual pulling at the whole Friday, and more than 300 nodes? No, I started to think it about, why my manual pulling works, but kubelet doesn't. Maybe it is something wrong on kubelet instead of our caching infra?

周四就这样结束了, 在我的手动努力下, 所有新创建的pods都跑起来了, 但是故事并没有就这样结束. 周五还有好几个rack需要部署, 难道我就要在整个周五就一直登机器跑命令了吗, 而且还是三百多台? 不, 这不行, 我开始思考这个问题, 为什么我手动能下载下来的镜像, kubelet就不行呢, 也许并不是我们的缓存设施不够健壮, 而是kubelet里面有什么问题.

I then looked into the code where the logic of pulling image resides, and soon, I realized there may be something wrong in real.

于是我开始检查kubelet里头关于下载镜像这部分的代码, 很快, 我意识到我可能真的发现潜在的问题了.

func (d *kubeDockerClient) PullImage(image string, auth dockertypes.AuthConfig, opts dockertypes.ImagePullOptions) error {
    // RegistryAuth is the base64 encoded credentials for the registry
    base64Auth, err := base64EncodeAuth(auth)
    if err != nil {
        return err
    }
    opts.RegistryAuth = base64Auth
    ctx, cancel := d.getCancelableContext()
    defer cancel()
    resp, err := d.client.ImagePull(ctx, image, opts)
    if err != nil {
        return err
    }
    defer resp.Close()
    reporter := newProgressReporter(image, cancel, d.imagePullProgressDeadline)
    reporter.start()
    defer reporter.stop()
    decoder := json.NewDecoder(resp)
    for {
        var msg dockermessage.JSONMessage
        err := decoder.Decode(&msg)
        if err == io.EOF {
            break
        }
        if err != nil {
            return err
        }
        if msg.Error != nil {
            return msg.Error
        }
        reporter.set(&msg)
    }
    return nil
}

Focus on the reporter part, we can find out: the progress reporter will cancel the request if it does not see any message out from the pull session for more than the time specified by image pull progress deadline. Could that happen for our case? Remember, I've been running image pull manually for several hours during the horrible Thursday, I do remember there is time period where the session seemed to be stuck but went through eventually.

我们可以看到, 这段代码中引入了一个progressReporter去检查整个镜像下载是否卡住. 它观察从docker返回的消息流, 如果超过一段特定的时间, 没有任何消息出来, 它就会取消这个请求. 我们是不是就是遇到了这个情况呢. 可还记得, 我整整花了好几个小时去手动下载镜像, 我好像确实有印象, 即便在某段时间内屏幕上的输出没有任何变动, 整个下载最后却还是成功的.

That's an exciting moment, looks like I find out the root case now. so I did one more manual pulling(did I said I won't do that again?), and measure its time in each phase. There are three key phases we have to care about:

这真是个振奋人心的时刻, 我好像终于找到了最终原因. 于是我做了最后一次登机器手动下载镜像(咦, 我好想说过不再这么干了), 并估算每个阶段所花费的时间. 总的来说, 我们关心下面三个主要阶段:

Note, there won't be anything changed in console while in verifying phase, but how long does it stay in that phase: 1m14s! and the default deadline configured on kubelet is just 1 minute.

注意, 在验证阶段, 屏幕上的输出不会有任何变动, 且惊人的是, 这个阶段持续了1分24秒. 并且我们也得知kubelet上配置的默认容忍时长只有1分钟:

$ /usr/local/bin/kubelet --help | grep image-pull-progress-deadline
      --image-pull-progress-deadline duration                                                                     If no pulling progress is made before this deadline, the image pulling will be cancelled. (default 1m0s)

Till now, I am able to confirm what's going on, and we can solve this issue by setting this value higher(5m), and that's exactly what I did during the rest of Friday: I made this change to all the 2000 nodes in that cluster, and everyone is happy now.

直到现在, 我终于可以确认到底出了什么问题, 在接下来的时间里, 我把该集群中所有的2000多个节点全部调成五分钟的配置, 大家就都可以开开心心的过周末了.