wxdwfc / rlib

RLib is a header-only library for easier usage of RDMA.
46 stars 10 forks source link

[qp_impl.hpp:131] poll till completion error: 12 transport retry counter exceeded #3

Open minghust opened 3 years ago

minghust commented 3 years ago

您好,感谢开源rlib!我在使用过程中,遇到了一个问题:

背景:client端(在机器1)使用线程t1和server端(在机器2)建立2个RCQP连接(QP1,QP2)后,client端的t1线程内创建一个新线程t2。接下来,t1使用QP1对server进行one-sided RDMA READ,t2使用QP2对server进行one-sided RDMA WRITE。t1和t2的RDMA READ/WRITE是并行的(读写没有任何冲突)。

到这里本应该没有问题,但是t2的RDMA WRITE却无法写成功(通过查看server端mem region未被修改而得知),导致t2在poll cq时出现 “transport retry counter exceeded” 报错。

查阅RDMA Aware Networks Programming User Manual (Rev 1.7),该错误的解释是:

6.2.13 IBV_WC_RETRY_EXC_ERR This event is generated when a sender is unable to receive feedback from the receiver. This means that either the receiver just never ACKs sender messages in a specified time period, or it has been disconnected or it is in a bad state which prevents it from responding.

奇怪的是,如果t2使用QP1进行RDMA WRITE,则可以写成功,poll也没问题(注意到QP1和QP2都是使用class RRCQP中的connect函数分2次成功连接的)。

但我并不希望t1和t2共用一个RCQP,因为t1和t2会争抢completion queue,比如t1 poll到了t2的ack,导致t1认为自己的RDMA READ成功了,但实际上可能还没读到remote data。

希望您可以解答,谢谢!

wxdwfc commented 3 years ago

你好,

能不能给出一个具体的代码来复现问题?目前从描述上来看我没看出什么问题。

ps:现在这个project移到 https://github.com/wxdwfc/rlibv2 进行维护了,如果方便的话还是用新版本比较好。

谢谢!

minghust commented 3 years ago

好的,具体代码是这样的: Server端:

void Server::RDMAConnect(std::string& client_ip,
                         int client_port,
                         int client_id) {
    // Server has already registered two seperate memory regions
    /************************************* RDMA Connection ***************************************/
    RDMA_LOG(INFO) << "Waiting for RDMA connecting compute nodes...";
    auto qp0 = rdma_ctrl->create_rc_qp(QPIdx{.node_id = client_id, .worker_id = 0, .index = 0},
                                           rdma_ctrl->get_device(),
                                           nullptr);
    while (qp0->connect(client_ip, client_port) != SUCC) {
        usleep(2000);
    }
    auto qp1 = rdma_ctrl->create_rc_qp(QPIdx{.node_id = client_id, .worker_id = 0, .index = 1},
                                          rdma_ctrl->get_device(),
                                          nullptr);
    while (qp1->connect(client_ip, client_port) != SUCC) {
        usleep(2000);
    }
    RDMA_LOG(INFO) << "Server: QP connected!";
}

Client端,线程t1

    void PairQPConnect(RdmaCtrl* rdma_ctrl,
                       RemoteNode& remote_node, // struct RemoteNode {int node_id; std::string ip; int port;};
                       MemoryAttr remote_mr0, // has been prefetched via QP::get_remote_mr()
                       MemoryAttr remote_mr1, // has been prefetched via QP::get_remote_mr()
                       RNicHandler* opened_rnic) {
        // Create the two queue pairs
        MemoryAttr local_mr = rdma_ctrl->get_local_mr(CLIENT_MR_ID); // CLIENT_MR_ID is a magic number
        RCQP* qp0 = rdma_ctrl->create_rc_qp(
            QPIdx{.node_id = remote_node.node_id, .worker_id = 0, .index = 0},
            opened_rnic,
            &local_mr);
        qp0->bind_remote_mr(remote_mr0);

        RCQP* qp1 = rdma_ctrl->create_rc_qp(
            QPIdx{.node_id = remote_node.node_id, .worker_id = 0, .index = 1},
            opened_rnic,
            &local_mr);
        qp1->bind_remote_mr(remote_mr1);

        // Queue pair connection, exchange queue pair info via TCP
        while (qp0->connect(remote_node.ip, remote_node.port) != SUCC) {
            usleep(2000);
        }
        while (qp1->connect(remote_node.ip, remote_node.port) != SUCC) {
            usleep(2000);
        }
        RDMA_LOG(INFO) << "Client: QP connected!";
        qp0_array[remote_node.node_id] = qp0;
        qp1_array[remote_node.node_id] = qp1;
    }

Client端,线程t1

int node_id = GetRemoteNodeID();
RCQP* qp = qp0_array[node_id];
size_t data_size = 1024;
char* read_buf = (char*) Rmalloc(data_size);
memset(read_buf, 0, data_size);
uint64_t remote_offset = 0;
auto rc = qp->post_send_to_mr(local_mr, remote_mr0, IBV_WR_RDMA_READ, read_buf, data_size, remote_offset, IBV_SEND_SIGNALED);
if (rc != SUCC) {
    RDMA_LOG(ERROR) << "client: post read fail. rc=" << rc;
}
ibv_wc wc{};
rc = qp->poll_till_completion(wc, no_timeout);
if (rc != SUCC) {
    RDMA_LOG(ERROR) << "client: poll read fail. rc=" << rc;
}
Rfree(read_buf);

Client端,线程t2

int node_id = GetRemoteNodeID();
RCQP* qp = qp1_array[node_id];
size_t data_size = 1024;
char* write_buf = (char*) Rmalloc(data_size);
memset(read_buf, 0, data_size);
uint64_t remote_offset = 0;
auto rc = qp->post_send_to_mr(local_mr, remote_mr1, IBV_WR_RDMA_WRITE, write_buf, data_size, remote_offset, IBV_SEND_SIGNALED);
if (rc != SUCC) {
    RDMA_LOG(ERROR) << "client: post read fail. rc=" << rc;
}
ibv_wc wc{};
rc = qp->poll_till_completion(wc, no_timeout); // ERROR: [qp_impl.hpp:131] poll till completion error: 12 transport retry counter exceeded
if (rc != SUCC) {
    RDMA_LOG(ERROR) << "client: poll read fail. rc=" << rc;
}
Rfree(write_buf);

如果将t2内的RCQP qp = qp1_array[node_id];换成RCQP qp = qp0_array[node_id];就没有问题

wxdwfc commented 3 years ago

你好,

目前如果要在rlib中连接多个QP的话建议借助RdmaCtrl (./rdma_ctrl.hpp), 具体可以参见link_symmetric_rcqps这一函数。 单独连接QP在rlib中没有经过详细的测试。

如果想要单独连接QP可以使用https://github.com/wxdwfc/rlibv2。 在该repo中可以比较方便的单独见QP(见https://github.com/wxdwfc/rlibv2/blob/master/examples/rc_write/client.cc)。 v2除了建立连接外在使用上和rlib基本一致,并且基本经过详细的测试比较稳定。

最后,由于rlib目前已经迁移到了https://github.com/wxdwfc/rlibv2, 还是建议使用rlibv2毕竟rlib我已经不再维护(但是v2有专人维护)。

感谢!

minghust commented 3 years ago

好的,谢谢解答!