Questioning Regards Conjugate Gradient Algorithm

I have been searching to understand more about the conjugate gradient algorithm. It was really genius idea from Schulman and prof. Pieter Abbeel, et al.

The thank is also for you guys contributing to this and implementing the algorithm.

def conjugate_gradient(f_Ax, b, cg_iters=10, residual_tol=1e-10):
    """
    This method solves system of equation Ax=b using an
    iterative method called conjugate gradients

    :f_Ax: function that returns Ax
    :b: targets for Ax
    :cg_iters: how many iterations this method should do
    :residual_tol: epsilon for stability
    """
    p = b.clone() # a basis for R^n
    r = b.clone() # residual
    x = torch.zeros_like(b) #input vector
    rdotr = torch.sum(r*r)
    for i in range(cg_iters):
        z = f_Ax(p)
        v = rdotr / (torch.sum(p*z) + 1e-8)
        x += v * p
        r -= v * z
        newrdotr = torch.sum(r*r)
        mu = newrdotr / (rdotr + 1e-8)
        p = r + mu * p
        rdotr = newrdotr
        if rdotr < residual_tol:
            break
    return x

I was wondering, when I found the mathematical algorithm, that there were things that confused me!

In the algorithm;

$r_{k}^{\top}r_k$ – it corresponds to rdotr, but I don't understand why didn't we transpose one of the r before multiplying it by itself?
The same with $pk^{\top}A{p_k}$ – it corresponds to (torch.sum(p*z) + 1e-8).

Please, if I am missing something, direct me. Thanks,

yandexdataschool / Practical_RL

Questioning Regards Conjugate Gradient Algorithm #516