Data mismatch on block volume when multiple pods share the same PVC

This issue is detected when running K8S CSI E2E test suite InitMultiVolumeTestSuite while CSI driver supports RWX(readwritemany) access mode.

Test steps:

set up a multi-node neonsan cluster, with k8s and neonsan-csi installed.
create a storage class
create a PVC pvc1 with above storage class, volume mode = block, access mode = ReadWriteMany
create pod1 on node1 with pvc1 (as block volume), create pod2 on node2 with pvc1 (as block volume) as well.
write some data on node1, then read the data on both node1 and node2.

Expected Result:

node1 and node2 should read out the exact same data, as they share the same underlying neonsan storage blocks (as they use the same one PVC).

Actual Result:

Data mismatch on node1 and node2. node1 has the correct data, while node2 does not.

Test Env: 172.31.30.10, ssh 192.168.101.174-176

Logs:

root@testr01n01:~# kubectl -n multivolume-7887 get pvc
NAME                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                    AGE
neonsan.csi.qingstor.com2s7jp   Bound    pvc-b88eb38a-6ff7-4a1c-a062-eef382aa53cc   5Gi        RWX            multivolume-7887-neonsan8wx4p   137m
root@testr01n01:~# kubectl -n multivolume-7887 get pod -o wide
NAME                                                    READY   STATUS    RESTARTS   AGE    IP             NODE         NOMINATED NODE   READINESS GATES
security-context-0cce6ebe-968e-4db2-ae00-f8a9a8d911ca   1/1     Running   0          139m   10.233.98.51   testr01n01   <none>           <none>
test-pod2                                               1/1     Running   0          38m    10.233.73.59   testr01n02   <none>           <none>

node1:

root@testr01n01:~# echo "i love china" | dd of=/dev/qbd7 bs=64 count=1
0+1 records in
0+1 records out
13 bytes copied, 7.0572e-05 s, 184 kB/s
root@testr01n01:~# head -c 64 /dev/qbd7
i love china
ay
O�0~�$f��R�1��dy��6n�u  #1�^;S�S�ϕC����q
m��root@testr01n01:~#
root@testr01n01:~#

node2:

root@testr01n02:~# qbd -l | grep b88eb38a-6ff7-4a1c-a062-eef382aa53cc
49  0x87a000000 qbd49   tcp://kube/pvc-b88eb38a-6ff7-4a1c-a062-eef382aa53cc /etc/neonsan/qbd.conf   0   0   0   0
root@testr01n02:~# head -c 64 /dev/qbd49
test write data 
O�0~�$f��R�1��dy��6n�u  #1�^;S�S�ϕC����q
m��root@testr01n02:~#
root@testr01n02:~# blockdev --flushbufs /dev/qbd49
root@testr01n02:~# head -c 64 /dev/qbd49
i love china
ay
O�0~�$f��R�1��dy��6n�u  #1�^;S�S�ϕC����q
m��root@testr01n02:~#

You can see node2 read the stale data until command blockdev --flushbufs was executed. However, it does not make sense to run the flush command on a new node, and not practical either - user can not run this command every time new data was written from a different node.

The data should be read just right on whichever node sharing the same PVC, without flushing any buffers.

This issue should be fixed, otherwise, we can not claim neonsan supports RWX in k8s.

thanks.

yunify / qingstor-csi

Data mismatch on block volume when multiple pods share the same PVC #46