yunify / qingstor-csi

neonsan csi plugin for kubernetes
Apache License 2.0
14 stars 13 forks source link

Data mismatch on block volume when multiple pods share the same PVC #46

Open stoneshi-yunify opened 4 years ago

stoneshi-yunify commented 4 years ago

This issue is detected when running K8S CSI E2E test suite InitMultiVolumeTestSuite while CSI driver supports RWX(readwritemany) access mode.

Test steps:

Expected Result:

Actual Result:

Test Env: 172.31.30.10, ssh 192.168.101.174-176

Logs:

root@testr01n01:~# kubectl -n multivolume-7887 get pvc
NAME                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                    AGE
neonsan.csi.qingstor.com2s7jp   Bound    pvc-b88eb38a-6ff7-4a1c-a062-eef382aa53cc   5Gi        RWX            multivolume-7887-neonsan8wx4p   137m
root@testr01n01:~# kubectl -n multivolume-7887 get pod -o wide
NAME                                                    READY   STATUS    RESTARTS   AGE    IP             NODE         NOMINATED NODE   READINESS GATES
security-context-0cce6ebe-968e-4db2-ae00-f8a9a8d911ca   1/1     Running   0          139m   10.233.98.51   testr01n01   <none>           <none>
test-pod2                                               1/1     Running   0          38m    10.233.73.59   testr01n02   <none>           <none>

node1:

root@testr01n01:~# echo "i love china" | dd of=/dev/qbd7 bs=64 count=1
0+1 records in
0+1 records out
13 bytes copied, 7.0572e-05 s, 184 kB/s
root@testr01n01:~# head -c 64 /dev/qbd7
i love china
ay
O�0~�$f��R�1��dy��6n�u  #1�^;S�S�ϕC����q
m��root@testr01n01:~#
root@testr01n01:~#

node2:

root@testr01n02:~# qbd -l | grep b88eb38a-6ff7-4a1c-a062-eef382aa53cc
49  0x87a000000 qbd49   tcp://kube/pvc-b88eb38a-6ff7-4a1c-a062-eef382aa53cc /etc/neonsan/qbd.conf   0   0   0   0
root@testr01n02:~# head -c 64 /dev/qbd49
test write data 
O�0~�$f��R�1��dy��6n�u  #1�^;S�S�ϕC����q
m��root@testr01n02:~#
root@testr01n02:~# blockdev --flushbufs /dev/qbd49
root@testr01n02:~# head -c 64 /dev/qbd49
i love china
ay
O�0~�$f��R�1��dy��6n�u  #1�^;S�S�ϕC����q
m��root@testr01n02:~#

You can see node2 read the stale data until command blockdev --flushbufs was executed. However, it does not make sense to run the flush command on a new node, and not practical either - user can not run this command every time new data was written from a different node.

The data should be read just right on whichever node sharing the same PVC, without flushing any buffers.

This issue should be fixed, otherwise, we can not claim neonsan supports RWX in k8s.

thanks.

stoneshi-yunify commented 4 years ago

Discussed with neonsan developers, when doing IO to a sharing block volume, it's essential to use O_DIRECT of write to skip system cache so that the data can be really staged on the device. It's upper application's responsibility to do this job. In this case, the K8S E2E test suite's job.

Luckily, this issue was also detected by the k8s community and fixed 17 days ago, see https://github.com/kubernetes/kubernetes/pull/94881 for more details.

So far no official k8s build containing the fix is released. We will wait a few days for that and revisit this issue till then.