zmc / ceph-devstack

MIT License
3 stars 1 forks source link

Support for multiple OSDs on a single testnode #6

Open VallariAg opened 3 months ago

VallariAg commented 3 months ago

When I ran a test with config which had multiple OSDs on a single testnode, I got the following error:

2024-05-15T12:29:41.414 INFO:teuthology.orchestra.run.7fa8c2843ce2.stdout:Created osd(s) 0 on host '7fa8c2843ce2'
2024-05-15T12:29:42.108 DEBUG:teuthology.orchestra.run.7fa8c2843ce2:osd.0> sudo journalctl -f -n 0 -u ceph-1b2586ac-12b6-11ef-945e-d6d5f423fdc9@osd.0.service
2024-05-15T12:29:42.110 INFO:tasks.cephadm:{Remote(name='ubuntu@7fa8c2843ce2'): [], Remote(name='ubuntu@c3bfc4209056'): ['/dev/loop3'], Remote(name='ubuntu@db02dd5eef59'): ['/dev/loop0'], Remote(name='ubuntu@de36ba4bccc7'): ['/dev/loop1']}
2024-05-15T12:29:42.110 INFO:tasks.cephadm:ubuntu@7fa8c2843ce2
2024-05-15T12:29:42.110 INFO:tasks.cephadm:[]
2024-05-15T12:29:42.110 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/teuthology/teuthology/contextutil.py", line 30, in nested
    vars.append(enter())
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/root/src/github.com_vallariag_ceph_0c8b425a40783ee42c035ea9fbe29647e90f007f/qa/tasks/cephadm.py", line 1072, in ceph_osds
    assert devs   ## FIXME ##
AssertionError

Each testnode had one loop device: https://pastebin.com/raw/8z5gj0CU (ls /dev output)

The above problem happens because my job config has multiple osds on 1 node (osd.0 and osd.1 deployed on same host) and there is only 1 device available on each testnode container that can be zapped for osd deployment. Using ceph-devstack setup, the teuthology function get_scratch_devices() returned 1 devices for each testnode. So the mapping (devs_by_remote) looks like this :

{Remote(name='ubuntu@7fa8c2843ce2'): ['/dev/loop2'], 
Remote(name='ubuntu@c3bfc4209056'): ['/dev/loop3'], 
Remote(name='ubuntu@db02dd5eef59'): ['/dev/loop0'], 
Remote(name='ubuntu@de36ba4bccc7'): ['/dev/loop1']}

And because we pop the loop device from the above devs_by_remote after 1st osd is deployed, the 2nd osd on same testnode has no more available devices to deploy 2nd osd on. I rerun my test with 1 osd/node config and that worked (test went through the ceph setup okay).

As for a proper solution... does this mean we should create more loop devices per testnode in ceph-devstack? Let me know, I'll love to pick this issue. It'll be a good gateway to understand more of ceph-devstack.