OSD Replacement Procedures

Check which disks needs to be put back in procedures.

  • To see which osds are down, check with ceph osd tree down out.
[09:28][root@p06253939p44623 (production:ceph/beesly/osd*24) ~]# ceph osd tree down out
ID  CLASS WEIGHT     TYPE NAME                        STATUS REWEIGHT PRI-AFF
 -1       5589.18994 root default                                             
 -2       4428.02979     room 0513-R-0050                                     
 -6        917.25500         rack RA09                                        
 -7        131.03999             host p06253939j03957                         
430          5.45999                 osd.430            down        0 1.00000
-19        131.03999             host p06253939s09190                         
 24          5.45999                 osd.24             down        0 1.00000
405          5.45999                 osd.405            down        0 1.00000
 -9        786.23901         rack RA13                                        
-11        131.03999             host p06253939b84659                         
101          5.45999                 osd.101            down        0 1.00000
-32        131.03999             host p06253939u19068                         
577          5.45999                 osd.577            down        0 1.00000
-14        895.43903         rack RA17                                        
-34        125.58000             host p06253939f99921                         
742          5.45999                 osd.742            down        0 1.00000
-22        125.58000             host p06253939h70655                         
646          5.45999                 osd.646            down        0 1.00000
659          5.45999                 osd.659            down        0 1.00000
718          5.45999                 osd.718            down        0 1.00000
-26        131.03999             host p06253939v20205                         
650          5.45999                 osd.650            down        0 1.00000
-33        131.03999             host p06253939w66726                         
362          5.45999                 osd.362            down        0 1.00000
654          5.45999                 osd.654            down        0 1.00000
  • Check the tickets for the machines in Service Now. Those who interest us are the named : [GNI] exception.scsi_blockdevice_driver_error_reported or exception.nonwriteable_filesystems.
    • If the repair service replaced the disk(s), it will be written in the ticket. So you can continue on the next step.

On the OSD:

LVM formatting using ceph-volume

  • Simple format: osd as logical volume of one disk

This is a sample output of listing the disks in lvm fashion. You will notice the number of devices (disks) in each osd is one. Also these devices don't use any ssds for performance boost.

(Ceph volume listing takes some time to complete)

[13:55][root@p05972678e21448 (production:ceph/erin/osd*30) ~]# ceph-volume lvm list

===== osd.335 ======

  [block]    /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba

      type                      block
      osd id                    335
      cluster fsid              eecca9ab-161c-474c-9521-0e5118612dbb
      cluster name              ceph
      osd fsid                  c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
      encrypted                 0
      cephx lockbox secret      
      block uuid                PXCHQW-4aXo-isAR-NdYU-3FQ2-E18Q-whJa92
      block device              /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
      vdo                       0
      crush device class        None
      devices                   /dev/sdw

===== osd.311 ======

  [block]    /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e

      type                      block
      osd id                    311
      cluster fsid              eecca9ab-161c-474c-9521-0e5118612dbb
      cluster name              ceph
      osd fsid                  1bfad506-c450-4116-8ba5-ac356be87a9e
      encrypted                 0
      cephx lockbox secret      
      block uuid                O5fYcf-aGW8-NVWC-lr5G-BsuC-Yx3H-WZl24a
      block device              /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e
      vdo                       0
      crush device class        None
      devices                   /dev/sdt

This is an example of an osd that uses an ssd for its metadata. It has a db part in which the metadata is stored.

[14:04][root@p06253939e35392 (production:ceph/dwight/osd*24) ~]# ceph-volume lvm list


====== osd.29 ======

  [block]    /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48

      type                      block
      osd id                    29
      cluster fsid              dd535a7e-4647-4bee-853d-f34112615f81
      cluster name              ceph
      osd fsid                  dff889e7-5db5-4c5e-9aab-151e8ad17b48
      db device                 /dev/sdac3
      encrypted                 0
      db uuid                   9762cd49-8f1c-4c29-88ca-ff78f6bdd35c
      cephx lockbox secret      
      block uuid                HuzwbL-mVvi-Ubve-1C5D-fjeh-dmZq-ivNNnY
      block device              /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48
      crush device class        None
      devices                   /dev/sdk

  [  db]    /dev/sdac3

      PARTUUID                  9762cd49-8f1c-4c29-88ca-ff78f6bdd35c

====== osd.88 ======

  [block]    /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558

      type                      block
      osd id                    88
      cluster fsid              dd535a7e-4647-4bee-853d-f34112615f81
      cluster name              ceph
      osd fsid                  f19541f6-42b2-4612-a700-ec5ac8ed4558
      db device                 /dev/sdab6
      encrypted                 0
      db uuid                   f0b652e1-0161-4583-a50b-45a0a2348e9a
      cephx lockbox secret      
      block uuid                cHqcZG-wsON-P9Lw-4pTa-R1pd-GUwR-iqCMBg
      block device              /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558
      crush device class        None
      devices                   /dev/sdu

  [  db]    /dev/sdab6

      PARTUUID                  f0b652e1-0161-4583-a50b-45a0a2348e9a

One way is to have an ssd and do a simple partitioning. Each partition will be attached to an osd. If the ssd part is broken, eg the disk failed, all the osds who use this ssd will be rendered useless. Therefore each osd has to be replaced. There is also a chance that the ssd is formatted through lvm, the metadata database part will look like this:

[  db]    /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85

    type                      db
    osd id                    220
    cluster fsid              e7681812-f2b2-41d1-9009-48b00e614153
    cluster name              ceph
    osd fsid                  81f9ed48-d27d-44b6-9ac0-f04799b5d0d5
    db device                 /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
    encrypted                 0
    db uuid                   wZnT18-ZcHh-jcKn-pHic-LCrL-Gpqh-Srq1VL
    cephx lockbox secret      
    block uuid                z8CwSn-Iap5-3xDX-v0NA-9smI-EHx1-EGxdAR
    block device              /dev/ceph-6bb8b94b-4974-44b0-ae6c-667896807328/osd-a59a8661-c966-443b-9384-b2676a3d42d8
    vdo                       0
    crush device class        None
    devices                   /dev/md125

Replacement procedure: one disk per osd

  1. ceph-volume lvm list is slow, save its output to ~/ceph-volume.out and work with that file instead.
  2. Check if the ssd device exists and it is failed.
  3. Check if it is used as a metadata database for osds, or as a regular osd.
    1. If it is a metadata database:
      1. Locate all osds that use it (lvm list + grep)
      2. Follow the procedure for each affected osd
    2. Treat it as a regular osd (normal replacement)
  4. Mark out the osd: ceph osd out $OSD_ID
  5. Destroy the osd: ceph osd destroy $OSD_ID --yes-i-really-mean-it
  6. Stop the osd daemon: systemctl stop ceph-osd@OSD_ID
  7. Unmount the filesystem: umount /var/lib/ceph/osd/ceph-$OSD_ID
  8. If the osd has uses a metadata database (on ssds)
    1. If it is a regular partition, remove the partition I guess
    2. If it's an lvm, remove it:
      1. eg for "/dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85"
      2. lvremove cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
  9. Run ceph-volume lvm zap /dev/sdXX --destroy
  10. In case ceph-volume fails to list the defected devices or zap the disks. You can get the information you need through lvs -o devices,lv_tags | grep type=block and use vgremove instead for the osd block.
  11. In case you can't get any information through ceph-volume or lvs about the defective devices, you should list the working osds and umount the unused folders with:
    $ umount `ls -d -1 /var/lib/ceph/osd/* | grep -v -f <(grep -Po '(?<=osd\.)[0-9]+' ceph-volume.out)`
    
  12. Now you should wait until the devices have been replaced. Skip this meaningful step if they are already replaced.
  13. If the osd has any metadata database used elsewhere (ssd) you should prepare it in case of lvm. For naming we use cache-`uuid -v4`. Just recreate the lvm you removed at step 7 with: lvcreate -name $name -l 100%FREE $VG. Lvm has three categories, PVs which are the physical devices (e.g /dev/sda), VGs which are the volume groups that contain one or more physical devices. And LVs are the "paritions" of VGs. For simplicity we use 1 PV per VG, and one LV per VG. In case you have more than one LV per VG, when you recreate it use e.g for 4 LVs per VG 25%VG instead of 100%FREE.
  14. Recreate the OSD using ceph volume, use a destroyed osd's id from the same host
    $ ceph-volume lvm create --bluestore --data /dev/sdXXX --block.db (VG/LV or ssd partition) --osd-id XXX
    

Replacement procedure: two disks striped (raid 0) per osd

  1. Run this script with the defective device ceph-scripts/tools/lvm_stripe_remove.sh /dev/sdXXX (it doesn't take a list of devices)
  2. The program will report what cleanup it did, you will need the 2nd and 3rd line, which are the two disks that make the failed osd and the last line which is the osd id.
  3. In any case the script failed, you can open it, as it is documented and follow the steps manually.
  4. If you have more than one osd replacement you can repeat steps 1 and 2, as the 5th step can done in the end.
  5. Pass the set of disks from step 1 after you have all of them working on this script:
    ceph-scripts/ceph-volume/striped-osd-prepare.sh /dev/sd[a-f]
    
    It uses ls inside so you can use wildcards if you are bored to write '/dev/sdX' all the time.
  6. It will output a list of commands to be executed in order, run all EXCEPT THE ceph-volume create one. Add at the end of the ceph-volume create line the argument --osd-id XXX with the number of the destroyed osd id, and run the command.
Improve me !