Categories
Cloud Linux

Extend Ceph Storage for Kubernetes Cluster

Scenario:

4 worker nodes with 25GB raw disk used in a ceph block cluster. As we are running low on space, we will extend the raw disks to 50GB and update rook-ceph accordingly.

Ceph OSD Management

Ceph Object Storage Daemons (OSDs) are the heart and soul of the Ceph storage platform. Each OSD manages a local device and together they provide the distributed storage. Rook will automate creation and management of OSDs to hide the complexity based on the desired state in the CephCluster CR as much as possible. This guide will walk through some of the scenarios to configure OSDs where more configuration may be required.

OSD Health

The rook-ceph-tools pod provides a simple environment to run Ceph tools. The ceph commands mentioned in this document should be run from the toolbox.

Once the is created, connect to the pod to execute the ceph commands to analyze the health of the cluster, in particular the OSDs and placement groups (PGs). Some common commands to analyze OSDs include:

ceph status
ceph osd tree
ceph osd status
ceph osd df
ceph osd utilization

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

Status Before:

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph status                                                                        cluster:
    id:     13c5138f-f2f6-46ea-8ee0-4966330ac081
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 13h)
    mgr: a(active, since 13h)
    osd: 4 osds: 4 up (since 13h), 4 in (since 13h)

  data:
    pools:   2 pools, 129 pgs
    objects: 5.12k objects, 19 GiB
    usage:   63 GiB used, 37 GiB / 100 GiB avail
    pgs:     129 active+clean

  io:
    client:   60 KiB/s wr, 0 op/s rd, 1 op/s wr


[istacey@master001 ~]$ kubectl --kubeconfig=/home/istacey/.kube/config-hr -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph osd  status
ID  HOST        USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
 0  worker002  10.9G  14.0G      1      151k      0        0   exists,up
 1  worker003  10.9G  14.0G      0        0       0        0   exists,up
 2  worker004  11.1G  13.8G      0        0       0        0   exists,up
 3  worker001  9.98G  15.0G      0        0       0        0   exists,up


istacey@worker001:~$ lsblk | grep sdb -A1
sdb                                                                                                     8:16   0   25G  0 disk
└─ceph--f067bb6e--522a--48c6--a2a8--8930d15dc02f-osd--block--dc871464--0a16--484a--8fa8--b723eec178f1 253:10   0   25G  0 lvm

Raw Disk Extended:

istacey@worker001:~$ lsblk | grep sdb -A2 
sdb                                                             8:16   0   50G  0 disk
└─ceph--f067bb6e--522a--48c6--a2a8--8930d15dc02f-osd--block--dc871464--0a16--484a--8fa8--b723eec178f1
                                                              253:10   0   25G  0 lvm

Remove the OSDs (one at a time):

https://github.com/rook/rook/blob/master/Documentation/ceph-osd-mgmt.md#remove-an-osd

To remove an OSD due to a failed disk or other re-configuration, consider the following to ensure the health of the data through the removal process:

  • Confirm you will have enough space on your cluster after removing your OSDs to properly handle the deletion
  • Confirm the remaining OSDs and their placement groups (PGs) are healthy in order to handle the rebalancing of the data
  • Do not remove too many OSDs at once
  • Wait for rebalancing between removing multiple OSDs

If all the PGs are active+clean and there are no warnings about being low on space, this means the data is fully replicated and it is safe to proceed. If an OSD is failing, the PGs will not be perfectly clean and you will need to proceed anyway.

Scale down rook-ceph-operator and the OSD deployments:

[istacey@master001 ~]$ kubectl get deployment -n rook-ceph | grep opera
rook-ceph-operator                   1/1     1            1           77d

[istacey@master001 ~]$ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
deployment.apps/rook-ceph-operator scaled

[istacey@master001 ~]$ kubectl get deployment -n rook-ceph | grep opera
rook-ceph-operator                   0/0     0            0           77d

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph status | egrep 'health|osds|usage'
    health: HEALTH_OK
    osd: 4 osds: 4 up (since 13h), 4 in (since 13h)
    usage:   63 GiB used, 37 GiB / 100 GiB avail

[istacey@master001 ~]$ kubectl get deployment -n rook-ceph | grep osd
rook-ceph-osd-0                      1/1     1            1           38h
rook-ceph-osd-1                      1/1     1            1           77d
rook-ceph-osd-2                      1/1     1            1           77d
rook-ceph-osd-3                      1/1     1            1           77d

[istacey@master001 ~]$ kubectl -n rook-ceph scale deployment rook-ceph-osd-0 --replicas=0
deployment.apps/rook-ceph-osd-0 scaled

[istacey@master001 ~]$ kubectl get deployment -n rook-ceph | grep osd
rook-ceph-osd-0                      0/0     0            0           38h
rook-ceph-osd-1                      1/1     1            1           77d
rook-ceph-osd-2                      1/1     1            1           77d
rook-ceph-osd-3                      1/1     1            1           77d

Down and out the OSD

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph osd down osd.0
osd.0 is already down.

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph status | egrep 'health|osds|usage'
    health: HEALTH_WARN
            1 osds down
            1 host (1 osds) down
    osd: 4 osds: 3 up (since 101s), 4 in (since 13h); 1 remapped pgs
    usage:   63 GiB used, 37 GiB / 100 GiB avail

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
-1         0.09760  root default
-9         0.02440      host worker001
 3    hdd  0.02440          osd.3           up   1.00000  1.00000
-3         0.02440      host worker002
 0    hdd  0.02440          osd.0         down   1.00000  1.00000
-5         0.02440      host worker003
 1    hdd  0.02440          osd.1           up   1.00000  1.00000
-7         0.02440      host worker004
 2    hdd  0.02440          osd.2           up   1.00000  1.00000

### Mark the OSD as out:

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph osd out osd.0
marked out osd.0.

Wait for the data to finish backfilling to other OSDs.

ceph status will indicate the backfilling is done when all of the PGs are active+clean.

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph status
  cluster:
    id:     13c5138f-f2f6-46ea-8ee0-4966330ac081
    health: HEALTH_WARN
            Degraded data redundancy: 3171/15372 objects degraded (20.628%), 80 pgs degraded, 80 pgs undersized

  services:
    mon: 3 daemons, quorum a,b,c (age 13h)
    mgr: a(active, since 13h)
    osd: 4 osds: 3 up (since 4m), 3 in (since 96s); 80 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 5.12k objects, 19 GiB
    usage:   50 GiB used, 25 GiB / 75 GiB avail
    pgs:     3171/15372 objects degraded (20.628%)
             78 active+undersized+degraded+remapped+backfill_wait
             49 active+clean
             2  active+undersized+degraded+remapped+backfilling

  io:
    client:   71 KiB/s wr, 0 op/s rd, 1 op/s wr
    recovery: 6.5 MiB/s, 1 objects/s

### backfilling is done:

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph status | egrep 'health|osds|usage'
    health: HEALTH_OK
    osd: 4 osds: 3 up (since 22m), 3 in (since 19m)
    usage:   62 GiB used, 13 GiB / 75 GiB avail

Remove the OSD from the Ceph cluster

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph osd purge osd.0 --yes-i-really-mean-it
purged osd.0

### Note osd.0 is on worker002:

[istacey@master001 ~]$ kubectl get pods -n rook-ceph -o wide | grep osd | grep -v prepare
rook-ceph-osd-1-6c468554f4-8btvj                      1/1     Running     3          26h   10.42.171.207    worker003   <none>           <none>
rook-ceph-osd-2-5f8ffcd5bb-p44d4                      1/1     Running     1          25h   10.42.64.205     worker004   <none>           <none>
rook-ceph-osd-3-5d8b989cb-4hf8h                       1/1     Running     5          27h   10.42.7.26       worker001   <none>           <none>

Zap the disk

https://github.com/rook/rook/blob/master/Documentation/ceph-teardown.md#zapping-devices

As root clean and Prepare the disk on the VM:

DISK="/dev/sdb"
dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync
ls /dev/mapper/ceph-* | xargs -I% -- dmsetup remove %
rm -rf /dev/ceph-*
rm -rf /dev/mapper/ceph--*
partprobe $DISK


[root@worker002 ~]# lsblk

NAME                MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdb                   8:16   0   50G  0 disk

Scale back up and let osd rejoin:

[istacey@master001 ~]$ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
deployment.apps/rook-ceph-operator scaled

[istacey@master001 ~]$ kubectl -n rook-ceph scale deployment rook-ceph-osd-0 --replicas=1
deployment.apps/rook-ceph-osd-0 scaled

[istacey@master001 ~]$ kubectl -n rook-ceph get deployment | egrep 'rook-ceph-operator|rook-ceph-osd'
rook-ceph-operator                   1/1     1            1           77d
rook-ceph-osd-0                      1/1     1            1           39h
rook-ceph-osd-1                      1/1     1            1           77d
rook-ceph-osd-2                      1/1     1            1           77d
rook-ceph-osd-3                      1/1     1            1           77d

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
-1         0.12199  root default
-9         0.02440      host worker001
 3    hdd  0.02440          osd.3           up   1.00000  1.00000
-3         0.04880      host worker002
 0    hdd  0.04880          osd.0           up   1.00000  1.00000
-5         0.02440      host worker003
 1    hdd  0.02440          osd.1           up   1.00000  1.00000
-7         0.02440      host worker004
 2    hdd  0.02440          osd.2           up   1.00000  1.00000

istacey@worker002:~$ lsblk | grep sdb -A1
sdb                                                                                                     8:16   0   50G  0 disk
└─ceph--ea8115b7--5418--41b9--b4d3--d6e22526dbb1-osd--block--68cfcb49--f858--46f2--979f--dc266e4e6cf0 253:10   0   50G  0 lvm

Wait for rebalance…

Rebalancing done….

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph status
  cluster:
    id:     13c5138f-f2f6-46ea-8ee0-4966330ac081
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 14h)
    mgr: a(active, since 14h)
    osd: 4 osds: 4 up (since 32m), 4 in (since 32m)

  task status:

  data:
    pools:   2 pools, 129 pgs
    objects: 5.12k objects, 19 GiB
    usage:   63 GiB used, 62 GiB / 125 GiB avail
    pgs:     129 active+clean

  io:
    client:   73 KiB/s wr, 0 op/s rd, 2 op/s wr

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph df
--- RAW STORAGE ---
CLASS  SIZE     AVAIL   USED    RAW USED  %RAW USED
hdd    125 GiB  62 GiB  59 GiB    63 GiB      50.14
TOTAL  125 GiB  62 GiB  59 GiB    63 GiB      50.14

--- POOLS ---
POOL                   ID  PGS  STORED  OBJECTS  USED    %USED  MAX AVAIL
device_health_metrics   1    1     0 B        0     0 B      0     15 GiB
replicapool             3  128  19 GiB    5.12k  58 GiB  55.89     15 GiB

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph osd status
ID  HOST        USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
 0  worker002  20.5G  29.4G      1      156k      0        0   exists,up
 1  worker003  13.2G  11.7G      0        0       0        0   exists,up
 2  worker004  14.3G  10.6G      0        0       0        0   exists,up
 3  worker001  14.5G  10.4G      0     4095       0        0   exists,up

Repeat for next 3 OSDs…

The operator ideally will automatically create the new OSD within a few minutes of adding the new device or updating the CR. If you don’t see a new OSD automatically created, restart the operator (by deleting the operator pod) to trigger the OSD creation.

Extra step after hitting an issue:

Pod in error and storage not available on node, edit with kubectl after scaling operations

### Edit with kubectl and remove node:

kubectl edit CephCluster rook-ceph -n rook-ceph 

    - deviceFilter: sdb
      name: worker001
      resources: {}

End result:

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph status
  cluster:
    id:     13c5138f-f2f6-46ea-8ee0-4966330ac081
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 3h)
    mgr: a(active, since 22h)
    osd: 4 osds: 4 up (since 94m), 4 in (since 94m)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 5.12k objects, 19 GiB
    usage:   63 GiB used, 137 GiB / 200 GiB avail
    pgs:     33 active+clean
 
  io:
    client:   49 KiB/s wr, 0 op/s rd, 1 op/s wr
 
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph osd status
ID  HOST        USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0  worker002  12.7G  37.2G      0        0       0        0   exists,up
1  worker003  16.2G  33.7G      1     24.7k      0        0   exists,up
2  worker004  15.2G  34.7G      0        0       0        0   exists,up
3  worker001  18.6G  31.3G      0      819       0        0   exists,up 

[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r  -- ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
-1         0.09760  root default
-9         0.02440      host worker001
 3    hdd  0.02440          osd.3           up   1.00000  1.00000
-3         0.02440      host worker002
 0    hdd  0.02440          osd.0           up   1.00000  1.00000
-5         0.02440      host worker003
 1    hdd  0.02440          osd.1           up   1.00000  1.00000
-7         0.02440      host worker004
 2    hdd  0.02440          osd.2           up   1.00000  1.00000

References:

https://github.com/rook/rook/blob/master/Documentation/ceph-osd-mgmt.md#remove-an-osd

https://github.com/rook/rook/issues/2997

https://docs.ceph.com/en/mimic/rados/operations/add-or-rm-osds/

https://www.cloudops.com/blog/the-ultimate-rook-and-ceph-survival-guide/