Scenario:
4 worker nodes with 25GB raw disk used in a ceph block cluster. As we are running low on space, we will extend the raw disks to 50GB and update rook-ceph accordingly.
Ceph OSD Management
Ceph Object Storage Daemons (OSDs) are the heart and soul of the Ceph storage platform. Each OSD manages a local device and together they provide the distributed storage. Rook will automate creation and management of OSDs to hide the complexity based on the desired state in the CephCluster CR as much as possible. This guide will walk through some of the scenarios to configure OSDs where more configuration may be required.
OSD Health
The rook-ceph-tools pod provides a simple environment to run Ceph tools. The ceph
commands mentioned in this document should be run from the toolbox.
Once the is created, connect to the pod to execute the ceph
commands to analyze the health of the cluster, in particular the OSDs and placement groups (PGs). Some common commands to analyze OSDs include:
ceph status
ceph osd tree
ceph osd status
ceph osd df
ceph osd utilization
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
Status Before:
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph status cluster:
id: 13c5138f-f2f6-46ea-8ee0-4966330ac081
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 13h)
mgr: a(active, since 13h)
osd: 4 osds: 4 up (since 13h), 4 in (since 13h)
data:
pools: 2 pools, 129 pgs
objects: 5.12k objects, 19 GiB
usage: 63 GiB used, 37 GiB / 100 GiB avail
pgs: 129 active+clean
io:
client: 60 KiB/s wr, 0 op/s rd, 1 op/s wr
[istacey@master001 ~]$ kubectl --kubeconfig=/home/istacey/.kube/config-hr -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 worker002 10.9G 14.0G 1 151k 0 0 exists,up
1 worker003 10.9G 14.0G 0 0 0 0 exists,up
2 worker004 11.1G 13.8G 0 0 0 0 exists,up
3 worker001 9.98G 15.0G 0 0 0 0 exists,up
istacey@worker001:~$ lsblk | grep sdb -A1
sdb 8:16 0 25G 0 disk
└─ceph--f067bb6e--522a--48c6--a2a8--8930d15dc02f-osd--block--dc871464--0a16--484a--8fa8--b723eec178f1 253:10 0 25G 0 lvm
Raw Disk Extended:
istacey@worker001:~$ lsblk | grep sdb -A2
sdb 8:16 0 50G 0 disk
└─ceph--f067bb6e--522a--48c6--a2a8--8930d15dc02f-osd--block--dc871464--0a16--484a--8fa8--b723eec178f1
253:10 0 25G 0 lvm
Remove the OSDs (one at a time):
https://github.com/rook/rook/blob/master/Documentation/ceph-osd-mgmt.md#remove-an-osd
To remove an OSD due to a failed disk or other re-configuration, consider the following to ensure the health of the data through the removal process:
- Confirm you will have enough space on your cluster after removing your OSDs to properly handle the deletion
- Confirm the remaining OSDs and their placement groups (PGs) are healthy in order to handle the rebalancing of the data
- Do not remove too many OSDs at once
- Wait for rebalancing between removing multiple OSDs
If all the PGs are active+clean
and there are no warnings about being low on space, this means the data is fully replicated and it is safe to proceed. If an OSD is failing, the PGs will not be perfectly clean and you will need to proceed anyway.
Scale down rook-ceph-operator and the OSD deployments:
[istacey@master001 ~]$ kubectl get deployment -n rook-ceph | grep opera
rook-ceph-operator 1/1 1 1 77d
[istacey@master001 ~]$ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
deployment.apps/rook-ceph-operator scaled
[istacey@master001 ~]$ kubectl get deployment -n rook-ceph | grep opera
rook-ceph-operator 0/0 0 0 77d
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph status | egrep 'health|osds|usage'
health: HEALTH_OK
osd: 4 osds: 4 up (since 13h), 4 in (since 13h)
usage: 63 GiB used, 37 GiB / 100 GiB avail
[istacey@master001 ~]$ kubectl get deployment -n rook-ceph | grep osd
rook-ceph-osd-0 1/1 1 1 38h
rook-ceph-osd-1 1/1 1 1 77d
rook-ceph-osd-2 1/1 1 1 77d
rook-ceph-osd-3 1/1 1 1 77d
[istacey@master001 ~]$ kubectl -n rook-ceph scale deployment rook-ceph-osd-0 --replicas=0
deployment.apps/rook-ceph-osd-0 scaled
[istacey@master001 ~]$ kubectl get deployment -n rook-ceph | grep osd
rook-ceph-osd-0 0/0 0 0 38h
rook-ceph-osd-1 1/1 1 1 77d
rook-ceph-osd-2 1/1 1 1 77d
rook-ceph-osd-3 1/1 1 1 77d
Down and out the OSD
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph osd down osd.0
osd.0 is already down.
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph status | egrep 'health|osds|usage'
health: HEALTH_WARN
1 osds down
1 host (1 osds) down
osd: 4 osds: 3 up (since 101s), 4 in (since 13h); 1 remapped pgs
usage: 63 GiB used, 37 GiB / 100 GiB avail
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.09760 root default
-9 0.02440 host worker001
3 hdd 0.02440 osd.3 up 1.00000 1.00000
-3 0.02440 host worker002
0 hdd 0.02440 osd.0 down 1.00000 1.00000
-5 0.02440 host worker003
1 hdd 0.02440 osd.1 up 1.00000 1.00000
-7 0.02440 host worker004
2 hdd 0.02440 osd.2 up 1.00000 1.00000
### Mark the OSD as out:
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph osd out osd.0
marked out osd.0.
Wait for the data to finish backfilling to other OSDs.
ceph status will indicate the backfilling is done when all of the PGs are active+clean.
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph status
cluster:
id: 13c5138f-f2f6-46ea-8ee0-4966330ac081
health: HEALTH_WARN
Degraded data redundancy: 3171/15372 objects degraded (20.628%), 80 pgs degraded, 80 pgs undersized
services:
mon: 3 daemons, quorum a,b,c (age 13h)
mgr: a(active, since 13h)
osd: 4 osds: 3 up (since 4m), 3 in (since 96s); 80 remapped pgs
data:
pools: 2 pools, 129 pgs
objects: 5.12k objects, 19 GiB
usage: 50 GiB used, 25 GiB / 75 GiB avail
pgs: 3171/15372 objects degraded (20.628%)
78 active+undersized+degraded+remapped+backfill_wait
49 active+clean
2 active+undersized+degraded+remapped+backfilling
io:
client: 71 KiB/s wr, 0 op/s rd, 1 op/s wr
recovery: 6.5 MiB/s, 1 objects/s
### backfilling is done:
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph status | egrep 'health|osds|usage'
health: HEALTH_OK
osd: 4 osds: 3 up (since 22m), 3 in (since 19m)
usage: 62 GiB used, 13 GiB / 75 GiB avail
Remove the OSD from the Ceph cluster
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph osd purge osd.0 --yes-i-really-mean-it
purged osd.0
### Note osd.0 is on worker002:
[istacey@master001 ~]$ kubectl get pods -n rook-ceph -o wide | grep osd | grep -v prepare
rook-ceph-osd-1-6c468554f4-8btvj 1/1 Running 3 26h 10.42.171.207 worker003 <none> <none>
rook-ceph-osd-2-5f8ffcd5bb-p44d4 1/1 Running 1 25h 10.42.64.205 worker004 <none> <none>
rook-ceph-osd-3-5d8b989cb-4hf8h 1/1 Running 5 27h 10.42.7.26 worker001 <none> <none>
Zap the disk
https://github.com/rook/rook/blob/master/Documentation/ceph-teardown.md#zapping-devices
As root clean and Prepare the disk on the VM:
DISK="/dev/sdb"
dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync
ls /dev/mapper/ceph-* | xargs -I% -- dmsetup remove %
rm -rf /dev/ceph-*
rm -rf /dev/mapper/ceph--*
partprobe $DISK
[root@worker002 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 50G 0 disk
Scale back up and let osd rejoin:
[istacey@master001 ~]$ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
deployment.apps/rook-ceph-operator scaled
[istacey@master001 ~]$ kubectl -n rook-ceph scale deployment rook-ceph-osd-0 --replicas=1
deployment.apps/rook-ceph-osd-0 scaled
[istacey@master001 ~]$ kubectl -n rook-ceph get deployment | egrep 'rook-ceph-operator|rook-ceph-osd'
rook-ceph-operator 1/1 1 1 77d
rook-ceph-osd-0 1/1 1 1 39h
rook-ceph-osd-1 1/1 1 1 77d
rook-ceph-osd-2 1/1 1 1 77d
rook-ceph-osd-3 1/1 1 1 77d
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.12199 root default
-9 0.02440 host worker001
3 hdd 0.02440 osd.3 up 1.00000 1.00000
-3 0.04880 host worker002
0 hdd 0.04880 osd.0 up 1.00000 1.00000
-5 0.02440 host worker003
1 hdd 0.02440 osd.1 up 1.00000 1.00000
-7 0.02440 host worker004
2 hdd 0.02440 osd.2 up 1.00000 1.00000
istacey@worker002:~$ lsblk | grep sdb -A1
sdb 8:16 0 50G 0 disk
└─ceph--ea8115b7--5418--41b9--b4d3--d6e22526dbb1-osd--block--68cfcb49--f858--46f2--979f--dc266e4e6cf0 253:10 0 50G 0 lvm
Wait for rebalance…
Rebalancing done….
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph status
cluster:
id: 13c5138f-f2f6-46ea-8ee0-4966330ac081
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 14h)
mgr: a(active, since 14h)
osd: 4 osds: 4 up (since 32m), 4 in (since 32m)
task status:
data:
pools: 2 pools, 129 pgs
objects: 5.12k objects, 19 GiB
usage: 63 GiB used, 62 GiB / 125 GiB avail
pgs: 129 active+clean
io:
client: 73 KiB/s wr, 0 op/s rd, 2 op/s wr
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 125 GiB 62 GiB 59 GiB 63 GiB 50.14
TOTAL 125 GiB 62 GiB 59 GiB 63 GiB 50.14
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 0 B 0 0 B 0 15 GiB
replicapool 3 128 19 GiB 5.12k 58 GiB 55.89 15 GiB
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 worker002 20.5G 29.4G 1 156k 0 0 exists,up
1 worker003 13.2G 11.7G 0 0 0 0 exists,up
2 worker004 14.3G 10.6G 0 0 0 0 exists,up
3 worker001 14.5G 10.4G 0 4095 0 0 exists,up
Repeat for next 3 OSDs…
The operator ideally will automatically create the new OSD within a few minutes of adding the new device or updating the CR. If you don’t see a new OSD automatically created, restart the operator (by deleting the operator pod) to trigger the OSD creation.
Extra step after hitting an issue:
Pod in error and storage not available on node, edit with kubectl after scaling operations
### Edit with kubectl and remove node:
kubectl edit CephCluster rook-ceph -n rook-ceph
- deviceFilter: sdb
name: worker001
resources: {}
End result:
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph status
cluster:
id: 13c5138f-f2f6-46ea-8ee0-4966330ac081
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 3h)
mgr: a(active, since 22h)
osd: 4 osds: 4 up (since 94m), 4 in (since 94m)
data:
pools: 2 pools, 33 pgs
objects: 5.12k objects, 19 GiB
usage: 63 GiB used, 137 GiB / 200 GiB avail
pgs: 33 active+clean
io:
client: 49 KiB/s wr, 0 op/s rd, 1 op/s wr
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 worker002 12.7G 37.2G 0 0 0 0 exists,up
1 worker003 16.2G 33.7G 1 24.7k 0 0 exists,up
2 worker004 15.2G 34.7G 0 0 0 0 exists,up
3 worker001 18.6G 31.3G 0 819 0 0 exists,up
[istacey@master001 ~]$ kubectl -n rook-ceph exec -it rook-ceph-tools-5d9d5db5bc-npz4r -- ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.09760 root default
-9 0.02440 host worker001
3 hdd 0.02440 osd.3 up 1.00000 1.00000
-3 0.02440 host worker002
0 hdd 0.02440 osd.0 up 1.00000 1.00000
-5 0.02440 host worker003
1 hdd 0.02440 osd.1 up 1.00000 1.00000
-7 0.02440 host worker004
2 hdd 0.02440 osd.2 up 1.00000 1.00000
References:
https://github.com/rook/rook/blob/master/Documentation/ceph-osd-mgmt.md#remove-an-osd
https://github.com/rook/rook/issues/2997
https://docs.ceph.com/en/mimic/rados/operations/add-or-rm-osds/
https://www.cloudops.com/blog/the-ultimate-rook-and-ceph-survival-guide/