[Ceph] Troubleshoot Ceph PG Incomplete
Investigating
You can check globally using ceph -s
and you can see there is incomplete or inactive but for more detail you can see by using this command. On this command result osd 100 is the primary of pg 19.64
ceph pg dump_stuck inactive
output:
ok
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
19.64 incomplete [100,85,89] 100 [100,85,89] 100
Or you can use this command for more details
ceph pg ls incomplete
output:
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
19.64 297 0 0 0 1245708288 0 0 1328 incomplete 24m 69498'1328 204713:96240 [100,85,89]p100 [100,85,89]p100 2022-04-04T04:47:37.785708+0000 2021-10-21T06:24:54.701161+0000
You can also check where is the PG lives
root@juju-bf8473-19-lxd-0:~# ceph osd lspools
1 default.rgw.buckets.data
2 default.rgw.control
3 default.rgw.data.root
4 default.rgw.gc
5 default.rgw.log
6 default.rgw.intent-log
7 default.rgw.meta
8 default.rgw.usage
9 default.rgw.users.keys
10 default.rgw.users.email
11 default.rgw.users.swift
12 default.rgw.users.uid
13 default.rgw.buckets.extra
14 default.rgw.buckets.index
15 .rgw.root
16 gnocchi
17 cinder-ceph
18 glance
19 scbench
20 cinder-ceph-ssd
21 device_health_metrics
22 testbench
23 rbd-kubernetes
24 k8s-uat
PG start with pool id follow by other numbers, for example in this case pg 19.64 lives on sbench
First Workaround
Set flag on ceph cluster
sudo ceph osd set noout
sudo ceph osd set norebalance
sudo ceph osd set nobackfill
sudo ceph osd set norecover
Check the pg query
ceph pg {pg.id} query | grep -Ew 'state|peer'
Try restart the osd one by one
ceph osd find {osd.id}
systemctl restart ceph-osd@{osd.id}
After that check the cluster status
Second Workaround
If the first workaround still not helping try use this but you must aware that there is possibility of data will be loss and you still need to set the ceph flags
Stop the primary osd daemon
systemctl stop ceph-osd@{osd.id}
After that mark complete the pg
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-{osd.id} --op mark-complete --pgid {pg.id}
Then you can start the osd daemon again and check the ceph cluster
systemctl start ceph-osd@{osd.id}
References
- https://www.oreilly.com
- https://lists.ceph.io
- https://medium.com/opsops
Happy, Enjoy Ngoprek ~
Comments