[Ceph] Troubleshoot Ceph PG Incomplete

1 minute read

Investigating

You can check globally using ceph -s and you can see there is incomplete or inactive but for more detail you can see by using this command. On this command result osd 100 is the primary of pg 19.64

ceph pg dump_stuck inactive

output:
ok
PG_STAT  STATE       UP           UP_PRIMARY  ACTING       ACTING_PRIMARY
19.64    incomplete  [100,85,89]         100  [100,85,89]             100

Or you can use this command for more details

ceph pg ls incomplete

output:
PG     OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES       OMAP_BYTES*  OMAP_KEYS*  LOG   STATE       SINCE  VERSION     REPORTED      UP               ACTING           SCRUB_STAMP                      DEEP_SCRUB_STAMP
19.64      297         0          0        0  1245708288            0           0  1328  incomplete    24m  69498'1328  204713:96240  [100,85,89]p100  [100,85,89]p100  2022-04-04T04:47:37.785708+0000  2021-10-21T06:24:54.701161+0000

You can also check where is the PG lives

root@juju-bf8473-19-lxd-0:~# ceph osd lspools
default.rgw.buckets.data
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
default.rgw.intent-log
default.rgw.meta
default.rgw.usage
default.rgw.users.keys
default.rgw.users.email
default.rgw.users.swift
default.rgw.users.uid
default.rgw.buckets.extra
default.rgw.buckets.index
.rgw.root
gnocchi
cinder-ceph
glance
scbench
cinder-ceph-ssd
device_health_metrics
testbench
rbd-kubernetes
k8s-uat

PG start with pool id follow by other numbers, for example in this case pg 19.64 lives on sbench

First Workaround

Set flag on ceph cluster

sudo ceph osd set noout
sudo ceph osd set norebalance
sudo ceph osd set nobackfill
sudo ceph osd set norecover

Check the pg query

ceph pg {pg.id} query | grep -Ew 'state|peer'

Try restart the osd one by one

ceph osd find {osd.id}
systemctl restart ceph-osd@{osd.id}

After that check the cluster status

Second Workaround

If the first workaround still not helping try use this but you must aware that there is possibility of data will be loss and you still need to set the ceph flags

Stop the primary osd daemon

systemctl stop ceph-osd@{osd.id}

After that mark complete the pg

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-{osd.id} --op mark-complete --pgid {pg.id}

Then you can start the osd daemon again and check the ceph cluster

systemctl start ceph-osd@{osd.id}

References

https://www.oreilly.com
https://lists.ceph.io
https://medium.com/opsops
Happy, Enjoy Ngoprek ~

Twitter Facebook LinkedIn

>_

[Ceph] Troubleshoot Ceph PG Incomplete

Investigating

First Workaround

Second Workaround

References

Happy, Enjoy Ngoprek ~

Comments

You May Also Enjoy

[k8s] OKD Virtualization Automate VMs with Tekton Pipelines - part 3

[k8s] OKD Virtualization Automate VMs with Tekton Pipelines - part 2

[k8s] Setup OKD Virtualization and launch VMs - part 1

[Ceph] Setup Microceph RadosGW