Ceph Block Store incident and associated service incidents
Description
As noted on
OTG:0054944, on Thursday 20th February 2020, around 10am, some parts of the Ceph infrastructure went down.
Ceph is used as the backing store for VMs providing attached disks and shared filesystems to many services - consequently several CERN site and VO services were affected.
Impact
What was affected (Grid-facing services)
- The CERN Batch Service was unavailable for job submission. Longer running jobs eventually failed due to losing touch with the batch
schedds
for too long. (OTG:0054948)
- The CVMFS stratum-0 was unavailable for some hours (OTG:005496)
- The WLCG monitoring infrastructure was degraded, recovering the backlog some hours later (OTG:0054946)
- Multiple VO services, both local and those supporting the experiments' distributed production and analysis were unavailable.
- Other non-Grid services were affected and can be found on the IT-SSB Service Incident page
- Notably lxplus, AIADM, acron, some AFS volumes, Gitlab and file access in Indico.
What was not affected (Grid-facing services)
- The EOS Grid services were not affected
- aiadm-homeless, a server for administering agile infrastructure machines with minimum external dependencies
Time line of the incident
LXBATCH
- 10:10: lxbatch submissions blocked
- 18:00: lxbatch job submissions open again
- 19:30: lxbatch back to normal levels
Analysis
- In most cases, the direct cause of secondary service degradation was due to a direct dependency on Ceph.
- In a few cases (e.g. Indico file access) the effect was tertiary, due to affected AFS volumes hosted on VMs.
- The services' dependencies (on Ceph) were reviewed by IT management at the C5 Service Meeting (PDF) and generally found to be reasonable and a required component of the service delivery.
Root Cause
- The root cause of the Ceph outage was determined after one week of investigations.
- Versions of the LZ4 compression library earlier than 1.8.2 have an encoding bug when compressing fragmented memory buffers.
- CentOS 7 ships version 1.7.5.
- The outage occurred when a central cluster metadata file (the osdmap) was incorrectly compressed, corrupting 4 bits, which then caused the ceph OSD daemons to throw a CRC exception and exit.
- On Tuesday Feb 25th the LZ4 library was first suspected and was subsequently disabled. LZ4 will remain disabled.
- Block storage data previously compressed is not expected to be at risk, because it is normally compressed from a consolidated memory buffer.
- We are highly confident in the root cause analysis:
- We are able to reproduce the exact bit corruption in a new Ceph unit test.
- Ceph upstream has developed a workaround for the buggy LZ4, and is getting input from the LZ4 developer.
Follow up
Actions:
- Reconfigure aiadm servers, which provide hosts for infrastructure tools, to have no dependencies on external storage. SSB at OTG:0055070
- Ceph: disable LZ4 compression until an upstream fix comes available.
Links
Ceph OTG:
Linked OTGs: