SlideShare a Scribd company logo
1 of 45
Download to read offline
Mastering Ceph Operations:
Upmap and the Mgr Balancer
Dan van der Ster | CERN IT | Ceph Day Berlin | 12 November 2018
Already >250PB physics data
Ceph @ CERN Update
5
CERN Ceph Clusters Size Version
OpenStack Cinder/Glance Production 5.5PB luminous
Satellite data centre (1000km away) 1.6PB luminous
Hyperconverged KVM+Ceph 16TB luminous
CephFS (HPC+Manila) Production 0.8PB luminous
Client Scale Testing 0.4PB luminous
Hyperconverged HPC+Ceph 0.4PB luminous
CASTOR/XRootD Production 4.4PB luminous
CERN Tape Archive 0.8TB luminous
S3+SWIFT Production (4+2 EC) 2.3PB luminous
6Stable growth in RBD, S3, CephFS
CephFS Scale Testing
7
• Two activities testing performance:
• HPC storage with CephFS, BoF at SuperComputing in
Dallas right now! (Pablo Llopis, CERN IT)
• Scale testing ceph-csi with k8s (10000 cephfs clients!)
8
RBD Tuning
8
Rackspace / CERN Openlab Collaboration
Performance assessment tools
• ceph-osd benchmarking suite
• rbd top for identifying active clients
Studied performance impacts:
• various hdd configurations
• Flash for block.db, wal, dm-*
• hyperconverged configurations
Target real use-cases at CERN
• database applications
• monitoring and data analysis
Upmap and the Mgr Balancer
9
Background: 2-Step Placement
Background: 2-Step Placement
RANDOM
1. Map an object to a PG uniformly at random.
Background: 2-Step Placement
CRUSH
2. Map each PG to a set of OSDs using CRUSH
Background: 2-Step Placement
CRUSH
RANDOM
Do we really need PGs?
• Why don’t we map objects to OSDs directly with CRUSH?
• If we did that, all OSDs would be coupled (peered) with all
others.
• Any 3x failure would lead to data loss.
• Consider a 1000-osd cluster:
• 1000^3 possible combinations, but only #PGs of them are relevant for
data loss.
• …
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• CRUSH provides a language for describing data placement
rules according to your infrastructure.
• The “failure-domain” part of CRUSH is always perfect: e.g. it will
never put two copies on the same host (unless you tell it to…)
• The uniformity part of CRUSH is imperfect: uneven osd utilizations
are a fact of life. Perfection requires an impractical number of PGs.
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• We can! Now in luminous with upmap!
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• We can! Now in luminous with upmap!
2.1-Step Placement
CRUSH
RANDOM
UPMAP
Data Imbalance
• User Question:
• ceph df says my cluster is 50% full overall but the
cephfs_data pool is 73% full. What’s wrong?
• A pool is full once its first OSD is full.
• ceph df reports the used space on the most full OSD.
• This difference shows that the OSD are imbalanced.
19
Data Balancing
• Luminous added a data balancer.
• The general idea is to slowly move PGs from the most full to the least full
OSDs.
• Two ways to accomplish this, two modes:
• crush-compat: internally tweak the OSD sizes, e.g. to make under full disks
look bigger
• Compatible with legacy clients.
• upmap: move PGs precisely where we want them (without breaking failure
domain rules)
• Compatible with luminous+ clients.
20
Turning on the balancer (luminous)
• My advice: Do whatever you can to use the upmap balancer:
ceph osd set-require-min-compat-client luminous
ceph mgr module ls
ceph mgr module enable balancer
ceph config-key set mgr/balancer/begin_time 0830
ceph config-key set mgr/balancer/end_time 1800
ceph config-key set mgr/balancer/max_misplaced 0.005
ceph config-key set mgr/balancer/upmap_max_iterations 2
ceph balancer mode upmap
ceph balancer on
21
Luminous limitation:
If the balancer config doesn’t take,
restart the active ceph-mgr.
Success!
• It really works
• On our largest clusters, we have recovered hundreds of
terabytes of capacity using the upmap balancer.
22
But wait, there’s more…
23
Ceph Operations at Scale
• Cluster is nearly full (need to add capacity)
• Cluster has old tunables (need to set to optimal)
• Cluster has legacy osd reweights (need to reweight all to 1.0)
• Cluster data placement changes (need to change crush ruleset)
• Operations like those above often involve:
• large amounts of data movement
• lasting several days or weeks
• unpredictable impact on users (and the cluster)
• and no easy rollback!
24
25
WE ARE
HERE LEAP OF FAITH
WE WANT TO
BE HERE
A brief interlude…
• “remapped” PGs are fully replicated (normally “clean”), but
CRUSH wants to move them to a new set of OSDs
• 4815 active+remapped+backfill_wait
• “norebalance” is a cluster state to tell Ceph *not* to make
progress on any remapped PGs
• ceph osd set norebalance
26
Adding capacity
• Step 1: set the norebalance flag
• Step 2: ceph-volume lvm create…
• Step 3: 4518 active+remapped+backfill_wait
• Step 4: ???
• Step 5: HEALTH_OK
27
Adding capacity
28
Adding capacity
29
What if we could
“upmap” those
remapped PGs
back to where the
data is now?
Adding capacity with upmap
30
Adding capacity with upmap
31
Adding capacity with upmap
• But you may be wondering:
• We’ve added new disks to the cluster, but they have zero PGs, so
what’s the point?
• The good news is that the upmap balancer will automatically
notice the new OSDs are under full, and will gradually move
PGs to them.
• In fact, the balancer simply *removes* the upmap entries we created
to keep PGs off of the new OSDs.
32
Changing tunables
• Step 1: set the norebalance flag
• Step 2: ceph osd crush tunables optimal
• Step 3: 4815162/20935486 objects misplaced
• Step 4: ???
• Step 5: HEALTH_OK
33
Changing tunables
34
Changing tunables
35
Such an intensive
backfilling would
not be transparent
to the users…
Changing tunables
36
Changing tunables
37
Changing tunables
38
And remember,
the balancer will
slowly move PGs
around to where
they need to be.
Other use-cases: Legacy OSD Weights
• Remove legacy OSD reweights:
• ceph osd reweight-by-utilization is a legacy feature for
balancing OSDs with a [0,1] reweight factor.
• When we set the reweights back to 1.0, many PGs will become
active+remapped
• We can upmap them back to where they are.
• Bonus: Since those reweights helped balance the OSDs, this
acts as a shortcut to find the right set of upmaps to balance a
cluster.
39
Other use-cases: Placement Changes
• We had a pool with the first 2 replicas in Room A and 3rd
replica in Room B.
• For capacity reasons, we needed to move all three
replicas into Room A.
40
Other use-cases: Placement Changes
• What did we do?
1. Create the new crush ruleset
2. Set the norebalance flag
3. Set the pool’s crush rule to be the new one…
• This puts *every* PG in “remapped” state.
4. Use upmap to map those PGs back to where they are (3rd replica in
the wrong room)
5. Gradually remove those upmap entries to slowly move all PGs fully
into Room A.
6. HEALTH_OK
41
Other use-cases: Placement Changes
• What did we do?
1. Create the new crush ruleset
2. Set the norebalance flag
3. Set the pool’s crush rule to be the new one…
• This puts *every* PG in “remapped” state.
4. Use upmap to map those PGs back to where they are (3rd replica in
the wrong room)
5. Gradually remove those upmap entries to slowly move all PGs fully
into Room A.
6. HEALTH_OK
42
We moved several hundred TBs of data
without users noticing, and could pause
with HEALTH_OK at any time.
No leap of faith required
43
WE
ARE
HERE
LEAP OF FAITH WE WANT
TO BE
HERE
UPMAP
What’s next for “upmap remapped”
• We find this capability to be super useful!
• Wrote external tools to manage the upmap entries.
• It can be tricky!
• After some iteration with upstream we could share…
• Possibly contribute as a core feature?
• Maybe you can think of other use-cases?
44
Thanks!

More Related Content

What's hot

Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Community
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashCeph Community
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency CephShapeBlue
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph Community
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to CephCeph Community
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephScyllaDB
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Introducing OpenHPC Cross Platform Provisioning Assembly for Warewulf
Introducing OpenHPC Cross Platform Provisioning Assembly for WarewulfIntroducing OpenHPC Cross Platform Provisioning Assembly for Warewulf
Introducing OpenHPC Cross Platform Provisioning Assembly for WarewulfNaohiro Tamura
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for CephCeph Community
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a NutshellKaran Singh
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM ShapeBlue
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
Tutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerShu Sugimoto
 

What's hot (20)

Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Introducing OpenHPC Cross Platform Provisioning Assembly for Warewulf
Introducing OpenHPC Cross Platform Provisioning Assembly for WarewulfIntroducing OpenHPC Cross Platform Provisioning Assembly for Warewulf
Introducing OpenHPC Cross Platform Provisioning Assembly for Warewulf
 
Podman rootless containers
Podman rootless containersPodman rootless containers
Podman rootless containers
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for Ceph
 
Bluestore
BluestoreBluestore
Bluestore
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a Nutshell
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Tutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting router
 

Similar to CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneCeph Community
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Ceph Community
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph Community
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterSrihari Sriraman
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Community
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 
Puppet Camp CERN Geneva
Puppet Camp CERN GenevaPuppet Camp CERN Geneva
Puppet Camp CERN GenevaSteve Traylen
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
 
Alibaba patches in MariaDB
Alibaba patches in MariaDBAlibaba patches in MariaDB
Alibaba patches in MariaDBLixun Peng
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
 
Inoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migrationInoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migrationOpenNebula Project
 
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time VariationUse of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time VariationJonathan Beard
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightRed_Hat_Storage
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightColleen Corrice
 
Chef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HAChef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HAAdam Spiers
 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
 

Similar to CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER (20)

Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
Puppet Camp CERN Geneva
Puppet Camp CERN GenevaPuppet Camp CERN Geneva
Puppet Camp CERN Geneva
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
Alibaba patches in MariaDB
Alibaba patches in MariaDBAlibaba patches in MariaDB
Alibaba patches in MariaDB
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
Inoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migrationInoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migration
 
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time VariationUse of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Chef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HAChef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HA
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 

Recently uploaded

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

  • 1.
  • 2. Mastering Ceph Operations: Upmap and the Mgr Balancer Dan van der Ster | CERN IT | Ceph Day Berlin | 12 November 2018
  • 3.
  • 5. Ceph @ CERN Update 5
  • 6. CERN Ceph Clusters Size Version OpenStack Cinder/Glance Production 5.5PB luminous Satellite data centre (1000km away) 1.6PB luminous Hyperconverged KVM+Ceph 16TB luminous CephFS (HPC+Manila) Production 0.8PB luminous Client Scale Testing 0.4PB luminous Hyperconverged HPC+Ceph 0.4PB luminous CASTOR/XRootD Production 4.4PB luminous CERN Tape Archive 0.8TB luminous S3+SWIFT Production (4+2 EC) 2.3PB luminous 6Stable growth in RBD, S3, CephFS
  • 7. CephFS Scale Testing 7 • Two activities testing performance: • HPC storage with CephFS, BoF at SuperComputing in Dallas right now! (Pablo Llopis, CERN IT) • Scale testing ceph-csi with k8s (10000 cephfs clients!)
  • 8. 8 RBD Tuning 8 Rackspace / CERN Openlab Collaboration Performance assessment tools • ceph-osd benchmarking suite • rbd top for identifying active clients Studied performance impacts: • various hdd configurations • Flash for block.db, wal, dm-* • hyperconverged configurations Target real use-cases at CERN • database applications • monitoring and data analysis
  • 9. Upmap and the Mgr Balancer 9
  • 11. Background: 2-Step Placement RANDOM 1. Map an object to a PG uniformly at random.
  • 12. Background: 2-Step Placement CRUSH 2. Map each PG to a set of OSDs using CRUSH
  • 14. Do we really need PGs? • Why don’t we map objects to OSDs directly with CRUSH? • If we did that, all OSDs would be coupled (peered) with all others. • Any 3x failure would lead to data loss. • Consider a 1000-osd cluster: • 1000^3 possible combinations, but only #PGs of them are relevant for data loss. • …
  • 15. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • CRUSH provides a language for describing data placement rules according to your infrastructure. • The “failure-domain” part of CRUSH is always perfect: e.g. it will never put two copies on the same host (unless you tell it to…) • The uniformity part of CRUSH is imperfect: uneven osd utilizations are a fact of life. Perfection requires an impractical number of PGs.
  • 16. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • We can! Now in luminous with upmap!
  • 17. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • We can! Now in luminous with upmap!
  • 19. Data Imbalance • User Question: • ceph df says my cluster is 50% full overall but the cephfs_data pool is 73% full. What’s wrong? • A pool is full once its first OSD is full. • ceph df reports the used space on the most full OSD. • This difference shows that the OSD are imbalanced. 19
  • 20. Data Balancing • Luminous added a data balancer. • The general idea is to slowly move PGs from the most full to the least full OSDs. • Two ways to accomplish this, two modes: • crush-compat: internally tweak the OSD sizes, e.g. to make under full disks look bigger • Compatible with legacy clients. • upmap: move PGs precisely where we want them (without breaking failure domain rules) • Compatible with luminous+ clients. 20
  • 21. Turning on the balancer (luminous) • My advice: Do whatever you can to use the upmap balancer: ceph osd set-require-min-compat-client luminous ceph mgr module ls ceph mgr module enable balancer ceph config-key set mgr/balancer/begin_time 0830 ceph config-key set mgr/balancer/end_time 1800 ceph config-key set mgr/balancer/max_misplaced 0.005 ceph config-key set mgr/balancer/upmap_max_iterations 2 ceph balancer mode upmap ceph balancer on 21 Luminous limitation: If the balancer config doesn’t take, restart the active ceph-mgr.
  • 22. Success! • It really works • On our largest clusters, we have recovered hundreds of terabytes of capacity using the upmap balancer. 22
  • 23. But wait, there’s more… 23
  • 24. Ceph Operations at Scale • Cluster is nearly full (need to add capacity) • Cluster has old tunables (need to set to optimal) • Cluster has legacy osd reweights (need to reweight all to 1.0) • Cluster data placement changes (need to change crush ruleset) • Operations like those above often involve: • large amounts of data movement • lasting several days or weeks • unpredictable impact on users (and the cluster) • and no easy rollback! 24
  • 25. 25 WE ARE HERE LEAP OF FAITH WE WANT TO BE HERE
  • 26. A brief interlude… • “remapped” PGs are fully replicated (normally “clean”), but CRUSH wants to move them to a new set of OSDs • 4815 active+remapped+backfill_wait • “norebalance” is a cluster state to tell Ceph *not* to make progress on any remapped PGs • ceph osd set norebalance 26
  • 27. Adding capacity • Step 1: set the norebalance flag • Step 2: ceph-volume lvm create… • Step 3: 4518 active+remapped+backfill_wait • Step 4: ??? • Step 5: HEALTH_OK 27
  • 29. Adding capacity 29 What if we could “upmap” those remapped PGs back to where the data is now?
  • 32. Adding capacity with upmap • But you may be wondering: • We’ve added new disks to the cluster, but they have zero PGs, so what’s the point? • The good news is that the upmap balancer will automatically notice the new OSDs are under full, and will gradually move PGs to them. • In fact, the balancer simply *removes* the upmap entries we created to keep PGs off of the new OSDs. 32
  • 33. Changing tunables • Step 1: set the norebalance flag • Step 2: ceph osd crush tunables optimal • Step 3: 4815162/20935486 objects misplaced • Step 4: ??? • Step 5: HEALTH_OK 33
  • 35. Changing tunables 35 Such an intensive backfilling would not be transparent to the users…
  • 38. Changing tunables 38 And remember, the balancer will slowly move PGs around to where they need to be.
  • 39. Other use-cases: Legacy OSD Weights • Remove legacy OSD reweights: • ceph osd reweight-by-utilization is a legacy feature for balancing OSDs with a [0,1] reweight factor. • When we set the reweights back to 1.0, many PGs will become active+remapped • We can upmap them back to where they are. • Bonus: Since those reweights helped balance the OSDs, this acts as a shortcut to find the right set of upmaps to balance a cluster. 39
  • 40. Other use-cases: Placement Changes • We had a pool with the first 2 replicas in Room A and 3rd replica in Room B. • For capacity reasons, we needed to move all three replicas into Room A. 40
  • 41. Other use-cases: Placement Changes • What did we do? 1. Create the new crush ruleset 2. Set the norebalance flag 3. Set the pool’s crush rule to be the new one… • This puts *every* PG in “remapped” state. 4. Use upmap to map those PGs back to where they are (3rd replica in the wrong room) 5. Gradually remove those upmap entries to slowly move all PGs fully into Room A. 6. HEALTH_OK 41
  • 42. Other use-cases: Placement Changes • What did we do? 1. Create the new crush ruleset 2. Set the norebalance flag 3. Set the pool’s crush rule to be the new one… • This puts *every* PG in “remapped” state. 4. Use upmap to map those PGs back to where they are (3rd replica in the wrong room) 5. Gradually remove those upmap entries to slowly move all PGs fully into Room A. 6. HEALTH_OK 42 We moved several hundred TBs of data without users noticing, and could pause with HEALTH_OK at any time.
  • 43. No leap of faith required 43 WE ARE HERE LEAP OF FAITH WE WANT TO BE HERE UPMAP
  • 44. What’s next for “upmap remapped” • We find this capability to be super useful! • Wrote external tools to manage the upmap entries. • It can be tricky! • After some iteration with upstream we could share… • Possibly contribute as a core feature? • Maybe you can think of other use-cases? 44