LCG Web>ManagementBoard>WorkloadManagementTechnicalEvolution>WLCGWlMgmtTECloudComputing>WLCGWlMgmtTEGCloudComputing (2012-06-19, UlrichSchwickerath)

EditAttachPDF

Show TOC

Hide TOC

Cloud Computing

Abstract

During the past years, cloud computing has become a popular mechanism to provide computing resources to end users. Within WLCG specifically the infrastructure as a service (IaaS) approach has been attractive to some experiments as a possible source of external resources for computing. Some sites have now started to experiment with internal clouds, and some of these clouds have been opened to dedicated users from the experiments to experiment with alternative access methods. Within the TEG this model was presented.

Introduction

A popular model to provide computing resources in industry nowadays are clouds. Several sites have some limited experiences with running internal clouds already. As an example, CERN is running an infrastructure called "lxcloud" for more than a year now, which is used for virtualizing parts of the batch resources, and which provides an EC2 access as a pilot service for selected users from different VOs. The infrastructure has been scaled up during a scalability test up to 16,000 virtual machines, all configured as standard worker nodes, and connecting to LSF.

An internal cloud is fully transparent to users. Grid services become applications which use machines provided by the internal IaaS cloud, and Grid service managers become users of this general infrastructure of the site. For end users, the is no difference in the services he sees.

The picture changes, if sites decide to open their internal cloud to users, by granting them access to the resource though a cloud interface.

Definitions

There are many definitions on the market about what a cloud exactly is. Often, they are defined via their features. Main features of clouds include dynamicality and elasticity, in the sense that clouds can grow (and shrink) dynamically, depending on demand. Another important principle is the "you get what you pay for", in the sense that a user pays only for his resources as long as he uses them. Clouds often give users the feeling of having infinite resources. Users expect them to be responsive and available when needed.

Clouds come in different manifestations,

Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Software as a Service (SaaS)

Some ideas of Clouds, specifically the Infrastructure as a Service approach, can be of interest for managing sites, in particular if this approach is combined with large scale virtualization of computer centers. Virtualization is often called an enabling technology for IaaS clouds. It has to be stressed though that virtualization, which is reality at many sites already for production services, must not be confused with cloud computing.

Assumptions

The model presented here is based on an IaaS infrastructure.
The model assumes that the applications which are run on the infrastructure are non-critical, for example CPU crunching applications which can die and just be restarted

Use cases

General explications

List of use cases

Identifier	Actors	Pre-conditions	Scenario	Outcome	(Optional) What to avoid
1.1	site	fully virtualized infrastructure	complete decoupling of physical and virtual resources	transparently roll out of fast and modern hardware on modern operating systems without the need of migrating the applications to the new OS too quickly resulting in reduced cost, better hardware support
1.2	site	fully virtualized infrastructure	virtualization of batch resources	dynamic and automatic adaption of the batch farm composition according to the requirements of incoming jobs resulting in increased job throughput and increased resource usage
1.3	site	fully virtualized infrastructure	service consolidation	consolidation of long lived and critical services with low resource requirements on the same hardware resulting in increased resource usage
2.1	site	non-optimal CPU resource utilization	mixing of CPU intensive and less demanding applications on the same physical node using virtualization	increased CPU utilization, increase of delivered CPU cycles	avoid overloading/overcommitting of hypervisors
2.2	sites	non-optimal CPU resource utilization	a site runs a build service which runs a expensive software builds each night. The same site runs a login service which is used during the day	reduced amount of required resources by elastic shifting of resources on a day/night cyclus
3.	site	scheduling of intrusive updates on hosts with many different applications	automation of intrusive updates by dynamic destroy/creation of virtual machines	significantly reduced operational cost	inefficiencies due to short VM lifetimes
4.	site	internal cloud supporting cloud bursting	high peak demands accessing the local capacities	processing of peak demands by buying-in external resources	increased cost
5.	VOs	frameworks which work on clouds	a university gets extra funding for resources on clouds outside WLCG, and students don't get enough resources inside WLCG due to priorities	reduced pressure on sites from less important end users
6.	all	ability to make use of industry-standard interfaces for resource access	VOs can run their frameworks both on clouds at sites and on commercial clouds	reduced cost by exploiting standard technologies
7.	site	fully virtualized infrastructure	complete decoupling of physical and virtual resources	transparently roll out of fast and modern hardware on modern operating systems without the need of migrating the applications to the new OS too quickly resulting in reduced cost, better hardware support
8.	site	fully virtualized infrastructure	virtual batch resources	dynamic and automatic adaption of the batch farm composition according to the requirements of incoming jobs, resulting in increased job throughput, increased resource usage
9.	VOs	access to internal clouds at sites	fast access to development machines within the LCG network for software development and tests	quick access (timeline: seconds ) to development machines which can be scrapped easily when done
10.	VOs	access to internal clouds at sites	fast access to build machines for software development, dynamically growing and shrinking build clusters for software builds and deployment	reduced time to build, test and deploy software releases
11.	VOs	access to internal clouds at sites	fast creation of complete test infrastructures for new developments	reduced time to production
12.1	sites	fully virtualized infrastructure supporting life migration	hardware intervention/ hardware exchange	easy freeing of machines which need hardware interventions , resulting in reduced cost due to reduced downtime (very short draining time)
12.2	sites	fully automatic hardware error detection	hardware detects an issue (eg using SMART) and frees itself by moving applications and data on it somewhere else	reduced cost due to reduced downtime (very short draining time) and more efficient use during the lifetime of the hardware
13.1	VOs	coordination of cloud features between sites	endorsed WLCG (or even VO ) specific images which can run at any site	increased consistency of worker nodes across sites	avoid too specialized images
13.2	VOs	coordination of cloud features between sites	virtual machines can be instantiated at any WLCG site, depending on where the data is closest	efficient use of resources across sites, shorter processing time for jobs
13.3	sites	pilot job replacement	VOs fire up one image at each site which builds up a dynamic and elastic cross-WLCG processing farm. Each instance directly connects to the VOs central queue to get a fitting payload	increased efficiency due to reduced overhead
14.	VOs	support for DMZ at sites for data preservation	ability to run old software releases on old operating systems, for example for data preservation	longer access to old data
15.1	sites	security and trust	a security update for running images is available and can be deployed to running instances	ensure system security
15.2	VOs	security and trust	a user is running a malicious application. The VOs and the sites work together to identify and stop him	traceability
15.3	sites	security and trust	during a security incident a site is asked to provide the syslog information from a running VMs	traceability and system security
16	sites	hardware replacements	once batches of machines have reached the end of their life time, they can be transparently replaced by new machines	efficient hardware management, entirely transparent to the user and without reduction of available resources
17	sites	existing site policies	a site requires services with limited security features, for example NFS, which require restricting access to site services	continued support for special applications
18	sites	idle or rapidly free-able resources	a site received a new batch of resources for computing or storage. Instead of leaving them idle while waiting for final allocation, they join the infrastructure (in a special zone) and are made available for applications like Boinc	exploitation of resources which would else be idle

Requirements

General explications

List of requirements

requirements for the infrastructure

Identifier	Requirement	Originating use cases	(Optional) Comments
1.	applications and infrastructure should be strictly separated from each other	1.1, 7,16
2.1	the infrastructure must have support for uncritical applications with finite life time (batch, development)	1.2, 9,10,11
2.2	the infrastructure must have support for long lived applications which can be life-migrated to a different machine	1.3, 16
3.	the infrastructure must have a small but sufficient amount of free resources to which running machines can be migrated	1.3, 12.1,12.2, 16,	such resources can be used by very short lived applications. The point here is that they can be freed quickly if needed so that life migration of machines can actually work.
4.	the placing algorithms of the cloud controller have to be aware of the current usage of a given hypervisor, it's free resources and the requirements of the applications running on it	1.3, 2.1,2.2, 12.2	implications for selecting a cloud controller and designing the internal cloud infrastructure
5.	the infrastructure must provide a standard entry point for IaaS access to pilot users	6,9,10,11
6.	a strong authentication is required for the IaaS entry point	9,10,11, 15.1,15.2,15.3
7.1	the part of the infrastructure which supports less critical number crunching applications (eg batch worker nodes) must be fast and responsive, in the sense that a VM which has been placed is available within seconds	1.2, 8,13.2, 13.3	inefficiencies in starting up VMs may sum up to a large waste of resources at large scale
7.2	the part of the infrastructure which supports less critical number crunching applications (eg batch worker nodes) must be efficient, in the sense that any virtualization overhead is tuned to be as small as possible (CPU, I/O and network)	1.2, 8,13.2, 13.3
8.	the site should make sure to virtualize as many local applications as possible, and enable them to run on the cloud infrastructure	2.1

requirements for experiment frameworks

Identifier	Requirement	Originating use cases	(Optional) Comments
Comment	Describe the requirement for the service	Comma separated list of use case identifiers from the table above	additional comments on this requirement
9.	experiment frameworks must be secure	15.2, 17
10.	experiment frameworks must ensure traceability of user payload	15.2
11.	experiments must be able to run on virtual machines	2, 4, 5, 6, 7, 13.1,1 3.2, 13.3
12.1	applications must have a limited life time	3, 7, 8, 12.1, ...
12.2	applications must be able to know about their remaining life time	3, 7, 8, 12.1	this has implications on the contextualization of the images, as proposed by HEPiX
12.3	applications should be able to terminate and cleanup themselves when they are done	3, 12.1	may require a mechanism which allows applications to destroy the VM they are running on

requirements for images (see ref. [1])

Identifier	Requirement	Originating use cases	(Optional) Comments
Comment	Describe the requirement for the service	Comma separated list of use case identifiers from the table above	additional comments on this requirement
12.	shared images must not contain any site-specific information	13.1,13.2,13.3
13.1	shared images must not allow users to gain root access	17
13.2	sites need to be able to inspect and black-list images for their site	17
14.	images must not contain any secret information (eg. root passwords) so that they can be freely shared between sites	13.1,13.2,13.3
15.	images must come with a certain amount of meta-data which describes what they are, and what is required to run them	13.1,13.2,13.3
16.	shared images must come from a trusted source	17
17.	it must be possible to revoke an image which must have an immediate effect on on sites supporting this image	15.1
18.	it must be possible to deploy critical update to running instances	15.1
19.	images should be as small as possible, and as generic as possible so that they can be shared between VOs	13.1,13.2,13.3
20.	sites must have the possibility to customize images at instantiation time	15.1,15.3	this is the site contextualization phase as proposed by HEPiX

External Resources

Many issues have already been addressed by the virtualization working group in HEPiX [1]. These include :

endorsement and trust
image exchange policies
contextualization
passing of information to applications inside the VMs

Several prototypes for internal clouds exist already within WLCG, eg. [2] While CMS has stated within the TEG that they are currently not interested in CloudComputing, ATLAS has a variety of ongoing cloud computing activities which are documented in [3] A relevant external resources is a publication on security and privacy in public clouds has recently been published by NIS [4].

Impact

Pilot job frameworks

It should be possible to significantly reduce the overhead caused by pilot job frameworks. The pilot role is moved to the cloud factory for the specific VO and thus separated from the payload. The payload itself is run it a sandbox. The UID switching within a single job (glExec) should no longer be required.

Data locality

A general problem which needs to be thought of in the cloud approach is data locality. In the described scenario the experiment frameworks are responsible to dispatch payload jobs to VMs running at sites where the data to be read is available.

WMS and Computing elements

The intelligence of the WMS is moved to the experiment frameworks. The computing element role is largely taken over by the cloud entry points of the sites.

Authentication and authorization

Authentication and authorization mechanism which are existing in current cloud controllers may not match the requirements of WLCG and need to be reviewed.

Image exchange

Trust chains for image exchange have been intensively discussed within [2], and policies have been established.

Image instantiation

The need for image contextualization at instantiation time by the site has been extensively discussed within HEPiX [2]. A mechanism to do that has been proposed, and is used for example by CERNVM [5]

Recommendations

Sites are free to use virtualization. If they do, it must be transparent to users
Sites which decide to go beyond virtualization, and setup an internal cloud, are encouraged to open part of the resources via an EC2 access to VOs which are interested in this.
In order to profit most from clouds at different sites, some level of synchronization between sites is needed. This includes the sharing of images and the definition of common virtual machine templates.
In the described IaaS models, experiments need to integrate their frameworks with clouds, and sites need to adapt their applications to use internal clouds. We recommend to found a body which involves all parties to synchronize these efforts, with the aim to avoid independent developments and multiple solutions for the same problems.
Accounting: Cloud access does not come for free. The TEG recomments that
- VOs making use of cloud resources are fully accounted for their cloud usage.
- fixed quotas per VO should be setup to ensure that each VO get their pledges.
Possibilities to enable fair share at the cloud level need to be investigated

Conclusions

Private clouds are mainly interesting for sites, as they offer them ways which potentially reduce the costs of managing their resources. Opening such internal clouds to specific users inside WLCG comes almost for free for the sites, and offer a modern way of accessing resources at different sites. Private clouds are entirely transparent to end-users. Opening them to pilot users possibly offers a way to reduce the overhead which comes with pilot jobs, and thus can help to reduce operational cost. Although the base technology already exists today, and several proof-of-concept studies have been successfully completed, it must be said that investigations have just been started, so that experiences are still limited. Many issues need to be identified and discussed before this model can be used for large scale production. At the long term though, clouds may have a great potential for future WLCG processing.

References

[1] [http://hepix.org][HEPiX]

[2] Cloud-like infrastructures within HEP

[3] [https://twiki.cern.ch/twiki/bin/view/Atlas/CloudcomputingRnD][ATLAS cloud activities]

[4] [http://www.nist.gov/customcf/get_pdf.cfm?pub_id=909494][NIST recommendations for clouds]

[5] [http://cernvm.cern.ch/portal/][CERNVM]

-- UlrichSchwickerath - 05-Mar-2012

Topic revision: r8 - 2012-06-19 - UlrichSchwickerath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback