Cloud Computing

Abstract

During the past years, cloud computing has become a popular mechanism to provide computing resources to end users. Within WLCG specifically the infrastructure as a service (IaaS) approach has been attractive to some experiments as a possible source of external resources for computing. Some sites have now started to experiment with internal clouds, and some of these clouds have been opened to dedicated users from the experiments to experiment with alternative access methods. Within the TEG this model was presented.

Introduction

A popular model to provide computing resources in industry nowadays are clouds. Several sites have some limited experiences with running internal clouds already. As an example, CERN is running an infrastructure called "lxcloud" for more than a year now, which is used for virtualizing parts of the batch resources, and which provides an EC2 access as a pilot service for selected users from different VOs. The infrastructure has been scaled up during a scalability test up to 16,000 virtual machines, all configured as standard worker nodes, and connecting to LSF.

An internal cloud is fully transparent to users. Grid services become applications which use machines provided by the internal IaaS cloud, and Grid service managers become users of this general infrastructure of the site. For end users, the is no difference in the services he sees.

The picture changes, if sites decide to open their internal cloud to users, by granting them access to the resource though a cloud interface.

Definitions

There are many definitions on the market about what a cloud exactly is. Often, they are defined via their features. Main features of clouds include dynamicality and elasticity, in the sense that clouds can grow (and shrink) dynamically, depending on demand. Another important principle is the "you get what you pay for", in the sense that a user pays only for his resources as long as he uses them. Clouds often give users the feeling of having infinite resources. Users expect them to be responsive and available when needed.

Clouds come in different manifestations,

  • Infrastructure as a Service (IaaS)
  • Platform as a Service (PaaS)
  • Software as a Service (SaaS)

Some ideas of Clouds, specifically the Infrastructure as a Service approach, can be of interest for managing sites, in particular if this approach is combined with large scale virtualization of computer centers. Virtualization is often called an enabling technology for IaaS clouds. It has to be stressed though that virtualization, which is reality at many sites already for production services, must not be confused with cloud computing.

Assumptions

  • The model presented here is based on an IaaS infrastructure.
  • The model assumes that the applications which are run on the infrastructure are non-critical, for example CPU crunching applications which can die and just be restarted

Use cases

General explications

List of use cases

Identifier Actors Pre-conditions Scenario Outcome (Optional) What to avoid
1.1 site fully virtualized infrastructure complete decoupling of physical and virtual resources transparently roll out of fast and modern hardware on modern operating systems without the need of migrating the applications to the new OS too quickly resulting in reduced cost, better hardware support  
1.2 site fully virtualized infrastructure virtualization of batch resources dynamic and automatic adaption of the batch farm composition according to the requirements of incoming jobs resulting in increased job throughput and increased resource usage  
1.3 site fully virtualized infrastructure service consolidation consolidation of long lived and critical services with low resource requirements on the same hardware resulting in increased resource usage  
2.1 site non-optimal CPU resource utilization mixing of CPU intensive and less demanding applications on the same physical node using virtualization increased CPU utilization, increase of delivered CPU cycles avoid overloading/overcommitting of hypervisors
2.2 sites non-optimal CPU resource utilization a site runs a build service which runs a expensive software builds each night. The same site runs a login service which is used during the day reduced amount of required resources by elastic shifting of resources on a day/night cyclus  
3. site scheduling of intrusive updates on hosts with many different applications automation of intrusive updates by dynamic destroy/creation of virtual machines significantly reduced operational cost inefficiencies due to short VM lifetimes
4. site internal cloud supporting cloud bursting high peak demands accessing the local capacities processing of peak demands by buying-in external resources increased cost
5. VOs frameworks which work on clouds a university gets extra funding for resources on clouds outside WLCG, and students don't get enough resources inside WLCG due to priorities reduced pressure on sites from less important end users  
6. all ability to make use of industry-standard interfaces for resource access VOs can run their frameworks both on clouds at sites and on commercial clouds reduced cost by exploiting standard technologies  
7. site fully virtualized infrastructure complete decoupling of physical and virtual resources transparently roll out of fast and modern hardware on modern operating systems without the need of migrating the applications to the new OS too quickly resulting in reduced cost, better hardware support  
8. site fully virtualized infrastructure virtual batch resources dynamic and automatic adaption of the batch farm composition according to the requirements of incoming jobs, resulting in increased job throughput, increased resource usage  
9. VOs access to internal clouds at sites fast access to development machines within the LCG network for software development and tests quick access (timeline: seconds ) to development machines which can be scrapped easily when done  
10. VOs access to internal clouds at sites fast access to build machines for software development, dynamically growing and shrinking build clusters for software builds and deployment reduced time to build, test and deploy software releases  
11. VOs access to internal clouds at sites fast creation of complete test infrastructures for new developments reduced time to production  
12.1 sites fully virtualized infrastructure supporting life migration hardware intervention/ hardware exchange easy freeing of machines which need hardware interventions , resulting in reduced cost due to reduced downtime (very short draining time)  
12.2 sites fully automatic hardware error detection hardware detects an issue (eg using SMART) and frees itself by moving applications and data on it somewhere else reduced cost due to reduced downtime (very short draining time) and more efficient use during the lifetime of the hardware  
13.1 VOs coordination of cloud features between sites endorsed WLCG (or even VO ) specific images which can run at any site increased consistency of worker nodes across sites avoid too specialized images
13.2 VOs coordination of cloud features between sites virtual machines can be instantiated at any WLCG site, depending on where the data is closest efficient use of resources across sites, shorter processing time for jobs  
13.3 sites pilot job replacement VOs fire up one image at each site which builds up a dynamic and elastic cross-WLCG processing farm. Each instance directly connects to the VOs central queue to get a fitting payload increased efficiency due to reduced overhead  
14. VOs support for DMZ at sites for data preservation ability to run old software releases on old operating systems, for example for data preservation longer access to old data  
15.1 sites security and trust a security update for running images is available and can be deployed to running instances ensure system security  
15.2 VOs security and trust a user is running a malicious application. The VOs and the sites work together to identify and stop him traceability  
15.3 sites security and trust during a security incident a site is asked to provide the syslog information from a running VMs traceability and system security  
16 sites hardware replacements once batches of machines have reached the end of their life time, they can be transparently replaced by new machines efficient hardware management, entirely transparent to the user and without reduction of available resources  
17 sites existing site policies a site requires services with limited security features, for example NFS, which require restricting access to site services continued support for special applications  
18 sites idle or rapidly free-able resources a site received a new batch of resources for computing or storage. Instead of leaving them idle while waiting for final allocation, they join the infrastructure (in a special zone) and are made available for applications like Boinc exploitation of resources which would else be idle  

Requirements

General explications

List of requirements

requirements for the infrastructure

Identifier Requirement Originating use cases (Optional) Comments
1. applications and infrastructure should be strictly separated from each other 1.1, 7,16  
2.1 the infrastructure must have support for uncritical applications with finite life time (batch, development) 1.2, 9,10,11  
2.2 the infrastructure must have support for long lived applications which can be life-migrated to a different machine 1.3, 16  
3. the infrastructure must have a small but sufficient amount of free resources to which running machines can be migrated 1.3, 12.1,12.2, 16, such resources can be used by very short lived applications. The point here is that they can be freed quickly if needed so that life migration of machines can actually work.
4. the placing algorithms of the cloud controller have to be aware of the current usage of a given hypervisor, it's free resources and the requirements of the applications running on it 1.3, 2.1,2.2, 12.2 implications for selecting a cloud controller and designing the internal cloud infrastructure
5. the infrastructure must provide a standard entry point for IaaS access to pilot users 6,9,10,11  
6. a strong authentication is required for the IaaS entry point 9,10,11, 15.1,15.2,15.3  
7.1 the part of the infrastructure which supports less critical number crunching applications (eg batch worker nodes) must be fast and responsive, in the sense that a VM which has been placed is available within seconds 1.2, 8,13.2, 13.3 inefficiencies in starting up VMs may sum up to a large waste of resources at large scale
7.2 the part of the infrastructure which supports less critical number crunching applications (eg batch worker nodes) must be efficient, in the sense that any virtualization overhead is tuned to be as small as possible (CPU, I/O and network) 1.2, 8,13.2, 13.3  
8. the site should make sure to virtualize as many local applications as possible, and enable them to run on the cloud infrastructure 2.1  

requirements for experiment frameworks

Identifier Requirement Originating use cases (Optional) Comments
Comment Describe the requirement for the service Comma separated list of use case identifiers from the table above additional comments on this requirement
9. experiment frameworks must be secure 15.2, 17  
10. experiment frameworks must ensure traceability of user payload 15.2  
11. experiments must be able to run on virtual machines 2, 4, 5, 6, 7, 13.1,1 3.2, 13.3  
12.1 applications must have a limited life time 3, 7, 8, 12.1, ...  
12.2 applications must be able to know about their remaining life time 3, 7, 8, 12.1 this has implications on the contextualization of the images, as proposed by HEPiX
12.3 applications should be able to terminate and cleanup themselves when they are done 3, 12.1 may require a mechanism which allows applications to destroy the VM they are running on

requirements for images (see ref. [1])

Identifier Requirement Originating use cases (Optional) Comments
Comment Describe the requirement for the service Comma separated list of use case identifiers from the table above additional comments on this requirement
12. shared images must not contain any site-specific information 13.1,13.2,13.3  
13.1 shared images must not allow users to gain root access 17  
13.2 sites need to be able to inspect and black-list images for their site 17  
14. images must not contain any secret information (eg. root passwords) so that they can be freely shared between sites 13.1,13.2,13.3  
15. images must come with a certain amount of meta-data which describes what they are, and what is required to run them 13.1,13.2,13.3  
16. shared images must come from a trusted source 17  
17. it must be possible to revoke an image which must have an immediate effect on on sites supporting this image 15.1  
18. it must be possible to deploy critical update to running instances 15.1  
19. images should be as small as possible, and as generic as possible so that they can be shared between VOs 13.1,13.2,13.3  
20. sites must have the possibility to customize images at instantiation time 15.1,15.3 this is the site contextualization phase as proposed by HEPiX

External Resources

Many issues have already been addressed by the virtualization working group in HEPiX [1]. These include :
  • endorsement and trust
  • image exchange policies
  • contextualization
  • passing of information to applications inside the VMs

Several prototypes for internal clouds exist already within WLCG, eg. [2] While CMS has stated within the TEG that they are currently not interested in CloudComputing, ATLAS has a variety of ongoing cloud computing activities which are documented in [3] A relevant external resources is a publication on security and privacy in public clouds has recently been published by NIS [4].

Impact

Pilot job frameworks

It should be possible to significantly reduce the overhead caused by pilot job frameworks. The pilot role is moved to the cloud factory for the specific VO and thus separated from the payload. The payload itself is run it a sandbox. The UID switching within a single job (glExec) should no longer be required.

Data locality

A general problem which needs to be thought of in the cloud approach is data locality. In the described scenario the experiment frameworks are responsible to dispatch payload jobs to VMs running at sites where the data to be read is available.

WMS and Computing elements

The intelligence of the WMS is moved to the experiment frameworks. The computing element role is largely taken over by the cloud entry points of the sites.

Authentication and authorization

Authentication and authorization mechanism which are existing in current cloud controllers may not match the requirements of WLCG and need to be reviewed.

Image exchange

Trust chains for image exchange have been intensively discussed within [2], and policies have been established.

Image instantiation

The need for image contextualization at instantiation time by the site has been extensively discussed within HEPiX [2]. A mechanism to do that has been proposed, and is used for example by CERNVM [5]

Recommendations

  • Sites are free to use virtualization. If they do, it must be transparent to users
  • Sites which decide to go beyond virtualization, and setup an internal cloud, are encouraged to open part of the resources via an EC2 access to VOs which are interested in this.
  • In order to profit most from clouds at different sites, some level of synchronization between sites is needed. This includes the sharing of images and the definition of common virtual machine templates.
  • In the described IaaS models, experiments need to integrate their frameworks with clouds, and sites need to adapt their applications to use internal clouds. We recommend to found a body which involves all parties to synchronize these efforts, with the aim to avoid independent developments and multiple solutions for the same problems.
  • Accounting: Cloud access does not come for free. The TEG recomments that
    • VOs making use of cloud resources are fully accounted for their cloud usage.
    • fixed quotas per VO should be setup to ensure that each VO get their pledges.
  • Possibilities to enable fair share at the cloud level need to be investigated

Conclusions

Private clouds are mainly interesting for sites, as they offer them ways which potentially reduce the costs of managing their resources. Opening such internal clouds to specific users inside WLCG comes almost for free for the sites, and offer a modern way of accessing resources at different sites. Private clouds are entirely transparent to end-users. Opening them to pilot users possibly offers a way to reduce the overhead which comes with pilot jobs, and thus can help to reduce operational cost. Although the base technology already exists today, and several proof-of-concept studies have been successfully completed, it must be said that investigations have just been started, so that experiences are still limited. Many issues need to be identified and discussed before this model can be used for large scale production. At the long term though, clouds may have a great potential for future WLCG processing.

References

[1] [http://hepix.org][HEPiX]

[2] Cloud-like infrastructures within HEP

[3] [https://twiki.cern.ch/twiki/bin/view/Atlas/CloudcomputingRnD][ATLAS cloud activities]

[4] [http://www.nist.gov/customcf/get_pdf.cfm?pub_id=909494][NIST recommendations for clouds]

[5] [http://cernvm.cern.ch/portal/][CERNVM]

-- UlrichSchwickerath - 05-Mar-2012

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2012-06-19 - UlrichSchwickerath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback