Integrating Cloud Computing resources into the WLCG experiment frameworks

This page aims to collect the activities driven by the IT-ES group in order to integrate Cloud resources of different providers with the experiment frameworks of the LHC experiments at CERN.

Meetings

Regular meetings are held in order to monitor the progress of the proof of concepts and to converge on a common strategy/solution for the involved experiments.

13 June 2013

  • Helix nebula
    • The tests with the SlipStream Blue Box started this week and, although we have issues with the configuration they look as we could manage to end them by the end of the month. After the PoC a decision has to be made to select one Blue Box as the final product.
  • ATLAS Cloud testing current focus is to puppetize the worker node of ATLAS in order to be able to run it in any computing resource. After this is completed, scale tests on Grizzly will start, both on AI and the HLT.
  • AgileInfrastructure
    • looking into the contextualization of virtual machines through puppet for the the CMS and ATLAS use cases; this is an argument which is very specific to CERN Cloud infrastructure.
    • HammerCloud tests for CMS are continuing, still very successful.
    • Two issues found: Possibility to impersonate any user having an account in OpenStack at CERN, security issue reported and then fixed by AI team in the OpenStack project ticket AI-2448; some of the booted instances gets stuck in BUILD ticket AI-2370.

15 May 2013

  • HelixNebula testing of BlueBox: starting to use EnStratius solution with VM's deployed into T-Systems cloud infrastructure. The latest status is:
    • the image for the virtual machines is ready for both the ATLAS and the CMS use case
    • the first jobs were completed past week running benchmarking jobs (3 VMs)
    • started scaling up the cluster sizes reaching 40 VMs for CMS and ATLAS (20 VMs for each case).
    • executed about 5500 ATLAS production jobs and about 18000 CMS analysis jobs with 99% job efficiency.
  • AgileInfrastructure testing status has been recently reported for the CMS tests, and main points are:
    • testing with HammerCloud standard jobs together with dynamic provisioning of resources through the EC2 interface for has started
    • first results looks very good; not drop in efficiency when comparing to other sites has it can be seen in the attached document;
    • for the next tests will change the job type to have longer jobs and increase the scale
    • Ibex release of AgileInfrastructure demonstrating to be more stable then the previous Hamster but need to keep going with the tests and verify it in a longer period;
  • DeltaCloud service is being investigated
    • the main interest comes from the fact that DeltaCloud provides a common interface to various cloud interfaces used as backend; the common interface can be of various types, including EC2;
    • we are able to setup DeltaCloud and connect it to the AgileInfrastructure nova interface
    • trying to connect from DeltaCloud to the EC2 interface; some issues on AgileInfrastructure side being tracked here.
  • Contextualization through CloudInit
    • first version with ganglia/condor/cvmfs modules is ready; first tests worked fine
    • need to integrate these in our testing
    • need to disseminate the modules for further usage and to include these on CernVM.

13 January 2013

  • AgileInfrastructure Cloud testing: common testing between ATLAS and CMS to run experiment jobs on the new CERN infrastructure (also doing a productive system testing from the AI point of view)
    • ATLAS is running a standard production queue with HammerCloud tests and real production (evgen, merge, pile, reco, simul, validation) constantly using 200 VMs (4cores, 8GB RAM, 80GB disk). In the last month (2.1.2013-1.2.2013) ~15k CPU days of ATLAS jobs were ran on the Agile Infrastructure.
    • CMS running tests jobs, both simulation (pythia simulation tests running ~1h/job) and analysis jobs(~8h test jobs reading /DoubleElectron/Run2012B-PromptReco-v1/AOD from EOS), reaching a peak of 200 VMs (4cores, 8GB RAM, 80GB disk).
    • Job efficiency for the ATLAS-PanDA use case is high, while CPU efficiency is not probably because of I/O. This is under investigation with dedicated HammerCloud tests.
    • CMS job submission is now being tested through GlideinWMS, which provides dynamic provisioning of Virtual Machines in the cloud, fully handling the VM life cycle depending on the job queue. First functional tests are fine, but scale it's not possible at the moment because if high load issues.

17 October 2012

  • ATLAS
    • Helix Nebula Proof of Concepts
      • All proof of concepts finished successfully - ran in total 40k CPU days on the providers. Atos and T-Systems received a report with the results and feedback. Results were also presented in the September GDB meeting and in the HEPiX workshop.
      • Participation in the HelixNebula TechArc meetings: definition of requirements from CERN/ATLAS and alignment with the other institutes.
    • CloudScheduler in Stratuslab@LAL: CloudScheduler is a dynamic provisioning tool written by UVic. Hasier (IT-ES Summer Student) added a plugin for CloudScheduler so we can use CloudScheduler for Stratuslab. Stratuslab test instance was shaky during summer. UVic to include the plugin in their running CloudScheduler instance and use StratusLab resources. We want to evaluate alternatives to CloudScheduler - its setup and configuration is not trivial.
    • CERN OpenStack evaluation: Getting Katharzyna Kucharczyk (Technical Student) up to speed in order to evaluate CERN Agile Infrastructure for ATLAS. Uploaded CernVM image to OpenStack and working on contextualisation. We want to first boot machines manually to run PanDA jobs and in a second stage integrate with Pilot Factory for VM lifecycle management. Integration with Pilot Factory is an alternative to CloudScheduler - under development in BNL and supposed to simplify the setup.
    • ATLAS Cloud Computing Coordination: PanDA queues for MC simulation demonstrated in HelixNebula and now standard operation in UVic&BNL. US development of instant T3 analysis clusters - pushing for OpenStack as most promising middleware. No news on storage - waiting for CERN IT strategy. Put in contact CERN IT Puppet Configuration group with US ATLAS people doing similar work - meeting during ATLAS SW&C promised future collaboration.

  • CMS
    • Tests on StratusLab@LAL.
      • latests tests demonstrated the feasibility to run CMS workflows in a cloud provider (both analysis and simulation)
      • developed a python prototype to automatically provision new VM as worker nodes, depending on the current queue of jobs
      • not possible to interface the cloud resources with the CMS pilot system: GlideinWMS uses Condor_g, which requires an EC2 interface not provided by default from StratusLab
      • deployed HammerCloud offsite by Ramon and some tests were performed
        • not possible to scale up given the unreliability of the cloud provider which was often in downtime or that had load issues
      • tests were then stopped given the difficulties to have a reliable system.
    • Tests on lxCloud@CERN. (work done with Marek Denis, IT-ES technical student)
      • evaluated the contextualization of the CernVM batch node image for worker mode: a summary table has been produced *here*
        • iterations with CernVM team in order to understand how to satisfy the use case we have for the contextualization, and also to fix some issues found
        • trying to understand the current gap between what we are currently doing and HEPiX standards for contextualization
      • executed first jobs by hand to validate the VM: both analysis and simulation jobs worked fine
      • started contextualizing and deploying extra services:
        • squid proxy server in order to reduce network latencies and to reduce the outbounds connection: the VM are set to use it for cvmfs and frontier access
        • ganglia head node to monitor the running VM
    • Tests on OpenStack@CERN, agile infrastructure. (work done with Marek Denis, IT-ES technical student)
      • just started since few days to deploy virtual machines; currently debugging first issues
    • Next Steps:
      • the work will mainly focus on the AgileInfrastructure based on OpenStack, also in order to test this new IT service; main steps of the work will be:
        • being able to contextualize VM for the execution of CMS jobs
        • interface CMS pilot system (Glidein-WMS) with the EC2 interface provided by OpenStack in order to automatically provision VM for the execution of jobs
        • start running first tests for simulation jobs, followed by analysis jobs

18 July 2012

  • ATLAS
    • Helix Nebula Proof of Concepts
      • ATOS: On standby - all resources dedicated to the other PoCs. We have asked ATOS when the resources will be allocated for ATLAS.
      • CloudSigma: 100 core cluster (20 machines with 5 cores each) running standard ATLAS MC simulation. Good performance in the last 10 days, however we had to restart CMVFS clients on all VMs. During the meeting we discussed the possibility of running some HammerCloud analysis tests with big input files that would have to be copied from EOS.
        States: activated:14 running:98 holding:1 finished:4563 failed:439 cancelled:114
      • T-Systems: Ramon is scaling up the cluster. He did not manage to use the Zimory API because of authentication problems (already notified T-Systems but no solution yet). Therefore he has to clone VMs manually through the web interface. Currently the T-Systems cluster consists of 264 cores (33 VMs with 8 cores each) and is running standard ATLAS MC simulation. We have asked Jurry to ramp up to 800 cores and 1.2TB RAM. Rodney has also asked ATLAS production managers to assign similar tasks as we had in CloudSigma in order to produce comparable results. However, bad performance in the last 10 days, at first sight many errors seem to be related to small disk configuration on the VMs, will redeploy once we can use the API.
         States: running:98 holding:3 transferring:98 finished:4230 failed:3900 cancelled:91
    • CloudScheduler in Stratuslab at LAL: Hasier Rodriguez (CERN IT ES summer student) is evaluating the possibility of using the UVic CloudScheduler for dynamic provisioning in Stratuslab at LAL. The CloudScheduler watches a Condor queue and according to the depth and number/status of the VMs boots up or shuts down VMs. Hasier has written the Stratuslab API for CloudScheduler and is now testing it. We are having some condor configuration issues and are following these up with the UVic developers and Rodney Walker. Again, this exercise proofs the need for a common interface layer such as CloudDelta.
    • HDFS in the cloud: Anatoly Kurachenko (ATLAS summer student) is working on setting up a HDFS cluster in Stratuslab at LAL (the place where we get resources the easiest). He has manually installed HDFS on 2 CernVM nodes using the standard disks that come with the VMs. He is able to run a Hello World Hadoop exercise. The installation still has to be automatized. Now we have to define some benchmark tests to verify the efficiency of this storage (measure dq2-get efficiencies of small, medium and large files). We have to re-discuss about how to use the HDFS cluster, but some initial proposals are:
      • Use the cloud site similarly to US T3 data sinks.
      • Modify the PanDA Pilot workflow to skip the stage out of output files from the worker node to the related storage. Instead a dedicated node in the cloud would copy the output of all worker nodes, which is available on HDFS, to the related storage element. Obviously this setup is interesting for CMS since they already use such a mechanism (Asynchronous Stage Out).

  • CMS
    • Recently we started working on running CMS jobs in the Cloud. The provider being evaluated is StratusLab.
      • First we prepared a CernVM and contextualized it to get needed software versions and proper configurations for condor and cvmfs. Also this step included the configuration site dependent parameters as the Trivial File Catalog (for the mapping between logical and physical file names) and the list frontier/squid servers).
      • In the next step we needed to connect the CMS workload management tools to the batch nodes started on StratusLab. CMS uses GlideinWMS which can talk to an EC2 interface, but StratusLab doesn't provide this interface by default. So, since CRAB can also submit directly to a condor cluster, a master was deployed on a dedicated VM inside StratusLab's Cloud and all batch nodes were configured with condor pointing to it. Once the configuration and deployment was automatized and validated, CRAB was used to submit jobs to the Condor cluster. The tests were split in two phases, running in the LAL instance of StratusLab:
        • First phase included the submission of MonteCarlo simulation jobs, that did not need any input data. Result of the functional test was successful and jobs ran without any problem.
        • Second phase included the submission of real analysis jobs, reading data through xrootd. The site worker node configuration was changed to set the xrootd server to be used (in this context Bari xrootd redirector has been used) and to have the needed credentials to handshake through xrootd. More then 200 jobs were submitted and jobs ran without any problem.
      • Next steps of the tests will:
        • evaluate the possibility of having a EC like interface provided by StatusLab and having GlideinWMS interacting with it;
        • run using HammerCloud in order to automatize the submission, increase the scale and retrieve a standard set of metrics that can be used as comparison with other WLCG sites or among different Cloud providers;
        • understand the impact in term of CPU efficiency and network latency when deploying a squid proxy server in the Cloud provider and when contacting a remote one;
        • evaluate how to handle the output data management: how to optimize the retrieval of output data (e.g.: using a shared file system as HDFS and asynchronously retrieve the data);
    • Once the work with StratusLab can be considered complete, the next phase will be to replicate the very same tests with lxcloud, which has a proper EC2 interface that can be directly used by GlideinWMS through Condor-G.

-- MattiaCinquilli - 18-Jul-2012

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf AI_CommonWF.pdf r1 manage 141.5 K 2013-01-31 - 16:51 MattiaCinquilli Common ATLAS/CMS workflow to test the AgileInfrastructure@CERN
PDFpdf AI_tests_HC_CMS.pdf r1 manage 190.0 K 2013-05-17 - 10:17 MattiaCinquilli Report about AI's CMS testing
PDFpdf ATLASjobperstate.pdf r1 manage 57.3 K 2013-01-31 - 16:50 MattiaCinquilli  
PDFpdf CMSJobperstate.pdf r1 manage 54.0 K 2013-01-31 - 16:50 MattiaCinquilli  
Texttxt hc_cms_20130513.txt r1 manage 2.9 K 2013-05-17 - 10:19 MattiaCinquilli HammerCloud result testing for CMS jobs on AI
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2013-06-13 - MattiaCinquilli
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback