WLCG Operations Coordination Minutes - 20 June 2013

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=255445

Attendance

  • Local: Maria Girone (chair), Andrea Sciabà (secretary/chair), Ian Fisk, Nicolò Magini, Maria Dimou, Felix Lee, Andrew Mc Nab, Raja Nandakumar, Stefan Roiser, Ikuo Ueda, Simone Campana, Anthony Hesnaux, Alexandre Beche, Domenico Giordano, Maite Barroso Lopez, Helge Meinhard, Ulrich Schwickerath, Maarten Litmaath
  • Remote: Alessandra Forti, Christoph Wissing, Alessandro Cavalli, Yury Lazin, Renaud Vernet, Catalin Dumitrescu, Reda Tafirout, Massimo Sgaravatto, Alessandra Doria, Matt Doidge, Peter Solagna, Michel Jouvin, Pepe Flix, Gareth Smith, Javier Sanchez, Jeremy Coles, Daniele Bonacorsi, Stephen Burke, Julia Andreeva, Valery Mitsyn, Isidro Gonzalez Caballero

News

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Highlights:

  • there is a new EMI-3 version of StoRM fixing most issues and recommended for deployment. It works well with FTS-3

Maarten reports that some sites installed the EMI-3 WN and ran into issues with voms-proxy-info, for which a fix will be soon released. He points out that we cannot recommend the EMI-3 WN for production and asks if it should be certified by a dedicated task force or if we can rely on the experiments to do the necessary tests. Alessandra mentions that it's used in production at some sites (DESY for CMS) without issues reported, but of course the usability of the EMI-3 WN depends on the experiment and the storage system at the site.

Maria G. proposes keeping track of the experience of sites with EMI-3 in the same twiki as for the SL6 migration, which is agreed on. Maarten suggests adding a compatibility matrix as it was done for the gLite-to-EMI migration.

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR: v2.1.13-9-2 and SRM-2.11 for all instances
EOS:
ALICE (EOS 0.2.29 / xrootd 3.2.7)
ATLAS (EOS 0.2.28 / xrootd 3.2.7 / BeStMan2-2.2.2)
CMS (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
LHCb (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
   
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.6-1
xrootd
3.2.7-1
None None
BNL dCache 2.2.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.8.1 (Atlas, CMS, LHCb)   Tests ongoing with new release, and bug fixing.
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.7-2.osg
Oracle Lustre 1.8.6
EOS 0.2.30-1/xrootd 3.2.7-2.osg with Bestman 2.2.2.0.10
   
IN2P3 dCache 2.2.12-1 (Chimera) on SL6 core servers and 2.2.13-1 on pool nodes
Postgres 9.1
xrootd 3.0.4
Upgrade to 2.2.13-1 (3rd golden release)  
KIT dCache
  • atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera)
  • cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera)
  • lhcbsrm-kit.gridka.de: 1.9.12-24 (Chimera)
xrootd (version 20100510-1509_dbg and 3.2.6)
  Upgrading all dCache instances and the PostgreSQL databases around the GridKa "firewall downtime" in CW 30 (22th-26th July)
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.7 (Chimera) (SURFsara), DPM 1.8.6 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 1.9.12-23 head nodes to 1.9.12-23 Next upgrade to 2.2 in September
RAL CASTOR 2.1.12-10
2.1.13-9 (tape servers)
SRM 2.11-1
  Upgrading all instances to CASTOR 2.1.13-9 by end of July.
TRIUMF dCache 2.2.13(chimera), pool/door 2.2.10    

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1    
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.0-1 for T1 and 1.8.3.1-1 for US T2s SL5, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.6-1 SLC6, EMI2 Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations  

Other site news

Data management provider news

Experiment operations review and plans

ALICE

  • Russian network: on June 4 the GEANT link to Moscow was cut to 100 Mbit/s (GGUS:94540) and since that time there have been many job failures at most of the Russian sites due to timeouts. To alleviate the congestion and avoid job loss, the most affected ALICE sites are closed for job processing since Sun evening June 9:
    • IHEP
    • JINR
    • MEPHI
    • PNPI
  • KISTI: splendid participation in raw data production cycles for p-Pb periods!
  • KIT: site firewall got again overloaded on Tue June 11 due to jobs switching to remote storage elements after suffering errors with the local SE. KIT Xrootd and network experts investigated and at least a configuration error on the SE was discovered and fixed, though occasional SE test failures are still observed, which then lead to a temporary demotion of the local SE in favor of remote SEs. Whenever the firewall traffic went up significantly, the concurrent jobs cap was temporarily lowered. At the moment it is again a lot lower than the nominal value, viz. 4k vs. 6k+. To be continued.
  • CVMFS deployment is about to get started: thanks Stefan!

Maria G. suggests adding KISTI to the list of Tier-1 sites in the Tier-1 Grid services report.

ATLAS

WebDAV access
ATLAS has sent a mail asking for inputs from the other experiments so that we can provide consistent requirements to sites. (WLCGOpsMinutes130530)
  • ATLAS plan/request in WLCG Operations planning March 2013
  • ATLAS view is that we would need to have authenticated WebDAV access for Storage Management operations (file renaming for the Rucio migration), and we don't have special suggestions for sites on how they will provide it (e.g. no special ports required, just the information should be provided to us)
    • if there is no contradicting requirements from other experiments, this will be the message from ATLAS to the sites.

Stefan says that LHCb is interested in WebDAV. Ian says that CMS will continue looking at WebDAV. Maarten says that it's also interesting for ALICE, but more for EOS at CERN than for external sites.

There is a discussion on the port to be used for WebDAV; most sites would use port 80 (443 for authenticated access) but some prefer using different ports to avoid being targeted by web robots. Nicolò will contact the developers of the storage system implementations to understand if they have any recommendations.

CMS

  • CMS also suffering from the network issues all over Russia
    • Transfer link commissioning of upcoming Tier-1 at JINR postponed
    • Usage of the Russian Tier-2 reduced to a minimum

  • Processing Status and Plans
    • Presently some MC production ongoing
    • Preparation of reprocessing of 2011 still ongoing

  • HLT Cloud
    • Stable running with ~6000 jobs
    • Resource expected to be ready for production
    • 1st use case: 2011 reprocessing of data

  • SITECONF hosting moving from CVS to GIT
    • Some site specific configurations managed by site admins
    • Move from CVS to GIT (at CERN) in preparation
    • Detailed discussion and instructions for sites via CMS HyperNews

  • DPM Configuration for CMSSW
    • CMS strongly recommends to used xroot protocol for local access
    • xroot works without any 'hacks' with CMSSW
    • rfio needs hacks with LD_PRELOAD (SL6 hack different from SL5)

  • CVMFS migration statistic
    • Missing 2 Tier-1 sites
      • T0_CH_CERN: Client prepared, just config switch needed
      • T1_US_FNAL: Likely to move in Fall with SL6 migration
    • 19 Tier-2 (about a third)
      • Some sites are partly switched
      • Often done/planned together with SL6 migration
    • Tier-3 sites will be addressed in the next weeks

  • SHA-2 Migration contacts for CMS
    • Mine Altunay (FNAL)
    • Diego Da Silva Gomes (FNAL, based at CERN)

  • "Out of Savannah"
    • Christoph Wissing got arrested

Domenico reports that many DPM sites that enabled xrootd as data access protocol are not producing any useful monitoring information. He will circulate the deployment instructions, to be followed by at least the main DPM sites.

LHCb

  • thanks to Maria Alandes for her work. She manage to get all the value correctly set for the CPUTIME limit at all the site used by LHCb (no more 999999).
  • Improvement of the staging at GRIDKA but not ideal.
  • Waiting for the python 2.7 bindings for SLC5 and SLC6 for lfc,dpm, gfal,... (GGUS:94546)
  • number of webserver serving our software for the installation (DIRAC/LHCbDirac) on the GRID increase with 2 new instances. In future it will be done using CVMFS as for the other LHCb Applications

Machine/job features discussion

For details see the slides.

Stefan proposes a plan for the implementation and deployment of the Machine/JobFeatures system, already discussed in the previous meeting and at the latest GDB. He proposes a task force joining efforts from the experiments (testing of implementations), sites (implementation) and the CERN IT SDC-OL section (coordination, documentation, etc.). He reminds that all the LHC VOs have expressed interest in this functionality. Moreover it should not interfere in any way with the normal site operations.

Reda asks how certain it is that experiments would actually use it. Simone comments that, like for the Information System, it critically depends on how reliable the information is and the plan should also include validation of the information.

Maria G. invites Stefan to contact the interested parties offline and have a mandate and list of members for the planning meeting on July 4th. Michel urges people to read the summary of the latest GDB, where it was extensively discussed what are the most important use cases.

Task Force reports

SL6

  • Sites status
    • Contacted again a number of sites to ask for progress, several updated the table yesterday in particular USATLAS and Italy sections have been update for most if not all sites.
    • Added stats section to the deployment page
  • OS/MW combinations
    • SL6/EMI-2 is working at several sites
    • SL6/EMI-2 tarball in test at 1 UK site and gone live at another for atlas production and LHCb.
    • SL6/EMI-3 is in test at 4 UK sites so far only one problem detected (see below). Atlas,CMS and LHCb are all supported by these sites. DESY-HH supports Atlas and CMS, and is already in production in CMS. CMS also did some successful testing at CNAF-T1 and they will investigate how many of their sites installed SL6/EMI-3 without problems.
  • Problems and solutions
    • Mounting CVMFS in rw is not necessary anymore. RH has released a new kernel that restores the open() function to its previous behaviour INC299258. kernel-2.6.32-358.11.1.el6 is already in the SL production repositories. It has been installed already at several UK sites.
    • EMI-3 voms-proxy-info stopped using X509_USER_PROXY env var and the output format of the timeleft field was changed. GGUS 94878. This affected atlas production and other things like the WLCG VOBOX. The client has been already restored to its previous behaviour. The new version is in the testing repository and should be released at the end of the month. The new version has been tested at 3 of the SL6/EMI-3 UK sites and atlas releases could be validated with it. All 3 sites are also passing Atlas HC tests for production and LHCb SAM tests. Albeit the problem was solved quickly it could have been spared altogether by avoiding these unnecessary changes.

SHA-2 migration

  • The SHA-2 CERN CA will be part of next week's IGTF 1.54 release
    • When it has become available, the temporary CA rpms will be withdrawn and the SHA-2 readiness testing instructions updated accordingly
  • ATLAS have made very good progress with testing their central services and so far everything worked: thanks Simone! Details:
    • AGIS (django based) has been tested and validated. Reading and writing were tested both via CLI, REST and WEB UI
    • PanDA (apache + mod_gridsite) has been tested and validated using Tadashi's curl function made available for the purpose
    • DDM (apache + mod_gridsite) has been tested for a subset of use cases. The full tests suite will be run. No issues observed so far.
    • Pandamon, Prodsys and AMI are pending validation.
    • My own (Simone) conclusion
      • If you use apache + the most recent version of mod_gridsite you are OK. But please try yourself.
      • Maarten's instructions on the Twiki are complete and can be used out of the box, together with the CA and VOMS setups

FTS-3

  • Task force meeting on Jun 19th
    • https://svnweb.cern.ch/trac/fts3/wiki/Minutes_19_09_2013
    • New features demonstrated (overwrite, retry error logs, history table querying, monitoring)
    • Discussion on distributed deployment model: message bus implementation preferred to share config and state between FTS3 servers
    • Discussion on new features requested by ATLAS: bulk deletion, multi-hop, pre-staging (for processing)
  • Migration plan
    • https://svnweb.cern.ch/trac/fts3/wiki/Migration
    • Migrate fraction of the activity (in "FTS2-like mode") as soon as RAL and CERN prod servers are ready
    • After ~2 weeks, migrate all UK and CERN activity
    • After ~1 month, start to include other FTS3 servers and regions.

gLExec

  • A gLExec deployment plan was presented in the Management Board meeting of June 18
    • After a short discussion the plan was given the go-ahead
    • Who will help taking care of various regions?
  • A page to track the deployment status is now available
    • ALICE gLExec tests have started and links to the results will be added shortly
      • Most of the sites currently fail the tests for various reasons
    • The current MyWLCG links will shortly be replaced with more reliable links that also will lead directly to the results for the given site
      • The time stamps will subsequently be (manually) adjusted

Maarten stresses the fact that the plan foresees making the gLExec SAM tests critical starting from October; experience shows that setting up gLExec and passing the tests is not difficult, it is well documented and it does not risk affecting any production workflow (while it is only being tested). Concerning the tarball distribution, it is used by a small minority of sites, and in case of problems, the overall impact is moderate and they can be dealt with as needed.

Reda says that it will be difficult for him to "sell" gLExec to the Canadian sites if ATLAS is not committed to using it. Maarten replies that the decision to proceed with the deployment of gLExec was taken in the WLCG Management Board and agreed by the ATLAS computing coordinator. Moreover, progress is being made in supporting gLExec in ATLAS. According to Alessandra many sites are not happy about installing gLExec but they will do it in order to satisfy the MB requirements and pass the critical tests. Ueda confirms that ATLAS agrees with the MB decision. Helge, Maarten and Michel close the discussion, concluding that the MB decision must be implemented, that enabling gLExec is proven to be easy enough and that sites not meeting the target date for making the SAM tests critical would risk being automatically marked as unavailable.

perfSONAR

Anthony presents his work on a Google Maps-based monitoring page for the various pS instances in WLCG.

  • perfSONAR v3.3 has been released last week. We will have a TF meeting tomorrow to discuss how to proceed and we will contact the sites quickly. We will report at the next meeting about the progress.

XRootD deployment

  • RPMs for third party plugins available in WLCG repo http://linuxsoft.cern.ch/wlcg/
    • VOMS-XRootD plugin (from G. Ganis)
    • gled-xrdmon - the Collector of UDP detailed monitoring packets (from M. Tadel)

  • Outcomes of ATLAS Software and Computing week
    • Some infrastructural issues have been observer in the deployment of FAX in EU. The same issues have not been spotted in US. People are now actively looking at identifying and curing those issues.
    • ATLAS would like to proceed with cautious adoption of FAX for some production use cases in controlled environment:
      • Fallback from production and analysis jobs in case of data access failure. Has been activated in US and some UK sites.
      • Panda brokering will be modified to relax the job-go-to-data constraint and allow some jobs to read data remotely through FAX. Work in progress. SOme more discussion needed.
    • Comparison tests have been run, looking at performance in event processing between current Panda file access method and FAX. For DPM sites, using xrootd directI/O rather than current method (lcg-cp download) shown a performance improvement up to x3 in event rate. ATLAS discuss with DPM sites about changing default access method.

Tracking Tools Evolution

  • On the savannah-to-jira migration front we are working in close collaboration with savannah expert Victor Diez to overcome some issues listed in Savannah:134651#comment40. As soon as we have answers and general use recipes we 'll document them on our TrackingToolsEvolution twiki.
  • On last 'real' meeting's presentation by Marian B. about merging the SAM/Nagios GGUS Support Unit (SU) with the LHC Experiment Dashboard one, there is consensus. The new proposed name for the SU is Grid Monitoring as it covers all Grid projects. Savannah:138241 was created after the meeting to proceed with this development.

Maria D. indicates July 10 as a likely (but not sure) date for the deployment of the new SU, which will need to be properly documented via a FAQ list.

AOB

Action list

  1. Add perfSONAR to the baseline versions list as soon as the latest Release Candidate becomes production-ready.
    • done
  2. Build a list by experiment of the Tier-2's that need to upgrade dCache to 2.2
    • done for ATLAS: list updated on 17-05-2013
    • done for CMS: list updated on 20-06-2013
    • not applicable to LHCb, nor to ALICE
    • Maarten: EGI and OSG will track sites that need to upgrade away from unsupported releases.
  3. Inform sites that they need to install the latest Frontier/squid RPM by May at the latest ( done for CMS and ATLAS, status monitored)
  4. Maarten will look into SHA-2 testing by the experiments when experts can obtain VOMS proxies for their VO.
  5. Tracking tools TF members who own savannah projects to list them and submit them to the savannah and jira developers if they wish to migrate them to jira. AndreaV and MariaD to report on their experience from the migration of their own savannah trackers.
  6. CMS to appoint a contact point for Savannah:131565 and dependent dev. items for the savannah-ggus bridge replacement by a GGUS-only solution tailored to CMS ops needs.
    • done
  7. Investigate how to separate Disk and Tape services in GOCDB
  8. Agree with IT-PES on a clear timeline to migrate OPS and the LHC VOs from VOMRS to VOMS-Admin
  9. For the experiments to give feedback on the machine/job information specifications
  10. Experiments interested in using WebDAV should contact ATLAS to organise a common discussion
  11. Create a twiki for the FTS-3 task force
    • done
  12. Give feedback to the IT-SDC proposal to unify WLCG monitoring support in GGUS
    • done
  13. Add KISTI to the list of Tier-1 sites in the Grid Services report
  14. Contact the storage system developers to find out which are the default/recommended ports for WebDAV
  15. Circulate the instructions to enable the xrootd monitoring on DPM

-- AndreaSciaba - 17-Jun-2013

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt dcachesites_cms_20130620.txt r1 manage 1.8 K 2013-06-20 - 17:18 NicoloMagini CMS dcache T2s - updated 20130620
Edit | Attach | Watch | Print version | History: r35 < r34 < r33 < r32 < r31 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r35 - 2013-06-25 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback