WLCG Operations Coordination Minutes - 16 May 2013

Agenda

Attendance

  • Local: Andrea Sciabà (secretary), Andrea Valassi, Maarten Litmaath, Maria Dimou, Oliver Keeble, Stefan Roiser, Felix Lee, Maite Barroso Lopez
  • Remote: Pepe Flix (chair), Sang Un Ahn, Massimo Sgaravatto, Valery Mitsyn, Gareth Smith, Alessandra Forti, Christoph Wissing, Alessandra Doria, Ilya Lyalin, Tiziana Ferrari, Alessandro Cavalli, Thomas Hartmann, Andreas Petzold, Ron Trompert, Peter Solagna, Di Qing, Luca Dell'Agnello, Burt Holzman, Luis Emiro Linares Garcia, Dave Dykstra

News

  • Thanks for new regional contacts joining the WLCG OpsCoordination efforts! They are: Thomas Hartmann (T1-Germany), Alessandra Doria (T2-Italy), Massimo Sgaravatto (T2-Italy), Jan Erik Sundermann (T2-Germany). Concerning T2s, so far we have contacts in Germany, France, Italy, UK... Other regions are more than welcome! Please, tell your colleagues and ask them to join.

Middleware news and baseline versions (N. Magini)

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Highlights:

  • a new version of lcg-utils, 1.15.0, fixes a timeout bug affecting in particular ATLAS.

Tier-1 Grid services (N. Magini)

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR: v2.1.13-9 and SRM-2.11 for all instances
EOS:
ALICE (EOS 0.2.29 / xrootd 3.2.7)
ATLAS (EOS 0.2.28 / xrootd 3.2.7 / BeStMan2-2.2.2)
CMS (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
LHCb (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
None None
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.6-1
xrootd
3.2.7-1
None None
BNL dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
  Upgrade to dCache 2.2-10 May 28, 2013
CNAF StoRM 1.8.1 (Atlas, CMS, LHCb)   Upgrade to StoRM 1.12.0 planned for late May
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.7-2.osg
Oracle Lustre 1.8.6
EOS 0.2.30-1/xrootd 3.2.7-2.osg with Bestman 2.2.2.0.10
  Upgrade to dCache 2 + Chimera this summer
IN2P3 dCache 1.9.12-16 (Chimera) on SL6 core servers and 1.9.12-25 on pool nodes
Postgres 9.1
xrootd 3.0.4
None Upgrade to 2.2-10+ in June 2013 (3rd golden release)
KIT dCache
  • atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera)
  • cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera)
  • lhcbsrm-kit.gridka.de: 1.9.12-24 (Chimera)
xrootd (version 20100510-1509_dbg and 3.2.6)
None None
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.7 (Chimera) (SURFsara), DPM 1.8.6 (NIKHEF)    
PIC dCache 1.9.12-20 (Chimera) - doors at 1.9.12-23 None Head nodes to be migrated to 1.9.12-23 by mid of June
RAL CASTOR 2.1.12-10
2.1.13-9 (tape servers)
SRM 2.11-1
None Upgrading all instances to CASTOR 2.1.13-9 by end of June.
TRIUMF dCache 2.2.10(Chimera) None None

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1    
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1 None None
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.0-1 for T1 and 1.8.3.1-1 for US T2s SL5, gLite Oracle ATLAS Oracle upgrade to 11 May 28, 2013
CERN 1.8.6-1 SLC6, EMI2 Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations  

Other site news

Data management provider news

Experiment operations review and plans

ALICE (M. Litmaath)

  • CERN: submission to SLC6 queue was stopped Tue May 7 evening after job failure reports, possibly due to presence of incompatible Xrootd client code in standard directories; other SL6 sites do not show such problems and probably do not have Xrootd clients installed; still to be debugged...
  • various T1 and T2 sites have upgraded their gLite VOBOX to a WLCG VOBOX
    • reinstallation from scratch is not required for ALICE
  • CNAF: switched back to Torrent Fri May 10 late afternoon after upgrade to an SL6 VOBOX
    • high job throughput and efficiency in the subsequent days, thanks!
    • a spare VOBOX has been reinstalled with SL5, again allowing use of the shared SW area when needed (the WN are still on SL5)
  • KISTI: many jobs quickly failed due to the absence of SL6 versions of certain analysis packages on the ALICE SW distribution servers
    • the missing packages will be added and job brokering improvements will be looked into

Pepe asks if the problem seen at KISTI might exist elsewhere: Maarten answers that it is possible. So far it was spotted only at KISTI, which, being a large site, can see such issues exposed very clearly, while on smaller sites they might not stand out. The affected packages belong to older software releases that are less frequently used.

ATLAS

  • NTR: excuses from ATLAS, no one can be present in today's meeting.

CMS (C. Wissing)

  • SL6 Migration
    • CMS-only Tier-2 sites are encouraged to migrate WNs as soon as possible

  • glExec
    • Used by CMS where it is working
    • Plan to require it by the end of the year
    • Sites are encouraged to enable it

  • Squid WLCG Monitoring
    • 67 sites in new monitoring
    • 40 still pending - CMS site contacted received a Savannah
    • Being followed in regular operations meetings

  • perfSonar
    • Not much progress
    • Waiting for new release with support for mesh configuration

  • Global DBparam reset for Phedex
    • Campaign started
    • All sites contacted

  • Standing items
    • Continuing commissioning of Russian Tier1
    • Interest in common condor_g submission probe for SAM

Pepe asks if the sites in the WLCG squid monitoring list are only Tier-2's. Christoph answers the list includes a mixture of Tier-2's and Tier-3's but he needs to check if it contains also Tier-1's. [AndreaS after the meeting: yes, T1's are in the list and some still need to upgrade]

LHCb (S. Roiser)

  • Operations
    • currently ongoing incremental stripping campaign in good progress, work is being carried out at Tier1 sites. The limiting factor as expected are the tape systems on the sites, the applications to execute are rather short (~ 6 hours / file). In total 1/3 of data has been processed in 20 days, which is slightly ahead of schedule.
    • Request to install the HepOSlibs meta-rpm to satisfy VO binary runtime dependencies on the OS layer. Especially recommended in the course of the currently ongoing SL6 upgrade campaign. EGI broadcast https://operations-portal.egi.eu/broadcast/archive/id/934
  • Issues
    • lxplus6 nodes found with cvmfs not working correctly (INC:295744) [Maite reports that the problem has been already solved]
    • problem with certificates in the grid cvfms repository, voms-proxy-init not working because of an expired CRL list (GGUS:93975). The afs version is working and the procedure to update those has been given. Currently being put in place.
    • LHCb relies on information from GOCDB for handling downtimes of sites automatically. It so happened that a site only was upgrading the tape system but had to declare a DT of SRM. As a consequence all file access (although disk only was working) was banned. Is it possible to have a more fine grained distinction (tape/disk) for services in GOCDB? (GGUS:93966)

Concerning the last point, Maarten explains that the usual practice in case of unavailability of just the tape backend is to declare the downtime as "at risk", but it is not satisfactory. AlessandraF adds that registering a new service type for tape is also the solution recommended by the GOCDB developers. It is agreed to follow up on the LHCb request, deemed reasonable by everybody.

Thomas reports that LHCb is not using the tape system at KIT efficiently because too few staging requests are submitted in each bunch, preventing the tape system from optimising the staging requests. Stefan agreed to discuss with the LHCb production team if it is possible to increase the number of requests per bunch.

Task Force reports

SL6 (A. Forti)

  • CERN added 2000 slots to the initial 5000 on LXBATCH.
  • There is also an issue on lxplus.cern.ch, reported on the IT Status Board causing intermittent login failures and still under investigation.
  • Since last update, only one Tier-2 has migrated to SL6.

Pepe adds that LHCb jobs are running fine on the new SL6 resources. In the ensuing discussion, these points are clarified:

  • not all Tier-1's already provide SL6 resources: only three offer SL6 capacity; soon they will be joined by BNL, NIKHEF and IN2P3. The rest are still in the testing phase.
  • LHCb encourages all its Tier-1's to provide at least some SL6 resources as soon as possible.
  • only NL-T1 said that they will migrate in "big bang mode". IN2P3 and RAL will migrate smoothly. PIC will start with 10-15% of the resources and after a few weeks proceed with the rest. The other sites did not say.
  • migration needs to be carefully coordinated to avoid multiple Tier-1's migrating (and being unavailable) at the same time: use the SL6 TF twiki to announce plans.
  • in the case of PIC, one full day of downtime is to be expected, preceded by 1.5 days of queue draining.

CVMFS (S. Roiser)

  • New CVMFS version 2.1.10 is finishing testing now and will be put in production Thursday next week (23. May). This version is good for sites with and without NFS support. [this version fixes some important bugs and it is considered mandatory for all sites]
  • Infrastructure changes
    • Enhanced release procedure put in place in addition to code checking, local functional testing and stress testing a hammercloud based stress testing is being setup on a multi-core worker node like machine. Afterwards a "burn in" test on candidate sites is being done for at least 48 hours.
    • SAM/NAGIOS probe has been developed, currently used in preprod by LHCb. Other VOs are informed and showed interest in also deploying it. Testing deployed version, I/O errors, connectivity to Stratum1 via proxies, used file descriptors, etc.
  • Deployment Status
    • 5 sites didn't meet the deployment target for ATLAS/LHCb ( 1 ATLAS, 2 LHCb, 2 ATLAS/LHCb, see https://twiki.cern.ch/twiki/bin/view/LCG/CvmfsDeploymentStatus for details)
    • In contact with ALICE about the sites to contact for CVMFS deployment. Rough statistics show ~ 30 sites not handled by the TF so far, another ~ 30 sites have CVMFS installed already for another VO.

gLExec (M. Litmaath)

  • the deployment status and concerns were presented in this week's MB meeting
  • the MB decided the deployment needs to continue and proposed ramping up the campaign:
    • ticket all WLCG sites that do not pass the tests
    • the tests are to become critical after a target date, currently Sep 1 [no distinction among sites will be done based on how much the VOs they support are ready to use gLExec]
  • the campaign will be tracked on a dedicated page (tbd)

SHA-2 migration (M. Litmaath)

Christoph will make sure that somebody from CMS will actively contribute.

Concerning the timescale, there aren't yet any deadlines: officially, SHA-2 certificates are not expected to work before August 1, but it doesn't imply they need to work as of that date. There are a lot of middleware service instances that need upgrading to SHA-2 compliant versions. Maarten expects that the WLCG infrastructure can fully support SHA-2 some time in the autumn. Tiziana adds that in the EUGridPMA meeting this week (summary) there was an unofficial decision by which all CAs will issue SHA-2 certificates by default from October 1, and we should prepare for that. Exact dates will be defined in the official minutes.

Tracking tools evolution (M. Dimou)

Experiments are invited to inform the GGUS dev. team if we foresee TEAMers/ALARMers who belong to more than one experiment. The issue was born with case GGUS:94084 and also discussed at the WLCG 'daily' Meeting (minutes). Current consensus is that this is a Very Rare Case and the GGUS dev. effort to associate the VO name with the TEAMer/ALARMer DN requires changes to all parts of the system (database structure, logic layer, user interface) so no action will be taken for the moment. If the registration in multiple VOs is done using different DNs the problem no more exists.

Http Proxy Discovery (D. Dykstra)

AndreaV asks what are exactly the use cases addressed by the proposal. Dave explains that it will not be used by the Frontier/squid monitoring or to replace the proxy lookup mechanisms used by the ATLAS and CMS Frontier production systems, but it is rather intended to be used for CVMFS, by jobs running on opportunistic resources e.g. in clouds (foreseen by ATLAS and CMS) and for some CMS applications not related to Frontier. Maarten comments that even Frontier might consider using WPAD if this led to a simplification.

News from EGI operations (P. Solagna and T. Ferrari)

For details, see the slides.

Highlights:

  • EGI will test for SHA-2 compliance of the production services with a new SAM test that will read the version information from the top BDII or other appropriate sources
  • all the central EGI operations tools (GOCDB, GGUS, Accounting and Operations portals and DB, SAM, MSG brokers) are SHA-2-ready
  • it is urgent to understand the timeline for migrating the OPS VO (and the LHC VOs) from VOMRS (not SHA-2 compliant) to VOMS-Admin [Maarten adds that according to VOMS-Admin developer Andrea Ceccanti all major issues encountered in earlier tests of the VOMRS-to-VOMSAdmin migration have been fixed]
  • UMD 3.0.0 was released last Tuesday (see some important caveats about the APEL clients)
  • There are several updates in the pipeline for UMD-2 and UMD-3, the most urgent ones to be released end of next week
  • A whole new bunch of ARC SAM metrics will be included in SAM Update-22: some replacing old ones, some brand new, and a couple to be obsoleted (GGUS:88829). This impacts WLCG because the old probes are part of the WLCG_CREAM_LCGCE_CRITICAL SAM profile. Feedback is welcome until 27-05-2013 via GGUS to the Operations SU [no objections were expressed from a WLCG point of view]

Tiziana asks Oliver which DPM and LFC versions are SHA-2 compliant. Oliver will check.

News from other WLCG working groups

AOB

Action list

  1. Find out why there are no SLC6 resources at CERN published in the BDII and how many SLC6 worker nodes are actually installed in lxbatch.
    • DONE: 10% of the computing capacity in SLC6 (as of 16.05.12: 189 24-core nodes and 304 8-core nodes in SLC6 batch, about 7k slots); one CREAM CE (ce208.cern.ch) configured to submit to SLC6 resources
  2. Add perfSONAR to the baseline versions list as soon as the latest Release Candidate becomes production-ready.
  3. Build a list by experiment of the Tier-2's that need to upgrade dCache to 2.2
    • done for ATLAS: list updated on 17-05-2013
    • done for CMS: list updated on 03-05-2013
    • not applicable to LHCb, nor to ALICE
    • Maarten: EGI and OSG will track sites that need to upgrade away from unsupported releases.
  4. Inform sites that they need to install the latest Frontier/squid RPM by May at the latest ( done for CMS and ATLAS, status monitored)
  5. Maarten will look into SHA-2 testing by the experiments when experts can obtain VOMS proxies for their VO.
  6. Tracking tools TF members who own savannah projects to list them and submit them to the savannah and jira developers if they wish to migrate them to jira. AndreaV and MariaD to report on their experience from the migration of their own savannah trackers.
  7. CMS to appoint a contact point for Savannah:131565 and dependent dev. items for the savannah-ggus bridge replacement by a GGUS-only solution tailored to CMS ops needs.
  8. Investigate how to separate Disk and Tape services in GOCDB
  9. Update the CVMFS baseline version to the latest version
  10. Agree with IT-PES on a clear timeline to migrate OPS and the LHC VOs from VOMRS to VOMS-Admin

Chat room comments

-- AndreaSciaba - 13-May-2013

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2013-05-17 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback