TWiki> LCG Web>WebPreferences>WLCGOpsMinutes130829 (revision 36)EditAttachPDF

WLCG Operations Coordination Minutes - 29 August 2013

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=263201

Attendance

  • Local: Maria Girone (chair), Andrea Sciabà (secretary),
  • Remote:

News

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR: v2.1.13-9-2 and SRM-2.11 for all instances (SRM-2.11-2 for LHCB)
EOS:
ALICE (EOS 0.2.37 / xrootd 3.2.8)
ATLAS (EOS 0.2.38 / xrootd 3.2.8 / BeStMan2-2.2.2)
CMS (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
LHCb (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
   
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.6-1
xrootd
3.2.7-1
None None
BNL dCache 2.2.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.11.2 emi3 (Atlas, LHCb)
StoRM 1.8.1 (CMS)
Atlas-LHCb update CMS update: september
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.7.slc
Oracle Lustre 1.8.6
EOS 0.3.1-5/xrootd 3.3.3-1.slc5 with Bestman 2.2.2.0.10
   
IN2P3 dCache 2.2.12-1 (Chimera) on SL6 core servers and 2.2.13-1 on pool nodes
Postgres 9.1
xrootd 3.0.4
   
KISTI xrootd v3.2.6 on SL5 for disk pools
xrootd 20100510-1509_dbg on SL6 for tape pool
dpm 1.8.6
  xrootd upgrade foreseen for tape (20100510-1509_dbg -> v3.1.1) in September
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.5-1
  • cmssrm-fzk.gridka.de: 2.6.5-1
  • lhcbsrm-kit.gridka.de: 2.6.5-1
xrootd
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd proxy 3.3.1-1
dCache upgrade to 2.6.5 end of July alice-tape-se.gridka.de will be upgraded to xrootd 3.2.6 in Sep/Oct. The ATLAS FAX xrootd proxy will be upgraded to 3.3.3 during September.
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.7 (Chimera) (SURFsara), DPM 1.8.6 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 1.9.12-23
xrootd 3.3.1-1
  Next upgrade to 2.2 in 16th September
RAL CASTOR 2.1.12-10
2.1.13-9 (tape servers)
SRM 2.11-1
   
TRIUMF dCache 2.2.13(chimera), pool/door 2.2.10    

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1    
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1 None None
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.3.1-1 for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.6-1 SLC6, EMI2 Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations  

Other site news

Data management provider news

Experiment operations review and plans

ALICE

  • significant and efficient grid usage over the summer break
  • CVMFS
    • deployment campaign: 33 tickets closed, 19 open
    • 1 small site (60 job slots) running it in production since 2 weeks, with good job success rates
    • next: switch a big T2
  • CERN
    • job submissions by ALICE and other VOs failing for many hours due to Argus incidents on July 21-22 (GGUS:95914) and Aug 23 (INC:365019)
  • KIT
    • 23 production files lost due to a damaged tape; as they already had 2 other replicas, only catalog entries needed to get adjusted
    • local SE access still unstable, but with reduced impact since the firewall upgrade on July 24-25; jobs cap at 6k since Aug 12
  • RAL
    • on Fri Aug 23 many WN ran out of memory due to one ALICE user's broken jobs that allocated huge amounts of memory in a very short time
    • to stop the issue, ALICE jobs were banned for the weekend
    • the guilty tasks were removed and their owner has been admonished

ATLAS

  • Distributed Computing Operations as usual in the past month. To be noted the ongoing efforts invested in the FTS3 commissioning and integration by ADC and DDM Ops, more details in the FTS3 task force report.
  • as a reminder ATLAS reports the "daily" issues at the WLCG "daily", all reports available here: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/ADCOperationsDailyReports2013

CMS

  • Dirk Hufnagel joined the machine feature task force for CMS

  • HammerCloud
    • stopped submitting jobs August 21, when the certificate to authenticate to glideIn WMS expired, fixed on August 28
  • Argus calls timing out: INC:365019
    • August 23: two CERN hosts we use for CMS analysis jobs submissions became unusable to users because gsissh connection was failing. We also had about 5K CMS pilots dying (condor Held) due to glexec failures. Problem apparently disappeared around noon, w/o any action taken
    • August 23: ALICE job submission was also badly affected from ~08:30 until ~14:00, causing CERN to get almost fully drained of ALICE jobs. What is more, also beforehand and afterwards there have been sporadic AuthZ errors, so it seems the Argus machinery was/is flaky…
    • This issue seems to be to be created by high user activity, and the way CERN setup the HA Argus currently is not able to cope with this load. The system worked as designed in the sense that the overloaded servers were taking out of the alias and the load moved on to the next while the previous one was draining.

  • T1 prioritization changes:
    • Following changes have been implemented recently:
      • t1production role is not used anymore after the GlideIn WMS frontends were merged, only the production role is used from now on
        • prioritization is done on the frontend level => fractional shares within the CMS T1 allocations are not needed anymore, simple prioritization is sufficient
      • disk/tape separation allows to open the T1 CPU resources for analysis accessing samples on the disk endpoints
    • Updated prioritization policy: simple prioritization without shares in CMS CPU allocation at the sites
      • available CPU resources should be assigned first to production role work, then to pilot role analysis work, then to any other role or no role analysis work within the CMS VO
      • Access for analysis to T1 CPU resources should prefer glexec and therefore the pilot role
    • All documented here https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsPoliciesVOMSRoles

  • CVMFS: working with the last remaining ~10 T1/T2 sites to switch to CVMFS
    • Seeing from time to time black holes due to CVMFS, instructed site admins to prepare debug tarballs and send them to CVMFS development team
  • glexec: after the summer break, we will restart working with the sites to enable glexec
    • by late fall, we would like to make the CMS glexec SAM tests critical and rely on glexec in our global glideIn WMS pool
  • Savannah->GGUS:
    • work continues to migrate all ticket activities to GGUS, working with the GGUS development team on:
      • new web input mask, test instance and implementation of additional CMS internal support units
    • list of open tickets: SAV:138640, SAV:134411, SAV:134413, SAV:134415, SAV:134416

LHCb

  • FTS3 test OK
  • python bindings for grid middleware nearly ready for python 2.7
  • incremental stripping campaign will be starting mid September for 2 months
  • GridKA : tape family set for the next round of data taking in order to improve the tape access

Task Force reports

CVMFS

  • Very good progress for ALICE site migration to CVFMS ( campaign started in July)
    • 31 of 51 sites have deployed CVMFS, most of the remaining sites promise to deploy this fall
    • 2 sites have not acknowledged the ticket yet (ITEP, IN-DAE-VECC-02)
  • New GGUS support unit "CVMFS" added under new category "File System"

SHA-2 migration

  • EGI Operations Management Board Aug 27 meeting
    • good progress for CREAM, VOMS, WMS
    • little progress for StoRM, dCache
    • discussion outcomes:
      • aim for Dec 1 deadline for compliance of services
      • ask IGTF/EuGridPMA to delay SHA-2 timeline by 2 months
  • CERN's VOMS servers can generate SHA-2 proxies!
    • but SHA-2 certs need to be registered as secondary certs for now, as described here
    • migration from VOMRS to the SHA-2 compliant VOMS-Admin is foreseen to happen in the autumn
  • ATLAS and LHCb SW looks ready for SHA-2
  • CMS
    • DBS2 will not be able to deal with SHA-2
    • it is scheduled to be retired in Nov
  • ALICE: various SW still to be checked

SL6

  • Tier1
    • Done: 8/16 (Alice 4/10, Atlas 6/13, CMS 3/9, LHCb 4/9)
    • In progress: CCIN2P3 and NDGF (both as planned)
    • TW is upgrading and already in production in Atlas with 50% of the resources. The other half will be done by the end of the week.
    • Others: there are 5 Tier1s which put as a deadline 2013-10-31. They might be able to do it sooner (for example Kit and CNAF commented on working towards September). RAL is deciding if to go down the ARC-CE/condor route or if to keep Torque/maui. In the first case it will be just matter of rolling out WNs in the second case we need to organise a big bang upgrade. RRC-KI-T1 and FNAL haven't finalised their plans.
    • Tier1s status: https://twiki.cern.ch/twiki/bin/view/LCG/SL6DeploymentSites#Tier1s
  • Tier2
    • Done: 48/129 (Alice 11/39, Atlas 28/89, CMS 22/65, LHCb 13/45)
    • A number of sites that put as deadlines 2013-07-31 and 2013-08-31 or marked themselves in progress are actually postponing.
    • 7 sites are still marked with "no plans and no testing" I will start to open tickets for these sites starting from 1st of September.
    • Few sites are waiting for Quattor templates. Investigating when these templates will be available.
    • Tier2s status: https://twiki.cern.ch/twiki/bin/view/LCG/SL6DeploymentSites#Tier2s
  • EMI-3
  • HS06
    • Reminder that sites are requested to run HS06 benchmark and update the value in the BDII. Sites which have already done it see increase up to 25% depending on the hardware. It would help the community if sites added their results to the hepix site http://w3.hepix.org/benchmarks/doku.php?id=bench:results_sl6_x86_64_gcc_445 however it is not clear how sites can have access to it. The normal hepix login doesn't work. Investigating with Hepix people.

Tracking Tools

Brainstorming is starting now about how to implement the possibility to notify multiple sites via a single GGUS ticket. You are invited to contribute comments in Savannah:138299.

This is a complex development so we have to be clear about the requirements. We also need to understand how often this bulk submission is likely to be needed. Summarising what we have so far:

  1. To avoid mistakes and abuse, the privilege should be offered to expert supporters/submitters.
  2. The submitter name should remain their own (no TPM involvement) so they control the future exchanges.
  3. Subject and Description field should be typed once and cloned to all individual ticket instances.
  4. The official site name, as in GOCDB or OIM, should still be offered for selection to the expert submitter, so they don't risk to mis-type complex names, e.g. AEGIS03-ELEF-LEDA (maybe not a WLCG name but the prettiest I found).
  5. The possibility to select >1 ROC/NGI (not only "Notify Site") should also be foreseen.

FTS-3

  • ATLAS, CMS, LHCb are now all using FTS3 for functional tests or real production.
  • FTS3 is used now in ATLAS for all production activity for approximately 30% of the overall transfers.
  • The integration with the experiment is an iterative process ( O(10) patches in the last 2 months)
  • we had experienced problems (starting from around 15th of August). The first problem we observed was the one of not having enough ACTIVE transfers in some links (link plenty of SUBMITTED).
    • this has been understood and solved, FTS3 DB was not optimized properly.
    • a patch was applied the 22nd of August, but this update had side effects which lead into having links completely stuck, some of the jobs were not reporting back their statuses.
    • we have understood the problems in the past days, and just yesterday (28th) a new FTS3 update has been releases and deployed in RAL.
      • we are evaluating the performances right now, it seems that the amount of errors is much higher than before. it seems that there are links with very low throughput which have an impressively high number of ACTIVE O(100)
      • since this morning we agreed with FTS3 devs to start putting more load with ATLAS functional tests on the fts3.cern.ch production CERN endpoint which is running with MySQL backend (i.e. similar to RAL), to try to reproduce - and then solve - the issues.

XrootD Deployment

  • New version of GLED Collector for XRootD detailed monitor (M. Tadel)
    • Change log relative to 1.4.0:
      • external updates:
      • root 5.34.07 -> 5.34.09
      • apr  1.4.6 -> 1.4.8
      • activemq-cpp 3.7.0 -> 3.7.1
      • resolve clients with numeric IPs (needed as more and more servers are getting configured to not do the reverse lookup)
    • RPMs availalbe in the WLCG repo.

  • dCache xrootd door monitoring plugin
    • new version with improved functionality will be made available in the WLCG repo by next week (announcement will be circulated)
    • In the meantime, few ATLAS sites (MWT2 and AGLT2) have already deployed the new version.

gLExec

  • 39 tickets closed and verified, 55 still open, many on hold until mid/late autumn
  • Deployment tracking page
    • various countries all done, in particular France!

AOB

Action list

  1. Build a list by experiment of the Tier-2's that need to upgrade dCache to 2.2
    • done for ATLAS: list updated on 17-05-2013
    • done for CMS: list updated on 20-06-2013
    • not applicable to LHCb, nor to ALICE
    • Maarten: EGI and OSG will track sites that need to upgrade away from unsupported releases.
  2. Inform sites that they need to install the latest Frontier/squid RPM by May at the latest ( done for CMS and ATLAS, status monitored)
  3. Maarten will look into SHA-2 testing by the experiments when experts can obtain VOMS proxies for their VO.
  4. Tracking tools TF members who own savannah projects to list them and submit them to the savannah and jira developers if they wish to migrate them to jira. AndreaV and MariaD to report on their experience from the migration of their own savannah trackers.
  5. Investigate how to separate Disk and Tape services in GOCDB
  6. Agree with IT-PES on a clear timeline to migrate OPS and the LHC VOs from VOMRS to VOMS-Admin
  7. Experiments interested in using WebDAV should contact ATLAS to organise a common discussion
  8. Contact the storage system developers to find out which are the default/recommended ports for WebDAV
  9. Circulate the instructions to enable the xrootd monitoring on DPM

-- AndreaSciaba - 28-Aug-2013-- AndreaSciaba - 28-Aug-2013

  • Screen_Shot_2013-08-29_at_4.22.57_PM.png:
    Screen_Shot_2013-08-29_at_4.22.57_PM.png
Edit | Attach | Watch | Print version | History: r40 | r38 < r37 < r36 < r35 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r36 - 2013-08-29 - AlessandraForti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback