WLCG Operations Coordination Minutes - September 18th, 2014

Agenda

Attendance

  • local: Maria Alandes (chair), Nicolo Magini (secretary), Marian Babik, Tsung-Hsun Wu, Andrea Manzi, Maria Dimou, Maarten Litmaath, Xavi Espinal
  • Christoph Wissing (CMS), Burt Holzman (FNAL), Maite Barroso (Tier-0), Jeremy Coles (GridPP), Di Qing, Catherine Biscarat, Pepe Flix (PIC), Michael Ernst (BNL), Natalia Ratnikova, Renaud Vernet, Michel Jouvin, Rob Quick (OSG), Ron Trompert (NL-T1)

Operations News

  • At the last MB held on Tuesday, WLCG Operations presented a set of preliminary ideas on where WLCG Operations could have room for improvement and save some effort to sites. At the MB it was then decided that WLCG Operations will investigate these ideas in more detail and carry out a survey among sites to find out more specifically where operational effort currently goes. This survey will help to identify areas where we could optimise operational costs. A survey will be prepared in the next days and sites will be contacted and asked to fill it in.
  • The lack of support for certain ARGUS components was presented at the MB and a series of actions were defined to try to find a solution for this problem.

Middleware News

  • Baselines:
    • New EMI release last week
      • Fix for Apel Parsers 2.3.1 ( compressed accounting log parsing failure) released in EMI-3
      • New BDII version (1.6.0) released in EMI-3, containing small fixes for GLUE2 and configuration changes to enhance performances, to set as baseline when released in UMD too
      • New version of the EMI-3 UI and WN available. New dependencies as gfal2-util and ginfo included, new bouncycastle-mail version dependency for sha-2 voms. The updated meta packages are not yet in UMD.
    • Castor new release 2.1.14-14 , already installed at CERN
    • FTS3 new release 3.2.27, installed at CERN already
    • Frontier/Squid 2.7.STABLE9-19, enhancement and bug fixes

  • MW Issues:
    • Storm and new GlueValidator check: There is an issue in the info-provider for Storm which has been discovered when running a check introduced in the new glue-validator version. The glue-validator is integrated as a Nagios Operational probe for EGI and for the moment it has been downgraded waiting for a fix for the Storm info-provider or a workaround inside the glue-validator

  • Maria Alandes comments that a fix for the GlueValidator was developed, waiting for decision on how to proceed before deployment.
  • Maarten comments that the bouncycastle-mail update should be automatic whenever yum update is run, because it is a standard system dependency (viz. of voms-clients3), i.e. no explicit dependency is needed.

  • T0 and T1 services
    • FTS2 decommissioning
      • Done at ASGC, NDGF-T1 and PIC
    • CERN
      • EOSCMS upgraded to 0.3..35
      • CASTOR upgraded to 2.1.14-14
      • FTS upgraded to 3.2.27 (going to be upgraded on other sites in the coming days)
    • NDGF-T1
      • dCache upgrades to 2.10.4-SNAPSHOT
    • PIC
      • n2n ATLAS plugin upgraded

  • PIC reports the decommissioning of their FTS2 server.

Oracle Deployment

  • Maarten comments that the Oracle team reported that migration to GoldenGate is progressing well.

Tier 0 News

  • AFS UI:
    • Ongoing work to understand the use cases. A very common one is to get the CRLs;the normal thing for grid services is to have the CA certs locally and have the standard fetch-crl cronjob updating regularly the CRL lists. We have documented how to configure this both with quattor (https://cern.service-now.com/service-portal/article.do?n=KB0002773) and Puppet (https://cern.service-now.com/service-portal/article.do?n=KB0002772)
    • Access statistics: In 124 minute sampling period on AFS server hosting both the AFS UI volumes, exclusively. Lots of non-CERN stats and reads from ific.uv.es, unige.ch, and in2p3 and a few others. CERN ones are mostly batch nodes and lxplus. And looking at users: Sorting distinct users by phonebook entry:
      • 50 PH-UCM, 14 PH-UAT, 10 Other, 7 PH-ULB, 3 PH-CMG, 1 PH-UAI,1 PH-LBD, 1 PH-LBC, 1 PH-AIP, 1 IT-SDC, 1 IT-PES, 1 IT-ES

  • WMS for EGI SAM: With the decommissioning of the WMS instances for SAM scheduled for October 1st, the only remaining use case is availability and reliability monitoring for EGI; for this we will use a catch-all WMS provided by EGI

  • plus5/batch5: User input has been analysed; some 40...50 tickets have been opened. A common pattern is the need for a platform to build binaries under SLC5; we have provided a KB article describing how to set up a private VM under Openstack with all the relevant RPMs (see https://cern.service-now.com/service-portal/article.do?n=KB0002752). This appears to address many users' concerns. Of the remaining tickets, a large fraction comes from CMS and TOTEM users. We are setting up a dedicated meeting with CMS and TOTEM representatives in order to better understand the situation and clarify the next steps to be taken. Other user communities raising concerns are NA48, NA49 and NA61, with whom we will also meet.

  • Maite added the statistics of AFS UI users in attachment to the agenda.
  • New ACTION on experiments to follow up in 2 weeks about the use cases for the AFS UI.

Tier 1 Feedback

Tier 2 Feedback

  • GridPP: Raises a question about the move from gfal to gfal2. There were indications that gfal (no longer supported) would be removed from EPEL in early November. Yet gfal2 has only just arrived in the WN release (3.1.0) leaving only a short time to check for issues. Responses on the WLCG ops coordination list now suggest gfal will not be removed in November - in that case there is no problem for discussion or need to discuss whether the WN baseline should now be changed from v3.0.0 to v3.1.0.

  • Confirmed that the gfal packages will not be removed from the repositories.
  • Andrea Manzi comments that the WN version 3.1.0 including gfal2 will be declared baseline as soon as it is released also in UMD.

Experiments Reports

ALICE

  • CERN: investigation of job failure rates and inefficiencies
    • RSS limit was increased to 2.3 GB - thanks!
    • more jobs were still seen getting killed unexpectedly
    • further details will be provided to the batch team

ATLAS

  • apologies: nobody from ATLAS can be present due to the ongoing ATLAS SW and Computing week.
  • no major news to report respect 2 weeks ago: many improvements/steps done on the commissioning and integration of Rucio and ProdSys2, but still long way to go (order several weeks)
  • activities as usual.

CMS

  • Processing overview:
    • Upgrade (Phase2) processing with high priority at Tier-1
  • Bad incident with xrootd fallback test on Sep 16
    • Unfortunate version of test caused exit state error
    • Site readiness and SAM results will be corrected
  • Space monitoring at sites
    • Want to close 'unknowns' in the storage accounting
      • Present Phedex space accounting is (known to be) incomplete
      • Only official datasets are tracked, no user data
    • Will be based on storage dumps
      • Upload locally pre-processed data to a central DB
      • Privacy of space usage of individual users can be kept
        • Site controls level of aggregation to be exposed
      • Same storage dumps can be used for consistency checks (locally stored files vs catalog)
    • Prototype ready
    • Request sites to join
    • More details: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SpaceMonSiteAdmin
  • AFS-UI at CERN
    • Services are being migrated to SL6 (with no dependency on AFS-UI)
    • Extending the UI availability beyond Oct 31st is still preferred
    • Request a supported Grid UI in CVMFS
      • CRAB3 clients outside CERN
      • Opportunistic resources (via Parrot)
  • Closing of lxplus5/lxbatch5
    • CMS answer sent by the Computing coordinator
I get back to you about the SLC5 support. CMS would like to keep a number of interactive machines on SLC5. We expect order of 10 virtual machines would satisfy the CMS use cases on SLC5. These will be used to compile software and allow legacy analyses from Run1 to be completed, with a timescale of Spring 2015.

  • AAA exercise
    • Recommended monitoring: http://dashb-wlcg-transfers.cern.ch/ui/#
      • Known to incomplete, but should indicate global trends
      • Select VO cms and xrootd
    • Time line
      • Sep 5-20: 'Overflow' running in US
        • Job sent to the data -> redirect elsewhere when waiting in the queue
        • Done on 1% of analysis jobs level in the US since over a year
        • Increased rate to ~15% with peaks to 25% (typical US-T2 analysis job load is 11k)
      • Sep 21-28: HammerCloud test for European/Asian region at 10% WAN access
      • First half of Oct: Overflow in the US and HC test in EU/Asia at 20% level
  • Reminders for sites
    • Update xrootd fallback configuration
      • Opened tickets to various sites - quite some took action - thanks!
    • Add "Phedex Node Name" to site configuration

  • Maria Alandes asks about the number of sites missing in AAA monitoring, Christoph answers that it should be around 10 (number to be checked).
  • Andrea Manzi comments that the UI version 3.7.3 is now available in cvmfs grid.cern.ch, including CAs/CRLs/lsc which are automatically updated and monitored; latest 3.10 version can also be deployed. Multiple UI versions deployed in parallel in directory tree: Maarten suggests to add a link to the 'default' version aligned with baselines. Christoph and Michel ask to clarify who is responsible in case of issues e.g. with syncing the CRL updates to the cvmfs repository. Andrea answers that the UI part is covered by himself and another SDC member; not clear if the support chain (ticketing etc.) is in place for the CRLs. ACTION: find out who is responsible for fetch-crl in CVMFS UI.

LHCb

  • LHCb representatives absent due to collaboration week.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • Michel asks about the progress on the VO side in glExec adoption. Maarten reports about significant progress from ATLAS: the pilot is now able to use glExec if found working, if not fallback to traditional mode. LHCb implemented glExec support a long time ago but hasn't started pushing to use it on the sites yet.
  • Maarten comments that the ARGUS support should be clarified before pushing further.

Machine/Job Features

  • NR

Middleware Readiness WG

Multicore Deployment

Accounting: Publishing multicore accounting to APEL works. ARC CEs publish correctly. For CREAM CEs to make it work it has to be an EMI-3 CE and it has to be enabled in the configuration.

Edit /etc/apel/parser.cfg and set the attribute parallel=true.

If the site was running multicore already, before upgrading and/or applying this modification, they need to reparse and republish the corrected accounting.

SHA-2 Migration TF

  • introduction of the new VOMS servers
    • by the Sep 15 deadline only a few T2 were failing the preprod tests
      • open tickets (~8 on Sep 18)
      • most of the working T3 sites were also OK on time
    • SAM prod machines use the new servers since Tue Sep 16
    • now we need to check experiments' job and data systems ASAP
    • we can gradually open the node firewalls to selected "customers" per experiment
      • e.g. lxplus-testing.cern.ch
    • we could keep the new servers blocked for outside CERN a while longer
    • EGI are running a campaign for Ops via the NGIs and ROCs
      • first try to ensure sites are configured OK
      • then switch SAM-Nagios instances, probably around the end of Sep
    • the old VOMS servers will continue running until the end of Nov
      • their final SHA-1 host certs expire by then

  • Maarten comments that LHCb and ALICE services are ~OK with the new VOMS servers.
  • Maarten will keep open the tickets on the sites failing SAM tests after the switch to the new VOMS servers.

WMS Decommissioning TF

  • Deployment of the Condor-based SAM probes planned on Wed 1st of October 2014
  • Discussing with ATLAS to also make ARC-CE tests critical for WLCG monthly reports on the same date (4 sites that would be affected are: ARNES, SE-SNIC-T2, SiGNET, UNIBE-LHEP)

  • Marian comments that there are fewer errors on ARC-CEs at ATLAS sites with Condor probes compared to WMS probes; the main reason is the different BDII query.
  • Michel comments that it looks like an issue with the probe, since some ARC-CEs which fail the tests are running fine in production.

IPv6 Validation and Deployment TF

  • NR

Squid Monitoring and HTTP Proxy Discovery TFs

  • No progress to report today.
    • OSG attempted to consolidate the Squid field in the OIM but still needs to adjust it further.
    • The automated squid monitor still needs the same small changes reported last meeting.

Network and Transfer Metrics WG

  • Kick-off meeting minutes and slides available at https://indico.cern.ch/event/336520/
  • The meeting had very good participation including experiments, ESNet Science Engagement Group (perfSONAR development team), Panda, PhEDEx, FTS, FAX as well as majority of the perfSONAR regional contacts. An initial overview of the current status in the network and transfer metrics was presented and a list of topics and tasks to work on in the short-term was proposed. Very good feedback was received and we have agreed on the topics to discuss at the follow up meetings.
  • Please check Twiki for updated task table
  • 5 sites received tickets on running an outdated version of perfSONAR

  • Marian reports progress on item T2.3 "WLCG perfSONAR configuration interface"
  • Marian reports that one site upgraded perfSonar since the last time.
  • Marian comments that the "perfSonar Operations area" of the working group will provide documentation on how to spot and debug problematic transfers. But the "metrics area" will also contribute to determine which metrics to use.

Action list

  1. ONGOING on the WLCG middleware officer: to take the steps needed to enable the CVMFS UI distribution as a viable replacement for the AFS UI.
    • The CVFMS grid.cern.ch contains the emi-ui-3.7.3 SL6 (path /cvmfs/grid.cern.ch/emi-ui-3.7.3-1_sv6v1) and provides as well CA certs, crls and voms lsc files. Given the new UI release we can also plan to upload the UI v3.10.0.
    • TODO: clarify the responsibilities (including ticketing etc.) for the maintenance of the CVMFS UI, in particular running fetch-crl
  2. ONGOING on the experiments: check the usage statistics of the AFS UI, report on the use cases at the next meeting.
  3. ONGOING on Andrea S.: to understand with EGI if it is possible to bypass the validation of VO card changes in the case of the LHC VOs. Agreed by Peter Solagna.
  4. CLOSED on the WLCG monitoring team: status of the CondorG probes for SAM to be able to decommission SAM WMS. Probes are in pre-production, deployment to production in October looks safe.
  5. ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Report about showstoppers. Status: the SAM team made a proposal on the steps to taken to enable SAM. ATLAS is following up to make sure that the new CEs are correctly visible in AGIS, while for the CMS VO feed they will be taken directly from OIM. The plan is at first to test HTCondor-CEs in preproduction and later switch to production. It is not foreseen to monitor at the same time GT5 and HTCondor endpoints on the same host.
    • No showstopper for SAM. Need to discover topology; publishing queues in BDII not necessary for SAM probes since Condor can choose the queue based on the proxy.
  6. ONGOING on Alessandro DG: find out from OSG about plans for publication of HTCondor CE in information system, and report findings to WLCG Ops. To be followed up with Michael Ernst and Brian Bockelman.

  • About the last two action items, Marian reports about a meeting between ATLAS and Brian Bockelman, next in 1 month. Some information will be published about HTCondor-CEs in infosys. Rob Quick reports that he deployed a CondorC collector (currently just collecting info from OSG CEs, not yet publishing) and is evaluating running it as a service. Michel suggests to publish the information in Glue2, Marian comments there was no conclusion about Glue1 vs Glue2. Alessandro to report status for ATLAS at next meeting.

AOB

  • (MariaD) Confirmed "GGUS status and news" talk at the GDB on October 8. Please write to ggus-info@cernSPAMNOTNOSPAMPLEASE.ch any issues you'd wish presented, explained, clarified in this presentation. Issues of interest to the experiments we support and/or WLCG Ops, in general. Also, please, say now if you wish a meeting session with the GGUS dev. team about candidate new features before or after the presentation.

-- MariaALANDESPRADILLO - 20 Jun 2014

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt reads.hosts.cern.txt r1 manage 2.5 K 2014-09-18 - 16:07 MaiteBarroso  
Texttxt reads.hosts.noncern.txt r1 manage 2.9 K 2014-09-18 - 16:07 MaiteBarroso  
Texttxt stats.hosts.cern.txt r1 manage 2.5 K 2014-09-18 - 16:08 MaiteBarroso  
Texttxt stats.hosts.noncern.txt r1 manage 2.9 K 2014-09-18 - 16:07 MaiteBarroso  
Texttxt statsreads.users.txt r1 manage 3.8 K 2014-09-18 - 16:07 MaiteBarroso  
Edit | Attach | Watch | Print version | History: r30 < r29 < r28 < r27 < r26 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r30 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback