WLCG Operations Coordination Minutes, December 3rd 2015

Highlights

  • Please register to the https://indico.cern.ch/e/WLCG-Workshop-Lisbon-2016
  • RedHat just provided a fix for openldap, to avoid current crashes affecting Top BDII and ARC-CE. Tests of this fix can now start.
  • The Information Systems (InfoSys) TF Future Use Cases Document is now ready in the WLCG Document Repository.
  • The French and Spanish sites can now have host certificates compliant with the Globus change in validation; Spain will use the Terena CA, France has introduced the feature in its CA
  • MJF and Infosys TFs are harmonising their position about definitions for number of cores and HS06 values
  • VOMS is not working with IPv6, probably due to a relatively recent change (GGUS:117987).

Agenda

Attendance

  • local: Andrea Sciabà (chair), Maria Dimou (minutes), Maarten Litmaath, Andrea Manzi, Marc Slater, Jerôme Belleman, Maria Alandes.
  • remote: Alessandra Doria, Michael Ernst, Christoph Wissing, Dave Mason, Vincenzo Spinoso, Peter Gronbech, Andreas Petzold, Andrew McNab, Di Qing, Massimo Sgaravatto, Ulf Bobson Severin Tigerstedt, Josep (Pepe) Flix, Julia Andreeva, Andrea Valassi, Gareth Smith, Antonio Yzquierdo, Jeremy Coles (part).
  • apologies: Catherine Biscarat, Alessandro Di Girolamo, David Cameron

Operations News

Middleware News

  • Baselines:
    • NTR

  • Issues:
    • Some news from RedHat regarding the openldap crash affecting Top BDII and ARC-CE. They just provide us today the fix to test. We will then communicate any news during the next ops or ops coord meetings

  • T0 and T1 services
    • JINR
      • Minor postgres upgrade to 9.4.5
    • PIC
      • Major dCache upgrade to v 2.13.13

Tier 0 News

  • Working on adding more capacity to Condor cluster.
  • LSF 9 software deployed on all worker nodes for both the ATLAS T0 instance and the main one.

DB News

Tier 1 Feedback

  • CC-IN2P3 (France, C. Biscarat) - Globus host certificate validation change - our CA is now able to deliver compliant certificates, allowing to declare alias in the AltName. We are in the process of changing the certificate on the affected hosts (2 SRM servers).
  • PIC (Spain, J. Flix) - Globus host certificate validation change - PIC is now using TCS (Terena) certificates, and we solved the issues with all of the host with aliases already. We are in the process to migrate the remaining machines to TCS certificates, which will happen rather soon.
  • PIC (Spain, J. Casals) - Fixed some issues in the Nagios-plugin for SAM3: https://github.com/jcasals/nagios-plugins-lcgsam
  • NDGF-T1(Nordics, Tigerstedt) will upgrade dCache to 2.14.x on 14.12.2015, full day outage.

Tier 2 Feedback

Experiments Reports

ALICE

  • generally normal to high activity
  • so far the heavy ion run has been smooth from the grid perspective!
    • reco jobs run very successfully
    • their RSS memory consumption has remained up to max ~2.5 GB
      • we have to see what happens at the planned higher beam intensities
    • this has allowed the use of normal queues at various T1
    • at CERN the fraction of two-core jobs is being lowered in steps
  • CERN: submission to HTCondor CE in production since yesterday evening
  • CERN: TEAM ticket GGUS:118062 opened Monday evening. ALICE was severely impacted by an OpenStack issue:
    • the standard build system could not be used to release analysis updates
    • a local mini build system was put together for the most urgent cases
    • thanks to the OpenStack team for solving the complex issue as fast as possible!

ATLAS

  • HeavyIon data taking: in general everything OK, nothing to be particularly worry about. Tier-0 performance in terms of events/second reconstructed from the whole cluster are quite low (few tents of Hz), observed huge I/O wait in Wigner spinning disks nodes. Those nodes now have been configured by the ATLAS Tier-0 to run less jobs than what is their standard batch configuration, and the performances improved.
  • Reprocessing test now ongoing: the plan is to launch a full reprocessing campaign the 14th of December. Plan is quite tight since there are still problems with the release, please stay tuned on the Tuesdays ADC weekly next weeks.

CMS

  • Heavy Ion Run is ongoing
  • Permission problem in EOS: GGUS:118027
    • Issue with mapping
    • Fixed by EOS team during the weekend - Thanks!
  • Some storage pools disappearing from the network: GGUS:118082, GGUS:118037
    • Investigated by CERN storage and network teams
    • Only seen by CMS?
  • CMS Tier-0 workflows is driving some CERN Openstack hardware to its limits: GGUS:118056
  • Staging problem at KIT: GGUS:117910
    • Let to too many queued transfers within the CMS transfer system
      • Quite some overall performance degradation, also affecting transfers where KIT is not involved
    • Situation is improving (at KIT and globally)
  • Had a little ticketing campaign for DPM sites to move to DPM 1.8.10
    • Earlier versions have issues with recent global/regional redirectors

LHCb

  • Operations
    • Currently processing pp reference run
    • Finished 13TeV pp data processing
    • Will be starting processing of Heavy Ion runs soon
    • Significant MC generation in-coming

  • Issues
    • Problems with user accessing files at IN2P3 were experience last Tuesday pm. Assumed to be down to CA issues as they went away at the same time but more likely just coincidentally fixed at the same time (GGUS:118077)
    • Problem with RRCKI tape put offline and preventing access from certain files now solved

  • Developments
    • MC simulation workflows have been executed successfully on commercial clouds, on both DBCE (up to 600 simultaneous jobs running) and Azure (up to 1000 simultaneous jobs running, high rate of stalled jobs under investigation).

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

Machine/Job Features TF

  • Ongoing discussions clarifying key/value pairs: some changes, some expanded definitions
  • Attempting to be consistent with WLCG Information Systems Evolution TF
  • 2nd draft of HSF technical note to record the communication procedure and the key/value pair definitions
  • Deployed at several batch sites, and many VM-based installations (all the ones using Vac/Vcycle)
  • DIRAC reading time limit information from MJF in LHCb pilot jobs and pilot VMs.
  • Next steps to review experience with implementations and installations, and update in view of technical note discussions.

HTTP Deployment TF

Information System Evolution


  • The Future Use Cases Document is now ready in the WLCG Document Repository ( PDF). There is a general agreement that a central information system owned by WLCG is an interesting idea. For some VOs the requirement is stronger than for others, but all VOs agree that they would rely on a central information system that provides good quality information. Activities like WLCG Monitoring and Operations will definitely rely on such tool. The WLCG Information System should:
    • Cache information from heterogeneous resources by regularly collecting information from primary data sources for WLCG service discovery (Now GOCDB, OIM and BDII, but the list of primary resources can evolve in the future).
    • Provide a consistent interface for all interested WLCG clients offering an intermediate layer between the sources of information maintained by EGI and OSG.
    • Include grid and non grid resources, like HPC and Clouds and be flexible enough to be able to include new types of resources.
    • Validate information before it gets published, applying corrective actions if necessary.
    • Logging information, namely when, how, by whom information was provided
  • Starting to prepare a Roadmap to GLUE 2.0 so that VOs and WLCG clients start consuming GLUE 2.0 information and we can plan at some point the decommission of GLUE 1.3.
    • EGI presented their plans to move to GLUE 2.0. Main showstopper is GLUE 2 WMS that was never tested in production. EGI is now trying to understand its actual use.
    • Waiting for OSG input about their plans to provide information to WLCG once they stop publishing in the BDII and whether we could expect information published in GLUE 2 after the implementation of the ClassAds to GLUE 2 translator.
  • Ongoing discussion to agree on a better definition of the GLUE 2 attributes defining HS06 (GLUE2BenchmarkValue) and Logical CPUs (GLUE2ExecutionEnvironmentLogicalCPUs), so that sites understand in a clear way what it is expected from them to be published in these attributes.

Alessandra Doria (Napoli) expressed the sites' appreciation for the TF's definitions' dissemination via the lcg-rollout mailing list and the poll for feedback from the sites.

IPv6 Validation and Deployment TF


  • The NAGIOS service set-up is still being tuned.
  • VOMS still doesn't work with IPv6. There is a ticket to follow this up. No problem with voms-admin.
  • The ARGUS - IPv6 status was discussed at the ARGUS collaboration meeting on December 2nd. Extract from the minutes: IPv6 support: not really tested but no problem expected as Java has a good IPv6 support and as ARGUS is binding to all interfaces/addresses. Sharing a lot of network code with VOMS Admin that is IPv6 compliant: the only known issues are with VOMS that is not using the same code.

Middleware Readiness WG


The http://indico.cern.ch/e/MW-Readiness_14 meeting yesterday, Dec. 2nd, was virtual. Summary:

  • New MW versions are now being under test via the ATLAS workflow. The dCache v.2.10.44 and v.2.14.0 and HTCondor v.8.4.1 verifications are already completed.
  • BDII v. 5.2.23 is the new MW product being verified at Brunel on CentOS7 and completed at GRIF-IRFU.
  • The ARC-CE v.5.0.3 verification is completed for CMS.
  • The ARGUS Collaboration met twice since our last meeting: on Nov. 6th and Dec. 2nd.
  • Please comment on the suggested date for the next meeting: Wednesday 20th January 2016 at 4pm CET. Objections with alternative dates should be sent to wlcg-ops-coord-wg-middleware@cernSPAMNOTNOSPAMPLEASE.ch

Full minutes are here with details per product and site. The new verified top-BDII is now in UMD (done by EGI).

Multicore Deployment

Network and Transfer Metrics WG


RFC proxies

  • CMS have switched test pilot factories to RFC proxies

Squid Monitoring and HTTP Proxy Discovery TFs

  • Nothing to report

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 15 GGUS tickets opened for SRM and Myproxy certificates not correct, 6 already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). 5-6 tickets still open at the 2015-12-03 meeting. All for sites which have no technical issues to proceed.
2015-10-01 Follow up on reporting of number of processors with PBS John Gordon CLOSED Everyone uses the development instance.
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team ONGOING A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-11-05 ATLAS would like to ask sites to provide consistency checks of storage dumps. More information and More details ATLAS - Status not clear at the 2015-12-03 Ops Coord meeting (ATLAS absent) None -

AOB

-- MariaDimou - 2015-12-01

Edit | Attach | Watch | Print version | History: r37 < r36 < r35 < r34 < r33 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r37 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback