WLCG Operations Coordination Minutes - November 6th, 2014

Agenda

Attendance

  • local: Nicolò Magini (secretary, CMS), Andrea Sciabà, Alessandro Di Girolamo (ATLAS), Andrea Manzi (Middleware officer), Maarten Litmaath (ALICE), Pepe Flix (PIC), Antonio Perez-Calero Yzquierdo (PIC), Nathalie Rauschmayr, Marian Babik, Felix Lee (ASGC), Marcin Blaszczyk, Dave Dykstra

  • remote: Alessandra Forti (chair), Yury Lazin (RRC-KI-T1), Andrej Filipcic (ATLAS), Massimo Sgaravatto, Micheal Ernst (BNL), Jeremy Coles (GridPP), Rob Quick (OSG), Thomas Hartmann (KIT), Ulf Tigerstedt (NDGF), Di Qing, Alessandro Cavalli (INFN-T1), Renaud Vernet (CCIN2P3), Burt Holzman (FNAL)

Operations News

Middleware News

  • the WLCG repository is planned to become signed next week
    • the change will be announced to all WLCG sites
    • it is backward-compatible: old configurations ignore the added signatures
    • new repository files will be available in rpms that also provide the public key of the repository

  • Baselines:
    • gfal/Lcgutil is unsupported starting from 1 november. ( removed from baseline table) The replacement is gfal2/gfal2-util already available in EMI repos, to be added to UMD this week.( together with new metapackages for UI and WN)
    • new reinstallation instructions for perfSonar available . All information have been broadcasted by the Network and Transfer Metric WG
  • MW Issues:
    • nothing to report
  • T0 and T1 services
    • CERN
      • Castor upgrade to 2.1.14-15 by the end of November
    • IN2P3
      • dCache upgrade to 2.6.36 on pools. xrootd configuration for FAX and AAA to respect EU privacy policy
      • FTS2 decommissioned
      • FTS3 upgraded to 3.2.29
    • BNL
      • FTS3 upgraded to 3.2.29
    • JINR-T1
      • srm-cms-mss.jinr-t1.ru upgrade to 2.10 on 07-11-2014 ( dCache tape)
    • KIT
      • Update of dCache for CMS and LHCb to 2.6.35. FAX configuration updated to meet EU privacy policy.
      • Update xrootd configuration for AAA to respect EU privacy policy pending
    • NL-T1
      • dCache upgraded to 2.6.37
    • RRC-KI-T1
      • Tape dCache upgraded to 2.6, LHCb dCache upgraded to 2.6.34
      • planning to Upgrade disk dCache for ATLAS to 2.6 and Upgrade tape dCache to 2.10
    • Triumf
      • Tape System upgrade, adding more tape capacity on Nov. 12th

  • Pepe Flix adds that PIC is planning the upgrade to dCache2.10 in December, pending the validation of the needed fix for the N2N plugin for the FAX federation.
  • NDGF is planning an upgrade of dCache from 2.10.9 to 2.10.10; and 2.11 is in testing. Not planning to use N2N.
  • Agreed to test integration of federation plugins with new storage versions as part of the middleware readiness activities.

Oracle Deployment

  • IT-DB new hardware installations in: CERN computer centre and Wigner.
  • Timeline: testing in October, production move - by the end of 2014. Schedule will be updated accordingly.
  • Following table includes only those DB services that concern WLCG

Database Comment Destination Upgrade plans, dates
ATONR Data Guard for Atlas Online Wigner Done
ATLR Data Guard for Atlas Offline Wigner Done
ADCR Data Guard for ADC Wigner 10-15 Nov
CMSR Data Guard for CMS offline Wigner Done
LHCBR Data Guard for LHCB Offline Wigner  
LCGR Data Guard for WLCG Wigner  
CMSONR Data Guard for CMS Online Wigner Done
LHCBONR Data Guard for LHCB Online Wigner  
CASTORNS Data Guard for Castor Nameserver Wigner Done
ATONR Active Data Guard for Atlas Online CERN CC Done
ALIONR Active Data Guard for Alice Online CERN CC 10-15 Nov
ADCR Active Data Guard for ADC CERN CC 10-15 Nov
CERNDB1 CERNDB1/EDMSDB/ACCDB CERN CC Tuesday, 18th November 18:00-19:00

  • The CMSONR Data Guard in Wigner is not an active instance, it is used for backup. Dave Dykstra suggests to also have a CMSONR Active Data Guard instance in Wigner, for redundancy with the Meyrin CMSONR ADG serving conditions through Frontier. Marcin to follow up with DB group.
  • Alessandro Di Girolamo gives a heads-up to DB team about upcoming migration from DQ2 to Rucio, which could change load patterns on ADCR.

Tier 0 News

  • Alessandro Di Girolamo reminds that the experiments will give a review of the criticality of services like VOMS, but criticality of underlying services e.g. DB on demand is the responsibility of the sites. Maite agrees that it needs internal discussion.

  • AFS UI: Statistics after closing lxplus5, some clients that showed up in random 3-sec sampling: vocms228, vocms06, lx1b21.physik.rwth-aachen.de, vocms237,vocms31, it-hammercloud-submit-atlas-01

  • Full statistics on current AFS UI usage to be provided when possible.
  • Andrea Sciabà comments that vocms06/228 are going to be retired in a few days.

  • SNOW intervention this Saturday morning, GGUS will be available: A new major release will be applied to Service-now at CERN, called the Eureka release, next Saturday, November 8. The planned intervention and time schedule is described at: https://cern.service-now.com/service-portal/view-outage.do?n=OTG0015527
  • During the first part of the intervention, approximately from 06:00 to 09:00 the system will be completely unavailable. During the second part, approximately from 09:00 to 12:00 the performance risks to be affected. *During the intervention time, it is recommended to use GGUS to respond to WLCG tickets originated that way (instead of answering through SNOW)*. The ticket transfer from GGUS to SNOW will be interrupted, hence some tickets may run out of sync. This will be fixed after the upgrade. WLCG alarm tickets should not be impacted by the SNOW intervention.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • very high activity during the last 11 days
    • 69k concurrent jobs reached Nov 5
  • recent changes caused each pilot (job agent) to exit after running a single task
    • now it again runs as many tasks as it can fit
  • CERN: the batch team found a bug that caused the CEs to publish bad job numbers occasionally
    • ALICE job submissions depend on such numbers being reasonably accurate
  • RAL: the ARC CEs now handle almost all ALICE jobs
    • the SAM tests still need some work, should be fine soon
  • the new VOMS servers are used across WLCG since a few weeks

  • Maarten Litmaath comments that it's not yet clear if the bug in CE publication is CERN-specific or potentially affecting other LSF sites.

ATLAS

  • Workload on the Grid: over the past weeks there were not so many jobs. This week the Grid is full, but we foresee again in few days some lack on jobs. MCORE slots also are affected by this "not enough workload". MC15 will likely start end of the year
  • Critical Services: we were contacted by WLCG ops coords, we will update the table, but still there are points (like the underlying services on which the critical services are relying on) which are outside this review
  • many stability issues with Panda and Rucio servers due to package upgrades (nss, openssl): disable the automatic upgrades for critical services and do it in a controlled manner on preprod/test boxes
  • rucio, prodsys2 progress: migration to rucio starting mid November
  • prodsys2 used for 10% of MC tasks, group production, analysis
    • testing the MC production full chain in Prodsys2/Rucio ongoing, not fully completed yet
  • multi-core still pending sites to enable it everywhere
    • resource limits passing to the batch queues:
      • need an agreement on glue parameters
      • need example implementation for all the systems
      • APF and aCT ready

  • Andrea Sciabà reminds to send updates about service criticality to him instead of updating the twiki directly.
  • Andrea will present officially the review of critical services at GDB/MB once the information from all experiments is collected; sites can get a preview if interested.

CMS

  • This week is CMS Offline and Computing week (at CERN)
    • Many people are even more busy
  • Processing overview:
    • New production campaign of MINIAOD (Tier-1 and Tier-2 sites)
    • Preparations for new campaigns (PHYS14 and another round of Upgrade) ongoing
  • List of critical WLCG services reviewed by CMS this week
    • Computing coordination send update to Andrea
    • Still to be understood from CMS side
      • CVMFS reliability when a Stratum-1 goes down (or falls behind in terms of status)
      • CVMFS Stratum-0 needs to include the machine used by CMS for publishing
  • Service Migrations
    • AFS-UI
      • Was access statistics re-done after closing of lxplus5
    • Testing of new VOMS server infrastructure
      • Little progress, no problems seen
        • CERN PhEDEx Debug transfers now running with proxy from voms2
      • Will ramp up to meet the know deadline (end of November)
  • Operational items
    • Reconfiguration campaign of xrootd for European sites started beginning of this week
  • Data transfer test T0->Tier-1 and Tape staging exercise planned for November/December
    • Details will be communicated to sites via CMS contacts (and relevant CMS HN)

LHCb

apologies from LHCb, due to the ongoing LHCb Computing Workshop we cannot attend the meeting today

  • Stripping 21 - "legacy Run1 data set" stripping
    • pre-staging of input data is progressing well (~ 1/3 of 2012 data is staged)
    • verification of the production is ongoing - start of the campaign estimated during next week
  • IPv6 - fixes done for the DIRAC framework which should enable it to run in IPv6 mode
    • dedicated dual-stack certification node has been requested for larger scale testing
  • New VOMS servers firewalls for LHCb have been opened world-wide.
    • The upcoming LHCbDIRAC patch release will use both old and new servers
    • If all goes well the release after will only use the new servers
  • voms-admin tests to be done

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • over the last few months the use of gLExec by PanDA has been completely revised and improved by Eddie Karavakis
  • for each site the usage can be configured:
    • never
    • always (mainly for testing)
    • only when it works (as determined on the spot by the pilot)
  • a deployment campaign already covers the T1 sites and 17 T2
    • to gain insight in the effects on stability and success rates across WLCG

  • Maarten and Alessandro clarify that the deployment campaign involves creating new Panda queues with glExec enabled, so it doesn't require additional action from the sites. ATLAS will decide on the migration procedure based on the experience.

Machine/Job Features

  • NR

Multicore Deployment

  • Passing parameters to the batch system: discussion started within the TF, so far only few sites participated.
  • ATLAS:
    • 40% of the production resources used occupied by multicore since the beginning of September.
    • Still 37 sites to move. Work needed both on sites and atlas side to setup the queues
    • Atlas very interested in the outcome of the passing parameters discussion. Setting them up on ATLAS side will not take long once parameters decided
  • CMS:
    • Multithreaded reconstruction application ready, first test workflows run at PIC, now moving to submission tests at T1s scale
    • Deployment to T2s: multicore pilot submission tests to be started with some candidate sites
  • Multicore TF presentation next week at the GDB

  • On Alessandro's question, Andrea Sciabà and Alessandra Forti remind that the task force mandate includes deployment, so the TF itself can follow up with sites for the deployment of its recommendations on job attributes.
  • Alessandro asks if Panda can start sending job attributes. Alessandra comments that this will not have bad effects since sites will currently not use them. Pepe and Antonio suggest to wait until the TF recommendations are more mature.
  • Alessandro reminds that some recommended attributes are in Glue2 schema, which is not deployed in OSG. Rob Quick comments that currently there is no use case for Glue2 in OSG; if experiments have a request it will be evaluated. Rob Quick will be involved in the TF for the Glue 1.3 <-> Glue 2 decision.
  • Rob Quick will involve the HTCondor-CE developers in the task force, to understand if HTC-CE supports passing job attributes.

Middleware Readiness WG

  • Pepe Flix reminds that the PIC dCache test instance is not dimensioned for stress tests; this is OK for the MWR WG activities.

SHA-2 Migration TF

  • introduction of the new VOMS servers
    • Ops and ALICE are done
    • LHCb date to be agreed
    • CMS - any news?
    • ATLAS - any news?
    • OSG will do a release on Nov 11 with the new servers added to the vomses files for ATLAS and CMS (GGUS:109265)
      • voms-proxy-init may then try the new servers first
      • it would be best if by that time the firewall is open also for those VOs!

IPv6 Validation and Deployment TF

  • NR

Squid Monitoring and HTTP Proxy Discovery TFs

Network and Transfer Metrics WG

  • Marian to check read permissions on Jira issues.
  • Alessandro asks if experiments can start to use perfSonar monitoring in operations. Marian answers that with upcoming deployment the metrics should be usable; the documentation will include troubleshooting guide for experiments at all levels: site perfSonars, data store, maddash.

Action list

  1. ONGOING on the experiments: check the usage statistics of the AFS UI, report on the use cases at the next meeting.
    • Maite to provide full updated statistics after lxplus5 closure.
  2. ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled on SAM CMS pre-production, plan is to go production next week - all sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM).
    • Ongoing discussions on publication in AGIS for ATLAS.
  3. ONGOING on experiment representatives - report on voms-admin test feedback
  4. ONGOING on Andrea Sciaba - review the critical services table
  5. CLOSED on MW officer - bridge communication between MW Readiness WG and EGI UMD Staged Rollout. MW Officer regularly attending UMD meetings; in recent case with dCache upgrade, the sites were taking the update directly from dCache.org rather than UMD to fix a critical bug.
  6. ONGOING on Alessandro Di Girolamo - suggest to monitoring team to organize common meeting with experiments to collect requirements for new SLS dashboards.
  7. CLOSED on Marian Babik - organize meeting to define details of IPv6 testing activities with SAM and perfSONAR. Status: perfSONAR in dual-stack recommended to ALL sites in the current install/update instructions for perfSONAR 3.4+. SAM coordination meeting will be held tomorrow (IPv6 is on the agenda).

AOB

  • Rob Quick comments that OSG is ready to send multicore accounting data to APEL. Rob Quick and Alessandra Forti to contact APEL team to ask if APEL is ready to receive it.
  • Burt Holzman asks to review the information flow to OSG for the VOMS incident. Rob Quick suggests to notify the OSG GOC at goc@openscienceNOSPAMPLEASE.org and then OSG will take care of broadcasting to its users. Agreed - mail to goc@openscienceNOSPAMPLEASE.org to be added to CERN procedures in case of outage of 'global' services like VOMS or myproxy.

  • Next meeting on November 20th

-- NicoloMagini - 2014-11-03

Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r25 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback