WLCG Operations Coordination Minutes - February 19th, 2015

Agenda

Attendance

  • local: Maria Dimou (chair), Andrea Sciaba' (minutes), Alexei Zhelezov (LHCb), Andrea Manzi (MW officer), Maarten Litmaath (ALICE), Oliver Keeble (IT-SDC), Prasanth Kothuri (IT-DB), Alberto Aimar (IT-SDC), Alessandro Di Girolamo (ATLAS)

  • remote: Renaud Vernet (IN2P3-CC), Yury Lazin (NRC-KI), Maite Barroso (Tier-0), Ron Trompert (NL-T1), Alessandra Forti (WLCG ops coordinator), Christoph Wissing (CMS), Jeremy Coles (GridPP), Di Qing (TRIUMF), Catherine Biscarat (IN2P3-CC), Gareth Smith (RAL), Andrej Filipcic (ATLAS)

Operations News

  • Now contacting the WLCG Operations Workshop in Okinawa speakers. See current agenda version here.
  • Contacted sites who gave answers to the survey showing misunderstanding of the fields' meaning. Some corrections were received.
  • The new date of vomrs decommissioning is March 2nd.
  • We shall discuss later today the proposal to form a new HTTP Deployment Task Force.
  • Savannah was closed down this morning at 9am CET. Last updates are summarised here.

Middleware News

  • Baselines:
    • new UMD release this week( UMD 3.11.0)
      • APEL client ( parser and publisher) v 1.3.1, several fixes including corrected handling of multiple FQANs in blah logs
      • APEL-SSM 2.1.5, bug fixes + removal of SSLv3 support
      • CREAM 1.6.4 ( also verified by MW readiness), including bug fixes
      • Gfal-utils 1.1.0, enhancements and bug fixes
      • DPM 1.8.9, Several high impact performance improvements in the core components

  • MW Issues:
    • Argus failures seen with latest Java versions (GGUS:111505) reported last meeting has been fixed and the new version of ARGUS-PAP 1.6.4 is in EMI now
    • There is an issue with DPM affecting one LHCb site in Brazil related to not checksum stored after an incoming transfer GGUS 111493, the fix is under development
    • FTS3 suffers of an issue reported by ATLAS related to shares not respected. Developers are working to fix the issue.

  • T0 and T1 services
    • CERN
      • EOSAlice and EOSLHCb upgraded to EOS Aquamarine version.
    • KIT
      • upgraded xrootd for Alice to v 4.0.4
    • JINR-T1
      • move both dCaches to new h/w in early March
    • IN2P3
      • Upgrade to 2.10.14+ on 24/02/201
    • RRC-KI-T1
      • Upgraded dCache for ATLAS to 2.10.17
      • planned to upgrade dCache for LHCb to 2.10.17

Tier 0 News

  • LHCb LFC migration: LHCb will migrate from LFC to DFC on February 23th; after that we will keep LFC and DFC running in parallel for some time until they are fully confident we can retire the LFC.
    • the only LFC instance left will be then the "shared" one; from next one we will be monitoring the usage to retire it at the same time than the LHCb one, and fully decommission the LFC service at CERN
  • VOMRS decommissioning and replacement by VOMS-admin: After discussion with the experiments, and due to different constraints, the final date to decommission VOMRS will be on Monday March 2nd

Maarten pointed out that EGI still relies on the "shared" LFC instance for the OPS tests, therefore Maite will contact EGI.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • normal to high activity
  • CERN: ALARM ticket (GGUS:111753) Fri evening Feb 13 when all the old gLite VOBOXes were found to suffer SSL errors on job submission.
    • The trouble was thought to be due to a Java update on the CEs on Thu/Fri:
      • SSLv3 got disabled by default.
      • Re-enabling it did not help, though.
    • The one WLCG VOBOX in use kept working OK.
    • It was decided to phase out the old VOBOXes a few weeks ahead of schedule, replacing them with more WLCG VOBOXes (done).
  • KIT: ALARM ticket (GGUS:111813) Tue evening Feb 17 because of GGUS ticket update e-mail flood
    • cured in the evening (thanks!), solved Wed morning

ATLAS

  • exploiting almost all the resources.
    • single core
    • multicore: we still see that there is room for improvements in terms of how many resources we really get
      • we are in contact with Alessandra on this.
      • we have agreed to extend the length from 100 to 200 events, which would double the length of the jobs, thus limiting the starvation
    • analysis is running as usual
  • MC15a in final validation stages: bulk of 1G events will be submitted in one/two weeks. all MultiCore. this is a workload foreseen for approx 3 months
  • ATLAS cosmics data taking ongoing (M8).
    • Tier-0 running smoothly with the LSF dedicated master (2.5k slots). Data from P1 are going still to CASTOR.
    • Data replication to the Grid has been setup with the RPG (Replication Policy on the Grid) tool.
  • FTS3: great collaboration with the experts, but we are anyway suffering of bugs that we don't understand why were not been spot before: shares are not respected, i.e. Tier0 data and DataConsolidation and User are all in a single queue, thus many user requests could block e.g. Tier-0 export. For now we will workaround changing priority on the ATLAS side but this is not a clean solution, only temporary.
  • SARA data loss: data which were produced with ProdSys2 were automatically recovered. But "old" data produced with ProdSys1 need manual task injection in ProdSys2 - ongoing now.

Alessandro clarified that ATLAS is already testing EOS for the Tier-0.

CMS

  • Cosmics data taking with magnet off started last week (CRUZET: Cosmic RU at ZEro Tesla)
    • Have continues Tier-0 operation
  • Problems with Tier-0 virtual machine provisioning
    • Missed somehow switch from http to https in the OpenStack endpoint. SNOW:0732469
  • Problems reading CMS streamer files from CASTOR last weekend, GGUS:111754
  • Establishing proper communication channels for OpenStack problems
    • Openstack infrastructure mission critical for CMS Tier-0
  • Production/Processing overview
    • Upgrade Digi-Reco on Tier-1s
    • Bigger Run2 MC production campaign ongoing at Tier-2s
  • Usage of CRC32 checksum
    • CMS presently provides CRC and Adler checksum
    • Since Adler32 is primarily used, it is considered to drop CRC32
    • Still in process to find out, if it is used e.g. at sites for transfer validation
  • Tape staging test at Tier-1 sites
    • Finish successfully: CNAF, PIC & RAL
    • On going: KIT, CCIN2P3 and FNAL

Alessandro asked if CMS is writing to EOS in the Tier-0 tests; Christoph said that this is the case, though CASTOR is still used at some level.

Alessandro asked more details on the CMS tape tests, to compare the targets to the ATLAS needs. Christoph will retrieve the information and add it to these minutes.

Andrea S. asked if CMS is seeing any problem with the FTS3 fairshares: Christoph said that there is no evidence of that and CMS is not using intra-VO shares at this time. Maarten added that it is possible that the bugged code affects also inter-VO fairshares, but Alessandro pointed out that it would be difficult to observe, as very few links are shared by more than one VO.

Maite asked if CMS expects to use other channels beside GGUS to report issues to the Openstack team; Christoph answered that people in CMS will be instructed to only use GGUS.

LHCb

  • Operations:
    • "Run1 Legacy stripping" finalization
  • Other:
    • migrate our file catalog from the LFC to the DIRAC File Catalog (DFC) next Monday, draining phase starts this weekend

Proposal of an HTTP deployment task force (Oliver Keeble)

Oliver explained the motivations for an HTTP deployment task force and proposed a mandate, to be approved by the experiments and the sites (see slides). The TF would take care of 1) identifying the functionality set required by the experiments, 2) providing recipes and recommendations to the sites to enable such functionality and 3) coordinating and tracking the deployment in WLCG.

Jeremy argued that it might be too early to form a task force, as we are still at the testing stage. Oliver asked the experiments to give their opinion on the need of a wide deployment. Maria decided to define an action to have, by the next meeting and after discussions to be had offline, a statement by each experiment on the need for the deployment of HTTP. Alessandro asked what amount of effort would be needed by the experiments and Oliver replied that it would be mostly limited to the definition of the required functionality, therefore a relatively small effort, not more than what is being already dedicated to this topic, but in a more formalised form.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • gLExec in PanDA:
    • testing campaign is covering 54 sites so far
      • including almost all of the big sites in EGI, plus BNL in OSG
      • main sites currently being debugged: RAL and RALPP
      • more sites to be added in the coming weeks

Andrea S. asked if the gLExec SAM tests run by ATLAS help in finding issues; Maarten confirmed that they are useful but do not cover some corner cases.

SHA-2

  • VOMS configuration reminders will be broadcast when VOMRS has been stopped (March 2)

Machine/Job Features

  • NTR

Middleware Readiness WG

  • Thanks to the new sites to installed the MW Package Reporter. Instructions here.
  • The more Package Reporter installations we have, the better we can prototype and suggest a functional results' display view.
  • Very good progress is being made at QMUL for the StoRM Readiness verification via the ATLAS workflow.
  • The MW Officer is working on the MW database view that will fetch and display candidate for Readiness verification release versions from the repositories.
  • Remember our next meeting in a month Wed March 18th at 4pm CET. Please check the Action List.

Alberto proposed to drop the name "package reporter" and use "Pakiti client": the suggestion is accepted but it cannot be done until some ongoing work with the Pakiti client is finished.

Answering to Alessandro, Andrea M. clarified that the package reporter provides information on the installed packages but not on the configuration or on what functionality is actually available (e.g. HTTP access).

Multicore Deployment

IPv6 Validation and Deployment TF

Squid Monitoring and HTTP Proxy Discovery TFs

  • Nothing to report

Network and Transfer Metrics WG

NTR.

Action list

  • ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled in production on SAM CMS; sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM). Number of CMS sites publishing HTCondor-CE is increasing. For ATLAS, Alessandro announced that with the USATLAS experts a solution has been found to publish the HTCondor CEs in the BDII and OIM in a way that satisfies both ATLAS and SAM needs. It's not a long term solution, but it should be good for the next six months. Before closing the action though we need a confirmation by Marian.
    • Ongoing discussions on publication in AGIS for ATLAS.
  • ONGOING on experiment representatives - report on voms-admin test feedback
    • Experiment feedback and feature requests collected in GGUS:110227
  • NEW on Oliver and the experiment representatives - HTTP deployment task force
    • The experiments should give their position of the need and the mandate of the task force

AOB

  • The next meeting is on March 5th.

-- AndreaSciaba - 2015-02-17

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback