Week of 090427

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).

GGUS Team / Alarm Tickets during last week

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jean-Philippe, Nick, Jan, Harry, MariaDZ, MariaG, Stephane, Alessandro, Julia, Roberto, Patricia);remote(Jeff, Gareth, Gang, Luca, Michael).

Experiments round table:

  • ATLAS reports - Alessandro and Stephan will share attendance from now. This is a detector muon week so ATLAS will start cosmics data taking probably on Wednesday. There is a small issue over the length of time it takes directly routed GGUS tickets to US Tier2 sites to arrive. MariaDZ reported that BNL should be migrating from GocDB to the OSG OIM which will affect this. Michael added that this has been delayed by the ATLAS reprocessing campaign but is now scheduled to be done in week 21 (second half of May) and that BNL staff are aware of the ticket routing issue.

  • ALICE - production is running smoothly with 10000 concurrent jobs. The french WMS (at Grif) has been downgraded from the (uncertified) 3.2 level and should soon be ready to resume production.They found a small directory configuration issue at the weekend at IN2P3 so one module of ALIEN will be replaced on all ALIEN servers duriing this week.

  • LHCb reports - They will be testing the merging of MC09 files this week in order to optimise the cpu/wall clock ratio. Several problems at FZK: an unstable WMS, several worker nodes with certificate problems and an unresponsive srm endpoint. The SRM endpoint at CNAF was not returning the correct Turl. Luca had looked at this ticket and could not see an error. Roberto added that it is opening of the Turl that is not working and both agreed to investigate further. JT said LHCb jobs had dried up over the weekend so is their current test over ? The reply was to expect merging jobs later this week.

Sites / Services round table:

  • ASGC: SRM ramping up and down during the past three days due to 'Big ID' problem. Site has applied a crond to make sure the missing id will be fixed every 10min, but problem still exists. This is a CASTOR bug and CASTOR development team is already working on this, just more time needed.

  • ASGC (Jason):
    • catalog service
      • lfc purged and continue with new table space recreated last Fri.
      • consider migrating other VOs from production instances serving atlas
      • adding lfc redundancy later this week
      • acl updated this morning to add default user/group access control for /atlas/role=production and /atlas/role=lcgadmin
    • FTS upgrade from 2.0 to 2.1, next week.
    • SAM availability drop due to screw of system clock
    • fixed last Thu
    • S2 SAM availability drop
      • due to bigId issue (savannah issue open by castor dev, https://savannah.cern.ch/support/?106879), data registration sometime fail with missing id type. majority of the missing Id refer to userfile type in id2type table
      • restart srm server/daemon able to suppress the error while we need new patch provided by castor dev to resolve the occi return into clause issue
    • Atlas
      • atlas data cleanup
      • T2 dpm data cleanup - help from J-P with script handling the data purging from specific vo, say atlas
    • SAM availability drop due to the acl of catalog after migrating to new table space

    • CMS
      • part of data files not able to migrate to tape (compOps savannah)
      • manual fix by data ops
      • drop of SAM availability due to the incorrect group permission of .globus dir (old LCG CEs).

* NL-T1 (JT): During the scheduled weekend power-off downtime the intention was to keep the site visble having an ldap server on the backup power supply but unfortunately its network switch was not on backup power.

CERN Databases (MG): the 3-monthly Oracle security patches are being prepared.

AOB: (MariaDZ) VO Admins of the LHC Experiment VOs go to https://lcg-voms.cern.ch:8443/vo/YourVOname/vomrs, create Groups "TEAM" and "ALARM" and add the DNs of your authorised teamers and alarmers. If you can't remember who is who please contact Guenter.Grein@iwrNOSPAMPLEASE.fzk.de for the Teamers' list and read the Alarmers' list in the Alarms' twiki.

NT: requested if there was interest from the experiments in having a generic SAM test to test if sites had installed the worker node 'hostname.lsc' files that obviate the need to renew annual Voms server certificates. Apparently voms-proxy-info returns a non-zero return code in this case and this is tested by LHCb though it does not affect job execution.

Tuesday:

Attendance: local(Jan, Jamie, Dirk, Sophie, Stephane, Jean-Philippe, Harry, Andrea, Olof, Patricia, Roberto;remote(Gang, Michel, Jeremy, Michael, Gareth, JT).

Experiments round table:

  • ATLAS STEP preparation (Graeme Stewart)
Dear All

I have two pieces of news important to sites re. the STEP09 challenges.

The first is for T1s, where a new project, step09, has been defined.
Tape families should be set up for this project with the expectation that all data here can be deleted. This project will produce files in both the ATLASDATATAPE and ATLASMCTAPE areas.

The second more concerns the EGEE Tier-2s. Here we would like sites to

1. Enable support for the /atlas/Role=pilot VOMS role to enable wide testing of analysis through the panda system.

2. Configure their fairshare targets for ATLAS to be:

50% for /atlas/Role=production
25% for /atlas/Role=pilot
25% for /atlas

Further information about STEP09 is being gathered in the twiki, where these requests are recorded. There will be a discussion on STEP09 at this week's ADC Operations meeting.

https://twiki.cern.ch/twiki/bin/view/Atlas/Step09

We will likely ask for one or two sites per cloud to perform a more detailed analysis of which jobs were run during STEP to evaluate the performance of the different infrastructures and their interaction at the sites.

Thanks in advance

Graeme

  • ATLAS reports (Stephane) - Information on the first transfers of muon calibration data are to be found at https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/3260. 2 main problems - during night problem with diskserver at RAL - maybe comments later - disk isolated, some data not accessbile, disk offline until tomorrow (GGUS) Gareth - failure of server - 1 disk failed and during rebuild another - power cycling got it back, initial rebuild of RAID array completed. Another suspicious disk, another rebuild before returing to prod likely tomorrow. Stephane - another smaller issue with CNAF - no size has been returned (GGUS Ticket) - expecting feedback from CNAF. Taipei - ready to start as soon as SE ready. Started export of Muon calib data to Germany, Italy, US and BNL. Gang - still waiting for some scripts from CASTOR team to delete data on SE. JPB - discussed with Jason about this. Script JPB sent for DPM is also usable for LFC.

  • CMS reports (Daniele) - In brief:
    • quite some consistency checks and files invalidation activity in progress in several sites
      • in particular: CMS DataOps is supporting Gang/ASGC to check the consistency of the datasets on-site
    • a zlib.pm packaging problem is causing some operational overload at some sites in MC prod: being addressed at the Offline/CMSSW level
    • MWGR Custodial data slow to show up on T1_IT_CNAF_MSS: Castor MigHunter issues recently, now should be ok - need to confirm via ops though
    • minor transfer issues in some routes, all being followed up among CMS contacts
    • CE issues at T2_US_Florida: no OIM notification, CRAB users got into it, but known, and being investigated

No detailed ticket report today.

  • ALICE (Patricia) - Problem in alien authentification at 20:00. Some operations performed manually - should have been transparent - but blocked DB for some time. Once service restarted problems disappeared. Grid33 - WMS3.2 version - put in production yesterday, started to see jobs that could not be killed by system. Job wrapper of WMS 3.2 not properly running - system in drain mode; Michel - last problem mentioned was minor just affecting LSF. Now fixed. System is unstable but should have no impact on ALICE as there are 2 WMS. If 3.2 WMS is drained should go to other one - not affecting ALICE.

  • LHCb reports (Roberto) - T1 WMS instances - all of them - main symptom is that all WMS operations timeout after 2'. At PIC issue with CRL - promptly fixed - SRM e-p at Gridka issue with TURLS. GGUS ticket. Lyon - gsidcap server closing connection when file being read by application. Sophie - LHCb having problems getting virtual ID.. Roberto - creating new groups in VOMS but forgot some details re LFC - rolled back.

Sites / Services round table:

  • NL-T1 (JT) - bandwidth limited SARA-NIKHEF increased to 10Gb yeesterday for ATLAS. 3D DB upgrade gear on site. Q: ATLAS LFC. Have separate LFC for ATLAS. Cannot tell it to use > 99 DB connections. Can this be changed? JPB - no. Will look at ways to improve that.

  • ASGC:
    • SRM temporarily recovered from 'Big id' problem after restarting the SRM server.
    • Waiting scripts from castor team to delete ATLAS data on SE.

  • RAL (Gareth) - at risk for CASTOR VDQM today and Oracle patches tomorrow so another "at risk".

  • CERN (Olof) - need to do LSF short scheduled intervention at 07:00 to upgrade to latest patch suite. Intrusive intervention - time it takes to failover to 2ndary master. No jobs lost.

AOB:

Wednesday

Attendance: local(MariaD, Jean-Phillippe, Harry, Alessandro, Diana, Antonio, Dirk);remote(Gang Qin,Joel Closier, Onor (SARA), Michel (GRIF), Jeremy (GridPP), Daniele (CMS), Gareth (RAL), Luca (CNAF), Michael (BNL) ).

Experiments round table:

  • ATLAS reports - (Alessandro) Cosmic data taking of this and next week: - no central (Tier-0) DPD production (apart from the track and muon ntuples) - no central distribution (apart from the muon ntuples and calibration)

  • CMS reports - (Daniele) CMS ticket review restarting. File level consistency checks ongoing with CNAF. Still some corruption and files to be recalled. Several tickets open. FZK: problems with tape recalls due to some lack of pre-staging tactics on CMS side. Will be implemented by CMS befor STEP. Still some T2 transfer errors due to a few problematic files in data sets.

  • ALICE (Reported after the meeting by Patricia)- Setup of the new WMS@CERN (gswms01) inside the Alice production in progress. After the testing phase, the system is ready to enter the production at CERN only (as starting point). New minor upgrade of the AliEn v2.16 performed this afternoon at FZK. This upgrade will now be distributed to all sites centrally from CERN following a transparent procedure for all sites

  • LHCb reports - (Joel) Since Monday WMS problems at several T1s. Timeout problems during submission. All sites aware of the problem and some fixes went in. To be checked by LHCb. Joel pointed out that sites should implement any necessary steps so that these fixes are retained after the next update or release. Joel reminded about open tickets at FZK and Lyon. The invalid turl problem at CNAF should be fixed now. LHCb will confirm.

Sites / Services round table:

  • SARA: issue with SRM late night/ early morning caused by file system full. Solved since the morning.
  • GRIF: New WMS problems with the job wrapper. Now fixed and back in production.
  • RAL : One disk server after problems now back in prod
  • CNAF: Castor some problems with rfio process on disk server: process apparently hangs for long time. Castor dev team is in contact with CNAF and suggested steps for further analysis.

Question by Joel: Can RAL & CNAF confirm that WMS for LHCb has been fixed? RAL: working on it. CNAF: will get back by email.

  • GRIF: Michel requested that a summary of the LHCb WMS problems is circulated once the issues are fully understood.
  • BNL: major move from non-space token area to space token controlled area. Time consuming as the user mapping had to be adjusted before the operation. So some increased error rate as the production queue was affected (permission errors from panda). This has been has been fixed now and no open issues are left.

AOB: Antonio described the glite release plans. A summary can be found at https://twiki.cern.ch/twiki/bin/view/LCG/ScmPps

Thursday

Attendance: local(Alessandro, Harry, Diana, MariaD, Jamie, JanI, Olof, Dirk);remote(Gang, Jeremy, Michael, Michel, Gareth, Daniele, Luca)..

Experiments round table:

  • ATLAS reports - (Alessandro) No issues to report (spontaneous applause) .

  • CMS reports - (Daniele) PIC is raising worries about large (O(100)GB) files distributed by CMS recently, because of insufficient WN storage space. CMS is discussing options to avoid currently unlimited filesize. Some staging problem at FZK and IN2P3. Ongoing debugging at CNAF for files with migration problems.

  • ALICE -

Sites / Services round table:

  • RAL - (Gareth) another disk server out with problems (part ATLAS MC disk). Being worked on. ATLAS: Did RAL declare downtime? No, just announced on the email list as previous discussions on the granularity for announcement concluded with the statement that announcements on individual disk servers would generate too much traffic. Jamie suggests to get back to this topic for discussion in a larger forum at one of the next operations workshops.

  • CNAF - (Luca) CMS migration problems being addressed, but not yet fixed. The WMS problems reported by LHCb are now fixed.

  • CERN - (JanI) CMS export server has been down due to a configuration error. This had no impact on data transfers.

AOB:

Friday

  • No meeting - public holiday (at least in some places...)

-- JamieShiers - 24 Apr 2009

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2009-04-30 - DirkDuellmann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback