Week of 091130

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Julia, Daniele, Simome. Maria, Dirk, Sophie. Jan, Patricia, Jan, Roberto, David, Andrew);remote(Ron, Angela, Xavier, Rolf, Gonzalo, Gareth, Jason, Brian, Gang).

Experiments round table:

  • ATLAS (Simone): Three problems possibly all related to FTS: Problem 1: no data transfers from CERN to BNL since last Friday, with segmentation faults on the agents. Ticket opened and developers looking at core file. Fellback to FTS 2.1 Worries in the experiment as KIT and BNL have FTS 2.2. Problem 2: CERN to CNAF error: channel does not exist: Problem traced to CNAF not publishing right endpoint. Site notified. Now OK. Problem 3: KIT to TRIUMF transfers not progressing. No error detected. Querying works... Ticket opened.

  • CMS reports (Daniele): FNAL export problems to european T2 sites (like Pisa). Should be fixed now. IN2P3 problems with Merge Job during MC production. Also no space for CMSSW releases. Squid problems at PIC, which will possibly require squid to be restarted. Transfer issues related to some RAW data files> 10 GB - being bypassed. Minor issues at T2s as described in the minutes.

  • ALICE (Patricia before the meeting) - Good performance of the T0 services during the whole weekend. Reconstruction operations at T0 going on. ce201 (CREAM-CE) is back in production. The running job profile of ALICE using both CREAM-CE and LCG-CE is available under: T0 running jobs profile. During the weekend the LB of KIT has been failing (lb-2-fzk.gridka.de) leaving the alice agents in status READY forever. The issue is being followed at this moment by the regional expert of Alice. Angela (KIT): possibly this is related to WMS. KIT asked if WMS is critical. Patricia confirmed that the local WMS are listed by the experiment as critical services. A local ticket opened. Will be followed up.

Sites / Services round table: NL-T1 (Ron): NTR. IN2P3 (Rolf): NTR KIT(Xavier): downtime tomorrow for the ATLAS dCache pool for the upgrade to the new dCache version. PIC (Gonzalo):NTR NDGF (Roger): NTR RAL: double disk failure in a disk array for LHCb ASGC: NTR

Services: Simone: CERN to GSI network activity: not a central activity and therefore ATLAS people will be contacted.

AOB: (MariaD following analysis by GGUS developer Guenter Grein): The tickets are synchronized well. Two tickets are not updated by OSG yet:

Tuesday:

Attendance: local(Luca, Harry, Simone, Jacek, Sophie, Maria, Jamie, Roberto, Ale, Maria, Daniele, Olof);remote(Gonzalo, Gareth, Ronald, Jeremy, ).

Experiments round table:

  • ATLAS - Good new! Many issues from yesterday sorted. CNAF - publication of e-p problem. FTS cache e-ps and now looks fine. Intervention @FZK this morning - as consequence slowness of transfers FZK-TRIUMF disappeared. Don't know cause nor cure. 1-1 correlated with intervention. Would be good if someone could report tomorrow on FTS 2.2 @CERN. Sophie - but service was downgraded? Simone - switched to FTS2.1 at CERN on Friday - some sites are still running FTS 2.2 and would therefore like to understand problem. Sophie- Gav aware, will come back on this.

  • CMS reports - Not much new - some tickets closed yesterday, e.g. FNAL. IN2P3 - still investigating problems with merge jobs. Still some problem with CMS SW releases and space on s/w areas at sites. Other T1 issue: PIC need to restart Squid instance(s) on site - should be upgraded and fix tomorrow. T2 level: closing tickets to some RU T2s. BDII visibility probs, PhEDEx agents not running. Both now fixed, as BR UERJ site. Some discussion on Twiki access for non-CMS-ers.

  • ALICE - Completing migration from out of warranty VO boxes. (Now a more reliable config of 4 boxes).

  • LHCb reports - At the ongoing LHCb week the plans until the end of 2009 have been presented. The two-fold primary goal by the end of this year is to power ON the full detector (RICH included) and calibrate/polish all channels.This week they plan to switch on some other sub-detectors (VELO and ST included) to take alignment data with magnet OFF. This alignment with collisions will proceed until10K events are collected (~1/2 hour) and should impact marginally OFFLINE and Grid activities. For that reason LHCb is eagerly looking for stable beams (may be starting from Thursday?) Running at very low rate the L0HLT and then physics stripping of MB and c/b-inclusive events from MC09 (order of tens of jobs). OFFLINE system is basically empty.

Sites / Services round table:

  • FZK: Today we had a downtime for the ATLAS dCache instance and all nodes were upgraded to the latest version 1.9.5-9. After the upgrade all atlas transfers failed due to an unknown cause (transfers with dteam credentials succeeded). To correct the situation we downgraded the SRM node again and all transfers succeed now. The investigation for this problem is not yet finished.
  • PIC: ntr
  • RAL: declared at risk for 3D today as some memory faults in one of Oracle RAC nodes. Being fixed. As GOCDB is down you can't see this at risk... Roberto - LHCb DST incident yesterday - any news? Gareth: double disk failure on RAID5. One lost 05:30, one at 06:00. No way found to recover. Data lost. Replacements disks on way - server will be rebuilt and checkout again.
  • ASGC: ntr
  • NL-T1: ntr
  • IN2P3: ntr
  • CNAF: as Simone reported from ATLAS side solved problem with their e-p which "poisoned" info-system. Reconfig of backend and frontend this morning to fix...

  • CERN: DB - streaming of CMS online to offline not working since late yesterday evening - memory fragmentation. Simone - locked session in Panda DB. People from Panda looking into it. Under investigation.

AOB:

  • DB - database for physics coordination changing. Planning meeting to introduce new GL - Tony Cass. Also an e-mail which will be circulated to T1 sites.

Wednesday

Attendance: local(Andrea, Jacek, Jamie, Maria, Roberto, Daniele, Lola, Maria, Sophie, Nick, Ale, Olof, Luca);remote(john Kelly, Angela, Roger Oscarsson (NDGF), pepe_flix_pic, Marc, Jason, Onno).

Experiments round table:

  • ATLAS - Good afternoon. Services after power cut: central catalog came back ok, site service restarted this morning. Some failures in WMSs - restarted early this morning. All now working ok. T1s: failures caused by RAL this morning - now solved (late morning). Transfer errors in PIC cloud (at T2) - will be investigated. Sophie - WMS failure is in SLS or real failures? A: real failures during night, monitoring showing errors this morning. Sophie - monitoring was bad but WMSs ok. Ale - similar issue with SRM ATLAS - marked as red but working ok (circa 09:30). (SLS resfresh?) Sophie - Jan Iven is looking at it...

  • CMS reports - A few general items: recovering central CMS services after CERN power cut. In reasonably good shape now - some response and action times not matching our critical service list - more than a few hours for some critical VO boxes; site status board - Pepe & Andrea are leading site readiness team. New features have been introduced without notice - new column & state. This has generated quite a few related bugs - need to avoid uncommunicated changes being made in production! Fixing manual all CMS scripts to make plots and tables available - these changes must be communicated before. Ad hoc meeting will be proposed to follow up. Twiki privs: action on protecting CMS Twikis. Recipe that should work but on a person by person basis. Another solution is "view access" to all members of an e-group. T1: issue with merge MC jobs at IN2P3 closed; CMS SW release & disk space at IN2P3 still on-going. Work at IN2P3 on increase of disk space allocation; on CMS side try to push on quicker deprecation of older releases. Delay in upgrading Squid at PIC (but overload due to siteSB probs). T2: good news from TIFR net admins; network problems should be cured - router at Mumbai congested. Will take a few more days to clear. 1Gbps via CERN to FNAL in next few days. Closing some probs at Bari (filesystem crash); Beijing-RAL T1 transfer issues fixed; RU T2s - pending for some time - also fixed.

  • ALICE - Power cut: 3 VO boxes required restart. Everything now fine. Some problems with VOALICE9 - not fully recovered. This m/c hosts primary castor gateway. Sysadmins investigating..

  • LHCb reports - realised power cut after receiving mails that space on certain space tokens 0. Disk server restarted manually - not on critical power. DIRAC services - all restarted ok. This is an LHCb week; Reprocessing and Stripping in the last 24 hours: 80 jobs (see SSB). Prevalently user activity going on; One of the LHCb dedicated WMS services (wms216) has run in troubles submitting to CREAMCEs. It seems the bug (#59054) is the reason of this problem and machines at CERN (but not only) must be upgraded with the patch. Nick - does this bug only affect LHCb? A: in principle should affect all. Sophie - aware of problem.

Sites / Services round table:

  • RAL: problems with misconfigured SRM - fixed. Caused file transfer failures. Quite a few overnight callouts due to CERN power failure.
  • KIT: FTS - see since yesterday some security errors. Not seen since a long time. Cert soon expiring??? Hope not a deeper bug.
  • NDGF: something from yesterday - small SRM around lunch - config error. Solved yesterday.
  • ASGC: continue security patching. Rolling. Hope to finish in 2 days.
  • NL-T1: SARA SRM was v. slow this morning. User errors generated due to timeouts. Did some config tuning - seems to have helped. ATLAS is running Hammercloud tests at a few sites including SARA. SARA is performing very poorly. Curious to find out why! 53806 GGUS submitted by Ron - someone from ATLAS has now responded to this ticket. Ale - seems to be an issue with LFC. Can you check if LFC is working ok?
  • PIC: from CMS side need to reinstall (SQUID) servers.
  • CNAF: ntr

  • CERN - DB: physics DBs survived as all production DBs - each have at least 2 nodes on critical power. All restarted before 05:00. Around 09:00 sysadmin restarted all 3 nodes of ATLAS online RAC. DB not available for 10'. (Were in "no contact" alarm state).Problem with AFS - solved by someone coming onsite. Restored batch, then CASTOR, then grid services. Everything back since this morning. If anything wrong please open ticket. Roberto: AFS problem - would have impacted also cron jobs? A: where were they running? lxplus was down... Dashboards: all down around 07:00 - sent mail to VO box support. Came back without problems. Apologise for update of SSB - nothing to do with power cut. Introduced problems. Will make sure doesn't happen again. Ale - data distribution dashboard management had lockfile permission problem. FTS 2.2: developers sent report that have an idea of what was wrong but would like to reproduce.

Release report: deployment status wiki page

AOB:

Thursday

Attendance: local(Andrew, Jan, Lola, Soph, Harry, Ale, Robe, Maria, Nick, Jamie, JP, Simone, Stephane, Julia);remote(Xavier, Roger, Gang, Jason, Gareth, Jeremy, Brian).

Experiments round table:

  • ATLAS - Not a lot to report today. Final day of ADC s/w meeting. Functional tests stopped as machine didn't restart ok after power cut. Restarted yesterday pm and now all ok.

  • ALICE - Opened remedy 645721 ticket because VOboxes not in production partition. These machines are now quattorized. Ticket requests putting into production. Sophie - you can use SMS set production state (as VOC). And to increase criticality to >50 (VOALICE 11, 13, 14). Soph - changed importance also by VOC.

  • LHCb reports - There should be stable beam starting on Saturday morning and LHCb is going to take data with magnet field ON. Since the SQLDDDB cannot give the correct value of the magnetic field it is therefore very important to be sure the online DB is propagated correctly to all sites and the production is set up for using Oracle. This means that ConditionDB access and its stream replication becomes crucial this weekend. Alerted operations team to stay tuned in order to not be the show-stopper to physicists having a look at first data. Received request to schedule an intervention on CASTORLHCB to apply level 2.1.8-17 and on SRM-LHCB to apply 2.8-5 for this Tuesday (8 December).Considering that even small hiccups have be avoided by Monday, LHCb strongly discouraged to do it either on Monday or Tuesday and rather suggested this afternoon slot. It has been decided to carry it at 14:00 today. The intervention is however not demonstrating to be so transparent as claimed with many users complaining about CASTOR unavailability after 10 minutes ;-). The intervention is not in GOCDB as UNSCHEDULED. Jan - has now been put in: done about same time as announcement sent. Roberto: AMGA server, after the power cut, have been reboot and brought to life hammering the LHCb DB. This service is not used anymore and should be removed. The issue reported yesterday about a bug affecting submission to CREAMCE has a patch (3489) already certified and is now in staged rollout ready to be deployed at sites. Nick - several sites already have it in production. Publicly available. In standard production repository in a week or so. Roberto - RAL disk outtage on Monday - request for post-mortem on this incident. Gareth - yes we'll do it!

Sites / Services round table:

  • KIT: found out 30' ago CMS has severe problems with verification of credentials for FTS transfers - working on it. No other problems.
  • NDGF: ntr
  • RAL: ntr
  • ASGC: ntr
  • NL-T1: yesterday talked about ATLAS Hammercloud tests. SARA was not performing well - 2 reasons: 1 of pool nodes had collapsed, 2nd reason - LFC had problems with connections to Oracle. Services restarted yesterday afternoon and since then all ok. NIKHEF: test not as expected - one pool node had crashed too. Uncertain why DPM was not aware of this. New storage should be available in matter of days. Bandwidth to storage from 20Gbps to 120Gbps. SARA - SRM will be down in 2nd week of Jan (Jan11+) for migration of pnfs to Chimera. Also submitted to GOCDB.

  • CERN: recovering from power cut. sysadmins have a lot of work - if you have requests please open ITCM ticket.

AOB:

Friday

Attendance: local(Simone, Jamie, Harry, RIcardo, Roberto, Lola, Patricia, Ale, Maria, Ewan, Julia, Giuseppe, Sophie);remote(Tristan Suerink (NL-T1, Xavier Mol (KIT), Jason, Jeremy, Gareth, Roger Oscarsson (NDGF)).

Experiments round table:

  • ATLAS - SRM ATLAS gave 2 bursts of inefficiency: yesterday around 21:00 and today around 10:00. Short in time - expert informed via std procedure. ATLAS scratch disk pool having some troubles - still to be understood. ATLAS is using pool in a "heavy" way & pool too small for this activity. SRM is shared - if SRM in troubles for scratch disk then T0 pool also in trouble. Discussing with CASTOR developers - many areas to tune before weekend. From ATLAS point of view need to understand usage of this pool - increase a bit? From CASTOR point of view is there a way to decouple SRM use of user & production areas? Giuseppe - pool is "over used" (as reported above). Go to a different pool or add more boxes. Simone - not space but activity. If increasing size of pool increases effective bandwidth, e.g. extra 50-100TB of space fine - can fill request now. Would not like to do this unnecessarily. Ale - all Tier1s can suffer from the same problem. Giuseppe - kind of warning that service is running close to capacity. Pool in any case is overloaded - need to look at usage. Tier1s - CASTOR Tier1s should not have a problem as they use grid ftp external - seen now as we have moved to internal mode at CERN (since 1 year!?) Take offline... [ Small problem with DQ2 machine - short downtime. ]

  • CMS reports - Daniele sent apologies (clash with a meeting in Bologna).

  • ALICE - Thanks to FIO for helping with assistance to include VO boxes in SMS prod status. Increasing priority of these boxes to >50. Will add VO box in SARA with latest gLite release. Also did request to pxsupport to retire a VO box and include another in my proxy server. Ricardo - will check with Sophie. Sites: production on-going; some not yet migrated to SL5 - will be contacted by experiment - should complete this migration before 2010. Patricia - still several Tier2s. Need at least a schedule of the plan. Roughly 10 sites involved.

  • LHCb reports - Not much activity - mainly user activity. T0: VOLHCB09 has an alarm with swap full and we can not login to the machine to see what's happened. The sysadmin looking at the problem; Joel sent a mail to the administrator of the machine lxbra2509 (hosting the old LHCb AMGA service) to stop and remove completely the AMGA software from there. Farida agreed to do so.

Sites / Services round table:

  • NL-T1: FTS traffic for ATLAS going very well!
  • KIT: ntr
  • RAL: problem this morning on FTS - DB session stuck. Resolved ~10:00. No transfers for some hours before.
  • NDGF: some problem with SAM tests for SRM. Fails but only with SAM cert.ATLAS transfers run smoothly. Sent an email to SAM people.
  • ASGC: ntr

  • CERN: ntr non plus.

AOB:

  • Plan for going to higher intensities tomorrow at 16:00 - experiments preparing for data!

-- JamieShiers - 27-Nov-2009

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2009-12-04 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback