Week of 090824

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jamie, Lola, Simone, Gang, Ewan, Edoardo, Patricia, Roberto, Julia, Harry, Andrea, Jean-Philippe, Jacek, Maria, Dirk , Markus, Diana, Olof (chair));remote(Gonzalo, Michael, Andreas Heiss, John Kelly, Jos, Ronald).

Experiments round table:

  • ATLAS - (Simone) Announcements:
    • NIKHEF - Hoang informed ATLAS that the DC move has completed. Ronald: confirm that the move has completed and the WNs are being brought back.
    • New version of the ATLAS site services is running in the french cloud since a week. Will now change to use LFC bulk inserts.
    • Reminder: tomorrow there will be the name change for BNL at 15:00 CEST (09:00 EDS). Sites are reminded to run the script 15 minutes after the start, i.e. 15:15. The Procedure is documented at https://twiki.cern.ch/twiki/bin/view/LCG/FtsProcedureSiteNameChange

  • ALICE - (Patricia) still finding small issues in the ALIEN configuration which prevents from starting the production. This is central configuration fault, not sites' problems (many sites have asked). The two WMSes for ALICE use in France are completely dead. For grid33 @ LAL a GGUS ticket 51107 has been submitted. Patricia discovered just before this meeting that the other one was also dead (GGUS ticket will be submitted). CREAM CE at CERN issues: ticket for ce201 has to be reopened over the Weekend. Seems to be some grid-proxies related problem... POST-MEETING: France has confirmed that both WMS@FRANCE are in scheduled downtime

Sites / Services round table:

  • PIC (Gonzalo) - NTR
  • BNL (Michael) - name change already mentioned
  • FZK (Andreas) - short downtime for the CE pointing to SL5 cluster. Still installing the remaining WNs to SL5, which will continue up end of the week. SE down time for SRM upgrade. Scheduled for 30 minutes tomorrow morning.
  • RAL (John) - scheduled an 'at risk' intervention for one of the tape robots tomorrow to fix the damages from the leak.
  • NIKHEF - NTR
  • ASGC (Gang) - CMS migration started again. 2000 files have been migrated.
  • CERN (Ewan) - CASTORCMS default pool was overloaded this morning.

  • Network: (Edoardo)
      Since 24/8/2009 14:00CEST, the CERN LHCOPN routers are announcing the prefix 128.142.0.0/16 together with the previous, more specific 128.142.128.0/17.
      The BGP announcement of the old prefix will be stopped only when all the Tier1s will have acknowledged of being receiving and using the new /16.
     The new address space is needed for the new servers that will soon be installed in the CERN computer centre.

     The intervention is documented here: https://gus.fzk.de/pages/ticket_lhcopn_details.php?ticket=50916
  • New router tomorrow put in production starting at 10am.

  • middleware: (Markus) WMS 3.2 released

AOB:

  • Maria: Do ATLAS and BNL want to run an alarm ticket test tomorrow late afternoon after the name change in order to assure the work-flow is ok after the change. Simone and Michael agree.

Tuesday:

Attendance: local(Simone, Gang, Dirk, Jamie, Maria, Harry, Patricia, Roberto, Maria, Julia, Dan, Olof (chair));remote(Gonzalo, Tiju, Daniele, Brian, Andreas).

Experiments round table:

  • ATLAS - (Simone) only concern: one tier-2 (Milano) failing all transfers due to SRM errors. FTS 2.2 testing: no problems seen here at CERN and the checksum functionality was successfully tested yesterday with dCache sites. Still need to test with DPM. In summary, the tests looks promising. Roberto: did you test with StorM? no. Simone thinks we need to foresee a stress-test with FTS 2.2 the two first weeks of Septembers. Brian: does this mean that ATLAS will require minimum version of DPM 1.7.2? Simone: we still need to verify that it works with DPM. In any case, the ATLAS production system does not require the checksum checks but will use it if the SE supports it.

  • ALICE - (Patricia) the CREAM CE problems at CERN have been fixed. More than 1000 concurrent jobs successfully run so far today. Concerning the issue with WMS in France reported yesterday: the sites were in scheduled downtime so it was normal that they were down. In order to cope with similar situations in the future ALICE can now use two WMS at CERN for submitting to the French CEs.

Sites / Services round table:

  • TRIUMF (Reda/offline): Yesterday we experienced a reverse DNS lookup failures affecting all of our Grid services which are in 206.12.1.0/24 address space (nodes on the lightpaths). The problem affected all sites outside British Columbia province. There was a configuration issue with a BCNET server which took some time to resolve after a few iterations with the NOC. This essentially resulted in a 0 MB/s traffic to/from TRIUMF lasting about 7 hours starting at about ~5PM UTC.
  • PIC (Gonzalo) - NTR. Scheduled downtime at PIC as announced. Maria: forwarded a mail to the wlcg-operations list last week concerning the GOCDB problem reported by Gonzalo about not being able to broadcast early announcements for scheduled downtime. As there was no answer, Maria suggests now to follow up the issue with the GOCDB developers. This was agreed by the meeting.
  • RAL (Tiju): NTR
  • ASGC (Gang): CASTOR operation team found a problem with the 'expert' system but the CMS migrations are still stalled for 24 hours. There is 110TB available in the disk space so the space is not yet critical.
  • FZK (Andreas): problem with PBS server this morning. Job submission didn't work for a few minutes. The SRM intervention scheduled this morning was successful.
  • CERN (Olof): LFC upgrade

AOB:

  • Simone: as agreed yesterday ATLAS will send an ALARM and TEAM ticket to BNL this afternoon after the name change.

Wednesday

Attendance: local(Jean-Philippe, Jan, Simone, Oliver, Maria, Gang, Antonio, Harry, Cedric, Roberto, Diana, Julia, Jamie, Olof (chair));remote(Gonzalo, Daniele, Michael, Tiju, Ron, Jos).

Experiments round table:

  • ATLAS - (Simone) BNL name change: situation of FTS (as known) - CERN, PIC, ASGC, IN2P3 applied the script. SARA applied it but the channels still existed, being followed up at site. RAL and TRIUMF do not do daily BDII update and will therefore apply the script at next intervention. NGDF and CNAF have not answered but the channels are working. GGUS test: name has not been updated, still BNL-ATLAS-T2. ALARM ticket test will therefore be postponed until tomorrow. Maria: not sure it will be possible by then? Diana: GGUS has not been updated but that can be done tomorrow. However, another issue is the resource grouping used for publishing sites in EGEE sense has nothing to do with GGUS. This will be followed up offline. TEAM ticket: several BNL-... names exist which is confusing and needs to be understood/fixed.

  • CMS reports - (Daniele) MC production run throughout August: 265M raw events in 26 days in the Tier-2s. Remarkable results according to physicists. Transfers Tier-2 -> Tier-1 ~90TB. Correction to minutes submitted by Daniele after meeting: T0->Tier-*: 90 TB, T1->Tier-*: 210 TB, T2->Tier-*: 110 TB. Further details in the CMS twiki. Some transfer errors are being investigated. SAM status: Russian sites seem to have suffered some instabilities in August - being followed up with sites. All problems have been tracked in savannah tickets, which have doubled in numbers during August.

  • ALICE -

  • LHCb reports - (Roberto) *
    • Request to reshuffling of CASTORLHDB disk pools has been fulfilled at CERN CASTOR operations team. One open question, which needs to be confirmed by Philippe: it seems like reduction of the lhcbraw space by removing out-of-warranty hardware is sufficient and it shouldn't be necessary to drain + remove other servers
    • PIC and NL-T1: ROOT - dCache client incompatibility problem have been solved by running a specific version of the client packaged with the LHCb software. Would like the sites to install the client natively on the nodes. Being followed up by the sites.

Sites / Services round table:

  • PIC (Gonzalo) - NTR
  • BNL (Michael) - NTR
  • RAL (Tiju) - NTR
  • NL-T1 (Ron) - next Monday the tape backend will be down for the whole day due to tape robot change. The FTS issue with the name change have been solved.
  • FZK (Jos) - some dCache disk nodes went down due to hardware problems. Data also no tape so access not be affected. TEAM ticket about FTS transfers to Tier-2 sites in the US, which required some efforts on the FZK side.
  • ASGC (Gang) - NTR
  • CERN (Jan) -
    • BNL name change went ok on the FTS @ CERN.
    • EGEE broadcast about emergency kernel update. Will be rolled out tomorrow. Simone: kernel patch of the VOBoxes? Jan: everything will be prepared in Cdb but it is up to the VO contacts to deploy the new kernel on their own machines. A notification will be sent out

  • Middleware (Antonio): last Monday new gLite was released for production. A complete redesign of the WMS service. Feedback from sites participating in the pilot is good. It has the advantage that it can submit to CREAM. Next release for production to be carried out this week (tomorrow or next Monday): yaim core change impacts all services. Main thing for the WN is the glExec wrapper scripts together with a corresponding fix to the LCAS interface. Harry: what are CERN plans for the deployment of the new WMS?

AOB:

  • (MariaDZ) Appended file with email thread with GOCDB developer on broadcast tool behaviour change. Jamie: We need an early notification about changes like that?

Thursday

Attendance: local(Jean-Philippe, Lola, Jamie, Harry, Maria, Diana, MariaDZ, Gavin, Roberto, Andrea, Simone, Cedric, Gang, Olof (chair));remote(Andreas, Daniele, Brian, Gareth, Gonzalo).

Experiments round table:

  • ATLAS - (Cedric) start ESD processing next week unless any problems found. Testing ALARM ticket to BNL at 5pm today. FTS notification for the BNL name change: still no news from CNAF. The channel seems to work, though.

  • CMS reports - (Daniele) tickets from last 24hrs. An ALARM ticket was opened to Tier-0 yesterday evening because of the queues having been set inactive following the emergency reboot yesterday. This was not expected from the message that had been sent out that said that only public batch queues were closed. Tape migration problems at ASGC. A ticket has been opened and the ASGC operation team is requesting help from CASTOR development team. Gang: the problem has not been solved. CASTOR development has not yet been requested to investigate. More than 40 tickets opened with the Tier-2s: some due to holidays.

  • ALICE -

  • LHCb reports - (Roberto) zero production going on, only a few user jobs for the distributed analysis activity. Follow-up concerning CASTOR: there is no need to reduce the LHCBRAW space. Second follow-up: problem at IN2P3 where a disk resident files cannot be accessed. Thirdly: any news from PIC or SARA about the issue reported yesterday concerning the dCache client installation. Gonzalo: no news but will follow-up after meeting.

Sites / Services round table:

  • ASGC (Gang) - CASTOR tape migration problem reported for CMS above.
  • FZK (Andreas) - network problems affecting file transfers CERN transfers to ftp servers at FZK. Human error, which has been fixed.
    • Correction received form Andreas after the meeting: The network problem caused SAM tests to fail because the routing between the hosts running the tests (e.g. monb002.cern.ch) and FZK was broken. Regular FTS transfers from/to Cern were not affected
  • RAL (Gareth) - outage this morning for deploying the kernel upgrade. GOCDB announcement for migration of WNs to SL5. Scheduled for week of 14th of September. Bank holidays in UK tomorrow. Roberto: will RAL still support SL4 resources? Yes, there will be some left. Brian: following up ticket with ATLAS for cleaning up connections from monbox@cern after the LFC migration. Simone: not aware of the issue (yet).
    • Correction submitted by Gareth after meeting: The UK bank holiday is next Monday, and RAL will be closed Monday and Tuesday (I.e. 31st August, 1st September) - not tomorrow
  • PIC (Gonzalo) - since some days we see intermittent failures in some of the CEs at PIC. It is the Replica Manager test fails when trying to replicate files to DPM server at CERN. A ping to the DPM server gives high packet failure. A GGUS ticket 51180 has been opened and is being followed up by Maarten. Gavin: we see the same problem here at CERN
  • CERN (Gavin) - we had to stop the LXBATCH service yesterday night. In order to limit the impact we tried to drain the jobs before the reboot. However, this may not have been the desired action and in particular CMS submitted an ALARM ticket for rebooting the Tier-0 production farm late in the evening. A post-mortem analysis is being produced.

AOB:

  • Gareth: this issue reported above about the packet loss - we also see some packet loss on the OPN starting from Tuesday this week. For historical reasons we ping srm-dteam.cern.ch (two nodes behind it). Gavin: could be because SRM nodes are being rebooted for memory upgrades although the node would then normally be taken out of the loadbalanced alias

Friday

Attendance: local(Cedric, Simone, Alessandro, Harry, Gang, Ewan, David, Olof (chair));remote(Gonzalo, Angela, Onno Zweers, Daniele, Gareth, Brian).

Experiments round table:

  • ATLAS - (Cedric) production low during week has started to ramp up. One (FTS) problem with PIC yesterday. Seems to have affected CMS too. Gonzalo: confirm this was a human error causing the DB to be rebooted. Affected FTS, LFC and 3D databases. After the database had recovered it seems that there was still a problem with the FTS and it was later found that it required a restart of the tomcat server. Twiki page for SL5 migration for ATLAS https://twiki.cern.ch/twiki/bin/view/Atlas/SL5Migration . ALARM ticket to BNL wasn't submitted yesterday but will do today. Gareth: we are planning some maintenance work on the 3D database at RAL and we are proposing a date of 3rd of September, which is during the reprocessing. Schedule a few (3) hours downtime though in reality it is probably only really down for 10 minutes. Is that a problem for ATLAS? Simone: this means having the Oracle database hosting the condition data down? yes Ok, then it is a problem. However, the reprocessing timescale isn't fully clear yet so it may still be possible. To be negotiate between ATLAS and DBAs at RAL.

  • CMS reports - (Daniele) Loadtest transfers PIC -> CALTECH: problem disappeared but not understood what the cause was. Ticket details in CMS report.
    • Olof: status of CMS tape problem at ASGC? Gang: _It is a three step process where currently at step two: the local experts are collecting information and ask advice from CASTOR experts at CERN. If problem not solved by next week, ASGC will ask CASTOR development team for helping out with the migrations. _
    • Angela: CMS contact reports problem with FTS. It is unclear why because it is ok for other VOs Daniele: seems to be a problem with proxy renewal for the CMS contact person. Angela: Apparently the same problem was also seen for a second user...

  • ALICE -

  • LHCb reports - Apologies from Roberto, who could not attend. However all details are in LHCb twiki. Roberto thanks pic and NL-T1 for their prompt reaction to the issue of dCache client: in the LHCb report you find exactly what we would like to have installed.

Sites / Services round table:

  • PIC (Gonzalo) - follow-up of issue reported yesterday causing SE SAM test failure. It is the replication test from the CE test. The destination SRM (dpm) fails with timeout. Maarten said yesterday that the network people had been contacted but there are no news yet. Olof: will check with networking team after this meeting
    • Checking with Maarten after the meeting: the trouble came from the new LCG core router introduced on Tuesday; it has been taken out an hour ago and SAM tests have started succeeding.
  • FZK (Angela) - currently problem with proxy delegation with the CREAM-CEs. Currently no idea what the problem is.
  • SARA (Onno) - few issues:
    • Powerfailure on Wednesday afternoon causing several problems:
      • VOMS failure due to glitches in the switches without redundant PS.
      • Also the tape backend was affected by the problem with filesystem.
      • Oracle 3D service affected: one node rebooted because no redundant PS. Fortunately another node in the cluster toke over so it should have been transparent
    • Reboots of WNs for kernel security upgrade yesterday
    • CE of the GINA cluster had problem with a file server
    • Last night NIKHEF had a problem with SRM daemon on the DPM SE. Because of this they were failing SRM transfers.
    • Finally: next Monday there is maintenance on the tape backend.
  • RAL (Gareth) - NTR, Brian: no update to the issues with LFC and monbox. Alessandro: from the point of view of ATLAS we don't use the old LFC. However, if you still see connections from ATLAS please submit a ticket (51093). Brian will check again.
  • CERN (Ewan): NTR

AOB:

  • After the change of site name from BNL-LCG2 to BNL-ATLAS in the FTS last Tuesday, the BNL-LCG2 site name was not changed in GOCDB, as can be seen at: https://goc.gridops.org/site/list?id=9 . The GridView team would like to know if the BNL-LCG2 name can also be changed to BNL-ATLAS in GOCDB. This change is important for GridView to keep track of the history of the BNL-LCG2 site in their monthly availabilities. It is just a site name change in GOCDB, so it's not adding services or any other modification.
    • No BNL person connected to the phone conf but in principle changing the site name in GOCDB should be transparent from their perspective. Alessandro: ATLAS is in favor of doing the change

-- JamieShiers - 2009-08-24

Topic attachments
I Attachment History Action Size Date Who Comment
Microsoft Outlook e-mail fileeml GOCdb_broadcast_tool.eml r1 manage 8.4 K 2009-08-26 - 10:07 MariaDimou GOCdb_broadcast_tool_missing_buttons
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2009-08-29 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback