Week of 090323

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).

GGUS Team / Alarm Tickets during last week

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jean-Philippe, Maarten, Roberto);remote(Angela/FZK, Gareth/RAL, Alessandro).

Experiments round table:

  • ATLAS (Alessandro) - few problems during the weekend: Panda Monitor machine voatlas21: 2 processes automatically killed for overload; problem currently with the Panda Monitor developers; many "critical" errors on PIC VOBOX - not really critical as due to subscription to non existing datasets - msg to be changed; SRM server down at MPPMU; pretty smooth otherwise.

Gareth worried about raising an alarm ticket to push Tier1s to update FTS in case of proxy delegation error. Also discussed in Prague. But as a site not updating could lead to great unavailability, Alessandro proposes to send now a team ticket about FTS upgrade; Maarten will check how to have the FTS release pushed from PPS to Production (easier for sites to take); if sites do not upgrade and many errors are seen an alarm ticket will be sent.

  • ALICE -

  • LHCb (Roberto) - quiet week: low level Monte Carlo and random analysis; several sites banned as unable to upload data from WNs to SE; because of the problem seen with some sites a couple of weeks ago, LHCb is wunning lcg_util transfers with a single stream, but this is very bad from performance point of view (0.3 MB/s form WN to SE); will try to test the sites one by one outside of the Dirac framework to see if some network mis-configuration. Could be also due to wrong setting of GLOBUS_TCP_PORT_RANGE on WNs. Roberto will give the list of problematic sites (GGUS 46946).

Sites / Services round table: RAL: the large I/O activity from MCDISK reported last week was activity on the LAN (SE to WN), this is why Alessandro did not notice it in his monitoring. RAL had today one hour unscheduled downtime because of a fibre channel problem on the Oracle RAC servers for Castor; problem for CMS took a bit longer to fix but everything should be ok now.

AOB: (MariaDZ) It's show time folks! In https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru one reads: "Next round in 2009: Maria to remind the VOs on Mar 23rd. The tests will run in the week of Mar 30th, the GDB being planned for April 8th. Related ticket savannah #107452. Any questions, lets discuss them tomorrow in this meeting. Alarm tests should start right after CHEP'09.

Tuesday:

Attendance: local(Miguel, Eva, Alessandro, Jean-Philippe, Roberto, Olof, MariaD, Ignacio);remote(Luca/CNAF, Angela/FZK, Gareth/RAL).

Experiments round table:

  • ATLAS (Alessandro)- Monday was quiet - Reprocessing will start on Thursday - problem in FZK due to Munich and Freiburg (ticket issued) - GRIF/LAL MCDISK full: need to understand why - Stephane restarted *aod subscriptions, this will increase the data rate on MCDISK.

  • ALICE -

  • LHCb (Roberto)- load on LFC at CERN increased because of the retries on failed uploads: one LFC node has been added by FIO - one file access issue in Lyon: file not yet online after 2 days - SRM problem at CERN (load?): uploads failing, ticket submitted,Shaun investigating.

Sites / Services round table:

Databases (Eva): DB down in Taiwan for several weeks: remove from replication - replication to RAL affected by power cut - replication to CNAF will be stopped next week for the scheduled site down time.

New version of FTS is going from PPS to Production patch #2760/2761. Sites should start installing.

AOB: (MariaDZ) As discussed with Luca, if CNAF is on scheduled downtime during the whole of the ALARMS testing week, the site should be omitted from this round of tests. I have added this common-sense phrase in the 'testing rules' https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru (point 4). About CNAF rejecting LHCb alarm tests in the last round (early March), expecting the email to be signed with Roberto's certificate, the documentation https://gus.fzk.de/pages/ggus-docs/PDF/1560_Alarm_Ticket_Process.pdf (section 3.1.2. and 3.2.1), explains:

  • The Authorised ALARMER signs with his/her certificate (section 3.1.2.)
  • The email is sent to a specific site alarm mail address and signed with the GGUS certificate. (section 3.2.1.).
This document, included in https://gus.fzk.de/pages/docu.php#8 is also linked from 'testing rules' https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru (point 5).

  • RAL (Gareth): major outage due to 2 power glitches: nothing available at RAL now - looked at BDII problem reported earlier: seems to be similar to the one reported by NIKHEF; Maarten suggests to install the latest BDII RPM from PPS repository; several sites have already installed it.

  • CNAF (Luca): CNAF on scheduled downtime next week - queues will be stopped this Friday afternoon - jobs should be able to complete - power off on Monday morning - some services like mail will stay up - Tier1 may be optimistically back on Thursday.

Wednesday

Attendance: local(Eva, Jean-Philippe,Alessandro, Antonio, Roberto, Olof);remote(Gareth/RAL).

Experiments round table:

  • ATLAS (Alessandro)- Reprocessing starts tomorrow under James Cashmore coordination- some information still missing: memory required per job (2.2 or 2.4 GB), number of CPUs to be used at each site - files data08_cos* will be used for reprocessing (520 TB distributed between the sites)

  • ALICE -

  • LHCb (Roberto)- as rported yesterday an extra LFC frontend node (lfc201) was installed; LFC logs have been analyzed ans showed suboptimal use of LFC features especially around sessions and transactions (probably due to recent changes in Dirac DM framework); the usage is lower today but Sophie will continue to monitor carefully - the file access problem in Lyon was fixed by Lionel (dCache internal locking problem); ticket closed - currently 3 LFC frontends at CERN; Olof wondering if the third one will be needed in the long term; answer:no.

Sites / Services round table:

  • RAL (Gareth): unscheduled down time at RAL since the double power trips yesterday - CASTOR was up at 14:30 today and everything should be up in the coming hour including batch - problems with broadcast mechanism not working for several hours; fixed now.

  • DB (Eva): replication to RAL LFC restarted yesterday afternoon after power cut - replication of Atlas to RAL restarted only this morning because of a DB corruption problem at RAL.

  • gLite releases: - glite 3.1 Update 42 on 31st March: FTS 2.1 (proxy delegation fix), BDII (reduction of cache size from 1GB to 50 MB to reduce footprint), new VOMS host certificate (see below), VDT 1.6.1 release 9 (touches all services) - on 6th April: new gLite release with CREAM and ICE/WMS fixes

LHC Voms Service
On March 31st at 09:00 UTC a routine host certificate change will happen on lcg-voms.cern.ch. Intervention will be transparent and last only a few minutes. As per usual by this time ALL voms aware services must have deployed the /etc/grid-security/vomsdir/*.lsc file method or have upgraded the lcg-vomscerts package to version 5.4.0-1. (As normal the .lsc method is not available to the WMProxy). A standard GOCDB "at risk" has been entered.

AOB:

Thursday

Attendance: local(Alessandro, Jean-Philippe, Eva, Roberto);remote(Angela/FZK, Brian/RAL, Gareth/RAL).

Experiments round table:

  • ATLAS (Alessandro)- quiet week - 100% success for the weekly functional tests - critical errors seen in Site Services monitoring machine due to a race condition between the deletion of a dataset and another subscription to it - reprocessing will only start next week because there are still issues with some muon algorithms.

  • ALICE -

  • LHCb (Roberto)- Dirac DM problem fixed, situation back to normal for LFC with 15 concurrent requests; can we remove the third server? will check. SRM problem at CERN and RAL with LHCB_USER space token; too few servers (two) for the load; use of disk servers configured with "external gridFTP" does not help; still waiting for new gfal/lcg_util version to be able to go back to "internal gridFTP" server configuration.

Sites / Services round table:

  • RAL (Brian): disks for ATLAS_MCDISK full at one UK Tier2; 2TB added but what should be done in this case? reply by Alessandro: sites are free to increase or decrease the disk space provided, but should send the updated configuration to Alexei - in the longer term we should see how to redefine the subscription according to the new space distribution.

  • RAL (Gareth): everything ok now after the double power trip on Tuesday, but SAM tests were still failing after the CE recovery (for both ATLAS and OPS VOs); need to understand why - tape robot problem 05:00 - 10:00 this morning; ok now.

AOB:

Friday

Attendance: local(Eva, Jean-Philippe);remote(Angela/FZK, Alessandro, Gareth/RAL).

Experiments round table:

  • ATLAS (Alessandro)- many production tasks aborted yesterday because double replication of data - many deletions launched - the deletions caused an overload on the NDGF srm server - NDGF was taken out from deletion agent - they understood the problem: "The problem was a DB schema mismatch which lead to indexes not being used for the locationinfo and retentionpolicy tables in chimera. Not using indexes makes for very slow DB updates, clogging everything up... This is fixed now and should be working as expected." - as INFN T1 is in scheduled downtime, LFC has been replicated in Roma; functional test to the calib disk Tier2 to verify the LFC behaviour. For now it seems ok (still only few files transferred and registered, but no errors) - normal operation expected during the weekend over the Grid.

  • ALICE -

  • LHCb (Roberto by mail)- as currently the activity is low, LHCb will wait for the next simulation run before giving the ok to remove the third LFC frontend server at CERN.

Sites / Services round table:

  • FZK (Angela): network connection between CNAF and GridKa down yesterday because of a broken switch in Switzerland, now repaired - quite a few SAM tests failing for BDII and sBDII - wondering why there is no backup link between CNAF and GridKa (CNAF decision? To be checked).

  • RAL (Gareth): at risk for CASTOR because of a tape robot control system problem, should be up in the next few hours; currently no tape activity possible - also a disk problem on an FTS server - both problems are may be still a consequence of the power glitches earlier this week.

AOB:

-- JamieShiers - 19 Mar 2009

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2009-03-30 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback