Week of 081110

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

Site Date Duration Service Impact Report Assigned to Status
NL-T1 21 Oct 12 hours most down   NL-T1 received
ASGC 25 Oct many days CASTOR down http://indico.cern.ch/conferenceDisplay.py?confId=44840 ASGC received
SARA 28 Oct 7 hours SE/SRM/tape b/e down https://twiki.cern.ch/twiki/pub/LCG/WLCGDailyMeetingsWeek081103/post_mortem_tape-system_outage_25_10_in_NL.pdf SARA received
PIC 31 Oct 10 hours SRM down   PIC received

GGUS Team / Alarm Tickets

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Nick, Maria, Andrea, Jamie, Maria, Flavia, Patricia, Gavin, Jan, Ed, Simone, Roberto, Olof, Steve);remote(Michael, JT, Michel).

elog review:

Experiments round table:

  • ATLAS (Simone) - TEAM tickets - 4 - all from ATLAS. Simone will follow up. Situation improved alot. Basically no outstanding issue to report. Bottom line - last problem causing inefficiencies (FZK with pnfs overload) disappeared when site upgraded to PostgreSQL 8.3 (from 8.2). Can observe that this really helped! Basically no problem whole w/e. Some problems - such as NDGF - Lyon - but for rest things look good. Q: smoothly transferring to SToRM? A: Y. Flavia - transfers CNAF - BNL over generic link and not OPN. Should use OPN for T1-T1 transfers. CNAF has adopted different strategy as different links. 10Gb to FZK and through FZK - SARA (and somewhere). Requested 10G for all T1s but not yet... Don't want to interfere with T0-T1 transfers. Tested CNAF-BNL link and seems ok - 800Mb/s. Problem ATLAS sees is reverse direction. Michael - still under investigation. Running iperf tests. Michel (GRIF) - problem in France users with production role cannot be mapped on CE. Did something change in ATLAS? Started beginning of November... Not related to last gLite update. Open GGUS ticket and discuss offline.

  • CMS (Andrea) - in last day of cosmics run - CRAFT ends tomorrow morning at 07:00. CERN failing basic SAM test as version of local site catalog not same installed as in CVS - "normal" - won't change until end of run (fixed tomorrow). Transfers - issues CERN-FNAL - quality 0 - checksum errors. Cause unknown. Problem with PhEDEx. To be investigated. Transfers -> FZK failures yesterday as dCache head node crashed due to out of memory. Fixed. CAF - 18% free - need to cleanup. Flavia - what kind of checksum? A: have to check...

  • LHCb (Roberto) - Nonetheless some important points to report today:
    1. The transfer load generator finds:
      1. Error transferring to/from StoRM at CNAF: https://gus.fzk.de/ws/ticket_info.php?ticket=43407 (heavy load on GPFS)
      2. We had error to/from GriDKA https://gus.fzk.de/ws/ticket_info.php?ticket=43408 (fixed in the week end - same problem as above...)
      3. All transfers into or out of SARA are failing (other than CERN-SARA which is suspiciously perfect). https://gus.fzk.de/ws/ticket_info.php?ticket=43409
    2. Activities running
      After FEST09 MC simulation completed, during the week end, the central DIRAC services have been upgraded and MC dummy production restarted. Now running 9K jobs around the grid.
    3. AOB
      Some non LHCb specific sites (e.g Napoli_ATLAS) were reporting thousand of LHCb jobs flooding their queues. Those sites are used in opportunistic mode and were publishing wrong information attracting many jobs)
    4. (at meeting) MC dummy production up to 9K jobs.
    5. Some sites in Italy complaining they are being flooded - appear 'too attractive' in IS
    6. Known problem with streaming to GridKA - tests on LFC from LHCb failing.
    7. PPS CASTOR instance - can we use FTS production server? JT - ask Philippe to bring this up to MB. Jeremy - as (CE) SAM tests not working for LHCb what should sites look at? A: relax CE tests - only critical tests are very basic tests.. If they are failing there are problems (O/S checks, q & IS check, as well as job submission). JT - should we be worried if not jobs? A - didn't check today for NIKHEF. Only running dummy MC production. Will restart some activity and like to keep free slots at T1s. JT - LHCb pilot role - is this still a priority if so how high? A: need to cross-check with Philippe. Simone - discuss in GDB as ATLAS will have similar request.

  • ALICE (Patricia) - Regarding WMS announced by GRIF on Friday please send DN. A:OK. Once this is doen whole production of ALICE -> WMS. AliROOT having some issues last week - resolved. Sites (re-)joining production. NIKHEF - had to manage manually. Only 7 jobs at NIKHEF, slowly ramping up. Announced to all sites this week. Will include 2 workarounds for bugs found by Maarten. SLC5 exercise - running over w/e completely stably. Upgrade to 64bits this week? Q: many sites with CREAM CE. A: only 1 (so far...)

Sites round table:

  • GRIF (Michel) - ATLAS production role mapping opened 43445 GGUS ticket.

  • NL-T1 (JT) - lattice qcd very happy - have got 1000 cpus to use!

  • CERN (Jan) - preproduction endpoints for SLC4. Hope testing will progress quickly so can phase out SLC3 endpoints by end November!

Services round table:

  • DB (Maria) - workshop tomorrow and day after. CASTOR, DB services at T0, T1 review. DBA session on Wednesday. Don't see problems with streams for GridKA.

AOB:

Tuesday:

Attendance: local();remote().

elog review:

Experiments round table:

  • LHCb - StoRM service at CNAF has been problematic yesterday and the day before (down). Nonetheless LHCB noticed that no announcement during that period and the subsequent recovery intervention has been submitted in the GOCDB, not complying then with the MoU. StoRM is now back to life but the correct procedure has not been followed.

Sites round table:

  • BNL - Some of BNL network is down. Not only dCache, but also several other services are affected. (Monday 16:43 UTC); Debugging on a primary network switch has resulted in a loss of some service. Back at 17:27 UTC.

Services round table:

AOB:

Wednesday

Attendance: local();remote().

elog review:

Experiments round table:

  • ALICE: Upgrading all sites to the newest version of the WLCG plugin to ensure the usage of several WMS per site. Several minor fixes have been also included. Testing phase of the module at: CERN, Torino, SARA, IN2P3, JINR, Legnaro and Prague. In addition the FIO team announced us the migration of the SLC5 nodes placed in PPS to 64b. Since yesterday ALICE is testing the nodes and running real jobs with no remarkable problems to announce. 

  • ATLAS: Since 1 day, ASGC is failing all transfers again. Jason reported that this is again an oracle problem but different error code in respect of the one the weeks before. Jason is in contact with CASTOR and Oracle experts at CERN to sort this out.

Sites round table:

  • IN2P3 (Marc HAUSARD) - Due to some maintenance work on our dCache software, CCIN2P3 would like to report the scheduled downtime on service SRM planned for tomorrow (Thursday 13/11).
    Consequently to the unavailability of SRM, there will also be a scheduled downtime on computing elements starting Wednesday 18:00 UTC until Thursday 13/11 19:00 UTC.

Services round table:

AOB:

Thursday

Attendance: local();remote().

elog review:

Experiments round table:

  • ATLAS (Simone)
    1. The BNL-CNAF network performance is still under investigation but see summary for Michael:
      "In summary, it looks like we hit a problem somewhere in/between ESnet and Geant. We have submitted a ticket to ESnet. Two of their engineers were assigned to work on the problem."
      I can provide a long detailed email from Dantong with the technical if needed.
    2. ASGC performance improved, but still is at the order of 70% and I see from time to time an obscure (to me)
      Jason is on the problem

[FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during PREPARATION phase: [GENERAL_FAILURE] stage_put error: Error in insert request :Database error, Oracle code: 1578ORA-01578: ORACLE data block corrupted (file # 10, block # 407142)ORA-01110: data file 10: '/u02/oradata/castordb/stager_01.dbf'Statement was :INSERT INTO PutStartRequest (subreqId, diskServer, fileSystem, fileId, nsHost, flags, userName, euid, egid, mask, pid, machine, svcClassName, userTag, reqId, creationTime, lastModificationTime, id, svcClass, client) VALUES(:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13,:14,:15,:16,:17,ids_seq.nextval,:18,:19) RETURNING id INTO :20and parameters' values were :subreqId : 1847864927diskServer : c2fs026.grid.sinica.edu.twfileSystem : /diskpool/data01/fileId : 11870603nsHost : c2nsflags : 0userName : stageeuid : 506egid : 506mask : 0pid : 17790machine : c2fs026.grid.sinica.edu.twsvcClassName : atlasPrdD1T0userTag :reqId : 491c15d1-0000-1000-acf7-ab95ddd3b454creationTime : 0lastModi]

    1. CNAF has been having intermittent behavior in the last 24h for transfers to StoRM. The problem is a generic SRM destination error, which I have been told boils down to a StoRM FE problem.
    2. In agreement with Maria and Diana, I submitted alarm tickets to all ATLAS T1s to check if there is some flaw in the procedure at sites. I did this during day time, but I forgot that 11AM CET is night in Vancuver, and I woke up the site admin. I would like to present my apologies.

  • LHCb (Roberto) - Again CNAF-StoRM seems to be problematic. Recovered yesterday today still back to failures. Opened a GGUS against LFC at GRIDKA because of the failures on the replication of the information of LFC.

Sites round table:

Services round table:

  • DB (Miguel) - The intervention scheduled for today (LCG production DB - 13 nov - 10am (CET) - rolling at CERN) finished successfully. The production database cluster is now running with the latest security patches and several bug fixes.
    Reminder that on the 18th November the validation and development system will not be available from 8:00 to 17:00, due to electrical maintenance in the computer vault. This will affect the following services:
    • INT11R - INT11R_LB, INT11R_NOLB, LCG_CERT_SAM_INT11R, LCG_DPM_INT11R, LCG_FTS_INT11R, LCG_FTS_PPS_INT11R, LCG_GRIDVIEW_INT11R, LCG_LDIF_INT11R, LCG_LFC_INT11R, LCG_MESSAGING_INT11R, LCG_SAM_INT11R, LCG_VOMS_INT11R INT6R - LCG_DEV_DB

AOB:

Friday

Attendance: local();remote().

elog review:

Experiments round table:

  • ATLAS (Simone) - all problems mentioned yesterday (TW CASTOR problems and CNAF intermittent behavior) have been solved. All transfer currently running near 100% efficiency.

Sites round table:

Services round table:

AOB:

-- JamieShiers - 10 Nov 2008

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2008-11-14 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback