Week of 120528

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday - No meeting - Whit Monday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Tuesday

Attendance: local(Alex, Elisa, Ian, Ikuo, Luc, Maarten, Mike, Xavier E);remote(Jeff, Jeremy, Jhen-Wei, Lisa, Michael, Rob, Rolf, Tiju, Ulf, Xavier M).

Experiments round table:

  • ATLAS reports -
    • LHC/ATLAS - physics data taking. Important and urgent MC production on-going (not new).
    • WLCG services
      • GOCDB (GGUS:82543) site/downtime info not accessible. Solved.
    • T0
      • CERN EOS (GGUS:82557 verified) open/create error: No space left on device
        • solution: raised the file quta from 8 to 9m in datadisk
    • T1
      • SARA SE instability : GGUS:82032 -> SARA is removed from T0 export untill SE issue solved . SARA blacklisted for T0 export
      • TAIWAN-LCG2_MCTAPE back in T0 SAV:128845
      • TRIUMF MCTAPE (GGUS:82556) Free=3.774 Total=25.0. Solved.
      • TRIUMF reported 1 TAPE lost
      • INFN-T1 DATADISK (GGUS:82517) invalid path -- likely too many directories. empty directories being removed
      • BNL-OSG2 (GGUS:82539 assigned) failed to contact on remote SRM -- apparently solved, but no response in the ticket (after the first response in GGUS:82539 "The hardware of SRM database had disk problem. We are working on it."
        • Michael: on Fri there was a problem with the SSD hosting the SRM DB, which was resolved within 1 hour; will check what happened with updating the ticket
        • Ikuo: the important matter was that the problem was solved very quickly; there also may have been confusion with GGUS:82538 which was closed before further updates could be added

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • Storage problem over the weekend, tracked to a user overloading the system. Since resolved.
        • Ian: the affected site was FNAL; it looked like a PNFS overload
      • Simulation remains the highest priority for the week
    • Other:
      • Ian Fisk is CRC

  • LHCb reports -
    • Users analysis, DataReprocessing of 2012 data and prompt reconstruction at T1s ongoing
    • MC production at Tiers2
    • New GGUS (or RT) tickets
    • T0:
      • ntr
    • T1:
      • SARA: unscheduled downtime this morning
      • IN2P3: ongoing investigation about corrupted files (GGUS:82247)
        • Elisa: the last affected file dated from March 20
        • Rolf: LHCb does not test for checksum errors immediately after their transfers, while ATLAS and CMS do, thereby avoiding problems later
        • Elisa: LHCb will implement such checks for FTS transfers; it may take a few weeks before the new code is in production; still, such problems are not normal!
        • Rolf: not clear where the problem is exactly, it may not be the site's fault
        • Elisa: what about stageout from the WN?
        • Ikuo: ATLAS also test the checksums for those transfers
        • Ian: for CMS it depends on the site; we have not often seen such problems and will rather detect them in PhEDEx when an affected data set is transferred
    • Central services
      • request to LFC support (GGUS:81924)
        • reassigned to CERN LFC support after the meeting

Sites / Services round table:

  • ASGC - ntr
  • BNL - nta
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - nta
  • KIT - ntr
  • NDGF
    • ATLAS reported they could not get certain files from tape, due to a problem with the tape system in Bergen; some data might be lost
    • the main OPN link to CERN is down for ongoing work; currently connected through backup route via SARA
  • NLT1
    • as the SARA SRM problem (GGUS:82490) started after an upgrade, a downgrade was performed to see if that helps
    • today's unscheduled downtime (11:30-13:30 UTC) was to deal with network issues
  • OSG - ntr
  • RAL
    • Yesterday, Castor-LHCb was switched to using transfer manager. Tomorrow we will switch Castor-Gen (used by Alice).
    • Request to other Tier1s - If other Tier1s have enabled hyper-threading on their batch farm, please can they get in contact with Alastair Dewhurst (alastair.dewhurst@cernNOSPAMPLEASE.ch)

  • dashboards
    • only CNAF and KIT still to fill out the Active MQ patch Doodle poll
      • Xavier M: KIT should have the patch running
      • Mike: the KIT FTS still has a problem, viz. a 57% deficit of transfers reported via Active MQ
  • grid services - ntr
  • storage - ntr

AOB: (MariaDZ) GGUS Release tomorrow! Usual round of ALARM and user test tickets as per Savannah:128369 .

Wednesday

Attendance: local(Alessandro, Alex, Elisa, Jan, Luc, Maarten, Marcin, Maria D, Massimo, Mike);remote(Giovanni, Gonzalo, Ian, Jhen-Wei, Lisa, Michael, Pavel, Rob, Rolf, Ron, Tiju).

Experiments round table:

  • ATLAS reports -
    • LHC/ATLAS - physics data taking (No stable beam since 10:30 yesterday). Important and urgent MC production on-going (not new).
    • WLCG services
      • NTR
    • T0
      • NTR
    • T1
      • SARA SE instability : GGUS:82490. Correlation with data deletion. SARA is removed from T0 export until SE issue solved .
      • INFN-T1 Unscheduled downtime (SRM). A priori ATLAS not affected. By precaution INFN-T1 removed from T0 export.
      • Slow transfers IN2P3-CC ->TRIUMF: 2 bad machines at Lyon identified (GGUS:82363). A dedicated FTS monitoring might be needed.
        • Luc: difficult to track time spent for a given transfer
        • Alessandro: the info probably is available via the message bus, but so far it is not clear to us; we need to be able to select transfers taking longer than N hours

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking
    • CERN / central services and T0
      • Lost a disk server in the T0streamer pool. Lots of errors in repack and Express. A team ticket was created and the problem is being addressed.
    • Tier-1/2:
      • HammerCloud failure at T1_FR_IN2P3. Exit code indicates failure to open an input file.
      • T1_IT_CNAF is expected back-up soon

  • ALICE reports -
    • NIKHEF jobs using Torrent since yesterday evening.
    • SARA job submission was blocked overnight by stuck resource BDII on CREAM CE (GGUS:82618)
    • CNAF: no jobs running since the night, 299 waiting.

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • New GGUS (or RT) tickets
    • T1:
      • IN2P3 ticket for jobs killed due to memory limit has been closed (GGUS:82544)
        • Elisa: is the 5 GB memory limit for the long queue at IN2P3 per process or per job?
        • Rolf: per process group, i.e. per job

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • brief VOMS outage this morning, a hanging daemon needed to be restarted
  • CNAF
    • work ongoing to recover from earthquake fallout; not clear when affected services can be restored for CMS and ALICE
    • at-risk downtime 09:00-12:00 UTC on May 31 for network backbone maintenance
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT
    • FTS manager tried to apply the Active MQ patch, but it was no longer in the repository; FTS developers have been contacted
  • NLT1
    • still investigating the slowness of the SRM (GGUS:82490); LHCb are hammering the service with srmLs requests at ~10 Hz, which may not fully explain the problem, but certainly needs fixing
      • Elisa: we will look into tuning our clients better to lower the rate
      • Alessandro: correlation with deletion activity?
      • Ron: unclear, but deletions do not come with their own srmLs overload; note that the Chimera service is shared between the experiments
  • OSG - ntr
  • PIC - ntr
  • RAL
    • this morning's CASTOR upgrade went OK

  • dashboards - nta
  • databases
    • yesterday the ATLAS PVSS service was moved to the 3rd node of the ATLAS online DB, after which nodes 1 and 2 were rebooted to clear alarms from the ATLAS DCS about event buffering; the underlying cause is not understood, but the issue was also cured by a reboot the previous time, a few months ago
    • today nodes 1 and 2 of the WLCG RAC were rebooted after agreement with the Dashboard team; other services were not affected
  • GGUS/SNOW
    • a new GGUS release happened today; a summary of the test alarms and test tickets with attachments will be given tomorrow
  • grid services
    • 3 extra WMS nodes were added for CMS, after recent user activities caused a stress on the sandbox file systems of the other machines
  • storage
    • no news yet for CMS t0streamer disk server; the hope is that all files can be recovered
    • EOSCMS had a deadlock early afternoon, cured by a restart after some debugging

AOB:

Thursday

Attendance: local(Alex, Luc, Maarten, Marcin, Maria D, Mike, Xavier E);remote(Andreas M, Elisa, Gonzalo, Ian, Jeremy, Jhen-Wei, Joel, John, Kyle, Lisa, Michael, Onno, Paolo, Rolf, Ulf).

Experiments round table:

  • ATLAS reports -
    • LHC/ATLAS - physics data taking (many short fills yesterday). Important and urgent MC production on-going (not new).
    • WLCG services
      • NTR
    • T0
      • NTR
    • T1
      • SARA SE instability : GGUS:82490. Situation more stable now (lower LHCb activity?), but still Destination errors. Still removed from T0 export.
      • INFN-T1 Scheduled downtime (9:00-12:00 ) for backbone network intervention. Still removed from T0 export.

  • CMS reports -
    • LHC machine / CMS detector
      • Physics ongoing
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • HammerCloud failure at T1_ES_PIC. SAV:129052
      • T1_IT_CNAF still doesn't report green in site status board
        • Ian: CNAF seems to be working again for CMS (see CNAF comment below)
      • Error in migration at RAL, but looks like an understood backlog

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • New GGUS (or RT) tickets
    • T1:
      • SARA: decreased the rate of job submission for merging jobs, in order to reduce a bit the pressure on the SRM
      • CERN : RAW transfer failing : (GGUS:82745 ) randomly...
        • Xavier: affected service class? we will look into it
        • Joel: lhcbtape; we did various tests with RFIO and GridFTP; could there be a problem with just 1 machine?
          • the problem turned out to be on the LHCb side

Sites / Services round table:

  • ASGC
    • on Mon there will be a downtime (0:00-10:00 UTC) for core switch maintenance and CASTOR DB table index reconstruction; all services will be unavailable
  • BNL - ntr
  • CNAF
    • today's network maintenance went OK
    • the ATLAS StoRM production back end had a problem around noon, solved in 30 min; under investigation
    • work in progress to restore storage services for CMS and ALICE
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3
    • working on the issue of corrupted LHCb files, examining FTS logs to understand the checksum mismatch rates
  • KIT
    • FTS Active MQ rpms are available again and will be deployed this afternoon
  • NDGF
    • tape problem in Bergen: 20 tapes are suspect, 2 out of 4 LTO drives died; no news today due to a strike; will try to get lists of lost files for ATLAS and/or ALICE
    • direct OPN link has been restored
  • NLT1
    • SRM slowness: LHCb decreasing the job rate helps for pinpointing the problem; can the ATLAS T0 export be restarted to see how that goes now?
      • Luc: we will first try targeting just the DATADISK space, because it is less critical; DATATAPE would come later
      • Joel: do you plan to give each experiment its own SRM?
      • Onno: we will consider that; for now we intend to move the DB to better HW, as it is easier and a good idea anyway
      • Joel: KIT have given ATLAS and CMS their own SRMs and are doing the same for LHCb now, as the limits were reached
      • Onno: we can contact KIT for advice as needed
  • OSG
    • yesterday's 2 GGUS test alarm tickets went OK
  • PIC
  • RAL
    • CMS CASTOR upgrade to use Transfer Manager went OK this morning

  • dashboards
    • CNAF still to fill out FTS Active MQ patch Doodle poll
      • Paolo: will inform FTS admin
  • databases - ntr
  • GGUS/SNOW
    • see AOB
  • grid services - ntr
  • storage
    • yesterday's crashed t0streamer disk server: waiting for vendor to replace a controller; all files still inaccessible, but hopefully recoverable

AOB: (MariaDZ) GGUS Rel. yesterday went well. From the test tickets there are 5 waiting for closing by WLCG supporters (CERN, OSG, NGI_IT). Ticket numbers in Savannah:128369#comment4 . Our Did you know?... this month explains the meaning of GGUS ticket number colours in email notifications and search results. The gus.fzk.de domain decommissioning was abandoned to avoid any user inconvenience. Details in https://ggus.eu/pages/news_detail.php?ID=461 Other highlighs of the release include a standard Cc address in LHCb TEAM tickets, new Support Units for Experiment Dashboards and REBUS. Details in https://ggus.eu/pages/news_detail.php?ID=458

Friday

Attendance: local(Alessandro, Cedric, David, Elisa, Jan, Luc, Maarten, Marcin);remote(Gareth, Gonzalo, Ian, Jhen-Wei, Lisa, Michael, Onno, Rob, Roger, Rolf, Xavier M).

Experiments round table:

  • ATLAS reports -
    • LHC/ATLAS - physics data taking (yesterday 10hours fill). Important and urgent MC production on-going (not new).
    • WLCG services
      • NTR
    • T0
      • NTR
    • T1
      • SARA SE instability : GGUS:82490. Still Destination errors. Put back into T0 export for disk only.
      • INFN-T1 Stable. Put back into T0 export for disk only. If weekend stable, put back also tape.

  • CMS reports -
    • LHC machine / CMS detector
      • Physics ongoing
    • CERN / central services and T0
      • Delayed Prompt Reconstruction for an additional 24 hours to solve a calibration issue
    • Tier-1/2:
      • No samples transferred to RAL were being marked as migrated to tape GGUS:82775
        • Gareth: the files are indeed not being migrated, looking into it
      • Tickets for consistency checks on storage for May were issued

  • ALICE reports -
    • CNAF back since noon

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • New GGUS (or RT) tickets
    • T1:
      • IN2P3: 26k files unavailable due to hardware problem (GGUS:82751)
        • Rolf: the machine had crashed due to an unknown cause; its micro code has been updated on request of the manufacturer, to get more information if it crashes again; currently the machine is in production and its files are available again
        • Elisa: we had made those files temporarily invisible for our users, now we will make them available again

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF
    • CMS and ALICE services are OK now after a successful intervention on the corresponding storage HW
    • the FTS Active MQ patch Doodle poll has been completed
  • FNAL - ntr
  • IN2P3 - nta
  • KIT
    • FTS Active MQ patch has been deployed
  • NDGF
    • tape problem in Bergen:
      • broken drives were replaced
      • some files may be lost, more news ASAP
  • NLT1
    • SARA outage 9:00-12:00 UTC on Mon:
      • moving the SRM name space database to more powerful hardware
      • network firmware updates
      • CREAM and WN maintenance
  • OSG - ntr
  • PIC - ntr
  • RAL
    • Overnight we had a problem with the LHCb SRMs. This was triggered by a daemon taking all the memory (which in turn was triggered by stopping the Castor DLF (logging) database.
    • We are investigating a problem with migration of files to tape for CMS.
    • RAL Closed next Monday & Tuesday (4/5 June).
    • Announced in GOC DB: Castor Atlas instance moving to use "Transfer Manager" scheduler next Thursday (7th June).
    • We are planning some interventions - not yet in the GOC DB as still being finalised:
      • Wednesday 13th June: Update Castor to version 2.1.11-9 (Expected Castor down from morning to early afternoon) also update database behind LFC & FTS to Oracle 11g. (Expect these services down whole of working day).
      • Tuesday 19th June: Site networking upgrade. RAL Tier1 disconnected from rest of world for up to 3 hours in the morning.
      • Wednesday 27th June: Update Castor databases to Oracle 11g. Castor down whole of working day.

  • dashboards
    • FTS Active MQ patch: all sites look OK, KIT still to be checked again
  • databases - ntr
  • storage
    • CMS t0streamer disk server: waiting on vendor call
      • Ian: recovery of the files on that server does not look urgent

AOB:

-- JamieShiers - 11-Apr-2012

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2012-06-01 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback