Week of 110822

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Jamie, Dirk, Cedric, Hong, Ian, Mike, MariaDZ, Lola, Eva, Alex);remote(Michael, Gonzalo, Onno, Tiju, Jhen-Wei, Giovanni, Catalina, Vladimir, Rob).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • ntr
  • T1 sites
    • BNL problem mentioned on Friday (LFC not reachable from outside : GGUS:73642) fixed (firewall issue)
    • Files that cannot be stage on Triumf GGUS:73645 (fixed), GridKa GGUS:73678 (fixed), PIC GGUS:73680, IN2P3-CC GGUS:73683
    • SARA : tape downtime. Stop T0 export to the site.
    • IN2P3-CC : Problem at Lyon on IN2P3-CC_VL GGUS:73461
    • RAL :
      • SRM suffering from some deadlocks GGUS:73644 [ Tiju - fixed on Saturday night ]
      • Disk server gdss230, part of the atlasStripDeg (d1t0) space token, is currently unavailable at the RAL Tier1 (Sunday 21 August 2011 11:25 BST).
  • T2 sites
    • CA-SCINET-T2 : SRM problems GGUS:73684 (Thunderstorm activity in the Toronto area caused power glitches which took out our cooling system at the datacentre, hence all systems were shutdown.)
    • IL-TAU-HEP : SRM problem GGUS:73681


  • CMS reports -
  • CERN / central services
    • CERN Production dropped out of BDII. Ticket was submitted. This only has a small analysis impact.
  • T1 sites:
    • We were trying to get pilots running on the new whole node queue at IN2P3. Looks like a CREAM issue and experts are working
    • File consistency problem reported at FNAL
  • T2 sites:
    • NTR


  • LHCb reports - Ongoing processing of data.
    • T1
      • IN2P3 :
        • "slow access to data" solved
      • PIC :
        • Problems with data access.
        • "Problems transferring files to CERN" (GGUS:73576) was solved.
      • GRIDKA :
        • Faulty connections from 192.108.46.248 (GGUS:73630) still opened.

Sites / Services round table:

  • BNL - ntr
  • PIC - comment on LHCb situation: just back from vacation - true that last week we were a bit thin and unresponsive for LHCb as contact had to be away. Starting today starting as LHCb backup. Related to GGUS ticket opened for files transferred PIC-CERN and failing, situation understood, some hanging transfers from tape, restart needed, about 100 files mainly affecting ATLAS but also LHCb. Following - still issues but will check.
  • NL-T1 - ntr
  • ASGC - ntr
  • CNAF - concerning problems on CE reported on Friday the failure of SAM test was due to FE which had disk problem. No update for failure of other test.
  • FNAL - investigating 8K files mismatch between PhDEX and local data system. Initial analysis: those files somehow moved but nothing else
  • OSG - ntr

  • CERN - looking at missing CERN PROD in BDII

AOB:

Tuesday:

Attendance: local(Jamie, Oliver, Hong, Mike, Manuel, Lola);remote(Joel, Kyle, Xavier, Michael, Tore, Tiju, Ronald, Catalina, Shu-Ting, Gonzalo, Giovanni).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • 2 central service interventions yesterday:
      1. ATLAS DDM central catalogue intervention: temporary glitches in ADC activities
      2. Frontier intervention: not well advertised to ADC shifters/experts (Savannah:85879)
    • T0 data export service (Santa-Claus) upgrade this morning
    • files have locality "unavailable" at CERN-PROD_DATADISK space token (GGUS:73746) [ latest news from ticket is that diskserver inaccessible but now back in production ]
  • T1 sites
    • INFN: fail to contact SRM endpoint (GGUS:73619)
    • IN2P3-CC: job failures (GGUS:73461) and tape staging issue (GGUS:73683). Site claimed both problems should be fixed now.
    • PIC:
      • transient file transfer failure from CERN-PROD_TZERO to PIC_DATADISK due to SRM overload. problem fixed.
      • massive job failure due to missing ATLAS release (GGUS:73732). software re-installation in progress. [ Gonzalo - related to installation of ATLAS release which is ongoing. A number of jobs are failing as this release is still not ready. Savannah ticket where team of ATLAS is following this. Seems jobs are being sent - have stopped queue for time being. ]
  • T2 sites:
    • ntr


  • CMS reports -
  • CERN / central services
    • CERN Production dropped out of BDII. This only has a small analysis impact. Solved by BDII restart, GGUS:73700
    • tomorrow, Aug. 24th, WLCG production database (LCGR) rolling intervention, service will be available but degraded, shifters and users informed. VOMS proxy init - will it fail or hang? A: should work ok during intervention (Manuel)
  • T1 sites:
    • Still trying to get pilots running on the new whole node queue at IN2P3. Looks like a CREAM issue and experts are working.
    • File consistency problem resolved at FNAL. Problem was that adler32 checksum calculation crashed for 12 production jobs. Jobs were not declared as failures and resubmitted, files were registered without alder32 checksum. File metadata and production infrastructure corrected.
  • T2 sites:
    • NTR


  • ALICE reports -
    • T0 site
      • ALICE has agreed CASTOR upgrade 31st August as proposed last week by Massimo
      • GGUS:73759. Big mismatch between number of jobs reported by MonaLisa and the information provider. Under investigation
    • T1 sites
      • Nothing to report
    • T2 sites
      • Sao Paolo. GGUS:73676. Site administrator's certificate was invalidated by VOMRS CA synchronizer. The registration appears as 'Approved' but he cannot create a proxy as lcgadmin. Solved

Experiment activities:

Ongoing processing of data.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1

Issues at the sites and services

  • T0
  • T1
    • PIC :
    • SARA :
      • Problem of space token for lhcb-user 5 space tokens with 2 with DATA but only 1 active)


Sites / Services round table:

  • BNL - ntr
  • KIT - ntr
  • NDGF -
  • RAL - ntr
  • NL-T1 - ntr
  • FNAL - we investigated 8K files mismatch between Phedex and local file system - just name change. Reminder of downtime on Thursday ~4h.
  • ASGC - ntr
  • PIC - ntr
  • CNAF - ntr
  • OSG - ntr

AOB:

Wednesday

Attendance: local(Hong, Jamie, Maria, Mike, Oliver, Luca, Fernando, Alex, Edoardo, MariaDZ);remote(Michael, Joel, Tore, Onno, Kyle, Gonzalo, John, Claudia, Catalina, Soichi, Shu-Ting).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • Hypervisors incident this morning causing T0 export services down for few hours
    • 2 DB rolling interventions this afternoon
      1. ATLARC
      2. LCGR - we were reminded yesterday by CMS report, would be appreciated if IT-DB could make a brief reminder in the daily WLCG Operations Meeting one day before the intervention. Also the intervention will affect/degrade LFC and VOMS services, it would be nice if an "at-risk" downtime could be declared in GOCDB. [ Luca - on status board since a few days. Communicated to ATLAS DBAs. For LCG DBs put on status board and announce ~1 week before. Will try to put a reminder. ]
  • T1 sites
    • IN2P3-CC: still see files not being staged from tape, GGUS:73683 reopened
  • T2 sites
    • CA-SCINET-T2:failed to contact on remote SRM, GGUS:73684 reopened

  • CMS reports -
  • CERN / central services
  • T1 sites:
    • Still trying to get pilots running on the new whole node queue at IN2P3. Looks like a CREAM issue and experts are working.
    • tomorrow, Aug. 25th, 3 PM to 7 PM CERN time, FNAL maintenance downtime
    • having trouble with some WMS at CNAF, tracked in savannah: SAVANNAH:123072
  • T2 sites:
    • NTR


  • ALICE reports -
    • General Information: 6 MC cycles, Pass0 and Pass1 reco and a couple of analysis trainings ongoing. ~33K Running jobs
    • T0 site
      • Missmatch between the number of jobs reported by MonaLisa and information provider solved yesterday. GGUS:73759 solved
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - Ongoing processing of data. finalisation of current production to clear old problem of data access

Sites / Services round table:

  • BNL - ntr
  • NDGF - ntr
  • NL-T1 - small follow up on issue reported by LHCb yesterday: 5 entries in our DB with LHCb user space token, only one active. Rest is historical - kept by dCache but no longer works. Trying to find out why some files not in any space token - will take some time to find out
  • PIC - situation wiht ATLAS is that Panda queues still closed, waiting for release of 16.2.7 to be finalized and then will open queues again
  • RAL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • ASGC - ntr
  • OSG - Soichi from OSG OPS has questions about BDII. Talking to David Collados about GGUS:73058. Working to upgrade OSG's production version of BDII CMON latest version. Have to switch to UMD repository. Didn't know about this and wanted to make sure that this is the right approach! [ MariaDZ - will take offline with David ]

  • CERN DB - ATLAS archive and tag DB patched, WLCG now, tomorrow rolling patches for security. Offline DBs for ALICE and COMPASS. Had a crash on once instance of ADCR - instance 3 - at 12:54 - due to Oracle bug. DQ2 and Panda don't run on that instance.

AOB:

Thursday

Attendance: local(Alex, Oliver, Jamie, Hong, Fernando, Steve, Eva, Massimo, Gavin, Lola);remote(Jhen-Wei, Joel, Ronald, Jeremy, Tiju, Rolf, Catalin, Pavel, Gonzalo, Rob, Claudia).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • AMI (read-only) instance at CERN not properly started up after the hypervisors incident yesterday
    • 2 out of 3 load-balancing machines for panda server are out from load-balancing, causing high load on one machine therefore service hangs. The reason is "/tmp" on those 2 machines are filled up by temporary files created by Panda. Experts are investigating. Production and distributed analysis activities are affected.
    • reprocessed AOD/DESD data distribution to T1s were started yesterday. It progresses well.
    • one of our key production user cannot access CERN Castor through SRM with his own certificate (GGUS:73794), probably related to the CA certificate expiration 2 days ago. He will need to register input files at CERN SRM for MC production.
  • T1 sites
    • IN2P3-CC: last part of RAW-to-ESD jobs of data reprocessing (the part involves tape staging) finished this morning. Local batch system is re-configured by site to avoid failures in file merging jobs caused by high load on WNs.
    • PIC got ATLAS release 16.6.7.6.1 reinstalled last night, production queue re-opened.
  • T2 sites
    • ntr


  • CMS reports -
  • CERN / central services
    • today, Aug. 25th, 2 PM to 5 PM CERN time, CASTORCMS upgrades http://itssb.web.cern.ch/planned-intervention/castorcms-update/25-08-2011, users informed
    • There looks to be a new Apache weakness on 'Range' headers (CVE-2011-3192). See (1) for a good description of possible steps to mitigate the problem while httpd developers are working on a patch. CMS will apply the first suggested mitigation on our web services cluster today. Thanks for Lassi Tuura for finding this.
  • T1 sites:
    • Still trying to get pilots running on the new whole node queue at IN2P3. Looks like a CREAM issue and experts are working.
    • today, Aug. 25th, 3 PM to 7 PM CERN time, FNAL maintenance downtime
  • T2 sites:
    • NTR


  • ALICE reports - General information: 6 MC cycles, Pass0 and Pass1 reco and a couple of analysis trainings ongoing. ~33K Running jobs
    • T0 site
      • Missmatch between the number of jobs reported by MonaLisa and information provider solved yesterday. GGUS:73759 solved
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations


  • LHCb reports -
  • Ongoing processing of data. finalisation of current production to clear old problem of data access

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
  • T1
    • SARA : is it possible to assigne the 1TB of "historical" LHCB-USER space token to LHCB-USER (GGUS:73087)
    • SARA in unsched downtime this morning - received info 3h later! [ Ronald- SARA declared downtime asap but portal delayed notification. ]


Sites / Services round table:

  • ASGC - reminder of downtime for CASTOR upgrade next Monday. 10:00 - Tuesday UTC. Downtime next Tuesday for network maintenance. 05:00 - 10:00 UTC. both in GOCDB
  • BNL -
  • NL-T1 - concerning unsched downtime - this was consequence of tests of MSS that went wrong. Earlier this week performed migration which went fine apart from this problem this morning so still in AT RISK until tests finished
  • IN2P3 - ntr
  • RAL - ntr
  • FNAL - downtime ongoing, ntr otherwise
  • PIC - ntr
  • CNAF - ntr
  • KIT - ntr
  • GridPP - ntr
  • OSG - following up on item brought up yesterday. Heard back about which distribution we should use for BDII and CMON s/w. MariaDZ - have put answer in GGUS ticket opened yesterday.

  • CERN FTS T2 Service Proposal:
    • Migrate last 2 of the 3 FTS agent nodes from SL4 to SL5 on Wednesday 31st August, 08:00 to 10:00 UTC.
    • During the intervention newly submitted transfers will queue up rather than being processed. Otherwise it is expected to be transparent.
    • The Tier2 service has been partially running SL5 for a couple of months already for some other active channels.
    • Rollback is quite possible.
    • Affected channels: CERN-MUNICHMPPMU CERN-UKT2 CERN-GENEVA CERN-NCP NCP-CERN CERN-DESY DESY-CERN CERN-MICHIGAN CERN-INFNROMA1 CERN-INFNNAPOLIATLAS CERN-MUNICH CERN-KOLKATA KOLKATA-CERN INDIACMS-CERN CERN-INDIACMS CERN-JINR CERN-PROTOVINO JINR-CERN PROTOVINO-CERN CERN-KI KI-CERN CERN-SINP SINP-CERN CERN-TROITSK TROITSK-CERN STAR-TROITSK CERN-ITEP ITEP-CERN STAR-ITEP CERN-PNPI PNPI-CERN STAR-PNPI STAR-PROTOVINO STAR-JINR STAR-KI STAR-SINP CERN-IFIC CERN-STAR STAR-CERN
    • Migration will be confirmed or not on Monday 29th at 15:00 CEST meeting.

  • CERN central services
    • gLite WMS for CMS back to normal, status of jobs up-to-date on all WMS nodes.

  • CERN grid - on Tuesday would like to migrate myproxy service at 09:00 to new service. Should be transparent. Oli - if registered to old will be with new? A: yes will sync over old directory.

  • CERN grid - move backing NFS store of batch service to better h/w. This will require a stop - not a draining - of the batch service. Cannot quit or query during this period. Propose morning of Wednesday i.e. 31st August.

  • CERN storage - intervention going on, first part finished, DB part started, on track so far.

  • CERN DB - LCG, ATLAS archive and PDB upgraded with latest patches, will continue with other DBs next week.

AOB:

Friday

Attendance: local(Hong, Massimo, Eva, Jamie, Mike, Dirk, Oliver, Maria, Nilo, Alex);remote(Ulf, Lorenzo, Gonzalo, Gareth, Michael, Onno, Jhen-Wie, Rolf, Joel, Rob).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • preparing for the LFC db migration/casteratlas defragment next Monday - users are notified, panda queue for CERN grid site will be closed on Sunday. Data transfer to be stopped on Monday.
    • The proposed schedule of CERN FTS-T2 intervention (Wednesday morning next week) is fine to ATLAS
    • new data project tag: mc11_900GeV
  • T1 sites
    • IN2P3-CC unscheduled downtime due to power cut in machine room, recovered. now still issues with output-merging jobs from data reprocessing
    • some files at RAL MCTAPE are in status "UNAVAILABLE", preventing files to be staged from tape (GGUS:73850)
  • T2 sites
    • ntr


  • CMS reports -
  • CERN / central services
    • Castor upgrades yesterday worked fine, didn't get any complaints, thanks.[ but GGUS:73848 just before meeting CASTOR CMS default service class degraded - Massimo looking into it ] Overload; some "top users" 7K pending requests, would like to understand why before contacting users.
    • next week:
      • Tuesday, Aug. 30., 9 AM CERN time, migrate myproxy.cern.ch to new service, transparent for users (users registered with the old service will stay registered with the new service due to sync)
      • Planned: Wednesday, Aug. 31., complete SL5 migration of CERN T2 FTS, 8 AM to 10 AM UTC, transfer requests during this time will be queued, will be confirmed in Monday's WLCG call
      • Planned: Wednesday, Aug. 31., morning, move backing NFS store of batch system to better hardware, requires stop of batch system, running jobs keep on running, requested feedback from CMS by noon, Monday, Aug. 29.
  • T1 sites:
    • FNAL downtime seems to have worked fine as well
    • next week
      • Monday/Tuesday, Aug. 29./30., ASGC downtime to upgrade Castor
      • Wednesday, Aug. 31., PIC at risk due to network operations on firewall testing a patch that solves some problems with multicast traffic
  • T2 sites:
    • NTR



Experiment activities:

  • Ongoing processing of data. finalisation of current production to clear old problem of data access

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
  • T1
    • CERN : pending (GGUS:73810) [ Massimo - working on it so no news ]
    • SARA : pending (GGUS:73087) [ Onno - will update ticket ]
    • IN2P3 : Unscheduled downtime


Sites / Services round table:

  • NDGF - one m/c room backup today, down for cooling problems; another one of Swedish ATLAS clusters working badly as ATLAS keeps sending a lot of short jobs that process 15GB of data and then quit - trashing the storage on that cluster. Tracking down the user. Tapes in Denmark working again - tape controller crashed on Wednesday. This week ALICE transfers to NDGF not working since March but took until this week for ALICE to tell us!!! [ Hong - do you know if analysis jobs or production? A ; no ]
  • CNAF - ntr
  • PIC - yesterday finally we saw that ATLAS installed release that was pending for a couple of days so could open again Panda queues. Did this around noon - then a very fast increase of ATLAS jobs; got 1500 merge jobs which generated high load on pnfs. Got degradation of storage service for a couple of hours. Jobs working - CMS saw some SAM tests timing out. Recovered and now ok
  • RAL - ticket from ATLAS: files unavailable, 1 d/s out of production resolved just before meeting and ticket set as solved. RAL is closed Monday and Tuesday next week so will not phone in unless there is a particular reason
  • BNL - major hurricane approaching NE US. Long Island directly in path. Fair amount of emergency planning. Impact expected Sunday. Cooling and power to lab maybe turned down. Maybe forced to take down all facilities services Sunday afternoon. Planning intervention Wed/Thu for pnfs->Chimera intervention that may spill into Friday. Rob - if you need any help with communications use my cell.
  • NL-T1 - yesterday we had an unsched downtime. After the downtime had finished we had a few pool nodes that had crashed. Restarted them but possible that after downtime some files unavailable.
  • ASGC - ntr
  • IN2P3 - as mentioned we had power cut. External one. For some time our local batteries and generators worked well. Power back before we had a major impact but our cooling machinery did not restart so this caused a breakdown of nearly all worker nodes. We are still investigating reasons. First machines back at 09:30 and situation completely reestablished at 11:00. Will produce a SIR as usual when we know more about cooling failure.
  • FNAL - yesterday downtime completed on time. unable to do h/w upgrade for pnfs server. Working with vendor to solve.
  • OSG - ntr

  • CERN DB: next Mon/Tue will upgrade all ATLAS, CMS and LHCb production DBs with latest Oracle security. Rolling

  • CERN storage: next week intervention on LHCb on Tue. 2.1.11.2 enabling transfer manager. Sep 1 for ALICE. Monday maintenance on DB for ATLAS and also version of CASTOR to be upgraded. Long operation on DB so will take all morning. On Wed N/S maintenance intervention on DB and not move to new DB h/w. This is an upgrade that should be transparent.

AOB:

-- JamieShiers - 18-Jul-2011

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2011-08-26 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback