LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek110822 (2011-08-26, JamieShiers)

EditAttachPDF

Week of 110822

Week of 110822

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday:

Attendance: local(Jamie, Dirk, Cedric, Hong, Ian, Mike, MariaDZ, Lola, Eva, Alex);remote(Michael, Gonzalo, Onno, Tiju, Jhen-Wei, Giovanni, Catalina, Vladimir, Rob).

Experiments round table:

ATLAS reports -
T0/Central Services
- ntr
T1 sites
- BNL problem mentioned on Friday (LFC not reachable from outside : GGUS:73642) fixed (firewall issue)
- Files that cannot be stage on Triumf GGUS:73645 (fixed), GridKa GGUS:73678 (fixed), PIC GGUS:73680, IN2P3-CC GGUS:73683
- SARA : tape downtime. Stop T0 export to the site.
- IN2P3-CC : Problem at Lyon on IN2P3-CC_VL GGUS:73461
- RAL :
  - SRM suffering from some deadlocks GGUS:73644 [ Tiju - fixed on Saturday night ]
  - Disk server gdss230, part of the atlasStripDeg (d1t0) space token, is currently unavailable at the RAL Tier1 (Sunday 21 August 2011 11:25 BST).
T2 sites
- CA-SCINET-T2 : SRM problems GGUS:73684 (Thunderstorm activity in the Toronto area caused power glitches which took out our cooling system at the datacentre, hence all systems were shutdown.)
- IL-TAU-HEP : SRM problem GGUS:73681

CMS reports -
CERN / central services
- CERN Production dropped out of BDII. Ticket was submitted. This only has a small analysis impact.
T1 sites:
- We were trying to get pilots running on the new whole node queue at IN2P3. Looks like a CREAM issue and experts are working
- File consistency problem reported at FNAL
T2 sites:
- NTR

ALICE reports - NTR

LHCb reports - Ongoing processing of data.
- T1
  - IN2P3 :
    - "slow access to data" solved
  - PIC :
    - Problems with data access.
    - "Problems transferring files to CERN" (GGUS:73576) was solved.
  - GRIDKA :
    - Faulty connections from 192.108.46.248 (GGUS:73630) still opened.

Sites / Services round table:

BNL - ntr
PIC - comment on LHCb situation: just back from vacation - true that last week we were a bit thin and unresponsive for LHCb as contact had to be away. Starting today starting as LHCb backup. Related to GGUS ticket opened for files transferred PIC-CERN and failing, situation understood, some hanging transfers from tape, restart needed, about 100 files mainly affecting ATLAS but also LHCb. Following - still issues but will check.
NL-T1 - ntr
ASGC - ntr
CNAF - concerning problems on CE reported on Friday the failure of SAM test was due to FE which had disk problem. No update for failure of other test.
FNAL - investigating 8K files mismatch between PhDEX and local data system. Initial analysis: those files somehow moved but nothing else
OSG - ntr

CERN - looking at missing CERN PROD in BDII

AOB:

Tuesday:

Attendance: local(Jamie, Oliver, Hong, Mike, Manuel, Lola);remote(Joel, Kyle, Xavier, Michael, Tore, Tiju, Ronald, Catalina, Shu-Ting, Gonzalo, Giovanni).

Experiments round table:

ATLAS reports -
T0/Central Services
- 2 central service interventions yesterday:
  1. ATLAS DDM central catalogue intervention: temporary glitches in ADC activities
  2. Frontier intervention: not well advertised to ADC shifters/experts (Savannah:85879)
- T0 data export service (Santa-Claus) upgrade this morning
- files have locality "unavailable" at CERN-PROD_DATADISK space token (GGUS:73746) [ latest news from ticket is that diskserver inaccessible but now back in production ]
T1 sites
- INFN: fail to contact SRM endpoint (GGUS:73619)
- IN2P3-CC: job failures (GGUS:73461) and tape staging issue (GGUS:73683). Site claimed both problems should be fixed now.
- PIC:
  - transient file transfer failure from CERN-PROD_TZERO to PIC_DATADISK due to SRM overload. problem fixed.
  - massive job failure due to missing ATLAS release (GGUS:73732). software re-installation in progress. [ Gonzalo - related to installation of ATLAS release which is ongoing. A number of jobs are failing as this release is still not ready. Savannah ticket where team of ATLAS is following this. Seems jobs are being sent - have stopped queue for time being. ]
T2 sites:
- ntr

CMS reports -
CERN / central services
- CERN Production dropped out of BDII. This only has a small analysis impact. Solved by BDII restart, GGUS:73700
- tomorrow, Aug. 24th, WLCG production database (LCGR) rolling intervention, service will be available but degraded, shifters and users informed. VOMS proxy init - will it fail or hang? A: should work ok during intervention (Manuel)
T1 sites:
- Still trying to get pilots running on the new whole node queue at IN2P3. Looks like a CREAM issue and experts are working.
- File consistency problem resolved at FNAL. Problem was that adler32 checksum calculation crashed for 12 production jobs. Jobs were not declared as failures and resubmitted, files were registered without alder32 checksum. File metadata and production infrastructure corrected.
T2 sites:
- NTR

ALICE reports -
- T0 site
  - ALICE has agreed CASTOR upgrade 31st August as proposed last week by Massimo
  - GGUS:73759. Big mismatch between number of jobs reported by MonaLisa and the information provider. Under investigation
- T1 sites
  - Nothing to report
- T2 sites
  - Sao Paolo. GGUS:73676. Site administrator's certificate was invalidated by VOMRS CA synchronizer. The registration appears as 'Approved' but he cannot create a proxy as lcgadmin. Solved

LHCb reports -

Experiment activities:

Ongoing processing of data.

New GGUS (or RT) tickets:

T0: 0
T1: 0
T2: 1

Issues at the sites and services

T0
T1
- PIC :
  - Problems with data access (GGUS:73707)
- SARA :
  - Problem of space token for lhcb-user 5 space tokens with 2 with DATA but only 1 active)

Sites / Services round table:

BNL - ntr
KIT - ntr
NDGF -
RAL - ntr
NL-T1 - ntr
FNAL - we investigated 8K files mismatch between Phedex and local file system - just name change. Reminder of downtime on Thursday ~4h.
ASGC - ntr
PIC - ntr
CNAF - ntr
OSG - ntr

AOB:

Wednesday

Attendance: local(Hong, Jamie, Maria, Mike, Oliver, Luca, Fernando, Alex, Edoardo, MariaDZ);remote(Michael, Joel, Tore, Onno, Kyle, Gonzalo, John, Claudia, Catalina, Soichi, Shu-Ting).

Experiments round table:

ATLAS reports -
T0/Central Services
- Hypervisors incident this morning causing T0 export services down for few hours
- 2 DB rolling interventions this afternoon
  1. ATLARC
  2. LCGR - we were reminded yesterday by CMS report, would be appreciated if IT-DB could make a brief reminder in the daily WLCG Operations Meeting one day before the intervention. Also the intervention will affect/degrade LFC and VOMS services, it would be nice if an "at-risk" downtime could be declared in GOCDB. [ Luca - on status board since a few days. Communicated to ATLAS DBAs. For LCG DBs put on status board and announce ~1 week before. Will try to put a reminder. ]
T1 sites
- IN2P3-CC: still see files not being staged from tape, GGUS:73683 reopened
T2 sites
- CA-SCINET-T2:failed to contact on remote SRM, GGUS:73684 reopened

CMS reports -
CERN / central services
- gLite WMS for CMS degraded, visible on servicemap (http://servicemap.cern.ch/ccrc08/servicemap1.html?vo=CMS&site=), GGUS:73765 - closed just before meeting. [ Alex - still affected is status of jobs on these 2 WMSs. Will take until tonight to catch up with status of all jobs ]
- today, Aug. 24th, WLCG production database (LCGR) rolling intervention, service will be available but degraded, shifters and users informed, no complaints heard
- tomorrow, Aug. 25th, 2 PM to 5 PM CERN time, CASTORCMS upgrades http://itssb.web.cern.ch/planned-intervention/castorcms-update/25-08-2011, users informed
T1 sites:
- Still trying to get pilots running on the new whole node queue at IN2P3. Looks like a CREAM issue and experts are working.
- tomorrow, Aug. 25th, 3 PM to 7 PM CERN time, FNAL maintenance downtime
- having trouble with some WMS at CNAF, tracked in savannah: SAVANNAH:123072
T2 sites:
- NTR

ALICE reports -
- General Information: 6 MC cycles, Pass0 and Pass1 reco and a couple of analysis trainings ongoing. ~33K Running jobs
- T0 site
  - Missmatch between the number of jobs reported by MonaLisa and information provider solved yesterday. GGUS:73759 solved
- T1 sites
  - Nothing to report
- T2 sites
  - Usual operations

LHCb reports - Ongoing processing of data. finalisation of current production to clear old problem of data access

Sites / Services round table:

BNL - ntr
NDGF - ntr
NL-T1 - small follow up on issue reported by LHCb yesterday: 5 entries in our DB with LHCb user space token, only one active. Rest is historical - kept by dCache but no longer works. Trying to find out why some files not in any space token - will take some time to find out
PIC - situation wiht ATLAS is that Panda queues still closed, waiting for release of 16.2.7 to be finalized and then will open queues again
RAL - ntr
CNAF - ntr
FNAL - ntr
ASGC - ntr
OSG - Soichi from OSG OPS has questions about BDII. Talking to David Collados about GGUS:73058. Working to upgrade OSG's production version of BDII CMON latest version. Have to switch to UMD repository. Didn't know about this and wanted to make sure that this is the right approach! [ MariaDZ - will take offline with David ]

CERN DB - ATLAS archive and tag DB patched, WLCG now, tomorrow rolling patches for security. Offline DBs for ALICE and COMPASS. Had a crash on once instance of ADCR - instance 3 - at 12:54 - due to Oracle bug. DQ2 and Panda don't run on that instance.

AOB:

Thursday

Attendance: local(Alex, Oliver, Jamie, Hong, Fernando, Steve, Eva, Massimo, Gavin, Lola);remote(Jhen-Wei, Joel, Ronald, Jeremy, Tiju, Rolf, Catalin, Pavel, Gonzalo, Rob, Claudia).

Experiments round table:

ATLAS reports -
T0/Central Services
- AMI (read-only) instance at CERN not properly started up after the hypervisors incident yesterday
- 2 out of 3 load-balancing machines for panda server are out from load-balancing, causing high load on one machine therefore service hangs. The reason is "/tmp" on those 2 machines are filled up by temporary files created by Panda. Experts are investigating. Production and distributed analysis activities are affected.
- reprocessed AOD/DESD data distribution to T1s were started yesterday. It progresses well.
- one of our key production user cannot access CERN Castor through SRM with his own certificate (GGUS:73794), probably related to the CA certificate expiration 2 days ago. He will need to register input files at CERN SRM for MC production.
T1 sites
- IN2P3-CC: last part of RAW-to-ESD jobs of data reprocessing (the part involves tape staging) finished this morning. Local batch system is re-configured by site to avoid failures in file merging jobs caused by high load on WNs.
- PIC got ATLAS release 16.6.7.6.1 reinstalled last night, production queue re-opened.
T2 sites
- ntr

CMS reports -
CERN / central services
- today, Aug. 25th, 2 PM to 5 PM CERN time, CASTORCMS upgrades http://itssb.web.cern.ch/planned-intervention/castorcms-update/25-08-2011, users informed
- There looks to be a new Apache weakness on 'Range' headers (CVE-2011-3192). See (1) for a good description of possible steps to mitigate the problem while httpd developers are working on a patch. CMS will apply the first suggested mitigation on our web services cluster today. Thanks for Lassi Tuura for finding this.
T1 sites:
- Still trying to get pilots running on the new whole node queue at IN2P3. Looks like a CREAM issue and experts are working.
- today, Aug. 25th, 3 PM to 7 PM CERN time, FNAL maintenance downtime
T2 sites:
- NTR

ALICE reports - General information: 6 MC cycles, Pass0 and Pass1 reco and a couple of analysis trainings ongoing. ~33K Running jobs
- T0 site
  - Missmatch between the number of jobs reported by MonaLisa and information provider solved yesterday. GGUS:73759 solved
- T1 sites
  - Nothing to report
- T2 sites
  - Usual operations

LHCb reports -
Ongoing processing of data. finalisation of current production to clear old problem of data access

New GGUS (or RT) tickets:

T0: 0
T1: 1
T2: 0

Issues at the sites and services

T0
- CERN : file outside SPACE token (GGUS:73810)
T1
- SARA : is it possible to assigne the 1TB of "historical" LHCB-USER space token to LHCB-USER (GGUS:73087)
- SARA in unsched downtime this morning - received info 3h later! [ Ronald- SARA declared downtime asap but portal delayed notification. ]

Sites / Services round table:

ASGC - reminder of downtime for CASTOR upgrade next Monday. 10:00 - Tuesday UTC. Downtime next Tuesday for network maintenance. 05:00 - 10:00 UTC. both in GOCDB
BNL -
NL-T1 - concerning unsched downtime - this was consequence of tests of MSS that went wrong. Earlier this week performed migration which went fine apart from this problem this morning so still in AT RISK until tests finished
IN2P3 - ntr
RAL - ntr
FNAL - downtime ongoing, ntr otherwise
PIC - ntr
CNAF - ntr
KIT - ntr
GridPP - ntr
OSG - following up on item brought up yesterday. Heard back about which distribution we should use for BDII and CMON s/w. MariaDZ - have put answer in GGUS ticket opened yesterday.

CERN FTS T2 Service Proposal:
- Migrate last 2 of the 3 FTS agent nodes from SL4 to SL5 on Wednesday 31st August, 08:00 to 10:00 UTC.
- During the intervention newly submitted transfers will queue up rather than being processed. Otherwise it is expected to be transparent.
- The Tier2 service has been partially running SL5 for a couple of months already for some other active channels.
- Rollback is quite possible.
- Affected channels: CERN-MUNICHMPPMU CERN-UKT2 CERN-GENEVA CERN-NCP NCP-CERN CERN-DESY DESY-CERN CERN-MICHIGAN CERN-INFNROMA1 CERN-INFNNAPOLIATLAS CERN-MUNICH CERN-KOLKATA KOLKATA-CERN INDIACMS-CERN CERN-INDIACMS CERN-JINR CERN-PROTOVINO JINR-CERN PROTOVINO-CERN CERN-KI KI-CERN CERN-SINP SINP-CERN CERN-TROITSK TROITSK-CERN STAR-TROITSK CERN-ITEP ITEP-CERN STAR-ITEP CERN-PNPI PNPI-CERN STAR-PNPI STAR-PROTOVINO STAR-JINR STAR-KI STAR-SINP CERN-IFIC CERN-STAR STAR-CERN
- Migration will be confirmed or not on Monday 29th at 15:00 CEST meeting.

CERN central services
- gLite WMS for CMS back to normal, status of jobs up-to-date on all WMS nodes.

CERN grid - on Tuesday would like to migrate myproxy service at 09:00 to new service. Should be transparent. Oli - if registered to old will be with new? A: yes will sync over old directory.

CERN grid - move backing NFS store of batch service to better h/w. This will require a stop - not a draining - of the batch service. Cannot quit or query during this period. Propose morning of Wednesday i.e. 31st August.

CERN storage - intervention going on, first part finished, DB part started, on track so far.

CERN DB - LCG, ATLAS archive and PDB upgraded with latest patches, will continue with other DBs next week.

AOB:

Friday

Attendance: local(Hong, Massimo, Eva, Jamie, Mike, Dirk, Oliver, Maria, Nilo, Alex);remote(Ulf, Lorenzo, Gonzalo, Gareth, Michael, Onno, Jhen-Wie, Rolf, Joel, Rob).

Experiments round table:

ATLAS reports -
T0/Central Services
- preparing for the LFC db migration/casteratlas defragment next Monday - users are notified, panda queue for CERN grid site will be closed on Sunday. Data transfer to be stopped on Monday.
- The proposed schedule of CERN FTS-T2 intervention (Wednesday morning next week) is fine to ATLAS
- new data project tag: mc11_900GeV
T1 sites
- IN2P3-CC unscheduled downtime due to power cut in machine room, recovered. now still issues with output-merging jobs from data reprocessing
- some files at RAL MCTAPE are in status "UNAVAILABLE", preventing files to be staged from tape (GGUS:73850)
T2 sites
- ntr

CMS reports -
CERN / central services
- Castor upgrades yesterday worked fine, didn't get any complaints, thanks.[ but GGUS:73848 just before meeting CASTOR CMS default service class degraded - Massimo looking into it ] Overload; some "top users" 7K pending requests, would like to understand why before contacting users.
- next week:
  - Tuesday, Aug. 30., 9 AM CERN time, migrate myproxy.cern.ch to new service, transparent for users (users registered with the old service will stay registered with the new service due to sync)
  - Planned: Wednesday, Aug. 31., complete SL5 migration of CERN T2 FTS, 8 AM to 10 AM UTC, transfer requests during this time will be queued, will be confirmed in Monday's WLCG call
  - Planned: Wednesday, Aug. 31., morning, move backing NFS store of batch system to better hardware, requires stop of batch system, running jobs keep on running, requested feedback from CMS by noon, Monday, Aug. 29.
T1 sites:
- FNAL downtime seems to have worked fine as well
- next week
  - Monday/Tuesday, Aug. 29./30., ASGC downtime to upgrade Castor
  - Wednesday, Aug. 31., PIC at risk due to network operations on firewall testing a patch that solves some problems with multicast traffic
T2 sites:
- NTR

ALICE reports - NTR

LHCb reports -

Experiment activities:

Ongoing processing of data. finalisation of current production to clear old problem of data access

New GGUS (or RT) tickets:

T0: 0
T1: 1
T2: 0

Issues at the sites and services

T0
T1
- CERN : pending (GGUS:73810) [ Massimo - working on it so no news ]
- SARA : pending (GGUS:73087) [ Onno - will update ticket ]
- IN2P3 : Unscheduled downtime

Sites / Services round table:

NDGF - one m/c room backup today, down for cooling problems; another one of Swedish ATLAS clusters working badly as ATLAS keeps sending a lot of short jobs that process 15GB of data and then quit - trashing the storage on that cluster. Tracking down the user. Tapes in Denmark working again - tape controller crashed on Wednesday. This week ALICE transfers to NDGF not working since March but took until this week for ALICE to tell us!!! [ Hong - do you know if analysis jobs or production? A ; no ]
CNAF - ntr
PIC - yesterday finally we saw that ATLAS installed release that was pending for a couple of days so could open again Panda queues. Did this around noon - then a very fast increase of ATLAS jobs; got 1500 merge jobs which generated high load on pnfs. Got degradation of storage service for a couple of hours. Jobs working - CMS saw some SAM tests timing out. Recovered and now ok
RAL - ticket from ATLAS: files unavailable, 1 d/s out of production resolved just before meeting and ticket set as solved. RAL is closed Monday and Tuesday next week so will not phone in unless there is a particular reason
BNL - major hurricane approaching NE US. Long Island directly in path. Fair amount of emergency planning. Impact expected Sunday. Cooling and power to lab maybe turned down. Maybe forced to take down all facilities services Sunday afternoon. Planning intervention Wed/Thu for pnfs->Chimera intervention that may spill into Friday. Rob - if you need any help with communications use my cell.
NL-T1 - yesterday we had an unsched downtime. After the downtime had finished we had a few pool nodes that had crashed. Restarted them but possible that after downtime some files unavailable.
ASGC - ntr
IN2P3 - as mentioned we had power cut. External one. For some time our local batteries and generators worked well. Power back before we had a major impact but our cooling machinery did not restart so this caused a breakdown of nearly all worker nodes. We are still investigating reasons. First machines back at 09:30 and situation completely reestablished at 11:00. Will produce a SIR as usual when we know more about cooling failure.
FNAL - yesterday downtime completed on time. unable to do h/w upgrade for pnfs server. Working with vendor to solve.
OSG - ntr

CERN DB: next Mon/Tue will upgrade all ATLAS, CMS and LHCb production DBs with latest Oracle security. Rolling

CERN storage: next week intervention on LHCb on Tue. 2.1.11.2 enabling transfer manager. Sep 1 for ALICE. Monday maintenance on DB for ATLAS and also version of CASTOR to be upgraded. Long operation on DB so will take all morning. On Wed N/S maintenance intervention on DB and not move to new DB h/w. This is an upgrade that should be transparent.

AOB:

-- JamieShiers - 18-Jul-2011

Topic revision: r16 - 2011-08-26 - JamieShiers

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback