LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek120514 (2012-05-18, AndreaValassi)

EditAttachPDF

Week of 120514

Week of 120514

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday

Attendance: local(Jamie, Maarten, Rajah, Lukasz, Simone, Claudio, Ignacio, Massimo, Alex, MariaDZ, Eva);remote(Michael, Lisa, Roger, Paolo, Jhen-Wei, Onno, Tiju, Rob, Dimitri).

Experiments round table:

ATLAS reports -
T0/Central Services
- NTR
T1s/CalibrationT2s
- RAL storage started failing at 17:00 PM on friday with an authentication error (CRL). GGUS:82148. RAL re-run the script by hand to fetch CRLs but some 80% failure rate remained. On friday evening we decided not to escalate to ALARM, since the problem had been already reported and anyway there was no dataset subscribed to RAL TAPE. On saturday morning RAL experts fixed the problem re-running the CRL script on all components (SRM, gridFTP). The issue is not completely understood.
- On friday afternoon, a problem in transferring data from/to NIKHEF was spotted. GGUS:82143. The problem appeared to be selective and interest only NIKHEF-CANADA transfers, people started investigating the network. Network experts and NL spotted a problem in the OPN link NIKHEF/SARA -> TRIUMF, where all packets get dropped on the hop just after gw-sara-local.lhc-tier1.triumf.ca towards srm.triumf.ca. NL network experts excluded a problem in the NL segment and passed the ticket to CA. CA fixed the issue promptly (Problem was in BGP session between TRUMF and SARA). The network issue has been there since ~6:00am Wednesday (UTC time) but from the ATLAS point of view was masked by many concurrent factors (low traffic CA-NL, error flooding from SARA-MATRIX, issue in functional tests machinery).
- NAPOLI suffered a power cut. (A calibration Tier2 - unavailable for 1 day over w/e)
- A DPM disk server failure in ASGC reduced by 140 TB the ATLAS capacity at the TW T1 (in GROUPDISK and DATADISK space tokens). Problem needs one week for a fix. Sone iteration needed with TW people to understand the impact on existing data. [ Jhen-Wei: reduced groupdisk and scratchdisk. In DPM we configure multi-controllers for all DPM servers. Last week 1 controller had h/w failure, making throughput slow. To minimize impact to ATLAS set this diskserver R/O to keep read requests but disallow new write requests. Hence reduction in logical space for ATLAS but should have no impact on existing data - dual access paths - but maybe some performance impact. No data lost but possible slow access.
- The ATLAS DB monitoring reported several times blocking sessions in various ATLAS databases: ADCR (instance 1 and 2), ATLR and ATONR. From the application point of view, no issue has been observed. A discussion with ATLAS DB experts is going on.

CMS reports -
LHC machine / CMS detector
- Physics data taking
CERN / central services and T0
- Tier-0 suspended operations during CMSR intervention (upgrade of offline DB)
Tier-1/2:
- Still file access errors at ASGC (time out in accessing files from time to time - many jobs fail).

ALICE reports -
- NTR

LHCb reports -
DataReprocessing of 2012 data at T1s with new alignment
MC simulation at Tiers2
Prompt reprocessing of data

New GGUS (or RT) tickets

T0
- Diskservers down at CERN. Also created a few problems for LHCb FTS transfers, due to the logic used for submitting the FTS requests (DIRAC issue - being looked at for improvement). [ Massimo - 2 cases; 1=disk server having problems from Monday, was being drained but completely broke on Friday; list of unavailable files in ticket. Alarm ticket on Saturday: same - ticket refers to raw data but only .dst on that machine. Rajah - explain offline. Some new logic in our DM system. People on call did not fully understand. Massimo - last night another machine crashed; trying to recover; will bring new machines. ]
T1
- SARA : Problem with FTS (GGUS:82178).
- IN2P3 : Various corrupted files - some files apparently get corrupted (checksum mismatch) after some time, even if the file is seemingly okay after being transferred into IN2P3. Old & verified ticket (GGUS:80338) there, but the problem is ongoing. IN2P3 contact (Aresh) aware of this. Corruption seen in files from different diskservers. [ Rolf - we still a lot of short-running pilot jobs here so limited # of running jobs to avoid overloading CEs. Rajah - this is a limitation of SGE? Rolf - no; too many short running pilot jobs, over 90%. ]

Sites / Services round table:

BNL - ntr
FNAL - ntr
CNAF - ntr
ASGC - nta
NDGF - ntr
IN2P3 - nta
NL-T1 - 2 things: SARA SRM in sched main in ~10' from now to activate MSS config change; FTS issue reported by LHCb and also noticed by ATLAS is prob caused by load on SRM. Main GGUS:82032 probaby caused by storage hosting namespace DB which is not optimally configured. Can try to reconfig using LVM and hopefully transparent but could take a while and decrease performance. 2nd option is to schedule a downtime and move DB to a faster machine temporarily. Would have to move again later so two downtimes. Each downtime would be an hour or so - TBD. Rajah - has this load on SRM been seen regularly in last month or so? Onno - we have had issues for some months now. Recently getting worse. Rajah - could this mean that jobs are having troubles getting TURLS?
RAL - two things: ATLAS issue - still under investigation; also affected LHCb and CMS. Tomorrow from 10-11 local time downtime on CASTOR LHCb instance. Upgrade tape gateway.
KIT - ntr
OSG - ntr

CERN storage - two interventions: Wed 16 - agreed with CMS to patch EOS CMS to fix bug. Wed 23 - intervention on DB behind N/S - should be transparent - at risk for ~1h, most likely 10-11.

CERN grid - myproxy service was slow last week; solved by blacklisting one user using 4 computers at 3 sites; biomed user. PRODLFC ATLAS - increase #IPs exposed by alias - now 4/5 are exposed.

CERN DB - Oracle security patch being applied

AOB: (MariaDZ) File ggus-tickets.xls with total numbers of GGUS tickets per experiment per week is up-to-date and attached to WLCGOperationsMeetings page.There was one real ALARM last week GGUS:82166

on Sunday 2012/05/13 by LHCb against CERN Detailed drills are attached to this page. As the 2012/05/15 WLCG MB was cancelled again, drills for the new MB date 2012/06/05 will cover 12 weeks - 14 real ALARMs so far since the last MB).

Tuesday

Attendance: local(Raja, Mike, Jamie, Alessandro, Eva, Massimo, Alex, Maarten);remote(Michael, Ronald, Lisa, Xavier, Jhen-Wei, Roger, Rolf, Tiju, Ian, Jeremy, Rob).

Experiments round table:

ATLAS reports -
ATLAS general: data taking period, stable beam as much as possible foreseen in the coming days.
T0/Central Services
- NTR
T1s/CalibrationT2s
- SARA-MATRIX storage instabilities GGUS:82032 (GGUS:82167 and GGUS:82169 are related to this first ticket) Onno replied that they've scheduled a downtime (23rd of May) to move the SRM database to other hardware
- transfer Romania to TRIUMF were failing due to a possible network problem between TRIUMF and Romania: TRIUMF experts couldn't download on FTS server from Romania CA the CRL. This problem has been now patched, i.e. they found a workaround to fetch the CRL, but this needs to be sorted out. GGUS:82154
- LCGR DB yesterday Rolling intervention took one h more than expected, but then everything was ok. Today ADCR and ATLR (same) interventions went smoothly.

CMS reports -
LHC machine / CMS detector
- Physics data taking
CERN / central services and T0
- Nothing to report
Tier-1/2:
- We are asking some sites to check production priority since we're not getting as many slots as we expected.

ALICE reports -
- NIKHEF VOBOX unavailable because of scheduled maintenance

LHCb reports -
DataReprocessing of 2012 data at T1s with new alignment
MC simulation at Tiers2
Prompt reprocessing of data

New GGUS (or RT) tickets

T0
- Retiring diskservers : Waiting for information from CERN before deciding what to do about the files which are currently unavailable. [ now have a list of files - only 1 file which is not recovered - AOK ]
T1
- GridKa : Problems accessing data (GGUS:82200) - restarted a LFCDaemon which had died and not restarted automatically. Okay now.
- IN2P3 :
  - Continuing problem with corrupted files - offline discussions ongoing.
  - Would like to know the DN of the "short" jobs which are coming into IN2P3 as reported yesterday and if possible a plot of the extent of the problem. We cannot see any significant problem from the LHCb side - we are doing the same thing for IN2P3, as for all other sites. [ Rolf - problem with short pilot jobs disappeared yesterday. Shortest running jobs now are ~1000 secs. For corrupted file please open another ticket as previous one closed & marked as verified. ]
  - Pilots failing at IN2P3-T2 (GGUS:82228). Apparently error-152

Sites / Services round table:

BNL - short incident yesterday affecting data processing caused by an NFS appliance that hung and needed a restart
NL-T1 - site maintenance at NIKHEF going to plan, about to restart services
FNAL - ntr
KIT - ntr
ASGC - DPM disk problem reported yesterday, vendor came to fix this morning and problematic server returned to normal, running smoothly. If not error observed will restore ATLAS settings. Reminder: scheduled downtime this friday to patch DPM headnode and upgrade DPM disk servers. Will start 07:00 - 13:00 UTC. [ Alessandro - in case of future problems please could we discuss a bit more about the actions, e.g. limits of space that were applied. ]
NDGF - ntr
IN2P3 - next week 22 May we will have an outage of batch which will slowdown evening of 21st; Mass storage system will bedown all day 22 May. Only disk files will be avaiable through dCache. Maybe some network problems for ~1h in morning - all else in downtime notifications
RAL - successfully upgraded CASTOR for LHCb this morning; tomorrow morning the same for ATLAS and CMS. Tape gateway upgrade
GridPP - ntr
OSG - ntr

CERN DB - yesterday during patching of LCGR a problem in config of listener - a human mistake. DB unavailabe for ~1h. Today during ATLAS patching had a problem with one ATLR nodes which died during intervention - waiting for motherboard replacement.

CERN dashboards - sites need to patch FTS to fix active MQ bug. Only 2 have done so far. Tier0 FTS was patched on Friday and since we have received no data from it.

AOB:

can please all the Tier0/1s check that they have installed the patch to the activemq https://svnweb.cern.ch/trac/glitefts/raw-attachment/wiki/FTSRelease_2_2_8/FTS_msg-ifce_patch.pdf as discussed in the last Tier1ServiceCoordinationMeeting (https://twiki.cern.ch/twiki/bin/view/LCG/WLCGTier1ServiceCoordinationMinutes120503#FTS_news)?

Wednesday

Attendance: local(Jamie, Rajah, Massimo, Alessandro, Eddie, Mike, Luca);remote(Michael, Stephane, Alexander, Paolo, Lisa, Roger, John, Pavel, Jhen-Wei, MariaDZ, Ian).

Experiments round table:

ATLAS reports -
- NDGF-T1: GGUS:82231 (and GGUS:82238): high rate of job failing " cvmfs problem on titan nodes, investigating" [ Roger - looks more like a permission problem on LFC. Ale - will check; maybe some confusion between two tickets ]
- RAL GGUS:82148 site had to manually update the CRL. It's the second time it happens in the past week. To be understood why.
- TRIUMF-ROMANIA Cert. Auth. : GGUS 82154 : Confirmation of asymmetric network path affecting FTS Triumf for Romania -> Canada transfers.
- to all the Tier0/1s running FTS servers: please check yesterday WLCG daily call AOB. Apply patch for activemq-cpp and for gridftpV2.
- Looks like CRL renewal at CERN is done less than a day before the old one is expired. This is creating problems as CRL can be uploaded only once per day (depends on a cron at the Grid site)
- RAL: some rawdata were not written on datatape and have to be recovered from CERN. Here is the reference for the files probably not written on TAPE https://savannah.cern.ch/bugs/?94647

CMS reports -
LHC machine / CMS detector
- Physics data taking
CERN / central services and T0
- There was a failure of the CRL publication that impacted the central web servers at CERN (amongst other things). There was an alarm ticket issued last night since it was after hours, and these pages host services used for monitoring data taking. Computing operations said it was forwarded, and Maarten commented, but the person who resolved the issue did not comment or close the ticket even today. I closed it myself after asking for an update.
Tier-1/2:
- Selected larger Tier-2s were asked to increase the production share for 3 weeks.

ALICE reports -
- NTR

LHCb reports -
DataReprocessing of 2012 data at T1s with new alignment
MC simulation at Tiers2
Prompt reprocessing of data

New GGUS (or RT) tickets

T0
- Still "Input data resolution" problems by jobs at CERN. Suspect one of the faulty diskservers still does not have the files there recovered. [ Massimo - one diskserver not yet recovered; files declared not reachable. ]
T1
- IN2P3 :
  - Continuing problem with corrupted files - In addition to offline discussions, GGUS ticket opened (GGUS:82247). [ Rolf - thanks for opening ticket, we will do an initial update with all history up to now, our T1 coordinator will lead a small WG on this issue ]
- SARA :
  - Problem with srm-s continues even after the unscheduled intervention postgreDB yesterday (GGUS:82178)
  - Disk cache in front of tape very low (0.8TB free out of 30 TB). Is there anything to be done about this? When does Garbage Collection start? Also see : https://sls.cern.ch/sls/service.php?id=SARA_LHCb-Tape [ Alexander - please open a GGUS ticket about cache issue ]

Sites / Services round table:

BNL - ntr
NL-T1 - ntr
CNAF - ntr
FNAL - ntr
NDGF - downtime on Friday for reboot of cache pools. Affect some available of data
RAL - had a data loss today due to disk server. Tried to migrate files but 34 files with bad checksums. CRL issue: we have a cron that runs between midnight and 08:00. Jobs ran fine but all diskservers had expired CRL - probably as ATLAS report. Today's downtimes all completed successfully
KIT - ntr
ASGC - ntr
OSG - GGUS changes to soap in SAV:127763. Please keep us updated! Maria - announced last week; not yet scheduled for release. Opened request for Service-Now. Only after all systems have done enough testing that we will schedule release so at least one month from now!

CERN - ntr

AOB:

Thursday

No meeting - CERN closed

AOB:

Friday

Attendance: local (Andrea, Massimo, Stephane, Eva, Mike, Raja, Ulrich); remote (Michael/BNL, Alexander/NLT1, John/RAL, Mete/NDGF, Jhen-Wei/ASGC, Lisa/FNAL, Jeremy/GridPP, Rob/OSG, Paolo/CNAF; Ian/CMS).

Experiments round table:

ATLAS reports -
- Backlog of T0 export to SARA reached a day -> SARA is excluded from T0 data export untill the storage issue is solved. ATLAS will make sure that current backlog will be exhausted today. NIKHEF not affected
- Lost files at RAL (did not reach TAPE) : Savannah:94647 : RAW recovered as expected and 18 MC files completly lost.
- RAL (GGUS:82317) : Problem to stage simulation files from TAPE [John: this is related to the previous issue, GGUS is being updated to give the list of lost files]
- FZK (GGUS:82306) : 2 FTS channels stopped for more than 24 hours partially blcking the simulation activity. Explanation : 'Probably problem with logrotate'
- Many sites automatically blacklisted by HammerCloud last night. Related to upgrade of Panda pilot conflicting with installed 'AtlasSetup'. Pilot update was rolled back this morning and site were automatically unblacklisted after few hours. Problem solved quickly enough to avoid impact on production activity.

CMS reports -
- LHC machine / CMS detector
  - Physics data taking
- CERN / central services and T0
  - Experimenting with running backfill simulation during CERN running. Seems to work and doesn't negatively impact Tier-0 running
- Tier-1/2:
  - Selected larger Tier-2s were asked to increase the production share for 3 weeks.
- Other:
  - Nicolo Magini will take over CRC on Monday. Much of CMS Computing will be at CHEP next week.

ALICE reports -
- no report today

LHCb reports -
- Data reprocessing of 2012 data at T1s with new alignment
- MC simulation at Tiers2
- Prompt reprocessing of data
- Disabled filling mode for pilots for next 10 days.
- T0
  - Still "Input data resolution" problems by jobs at CERN. Waiting for the files from the faulty diskserver to be recovered.
- T1
  - RAL : Problem with batch system / aborted pilots (GGUS:82304)
  - SARA :
    - Continuing huge problems with access to data (GGUS:82178). No new GGUS ticket opened.
    - Disk cache in front of tape very low (GGUS:82267). Solved.

Sites / Services round table:

KIT - closed today for "pont"
Michael/BNL: ntr
Alexander/NLT1: ntr
John/!RAL: several problems on Wednesday related to the new CRL published by CERN, these should be given in advance, followup is needed on this. CERN should publish the new CRL at least a couple of days before the old ones expire, because the sites must copy the new lists. Note that some CRLs will expire this Saturday, so something needs to be urgently done. [Stephane: this is related to GGUS:82148 for ATLAS, but it is a RAL ticket. Will open a ticket against CERN. Ulrich: thanks, will then dispatch it to the CA experts. Andrea after the meeting: opened INC:130120. ]
- [Raja: status with batch? John: fixed a few issues, but others are still being followed up.]
Mette/NDGF: ntr
Jhen-Wei/ASGC: dpm maintenance finished two hours ago
Lisa/FNAL: issues with some jobs that seem to fail in the Condor queue (there are two savannah tickets about this), still being followed up
Jeremy/GridPP: ntr
Rob/OSG: ntr
Paolo/CNAF: ntr

CERN FTS - We added support for the na62.vo.gridpp.ac.uk VO today to all FTS instances at CERN.
Eva/Databases: ntr
Mike/Dashboard: ntr
Ulrich/Grid: ntr
Massimo/Storage: intervention on DB behind CASTOR on Tuesday, will affect all instances, will be marked as at risk in GOCDB
- [Raja: what is the status of LHCb diskserver? Massimo: vendor will only come on Tuesday, sorry if it is not before]

AOB: none

-- JamieShiers - 11-Apr-2012

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
ppt	ggus-data.ppt	r1	manage	2564.5 K	2012-05-09 - 17:59	MariaDimou	GGUS slides for the MB. Drills complete, totals not yet available

Topic revision: r21 - 2012-05-18 - AndreaValassi

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback