LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek080707 (2008-07-11, HarryRenshall)

EditAttachPDF

Week of 080707

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here

General Information

Web archive: https://mmm.cern.ch/public/archive-list/w/wlcg-operations/
CERN IT status board: http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/
elogs: https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Kors, Andrea, Maria, Jamie, Harry, Steve, Gavin, Patricia);remote(Gonzalo, Derek).

elog review:

Experiments round table:

ATLAS (Kors) - on shift this weekend. Very bad w/e. 1st problem - Taipei. Wierd problem. 150TB of disk - filled - but expect only 10TB. Where is the 140TB? Finally got Jason - sent dump. Some files copied up to 80 times! Somewhere in se catalog? Added another 20TB but filled this too! Identified problem. Harry - why not seen during May? A - was, but not followed up. Next - calibration disk in Rome filled up. 2) No traffic BNL - TRIUMF. Michael picked up. OPN problem, failover didn't work. Mail -> Edoardo. 3) CNAF. Friday night, low efficiency. email - no reply. T1-T2 traffic down. LFC down? emails - no reply (morto, rotto, ...) Sat lunchtime - used alarm mail. No reaction. Waited until Saturday evening. GGUS alarm. Sunday morning - new alarm. Sunday evening - still no reply, Simone sent email to Luca who picked it up. Problem still not solved. Storm fixed, LFC not. No way to see if anyone has picked up alarm / ticket. Need feedback.

CMS (Andrea) - CRUZET 3 should have started at 09:00. No feedback yet... More info tomorrow. 100TB total, 7-10 days. Last Friday started PhEDEx consistency tool campaign. Check consistency of info in PhEDEx database, DBS, etc. Sites to test, after a week of testing (volunteers found) will go to all sites.

ALICE (Patricia) - production restarted after myproxy problems. Post-mortem below. Test in production CREAM CE - KIT.

Sites round table:

Core services (CERN) report:

Myproxy post-mortem

License server intervention - should have been transparent - hit LSF & hence CASTOR. New license feature - not understood why... Not thought to be related to move(?!)

DB services (CERN) report:

Upgrade to 10.2.0.4 at PIC tomorrow and on 17th at RAL. Broadcasts should have been sent... >4 hours downtime of related services in each case.

Monitoring / dashboard report:

FTM & GridView - select FTS & sites. Not all data seems to be coming through entire chain...

Release update:

AOB:

Tuesday:

Attendance: local(Jean-Philippe, Flavia, Harry, Jamie, Sophie, Luca (special guest!), Miguel, Patrica, Jan);remote(Michael, JT).

elog review:

Experiments round table:

ALICE (Patricia) - closed all tasks regarding FDR - over! Run 3 continuing until real data taking. From detector point of view everything ready! JT - looks like ALICE jobs stopped running Thursday morning. OK? A: ok this morning. Will check again..

Sites round table:

CNAF (Luca, by e-mail) - kors, we received you email alarm and also our internal alarms.
we started on Saturday morning addressing the problem, i.e. LFC, GPFS, LSF and so on. The cause was a fault on a card on the core switch, causing, from Friday night, a continuous flapping of one of our main backbone links and hence instability on all services. We missed to answer the email on Saturday, even if we were trying to fix the problems (but Extreme support, despite the gold service, could not change the card before yesterday afternoon) and it was only on Sunday evening that we answered your email. I am very sorry for this. we are following this issue internally to make sure we will not repeat this mistake next time.

luca

CNAF (Luca - live) - starting 2 weeks ago with UPS incident power cut caused major problems with ~1 week. Presume that this has caused h/w problems on core switches. Not sure, but highly probable... Last weekend our centre was ~unusable as interconnect between 2 core switches continuously flapping. As consequence destroyed LSF & GPFS clusters. Despite "gold contract" card on switch only changed yesterday. Today switch broke again - waiting since ~midday for another intervention to change another part of the main core switch. No follow-up of alarm e-mail from Kors, even though people were working on the problem (without success as h/w couldn't be fixed by them...) Noone thought to answer e-mail - clearly unacceptable for users. Internal post-mortem tomorrow to follow these issues. Writing draft p-m now... Hope this is last incident - 3 full weekends of service interventions - have to review intervention procedures.

BNL - TRIUMF connectivity (e-mail thread) - https://twiki.cern.ch/twiki/pub/LCG/WLCGDailyMeetingsWeek080707/BNL_-_Triumf_connectivity-July8.txt. Michael - saw mail from Simone, got in touch with networking. No suspicious configuration of h/w issue. Contacted ESNET operations. Major outtage in Seattle area (Seattle & Denver!). Carrier services from several providers. National Lambda Rail (NLR) - part of backbone. Suspected h/w in Denver - switch bust. No spare part at location. Couldn't count on primary path. Why backup path via OPN not working? Discovered some time - TRIUMF+BNL network engineers tried to find solution. Not possible as planned outtages of primary path too short. Yesterday was good opportunity to study closer - in East Coast afternoon understood it is probably s/w problem with CISCO equipment. While primary path is down somewhere interfaces "believe" link is up. Means failover to backup doesn't happen. If i/f is shutdown this failover works. Equipment not detecting primary path down. Manual shutdown of TRIUMF i/f - connectivity restored. Traffic flowing again since 5pm Eastern. Q: more problems affecting BNL? A: extra complexity in ESNET - bandwidth reservation system has been cause of some of these (last 2 problems). In contact with ESNET - providing feedback and pretty demanding! In contact with Bill Johnson. Given poor services ESNET will remove dependencies for such paths on NLR.

NIKHEF (JT) - lots of LHCb jobs!

Core services (CERN) report:

LCG OPN intervention now completed.

Problem (again) due to move of license servers - glitch in LSF scheduling affecting batch and CASTOR for some minutes.

RB for SIXT - when? This week(!) Not obvious.

* CASTOR (Flavia) - prestaging tests on new SRM not successful at the moment. Bug in CASTOR b/e not fixed in "latest release". (bringonline doesn't return success.) Jan - 2.1.7.-10 has fix but not yet deployed - under deployment testing but no schedule yet.

* DTEAM (Jan) - directory with 200K test files. S2 tests in 2007 were not cleaning up properly.

* CMS removed from SRM v1, as well as several non-HEP VOs...

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Thursday

Attendance: local(Simone, Sophie, Harry, Andrea);remote(Julien).

elog review:

Experiments round table:

ATLAS (Simone): yesterday, after a scheduled maintenance a PIC, a known FTS problem consisting in credentials being continuously delegated reappeared. It seems that this happens everytime the FTS server goes down and up; cured by PIC people. At the same time the local LFC went down and had to be restarted; there is a suspicious time correlation. Two hours ago, CNAF restarted all the critical services: things look fine so far, files are being transferred. There is a problem with exporting data from BNL to other sites, while BNL has no trouble with importing data. To be discussed this afternoon with Michael. RAL is finishing to migrate all files from SRMv1 to SRMv2, so it's not participating right now.

CMS (Andrea): CRUZET3 ongoing, transferring at ~400 MB/s to Tier-1 sites and CAF. Suffering from a problem in some CERN VO boxes, due to AFS tokens expiring prematurely. Already in contact with CERN IT people to find a solution. The effect was that the output of merge jobs was not properly registered in the Dataset Bookkeeping System; they had to be registered manually. For the moment, the relevant AFS files have been moved to a local disk. Investigating a general slowness in DBS. Found a configuration problem which caused failures in the registration of some data files in DBS and a premature closing of data blocks, and eventually an event count lower in PhEDEx than in DBS.

Sites round table:

IN2P3 (Julien): a downtime is scheduled for all services on the 15th, lasting six hours from 09:00.

Core services (CERN) report (Sophie):

the SIXT VO is now supported by the CERN WMS and RB machines, to be tested by Patricia
Simone asked if it is possible to reschedule an intrusive CASTOR upgrade, currently scheduled for next Wednesday, because it conflicts with an ATLAS analysis tutorial. ATLAS will propose an alternative date.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local(James,Simone,Jean-Philippe,Gavin,Harry,Sophie);remote(Derek,Michel).

elog review: nothing new

Experiments round table:

ATLAS: CNAF is back up and running though they do not seem to be processing their backlog from the previous days. The subscriptions are there but are not being picked up. Reprocessing tests (involving tape recall) have been stopped while M.Branco fixes a workflow problem. They should resume on Monday for which SARA is ready, INFN is nearly ready and IN2P3 will wait till after they have performed a dcache upgrade in mid-week. M.Ernst reported that the storage server problem at BNL reported yesterday has been fixed this morning, hopefully permanently. SC said that files were still not being transferred from BNL and had to be reminded that it was still morning in BNL so the fix was recent.

Sites round table:

Core services (CERN) report: There will be a linux kernel upgrade Monday plus a new openafs client. The Castor upgrade for ATLAS that was scheduled for midweek has been brought forward to Monday from 14.00 to 16.00. This is to avoid a fatal data size overflow problem.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

-- JamieShiers - 07 Jul 2008

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
txt	BNL_-_Triumf_connectivity-July8.txt	r1	manage	3.4 K	2008-07-08 - 12:23	JamieShiers

Topic revision: r7 - 2008-07-11 - HarryRenshall

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback