Week of 090330

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).

GGUS Team / Alarm Tickets during last week

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jean-Philippe, Julia, Jan, Miguel, Maria, Jamie, Olof, Harry, Alessandro, Patricia, Maria, Nick, Simone, Markus);remote(Michael, Angela, Daniele, Gareth, Michel).

Experiments round table:

  • ATLAS (Alessandro) - Quite w/e except for RAL unsched. downtime. Most important issue today - ask to all T1s to ensure FTS patch for delegation issue should be put in production! Miguel - tomorrow will deploy at CERN - 'transparent'.

  • CMS reports (daniele) - still catching up with things over w/e and during CHEP. CNAF T1 in sched down. Relatively high availablity over last 24 and 48h. Only IN2P3 not 100%. Look for some correlation between CE tests and issues. Some tickets being opened - tails of transfers to/from T1s but specific T2s. GGUS to FZK last week promptly investigated and fixed.

  • ALICE (Patricia) - During CHEP experiment quite busy with CREAM systems: CERN - stopped for a while; RAL; also some other sites. RAL - sched down; also some Alien issues. Solved - this will also be a 'CREAM week'. [ VO box tests failing with "Write access interdit within the software area for sgm accounts " for voalice03 and voalice06. - to be checked. Tests may eventually be removed(?) ]

  • LHCb - No production going on at during week end and today. Plans is to run at low level a MC dummy productions with last stable version of LHCB software. From last week a couple of main points to remind: LFC third node added by LFC to cope with anomalous increase of load. To be confirmed by LHCB (with another round of MC jobs) whether this can be released or kept. SRM LHCb_USER at CERN and RAL. It looks like they are under-dimensioned and this was causing performances degradation last week again. Discussions ongoing in LHCB to decide how to redistribute disk servers accordingly the need and the pattern usage of various space tokens.

Sites / Services round table:

  • FZK (Angela) - tried to update FTS for deleg issue - couldn't get back in production as Tomcat cannot connect to DB. Miguel - will check with Gavin. Angela - error message we see is not on known issues page. Old system normally switched off tomorrow - hope to get new service up soon! [ Detailed report sent to fts-support ].

  • RAL (Gareth) - 1) problem with tape handling on Friday - resolved end of afternoon. Also had problem with FTS resolved late Friday evening - DB moved to a new m/c. (s/w mirrored disks and problems with both disks - unplanned move to a spare m/c) Left FTS running with # channels reduced over w/e - now fully back. 2) Significant outage early due to 2 power glitches - think these were knock-on effects from these.

  • CERN (Miguel) We will update all CERN-PROD FTS services to the latest patch on Tuesday 31st March at 09.00, fixing the delegation issue. There should be no service downtime. Ref: gLite 3.1 27.03.2009 - 3.1 Update 42

  • DB (Maria) - split CNAF because of sched down. Problem this morning with LHCb online DB. Sysadmin had intervention for adding non-DB storage but some error brought down DB unexpectedly.

AOB: (MariaDZ) ALARM testing week starts TODAY! Instructions: https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testin$Reports this Friday at the latest, f2f MB and GDB being planned for April 7th and 8th. Related ticket http://savannah.cern.ch/support/?107452. Additional documentation on ALARMS' processing was included in the AOB of https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek090323#Tuesday

Very important strategic USAG this Thursday April 2nd at 9:30. All 1st line user support (TPM in GGUS terms) is subject to change. Details in: http://indico.cern.ch/conferenceDisplay.py?confId=55267

Tuesday:

Attendance: local(Jan, Gavin, Miguel, Roberto, Harry, Jamie, Nick, Alessandro, Steve, Patricia, Julia, Simone, Jean-Philippe);remote(Angela, Gareth, Michael, Michel, Daniele).

Experiments round table:

  • ATLAS (Ale) - expert on call noticed T1 problems due to downtimes not correctly published. Ricardo Rocha found bug in dashboard agent - now fixed. ATLAS FTS service update: CERN, FZK, SARA ok; still to be updated are RAL, PIC, TRIUMF, BNL, NDGF & Lyon. Lyon testing new version and we are using in DDM. Not known for ASGC & CNAF. Reprocessing - link with more details - signoff period expired & now almost ready to start. Expert reproc jobs middle of week. More complete list of input d/s in Twiki. Harry - after disk resident copies have been removed? A: yes, in discussion with T1s. Harry - bulk pre-staging? Ale - rampup period then 2K jobs prestaging in parallel. Simone - 10-50 files per request. First job starts when all files in a bunch staged. Jobs come from central place in CERN. Polling via SRMls configurable and hence largely reduced. Hopefully will be in production by time exercise starts.. Brian - downtime issue - reason why within UK sudden flow of data in/out T2s midnight last night? Simone - no, no action taken - will investigate.

ATLAS FTS servers update
   OK: CERN, FZK, SARA
   ToBeUpdated: RAL, PIC, TRIUMF, BNL, NDGF, LYON (under testing the patched version)
   Not known: ASGC and CNAF

Reprocessing:
The sign-off period having expired, we are now almost ready to start. Sites should expect reprocessing jobs in the middle of this week. A complete list of the input dataset containers involved can be found here:
https://twiki.cern.ch/twiki/pub/Atlas/DataPreparationReprocessing/reproMarch09_inputnew.txt

  • CMS reports (Daniele) - no incremental on twiki wrt previous days. Main issues: 1) 2 T1s out of 7 fully down, ASGC & CNAF, trying to arange ops according; 2) FNAL not routing to Nebraska, trying to transfer from RAL & IN2P3 but poorer links. Trying to understand why. 3) Still some issues with FZK - related to dCache pool hopping. Affecting T1-T1 transfers, e.g. to in2p3. Artem gave explanations - have to wait until d/s for export can be staged back from tape 4) some files not migrated to CERN tapes since a few days - Nicolo Magini has opened Remedy. T2 issues - disk full etc. Miguel - CASTOR file is corrupted as in last case - this is why not being committed to tape. Looking if other files like this. Need procedure to detect better; need upgrade of castor client to see better error that file is corrupted.

  • ALICE (Patricia) - clarification of issue reported yesterday re VO box tests. (s/w area access) Only from one VO box access - upgrade in voalice03 to have latest CREAM CE clients gsiklog was missed - needed to have token to write in s/w area at CERN - i.e. AFS. Now ok.

  • LHCb (Roberto) - confirm that prod activity ~random analysis activity. Discussion concerning LHCb user disk pool at CERN & RAL - load on this space token too high - current setup with 2 disk servers cannot cope with activity. Can this activity be split across several space tokens? If not expect to ask CERN IT for suggestions on best configuration...

Sites / Services round table:

  • FZK (Angela) - switched off old instance - new instance running but quite unstable. Reinstalled yesterday twice, not sure what is problem. Simone - ATLAS has been using old FTS server until now. Why not keep it running until new stablises? Angela - you switched a few weeks ago! Ale - what is name? New FTS-FZK - see alot of transfers since several weeks. Gav - are you waiting for some comments from fts-support? Angela - if CERN & SARA upgraded without problem maybe really FZK specific. Simone - confirm that ATLAS is using this new instance. Angela - very little CMS activity.

  • CERN (Steve) - upgraded VOMS cert - all a bit frantic! Thanks to sites for updating promptly! Miguel - ideally 2 weeks lead time to schedule this. Should be tested in PPS and rolled out. Nick - changed procedure. Steve - got 'lost' someone between cert & PPS. GridSite part of WMS can't use lsc files - other components can - will follow up with developer.

AOB: Concerns CERN Site, came up with the ALICE ALARM testing, see today's comment by MariaDZ in ticket https://gus.fzk.de/ws/ticket_info.php?ticket=47510:

This is the address to which ALARM tickets are directly sent:
https://twiki.cern.ch/twiki/bin/view/LCG/OperationsAlarmsPage
For the ALICE case this is alice-operator-alarm@cern.ch

This is exactly what happened. You can see at the top of the ticket history:
Sent ALARM mail to mail address alice-operator-alarm@cern.ch 

You (all LHC experiments) have decided what the content of these mailing lists should be. 
Yours obviously includes Patricia this is why she received the mail. 

The GGUS ticket 'assignment' as such was done (correctly) to ROC_CERN
which is responsible for the 'Affected Site'. 

Now, if some services should be added in this mail address and if negotiation is necessary with FIO 
for the level of respose time committed, we should discuss in wlcg operations. https://twiki.cern.ch/twiki/bin/edit/LCG/WLCGDailyMeetingsWeek090330?t=1238590890

This is a CERN site issue, not a GGUS one.

yours
maria

Wednesday

Attendance: local(Sophie, Gavin, Jamie, Jean-Philippe, MariaDZ, Antonio, Alessandro, Andrea);remote(Gonzalo, Angela, Michael, Derek, Daniele).

Experiments round table:

  • ATLAS (Alessandro) - All T1s have been asked to clean disk buffer so that reproc jobs will take from tape. Ready to start MC08 hits merging. Most of this merging will take data from disk and write output to ATLASMCTAPE - should not affect reprocessing operations. Issue open with BNL. Michael - observe that due to MC reprocessing hits files are accessed and these had been written to tape in the past. V small - 30MB files. Received tasks late last week demanding 10Ks of these files from tape. One request on Saturday for 36K hits files / day. Storage system delivered these files. Given # of files and to make reprocessing jobs runnable results in underutilization of CPU resources in US cloud. ATLAS have taken care that these files will be merged in the future and only merged files will go to tape. Now that reprocessing of cosmics will start in next 24h or so we have competing tasks and processing routes in terms of access to tape: raw from tape for cosmics reprocessing and huge # of small file requests. No prioritization mechanism and so cannot guarantee timely delivery from tape. Huge q - usually 10-20K requests and these are optimised on basis of files present on same cartridge. No physics-stream related prioritization. BNL working on this to allow this - expected April. Ale - are you going to implement in PandaMover or DDM? Michael - not specific. Couple this to ATLAS task IDs and priorities. Yesterday just after 17:00 local time network connectivity to Panda DB servers - those that still server US, Fr and some other clouds - lost network connectivity. Not accessible from outside. Cyber security at BNL very vigilant - based on their dynamic web pages both servers blocked without warning. Warning normally comes a few days prior to this. a) not warned b) already after hours - took about 1 hour to get a person at home to unblock these nodes. More forensics will be performed this morning. No explanation why - fullfil all cybersecurity criteria Notifed by Simone that problems with callbacks - one of services had died - corrected quickly (40') - not heard back since so believe all in order. Ale - last point: today ATLAS run control meeting - expert Alexei Klimentov - cosmics run with complete configurations can be expected between May 4 - 11 (week 19 & 20). Will involve also T0.

  • CMS reports (Daniele) - 3 top site related comments: 1) still have 2 T1s/7 fully down. Extended by at least one day from end of yesterday. CNAF is also in scheduled down. 2) Major problems in some US T2s. Major data access failure in dCache in San Diego. Sched down MIT cluster. FNAL not routing data to Nebraska as yesterday. Brian has been debugging - mostly due to some tuning in fts/phedex. Opening GGUS tickets < Savannah. Warsaw permission denied in writing data transferred from IN2P3.

  • ALICE (Patricia) - Lyon T1 reported this morning that ALICE submitting a huge # of jobs through 1 CE and blocking Q. Issue was: VO box uses local q and some small qs for T2s. 2 issues: ALICE asks info system - if one q gives 0 weighting ALICE will continue submitting. [ more ] Made test with similar JDL - see same symptom. WMS were only taking 1 q out of those declared with same weight. Will repeat tests with WMSs at CERN to see if WMS or Q issue - report to ALICE task force list. Opened ticket for GRIF - 47550 - WMS in France for ALICE is overloaded- won't accept new requests. Discuss during ALICE TF meeting at 14:00. CERN CREAM CE started to have issues this morning - ce202/1 - submitted Remedy ticket through lsf support - answered and following directly.

  • LHCb (Roberto) - production activity ==0; only a few use jobs. Tomorrow physics group meeting to decide on dc09 - new MC physics production. Waiting for new version of simulation programme expected 2 weeks from now. DIRAC fixed for several recent problems. MC dummy will restart in a few days.. Hopefully some production jobs will start - some sites querying...

Sites / Services round table:

  • CNAF (Luca) - there was a problem on LHCb gridftp configuration at CNAF (related to an old version of yaim). After having solved this, we think there is a problem in the lhcb test execution order and (I think) Roberto agreed in reviewing it: this was before CHEP and now, with our down, we have no additional news. I put Elisabetta in cc (she was working on our side on this problem)

  • TRIUMF (Di) - TRIUMF updated the FTS patch for delegation issue (yesterday) afternoon and everything is working fine.

  • LHCOPN (Edoardo) - INTERVENTION:
The two CERN's LHCOPN routers connecting CERN to the Tier1s will be rebooted for a software upgrade, one after the other.

DATE AND TIME:
Wednesday the 8th of April 2009 between 14.00 and 15.30 CEST

IMPACT:
Each router will be down for 30 minutes during the maintenance window.
Connectivity to the Tier1s should be available via backup paths; performance may be degraded.

Kind regards

Edoardo Martelli

  • Release update (Antonio) - nothing significantly different from last week. FTS delegation patch release has gone out. New version of VDT - issue reported from DESY but turned out to be unrelated to update. Upgrade to VDT 1.6.1 safe. Update includes BDII fix - memory footprint. New host certificates for VOMS - urgent to upgrade! Old certs removed from lcg voms server. No schedule change - had to invert content of update 43 & 44. Set of patches to YAIM will be with update 43 on April 6. Contain new version of YAIM with service discovery. Several fixes in dependencies - possibly one for VObox if certified in time. New dynamic info plug in for Sun Grid engine. Just after Easter stable version of CREAM & ICE. This is version of CREAM tried out in pilot.

AOB:

Thursday

  • Attendance: local(Andrea, Ricardo, Jan, Gavin, Alessandro, Roberto, Julia);remote(Michael, Derek, Gonzalo, Daniele).

Experiments round table:

  • ATLAS (Alessandro): Team ticket to CERN-PROD about 0 length files. Subscriptions have been stopped for these files. Follow up with Atlas and Castor (now understood - ticket closed). Scan through castor to look for other files of the same type. Question: Writen using SRM? No - LSF jobs directly to Castor with rfio. Reprocessing started- 1st tasks launched for test jobs (a few for each cloud to verify that the setups are all OK). These are test tasks -- if they fail that won't be accounted against the site. Waiting from responsible for green light to continue.

  • CMS (Daniele) - Full Report. No new tickets open related to site problems - still following up on open ones. 2 out of 7 T1s down (ASGC, CNAF in scheduled downtime). Looking at site behaviour to understand how to deal with corrupted files. Possible solution: run checksums on these files amd work our procedureon how to fix them. Impact: one bad file causes lots of trouble with many retries -> e.g. transfer quality gets messed up artificially and we end up unfairly blaming sites. Report on GGUS alarm test for CMS: started yesetrday (previous test march 9th). Site were not warned of the exact date - some did not know - and sites reacted well!. This time, sent alarms at 7pm site time - response was very satisfactory - see webpage for results. Observation: no SMS at CNAF (this was expected as the site was in scheduled downtime, otherwise fine). All sites closed tickets well: max 2 hours, min 30 mins. Sites admin closed apart from PIC (CMS contact closed, but site admin action was OK). Q (Roberto): CMS sent to a test to CNAF as well? Yes, all 7 T1s and CERN,

  • ALICE - No report.

  • LHCb (Roberto) - Monte Carlo dummy production ongoing (problems fixed in DIRAC). We had an issue with one diskserver at CERN 'lhcbuser' space causing user jobs to fail accessing files on it. (Waiting vendor intervention). Q to FIO: can FIO communicate this to the experiment when we have a temporary downtime due to hardware. A) Happens all the time - no easy way to do it. Post-meeting update (OB) The daily diskpool dumps, e.g. https://castor.web.cern.ch/castor/DiskPoolDump/lhcb.lhcbdata.last, contains the information if a disk copy is available or not (be aware that an unavailable diskcopy can still have a replica on another server or tape). The dumps are not part of the CASTOR software and provided on a best effort basis. If a diskserver is unavailable for a long time or lost you are informed - otherwise, for transient things, we don't inform. GGUS alarm test - reproduced result from a few weeks ago. Good response from almost all sites. NIKHEF/SARA issue again (Jeff closed a ticket for SARA). Bad answer from CNAF?"Not authorised to trigger alarm in system". The ticket was not assigned to anyone in CNAF. Need to follow up. Q(Daniele): any problems with other sites? No problems for other sites - just CNAF to follow up. Did anyone else from LHCb try? Not yet, just Roberto - someone else from LHCb could try.

Sites / Services round table:

  • BNL (Michael): No issue to report - one little glitch due to new CERN VOMS certificate, affected the LFC - corrected thanks to site admins in US, close to midnight.

  • RAL (Derek): No issues.

  • PIC (Gonzalo): No issues

No service issues.

No AOB.

Friday

Attendance: local(Jean-Philippe, Dirk, Jamie, Nick, Alessandro, Julia, Jan, Patricia, Gavin);remote(Luca, Gonzalo, Michael, Gareth, Michel, Angela).

Experiments round table:

  • ATLAS (Alessandro) - Good afternoon! Observed transient problem with IN2P3 T1. Problem caused inefficiency - ticket submitted - related to SRM. First time seen error message "all ready slots are taken and ready thread q is full". Problem disappeared but would like an understanding from site... Observed transient problem to RAL - ggus ticket - inefficiency between 30-70% - data export from RAL to other T1s. Reprocessing: bulk campaign has started! During ADC operation meeting yesterday first results: links from meeting to understand various tasks. Luca: how long will campaign last? A: depends on data volume at site but expect ~2 weeks - will include CNAF as soon as out of scheduled downtime. Will submit tasks for It T1 as soon as ok. GGUS ticket - document in GGUS pages to which Lyon T1 people refer - official document seems not updated. Lyon T1 people don't want to forward Team tickets to T2(s). Brian@RAL: kernel panic on stager disk server. Some of reprocessing tasks seem to be picking up files from datadisk and not recalling from tape. Reprocessing supposed to be from tape? A: yes, so very strange. Brian - will debug info and put mail out onto atlas reprocessing list. [in this official document, https://gus.fzk.de/pages/ggus-docs/PDF1540_FAQ_for_team_tickets.pdf it is written that only T1s receive team ticket. Is it possible to update it to include T2s ?

  • CMS reports - Daniele had a clash with the CMS Computing Resource Board today.

  • ALICE (Patricia) - 2nd WMS in France put in production yesterday. CREAM CE issue at CERN reported yesterday solved. ALICE has made official request to have at least one of CREAM CE behind SLC4 to have access to more resources (WNs should be SLC4.)

  • LHCb -

Sites / Services round table:

  • CNAF (Luca) - yesterday evening had electric power back, started to switch on h/w this morning and reconfiguring all services. DB ok, starting up castor and GPFS. Should be back - bar serious issues - Monday morning as scheduled. CNAF & LHCb alarm ticket issue - had another round today with another alarm ticket from LHCb but not from Roberto and this succeeded! Check only signature of GGUS portal - apparently all tickets sent by Roberto fail to be verified. Problem under investigation. Michel - not crl expiration problem. CERN crl valid only for 5 days. Luca - but we verify GGUS signature. Suspicion was verified submitter cert - check with responsible at CNAF - log shows we verify signature. OK except for Roberto's alarms!

  • PIC (Gonzalo) - had yesterday applied FTS delegation patch. Also T1 alarms tests - checking results at PIC - for test by CMS (48h ago) - Daniele reported quite a delay - not MOD but CMS contact. Problem was that GGUS ticket didn't trigger SMS to operator. Filter was not correct - list of DNs authorized being updated periodically. Daniele's DN updated 20 days ago and not up to date in our filtering! Luca - AFAIK we have only to verify GGUS signature and not names - person is verified on portal. We need to check update mechanism / frequency!

  • GRIF (Michel) - just installed last WMS 3.2 patches. on test WMS - some news next week! Supposed to fix remaining problems with match making performance.

AOB:

-- JamieShiers - 30 Mar 2009

Topic attachments
I Attachment History Action Size Date Who Comment
Microsoft Word filedoc IN2P3_02april2009_WLCG_incident_report.doc r1 manage 48.5 K 2009-04-14 - 17:42 HarryRenshall Sevice Incident Report on IN2P3 tape robotics failure
PDFpdf LHCB__GGUS_ALARM_ticket_test_results.pdf r1 manage 77.8 K 2009-04-02 - 08:19 JamieShiers  
PowerPointppt wlcg-servicereport-Mar31.ppt r1 manage 2364.5 K 2009-03-30 - 10:49 JamieShiers  
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2009-04-14 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback