Week of 090112

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

Site Date Duration Service Impact Report Assigned to Status

GGUS Team / Alarm Tickets during last week

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jamie, MariaDZ, Nick, Maria, Markus, Harry, Jean-Philippe, Gavin, Julia, Simone);remote(Michael, Gonzalo, Michel, JT, Gareth).

elog review:

Experiments round table:

  • ATLAS (Simone) - two problems observed over w/e. 1st in transferring data to FZK. After investigation found 2 instances of deletion service of atlas running concurrently. As FZK cloud runs version of deletion service - also running at CERN by misconfig. Now off. 2nd problem: now longstanding issue of data transfers NDGF - FZK. Problem at globus layer. Two sites receive and ship data ok from all others but not each other. Notified by expert - no follow up by this morning. Followup triggered response - being looked at but no solution yet. Transfers Taiwan 20% failures user error: unknown error while stating file (at Taiwan). Markus: ftp error number? A: think still at SRM layer. Will contact Jason. Last point about MC production: might be peroblem with "scout jobs" - these are sent to checkout if env ok and then "avalanche" of normal jobs.

  • CMS (Daniele) Data ops team keeping T0 sat. Repack all Craft twice, Cruzet 3-4 times. Results written to disk only pools. Some on T0 - CERN resources pretty good except LSF instabilities over w/e of Jan 3-4 - also fixed. Data handling lessons - mostly CMS specific. MC production: summer 08 still running. Something like 208M / 300M (recons/produced) Fall 08 basically finished - some assignments pending. WInter09 fastsim assigned 45 prod requests, 44 done, 1 skipped as badly placed. >300M events produced! Issues: couple of T2 sites with issues, fixed or by-passed. LCG - performed nicely, generic report about slowing down of overall production during many jobs in submitted - state. Successful job throughput satisfactory. reprocessing at T1: CRAFT & :CRAFT skims in IN2P3 FZK and PIC: in2p3 some storage and srm issues presented skims to finish efficiently. Many issues with running glide-ins; seen at IN2P3. Also tried to move to glite submissions but not "good enough" - no tickets so not clear why. Have to run recons / reprocessing on "custodial sites" - about 3/7 Tier1s. Monitoring for glide-ins hard for sites to see status hence hard to help. Network issue FNAL - FZK not promptly resolved. Routing issue? Patch -> another path. Transfer system: no major issues, 175TB transfer. Just one event - phedex castor agents at CERN not responsive Friday morning 10 days ago. Auto-resolved. Hot topic: PIC s/w area. NFS - based area and CMS scram config - some workaround will be in place soon! Gonzalo: env variable that points to s/w area. A: at moment discussion on-going; Discussing what would be a good solution ether site- or expt- side.

  • LHCb ( Roberto) 1. issue reported last week with LFC was DIRAC handling of registration. First checks if GUID is registered if not tries to register. If yes "gives up" - problem on client side. Success for registering file allowed new dummy MC during w/e. Reached new record of 15K concurrent WLCG running jobs. Claimed at production meeting that this result was running fairly smoothly. 3rd point: request on packages installed on sl5. LHCb needs sl4 compiled apps on sl5. Means at least gcc 3.4 compiler should be installed with sl5 m/w. Markus - should go to AF.

  • ALICE (Patricia) - after probs and issues reported last Friday have prepared a post-mortem meeting tomorrow. Also reported during GDB Wed. Sites beginning to respond. Russian sites putting new WMS today. They have seen same high load probs as at CERN. For rest of sites still seeing same problems -> post-mortem.

Sites round table:

  • PIC (Gonzalo) - comment on issue found at PIC during Xmas. Noticed < Xmas - related to NFS shared area. Noticed an issue with LHCb expt - LHCb jobs using sqlite files through NFS - discussed at this meeting. Some discussion with LHCb - conclusion that LHCb agreed would stop using sqlite but will take some time until - Roberto no longer using sqlite in s/w area. Gonzalo - same symptons over xmas. Think some job caused this but not completely sure. Looking into problem also CMS accessing sqlite through NFS. ATLAS ??? In general serious problem for us - any job or process will cause NFS to be locked and generate hanging process that blocks computing farm. Last week in this situation. Today reported all farm nodes plus NFS server - only way to reach healthy situation. Ask experiments to stop this use of sqlite - understand they are working on it. Implement a solution which is to mount R/O in WNs. A few dedicated WNs for SGMs that mount in R/W. As problem is severe want to try this. First problem - having special WNs always a problem. 2nd: all/some expts issue SAM tests using SGM account hence probe only these special WNs. Other sites affected? Issue that can affect all sites using NFS - worth to discuss globally - e.g. GDB. Michel (GRIF) - completely agree we have to discuss. Similar problem with MPI jobs. Need to discuss efficient access to shared area. Should not just say no use of shared area for high throughput. Addendum: in trying to explore a solution to avoid this problem thinking of using env variable VO SW DIR to tell jobs where s/w area is - VOs use hard-coded paths so this variable is kind of useless for this. (problem reported by CMS/Daniele). Expts - avoid using absolute paths in installation procedure. JT - just a bug on the expt. Simone - checked if ATLAS is doing this. Both SCRAM and CMT not expt. They do read env variable but once installation over set absolute paths. Don't expert immediate solution... Hence also an issue for LHCb (yes).

  • RAL (Gareth) - new building - moving beginning of March (3rd) - 5 week delay. Several hundred servers to move! More details later... ATLAS file transfer tests will start Wed - have been making some changes to CASTOR to improve performance in anticipation. (Still ATLAS plan).

  • CERN (Jan) - 14-15 castor degraded for 45' LCG router put back in production at 14:10 did not behave as expected - removed at 14:55 - might have experienced degradation for 45'

Services round table:

  • DB (Maria) - still ongoing problem with Taiwan. Split off Friday afternoon. Plan intervention for LHCb online DB to fix problem on storage for backups for online cluster. Wed resync of conditions from ATLAS offline RAC to Tier1 sites - not transparent as for the moment no data replicated to Tier1 sites due to problem just before Xmas. Fixed online to offline < Xmas but postponed until > Xmas for Tier1s.

AOB

Tuesday:

Attendance: local(Jean-Philippe, Nick, Harry, Jamie, Ewan, Sophie, Jan, JT, Patricia, Luca, Olof, Simone, Uli Steve);remote(Michael, Gareth).

elog review:

Experiments round table:

  • ALICE (Patricia) - just finished PM meeting of WMS issues during Xmas. Will be explained in detail at GDB tomorrow. Overload that expt made to myproxy server and WMS issues not correlated. Change in code being made now (submission model of ALICE) will be ready end this week - this should solve myproxy overload. WMS - huge backlogs - still needs to be studied in detail. Some ideas - e.g. huge # of jobs to some sites. WMS 103 / 9 old nodes, glite 3.0 - now deprecated. New WMS m/cs will be put in production and see how things evolve. # jobs being submitted should not be the root cause - strict tests prior to putting WMS into production. Ewan & Sophie in contact with developers - will test together with developers. Nick - why glite 3.0? A: these machines had not yet been migrated to new h/w. Now available and will be migrated. Ewan - now better to put WMS & LB on same box. New boxes are sufficiently powerful. Now 8 core boxes. A suivre...

  • ATLAS (Simone) - to confirm that tomorrow T1-T1 10M file test will start. Each sites holds 1K datasets with 1 lakh files each. Each d/s must be distributed everywhere. Each T1 exports 100K d/s --> 10M files. Every T1 should be able to digest transfer of 100K files / day. This is # for data distribution and MC production. Size = 10-20MB/file. Test of services like FTS/SRM - not throughput test in terms of MB/s. Same as December test which had to be stopped due to site problems. Exercise will last ~10 days. 1K d/s subscribed per day for 10 days. Pointers to monitoring will be provided. New version of site services ready? If all ok will be deployed at all VO boxes deployed at CERN and coordinate with US for VO boxes for US sites. This is short-medium term plan. Analysis tests ongoing since 2 months. Will now test access using local protocols vs copying file locally to WN and Posix open. (Comparison between the 2). Some results show that local copy is better - if true would relieve lot problems of different types of file access protocol. JT - don't you already know A? Simone - in old way had to copy all files & then access 1 by 1. Athena can now copy 2nd file whilst processing first etc. JT - Hong had study to do as stated above. Factor of 15 performance hit in remote access protocol. Program was reading a factor of 15 more data than in file. On file by file basis local access more convenient. If you copy all need lots of local disk. Copying ahead & then deleting solves. Read ahead related problems? Olof - RFIO doesn't do read ahead on server side but on client. in shift conf file. Plan was to have a library call but... ENV variable to switch off client buffering? Used by CMS? Olof - test xrootd also? A: yes, a team of 3 people will test this here at CERN. ppt Thursday on preliminary results - will circulate pointer. Brian (RAL) - 100k files per day? A: each site holds 1K d/s, each d/s=100 files, each d/s from each site to each of 10 other sites. 1K x 100 x 10 = 1M files exported from each site to everyone else; Brian - trying to work out how many SRM transactions this will be... Luca - 100K files per day at each T1 - continuous activity or bunches? Inject subscriptions every morning. About 1 sec per subscription. Peak load will depend on configuration of related FTS servers. Brian - comment: one things noticed last time as that we get heavy SRM load also from concurrent processing from jobs on WNs. Is there going to be a mechanism when there will be heavy MC and/or analysis? A: MC not heavy during the next 10 days. For analysis, tests are coordinated by Dan & Johannes at cloud request. Maybe ask for little tests at T1s during the next 10 days (ok for T2s).

  • LHCb (Roberto) -
    • about a long standing issue with the shared area at CNAF (CNAF-T1 and CNAF-T2). We open a GGUS since the problem appeared the first time (before Xmas) and we kept the site banned from the production mask since SAM jobs are continuously failing by timing out to access the application software (indication the problem is not fully understood). As also reported at some point on time before Xmas break, CNAF was providing shared area for sw application via GPFS - also serving local user data and the overload caused by this non grid activity was affecting the grid access to the software preventing the right usage of the center.
    • Also during CCRC plots like that below were able to demonstrate this strong limitation that the flickering access to the shared area was (is still) causing. cnafCPUEfficiency.png
    • CNAF management told me that this time the problem seems to be due to ATLAS application overloading the shared area for LHCb too.
    • Luca - learned this yesterday. Asked local LHCb contact to check; problem still there - sharp timeout of 300s for preparing job. We have a weak s/w area. Problem yesterday was clash between ATLAS & LHCb activity. If we try to do operation from WN with no ATLAS jobs ok. WNs can have up to 10 concurrent jobs (quad core dual processor) In this case access slow. Think problem on client side. Shared area not that performant. Small tender over Xmas - now deciding on newer h/w to buy for shared area to cope with this problem. End tender evaluation tomorrow. LHCb don't want to increase timeout... JT - did you try trick of marking shared area R/O? Yes, didn't help. Problem now more on client side. Put all farm R/O except some nodes (beg Dec) R/W for s/w installation. Debugged extensively for ATLAS jobs. Noticed as mentioned already last year CMT use by ATLAS overloading shared s/w area... After that doubled connection to shared area. Didn't observe saturation in link any more. Limit in I/O ops of H/W hence new tender for better gear. Yesterday noticed clash between ATLAS & LHCb use. Definitely a problem - time jobs need to load s/w is too high. On other hand think limit of 300s is very short... Bottom line: idea is to buy new much more powerful h/w and move part of software area there and overcome by "brute force". Understood problem with shared area common with other centres. ..

Sites round table:

Services round table:

  • WMS (CERN - Ewan) - new WMS nodes. Anything beginning WMS1* retires 26th Jan - same day as RBs retire. Monday 26th. Simone - can you check 1 week before to see which and by who? Asked inside ATLAS to switch but always some tail... Sophie - yes doing this...

AOB:

Wednesday

Attendance: local(Eva, Maria, Jamie, Luca);remote(Michael, Gareth).

elog review:

Experiments round table:

  • ATLAS (Simone) - The 10M file test started this morning as announced. I will post an update late in the afternoon. no burning issues to report.

Sites round table:

  • BNL (Michael) - due to 10M file test SRM transaction rate / file transfer completion rate extemely high. ~14K per hour/4Hz - much higher than before! Good news that SRM as implemented today can sustain.

  • CNAF (Luca) - discussed yesterday timeout on shared s/w area - quick workaround of extension of timeout on their part - waiting to meet Roberto again to conclude. Tomorrow morning we will close tender for new h/w for s/w area.

Services round table:

  • DB (Eva) - problem with Taiwan on ATLAS streams setup. DB is still down. First split from main setup but now out of streams recovery window so cannot resync using streams as / when DB is back up. If DB cannot be recovered - suggest recreation - can use transportable tablespaces to reinstatiate streams. Maria - generally speaking rather urgent that move - like other T1 sites - to ASM. Like to understand plan for this & plan for DBA! and move away from OCFS2 -> ASM. Situation for ATLAS: conditions not being updated and DB itself is down. Luca - what was problem at CERN? Eva - not at CERN, problem is at Taiwan. Luca - we had problem with backup of conditions DB 2 weekends ago. Discovered problem due to failure of streaming process (?) Maria - didn't hear anything from Barbara. Eva - replication for CNAF always working. Luca - will investigate with CNAF DBAs. Eva - today we are setting up replication of ATLAS online accounts between offline DB (T0) and T1 sites (conditions). Should be finished this week...

AOB:

Thursday

Attendance: local(Nick, Harry, Jamie, Jean-Philippe, Julia, Gav, Jan, Andrea, Roberto, Simone, Maria, Maria);remote(Michael, Gareth, JT, Daniele, Andreas).

elog review:

Experiments round table:

  • CMS (Daniele) - following up on some issues seen with T1s, e.g. PIC. Bypass solution for problem - sw installed in NFS mounted system - due to sqlite. Mount area in R/O and restrict in R/W to CMS swman user. Fix can be deployed once CMSSW release. Reprocessing can continue. Problems at other sites: FZK etc fading from storage to framework / glidein so nothing particular for sites. In weekly overview of sties focus a bit more on T2s from next week. This morning first kickoff meeting of Asian instance of FAC OPS meeting (India, Pakistan, China, Taiwan) - repetition of facilities meetings Monday pm with some extracted information for this region.

  • ATLAS (Simone) - see https://twiki.cern.ch/twiki/pub/LCG/WLCGDailyMeetingsWeek090112/10Mfiles.ppt. In attached file details of 10M file test including observations on day 1. Metrics defined - all sites import 1M files within 10 days, rate of 100K files per day. Other metric: after 10 days all sites hold copy of 95% of d/s. Issues: FZK experiencing SRM problem. Went in unscheduled downtime to fix. Closed FTS channels in FTS@FZK so their FTS doesn't schedule transfers towards FZK. Also asked to stop (did) site services -> FZK. Still used as a source but not dest. Big backlog when back to life. SRM timeout in prepare to xxx. Andreas - no detailed info - all dcache people at w/s doing debugging with developers. SRM under extremely high load - srmls from some script? To figure out which type of srm access causes problems switched off all. Tuning of srm service with underlying DB with devs. Situation improving but not yet fixed... Simone - DN of person doing srmls? A: Andrea- something done by stageout module of CMS production agent. Andreas - FTS switched to active since a few mins. Andrea- bug fixed in CVS Jan5. When in production? Simone- TRIUMF does not authorize Kors' cert to change priority of FTS jobs - ticket sent. Those are site issues. Performance: BNL, CERN, CNAF, NDGF, PIC at 200% of expected. ASGC, RAL, SARA inline (130%). Lyon 20% lower - looking into it. TRIUMF v slow. FTS in TRIUMF only 2 FTS transfers from other T1s. From old days? Should be checked... LFC: registration backlog for BNL and ASGC. Checked with BNL - now registering at 2Hz. ASGC at 0.5Hz. European sites at 5-10%. RTT? ATLAS uses bulk q but not bulk registration "method not suitable" - each reg = 4 calls to LFC. As not bulk operation if LFC "far away" start seeing performance problems. Would also be case for TRIUMF but no backlog due to above. JPB - next LFC will have bulk method - 1 RT - to register files. Simone - noone file Savannah bug - if in pipeline good. JPB - better to file bug. Simone - BNL move 200K files but didn't complete so many due to above.

  • LHCb (Roberto) - not much to report. Running 10K MC dummy prod jobs. Some sites failing due to file(s) missing in /tmp - cleanup script? 1st guess - under investigation. Follow up on CNAF shared area: new hw for time being LHCb will increase timeout by factor 2. Help to reuse CNAF banned since <Xmas. New WMS@CERN for LHCb. Having problems logging into LB. Ewan fixed immediately - ok. JT - only 30 jobs from you guys! A: have to check...

Sites round table:

  • NL-T1 (JT) - only few ALICE jobs.

  • BNL (Michael) - due to heavy SRM operation ran out of log space of catalina component of SRM. Around 04:15 notified by monitoring. Hit 90% threshold. 10GB left but filled in 20' Remarkable - did not expect such a rapid increase! Added 0.5TB to log space and hope will not reoccur!

  • GRIF (Michel) - problems with WMS for ALICE - know that there are problems - please send details / feedback.

Services round table:

  • DB (Maria) - restored online conditions for ATLAS, now working online -> T1 except for Taiwan. Site out of replication as DBAs not able to bring DB up again after block corruption and now out of 10 day streams recovery window.

AOB:

  • GGUS (Maria) - would like regular alarm tests. LHCb have given test cases. All VOs could / should generate an alarm at a scheduled date. Instructions.

Friday

Attendance: local(Uli, Jamie Lawrence, Simone, Jean-Philippe, Roberto, Nick, Patricia);remote(Michael, Gareth).

elog review:

Experiments round table:

  • ATLAS (Simone) - good news! FZK somtime beginning of morning came back to life. Sites are now importing from FZK & v.v. Still standing issue between FZK & NDGF - cannot exchange data. "Hidden" in previous days but now back. GGUS ticket but many updates - need further followup. Lyon is now in trouble. Import/export at 0%. SRM timeouts - put & get. General remark about 10M file exercise. Rates of day 1 dropped quite a bit in day 2. Now at level that fastest site importing at rate fixed in metrics, slower are < metric. This is a problem that has nothing to do with site - "grid services" - DDM site services & handling of backlogs with many failures. Systems starts well from scratch but starts degrading. Will it stabilize at reasonable rate or continue to get worse? Part of exercise - debug ex. with Miguel - some access of site services to MySQL were fixed - more debugging. Michael - Q regarding possible repetition of test. Plan to rerun after known issues fixed? A: discuss next week at jamboree - some preparation but no reason otherwise. In favour - Michael. Gareth: q about plans for w/e. Continue as is? A: yes. Gareth - as aside increased yesterday am # of FTS slots / channel and saw some improvement in rate of processing at that point. Clearly other issues... Simone - effect you see probably some channels full and some not fast enough - performance in machinery itself.

  • CMS (Daniele) - reproc ongoing. No major site issues other than ATLAS reports. FZK etc. Main issues are on CMS side. Taking care of reprocessing and understanding CMS issues. Qs: more updates on timing of SL5? GGUS flow tests?

  • ALICE (Patricia) - the issue yesterday : WMS in France - continued by emails. WMS for ALICE announced with register with myproxy server and put in production. In case of issues will get in contact with site. Thanks for fast work. New model for submission will be in CVS next week with subatech for now. WMS214/5 at CERN have been tested - ok. Will be put in production over w/e following deprecation of old glite 3.0 nodes

  • LHCb (Roberto) - 32bit/64bit incompatibility. SAM test jobs failing at PIC and CNAF: some libraries used by ROOT to open/access files for 32bit in 64bit O/S were missing. Will impact all VOs using AA apps. For time being put extra requirement on VO card but this will affect others. Spawned another issue re dcache client packaging. Both 32 & 64 libs are put in same path - dcache bug opened.

Sites round table:

  • ASGC (Jason) - Migration to ASM (For Oracle) is already in progress. We hope to bring up a new cluster serving 3D soon and once this has stabilized we'll continue the migration of grid service cluster as well as castor db. Besides to the migration to ASM, we also discuss about the backup policy. also we're negotiating with vendor for the robust raid subsystem. i hope we will have short intervention in the future to migrate again the drive media.

  • FZK (Doris) - To clarify the situation at GridKa, we had massive problems with SRM last weekend, which we solved together with the dCache developers by tuning SRM database parameters, and the system worked fine again from Sunday till Wednesday early morning. Then we run into the same problem again. There are several causes for these problems, the most important are massive SRMls requests, which hammer on the SRM. If the performance of SRM is ok, then the load is again too high for the PNFS database. So tuning SRM only brings problems on PNFS and vice versa. This resulted in: solving one problem we run immediately into the next problem. Furthermore we had some trouble because one experiment tried to write into full pools. This was a kind of "deny of service" attack, since many worker nodes kept repeating this request and the SRM SpaceManager had to refuse it. After detecting this problem we worked together with the experiment representative to free up some space and relaxed this situation.

    However the current situation is that we reduced the amount of SRM threads at a time, to allow the once accessing the system to have a chance to finish. It seems that this has relaxed the situation, we still have transfers failing because we run into the limit, and reject these transfers. But it seems this is the only way to protect us from failing completely. In my opinion we should learn to interpret the SAM tests, in our case a failing SAM test currently does not mean we are down, but fully loaded. Next week we will try to optimise some of our parameters to see if we still can cope with more requests, but currently it seems better to let the system relax and to finish the piled jobs.

    We (experiments, developer and administrator) should all work together to find a solution to not stress the storage system unnecessary, tuning by itself does not solve this problem in the long run.

* LSF@CERN (Simone) Batch of 10 ATLAS jobs stuck - launching func. tests - restarted after Uli intervened on config of LSF

Services round table:

AOB:

  • Reminder: on Monday SLC5 CEs will do into production! 128/9. Nick - how many WNs behind? 58 dual quad cores.

-- JamieShiers - 12 Jan 2009

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt 10Mfiles.ppt r1 manage 184.5 K 2009-01-15 - 12:52 JamieShiers  
PNGpng cnafCPUEfficiency.png r1 manage 55.2 K 2009-01-13 - 10:55 JamieShiers  
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2009-01-16 - RobertoSantinel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback