Week of 080901

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Ignacio, Simone, James, Roberto, Steve, Jamie, Harry, Jean-Philippe, Markus, Gavin, Riccardo, Jan);remote(Jeremy, Daniele, Derek).

elog review:

Experiments round table:

  • ATLAS - RAL situation followup. Friday afternoon - RAL asked to come back in func. tests. Had understood & fixed problem. Re-introduced - low rate (5%). Increased rate to ~50% of nominal.On Sat. reintroduced T1-T1 trasnfers - also ok. On Sunday replication of FDR raw in prep for reprocessing tests restarted - also ok. Quarantine from RAL finished this morning. Getting new FDR AODS and DPDs. Derek - think we know problem (inside Oracle) but not why its occuring. Stable enough for ATLAS data. 2nd point - mostly for BNL - Friday pm CERN time. Communication problems between services at CERN / BNL. DB services at either end. Investigated all w/e. Boiled down to network problem: expect more info from Michael. Some services on OPN ok, others not...

  • CMS - from this w/e didn't yetcollect much info. Magnet at 3T. First bending news. Just one issue - migration of data to CAF not inserted in local DBS instance. Issue debugged and understood over w/e. Quite some backlog - take 2-3 days to recover. Computing shifts with non-expert using checklist. Continue with 1 expert + 1 9:00 - 23:00 and decide later how to proceed / adapt.

  • LHCb - couple of issues. NIKHEF - struggling over w/e with SRM and gsidcap server. Problem creating 'control line' preventing to access data. Also GridKA - metadata query timeout. Only 2 sites where new 'CCRC' tests failing. All otghers ok. PIC - downtime tomorrow, prepared carefully by draining first site. Thanks alot for following correctly the procedures!

  • ALICE: quiet weekend, only minor issues regarding the usage of the WMS system at CERN and some other small issues in some T2 sites.

Sites round table:

Core services (CERN) report:

  • Confirm Wed morning SRM maintenance. All unavailable simultaneously for up to 1 hour. Q: increase timeout to RAL? A: no. Q (Simone) - after jamboree of last week decided ATLAS will start using SLC4 FTS (when ready!) Gav - being installed right now.

DB services (CERN) report:

  • Installed upgrade 7 of RH4 last week (all validation RACs). Deployed since 1 week and hence would like to push to production DBs as of next week. Rolling operation... CNAF has announced <4 hour intervention tomorrow for upgrade to 10.2.0.4 for ATLAS, on 4th for FTS & LFC and 3rd for CASTOR. BNL - upcoming intervention on 3rd for storage upgrade.

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Jamie, Gavin, Maria, Jean-Philippe, Simone, Nick);remote(Joel, Michael, Xavier, Derek, JT, Michel).

elog review:

Experiments round table:

  • LHCb - nothing special to report. At NIKHEF still problem accessing data - gsi door restarted, seems to have solved problem. # CEs for LHCb at CERN quite low - only 5 job slots.

  • ATLAS - jobs at CERN running very well smile Not much to report. PIC in scheduled downtime, hence sites trying to get data from there see problems. Problem yesterday exporting data to NDGF - 'known' problem(?) Storage setup disk/tape quite unbalanced. Unique FTS channel. Once alot of tape data this will also limit disk throughput. Discussion yesterday on whether separate disk/tape channels. Michel Jouvin proposed solution used at GRIF at time of CCRC'08. Alexei Klimentov should have communicated that this solution was around. Michel - just contact me! Gavin - GGUS ticket about transfers to NDGF timing out just as about to complete - same issue? Nevertheless NDGF got all intended data... Derek - seeing a lot of transfer failures from RAL. .Think due to faulty disk server reported last week. Lost about 4000 files - passing file IDs to ATLAS. Simone - ATLAS will take care of cleaning up catalog. Permanent loss.

  • ALICE - Yesterday at 16:30 there was an unscheduled power cut in the server room of ALICE which unfortunately lasted past the UPS protection time. This stopped the production for about 2h (it affected the ALICE central services). Once the power was up again, the central services were restored. A post-mortem analysis showed this morning that the job losses have been minimal, mostly those waiting for input data - timeout. At this moment we are checking the situation of all sites which are not fully in production.

Sites round table:

  • NL-T1 - Q: LHCb still having problems? Joel - reopened GGUS ticket, reply that gsi door restarted. Have to check if now ok. cc: to JT. Also no ALICE jobs - related to above.

  • RAL - ongoing CASTOR problems. Intervention tomorrow. Config change to RAC for ATLAS/LHCb and haven't seen constraint violation problem since. Will apply to RAC hosting nameserver (common across all instances).

Core services (CERN) report:

  • FTS service - shuffling boxes around. Should be 'completely transparent'. Swap SLC3 boxes reaching end of warranty with others still in warranty.

DB services (CERN) report:

  • Scheduled intervention today for ATLAS cond 7-10, LHCb cond 12-15, LHCb R/O LFC 12-15. Upgrade to 10.2.0.4. Followed tomorrow by CASTOR DB upgrade to 10.2.0.4. Also PIC tomorrow 09-12 - CPU JULY 2008 PATCH for both databases and ASM correction in ATLAS

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

Attendance: local(Jamie, Harry, Maria, Andrea, Gavin, Jean-Philippe, Julia, Simone, Ignacio);remote(Daniele, Michael, Joel, Jonathon Wheeler, Pepe).

elog review:

Experiments round table:

  • CMS - https://prod-grid-logger.cern.ch/elog/CCRC%2708+Logbook/454 Entering last mid-week exercise before Sep 10. Reprocessing last 3T data. Still following up on issues reported yesterday on elog. Some already closed, some still under investigation. Summarize tomorrow. Q: should contact James regarding elogger request? Julia - talked with James. Will be run as service? Hence different group.... Can setup elogger but in long term ??

  • LHCb - concerning NIKHEF, succeed to access files but need to restart services daily is big problem. Improvement required - will be big problem when data taking and / or with long jobs. Q: NIKHEF / SARA split? A: perhaps due to other problem, e.g. single server. Each time service is restarted can access data then suddenly decays. Many GGUS tickets open. At some stage this needs to be resolved. Not seen at other dCache sites... Discovered this morning short interupt for SRM LHCb endpoint - message missed in GOCDB - apologies. Yesterday at MB reminded Tiers1 that space tokens not yet in place - some still missing. Roberto will open GGUS ticket for each space token - some sites still using CCRC space tokens. Also in contact with Ulrich for jobs - situation improved abit. Still a problem with AGM account at CERN - debugging with Ulrich.

  • ATLAS - biggest problem ATLAS facing is monitoring - basically stopped working yesterday night (dashboard for T0 and export). Huge load of callbacks and timeout at server side. Developers spent alot of time even late night trying to understand problem. Got some help from DDM developers this morning. Problem at site services or dashboard level?? Still some problems - from time to time error plots show empty results. No problems (seen) at sites. Traffic from CNAF and ASGC problematic as scheduled downtime for both. Also errors in morning from scheduled intervention at CERN. ATLAS export on call did not notice and sent GGUS ticket - please reply site in scheduled downtime and close. From time to time these downtimes do not get displayed in dashboard. Monitoring needs to be improved but developers have more urgent things to look at right now...

  • ALICE - If you have just right a look to MonaLisa you will see all the sites in yellow colour. http://pcalimonitor.cern.ch. The production is fully stopped, waiting for the next physics production cycle which should begin at the end of the week (I will be able to give more news after the TF meeting of tomorrow). However you will see Legnaro and CERN in production. Most of these jobs are users jobs which are running few analysis on demand. But not the production itself.

Sites round table:

  • RAL - had problem with CASTOR last night as LSF partition filled up. Affected LHCb and GEN instance. Jan - problem understood and solved? A - yes, due to large logfile that could not be log rotated. Once deleted ok. Also had CMS suffering from DB constraint problem. In down time at moment with fix to nameserver that might solve this problem(?) Know in 1/2 hour or so... Stopped batch work in consequence and will reopen when work over.

Core services (CERN) report:

  • SRM intervention went according to schedule. Forgot to put LHCb endpoint in downtime in GOCDB. Intervention went smoothly and services back at 11:00. For services at CERN, working on latest glite release.

  • FTS service - made necessary h/w moves so servers in warranty.

DB services (CERN) report:

  • Ongoing intervention on ATLAS on and offline clusters for rolling upgrade of kernel RH4 update 7. Ongoing intervention at CNAF to 10.2.0.4 and firmware upgrade for storage at BNL. Tomorrow CNAF at FTS/LFC to 10.2.0.4. Also LHCb rolling upgrade of kernel tomorrow 08:30 - 10:30.

Monitoring / dashboard report:

  • Need to investigate further problems reported by ATLAS.

Release update:

AOB:

Thursday

Attendance: local(Harry, Jamie, Jean-Philippe, Ignacio, Julia, Simone);remote(Daniele, Michael, Yannick/Strasburg, Joel, Michel).

elog review:

Experiments round table:

  • CMS - elog entry summarizing activity over last 24h https://prod-grid-logger.cern.ch/elog/CCRC%2708+Logbook/456. Mid-week exercise starting ~last call. Transfer FNAL/CNAF (custodial). Transfer rates initially low, some errors to FNAL being investigated, related to stuck process? Not such a prompt migration to tapes at CERN but should now be digested. 1GB/s network peak at CERN yesterday then going down, 100MB/s out of CERN. CMS CAF2 - 1 node down, Peter Kreuzer following up. Some transfer errors, some file exist errors (CAF) fixed already. T1 workflows: FZK has some analysis tests that are not ok. Job robot not running at RAL - data operation action pending. T2 workflows, Belgium, Beijing, Korea etc some transfers failing, job robot problems etc, Following up case by case. Julia - elogger - discussed again with James. This will be setup. To whom should configuration questions be setup? Contact Daniele.

  • ATLAS - situation now looks quite good. Problem with monitoring yesterday solved. Detailed email from Ricardo - will upload. Bottom line: all functionality of dashboard now there. Patches for missing information etc. applied. Downtimes from GOCDB now displayed (again?). Other things to report: only (small) issue is scheduled downtime with Taipei yesterday. Still not fully recovered. Ugpraded to CASTOR 2.1.7 - same as RAL who struggled for couple of weeks. Are they aware of problems / issues? Ignacio - in contact with developers. Have some problem but different to RAL. 32/64bit related. Other ongoing activities: distribution of FDR2 data to all sites. Happened, complete, ok! RAWD from FDR from June now being distributed to all T1s. Programme to run massive reprocessing test. FDR+COSMIC data. Aim to simulate pre-staging + job load + condDB. Now deployed DM instance that moves user data. New dashboard instance, new servers at CERN. Should have relatively low traffic but little experience so far! Q: end-points? A: different space tokens. 1) Pledged resources 2) Unpledged.

  • LHCb - GGUS tickets to have all space tokens setup at T1s. New GGUS ticket for behaviour of WMS at CERN (WMS 101/106) Not behaving well. Maarten following up. LSF problems at CERN (priority & shares) not so bad. Still debugging with Ulrich handling of tokens.

Sites round table:

  • CNAF - Brief report on alarm ticket problem for IT-ROC. Guenter@GGUS had implemented a version that did not push alarm tickets to the local ROC systems (but only to the site contact mailing list). Unfortunately he forgot to change this later and nobody in Italy recognized this during the tests. Two tests were done to check the work flow on 17th of July (Tickets #38614 and #38633). Both tickets closed by INFN-T1 site manager via GGUS interface. Fortunately we have received only a real alarm ticket (#39994) and the problem has been raised and solved (thanks to Guenter, Diana and Roberto). Two new tests have been done to check (successfully) the workflow (Tickets #40154 and #40160).

  • RAL - expect post-mortem on analysis and resolution of CASTOR / Oracle problems - preferably before MB of next week!

Core services (CERN) report:

DB services (CERN) report:

  • Interventions for CNAF and BNL went fine. Upgrade of kernel for ATLAS DBs yesterday went as foreseen. Intervention today at PIC to 10.2.0.4 & July patch. Now have homogeneous setup for Oracle version + patch. 2 interventions next week for CMS on/offline (kernel upgrade) and to WLCG. Again re-establish uniformity for CERN servers. Will then freeze for startup...

Monitoring / dashboard report:

Release update:

  • Worrying e-mail concerning new versions of gfal / lcg_utils & reading info from information system if space token not used. Update 30 fixes most of problems reported in July. Top level BDIIs running 30 not compatible with this - EGEE broadcast today explaining this.

AOB:

Friday

Attendance: local(Oliver, Julia, Maria, Jamie, Jean-Philippe, Patricia, Simone);remote(Michael, Jonathan, Pepe, Daniele, Luca, anonymous).

elog review:

Experiments round table:

  • ALICE - announced at ALICE TF new alien version, fully b/w compatible. Installed at CERN and some small sites, up to now all ok. Officially announced to sites probably next week. Upgrade procedure has changed. Operations inside VO boxes much easier, one command line to perform full upgrade. Some minor issues with services at some Tier2s. Following up directly with sites. CREAM - asked FZK to put ALICE production q behind CREAM CE. Fully supported now in ALIEN. 120 jobs concurrently, performance stable and good.

  • ATLAS - 'problem of notification' - BNL reported that tickets from GGUS with team account ending up in wrong mailing list (informational one). Mistake - GGUS sending mails to wrong list. Yesterday did test with team and alarm tickets. Alarm ticket OK (Dantong). Savannah ticket for change. Meanwhile all reports from shifters 'manually' forwarded to right list in BNL. Michael - Thanks! Simone - another point re BNL. 2 days ago Ricardo installed another front end for dashboards to get callback from BNL. Hiro reported yesterday callbacks piling up. Ricardo currently away, Ben taking care in contact with Hiro. Callbacks coming v. slowly. Monitoring of BNL traffic currently 'dubious'(?) Michael - traffic within US cloud, not just BNL. Julia - related to problem of network connection CERN-BNL? A: don't know - Ben looking into it. Michael - no, its not related to this. Must be something else - have restored configuration as of Friday morning. German cloud: observed that alot of transfers served by FTS at FZK fail ldue to authen problem that looks like old delegation bug. Strange as mechanism to refresh proxies in place since ~year. Checked crons - ok. Various interventions Simone/FZK didn't help - Akos following directly with FZK.

  • CMS - usual elog with main activities. Some links to savannah tickets. Fast ticket closure for T1 tickets, abit slower for T2s. Do better with T2s? Nothing crucial, but elogs over past ~2 weeks show most problems at T1s solved within 24 hours, not true for T2s...

Sites round table:

  • RAL - another unrelated DB problem last night. Affected ATLAS, LHCb & GEN for a couple of hours. Add another parameter to Oracle side and reside DB.

  • CNAF - performed upgrade (July CPU patch to Oracle), still to apply patch to ATLAS cond DB - next Monday. Next week or week later will perform upgrade to CASTOR 2.1.7. Would involve a downtime for all VOs.

Core services (CERN) report:

DB services (CERN) report:

  • Upgrading CMS online cluster to latest RH kernel. Downtime today - investigating. (Unexpected as normally rolling). More news Monday. On Monday upgrade kernel for PDB cluster and CMS offline. Tuesday WLCG and ATLAS + LHCb downstream capture boxes.

Monitoring / dashboard report:

Release update:

AOB:

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2008-09-05 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback