Week of 101122

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Andrea V, Manuel, Jan, Dirk, Maarten, Jamie, Nilo, Flavia, Maria, Alessandro Douglas, Harry, David, Eddie, Jacek, Lola, Roberto, MariaDZ, Ignacio);remote(John, CNAF, Andrea S, Ian F, Michael, Ulf, Jon, Kit, Joel, Rob, Rolf, Stefano).

Experiments round table:

  • ATLAS reports -
    • Nov 20-22 (Sat-Mon)
      • HI run started on Saturday
    • T0:
      • Problem with DB
        • Sat Afternoon: ATLR DB: PVSS apply handler was stopped on create index statement. PVSS replication between online and offline was affected.
        • Sat Night : ATLR DB: All instances were affected. PVSS replication between online and offline was affected.
        • Sun; ATLR DB: High load on instance 3
          • All problems was solved, thanks to PhyDB team. PhyDBAs, ATLAS DBAs and PANDA developers will discuss how to tune DB.
      • CERN-PROD_TZERO: File cannot be transferred to BNL (GGUS:64473) [ Jan - this one file that has a bad checksum arrived thus. Probably more to discuss with DAQ people. Ale - this file is an ESD. Most probably transfer from batch to storage (where it went wrong...) Should activate checksums on those transfers. Declare as lost as said by Doug. Should discuss how to activate those checksums. ]
      • FTS channel CERN-INFN-T1: The only channel where transfers fails transfers when INFN-T1 was back online after networking problem (GGUS:64477)
      • AFS120 was very slow on Saturday evening, problem gone on Sunday GGUS:64471
    • T1s:
      • INFN-T1: internal network was down, GGUS:64459
      • FZK: failed srm-bring-online requests, GGUS:64365, on going problem.
      • LYON: most ATLAS activities are resumed.
    • T2s:
      • No major problem

  • CMS reports -
    • Experiment activity
      • Slow start for HI. A couple of good fills but not stressing the system
    • CERN and Tier
      • Quiet at CERN. Issue over the weekend with load on default pool. Team ticket submitted to get list of people writing. Tracked to a single user writing 16k small files to Castor. User encouraged to move to a disk-only pool.
    • Tier1 issues and plans*
      • Mopping up production at Tier-1s
      • T1_ES_PIC issue identified on one of the Squids. Ticket submitted
    • Tier-2 Issues
      • No big Tier-2 Issues
      • We will run a 16 minbias events of pile-up test at selected sites to see how the application performs. We'll notify sites.
    • Stephen Gowdy new CRC starting tomorrow.
    • Eddie - CE and SRM tests at CNAF failing Sunday due to networking problems at INFN GGUS:64459 & GGUS:64462. FNAL - SRM tests failing Fri-Sat due to heavy usage of SE GGUS:64463.

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • FZK: GGUS:64470. During the weekend there were problems with cream-1 and cream-3. Production was stopped since yesterday. /tmp on cream-1-fzk was full, old files have been deleted and tomcat was restarted in both ce's. However, there are still problems in submission [ Xavier - host has problems with proxy renewal - in unscheduled downtime until 18:00 today]
    • T2 sites
      • Usual operations

  • LHCb reports - SAM tests for CE and CREAMCE showed on Saturday problem with libraries used by m/w clients to submit jobs. Problem fixed on Sunday morning. Saturday's test were not running on the WN and sites might have notice that as missing results.
    • T0
      • none
    • T1 site issues:
      • CNAF : network problem to the center since Saturday evening. People working on that during the weekend.
      • SARA - SRM tests failing Friday, Friday-Saturday still shared area problems at IN2P3 (Eddie)

Sites / Services round table:

  • RAL - tape robot unavailable tomorrow. In GOCDB as at risk - 10:00 - 14:00
  • BNL -ntr
  • NDGF- problem in site in Denmark this w/e where cluster kept accepting jobs from ALICE even though q was full. Instead of 40 jobs accepted 28K jobs. Being cleaned up at moment. Not accepting any more jobs for the moment. Being looked into
  • FNAL - recovered from power outage; experiencing SRM and pnfs overload - trying to alleviate; FNAL-UCSD network circuit is down.
  • KIT - beside srm bringonline problem (see GGUS ticket) problems with DATADISK space token. 10 ATLAS pools had to be restarted for that. Downtime 1 Dec. FTS and LFC will be down 09:00 - 10:00 UTC. At risk only.
  • IN2P3 - ntr
  • CNAF - this w/e had problem on one of core switches. Tricky problem - took from Sat 13:00 - Sun pm to understand where problem was. Located on Extreme Networks Black Diamond. Dropping packets on one module. Moved connections from this failed module and waiting for swap. Ale - observed that part of storage - DATADISK - working ok but DATATAPE not. http timeout from FTS. Understood? A: had to move connection CNAF-CERN from one module to another. Checked this morning. Luca - strange behaviour inexplicable to me. Only with transfers to/from CERN. Traceroute between all of the involved element regular. Also tried to transfer from none on OPN and succeeded but not from lxplus. Analyzed traffic at packet level with tcpdump. Could observe traffic coming from CERN but nothing arriving at Storm f/e. Enabled verbose logging in f/e including restart of e/p which apparently fixed problem (but not clear why...)
  • NL-T1 - ntr
  • PIC - the LHCb CE SAM errors seen last night (from 21h to 24h, local time) were due to timeouts accessing the shared software area. The NFS was overloaded in that period. In the next days we will migrate the ATLAS SW area to another controller, and we hope this will solve these SW area load issues in the future.
  • OSG - ongoing BDII sage BNL returned to published full d/s Thursday. Stable since. Back to stable state wrt BDII. Next item is to return BDII timeout to original. Short week this week in US so preferably not. Start talking about Monday to return original timeouts to BDII

  • CERN FTS Monitor The CERN instance of the Lyon FTS Monitor service http://fts-monitor.cern.ch/ is now ready to migrated to a new SLC5 host. This includes an update to the latest Lyon FTS monitor. The new service is available for preview from https://fts300.cern.ch/ Please provide any comments now and a date for the alias change will be published shortly. The active SLS sensors will also migrate at the same time to this new host when the alias changes.

  • AFS - Rainer looking at it. Arne working on automated response.

AOB:

Tuesday:

Attendance: local(Pszemek, Manuel, Flavia, Harry, Roberto, Maria, Jamie, MariaDZ, Maarten, Simone, Alessandro, David, Eddie, Andrea, Jan);remote(Pepe, Stefano, Ronald, Rob, Joel, Jon, Ulf, Michal).

Experiments round table:

  • ATLAS reports -
    • T0:
      • More central database issues: PVSS DCS data streaming was falling behind all day Mon. Problem was certain sql selects putting load on the server, and slowing down the streaming. Back to normal by Mon. night.
      • Central database load still spiking due to Panda service. Indexing of high use tables is being redone (some indexes added, and restructured, some dropped). Concerns about job monitoring put in limits on selects that blocked ability to monitor jobs, limit had to be removed soon after.
    • T1:
      • RAL lost a file server yesterday, and announced around 19:00h. A large fraction of storage is corrupted and files lost, but which files unknown. File recovery is going on today, but by early afternoon whatever can't be recovered will be called lost. [ John Kelly - identified 130 files that are more important, being copied off and checksummed. Waiting for that to finish. Douglas - apparently several disk servers from this batch have failed. Plan? ]
      • INFN had fixed network problems, all space tokens put on-line. Problems with datatape appeared, but seemed to be temporary, no new errors since late last night.
    • T2:
      • Unscheduled outage at Liverpool not reported to GOCDB, caused failures in transfers.
      • Missing release installs at Milano are causing job failures there.

  • CMS reports -Experiment activity
    • HI Run was getting better
    • CERN and Tier0
      • GGUS:64511 SRM issue last night seems to have solved itself
      • GGUS:64315 t0export disk server down since last week with only copy of two EDM RAW collision files
    • Tier1 issues and plans*
      • Mopping up production at Tier-1s
      • T1_ES_PIC issue identified on one of the Squids. Savannah:117951, one of the three servers has a disk problem
    • Tier-2 Issues
      • No big Tier-2 Issues
      • We will run a 16 minbias events of pile-up test at selected sites to see how the application performs. We'll notify sites.

  • ALICE reports -
    • T0 site * Nothing to report
    • T1 sites
      • FZK: GGUS ticket 64470 In progress. For the moment submission is only through cream-ce3. There is an unscheduled intervention until tomorrow morning in cream-ce1.
    • T2 sites
      • Usual operations

  • LHCb reports - Quite w/e. New release of DaVinci and LHCbDirac available,Validation will be performed this afternoon. How to proceed to test cernvmfs on the T1 ? [ Put on agenda of next T1SCM - answers from T1 sites at that meeting ]

Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • NDGF - the Nagios server at SARA that we use had certificate problems this morning - a lot of errors for all of our sites and some still going on. Myproxy server at CERN not very happy about this. 6-7 GGUS tickets tracking this - hopefully fixed by tomorrow. Maarten - did DN of Nagios server change? A: DN of Nagios tests were changed. Maarten - might require an update of Myproxy information.
  • NL-T1 - ntr
  • CNAF - ntr. Alessandro observed intermittent storage failures. One of SAM machines always failing tests. Maybe storage doesn't accept proxy signed by BNL VOMS server. A: working on it.
  • PIC - ticket reported by CMS one squid server out of 3 not responding.
  • KIT - problem with one of CREAM CEs cream-1-fzk problem is new jobs stay for ever in registered or idle status. Proxy renewal? disk space full on w/e - many jobs not submitted. Cream CE in downtime until tomorrow. Maarten - are you running latest version? A: maybe not latest from this update. Saw some fixes about proxy renewal. Maarten - would be good to upgrade CREAM nodes. DImitri - other CEs dont have this problem. An option to update... Maarten - fixes were quite important - problems seen not always reproducible. Some CEs might work fine whilst others "crumble". Latest release has fixes that make this a lot less probable.
  • RAL - in addition to ATLAS d/s problems intervention on tape robot today that went well. Similar AT RISK tomorrow on FTS servers for glite 3.2 upgrade.
  • OSG - going through a maintenance period for our Twiki. After today will be using x509 for edit privileges.

  • CERN - lost disk server for ATLAS yesterday. Monitoring did not catch that this box was dead and had been for several days. Need again h/w intervention. Simone - is there a list of files? Jan - will put in GGUS. Tomorrow a nontransparent intervention on CASTOR T3. 09:30. DB Upgrade. Not in GOCDB as no SRM interface.

* IS - trying to consolidate information displayed by GSTAT. Opened ticket against CNAF as info not accurate. Will continue against T1s and T2s.

AOB:

Wednesday

Attendance: local(Harry(chair), Eddie, DavidT, Roberto, Simone, Maarten, Flavia, Doug, Manuel, Luca, Massimo, Stephen, Eduardo, Alessandro);remote(Stefano(CNAF), Michael(BNL), Jon(FNAL), Rob(OSG), Ulf(NDGF). Alexander(NL-T1), Pepe(PIC), Rolf(IN2P3), Dimitri(KIT)).

Experiments round table:

  • T0:
    • Looked like transfer problems from T0-T1 yesterday, turned out to be expired grid cert. for monitor processes. Cert. was replaced and transfers turned out to be all good.
    • More discussions with database people and Panda support, a more detailed report at end of week.

  • T1:
    • Disk server problems for RAL complete, they were able to recover 337 files, all others are marked as lost. Production has restarted, and tasks missing files are aborted, and will have to be redone. The problem generation of file server has been put into read-only, so no new data goes there. In two weeks, new servers will be rushed online, and data will be moved off. They are currently 150TB behind ATLAS storage agreements, and it will take 3-4 weeks to get back up to desired levels.
    • Network problems for data transfer to INFN are back again, and an alarm ticket was posted at 10:00. Site responded quickly, and problems were worked out. Space tokens were blacklisted, but brought back after fixes were put in place and tested. Alarm ticket will be closed by time of this meeting.
    • IN2P3 put back in full service today, and new data from T0 is being sent there again. Site had been in low-use for past week. Problems were seen in trying to handle the new reprocessing effort, and many services were turned off until this worked well. Now need to watch services there for the rest of the week to see if things are working again.

  • T2:
    • Victoria-Westgrid offline due to disk failures. I'd like to point this out, because this problem happened, and Bryan Caron for the Canadian cloud announced this us before we saw the problem, created a site savanna ticket, and listed the outage in GOCDB. ATLAS systems saw the outage in GOCDB and blacklisted the site. Very few failures occurred, and very little effort required to manage the tier-2 failure. This should be the standard on how tier-2s are managed in all clouds. (GGUS:64536)
    • Frascati set offline, problems with job running on batch and access to storage.
    • TR-10-ULAKBIM is still out, they requested to be turned on today, but they are still not passing SAM tests. (GGUS:64426)
    • UKI-LT2-Brunel was out last night due to internal network issues to a dataserver. (GGUS:64572)
    • DE/PSNC set offline due to most jobs failing with storage access to worker nodes issues.
    • RU-Protvino-IHEP set offline, problem with access to DATADISK locally.

  • Experiment activity
    • HI Run was getting better

  • CERN and Tier0
    • GGUS:64315 t0export disk server down since last week with only copy of two EDM RAW collision files.
    • Still some issues with DEFAULT pool, opened a ticket.

  • Tier1 issues and plans
    • Mopping up production at Tier-1s
    • T1_ES_PIC issue identified on one of the Squids. Savannah:117951, Fixed.
    • T1_US_FNAL transfers issues. Large tape queue for exports. Many sites providing files with wrong checksums thought to be due to source sites having bad copies of the files, Savannah:118013, under investigation.

  • Tier-2 Issues
    • No big Tier-2 Issues
    • We will run a 16 minbias events of pile-up test at selected sites to see how the application performs. We'll notify sites.

  • Dashboard report (Eddie): The CMS ssoftware installation SAM test was failing yesterday in a CE of ASGC trying to access a software release that had not yet been installed, Savannah ticket 117339, now solved.

  • GENERAL INFORMATION: pass1 reconstruction of one of the HI run ongoing at T0 and a MC cycle

  • T0 site: Nothing to report

  • T1 sites
    • FZK: this midday cream-1 was back in production, so it is working and in good shape. Thanks

  • T2 sites: Nothing to report

  • Experiment activities: Validation in progress and going well including at IN2P3 !!! However there are many job failures on time limit at IN2P3. Rolf reported that the local LHCb support is working on this and suggested Joel contact them or submit a ticket.

Sites / Services round table:

  • RAL:
    • Follow up by Andrew Sansum from yesterdays ATLAS disk server failure: In the minutes today (Tuesday) Douglas noted that several disk servers in one of our batches failed and asked if we had a plan. John wasn’t able to say exactly what our plan was (my fault for not briefing him properly). In fact we decided yesterday afternoon to remove the whole batch as soon as is practicable and replace them with reserve capacity that is sitting on our machine room floor having already completed acceptance tests. We are now working through the mechanics and scheduling of this task and the relative urgency between VOs and service classes. We are also looking at what temporary measures we may have to take before we actually get the servers out of production. Realistically I expect we can do some of the smaller service classes in the next week or two but ATLAS is going to take longer – we are working on the details now and Alastair Dewhurst will be talking the issues through with ATLAS directly over the coming days. We’ll do what we can to reduce the inconvenience but obviously this is not going to be pain free for anyone. I apologise – its very frustrating for all of us but clearly vital now that these servers are pulled out of operation until fixed.
    • FTS upgrade is going ahead today.
    • Yesterday’s minutes gave a figure of 130 files identified as important, the correct figure should be 830.

  • CNAF: This morning had a problem that transfers to ATLAS DATATAPE were failing. This was due to a misconfiguration related to BNL Voms references disappearing from the end points - now fixed and working again. They will close the ticket if transfers are succeeding for a few more hours. Joel reported that CNAF had told him they were seeing a lot of jobs from LHCb while the LHCb monitoring showed only few jobs there. Stefano agreed to follow up on this.

  • FNAL:
    • As CMS reported they are investigating srm overloads at FNAL
    • A reminder that FNAL and other US sites will be on a holiday schedule through Sunday (for Thanksgiving) so they will be running at an emergencies only response level.

  • NDGF:
    • Had a major power cut in Linkoping yesterday evening probably blizzard induced. Stabilised in 20 minutes after power restored with one host damaged.
    • Scheduled updates of disk pools completed today with no problems.

  • NL-T1: This morning a file-server serving ATLAS and LHCb crashed with a kernel panic but it was brought back online in 15 minutes.

  • OSG: The bdii has remained stable so hopefully all is now well. As reported by FNAL OSG operations will be on holiday hours over Thanksgiving till Monday.

  • KIT: Still have problems with one of their CREAM-CE (submission to batch server gives error 127) so their downtime has been extended till tomorrow.

  • CERN CASTOR: Oracle backend database for the CERN T3 service (analysis) was upgraded this morning.

  • CERN batch: Will be retiring 5 SLC4 CEs on Monday starting by putting them in draining mode.

  • Information System - Open GGUS tickets #64595 against PIC, #64593 against IN2P3-CC and #64592 against FZK-LCG2 because of missing or inconsistent information published concerning installed capacity of SEs.

  • Frontier: Vulnerability detected in the Tomcat Manager for version 6.0.29 (the version used by the frontier rpms). New frontier-tomcat rpms that correct such vulnerability have been released. The new rpms v1.0-5 can be found here.

  • ATLAS Frontier servers at CERN have been upgraded to the new frontier-tomcat rpms.

AOB:

Thursday

Attendance: local(Harry(chair), Doug, DaveT, Eddie, Lola, Massimo, Stephen, Manuel, Alessandro, Roberto, MariaD, Flavia);remote(Stefano(CNAF), Ulf(NDGF), Rolf(IN2P3), Ronald(NL-T1), Jeremy(GridPP), Foued(KIT), John(RAL), Joel(LHCb)).

Experiments round table:

  • T0:
    • Nothing significant to report today.

  • T1:
    • INFN transfer and storage problems came back last night, causing large numbers of transfer failures for about an hour. Site reported that it was a temp. storage outage. No problems observed since then.

  • T2:
    • SRM contact problems in Milano, but seems to ok for most of today.
    • SRM contact problems with QMUL, and not getting much response from site on issues.
    • SRM contact problems with RAL-LCG2 off and on.
    • Outage of Manchester to deal with internal failures for a couple hours today. Correctly reported and ticketed by site.

  • Experiment activity
    • Good run overnight

  • CERN and Tier0
    • No open issues.
    • GGUS:64315 t0export disk server down since last week with only copy of two EDM RAW collision files. Files successfully recovered and reconstructed yesterday.
    • Still some issues with DEFAULT pool, opened a ticket. Was due to a user having the same file open 140 times.

  • Tier1 issues and plans
    • Mopping up production at Tier-1s
    • T1_US_FNAL transfers issues. Large tape queue for exports. Many sites providing files with wrong checksums, Savannah:118013. Exports look like they've recovered.

  • Tier-2 Issues
    • No big Tier-2 Issues
    • We will run a 16 minbias events of pile-up test at selected sites to see how the application performs. We'll notify sites.

  • GENERAL INFORMATION: pass1 reconstruction of one of the HI run ongoing at T0 and a MC cycle

    • T0 site: Nothing to report

  • T1 sites
    • FZK: this morning there were several issues with the proxy and Cluster Monitor :
      • couldn't connect to myproxy server - happens from time to time at FZK: solved
      • vobox-proxy query: strange behavior . Probably the database was corrupted. Fixed by registering again the proxies and restarting the services. Could be due to an Alien module so an ALICE problem.
    • AliEn CE was sometimes loosing connection with the ClusterMonitor both at FZK and IN2P3 so this is under investigation.

  • T2 sites: Nothing to report

  • Experiment activities: Validation in progress and going well. Reprocessing will start this afternoon and is expected to run for a few weeks depending on the availability of the sites and how well they sustain the rates.

  • T0
    • LFC : if the expert is not around nothing happened. Is it acceptable ? (CT727281 requesting a manual change in the LFC database was put ON HOLD due to absence of expert [Fwd: Strange SE in LFC ]). Maria queried why this ticket was submitted directly to the CERN PRMS system rather than through GGUS. Joel had sent it direct to lfc.support@cernNOSPAMPLEASE.ch but will prefer ggus tickets in future.

  • T1 site issues:
    • SARA : jobs aborted for CREAM CE (GGUS: 64618)
    • GRIDKA : jobs aborted for CREAM CE (GGUS: 64617)
    • IN2P3: After discussions the LHCb timeout for worker node access to the software shared area has been increased to 1 hour.

Sites / Services round table:

  • KIT: Have extended the downtime of the CREAMCE-1 as some jobs were aborting (see the LHCb report).

  • CERN CASTOR:
    • ATLAS and CMS issues from yesterday will be solved in different ways with the files being available again soon.
    • There will have to be two important interventions in January: one will affect all activities with an Oracle upgrade of the name server backend but it will be short so around the 6 January is suggested. The second will be a tape-robot stoppage of about 24-hours so affecting all data in that robot but which has some flexibility so it would be good to know experiment activity plans for January. Stephen for CMS reported they would be reprocessing their heavy-ion data in January.

  • CERN Squid/Frontier: New rpms will be announced later today.

  • ggus oPEN ISSUES (MariaD): Thursday daily meetings will now be used to review open GGUS issues even if there is no ITSCM meeting. This week ALICE has said no issues, LHCb said nothing apart from the shared SW area at IN2P3 and there is no input from the others which probably means no issues. She reported one open ticket from CSCS (Swiss Supercomputer centre, a CMS Tier-2) who have submitted a GGUS:63676 requesting support for a Torque batch system issue. Such support is thought to come from Nikhef on a best-efforts basis and has already been escalated in ggus. Could NL-T1 please take a look ? Ronald agreed to follow up.

  • AOB: SLC4 VObox support ending in a weeks time. See S.Traylen mailing. Stephen queried why the alias lxvoadm could not be reused for lxvoadm5. This might mean creating a new lxvoadm4 alias for the overlap period - to be followed up.

Friday

Attendance: local(Harry(chair), Doug, Eddie, Manuel, Massimo, Maarten, Lola, Simone, Roberto, Alessandro);remote(Michael(BNL), Xavier(KIT), Joel(LHCb), Stefano(CNAF), John(RAL), Pepe(PIC), Onno(NL-T1), Rolf(IN2P3), Stephen(CMS), Jeremy(GridPP)).

Experiments round table:

  • T0:
    • GGUS outage, database access problems there, started around 13:00, back by 14:00 today.
    • lxvoadm5 use discussed, the login access list needs to be defined. Currently about 120 ATLAS users can login to lxvoadm, so people want to clean up the login access on the machine (down to maybe 50 users). Discussions on how to do this, and which lists to define are on-going.
    • Oracle load issues again for a few hours yesterday caused by Panda use. This seemed to be load issues from both panda server and panda monitor services. Later in the day, Oracle load was fine, and as far as people could tell the access and use were the same. Seems the load is high enough that short load spikes can take hours to settle down. Some notes on some changes decided on in daily meetings this week (as requested):
      • Main load issues this week resulted from database structure of high use tables kept small with current job records, once they get to a final state (apr. 3 days), records copied to archive tables, and removed. The ability to remove records fast enough broke down, and these tables grew, from 3 days, to 7 days of records.
      • Special server routines setup to remove records asymmetric with the copy, in large bulk removes of 10k at a time. Still not able to keep up at start of week. Index on high use tables restructured, unused or underused indexes dropped, some indexes changed to "bit maps"(?), talk of some multi-column indexes (not sure if those were really setup).
      • Server routines for record removal changed to 500k size, after this, it was able to catch up by end of week. I believe this will stabilize at 100k.
      • Talk of setup of daily partitions on high use tables, so removal can happen on lower use older partitions, not sure if this will go into production.
      • Use of index hinting re-done in the monitor to reduce load on larger selects on archive tables.
      • Use of squid caches in front of servers setup, with forced caching of some status selects done by pilots.
      • Discussion also of serve side routines for the large monitor selects.
      • Continuing issue to generally reduce load, but current server will be replaced in early Jan (which may require a two day Panda outage) and load patterns will have to be evaluated again after that.

  • T1:
    • Jobs failing at Taiwan due to release access issues. Was traced to nfs mounting to a certain pool of WNs, and fixed, now in testing. This is effecting the HI reprocessing there. (GGUS:64645)
    • Jobs failing at INFN-T1 with storage access problems, and failing of some SAM tests. Reported at 12:43UTC today, and SAM tests are still red, no response from site yet. (GGUS:64660)

  • T2:
    • TR-10-ULAKBIM is still down, although site is working with cloud contacts to pass ATLAS SAM tests, so probably getting there.
    • RO-07-NIPNE storage full, and site SRM access issues is blocking ability to clean disks. Site blacklisted for now.
    • Frascati and PSNC fixed problems, and sites back online.

  • Experiment activity: Good running

  • CERN and Tier0: No open issues.

  • Tier1 issues and plans
    • Mopping up production at Tier-1s
    • T1_US_FNAL transfers issues. Large tape queue for exports. Many sites providing files with wrong checksums, Savannah:118013. Exports look like they've recovered.

  • Tier-2 Issues
    • No big Tier-2 Issues
    • We will run a 16 minbias events of pile-up test at selected sites to see how the application performs. We'll notify sites.

  • GENERAL INFORMATION: pass1 reconstruction of one of the HI run ongoing at T0 and a MC cycle

  • T0 site
    • During December there will be a downtime to upgrade AliEn. It will last 1 or 2 days
    • There are no definite plans yet for when to schedule the January CASTOR changes mentioned yesterday. It was agreed to coordinate with Massimo after the end of the heavy ion run.

  • T1 sites
    • FZK: this morning there were some issues again with the proxy in one of the voboxes. It is a persistent problem we are investigating. Next AliEn version 2.19 will solve the issues reported yesterday about the Cluster Monitor, that collapses if there are too many jobs

  • T2 sites: Nothing to report

  • Experiment activities: Reprocessing is started and is going well. Will be ramped up over the weekend to keep their sites busy.

  • T0:
    • Since this morning svn commits have been failing - this is urgent should be addressed before the weeked. Manuel explained this is related to egroups authorisation and the svn team is working on it.

  • T1 site issues:
    • SARA : [SE][srmRm][SRM_AUTHORIZATION_FAILURE] Permission denied (GGUS:64659)
    • IN2P3 : We had a discussion with IN2P3 this morning and we agree on this plan. As we see that some fraction of jobs are finishing successfully, we ask them to remove from our SHARE the WN which are not behaving properly and to apply the same modifications to these "faulty" WN as are configured on the successful ones then to put them back in production.

Sites / Services round table:

  • GGUS: The ggus outage today was due to the backend database cluster rebooting by itself then not coming back cleanly. Neither is fully understood yet - more news on Monday.

  • CNAF: Not all ATLAS jobs are failing so we are letting them run while storage experts examine the failures. There was also an lfc problem but this is now thought to be unrelated.

  • KIT: Since midday CREAM-CE1 is out of unscheduled downtime and back in production.

  • RAL: All the vulnerable disk servers are now in read-only mode and testing of the replacement machines will start next week.

  • NL-T1: The srm authorisation failure reported by LHCb is being investigated.

  • IN2P3: Concerning the modifications to worker nodes requested by LHCb an important factor is the afs cache size which we will be increasing to either 32 or 64 GB, not yet decided. There are also kernel parameter changes to fix a problem seen over a year ago of having too many i/o requests.

AOB: LHCb was looking for volunteer sites to evaluate the cernvmfs file system. Both NIKHEF and RAL have volunteered and first results look encouraging.

-- JamieShiers - 17-Nov-2010

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2010-11-26 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback