Week of 101101

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Jan, Luca, Steve, Simone, Jarka, Jamie, Maria, MariaDZ, Lola, Eddie, Maarten, Alessandro);remote(Michael, Reda, Ron, Renato, Gareth, Rob, Jon, IanF, Andreas Davour).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues
      • CNAF-BNL network problem (slow transfers) GGUS:61440 .
    • Nov 1st (Mon)
      • ATLAS general information: reprocessing campaign started on Friday (in the best ATLAS tradition)
      • Issues with T0 and T1s
        • IN2P3-CC is still not stable. Error rate fluctuating from few percents to almost 100%. Lyon tried to reduce the number of job slots but this does not seem to help. This problem need attention (GGUS:63638 and GGUS:63631).
        • One ESD file can not be exported from CERN because of checksum failed verification (GGUS:63638). Looks like the file is really corrupted due to HW problem (raid controller reset). ATLAS T0 people have been notified.
        • Staging problem in NDGF-T1 (reported on friday) appeared again over the weekend. NDGF team understood the root cause yesterday and deployed a fix today (together with an already scheduled intervention)
        • PIC FTS "lost" all channels during the night. Actually, channels have been put in DRAIN in preparation for tomorrow's downtime, but too much in advance (1 day too early). They are back to active now. But I suggest this is not done again tomorrow (seems that "DRAIN" refuses new transfers but there is no way from DDM to know that channels are draining). I would CLOSE them. [ Maarten - closed sounds worse than drain. Simone - if closed piles up transfers on this channel, if no other source ... If in drain submission of FTS job fails so get huge # of failures on dashboard. Neither is ideal. Best - schedule downtime in GOCDB and let ATLAS DDM decide. ]
      • Central Services
        • ATLAS DDM dashboard was reporting only failures on Friday. This was fixed following the Dashboard Maintenance instructions
        • ATLAS Production dashboard is broken since Friday. Email sent to Dashboard Support. [ Eddie - Not maintained at CERN - main developer is in Wuppertal. ]

  • CMS reports -
    • Experiment activity
      • Reasonably quiet weekend
      • Getting ready for HI Running
    • CERN and Tier0
      • Smooth operations
    • Tier1 issues
      • Reprocessing ongoing
      • Transfer failures from IN2P3 to MIT (GGUS and Savannah). Could impact export of HI MC - hence escalated.
    • Tier2 Issues
      • large production ongoing
    • AOB
      • CRC: Changes to Peter Kreuzer tomorrow.

  • ALICE reports -
    • T0 site
      • Friday night the AFS there was a the lack of synchronization between the r/rw AFS areas. It was solved in few hours by Maarten and confirmed by ALICE experts
      • This morning there were needed some manual intervention in voalice09, which is the xrootd redirector machine for ALICE, to restart the redirector because it was stuck.
    • T1 sites
      • ntr
    • T2 sites
      • Downtime of few hours in Hiroshima during 02-11. Some hardware of a SE broke and they must replace some parts
      • Usual operations concerning Bologna T2

  • LHCb reports - Weekend without Data taking. Reconstruction proceeding. Few MC productions.
    • T0
      • Last Wednesday GGUS:63514. Diskserver problem. Files set "invisible" for user, who can use replicas. Waiting for update.
    • T1 site issues:
      • IN2P3: "seg-faulting" problem fixed. (GGUS:63573 and GGUS:62732 closed)
      • RAL: GGUS:63468 (Opened 26/10) SRM service instability. RAW data had size "zero". [ Gareth - had a few probems with LHCb SRMs. As mentioned recurrence of a problem with some files recorded with 0 file size. Last week some badly formed TURLs returned by SRM. On Friday rebooted CASTOR head nodes to try to resolve. Ran into problems after with CASTOR timeouts - failed many SAM tests. More problems Friday afternoon and evening. First problems DB connection problem, others load. Post-mortem being prepared ]

Sites / Services round table:

  • KIT - ntr [ public holiday in Germany today! ]
  • TRIUMF - as Simone mentioned doing reprocessing for ATLAS. So far smooth over 4500 jobs over w/e. Less than 1% failure rate - dominated by application errors. About 7K jobs in queue.
  • BNL - all services running smoothly. BNL technicians working on primary electrical supply and running on diesel power until tonight
  • NL-T1 - had an issue with site BDII and SARA fixed Sunday am.
  • RAL - had some problems with ATLAS srm this morning - srms crashing. Cause is a call to status of bringonline req - known to crash this version! Local ATLAS contact will bring this up at ATLAS meeting shortly. Some batch problems since yesterday evening - SAM tests on CEs failing - having difficulties contacting some aspect of batch service - working on it.
  • FNAL - ntr
  • NDGF - 3 comments: 1) GGUS:63583. In contact with main dCache developer who had found a bug which could create this kind of behaviour. This morning a patch applied during planned outage- should be solved and will close ticket tomorrow if ok. 2) r/o pools under investigation not in write mode yet. 3) unplanned outage - probably forgot to transmit that Stockholm pools offline all day - will affect availability of data.Online tonight.
  • OSG - ntr

  • CERN storage : repeat tomorrow at 14:00 intervention on ATLAS head node. Also tomorrow schedule intervention on ALICE network switch - affects 12 machines. 10 diskservers will be unavailable while switch replaced. Simone - please send usual email as well as GOCDB.

AOB:

Tuesday:

Attendance: local(Eddie, Julia, Peter, Lola, Miguel, Maarten, Simone, Maria, Jamie, MariaDZ, Jarka, Alessandro, Gavin, Steve, Luca);remote(Elisabetta, Gang, Tiju, IanF, Rolf, Gonzalo, Federico, Reda, Kyle, Ronald, Jeremy, Michael, Jon, Andreas, Dimitri).

Experiments round table:

  • ATLAS reports -
    • IN2P3-CC: high load on dCache issues (GGUS:63627) dump of activity promissed for this afternoon, thank you! Other issues reported in GGUS:63631 and GGUS:63626. Do you please have any update to those tickets? [ IN2P3 will follow up ]
    • ATLAS had some problems with SS between midnight and 4am, problem is being investigated. Suspect correlation with CERN CC issues observed during the night, e.g. AFS glitch. Can CERN please comment?
    • ATLAS DDM dashboard was empty this early morning, it was fixed later in the morning. Developers are trying to find the underlying issue.

  • CMS reports -
    • Experiment activity
      • Aligned to current LHC plan, namely to have a pp fill on Wednesday evening and on Thursday morning switch to the HI data-taking mode.
    • CERN and Tier0
      • putting configurations in place to guarantee a smooth transition to HI running
    • Tier1 issues
      • Storage pool issues at IN2P3 (since upgrade on Sep 22)
        • Affecting transfers, e.g to MIT (see Savannah ticket 117603)
        • Workaround is downgrade to old situation, but apparently some data was left on pools where the workaround has not yet been applied
        • Same issue is also affecting CMS processing at CCIN2P3
      • KIT failing CMS-CE SAM test (see GGUS ticket 63689)
      • Request by RAL move to glite 3.2 FTS : ok for CMS, however for file exports, the destination sites need to change their FTS mapfile for RAL and test the new endpoint. Once this test is done at several UK T2s, it can go to production.
    • Tier2 Issues
      • No outstanding issue, large production ongoing
    • AOB

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • We observed that since yesterday there is no job running at SARA for ALICE. Under investigation (but should be...)
    • T2 sites
      • Usual operations

  • LHCb reports - Weekend without Data taking. Reconstruction proceeding. Few MC productions.
    • T0
      • NTR
    • T1 site issues:
      • RAL: GGUS:63468 (Opened 26/10) SRM service instability. RAW data had size "zero". Waiting for an update [ Just been declared as unsolved. ] Tiju - after restarting services had not seen problem again. This is why ticket was closed as unsolved. ]

Sites / Services round table:

  • IN2P3 - ntr
  • TRIUMF - ntr
  • NL-T1 - ntr
  • BNL - yesterday reported intervention associated with primary power feeder. At end of intervention when switching back to uiliity power had an outage. Had to bring back all ATLAS tier1 services. Done for majority of services within 2 hours and remainder within the third hour
  • FNAL -2 things: 1) internal switch crashed yesterday and replaced supervisor module. 200 WNs offline for 2 hours. Today is 2 Nov - original date for extended downtime. As LHC tech stop changed moved downtime to next Monday. Changed also in OSG downtime system. Did not seem to worked - flagged as down in GridView. Jobs not sent etc. Not in downtime but flagged as so... Remember when changed downtime in OSG some issue - Nov 2 was going to be flagged as "secret" (not anymore...) Kyle - not sure about that. Saw ticket. Peter - will check who should follow up.
  • RAL - yesterday reported problems with batch server - rebooted and now ok. This morning downtime to upgrade ATLAS SRM to 2.8.6. To solve problems with ATLAS yesterday. Also had a downtime for some work on tape robots - completed successfully
  • PIC - ntr - in middle of scheduled intervention. Planned to be back around 18:00 this afternoon
  • ASGC - ntr
  • CNAF - this night we had problem on machine affecting ALICE s/w area. NFS issue. Problem blocked node for at least 5 hours
  • NDGF - addendum to yesterday's news. R/O pools still stuck. Everytime we turn to R/W GPFS crashes. Trying to find out why... ATLAS data there but should be available for reading.
  • KIT - problem with our CEs this morning. Main problem on 1 of CEs - disk was full of log files from globus gram - happens if jobs submitted with globus job submit or some other globus tools to CE. Cleared home dir and seems ok. In contact with CMS to find why it happened.
  • GridPP - ntr
  • OSG - report that BNL was sending stale data to CERN BDII today. Restarted TomCat which fixed problem. Michael - we have more info about nature of these outages. CEMON dies and most likely due to memory shortage. Process gets out of mem - will add some physical memory to get around this.

  • CERN Storage - AFS / kerberos issue mentioned by ATLAS. SSB mentions an upgrade of some Kerberos packages. Ticket # to follow up? Jarka - we did not file any ticket; checked ATLAS AFS monitor and green all night. Saw something in ATLAS mailing list. Miguel - forward to afs or kerberos support. Alessandro - "strange" - didn't fill ticket as investigating. Saw many different problems on some of central services. Some seem to be more network related. Our services do not rely on AFS. Error message "cannot contact" - not 100% sure what to report. There was an intervention on ALICE switch - went ok. Currently at risk for ATLAS as servicing a head node. Failed over and will fail back with an hour or so.

  • CMS side: FNAL seems fine. Jon - somethings work and some not... Take a look at GridView. We are yellow. Working with OSG to get downtime removed.

AOB:

Wednesday

Attendance: local(Alessandro, DB, Eddie, Edoardo, Jarka, Lola, Maarten, Miguel, Peter, Roberto, Simone, Steve);remote(Andreas, Dimitri, Federico, Gang, Gonzalo, Jon, Michael, Onno, Reda, Rob, Rolf, Tiju).

Experiments round table:

  • ATLAS reports -
    • IN2P3-CC: high load on dCache issues (GGUS:63627): dump of activity sent this morning. dCache developers contacted, deeper system investigation today. Any findings in the meantime, please?
      • Rolf: ticket updated around noon for more statistics, still issues with tape mounting/scheduling system, work ongoing; AFS SW area issues possibly dependent on WN system version, different versions in use on different WN
      • Simone: dCache at Lyon has been unstable for weeks, intermittent failures esp. affecting the disk parts - is there an activity to try and understand where all this trouble is coming from? other sites are also affected when data needs to be transferred/copied, Lyon may end up blacklisted
      • Rolf: the matters are under investigation
    • Reprocessing activities in progress.
    • Last pp stable beam planned for tonight, start of lead ion beam commissioning planned for tomorrow.

  • CMS reports -
    • Experiment activity
      • Ready for last pp-collision running of the year, planned for late afternoon/evening : 50 ns bunches fill for collisions with 12+4x24 pattern (8 hours in total)
      • tomorrow (Thursday) 12:00 : start Heavy Ion commissioning (Ion source should be back)
    • CERN and Tier0
      • putting configurations in place to guarantee a smooth transition to HI running
    • Tier1 issues
      • IN2P3 : storage pool issues (since upgrade on Sep 22)
        • still affecting the CMS pile-up production, which is moving on very slowly for the remaining jobs due to that issue
        • problem with transfers from IN2P3 to MIT: GGUS:63826 (Savannah:117603)
      • KIT : issue of CMS jobs flooding home directories (gatekeeper) of the CEs (see GGUS:63689) : workaround found by the local admins (auto-cleaning), however CMS still needs to make sure such (WMAgent) test workflows are getting under control (relevant CMS operators are currently working on that)
      • RAL : CMS-CE glitch at 23:00 - 05:00 UTC : fixed by local admins, by changing the maximum number of files that can be staged from 1000 to 800
      • Near future planning : start re-reco round end of the week
    • Tier2 Issues
      • No outstanding issue, large production ongoing
    • AOB
      • nothing to report

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • NIKHEF. AliEn Services had to be restarted this morning because the CE was stuck. The JOBSID file from CE.db had some 360 000 values, cleaning up this file and restarting the services fixed the problem.
      • Work ongoing to debug ALICE work flow at SARA
    • T2 sites
      • Usual daily operations

  • LHCb reports -
    • Experiment activities: Mostly MC and users jobs
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 1
      • T2: 0
    • Issues at the sites and services

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • BDII update incident during the night due to tomcat out of memory, will be restarted as needed, more memory will be added
  • CNAF - ntr
  • FNAL
    • several GGUS test tickets received: what to do with them?
      • after the meeting those test tickets turned out to have been for CERN, sent by Peter; Guenter Grein of the GGUS team is investigating
    • yesterday's downtime could not be fully cleared, 18 hours remain: this should be corrected in the monthly availability calculations
      • Rob: correction will be sent
  • IN2P3
    • LHCb ticket being worked on
    • CMS ticket? added to CMS report after the meeting
  • KIT
    • only CMS LCG-CE flooding issue (GGUS:63689, cured by cleanup, not yet understood)
  • NDGF
    • RAID crash at Stockholm, rebuild ongoing, 1 pool offline; ATLAS potentially affected
  • NLT1
    • ALICE jobs did not start at SARA, temporarily fixed by fair share modification, working on proper solution; now each job exits after 3 minutes, to be debugged by ALICE (was also seen at NIKHEF)
    • ATLASDATADISK was full at NIKHEF, ATLAS recovered space by removing old data
  • OSG - ntr
  • PIC
    • yesterday when the scheduled downtime ended a large number of jobs starting simultaneously caused the shared area to become overloaded, causing a lot of job failures for ATLAS, LHCb and others
      • Maarten: do you plan to do something about the shared area?
      • Gonzalo: issue depends on job type, not always problems when batch queues are reopened
      • Maarten: try different areas for different experiments?
      • Gonzalo: we will look into that possibility
  • RAL - ntr
  • TRIUMF - ntr

  • CASTOR - ntr
  • dashboards - ntr
  • databases
    • LCGR execution plan instabilities affected FCR application, due to SQL profile misleading the optimizer, now corrected
  • grid services - ntr
  • networks - ntr

AOB:

  • updates incorporated in preceding items

Thursday

Attendance: local(Eddie, Alessandro, Jamie, Maria, Simone, Jarka, MariaDZ, Steve, Carlos, Lola, Nicolo, Miguel);remote(Rolf, Foued, Michael, Jon, Reda, Rolf, Gang, Onno, Jeremy,Gonzalo, Gareth, Federico, Rob).

Experiments round table:

  • ATLAS reports -
    • IN2P3-CC: three issues in total. Can IN2P3-CC please comment on ongoing works to solve this issue?
      • high load on dCache (GGUS:63627)
      • tape - SRMV2STAGER errors (GGUS:63896), manual intervention plans mentioned yesterday in ADC ops list
      • SW area (AFS issue)
    • Taiwan-LCG2: SRMv2 daemon died (GGUS:63836), site fixed the problem promptly. Thank you!
    • INFN-T1: some ATLAS reprocessing jobs do not start because of 'pending' inputs (GGUS:63884). Site responded that files are in 'ONLINE_AND_NEARLINE' status and recommended to restart jobs. [ Simone - problem is that ATLAS staging agent in a funny state for that site. Agents insists that the files are offline. Did a MySQL intervention at 13:30 - have to check if this was successful. Was to solve emergency situation. Intervention OK - solved but not understood. ] [ Michael - BDII issue, Maarten found as system is case sensitive an non-LHC VO spoiled propagation from OSG to WLCG BDIIs. Corrected about 30' ago. Point still is - and like to discuss further with OSG -why such entries are accepted by OSG BDII but not properly propagated. Rob - should discuss within OSG. Make note to discuss in production meeting next week. ]

  • CMS reports -
    • Experiment activity
      • We did not get the expected last pp-collisions fill (50 ns bunches) last night
      • Now starting Heavy Ion commissioning, expecting first collisions on Sunday morning
    • CERN and Tier0
      • Making new CMSSW application release to handle HI processing, which fixes some bugs (e.g. xrootd client). Expecting to be ready for Sunday.
    • Tier1 issues
      • IN2P3-CC : storage issues (see GGUS:63826 , warning : Subject of ticket not describing the full issue)
        • issue with upgraded pool (Sep 22) that needed to be downgraded
        • on the same ticket, it was specified by Damien Mercier (CMS contact) that staging from MSS to DISKS is really slow, and site experts are still working on the issue
        • besides transfers inefficiencies, this affects also CMS processing at IN2P3-CC :
          • 600 jobs completed last night in ~8h, using 1000 slots. That corresponds to the MC production speed, while most these jobs (500) were actually file-based processing jobs, which are expected to go much faster.
          • note that the site admins made some support effort for CMS, by bringing online all the MinBias files needed for this pile-up processing and by providing CMS a few more CPUs to speed up the production. Yet, as mentioned above, the processing is not as fast as expected.
      • Near future CMS plans at Tier-1s : (a) finishing pile-up processing mainly at IN2P3-CC and INFN, (b) start large re-reco round at all sites end of the week, (c) Heavy Ion MC processing at IN2P3 (which is depending on when (a) will finish)
    • Tier2 Issues
    • AOB
      • Savannah-GGUS bridging tool :
        • Successful optimization of bridging tool yesterday : now only a single button needed by CMS users to get the whole machinery working, namely : GGUS-Robot trigger a GGUS ticket, the GGUS ticket is cross referenced in the Savannah ticket, the Savannah ticket Status is put "on-hold" and the closure of the Savannah ticket is triggered upon the closure of the GGUS ticket. Thanks a lot to Guenter Grein for his on-going support !
        • Yesterday during bridging tests, we had a little incident : temporary DB mismatch during a GGUS service upgrade triggered wrong assignment of bridged GGUS tickets that went to FNAL instead of CERN, as noticed and reported by Jon Bakken in yesterday's WLCG Call. Sorry for spamming FNAL ! The problem was promptly understood and fixed. [ Steve - should we close them? A: yes. ]

  • ALICE reports -
    • T0 site
      • Last night there was a problem with PackMan that stopped production at CERN for some hours. The AFS software area was full so quota needed to be increased and some clean up was done.
    • T1 sites
      • RAL issue last Monday: It seems that the new 2.1.9 instance is now working. We will ask the RAL experts for confirmatio and close the ticket
    • T2 sites
      • Usual daily operations

  • LHCb reports - Experiment activities: Mostly users jobs, few MC and reconstruction tail
    • T0
      • NTR [ some delays in migration of LHCb data to tape because of bad checksums in aborted gridftp transfers. Something that CASTOR 2.1.9-9 fixes. Probably good to upgrade LHCb instance to latest version in coming days. ]
    • T1 site issues:
      • RAL: Diskserver out of production for few hours (GGUS:63902)
      • PIC: As reported by pic yesterday, we are experiencing a shared area slowness.
      • NIKHEF: GGUS:63816 (opened yesterday). Pilots were aborted because jobs exceed the memory limits. Closed now

Sites / Services round table:

  • IN2P3 - one of ATLAS issues which is the one with s/w area - new ticket from today - was already found by another one be for. This is a very similar thing to LHCb s/w area. working group currently in place for LHCb problem will also handle this problem - same people involved. Other tickets: dCache people are in contact with dCache developers which means that they are looking at our site. Ongoing. This was action of yesterday afternoon. Ticket on staging problem - staging schedule in place which had some problems yesterday. Manual intervention described yesterday was to get out of this problem. Not the reason for problem but an action to avoid such situations as much as possible. This should be solved today - will verify. CMS: storage issue still ongoing. In addition for ATLAS there is something that has to be confirmed - problems with ATLAS arrived when ATLAS is in peak periods here. Above MoU engagement. Will try to work around this by trying to see whether we have to limit the # jobs for ATLAS at the site to MoU levels to see whether this has an impact on the errors. Simone - # jobs in Lyon was reduced to 500 a few days which is surely below this MoU but problems persisting over w/e. Sure - look into it but there is a standing problem with storage in Lyon since end September which looks more related to dCache upgrade at that time. Rolf - several ongoing problems. One of the events is certainly upgrade end September but this is already under investigation with dCache developers. Under certain circumstances also have high load from other experiments. Will isolate LHCb from ATLAS during s/w installation as this has shown to be a source of problem for both experiments. Hopefully will solve / clarify outstanding problems. Hard to handle all this together.
  • KIT - ntr
  • BNL - ntr
  • FNAL - the availability numbers from last Tuesday's site downtime not corrected yet - please can this be done. Rob - will check on status of those records being resent. Jon - was told a few days ago that as downtime was deleted was impossible to retransmit info. Might have to be done by hand on WLCG side. Rob - will check with Scott and will mail SAM team directly if so.
  • TRIUMF - ntr
  • ASGC - yesterday afternoon ASGC unstable - some CMS files could not be restored. Downtime for net maintenance finished this morning. From midnight until 10 this morning.
  • NL-T1 - NIKHEF has a broken switch which causes problems on some WNs. Being replaced.
  • PIC - as mentioned by LHCb had 2nd episode of NFS area overload. Started yesterday at 17:00. Puzzled as didn't see change in ATLAS reprocessing - 1-2K ATLAS jobs and suddenly NFS started to be saturated again. Happened during night. A couple of hundred ATLAS analysis jobs started around that time. Stopped this activity around 10:00 this morning. Don't have 100% certainty that 1-1 correlated but situation improved. Maybe ATLAS activities changing too... Related to this in GGUS ticket from ATLAS when first NFS overload appeared correlated to opening of batch qs after downtime they mentioned that they were doing quite a large # of CMT setups. Going to change? If so would like to know when. Simone - will have to check and get back to you. Jarka - some activity in development.
  • RAL - ntr
  • GridPP - ntr
  • OSG - ntr

  • CERN AFS UI The current gLite 3.2 AFS UI will be updated on Tuesday 9th November at 09:00 UTC.
      previous_3.2 current_3.2 new_3.2
    Now 3.2.1-0 3.2.6-0 3.2.8-0
    After 3.2.6-0 3.2.8-0 3.2.8-0
    The 3.2.8-0 has been in place now for 3 weeks for testing and is in use by CMS during this time. [ Nico - some minor problem which will be reported] The very old 3.2.1-0 release will be archived and subsequently deleted in month or twos time.

GGUS Issues

Ticket\Experiment Creation date Assigned To Last update Status Comment
GGUS:61440 ATLAS CNAF-BNL Transfers 2010/08/23 ROC_Italy 2010/11/02 Re-opened The ticket was re-opened to understand how the problem was (?) fixed. A description on what has been changed from the NREN side is requested in the diary but remains with no answer.
GGUS:62696 CMS xrootd errors 2010/10/02 ROC_CERN 2010/11/04 In progress Lets plan the fix for "bouncing" clients installation in the CMS redirector in 2 weeks. Please define date in the ticket.
GGUS:61706 CMS Maradona error in the CERN CEs 2010/08/31 ROC_CERN 2010/10/14 On hold User ticket ‘less urgent’. Errors still fluctuate. CERN IT services wait for hotfix from vendor (Platform LSF).
GGUS:63450 CMS no active links to SLS from service map 2010/10/22 ROC_CERN 2010/11/03 In progress User ticket. urgent! CMS uses a different to the default ‘Critical Service map’. CERN IT specialist on Service Level Status investigating the difference between the 2 maps (URLs).
GGUS:59880 LHCb Shared area performance problems at IN2P3 2010/07/08 NGI_France 2010/10/28 In progress User ticket. Very urgent! Discussed at the last T1SCM of 2010/10/28, experiment provided reasons for solution needed urgently, no news from the site.
GGUS:62800 LHCb problems to install experiment software in shared area at IN2P3 2010/10/06 NGI_France 2010/11/02 Waiting for reply TEAM ticket. Top priority! Experiment to provide the site with the script that checks and installs missing packages.

AOB:

Friday

Attendance: local(Steve, Jacek, Peter, Alessandro, Jerka, Jamie, Gavin, Lola, Eddie, Ignacio);remote(Michael, Jon, Xavier, Gonzalo, Reda, CNAF, Rolf, Tiju, Jeremy, Ronald, Gang, Rob, Federico).

Experiments round table:

  • ATLAS reports -
    • IN2P3-CC: two issues remain. Can IN2P3-CC please comment on ongoing works to solve this issue?
      • tape - SRMV2STAGER errors (GGUS:63896), issue vanished from DDM dashboard, staging has been done today around 5am UTC.
      • high load on dCache (GGUS:63627) - can you please comment?
      • SW area (AFS issue) - jobs timing out and failing in the SW setup phase (GGUS:63917) - can you please comment? New pilot has been released today, new feature added: Increased timeout limit from 600 s to 3600 s in python version localisation to alleviate AFS timeout problems. [ Rolf - for tape staging problem, this has been a problem of tape mount scheduler which was not configured appropriately for load. This has been changed for higher load. For dCache problem we are waiting for news from dCache developers. For the timeouts in setup of phase of jobs, not installation problem which is another thing. They seem to be associated to a high load on WNs when there are several ATLAS jobs starting or running at the same time. Apparently due to GAUDI? framework you are using. Also seen by LHCb. Issue arises in the setup phase of the jobs are many files opened then - 5K files opened by 18 jobs. # different files = 1400 files. Fills AFS cache. Could increase cache size and run some more jobs. Will try to limit # ATLAS jobs on a single WN and trying to find another solution in cooperation with ATLAS.FR to see whether we can avoid opening that many files in the setup phase of jobs. Alessandro - this 1K files is not clear. CMT is touching 1K files? Rolf - yes, CMT. Ale - this is standard behaviour of ATLAS jobs - has always been like this. Rolf - with >18 jobs on a WN this creates problems. As # jobs / WN increases the problem intensifies. Still trying to find a solution. Ale - would be interesting to know if other sites with a different s/w area solution have the same / similar problems.

  • ALICE reports - GENERAL INFORMATION: Pass 0,pass1 and pass 2 reconstruction activities ongoing. Last 'corrections' MC is presently running. All software and Grid is ready for production, on standby for first collisions.
    • T0 site - Nothing to report
    • T1 sites - Nothing to report (GGUS:63612 against RAL solved)
    • T2 sites - Usual operations

  • LHCb reports - Mostly users jobs, few MC and reconstruction tail
    • T0
    • T! site issues:
    • CE at PIC for which tests always fail. Gonzalo - this is a test CE. Not in production. Eddie - still affects site availability. Ale - should be published as maintenance and not production otherwise SAM will want to run some tests.

Sites / Services round table:

  • BNL - ntr for Tier1. Update on CNAF-BNL networking issue. Network engineers say problems in direction from BNL to CNAF now resolved. Now working on other direction - hopefully will be resolved in next few days so can switch back from alternate route to production route.
  • FNAL - 1) reminder that next Monday is scheduled downtime from 06:00 local time to 00:00 that day. Don't expect it to last that long - first downtime since last November. 2) Downtime that was incorrectly charged to us still on books and we are still flagged as being down. Rob - is this a full FNAL downtime or Tier1? A: USCMS T1. Talked to Wojcik yesterday who said he would remove bad item from DB and then GridView people need to refresh. Maarten - have sent a reminder. By end of this month I am sure it will have been resolved. Jon - we use this info at FNAL every Monday!
  • KIT - yesterday evening around 17:30 lost a PBS server. Don't know why - f/s got corrupted. Had to reinstall on new h/w. Final steps being done, will be back at full capacity soon. Lost around 8K jobs as a result. (EDIT by Xavier: We lost a PBS and not a "PVSS")
  • PIC - ntr
  • TRIUMF - ntr
  • CNAF - ntr
  • IN2P3 - just a clarification - the timeout problem we are suffering from now apparently has been there for a long time. Only difference is that it was previously rare - frequency higher now and more jobs impacted.
  • RAL - ntr
  • NL-T1 - 2 downtimes: 1st this Monday, a pool node at SARA will have fw upgraded. Nov 16 network maintenance at SARA - entire site unavailable all day.
  • ASGC - ntr
  • GridPP - ntr
  • OSG - 2 BDII issues: 1) found problem (Maarten) that stopped BNL SE from publishing - corrected at BNL - duplicate GUMS entries removed. BNL doing regular restarts to stop info becoming stale.

  • CERN - will upgrade CASTOR LHCb Monday 10 - 12. Transparent intervention - maybe some delay in tape migrations

AOB:

-- JamieShiers - 28-Oct-2010

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2011-01-27 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback