Week of 110207

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Massimo, Nicolo, Jan, AndreaS, MariaG, Stefan, Jamie, Ale, Stephane, Mike, Eva, Maarten, MariaDZ, Dirk);remote(Francesco/CNAF, Elena/ATLAS, Tore/NDGF, Maria Franzcesca/NDGF, Xavier/PIC, Jon/FNAL, Rolf/IN2P3, Ron/NL-T1,Daniele/CMS, John/RAL, Jeremy/GridPP, Suijan/ASGC, Rob/OSG, Dimitri/KIT, Paolo/CNAF).

Experiments round table:

  • ATLAS reports - Elena
    • ATLAS production and analysis jobs were successfully running across ATLAS sites
    • T1s
      • PIC GGUS:66409: Migration of data from damaged pools ended this weekend. pic is producing the list of lost files. Thanks.
      • FZK GGUS:66783: 20% of atlas jobs are failing because of network problem. GGUS:67085: Some ATLAS files were unvailable this morning. One of machines hosting several ATLAS disk-only pools was restarted. Thanks for quick fixing.

  • CMS reports - Nicolo
    • CERN
      • No data taking
      • Ongoing debugging effort with IT-DB to understand DB connection issues in T0 workflow injectors
    • T1s
      • MC reprocessing with 3_9_7 is in the tails, some long-running jobs aborting and resubmitted.
      • FNAL - import of old datasets for custodial storage was failing due to missing target directory, fixed SAV:119077
      • RAL - AT RISK to deploy gridftp checksumming upgrade
    • T2s
      • 7TeV production ongoing.
      • T2_UK_SGrid_Bristol SCHEDULED downtime (OUTAGE) [ 2011-02-04 08:00 to 2011-02-11 17:00 UTC ] - computing room maint
      • T2_US_Florida SCHEDULED downtime (Outage) [ 2011-02-04 23:00 to 2011-02-07 15:00 UTC ] - SRM
      • T2_FR_CCIN2P3 SCHEDULED downtime (AT_RISK) [ 2011-02-06 05:00 to 2011-02-08 01:00 UTC ] - CE queues draining
      • T2_BR_UERJ UNSCHEDULED downtime (Outage) [ 2011-02-06 12:30 to 2011-02-07 12:30 UTC ] - CE and PhEDEx
      • T2_CH_CSCS SCHEDULED downtime (OUTAGE) [ 2011-02-07 07:30 to 2011-02-08 12:00 UTC ] - CE and SRM
      • T2_UA_KIPT UNSCHEDULED downtime (OUTAGE) [ 2011-02-07 09:00 to 2011-02-09 12:00 UTC ] - CE and SRM
      • T2_BE_UCL: SAM CE errors for read-only /tmp SAV:119070
      • T2_US_Nebraska: SAM/JobRobot errors for cert expiration on auth server, fixed SAV:119072
    • CMS CRC-on-duty from Tuesday, Feb 8th: Daniele Bonacorsi

  • ALICE reports - Maarten
    • T0 site
      • The AFS r&w software area was unaccessible this morning. Possible reasons:
        • AFS releases crashing. So if every time that there is a software update (and if they are too frequent) , we do a release right after, it can happen that two of them crashes and this provoke the failing
        • There is a bug in the afs server code.
      • AFS experts are looking into it and we are checking the AFS AliEn module to reduce this effect
    • T1 sites
      • nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - Stefan
    • Experiment activities:
      • MC production: smooth operations during the w/e
    • T0
      • Observed a non negligible fraction of user jobs failing at CERN. Investigating.
    • T1
      • RAL: transparent intervention on (remaining gridftp server) to use the correct checksum
      • SARA: Fraction of users failing at NIKHEF dropped after having increased the timeout in dcap connection. (GGUS:66287)
      • GridKA: almost all data management activities over there were failing (as also confirmed by SAM). Opened GGUS ticket on Saturday (GGUS:67079). Again this is an overload problem.
      • GridKA: misleading information advertised by the Information System attracting erroneously pilots to the already full site. (GGUS:67106)
        • Dimitri: which values? Stefan: Number of waiting jobs.
        • Mike: saw today a lot SRM test are failing - also at IN2P3. Roberto has taken SRM tests out of critical area and continues investigation.
    • T2 site issues: :
      • NTR

Sites / Services round table:

  • CNAF: NTR
  • NDGF: NTR
  • PIC: finished migration and will reinstall corrupted pools. Expect 600-800 lost files but waiting for final list from scrub. Pointing out the need for a consistency checkbetween all catalogs and storage. PIC will produce a SIR by the end of this week.
  • FNAL: NTR
  • IN2P3 : NTR
  • NL-T1 : NTR
  • RAL : gridftp upgrade went well.
  • ASGC : NTR
  • KIT : NTR
  • GridPP - NTR
  • OSG - ticket for south west t2 failures now closed - logging partition filed up.
  • CERN
    • Eva: tomorrow transparent intervention on ATLAS DB
    • Jan: EOS intervention today went ok with slight overrun of expected schedule
    • Maarten: Issue with slow lxadm nodes has been raised internally and is being worked on.
    • Massimo: reminded at the tape system intervention scheduled for tomorrow

AOB:

Tuesday:

Attendance: local(Alessandro, Daniele, Edoardo, Jan, Luca, Maarten, Maria D, Mike, Stefan, Stephane);remote(Dimitri, Francesco, Gareth, Gonzalo, Jeremy, Jon, Maria Francesca, Michael, Paolo, Rob, Rolf, Ronald, Suijian, Tore).

Experiments round table:

  • ATLAS reports -
    • MC production rate is decreasing. Not obvious that production will ramp up before the week-end
    • HI reprocessing still foreseen next week. RAW data are replicated between T1s to spread the load
    • Use service certificate (instead of Kors Bos) for Functionnal Test DDM transfers. Issues with FZK GGUS:67139 , PIC GGUS:67138 and RAL GGUS:67137 . Seems OK for other sites except IN2P3-CC (site is down)
    • T1s
      • PIC GGUS:66409: 1054 files declared as lost by PIC. List was fed to deletion service. ATLAS is recovering files available in other sites.
        • Gonzalo: list might not be complete yet, could even shrink a bit (hot files also present on other pools); final list later today
      • FZK GGUS:66783: Still 20% of atlas jobs are failing because of network problem. No recent update of GGUS ticket.
        • Dimitri: network problem is under investigation
      • IN2P3-CC: All activities in FR cloud stopped since LFC is down
  • CMS reports -
    • CERN / Central services
      • No data taking. Waiting for next MWGR (could be starting next Thursday, TBC)
      • CMSWEB production upgrade this morning: completed and OK
      • Issues with few CMS FNAL colleagues and their CERN accounts, not completely excluded by Martti yet to be unrelated to the new account management system introduced by CERN-IT at the end of last year (with some accounts being somehow missed), but more recent investigations seem to point to the CMS online side, following up in that direction at the time of this call
    • Tier-0
      • Still off. Ongoing debugging effort with IT-DB to understand DB connection issues in T0 workflow injectors
        • Last progress: Jacek suggested to Dirk simplify our transaction/session handling, tried what suggested, seems to work, not definitive but promising
      • HI zero-suppression starting soon. It will then be sent to FNAL and Vanderbilt T2 for reprocessing
    • Tier-1's
      • Tails of MC reprocessing with 3_9_7, some long-running jobs aborting and resubmitted
      • RAL T1: identified some (4) corrupted files which could be the cause of a wrongly diagnosed slow tape migration
      • downtimes
        • T1_FR_CCIN2P3 SCHEDULED downtime (OUTAGE) [ 2011-02-08 07:00 to 2011-02-08 18:00 UTC ] - Outage of database servers and data migration
        • T1_ES_PIC SCHEDULED downtime (OUTAGE) [ 2011-02-08 08:01 to 2011-02-08 12:00 UTC ] - FTS services: Database upgrade
        • T1_UK_RAL SCHEDULED downtime (AT_RISK) [ 2011-02-08 09:00 to 2011-02-08 16:00 UTC ] - At-risk to roll out changes to WAN tuning for disk servers in cmsWanIn and cmsWanOut.
    • Tier-2's
      • 7 TeV MC production ongoing, new request just came yesterday
      • downtimes
        • T2_FR_CCIN2P3 SCHEDULED downtime (OUTAGE) [ 2011-02-08 01:00 to 2011-02-08 18:00 UTC ] - general outage for maintenance (see T1)
        • T2_UK_SGrid_Bristol SCHEDULED downtime (OUTAGE) [ 2011-02-04 08:00 to 2011-02-11 17:00 UTC ] - Entire Building PowerOut and Machine Room Maintenance
        • T2_BR_UERJ SCHEDULED downtime (Outage) [ 2011-02-07 12:30 to 2011-02-08 12:30 UTC ] - Problems in external switch make CE/PhEDEx nodes unavailable
        • T2_RU_RRC_KI SCHEDULED downtime (OUTAGE) [ 2011-02-08 11:00 to 2011-02-08 12:00 UTC ] - hardware upgrade at storage nodes
        • T2_UA_KIPT UNSCHEDULED downtime (OUTAGE) [ 2011-02-07 09:00 to 2011-02-09 12:00 UTC ] - Hardware problem (RAID Controller issues on a disk server)
    • Miscellanea
      • testing new algo for considering CREAM-CE and LCG-CE for SAM site availability. Julia and Dashboard colleagues reported at yesterday's Computing Ops meeting. Testing window is one (this) week. CMS power users plus CSP shifters are given instructions and asked to report documented inconsistencies (if any).
      • Vidyo testing: in progress within IT, if tests ok then CMS will try it out at one of next Computing Ops meetings

  • ALICE reports -
    • T0 site
      • Tomorrow 9th February around 11 AM we are going to do the migration of one of the five ALICE xrootd servers. If everything is OK, we will do the other 4 on Thursday
    • T1 sites
      • FZK: one of the CREAM-CEs is not working
      • IN2P3: scheduled downtime announced yesterday
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Running at very high sustained rate (25K jobs in parallel) in the last days. Introduced load balancing mechanism for some internal DIRAC servers not sustaining this rate otherwise.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • NTR
      • T1
        • PIC: Observed this early morning a spike of user jobs failing with input data resolution. This is consistent with the shortage of the SRM service also reported by SAM at around 2:00 am this morning.
        • IN2p3: Major downtime today. Banned all services form the LHCb production mask.
        • GridKA: Problem with the SRM database that got screwed up (GGUS:67079). This explained the poor performances observed since Saturday.
        • GridKA: misleading information advertised by the Information System attracting erroneously pilots to the already full site. (GGUS:67106)
          • Dimitri: which BDII? our site BDII seems to look OK, but our top-level BDII seems to have outdated information?!
          • Stefan: will check which BDII is used; values of 444444 are "OK", whereas values of 0 are really bad
          • Maarten: please ensure your own top-level BDII is OK, open a GGUS ticket for the BDII developers as needed
      • T2 site issues:
        • NTR

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3
    • all maintenance work went OK except for the Oracle upgrade: a HW problem with a disk was discovered, which will delay the end of the scheduled maintenance
  • KIT - nta
  • NDGF - ntr
  • NLT1
    • ALICE jobs at SARA were killed because of excessive memory usage
      • Maarten: also other sites have suffered such problems; AliEn 2.19 has protections against that, but they have not been enabled yet (need to be very careful with that); this issue should go away in the near future
  • OSG
    • top-level BDII deployment plan being reviewed, we will contact Flavia and Laurence accordingly
  • PIC - nta
  • RAL
    • reminder: outage of FTS/LFC/3D DB tomorrow for Oracle upgrade

  • CASTOR
    • tape library intervention ongoing (also tomorrow)
  • Dashboards - new services deployed:
    • Updated versions of the ATLAS and CMS Job Monitoring interactive interfaces has been deployed; currently being validated by CMS, but already in production for ATLAS. Performance improved by redisigning the SQL queries.
    • New version of CMS's Site Usability interface has been deployed on the validation server. New algorithm calculates availability on the basis of SAM test results, considering CREAM and classical CEs as services of the same critical type. Should enter production in ~1 week, after validation.
    • New version of the, fully redesigned, CMS critical service map has been deployed on the validation server.
  • databases
    • rolling upgrade of ADCR DB done because of PanDA suffering Oracle bug 6887866, not seen on other 10.2.0.5 instances; more information during T1 coordination meeting this Thu
  • GGUS/SNOW
    • tests have been stopped because of a SNOW upgrade; small adjustment on GGUS side needed, developers are in the loop
    • next tests will be on Thu
  • networks
    • packet loss on KIT-CERN link being investigated (see ATLAS report): the link itself looks clean; the affected WN at KIT are behind a NAT

AOB:

Wednesday

Attendance: local(Daniele, Mattia, Jan, Jamie, Eva, Stefan, Ale, Lola, Ricardo, Dirk);remote(Jon/FNAL, Xavier/KIT, Maria Francesce/NDGF, Tore/NDGF, Rolf/IN2P3,John/RAL, Ronald/NL-T1, Rob/OSG, Suijan/ASGC, Daniele/CNAF ).

Experiments round table:

  • ATLAS reports - Ale
    • Use service certificate (instead of Kors Bos) for Functionnal Test DDM transfers:
      • Problem confirmed for PIC, FZK and RAL. For the moment, service certificate is no more used for FTS transfers untill the problem is solved.
      • Ale Di Girolamo following the issue with FTS support. This problem is critical for ATLAS
        • Jan: got ticket about email extension in the cert - discussion back in GGUS. Jan will send details to Ale.
    • T1s
      • PIC : Summary of lost files (GGUS:66409) : 217 dark files / 560 files recovered from other sites / 393 files completly lost (removed from dataset definition) (More details)
      • FZK (GGUS:66783) : Still pending
      • FR cloud back in production after IN2P3-CC downtime
      • RAL : All activities in UK cloud stopped. Opportunity for some UK T2 sites for intervention.

  • CMS reports - Daniele
    • CMS CRC-on-duty from Feb 8th to Feb 14th: Daniele Bonacorsi
    • CERN / Central services
      • No data taking. Waiting for next MWGR (could be today/tomorrow). Magnet ramp foreseen today.
    • Tier-0
      • T0 issues investigation in progress (see previous reports)
      • tuning for the the HeavyIon zero-suppression run (starting this week)
      • a tape migration backlog seems to be accumulating to T0_CH_CERN_MSS (PhEDEx: ~5 TB requested, ~200 GB currently queued)
        • being monitored by CSP shifters, no ticket so far
    • Tier-1's
      • FNAL->RAL transfer of one block not progressing (SAV:119148)
    • Tier-2's
      • nothing special to report
    • Miscellanea
      • CREAM-CE / LCG-CE checks of SAM site availability: so far the shifters did not report any inconsistencies.
        • It's being kept monitored for 1 week (today: day 2)
      • Ale: ATLAS is interested in using the same algorithm and ideas as CMS and will stay in contact.

  • ALICE reports - Lola
    • T0 site
      • Migration to SLC 5 of the xrootd services postpone till Monday due to experiment request: urgent analysis needed of a data run
    • T1 sites
      • NIKHEF: AliEn services were not working csince last night and needed to be restarted manually
    • T2 sites
      • Several operations

  • LHCb reports - Stefan
    • MC and user activity.
    • T0
      • LFC-RO: one user was not able to connect to the service. (GGUS:67167). LFC managers report the user exposed a revoked certificate. This is weird being the certificate valid in VOMS. Asked CA support at CERN to look at this certificate revocation status (remedy:745836)
    • T1
      • NIKHEF: One user jobs systematically killed by the batch system. Open a GGUS ticket to ask more information to sys admins there (GGUS:67160)
      • NIKHEF: Since 4pm yesterday MC production jobs seem to fail due to timeout in setting up the runtime environment.
      • GridKA: Quick upgrade this morning of d-cache to fix the SRM corrupted database problem spotted the days before.
      • GridKA: misleading information advertised by the Information System attracting erroneously pilots to the already full site. (GGUS:67106). Now the situations seems to be better, anything changed GridKA side?
      • RAL: LFC/FTS/3D Oracle backend intervention. Set inactive services relying on it.
      • IN2p3: back from the DT, re-enabled the services over there.
    • T2 site issues: :
      • NTR

Sites / Services round table:

  • KIT:
    • problems with internal DNS until 7 p.m. local time, which caused several failures involving internal communication
    • emails sent to KIT-Mailinglists (e.g. xyz@listsNOSPAMPLEASE.kit.edu) are not delivered since yesterday
    • network problems of ATLAS still under investigation
    • problems with GridFTP doors for gridka-dcache.fzk.de were resolved together with the DNS problems
    • one GridFTP door of atlassrm-fzk.gridka.de, that has been out of production since last week, is back online again (defect RAM)
    • one machine hosting each a dcache pool for writing to tape for ATLAS and CMS had to be rebooted yesterday
    • short downtime from 10 to 11 a.m. for reparing corrupted SRM database tables of gridka-dcache.fzk.de - this should resolve and prevent the performance issues of last days
  • FNAL - NTR
  • NDGF - tomorrow 1h down from 12:30 due to dcache update
  • IN2P3 - small prob yesterday during downtime - have avoided delay by only partial upgrade of oracle h/w. Since Fri local FTS support working transfer issues. No ticket yet but will report as solution is found.
  • RAL - in downtime for DB upgrade, but might finish early
  • NL-T1 - problems at nikhef reported above due to nfs box becoming unreachable. Also batch system get affected. Now all stable and jobs running. Downtime scheduled for next week will be rescheduled to March.
  • OSG - NTR
  • ASGC - NTR
  • CNAF - NTR
  • CERN/Eva - last DB upgrade for LCG integration DB to 10.2.0.5 will be done tomorrow.

AOB:

  • Jamie: tomorrows T1 coordination meeting will include an update on FTS and Oracle server versions and a presentation by Roger Bailey Roger on the conclusions from the Chamonix workshop and plans for the LHC run in 2011 / 2012.

Thursday

Attendance: local(Ricardo, Mike, Stephane, Ale, Maarten, Jan, Jacek, Stefan, Jamie, MariaG, Massimo, Dirk);remote(Michael/BNL, Gonzalo/PIC, Tore/NDGF, Jon/FNAL, Maria Francesca/NDGF, Rolf/IN2P3, Ronald/NL-T1, Foued/KIT, Jeremy/GridPP, John/RAL, Poalo/CNAF, Daniele/CMS, Suijan/ASGC).

Experiments round table:

  • ATLAS reports - Stephane
    • Use service certificate (instead of Kors Bos) for Functionnal Test DDM transfers: Negociations to define new DNs without email adress
    • ALARM ticket
      • PIC (GGUS: 67205) : Problem transfering data from PIC (Some griftp servers affected by 'gridftp daemon were dead'). Solved
    • T1s
      • PIC (GGUS : 67200) : Problem to transfer some files from PIC : 'AsynWait error'. Traced to a specific pool which was restarted. Seems better for the moment but not perfect
      • BNL-RAL (GGUS: 67214) : Temporary network problem between BNL and RAL which affected transfers
      • FZK (GGUS:66783) : The error rate is now at percent level. Not expected as help is request in GGUS ticket

  • CMS reports - Daniele
    • CERN / Central services
      • No collisions. CMS magnet at 2 T now. Not started with the 10 days global runs with cosmics, yet.
    • Tier-0
      • the HeavyIon zero-suppression run has started (~2k running jobs, ~30k in the queue)
      • the apparent tape migration backlog as from yesterday's report turned out to be just fine. Please ignore.
    • Tier-1's
      • MC production continues with no major site/infrastructure issues
      • T2_BE_IIHE->T1_FR_IN2P3 transfer not progressing, not sure yet at which end the problem is (SAV:119172)
    • Tier-2's
      • MC production and analysis go on with no major site/infrastructure issues
      • T2_RU_INR: failing the CE analysis SAM test (file open error) and JR at 77% efficiency (files missing) (SAV:119169)
      • T2_ES_IFCA: problems with CREAM-CE at IFCA (outage). The other two LCG-CEs are ok.
    • Miscellanea
      • CREAM-CE / LCG-CE checks of SAM site availability: current CSP shifter provided a script to do that, CRC is validating it
        • It's being kept monitored for 1 week (today: day 3)
      • [ CMS CRC-on-duty from Feb 8th to Feb 14th: Daniele Bonacorsi ]

  • ALICE reports - Maarten
    • T0 site
      • CAF extension is over and the data re-staging will start this Saturday 12th February at 11 AM.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Several operations

  • LHCb reports - Stefan
    • Experiment activities:
      • MC and user activity smooth operations.
      • Issue in synchronizing VOMS-GGUS. Some teamers did not get propagated to GGUS.
    • T0
      • Some AFS slowness observed. This might explain why an increased amount of MC jobs are failing at CERN (order of 10%)
    • T1
      • NIKHEF: Discovered a problem with NFS mounted home for pool accounts in some WN. They had to restart some of them and kill jobs. (GGUS:67160)
      • IN2p3: picked up happily after the DT.
    • T2 site issues: :
      • NTR

Sites / Services round table:

  • BNL - in process of adding disk space to dcache which requires new controller firmware. Instead of downtime, site will install in transparently with 10-20 mins unavailability which should not affect ddm replication or job execution
  • PIC - data corruption (ggus:66409) recovered on evening of 8 Feb evening. Put online most affected files but 1136 files have been lost. This night alarm for gridftp: 3 of 4 ftp doors died. Manual restart of the doors fixed it. Also fixed some problematic transfers by rebooting an overloaded pool.
  • NDGF - ntr
  • FNAL - had 9 files missing from non-custodial pool which were blocking transfers to RAL - fixed
  • IN2P3 - ntr
  • NL-T1 - ntr
  • KIT- yesterday a new cream ce has put into production
  • GridPP - ntr
  • RAL - oracle intervention of yesterday went well
  • CNAF - problem for atlas caused by storm backend - now fixed and confirmed by ATLAS
  • ASGC - ntr
  • CERN - 3 new cream CEs put in production. ILC VO has been added to FTS.

AOB:

Friday

Attendance: local(Ricardo, Jamie, MariaG, Stephane, Mike, Maarten, Massimo, Jan, Ale, Dirk);remote(Xavier/PIC, Jon/FNAL, Xavier/KIT,Ulf/NDGF, Suijan/ASGC, Onno/NL-T1, Jeremy/GridPP, Gareth/RAL, Rolf/IN2P3, Rob/OSG, Daniele/CMS, Paolo/CNAF ).

Experiments round table:

  • ATLAS reports -
    • Use service certificate (instead of Kors Bos) for Functionnal Test DDM transfers: Update after meeting
      • asked CERN CA to change robot DN and received new cert now. The new subject is
      • subject : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management
      • all T1 should add this new DN to their FTS with same privileges as Kors
        • Castor T1 should check their gridmap files. ATLAS will start testing on Mon.
    • GGUS : Cannot retrieve or post ticket
    • T1s
      • FZK (GGUS:67240) : dccp problem during the night
      • SARA/NIKHEF (GGUS:67256) : Transfer problems from NIKHEF to SARA : Transfer time-out

  • CMS reports -
    • CMS / CERN / central services
      • CMS magnet at 4 T now. Stable. Global run with cosmics (CRAFT11) ongoing now.
      • CMS gLite-WMS as monitored by GridMap degraded to about 88% on Feb 10th afternoon. It restored within an hour.
    • Tier-0
      • the HeavyIon zero-suppression run continues smoothly. Expected to continue over the weekend and - with the current resources - to last ~18 days.
        • currently using ~2.5k T0 slots (plot), castor load -> out: ~3.5 GB/s ; in: 0.8 GB/s with peaks at >2.5 GB/s (plot). For resources, in contact with Bernd
      • subdetectors going in DAQ one by one for the cosmic global run. T0 checks resumed by shifters. Expected to continue over the weekend.
    • Tier-1's
      • KIT has JR efficiency at ~70% (JR jobs failures with with Condor-Maradona errors, ~11% today)
      • PIC has <90% SAM Site Availability: sporadic failures, nothing worrisome, CMS contact informed and checking
    • Tier-2's
      • NTR - business as usual
    • Miscellanea
      • CREAM-CE / LCG-CE checks of SAM site availability: a couple of observations by shifters - noted down, will wrap up at the end
        • It's being kept monitored for 1 week (today: day 4)
    • plans for the weekend: HI-ZS run will continue. Global run will continue. MC prod at Tiers as usual. T0 resources utilization will the thing to keep an eye on, mostly.

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • IN2P3: Services needed to be restarted yesterday manually after the intervention.
    • T2 sites
      • Cyfronet: GGUS 67254 . Information provider is reporting wrong (444444 jobs waiting a none running)
      • Several operations
    • IN2P3 - still problem with alice jobs coming through cream - investigating

  • LHCb reports -
    • Experiment activities:
      • MC and user activity smooth operations.
    • T0
      • Some AFS slowness observed. This might explain why an increased amount of MC jobs are failing at CERN (order of 20%). Will throttle the number of MC jobs
      • ce114.cern.ch. Many pilots aborting with Globus error 12. This is symptomatic of some misconfiguration of the Gatekeeper (GGUS:67253)
    • T1
      • CNAF: a bunch of jobs failing at around 10pm yesterday setting up the runtime environment. Problems, also spotted by SAM, has gone around 11pm. Confirmed to be related with a shared area problem
      • GridKA: issue with the information system attracting jobs (GGUS:67106) has gone. Now it publishes correctly all Glue values and the rank computed on top of them is rightly unattractive. The site should not be flooded any further.
    • T2 site issues: :
      • NTR
    • Massimo: slow afs @ CERN - need more info. Roberto: will follow-up offline.

Sites / Services round table:

  • Xavier/PIC - SIR uploaded to https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents
  • Jon/FNAL - NTR
  • Xavier/KIT - reboot one fileserver - some diskonly pools offline for a while
  • Ulf/NDGF - network outag tue evening - splitting t1 in two pieces. downtime has been scheduled. will take data.
  • Suijan/ASGC - NTR
  • Onno/NL-T1 - NTR
  • Gareth/RAL - NTR
  • Rolf/IN2P3 - NTR
  • Paolo/CNAF - NTR
  • Jeremy/GridPP - NTR
  • Rob/OSG - downtime calculation problem for BNL. OSG s/w release had bug which affected some sites. Patch has been applied and new version is expected for next week.
  • Jan/CERN: error in info provider failed ops test for SRM.

AOB:

  • Massimo: GGUS/SNOW integration is progressing. Will switch on monday to snow link. Would like one experiment to submit test tickets. ATLAS volunteered.

-- JamieShiers - 02-Feb-2011

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2011-02-11 - DirkDuellmann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback