Week of 110523

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Xavier, Michal, Nicolo, Ricardo, Jacek, Juri, Massimo, Ignacio, MariaDZ, Dirk);remote(Michael/BNL, Lorenzo/CNAF, Jon/FNAL, Gonzalo/PIC, Rob/OSG, Huang/ASGC, Jeremy/GridPP,Vladimir/LHCb, Tiju/RAL, Onno/NL-T1, Rolf/IN2P3, Xavier/KIT).

Experiments round table:

  • ATLAS reports -
    • CERN/T0/Central services
      • Issue with acrontab/AFS at CERN fixed. GGUS:70735 verified.
      • T0/CASTOR: problems in retrieving files from ATLAS T0MERGE pool since ~0:30 May 22. ALARM GGUS:70774 filed 2011-05-22 10:53 UTC, solved 2011-05-22 11:45. Disk server lxfsrc5306 has become unresponsive, rebooted.
    • Tier-1
      • FZK-LCG2 :
        • DATADISK to INFN-T1 functional test FT failures: "SOURCE error during TRANSFER phase: [TRANSFER_TIMEOUT] globus_ftp_client_size: Connection timed out". GGUS:70758 filed: May 20 at ~3pm, solved: at ~9:30pm. No problems at the site, the problematic files transferred to the site, no more errors.
        • MCTAPE ~700 FT failures for several minutes: "SRMV2STAGER: Job failed with forced timeout after 21000 seconds". GGUS:70764 assigned: 05-21 09:33.
      • TAIWAN-LCG2:
        • MCTAPE large amount of staging failures: "SRMV2STAGER:StatusOfBringOnlineReques:Error caught in processPrepareRequest. ORA-01422:exact fetch returns more than requested number of rows ORA-06512: at "STAGER.PROCESSPREPAREREQUEST". Savannah:82483 filed on Fri, GGUS:70763 filed on Sat. One tape A00888 had media error, vendor's support was requested.
        • Power surge in DC at 23:40, May 21. Unscheduled downtime for Taiwan T1 and T2. Excluded from DDM automatically and queues set offline by the shifter. Extended till 9:00am, May 23. Elog. 25606, 25618,25620.
    • Xavier/ATLAS - AFS slowness for some jobs - was solved at 13:00. Ignacio/Ricardo: root cause was LDAP auth server overload from batch.
    • Xavier - KIT had one tape stuck in drive which had to be released manually - all fine now.
    • Huang - power incident at ASGC: recovered most services in 2 h, but needed more time for FTS recovery. Now all back.

  • CMS reports -
    • LHC / CMS detector
      • Highest recorded instantaneous Luminosity at LHC : 1.075 * 10^33 cm^-2 s^-1
      • CMS acquired ~60/pb since Friday evening
    • CERN / central services
      • As many others, CMS processing affected by central lxplus issue at CERN this morning
      • CASTOR/T0 file copying (rfcp) is still an issue (already reported Friday) : on-going work between CMS and CERN/IT, see INC:033080
        • Massimo: managed to reconstruct history of problematic files - root cause is network - looking with development if there was a race condition in addition.
      • PhEDEx central agents affected by "ORA-01652: unable to extend temp segment by 128 in tablespace TEMP" error over weekend, TEMP table extended this morning by IT-DB, monitoring logs to verify if issue is solved, see INC:040410
        • Jacek: temp table space is shared for many apps - one started to abuse that space. Looking now into creation of individual temp spaces.
    • Tier-0 / CAF
      • Tier-0 busy processing latest data
    • Tier-1
      • MC and Data re-reconstruction ongoing.
    • Tier-2
      • CMS internally reminding its Tier-2s to install gLExec on WN + ARGUS/SCAS/GUMS , before end-of-June deadline (see CMS twiki poll)
      • MC production and analysis in progress
    • AOB :
      • Today : Nicolo Magini reporting for CMS
      • Next CMS CRC-on-Duty reporting here from tomorrow on : Stephen Gowdy

  • ALICE reports -
    • T0 site
      • User job errors and a job backlog developed during the weekend due to a problem with the switch in front of the central AliEn services; the problem was resolved around noon today.
    • T1 sites
      • NTR
    • T2 sites
      • Several operations

  • LHCb reports -
    • Experiment activities:
      • Data Taking is active. Processing and reprocessing are running.
    • T0
    • T1
    • T2

Sites / Services round table:

  • Michael/BNL - ntr
  • Lorenzo/CNAF - ntr
  • Jon/FNAL - ntr
  • Gonzalo/PIC - ntr
  • Huang/ASGC - ntr
  • Tiju/RAL - ntr
  • Onno/NL-T1 - this week two downtimes - tomorrow Oracle maintenance affecting fts and lfc, on thu at risk for srm due to cert renewal.
  • Rolf/IN2P3 - reminder scheduled down for all service apart from broadcast and portal - no notification for other downtimes during that time.
  • Xavier/KIT - unscheduled downtime for cream1 until 6pm tomorrow, also downtime at 9-12:30 for fts tomorrow

  • Rob/OSG - bdii upgrade plans - would like to stay informed? Ricardo: will do it later this week - firm date by tomorrow.
  • Jeremy/GridPP - ntr

  • Jacek/CERN - last night LHCb streaming stopped to SARA due to config problem. Problem was fixed manually and in touch with oracle support to prevent this issue to reappear in the future.
  • Michal/CERN - observed CMS dashboard performance degradation. Jacek: related to execution plan instability - will investigate further with app owners.

AOB: (MariaDZ) GGUS Release this Wednesday May 25th. Downtime published in GOCDB and announced via the CIC portal. This is a major release due to the GGUS Remedy (AR) system upgrade.

Tuesday:

Attendance: local(Xavier, Stephen, Massimo, Ricardo, Michal, Ale, Maarten, Eva, Edoardo, MariaDZ, Stefan, Dirk);remote(Gonzalo/PIC, Jon/FNAL, Ronald/NL-T1, Kyle/OSG, Vladimir/LHCb, Tiju/RAL, Rolf/IN2P3,Paolo/CNAF, Dimitri/KIT, Christian/NDGF).

Experiments round table:

  • ATLAS reports -
    • CERN/T0/Central services
      • NTR
    • Tier-1:
      • IN2P3: in scheduled downtime affecting SRM, CEs, FTS and LFC. FR cloud site services for data handling turned off before the start (7:00am). FR cloud stopped 9 hours in advance in Panda to drain jobs and prevent data processing failures. Still on downtime.
      • SARA: in scheduled downtime affecting FTS and LFC. NL cloud stopped 9 hours in advance in Panda to drain jobs and prevent data processing failures. No intervention on site services for data handling, few hours in downtime. Recovered fine after the donwntime in both data transfer and data processing.
      • FZK-LCG2: in scheduled downtime affecting FTS. No intervention on site services for data handling, few hours in downtime.Recovered fine after the donwntime.
      • NDGF: All transfers involving NDGF-T1_MCTAPE have been failing for many hours (Failed to pin file [rc=32,msg=HSM script failed) GGUS:70822 (remains assigned). This seems different from the issue observed some days ago (Failed to pin file [rc=149,msg=No pool candidates available/configured/left for CACHE]) GGUS:70511 that was closed yesterday.

  • CMS reports -
    • LHC / CMS detector
    • CERN / central services
      • CASTOR/T0 file copying (rfcp) is still an issue (already reported Friday) : on-going work between CMS and CERN/IT, see INC:033080
        • Massimo: analysis of problem points to network glitch as root cause and no further castor problem. Will leave ticket open for sometime as CMS had several cases and it is not clear why only CMS is affected. Stephen: there may be a new occurrence today - if yes, will update the ticket.
      • Tier-0 saw another Oracle problem, told it was due to some other user swamping the system, same reason as PhEDEx?
      • CASTORCMS busy serving LHE files, will attempt to see if we can alleviate this
    • Tier-0 / CAF
      • Tier-0 busy processing latest data
    • Tier-1
      • MC and Data re-reconstruction ongoing.
    • Tier-2
      • MC production and analysis in progress
    • AOB :
      • None
    • Stephen: one PB of data just disappeared from status display? Massimo: will follow up.

  • ALICE reports -
    • T0 site
      • Although the porblem with the switch reported yesterday was solved, there is still a job backlog. At the moment the number of running jobs is increasing and number of jobs in the task queue decreasing slowly
    • T1 sites
      • Nothing to report
    • T2 sites
      • Several operations

  • LHCb reports -
    • Experiment activities:
      • Data Taking is active. Processing and reprocessing are running.
    • T0
      • CERN : problem CE ce130.cern.ch (GGUS:70748)
        • possibly due to high load on CE - now the problem has disappeared.
    • T1
    • T2
    • Stefan: some confusion caused by recent IN2P3 intervention. Would like to improve notification in case of intervention problems. Rolf: downtime has been correctly notified, but the transparent intervention turned out to be non-transparent due to a problem with a draining script. Site discovered the problem at same time as LHCb ticket arrived and followed up immediately.

Sites / Services round table:

  • Gonzalo/PIC - ntr
  • Jon/FNAL - ntr
  • Ronald/NL-T1 - ntr
  • Tiju/RAL - ntr
  • Rolf/IN2P3 - still in downtime - goes as planned.
  • Paolo/CNAF - ntr
  • Dimitri/KIT - extended downtime for cream-1
  • Christian/NDGF - investigating ticket by ATLAS

  • Kyle/OSG - date and time for bdii changes? Ricardo: CERN bdii will be tomorrow - top level bdii on Thu (afternoon in both cases).

  • Eva/CERN - CERN to SARA replication problems: rolling patch will be applied asap by SARA
  • Michal/CERN: migration of ATLAS dashboard DB ongoing - downtime of 5h with no data updates during that period.

AOB: (MariaDZ)

  • GGUS Release tomorrow. All regular ALARM tests to american sites will take place one day later to give time to OSG developers to adjust their configuration to the new GGUS AR during working hours in their timezones.
  • A DOEgrids' host certificate request was received from the University of Houston-Downtown http://uhd.edu/ The requestor Hong Lin cannot be found in any LHC or CERN database. Does OSG have collaborators in this university?
    • Kyle will follow up on this.

Wednesday

Attendance: local(Xavier, Ricardo, Stephen, Michal, Lola, Luca, Massimo, MariaDZ, Dirk);remote(Gonzalo/PIC, Felix/ASGC, John/RAL, Vladimir/LHCb, Jon/FNAL, Onno/NL-T1, Rolf/IN2P3, Lorenzo/CNAF, Rob/OSG, Dimitri/KIT, Christian/NDGF).

Experiments round table:

  • ATLAS reports - Xavier
    • CERN/T0/Central services
      • NTR
    • Tier-1
      • IN2P3: Extended downtime yesterday until 01:00am, Site and cloud services fully back into production, observing good fraction of errors due to I/O failures (WN-storage) and LFC glicthes (not possible to file a GGUS (in SD) but shifters collecting info).
      • TAIWAN: Reactivated RAW data export to TAPE (availability set to 100 in Santa Claus)
      • NDGF: tape recall failures follow-up GGUS:70822. Site informs that everything points to another broken tape. Library audit in progress. Endpoint remains whitelisted until clarification.
    • AOB: scheduling config change in panda has resulted in decreasing job numbers at some sites. This has been corrected since 8am.
    • Rolf: investigating ATLAS problems: apparently problem with the LFC - will followup to via ggus ticket once ggus is back. Xavier: currently 40% failure rate

  • CMS reports - Stephen
    • LHC / CMS detector
    • CERN / central services
      • CASTOR trouble yesterday afternoon/evening. Due to bad database optimiser. GGUS:70843 .
    • Tier-0 / CAF
      • Tier-0 busy processing latest data
    • Tier-1
      • MC and Data re-reconstruction ongoing.
    • Tier-2
      • MC production and analysis in progress

  • ALICE reports - Lola
    • General Information: Yesterday there was a problem with one of the Job Brokers in the Central Services which caused errors at most of the sites. The Job Broker was restarted and the problem went away, but the cause of the recent issues with the Central Services remains under investigation.
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Several operations

  • LHCb reports - Vladimir
    • Experiment activities:
      • Data Taking is active. Processing and reprocessing are running. Certification of new Dirac version.
    • T0
    • T1
    • T2

Sites / Services round table:

  • Gonzalo/PIC - ntr
  • Felix/ASGC - We found one of our Castor srm servers did not rejoin nis domain properly after power surge, which caused our Castor srm service unstable for a long while yesterday. It should be fixed now
  • John/RAL - ntr
  • Jon/FNAL - ntr
  • Onno/NL-T1 - reminder: tomorrow at risk for sara srm
  • Rolf/IN2P3 - downtime yesterday - took longer as site had to update microcode for disk arrays and this operation took longer than predicted by vendor. T1 is shown as availability 0 for month of may - bug in gstat? Site used to configure site bdii for LHCb in special way, which is not accepted by gstat. Ticket (ggus:70225) open since May but no action from gstat people. Now changed back in bddi config but this may disadvantage LHCb. Vladimir: will check.
  • Lorenzo/CNAF - ntr
  • Dimitri/KIT - fts was successfully updated yesterday
  • Christian/NDGF - tomorrow fibre cut - some atlas data will be unavailable. ATLAS tape issue: might not be problem with tape but in TSM - will create ticket after ggus is back.
  • Rob/OSG - osg bddi down yesterday from 14-15UTC after system update (new version of open ldap came in) - rolled back open ldap. Currently the ticket exchange is not working - Gunther is working on this. No alarm ticket recently - so no effect yet.
  • Ricardo/CERN - cern bdii was upgraded today at 14:30. Downtimes registered for tomorrow: production and top level bdii
  • Luca/CERN: after DB services came back at IN2P3 discovered some missing transactions. The LHCb LFC DB has been loaded from another DB and now looking at reasons for replication loss.
  • Michal/CERN: dashboard DB migration finished - afterwards dashboard exceeded table space volume limit while catching up. Now ok and filling gaps with logged data - no data was lost.
  • MariaDZ: can we see offline what to do for cert request for Houston? Rob: will look at this.

AOB: (MariaDZ on behalf of Guenter Grein) Due to some interface problems the post-release GGUS ALARM tests are postponed at least until tomorrow.

Thursday

Attendance: local(Xavier, MariaDZ, Ricardo, Mike, Jan, Dirk);remote(Jon/FNAL, Ronald/NL-T1, John/RAL, Rob/OSG, Lorenzo/CNAF, Rolf/IN2P3, Gonzalo/PIC, Christian/NDGF, Vladimir/LHCb, Stephen/CMS, Dimitri/KIT, Todd/ASGC).

Experiments round table:

  • ATLAS reports - Xavier
    • CERN/T0/Central services
      • CERN cloud set brokeroff in Panda for draining.
    • Tier-1
      • IN2P3: Issues found yesterday with LFC access and SRM-WN communication vanished. Site informed about punctual LFC overloads, but this has been cured apparently. We are ineterested in knowing what caused the overload on the LFC Savannah ticket
      • INFN-T1: found problems starting yesterday afternoon (~14h30) accessing SRM, no GGUS (SD) filed but Savannah: link problems disappeared after three hours.
      • NDGF: follow-up on HSM issue - being addressed by site experts GGUS:70822.
        • Christian: no news yet from NDGF - investigating

  • CMS reports -
    • LHC / CMS detector
      • No beam till this evening
      • Taking cosmics
    • CERN / central services
      • Cleaning up T0EXPORT pool to remove 1.3PB, will be used for EOS migration
    • Tier-0 / CAF
      • Tier-0 busy processing latest data
    • Tier-1
      • MC and Data re-reconstruction ongoing.
    • Tier-2
      • MC production and analysis in progress

  • ALICE reports -
    • T0 site
      • The job backlog has been processed.
    • T1 sites
      • KIT: ALICE::FZK::TAPE is full, experts have been contacted.
    • T2 sites
      • Several operations

  • LHCb reports -
    • Experiment activities:
      • No beam - no data. Processing and reprocessing are running. Certification of new Dirac version. TransferAgent stuck again, but all files from backlog transfered now.
    • T0
    • T1
      • GRIDKA CREAMCE (GGUS:70835)
      • IN2P3 LFC RO Mirror is down, as result most MC jobs from French sites failed to upload output data files.
    • T2

Sites / Services round table:

  • Jon/FNAL - ntr
  • Ronald/NL-T1 - SARA was at risk for replacement of certs and broken hw - all successfully done.
  • John/RAL - ntr
  • Lorenzo/CNAF - ATLAS problem likely due to overload of network between CNAF and KIT. Network people have been working on this and now everithing is fine.
  • Rolf/IN2P3 - ntr
  • Gonzalo/PIC - incident this night with batch unavailable 12h - triggered by faulty procedures for taking out nodes which made batch system stuck. recovered this morning and reviewing the procedure.
  • Christian/NDGF - ntr
  • Dimitri/KIT - ntr

  • Rob/OSG: ggus upgrade was applied: patch which was tried in test system did not work in production system. By 2:00 local time the patch was made to work. Worried as ggus test system does not look like production setup (does not have all updates yet). Maria will follow up. Sending all tickets during downtime now. Will follow top-level bdii change and will test on OSG side.

  • Ricardo/CERN: site bdii was upgraded this morning - top-level being upgraded now.
  • Mike/CERN: problem with availability calculation for ALICE - tests are green but result is red - looking at this.
  • MariaDZ: Jarka discovered that new GGUS release added bug in searches - now all sites are returned instead of filtered sites (Savannah:121168). Propose to defer alarm test to Monday as developer and Maria are not available on Friday.

AOB:

Friday

Attendance: local(Xavier, Jarka, Pedro, Gavin, Eva, Maarten, Mical, Jan, Dirk);remote(Michael/BNL, Stephen/CMS, Onno/NL-T1, Jon/FNAL, Xavier/KIT, Huang/ASGC, Gonzalo/PIC, Gareth/RAL, Rolf/IN2P3, Vladimir/LHCb, Lorenzo/CNAF, Christian/NDGF, Rob/OSG).

Experiments round table:

  • ATLAS reports - Xavier
    • CERN/T0/Central services
      • CERN cloud set online for HLT jobs.
    • Tier-1
      • IN2P3: We are interested in knowing what caused the overload on the LFC on wednesday after the downtime.
        • Rolf/IN2P3: local atlas support will follow up
      • INFN-T1: we ask the site to provide further info on this network saturation, did this happened at OPN level or network problem caused routing through the backup link ? link.
        • Lorenzo/CNAF: yes, there was a failover to the backup link
      • Jan: transfer problems for EOS import from NDGF due to firewall config issues - should be fixed now
      • Dirk: some few individual ATLAS user caused problems due to high frequency ticket renewals in kerberos with POOL applications. POOL developers are informed and will investigate.

  • CMS reports -
    • LHC / CMS detector
    • CERN / central services
      • CVS issues yesterday, suppose to be fixed - CMS will check.
    • Tier-0 / CAF
      • Tier-0 busy processing latest data
    • Tier-1
      • MC and Data re-reconstruction ongoing.
    • Tier-2
      • MC production and analysis in progress
    • AOB :
      • None
    • Maarten: three additional WMS nodes have been put in to decrease load on existing ones. Will also check if tuning of mysql DB would increase scalability.

  • ALICE reports -
    • General information: We still see the availability of T0 and T1's red in dashboard, even after including back the test we removed for the availability calculation. We see published the result of 2 of the 4 tests that should be showed. One is red and we are investigating the reason, but the other 2 are missing.
      • will investigate further with dashboard team - one test may not be executed right now...
    • T0 site
      • Nothing to report
    • T1 sites
      • FZK
        • unscheduled dowtime for the three CREAM-CE that ALICE is using at the site.
        • SE won't be available untill next week
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Waiting for beam, data. Processing and reprocessing are running. Certification of new Dirac version.
    • T0
    • T1
    • T2

Sites / Services round table:

  • Michael/BNL - ntr
  • Onno/NL-T1 - ntr
  • Jon/FNAL - ntr
  • Xavier/KIT - ntr
  • Huang/ASGC - ntr
  • Gonzalo/PIC - scheduled intervention 8th June (whole day) for more powerful h/w for dcache. Received ggus ticket for alarm test but not the alarm. Same for other sites. This is likely due to the absence of MariaDZ and ggus developers today. Jon + Gareth: should shift alarm test to Tue as Mon is a holiday in US and UK.
    • GGUS:71007 opened to request the change of date - GGUS developers agreed to do the test on Tue May 31.
  • Gareth/RAL - ntr
  • Rolf/IN2P3 - ntr
  • Lorenzo/CNAF - ntr
  • Christian/NDGF - downtime on monday for Norway - some atlas data will be unavailable
  • Rob/OSG - ntr (both ticketing and bdii changes worked fine)
  • Eva/CERN - ATLAS integration DB will be upgraded on Mon - ADC_R will switch back on Tue
AOB:

-- JamieShiers - 18-May-2011

Edit | Attach | Watch | Print version | History: r31 < r30 < r29 < r28 < r27 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r31 - 2011-05-27 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback