Week of 101018

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Harry(chair), Jan, Yuri, Graeme, Carlos, Ulrich, Alessandro, Patricia, Lola, MariaD);remote(Michael(BNL), Barbara(CNAF), Jon(FNAL), Gonzalo(PIC), Xavier(KIT), Rolf(IN2P3), Marie-Christine(CMS), Kyle(OSG), Gang(ASGC), Tiju(RAL), Onno(NL-T1), Roberto(LHCb)).

Experiments round table:

  • T0
    • CERN ssh access to lxvoadm machine failed. BUG:117331. Request to "vobox.support" sent Oct.16 at ~7pm. (This was due to a network switch partial failure)
    • Still ATLAS production dashboard issue Oct.18 at 00:20. Experts informed. BUG:73904. Scheduled downtime Oct.19 (2h.) for an ORACLE DB upgrade.

  • T1s
    • Taiwan-LCG2 Transfer errors SCRATCHDISK. Elog18322, site informed and they are working on that. SE unavailable Oct.16 at 11:20. GGUS:63170 solved.
    • BNL: transfer failures to contact on remote SRM. GGUS:63164 verified: the SRM and pnfs services were restarted Oct.16 at 02:40.File transfer failures from DATADISK, HHTP_TIMEOUT to SRM.GGUS 63177: solved Oct.17 at 3:41.
    • NDGF DATADISK transfer failures related to GRIDFTP_ERROR. GGUS:63179 assigned Oct.17 at 07:30. No reply yet.
    • INFN-T1 MCDISK transfer failures: "Source file/user checksum mismatch". GGUS:63175.

  • Experiment activity
    • Preparing for last fill before technical stop, high rate (800-1000 KHz) expected on Stream A when data taking resumes

  • CERN and Tier0
    • T0 Prompreco injector crashed Saturday night , memory exhaustion; job relaunched
    • GGUS:62696 occasional errors when opening files (via xrootd) on the Tier-0 processing pools: on hold
    • CASTOR Intervention for upgrade for HI planned for tomorrow starting at 9

  • Tier1 issues
    • Reprocessing at ASGC still ongoing with glitches - talking to site.

  • Tier2 Issues

  • MC production
    • large production ongoing

  • T0 site
    • Good behavior of the services in general during the last weekend
    • voalice10 (xrootd) showed some error messages last night. The issues were again associated to the firewall (reported last Friday). Problem solved this morning. Will prepare an iptables configuration in the quattor profile of this machine.

  • T1 sites
    • CNAF: AliEn user proxy expiration during the weekend. Services restarted this morning
    • Good behavior of the T1 sites in general

  • T2 sites
    • Usual operations concerning GRIF and UNAM in particular. In contact with the sites.

  • Experiment activities: Mostly user jobs in the morning, reconstruction of new data taken yesterday was not problematic.

  • T0
    • The data consistency suite running at CERN is trapping issues with data with incorrect checksum computed (due to a bug hitting also RAL installation). The same suite also reports files damaged due to problem in the upload. The reason of this last point has to be further investigated.
    • LHCb Service Status Board: the feeders have to be restarted

  • T1 site issues:
    • CNAF: A user reporting his job with problems accessing data. A couple of WNs were not properly configured w/o GPFS mounted properly
    • CNAF: Many jobs stalled: LSF configuration issue
    • IN2p3: received the list of files in a faulty disk-server. Data manager removed them from the catalog level. (GGUS:63024)
    • RAL: Half of the jobs stalling (and then eventually killed by either the dirac watch dog or the LRMS because exceeding the wall clock time). Investigations decided this was a Dirac not a site issue - a new Dirac patch will be put into production.

Sites / Services round table:

  • BNL: Have defined a maintenance window tomorrow for rearranging power distribution to about 20% of worker nodes. They will be drained overnight then off for about 8 hours.

  • CNAF: For the LHCb LSF issue a bug in the LSF license server is suspected. They are in touch with LSF support.

  • PIC: On Saturday morning had an accidental cooling stoppage affecting about 60% of worker nodes where running batch jobs had to be killed. Full capacity was restored after 2-3 hours.

  • IN2P3: The LHCb disk server failure ticket is being followed up.

  • ASGC: There are 3 files CMS jobs cannot access. Looks like an inconsistency in CASTOR - being investigated.

  • CERN CC: Annual power tests are scheduled for 1 November. They should be transparent but all grid services have been put at-risk for that day.

  • CERN CASTOR: ALICE and CMS CASTOR upgrades tomorrow from 09.00. Restrictions an ATLAS users mounting tapes in stagepublic have been put in place. Following the rescheduling of the LHC technical stop any posible heavy ion high-rate data taking testing with ALICE and CMS will have to be rescheduled.

AOB: The latest LHC schedule is to be found at https://espace.cern.ch/be-dep/BEDepartmentalDocuments/BE/2010-LHC-schedule_v1.9.pdf indicating a technical stop from Tuesday 19 October to Friday 22 October followed by about two weeks of proton physics leading to Heavy Ion setting up starting Friday 5 November.

Tuesday:

Attendance: local(Harry(chair), Edward, Maarten, Alessandro, Jan, Ignacio, Lola, MariaD);remote(Michael(BNL), Gonzalo(PIC), Roberto(LHCb), Jon(FNAL), Marie-Christine(CMS), Ronald(NL-T1), Tore(NDGF), Tiju(RAL), Kyle(OSG), Graeme(ATLAS), Dimitri(KIT)).

Experiments round table:

  • Ongoing issues
    • CNAF-BNL network problem (slow transfers) GGUS:61440 , GGUS:63134. Failures due to transfer timeouts for the large files (>2GB, often ~4GB).

  • ATLAS
    • Magnets off for maintenance and shifts reduced during LHC technical stop.

  • T0
    • Bulk reconstruction of weekend data is finishing, so data export to T1s proceeds as normal.
    • Access permissions on the CERN ATLAS Twiki have changed so only members of ATLAS can read - being investigated.

  • T1
    • IN2P3-CC report recovery from "locality is unavailable" problems. GGUS:63138, GGUS:63180.
    • Update from Hiro on CNAF-BNL network problem. GGUS:61440. Situation is improving though not quite solved.
    • INFN-T1 checksum problems on one file - no response from site. GGUS:63184. Suspicion is of file corruption during transfer from worker node to storage.

  • Middleware
    • There is a serious, but easy to fix, bug in the gLite 3.2 WN build that affects T2s using SuSE as their base release. This includes some important ATLAS T2 sites (MPP and LRZ) so ATLAS would like the issue escalated. GGUS:61106. (Note the original ticket dates from August).

  • Experiment activity
    • Technical stop

  • Tier1 issues
    • Various rereco certifications on going
    • Problem in site-BDII at PIC early this morning: the site had disappeared from the dashboard. Fixed (Savannah sr #117347: SAM test in visible error since 3h)

  • Tier2 Issues
    • Nothing to report

  • MC production
    • large production ongoing

  • AOB
    • HI Tests likely to be Thursday, with ALICE
    • New CRC: Stefano Belforte, Oct 20-25

  • T0 site
    • Small operations in voalice13 to startup the proper PackMan service. Production ramping up
    • No issues with voalice10 today (reported yesterday). SE@CERN performing well.

  • T1 sites
    • All SE@T1 sites reporting fine in ML, no issues for the workload management system

  • T2 sites
    • UNAM and GRIF issues reported yesterday. Both sites were in downtime yesterday
    • Following today the operations in Bologna-T2, Legnaro, Clermont

  • Experiment activities:
    • No data received last night. Reconstruction running to completion. Some delay related to problem at IN2p3 (RAW data not available because of the disk server) and problem at RAL (stalling jobs)

  • T0
    • The CASTOR test suite run has trapped two data inconsistencies on CASTOR - probably bad uploads of MC files from worker nodes.
    • LHCb SSB: the feeders have to be restarted

  • T1 site issues:
    • IN2p3: Shared area issue preventing to install latest version of LHCb application. (GGUS:63234). This problem with the shared area at Lyon must be escalated.
    • IN2p3: LHCb want a clear estimation on when the disk server will be back to life to take a decision on how to handle few remaining data to reconstruct.
    • RAL: Problem of stalling jobs was due to a bug introduced in DIRAC screwing up the estimation of the CPU

Sites / Services round table:

  • PIC: Site bdii was not publishing any data overnight for about 8 hours - fixed by reconfiguring. Does not seem to have caused major problems for the experiments. Experiment SAM srm tests did not show any failures while the ops ones did. Alessando stated that within the SAM framework tests to a site with no bdii visible are not published. This should be followed up.

  • FNAL: Network connection between FNAL and KIT was down for 3 hours last night.

  • RAL: One disk server for LHCb is down.

  • KIT: Have rescheduled the downtime of a cream-ce and an lcg-ce from this Thursday to next Thursday. Due to a cluster split.

  • CERN CASTOR
    • Have upgraded alice and cms castor versions to 2.1.9 in preparation for the HI run.
    • A 24 hour test of recording ALICE and CMS simulated raw data to tape (no data export) at 2 GB/sec each is now scheduled from 14.00 on Thursday 21 October.

  • AOB: Maarten reported that ATLAS tests at RAL of the gLite 3.2 FTS have found a critical bug in that on delegation proxies of an unsupported version are created. Sites should are suggested to stay with the working 3.1 version while this is fixed. The problem lies in the web service component and not the agent.

AOB:

Wednesday

Attendance: local(Renato, Harry(chair), Graeme, Patricia, Maarten, Gavin, Edward, Ulrich, Jan, Ignacio, Vawid, MariaD, Alessandro, Roberto);remote(Michael(BNL), Gonzalo(PIC), Jon(FNAL), Rolf(IN2P3), Gang(ASGC), Tiju(RAL), Kyle(OSG), Stefano(CMS), Onno(NL-T1), Jens(NDGF), Luca(CNAF), Dimitri(KIT)).

Experiments round table:

  • ATLAS
    • No data taking during LHC technical stop.

  • T1
    • IN2P3-CC locality problem is not completely solved, GGUS:62782. We continue to report files we see fail, but can the site check from their side so we don't have to tediously report transfer by transfer please.
    • INFN-T1 corrupted file, site has confirmed this is the only copy and the checksum is bad - it will be purged. GGUS:63184.
    • TAIWAN-LCG2 errors on transfers to BNL ("source file failed on the SRM with error [SRM_FAILURE]"). GGUS:63286.
    • RAL-LCG2 disk server down, reported by the site to be for memory change.
    • RAL-LCG2 SRM went down for ~1 hour. GGUS:63291. Now solved.

  • Middleware
    • FTS on gLite3.2 testing in the UK was stopped after progress in understanding the proxy delegation problem.

  • Central Services
    • DDM functional tests did not run for ~24 hours due to a bad service restart - now fixed.
    • Twiki access restrictions were put in place yesterday which, by default, limit ATLAS twiki access to ATLAS authors only. We are relaxing these restrictions on a page by page basis, but please let us know if something you think you should be able to see is inaccessible (email atlas-adc-expert@cernSPAMNOTNOSPAMPLEASE.ch).

  • Experiment activity
    • Technical stop

  • CERN and Tier0 * opened GGUS:63246 for BDII problems

  • Tier1 issues
    • Various rereco certifications on going
    • Problem in site-BDII at PIC yesterday was fixed SAV:117347
    • CMS T1_FR_IN2P3 is experiencing low transfer rates on some disk pools (under control)

  • Tier2 Issues
    • Nothing to report

  • MC production
    • large production ongoing

  • AOB
    • HI Tests Thursday 10:am to Friday 10:am, with ALICE. CASTOR team had this scheduled for 14:00 to 14:00 but CMS need to stop earlier. To be confirmed with ALICE when they can start.
    • CRC: Stefano Belforte, Oct 20-26

  • GENERAL INFORMATION:
    • Pass 0 and Pass1 reconstruction activities ongoing together with two MC cycles. Good status of all T1(T0) SEs in MonaLisa

  • T0 site
    • GGUS: 63282. All CREAM-CEs at CERN failing this morning. (This was due to transient LSF master blockages).

  • T1 sites
    • All T1 sites in production, no remarkable issues

  • T2 sites
    • Usual operations applied to several T2 sites with no remarkable issues

  • Experiment activities:
    • Impressive amount of MC jobs (65K run in the last 24 hours) and very tiny failure rate (~1%) (see pics in report link above)

  • T0
    • isscvs.cern.ch is no longer accessible outside CERN (CT719539). Issue immediately handled and problem understood to be related to the CERN firewall setup for CVS servers.

  • T1 site issues:
    • GRIDKA: observed instabilities (SAM and real activities perfectly correlated in timing) on the SRM endpoint (GGUS:63253)
    • RAL: one disk server of lhcbMdst service class was not available yesterday (GGUS:63230)
    • IN2p3: any news from SUN (GGUS:63024)? Since had report that SUN had left and expecting to have service back tomorrow morning (some 34000 files are concerned).

Sites / Services round table:

  • BNL: Michael reported a new RH5 root exploit in the gnu dynamic linker under certain setuid configurations (probably found on many hosts). The CVE number is 2010-3847 and BNL have already deployed a workaround that can be done during normal operations with no reboot. Maarten reported this exploit is already in discussion by the EGI security teams and may lead to some temporary loss of capacity as sites decide how to respond.

  • PIC: had another overnight service incident when 4 of their 5 dcap doors went down at the same time around midnight causing many running jobs to fail. The doors have been restarted and the cause is under investigation.

  • NL-T1: They also have a workaround in place for the root exploit reported by BNL.

  • NDGF: Will be patching some pools in the next few days for the root exploit. They are aware of a new easy to do kernel root exploit (CVE-2010-3904) for which they are preparing kernel patches (to blacklist one kernel module).

  • CNAF: Have put an at-risk downtime for a storage upgrade - should be transparent. The open ggus ticket from ATLAS yesterday (for a corrupted file) did not get into the Italian ROC ticketing system so this needs to be checked.

  • CERN Batch. Fix for the above problem will be deployed today together with patch for CMS Maradona problem. Working on both the root exploits reported above.

  • CERN CASTOR: Preparing for the Heavy Ion running test tomorrow. Both ALICE and CMS raw data pools will have associated 40 tape drives to guarrantee their rates of 2 GB/sec to tape.

  • CERN srm: SLS monitoring of CERN srm has shown various unavailabilities that is thought to be a monitoring only issue. It affected c2public which is used by all the srm tests.

  • CERN databases: There was an intervention today on the ATLAS online cluster with a switch replacement. Maria reported she is getting Oracle errors from the CERN CA since many hours but nothing is shown on SLS. The ticket she submitted will be followed up.

AOB:

Thursday

Attendance: local(Harry(chair), Uli, Zbygzek, Edward, Graeme, Maarten, Jan, Patricia, Alessandro, MariaD);remote(Michael(BNL), Jon(FNAL), Roberto(LHCb), Ronal(NL-T1), Tore(NDGF), Rolf(IN2P3), Tiju(RAL), Elizabeth(OSG), Gang(ASGC), Barbara(INFN), Foued(KIT), Stefano(CMS)).

Experiments round table:

  • ATLAS
    • Some cosmics runs, but no export to T1s.

  • T1
    • IN2P3-CC locality problem still not completely solved, GGUS:62782.
    • TAIWAN-LCG2 errors on transfers to BNL solved, disk server overload. GGUS:63286.
    • TAIWAN-LCG2 jobs failing due to a software area overload last night. Site reduced the number of jobs slot to ATLAS. Recovered with reduced capacity. BUG:74033.
    • RAL-LCG2 SRM problems yesterday - report from site that this was being caused by an ATLAS user pulling data from RAL into an NDGF cluster. User was suitably chastised and investigation ongoing.

  • Middleware
    • FTS on gLite3.2 - some UK sites were not accepting ATLAS VOMS credentials issued by the BNL VOMS server. Sites have been alerted.
      • Reminder to all sites to ensure all 3 ATLAS VOMS servers are properly configured. Alessandro reported that ATLAS have temporarily added a SAM test that uses the BNL Voms server in order to identify any remaining misconfigured sites.

  • Experiment activity
    • Technical stop

  • CERN and Tier0
    • HI test in progress.No info yet. For CMS this started around 10:00.

  • Tier1 issues
    • Reprocessing ongoing, no issues

  • Tier2 Issues
    • large production ongoing, no issues
    • user's analysis reached 100K jobs/day last week average

  • MC production
    • large production ongoing

  • AOB
    • CRC: Stefano iBelforte, Oct 20-26

  • GENERAL INFORMATION:
    • Same reconstruction and MC activities ongoing with no major issues to report.

  • T0 site
    • CREAM-CEs all back in production after the issue reported yesterday. For today a transparent patch (to LSF) was expected, no issues found. Uli confirmed the patch (a new master batch daemon in fact) had been deployed successfully.
    • The ALICE HI testing was scheduled to start at 14:00.

  • T1 sites
    • All T1 sites in production
    • RAL has just reported that jobs of a particular ALICE user were filling the memory on worker nodes. The user will be contacted.

  • T2 sites
    • no remarkable issues to report

  • Experiment activities:
    • No MC, no reconstruction, validation of the new reprocessing on going. No major issues

  • T0
    • none

  • T1 site issues:
    • IN2p3: just got a report that the failed diskserver (GGUS:63024) is now back online. LHCb will re-enable the files in their catalogue and resume pending jobs.

Sites / Services round table:

  • NDGF: Had some ATLAS and ALICE data unavailable today due to disk problems. Fixed about an hour ago.

  • IN2P3: The ATLAS locality problem is being addressed and the LHCb disk server ticket was closed about 11:00.

  • KIT: Had a problem with their nfs server used for software distribution. Has been fixed but left some worker nodes in a strange state. Being worked on.

  • CERN databases: Have performed a transparent application of the latest Oracle security patches on INT11R database (integration database for WLCG projects).

  • CERN CASTOR: The Heavy Ion running tests of ALICE and CMS have started. A coordination meeting with IT and both experiments has just been held. CMS started at 10 but had to stop and relaunch. Seeing little activity yet. CMS have given us their flow description - starts with the tostreamer pool and to t0export from where data will go to tape.

AOB: (MariaDZ) A big number of new GGUS Support Units come to life with the major GGUS Rel. 8.0 on 2010/10/27. They are listed on 2 slides on https://gus.fzk.de/pages/news_detail.php?ID=420, linked from GGUS home under Latest News.

Friday

Attendance: local(Edward, Zbygzek, Harry(chair), Graeme, Uli, Maarten, Lola, Jan, Alessandro);remote(Conference system failure: email reports (mostly conference failure or ntr) from FNAL, BNL, LHCb, NL-T1, CMS, RAL, KIT, OSG).

Experiments round table:

  • ATLAS
    • ATLAS combined runs will start this afternoon, but no physics fills expected from LHC until tomorrow daytime.

  • T2
    • SAM tests switched to using BNL signed voms extensions. 11 T2s ticketed.

*Central Services * GOCDB downtimes not propagating correctly into ATLAS Grid Information System, so the DDM automatic blacklisting system is not being fed. BUG:74295.

  • Experiment activity
    • Technical stop. Preparation for data ongoing, expect to be be ready

  • CERN and Tier0
    • HI test completed. Castor test successful. We had some hicups due to some peculiarities of the data playback, not expected to happen with real data. Overall latency to tape was longer then expected and will be looked at next week by CMS Tier0 people.
    • A problem with Oracle last evening affected PhEDEx and Frontier (and indirectly crab job submission) as reported by PhEDEx expert/admin/developer Nicolo' from Taipei, but nothing appeared on IT SSB/SLS, made more tricky to figure out at the time because Critical service GridMap http://servicemap.cern.ch/ccrc08/servicemap1.html lost its interactivity, BUG:74280 opened

  • Tier1 issues
    • Reprocessing ongoing, no issues

  • Tier2 Issues
    • large production ongoing, no issues
    • user's analysis reached 100K jobs/day last week average

  • MC production
    • large production ongoing

  • GENERAL INFORMATION:
    • HI exercise together with CMS went quite well from the ALICE side. More information and details during the next ops. meeting on Monday

  • T0 site
    • Scheduled operations on the ALICE voalicefsXX and voalice13 nodes on Monday from 9:00 to 11:00
    • Minor operations foreseen also for voalice11
    • No issues observed at the T0 services

  • T1 sites
    • No remarkable issues to report

  • T2 sites
    • Usual operations at these sites

  • Experiment activities:
    • Some MC activity, charm stripping launched yesterday, validation of the new reprocessing on going. No major issues

  • T0
    • A large number of jobs (but not all) seemed to be failing with "SetupProject.sh Execution Failed" (AFS-like issue) in several productions. Problem seems to have resolved itself - transitory between 8:30-9:30 UTC

  • T1 site issues:
    • GRIDKA: (GGUS:63353). LHCb_DST space token running out of space. No all 2010 pledged resources allocated.
    • IN2p3: (GGUS:63024). Disk server back in production. Setting files visible in the catalog to run the remaining reconstruction and merging.
    • IN2p3: (GGUS:63234 Verified yesterday). Again software area issue with more jobs timing out in setting up the environment. Definitely to be addressed.

Sites / Services round table:

  • BNL: ntr

  • FNAL:
    • FNAL-TIFR network link down for a few hours yesterday - now that I've asked to be notified of network outages, there are lots of them!
    • FNAL FTS & phedex transfers affected due to oracle outage at CERN yesterday

  • NL-T1: ntr

  • OSG: ntr

  • KIT:
    • GridKa services were still suffering from the NFS issues from Thursday. However, we managed to stabilie the systems and we also instructed our on-call engineers, to solve such kind of problems if they occur during the weekend.
    • About GGUS:63353 - LHCb_DST space token running out of space: LHCb currently has more disk space installed and usable as was pledged for 2010. Therefore we cannot increase the size of space token anymore. IF LHCb needs more disk space, this has to be requested in the TAB.

  • CERN databases: The CMS offline database was rebooted by clusterware. Two services required manual intervention - one was down for 15 minutes and another for 40 minutes.

  • CERN CASTOR: Jan Iven gave a preliminary report on the Heavy Ion tests. ALICE ran pretty smoothly including making disk-to-disk copies (peaked at 7 GB/sec) and are happy with their rates to tape. CMS were surprised it took 6 hours for data from the pit to reach tape - will be analysed. They were writing large files (around 20GB) so got a good rate of 100 MB/sec to individual tape drives.

AOB:

-- JamieShiers - 16-Oct-2010

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2010-10-25 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback