Week of 100823

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Jamie, Peter, Elena, Patricia, Massimo, Harry, Ignacio, Ewan, Jean-Philippe, MariaDZ, Jacek, Edward, Edoardo, Alessandro, Andrea, Miguel);remote(Michael, Catalin, Angela, Kyle, Gang, Daniele, Pepe, Gareth, Ron, Anonymous (CNAF?) ).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues:
      • FZK-NDGF transfer issue GGUS:60437 (updated 2010 08 04)
      • INFN-BNL network problem
    • Aug 21 - Aug 23 (Sat - Mon)
      • Physics Run with 48 bunches per beam on Friday and new one started yesterday night.
      • Tier-0 - Central services
        • Problem with online DB PVSS on Sunday morning. One of online DB machines was rebooted by PhyDB support team. Further problem was fixed by ATLAS Online DB expert . [ Jacek - DB problem on Sunday morning. Started around 06:00 with replication stuck due to bug in Oracle. Work-around was reboot of node where replication processes running. This worked ok but a bit later another node crashed for reasons unknown. Today one node of same DB rebooted again - this time due to mistake of sysadmin ]
      • Tier-1 issues:
        • SARA-MATRIX: GGUS:61265 still ongoing (oracle DB for LFC and FTS problems).
        • Taiwan-LCG2:
          • SRM problem on saturday GGUS:61314 . The problem was solved in 30 min (Thanks!): SRM was overloaded and the site added one more disk server in production.
          • Monday 2amUTC - 4:30am UTC: unscheduled power outage ( https://goc.gridops.org/downtime/list?id=88305449 ) All main services were affected, they are now back online.
        • INFN-T1: about the ALARM GGUS:61305 (LFC daemon crashed): from the logs has been understood that the daemon were killed by a query with a PROXY with too many FQAN. It was the same problem that has been observed also in PIC. As said on Friday, it's a bug in LFC. ATLAS is normally using a workaround, but that time has been forgotten. Maybe the priority of the bug should be increased to avoid this problem will happen again.
        • BNL: srm problem is seen on Monday morning. GGUS:61359. The problem was fixed in ~ 1h. Thanks!. [ Kyle - this has happened in past: was opened as a critical ticket which woke someone up. Ale - this ticket was submitted by ATLAS shifter at P1. Problem is related to files that from CERN are not reaching BNL datadisk. Files were func. tests. However same path is followed by real data and hence reason why ticket was TEAM and not ALARM. From ATLAS - thanks very much for the immediate answer - but feel criticality was wrong. Michael - agree with Alessandro that this was critical for T1 operations. MariaDZ - was long discussion in this meeting concerning criticality. There was a conclusion that when someone should be woken up should be an alarm. Other tickets should not wake people up. ]
      • Tier-2 issues:
        • No major problem

  • CMS reports -
    • Central infrastructure
      • Nothing to report
    • Experiment activity
      • Smooth Data Taking
    • Tier1 Issues
      • T1_TW_ASGC : more SAM SRM instabilities during the week-end, causing transfer inefficiencies.
        • Apparently this was due to the fact that the local space went full on srm server, thus the srm server worked abnormally. Since Aug 23, 6AM UTC, things look stable.
        • Note : since the problem was intermittent, it was continuously watched by CMS shifters but not GGUS ticket was opened (only a Savannah ticket 116338 which is now closed).
      • T1_TW_ASGC : SAM CE errors and a 0% JobRobot efficiency since Aug 23, 04:30AM UTC, following their short unscheduled dowtime
        • Opened Savannah ticket 116356, bridged to GGUS:61367
        • Latest news : problem was due to a power outage
    • Tier2 Issues
      • "hungry CMS user" overloading WNs at JINR-LCG2 : jobs were using up to 50GiB on WNs scratch area, while according to CMS VO card the scratch area per WN should be 10GB.
      • While the user first acknowledged his error, he resubmitted the same kind of jobs again, which caused the site admin to ban him from using the site ! We need to contact the user to make sure his workflows are under control wink
      • See GGUS:61241 and GGUS:61293
    • MC production
      • On-going.

  • ALICE reports - GENERAL INFORMATION: High activity during the weekend with a stable quota of more than 20K concurrent jobs (maintained during the weekend). In addition, a new MC cycle is scheduled
    • T0 site
      • Stable production during the weekend. This morning ce203 was failing at submission time. Ticket: GGUS:61366 has been submitted to track the problem (other two cream nodes are working so not affecting production)
    • T1 sites
      • IN2P3-CC. GGUS:61354. During the weekend Alice announced the bad performance of the local CREAM-CE used by the experiment (cccreamceli02.in2p3.fr). The site is providing an additional CREAM-CE: cccreamceli03.in2p3.fr (never used by Alice, but with a significant number of resources behind), which is not performing well for the moment. The Alice contact perform at the site has been asked to check the system.
    • T2 sites
      • Operations needed during the weekend in Poznan, Cyfronet and UCT (Cape Town). The local CEs were failing in Poznan and Cyfronet and some proxy renewal problems were found in UCT

  • LHCb reports -
    • Experiment activities:
      • Reprocessing, MonteCarlo and analysis jobs.
      • There is the need to adjust the trigger rate to cope with the very high throughput from ONLINE. If LHC continues to provide such a good fill and such high luminosity we might run in trouble with backlogs forming. While the 60MB/s rate per computing model is not so rigid at most 100MB/s is acceptable. Internal discussion to cut a factor 2 (w/o affecting physics data)
      • Observed a general reduction of the number of running jobs at T1's. Last week 8K running concurrently now less than 3K. At CERN (absorbing the load from SARA) the worse situation: 3K waiting jobs vs 500-700 effectively running. We do not see anything particularly wrong with the pilot submission itself that might slow down/affect the # of running jobs (apart a slightly higher failure rate due to recent problems at Lyon). Is there any fair share hit?
    • T0 site issues:
      • Discovered several RAW files that can't be staged last Saturday (GGUS:61346). It looks like it was a glitch with some disk server.
      • xrootd problem (16/08) GGUS:61184 Now using castor:// protocols on all space tokens apart on mdst (where we keep using xroot://); the issue seems to have gone (because the service is less massively used) but still open the point why we met these instabilities with the xroot manager on castorlhcb. [ Miguel - setting up an xroot manager and will contact LHCb so that we can debug together in more detail. ]
      • Low # of running jobs due to low # of pilots - cream ce seems to be acting as "blackhole" (see details in LHCb report)
    • T1 site issues:
      • RAL: received notification that one of the diskservers on lhcbMdst is unavailable (gdss468). The ~20K files there have been already marked as problematic and won't be used until the disk server recovers.
      • IN2P3: All jobs to their CREAMCE were aborting. Service started but not clear the reason. (GGUS:61358)
      • SARA Oracle DB. NIKHEF and SARA banned. Reconstruction jobs diverted to CERN (which starts to be really overloaded too). Transfers from other T1's also not going (due to FTS unavailable). Only RAW data from CERN are coming, but cannot be processed due to ConditionDB unavailable too. Any news?

Sites / Services round table:

  • BNL - nothing to add
  • FNAL - had some failing transfers on Friday night - stuck as stager not seeing files as available - fixed in 5-6hours
  • KIT - ATLAS - one w/e saw a few prod jobs consuming > 40GB of local disk. Normal ATLAS activity? Local contact said that might be some skimming jobs which could use more diskspace than normally required. FZK - 50GB would be the limit for such jobs. Where does this limit come from? Affects other jobs and other VOs so not happy.. Elena - will review and get back to you. Angela - we killed a few of these jobs which were affecting others. [ Ale - please can you send to mailing list some examples of jobs. ]
  • ASGC - as mentioned we had power outage this morning at about 01:40. Happened in Academica Sinica. Recovered in ~2h. Some tests continue to fail since recovery. Some interaction between previous failing tests?
  • PIC - ntr
  • RAL - diskserver problem of LHCb. A few servers from a batch failed. When testing no problems and hence back in service. Doing the same with this diskserver.
  • NL-T1 - Discovered that during restore of Oracle DB that we had corruption of logical blocks which cannot be recovered. Trying to get as much back as possible but will not be without data loss.
  • OSG - having an issue with transferring records to WLCG SAM. Try a resend which did not work so reverting to older version of scripts. Dialog ongoing.

  • CERN DB - two incidents: yesterday pm LHCb online DB down for 1.5 hours due to power cut. This morning since around 0815 replication to ATLAS T1 sites not working. Problem in sending DB - hard to estimate when this will be fixed.

  • CERN Network - tomorrow morning will configure backup link CERN-RAL. Operation will be transparent unless we get permission to test backup. Gareth - declared an AT RISK for this. Keen to test this failover. Will be tested between 09:00 and 10:00 GVA time.

AOB:

  • Alarm raised by ATLAS GGUS:61305 which appears to be fixed but still open. Raised against CNAF for LFC problem. Elena - we will close this ticket.

Tuesday:

Attendance: local(Jamie, Peter, Luca, Peter Love, Roberto, Ewan, Harry, Alessandro, Edward, Jean-Philippe, MariaDZ, Lola, Patricia);remote(Michael, Xavier, Gang, Ronald, Catalin, Stefano, Kyle, Pepe, Tiju).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues
      • FZK-NDGF transfer issue GGUS:60437 (updated 2010-08-23)
      • INFN-BNL network problem GGUS:61440 (updated 2010-08-24)
      • SARA oracle outage GGUS:61265 (NL cloud offline)
      • RAL-NDGF functional test errors GGUS:61306 (updated 2010-08-23)
    • Aug 24 (Tue)
      • BNL dCache nameserver issues, no ticket but are monitoring impact elog 16200
      • Streams replication to Tier 1s problem. Oracle involved. [ Luca - replication blocked from downstream offline capture. Will not startup - Level 1 SR open. A couple of hours ago analyst seemed to have found a problem. A cleanup method is being prepared - hope to apply later this afternoon. ]
      • See ongoing issues above

  • CMS reports -
    • Central infrastructure
      • Nothing to report
    • Experiment activity
      • LHC planning to do last 3 fills of the week with higher intensities (up to 1.2E+31), hence we are expecting an increase in input rates by 30%. CMS should be able to handle that but we are watching closely.
    • Tier1 Issues
      • T1_TW_ASGC : more SAM SRM instabilities, causing transfer inefficiencies and SAM errors. In fact, SRM stability at ASGC since 1 week has been rather poor, see attached plot. As a consequence, Savannah ticket 116338 and Savannah ticket 116356 were re-opened.
        • ASGC_SAM_SRM_since_1_week.png
        • is there a general issue regarding SRM at ASGC ? [ Gang - working on this, cannot comment too much at this moment - still not solved problem. Peter - from tickets seems that one of problems kept re-occurring. Is there some more general issue? Ale - from ATLAS we have also reported such problems. 5 team tickets opened and closed. Very pro-active in fixing problem but then some other problem occurred. ]
    • Tier2 Issues
      • The (young) user that was banned from JINR-LCG2 has acknowledged his errors, but was not able to kill his jobs via the CRAB analysis tool. He addressed the issue on the internal CMS CRAB support forum which should then lead to a proper end of the case covered in GGUS:61293
    • MC production
      • Very large (1000M - 1 billion) MC production about to be launched at T1s and T2s
    • AOB
      • CNAF BDII issues (see Savannah ticket 116378) : "egee-bdii.cnaf.infn.it" stuck and not containing any CE tagged with a particular CMS application release (CMSSW_3_8_1_patch3)
        • maybe worth updating to new BDII version as CERN has done last week ? [ Luca - updated version of BDII 2 hours ago. BTW - have these problems been advertised? Discovered "by chance" there was this new version. Upgraded and now should work. ]
      • Next CMS Computing Run Coordinator starting tomorrow : Marie-Christine Sawley

  • ALICE reports -
    • T0 site
      • Ticket: GGUS:61366 submitted yesterday. Problem solved (A CREAM CE at CERN failing).
      • Ticket: GGUS:61516 submitted just before the meeting. The DB of ce203 is reporting more than 25K jobs running
    • T1 sites
      • Ticket: GGUS:61354 submitted yesterday. Problem solved. In addition the problem found with the 2nd CREAM-CE was also solved and both systems have been put in production since yesterday with no issues to report. The remaining submission of Alice to the LCG-CE in Lyon has been fully deprecated yesterday afternoon
      • CNAF: Todays raw data transfers failing to the T1. After the FTD problem reported yesterday, Alice is retrying the transferring of old runs to CNAF. The FTD log files report that the corresponding files are not staged. Experts are trying optimize the transfer stages reported by AliEn. The goal is that such a transfers will be retry at least once and after one hour before declaring it as failed transfer
    • T2 sites
      • Usual operations applied to different T2 sites with no remarkable issues to mention

  • LHCb reports -
    • The overall slowness in dispatching jobs reported yesterday was (most likely) due to concurrent factors related to various gLite WMSes at different T1's (now removed from the mask) resulting in a wrong assumption of the real status of the jobs (reported as scheduled while were completed) and then preventing DIRAC to submit furthermore. Analysis of these jobs will be done and hints will be reported to service providers for further analysis.
    • T0 site issues:
      • All pilot jobs aborting against ce203.cern.ch (GGUS:61423). This was the reason of the observed low submission rate. Problem seems to be due to a known bug in CREAM. [ Ewan - appears to be a memory leak ]
    • T1 site issues:
      • IN2P3: Shared area: problems again preventing to install software (GGUS:59880 and GGUS:61045)
      • GRIDKA: issue with one CREAMCE (high number of pilot jobs aborting in the last week). GGUS ticket not issued yet. [ Xavier - please provide id's of pilot jobs that were aborting in GGUS ticket when you open it ]
      • CNAF: Issue with direct submission against CREAMCE reported yesterday seems to be due to a certificate. Under investigation.
      • CNAF: Due to a wrong way of switching off VMs, many xfers from other T1's were failing. [ Luca - was a problem on SRM as reported. GPFS load has been solved. CE problem - didn't find any cert problem - cert still valid. Cannot understand what is the problem. ]

Sites / Services round table:

  • BNL - as Peter mentioned were problems with xfers and jobs aborted due to NS component of dCache overload. Investigations resulted in findings like using a method that is heavy on NS - can be changed since we have a path to internal ID translation method. In process of implementing that NS not utilized at this level. Will result in a big release and expect good performance. Problem solved smile
  • KIT - in morning one of tape libs was broken - fixed provisionally. Will shutdown tomorrow so external technicians can do extended check from 10:00 - 15:00. Reading from this library will not be possible - writing will go to one of two other libs
  • ASGC - scheduled downtime this Friday for network config postponed to Tue Aug 31.
  • NL-T1 - Oracle RAC DB still down - attempts to recover corrupt logs from backups failed. SARA & Oracle investigating if backups corrupt. JPB - which DB? Affects whole RAC DB.
  • FNAL - ntr. Q: what was problem with BDII and how solved? Have some tickets opened but didn't get answer.
  • CNAF - working on BNL transfer problems. Last week Ricardo made many tests between CNAF and GARR and GARR and BNL. Still working to better understand what is happening.
  • PIC - ntr
  • RAL - this morning at risk for testing RAL - CERN backup OPN link. Tests ok. Diskserver out for LHCb - now back in production.
  • OSG - SAM uploading issue from yesterday now resolved.

  • CERN DB - nothing to add

AOB: (MariaDZ)

  • Ale - from ATLAS would like to answer Q from FZK yesterday - about production jobs using > 40GB of disk space on WNs. Those jobs are standard ATLAS production jobs - group production in particular. Has been some discussion about whether these could cause problems. Meanwhile if FZK don't want such jobs we will change scheduling so that such jobs run elsewhere.

Wednesday

Attendance: local(Peter L., Jamie, Marie-Christine, Patricia, Edoardo, Andrea, Harry, Ricardo, Alessandro, Renato, MariaDZ, Ignacio, Luca);remote(Onno, Rolf, Angela, Gang, Catalin, Michael, Kyle, Tiju, Luca, Pepe).

Experiments round table:

  • ATLAS reports -
    • Tier-1
      • FZK-NDGF transfer issue GGUS:60437 (OK but not understood) [ Angela - concerning this ticket we think we have found problem - see some network errors on one of fileservers were slow transfers come from - investigating if h/w failure ]
      • RAL-NDGF functional test GGUS:61306 (OK but not understood)
    • BNL (MIchael) - did some more investigations (on CNAF-BNL transfer issue) using dedicated FTS channels. Now believe that problem is in path beyond GARR but probably on borderline between GARR and Geant2. In process of getting Geant2 engineering involved. Luca (CNAF) - still waiting for feedback from GARR. Hopefully will be solved soon.

  • CMS reports -
    • Central infrastructure
      • Nothing to report (did upgrade of T0 monitoring service this morning - went well and finished at 13:00)
    • Experiment activity
      • Since yesterday long physics run with record lumi at 350 nb-1 *Actively monitoring input rates
    • Tier1 Issues
      • T1_TW_ASGC : tickets closed, SRM back to normal.
    • Tier2 Issues
    • MC production
      • Very large (1000M) MC production about to be launched at T1s and T2s
    • AOB
      • New CRC on duty: Marie-Christine Sawley

  • ALICE reports - General information: Currently, there are 3 MC cycles running in the system (high number of running jobs in the system)
    • T0 site
      • Ticket reported yesterday, concerning the bad performance of the CREAM-DB of ce203 (GGUS: 61516). The ticket is in progress. CREAM experts have asked for a list of possible jobids which are reported in a wrong state.
      • An intervention on the CASTOR name server to upgrade it to 2.1.9-8 is foreseen on Wednesday 1st September. (Starting at 9:00, 4 hours "at risk"). Agreed by the Online team of Alice.

    • T1 sites
      • Raw data transfers ongoing in the last 24h to IN2P3, FZK and CNAF. In addition the results of the SE tests in ML (in the last 24h) is the expected one. No issues to report in terms of jobs performance
    • T2 sites
      • Issues found today in Birmingham (already solved).

  • LHCb reports -
    • Experiment activities:
      • Mainly reprocessing/reconstructing real data and huge analysis activity competing with at T1's. Low level MC.
    • T0 site issues:
      • LFC: observed a degradation of performances querying LFC (GGUS:61551)
      • xroot not working also for lhcbmdst space where it had been decided to use this protocol (GGUS:61184). Strong commitment from IT-DSS people to debug it.
    • T1 site issues:
      • RAL: SRM reporting incorrect checksum information (null) GGUS:61532
      • pic: one of the problematic gLite WMS reported yesterday was at pic. It was simply overloaded. GGUS:61517
      • IN2P3: Shared area: keeping this entry for traceability. Work in progress but problem still there. (GGUS:59880 and GGUS:61045)
      • GRIDKA: the shifter evaluated that the number of failures in the last period is not so worrying. So no ticket has been issued.
      • GRIDKA: Some users observed some instability with dCache (GGUS:61544). Prompt reaction from sysadmins that did not observe any particular issue their side.
      • CNAF: Issue with direct submission against CREAMCE reported yesterday seems to be due to a certificate. Under investigation.
      • SARA: Eagerly waiting for SARA to recover and complete the remaining 900 merging jobs that have to be delivered to users by the end of this week. Considering this top priority activity and the situation with the SARA RAC, it has been decided to hack any DB dependency and run these merging jobs at SARA (where data to be merged are) by pointing to another condDB.

Sites / Services round table:

  • NL-T1 - Oracle DB problem: still in contact with Oracle support and still checking if the backups are corrupt. Tomorrow morning we will decide how to proceed. If by then not successful will start reconstructing a new database. Tomorrow morning we will have some network maintenance. Affects storage and compute cluster, plus backup link to OPN. Alessandro - discussed issue (LFC). LFC for ATLAS is the critical service. Information is not available elsewhere. (Information for whole cloud).
  • IN2P3 - ntr
  • KIT - broken tape lib now fixed
  • ASGC - SRM problem at ASGC finally identified. Misconfig of CMS production role.
  • FNAL - ntr
  • BNL - during tech stop next week on Wednesday we plan a network intervention. Designed to be transparent but will declare and AT RISK.
  • RAL - ntr
  • CNAF - issue with slow transfers to BNL already discussed; still open problem with direct submission observed by LHCb. Certificate reported as probably origin of problem is ok - can't reproduce problem. Testing to try to find cause of problem.
  • PIC - today would like to comment that ATLAS jobs generating a lot of files under /tmp area no longer deleted after job finishes. GGUS:57236 (since July) - deleting these files on /tmp - is ATLAS aware? Will they solve? LHCb ticket on WMS - ticket closed yesterday. Maybe use an alternate WMS to not overload this one. Load reduced since restart of service.
  • OSG - also wondering yesterday FNAL brought up issue about BDII and how solved. Haven't seen an update... Ricardo - at CERN we deployed a new version of BDII RPM which was provided by Laurence. Don't know if released....

  • CERN DB - problem of replication to T1s fixed yesterday at 19:00 - latency went to zero at midnight. Max latency 35h. P-M being prepared. Investigating some issues on DB2 which caused high load. In contact with DQ2 devs.

  • CERN Storage - sent announcement mail for CASTOR NS intervention on 1 Sep.

  • Network - RAL reported yesterday that backup net tests went well. Primary link to RAL failed and backup used!

AOB:

  • There will be a Tier1 Service Coordination meeting tomorrow - agenda.
  • open/pending SIRs to be discussed:
    • SIR received for CERN vault cooling issues
    • SIR being prepared for LHCb online DB problem after power cut
    • SIR requested for LFC/FTS DB problem at NL-T1
  • also key open GGUS tickets

  • BDII version installed at CERN last week is 5.0.9-1.

Thursday

Attendance: local(Peter L, Jamie, Harry, Patricia, Roberto, Eddie, Ewan, Andrea, Massimo, Ignacio, Marie-Christine, Dirk, Jacek, Nicolo, MariaDZ);remote(avier, Rolf, Kyle, Pepe, Tiju, Michael, Catalin, Gang, Ron, Stefano).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues
      • INFN-BNL network problem GGUS:61440
      • SARA oracle outage GGUS:61265 (NL cloud offline)
      • FZK-NDGF transfer issue GGUS:60437 (OK pending new FT transfers)
      • RAL-NDGF functional test GGUS:61306 (IP added to whois object, pending new FT transfers)
    • Aug 26 (Thu)
      • Tier-1
        • RAL disk server gdss81, part of atlasStripInput (DATADISK) offline, pending fsck [ Tiju - partition went R/O - checking what is problem. No update for moment ]

  • CMS reports -
    • Central infrastructure
    • Experiment activity
      • Smooth data taking
    • Tier1 Issues
      • Nothing to report
    • Tier2 Issues
      • Nothing to report
    • MC production
      • Very large (1000M) MC production ongoing T1s and T2s

  • ALICE reports -
    • T0 site
      • Ticket concerning the bad performance of the CREAM-DB of ce203 (GGUS:61516). The issue has been reasigned to the CREAM-CE developers. Since the issue has been also observed at some other T2 sites of Alice, service experts are trying to fix the problem. In the meantime all sites of Alice have been checked to ensure the source of information used at each site is the BDII information
    • T1 sites
      • Bad performance of the local NDGF SE today in ML. Transfers to the site are also failing today. The origin of the problem seems to come from the current TFD patch under testing at this moment (to be discussed during todays TF meeting)
    • T2 sites
      • Nothing remarkable to report

  • LHCb reports -
    • Experiment activities:
      • Due to the exceptional amount of data to be processed/reprocessed and merged for users and backlogs forming at the pit (due to limited to 1Gb/s connection to CASTOR and high level trigger cuts too relaxed) LHCb is putting heavy load on many services (central DIRAC, grid T0 and T1 services).
      • (rather for the T1 coordination meeting). LHCb wonder about a clear procedure in case a disk server becomes unavailable for an (expected) long period (> half day). In case there is nothing already formally defined, a good procedure could be:
        • Disk server crashes.
        • Estimation of the severity outage.
        • The disk server will be unavailable for a long while (ex. a week, back to the vendor)
        • Try to recover data from tape (if possible) or from other diskservers (in case there are available elsewhere) and copy them to another disk server of the same space token.
        • Announce to the affected community (through agreed channels, GGUS to lhcb support for example is the way lhcb would like to receive these announcements) the incident and transmit with this announcement the list of unrecoverable files.
      • T0 site issues:
        • Requested another space token (LHCb_DST, T0D1, no tape) to handle small un-merged files (that will be scrapped anyway after the merging). This is to avoid to migrate to tape smallish files as it happens now with M-DST (T1D1) serving these files. It is not necessary it is open to outside but that it had a good throughput to WNs.
        • CERN: A patched version of the xroot redirector has been deployed on castorlhcb (that should fix the pb reported last week). Re-enabled the xroot:// protocol on lhcbmdst service class.(GGUS:61184)
      • T1 site issues:
        • RAL: lhcbuser space token running out of disk space: as an emergency measure another diskserver has been added.
        • RAL: Since yesterday we got two more diskservers out of the game (gdss470 for lhcbMdst and gdss475 for lhcbuser space tokens). These diskservers pair with the gdss468 reported last week that seems to be still not fully recovered (working at reduced capacity).
          • From RAL: Due to the problems caused by high load on LHCb diskservers, we have reduced the number of jobslots on all LHCb diskservers by 25%. The last two diskservers that went down, gdss470(lhcbMdst) and gdss475(lhcbUser), are back in production with their jobslots reduced by 50%. We will monitor these machines and review these limits tomorrow.
        • IN2P3: Shared area: keeping this entry for traceability. Work in progress but problem still there. (GGUS:59880 and GGUS:61045)
        • CNAF: Issue with direct submission against CREAMCE. It is OK now
        • CNAF: FTS transfers failing to CNAF LHCb_MC_M-DST space token GGUS:61571. tokens getting full over there

Sites / Services round table:

  • KIT - ntr
  • IN2P3 - ntr
  • PIC - still have problem with ATLAS jobs that fill /tmp. ATLAS wanted to have a look at this.. Peter L - did you see Savannah ticket? Bug in CMT which will be fixed in the next release - this comment is from a few months ago! Can you run a tmpwatch cleanup. Y - can survive with this until new release
  • RAL - nta
  • BNL - ntr
  • FNAL - ntr
  • ASGC - ntr
  • NL-T1 - db problem - seems to boil down to Oracle ASM or underlying storage. Doing another restore to see if we can restart DBs on single Oracle instance. Trying to solve problem rather than bring service back. By end afternoon will send a message about plans on how to proceed. Roberto - submit a bunch of pilots but not matching the site. Many waiting jobs - GGUS:61583
  • CNAF - farming people and storage people working on LHCb ticket related to FTS failures. Seems to be related to a particular VM implementation on WNs. Hope to solve problem in 1-2 days. Roberto - many transfers failing as dest space token full.
  • OSG - ntr

AOB:

Friday

Attendance: local(Marie-Christine, Jamie, Peter, Ewan, Jacek, Gavin, Andrea, Edward, Roberto, Patricia, Renato, Harry);remote(Stefano, Michael, Angela, Rolf, Gang, Kyle, Ronald, Gareth, Vera, Catalin).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues
      • INFN-BNL network problem GGUS:61440
      • SARA oracle outage GGUS:61265 (DB restore up to 18th August)
      • FZK-NDGF transfer issue GGUS:60437 (test pending)
      • RAL-NDGF functional test GGUS:61306 (FT transfers continue to fail)
    • Aug 27 (Fri)
      • Central ops
        • LSF slow dispatching jobs to atlast0 (ALARM sent: GGUS:61607) [ Gavin - specific issue solved by reconfiguration of batch. Ticket open with Platform - need to debug with them ]

  • CMS reports -
    • Experiment activity
      • Had problem with DQM uploads not working; expired certificate; corrected now; RECO DQM failures resubmitted on CAF
    • Central infrastructure
    • Tier1 issues
      • Nothing to report
    • Tier2 Issues
    • MC production
      • Very large (1000M) MC production ongoing T1s and T2s

  • ALICE reports - GENERAL INFORMATION: Grid activities ongoing: Pass1 reconstruction and TPC calibration activities together with 3 MC cycles
    • T0 site
      • Ticket GGUS:61516 in progress. For the rest of the services nothing to report.
    • T1 sites
      • All T1 sites currently in production
    • T2 sites
      • Operations required ion Catania and TrigGrid yesterday afternoon to bring the sites back in production.

  • LHCb reports - NOTE: Very intense activities in WLCG these days with backlogs of data to recover and with reprocessing+merging of old data to be delivered soon to our users. At this entry of the LHCb elog more explanation. The observed poor performances of diskservers/SRMs at T1's are due to the exceptional amount of merging jobs and relative merged data transfer (1GB/s sustained) across T1s.
    • T0 site issues: none
    • T1 site issues:
      • RAL: reduced regime of their disk server is affecting users with their request to access jobs queued waiting for a slot.
      • CNAF: The FQAN /lhcb/Role=user is not supported in their CREAMCEs (being the ultimate cause of the problems observed this week) GGUS:61590
      • CNAF: Transfers to CNAF are OK now (GGUS:61571)
      • SARA: huge backlog of jobs to run. There are 3 jobs reported by their bdii to be waiting. For lhcb that kills us preventing to submit pilot to SARA (due to the rank). GGUS:61588
      • SARA: Yesterday SRM overloaded affecting transfers to SARA (GGUS:61586) then SRM dead (GGUS:61598).
      • SARA: still problem (initialization phase plus conditionDB access) preventing jobs to run smoothly. (GGUS:61609)
      • IN2p3: huge amount of pilot jobs aborting against CREAMCE. Service seems to be down (GGUS:61605) [ Rolf - we suffered from an outage of national network - Renater - from 10:50 to 11:10. Incident reported in ticket is in that period. Checking if there were other errors. ]

Sites / Services round table:

  • CNAF - ntr
  • BNL - ntr
  • KIT - ntr; Q: is the VOMS server for LHC VOs running on SL4 or SL5? A: SLC4, gLite 3.1. Hope to have to gLite 3.2/SL5 by end year.
  • IN2P3 - nta
  • ASGC - ntr
  • NL-T1 - update on DB issues at SARA. As reported by ATLAS succeeded in restoring DB until 18th. Then something happened. Some old corrupt data entered DB again. Removed all old data and restoring again.. Going to take a couple of hours. Not sure if it will complete today.
  • RAL - this w/e is a holiday w/e and RAL closed Monday + Tuesday next week. Probably noone phoning in. Issues for LHCb - still running and will run thru w/e with reduced # of job slots. Also limit on # batch jobs to keep load on d/s steady over w/e. 1 disk server out for ATLAS - still being worked on. 3 partitions; 2 looking ok, 1 still being check out.
  • NDGF - ntr except for a ticket GGUS:61036 - ok at NDGF end but not at RAL. Gareth - not as simple as network down, very localised. Peter - is there any issue open with network provider? A: don't know, there has been some discussion but will chase up.
  • FNAL - ntr
  • OSG - ntr

AOB:

-- JamieShiers - 23-Aug-2010

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng ASGC_SAM_SRM_since_1_week.png r1 manage 49.6 K 2010-08-24 - 14:47 PeterKreuzer  
Texttxt GGUS_OSG_Priority.txt r1 manage 5.3 K 2010-08-24 - 14:44 MariaDimou  
Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2010-08-27 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback