Week of 100719

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Miguel, Xavier, Alexei, Steven, Ale, Roberto, Jamie, Patricia, Simone, Gavin, Carlos, Massimo, Harry, JPB, Giuseppe, Jacek, Andrea, Maaten, Ulrich, , Flavia, Pepe, Nilo, Dirk);remote(Michael/BNL, Jon/FNAL, Gonzalo/PIC, Xavier/KIT, Gang/ASGC, SIP?, IN2P3/Rolf, Tiju/RAL, Rob/OSG, Christiamn/NDGF, CNAF, Ron/NL-T1, Daniele).

Experiments round table:

  • ATLAS reports - Xavier
    • Tier-0: Nice fills and Luminosity records during weekend. Extra data distribution exhausted ATLAS-T0 newtork (T0ATLAS reached 6.5GB/s). Write failures started on Fri/20h30. Situation alleviated at around Sat/1:00AM (FTS' CERN-CERN channel reduced #files by 1/3).
    • Problems getting data from ATLAS T0MERGE pool on Saturday evening. Alarm ticket sent to CASTOR support (Sat/21h32): GGUS 60197
      • Fixed on Sat/22:57 !
    • BNL instabilities in dCache, database recreated (outage form Fri/20h30 to Sat/2:30AM) - site people very proactive in communication. ADC ops was aware of every single action.
    • ASGC has small glitch (Sun/22:00) reported immediately to the ATLAS CR in advanceand quickly recovered after 30 mins.
    • Backlog of Datasets from Tier-0 TZ/DAQ to CERN-PROD_DATADISK - possible side effect of reducing #files in FTS' CERN-CERN channel.
    • GGUS: Reopened tickets not received at the site - need follow up, GGUS ticket created to keep track about this: GGUS 60219 * BNL: GGUS 60170 * Happened also on Thursday for the ALARM ticket sent to NGDF GGUS 60108
    • Long run yesterday (32h, 61.2nb^-1) with Luminosity record - expect high volume of data in the next hours (x30 express stream)
    • Miguel: reminder that T0ATLAS is sized to operate between 3GB/s and 4 not 6.5GB/s. Simone: may need access to FTS tuning for emergency tuning. Right now this exists only for CERN-CERN, but would need similar rights for T1 FTS parameters. Gavin: yes, will be provided.

  • CMS reports - Pepe
    • Tier-1s
      • [CLOSED] Alarm GGUS #60204: ALL CASTORCMS instances in RED, no data coming in from Point 5. An anomalous activity degraded the service in the whole CASTOR CMS instance. The anomalous activity has triggered several thousands of disk-to-disk copies from a user, resulting in about 30K jobs pending in the default service class. CERN/IT just killed all of them and CASTOR recovered. CMS warned the user and CMS trying to avoid this occurring again, by introducing a limit to prevent users making too many requests. Ticket was opened "2010-07-18 14:19 UTC" and action taken by "2010-07-18 14:26 UTC".
    • Tier-1s
      • [CLOSED ]Team GGUS #60154: Tape backlog at CNAF - Muonia datasets not written to tape. We cannot process data until blocks are closed (which means, transferred to tape). The problem was fixed during the weekend. It was due to a bug related to TSM clients which prevented files migration to be completed properly. Team ticket opened "2010-07-15 20:27 UTC" and fixed by "2010-07-18 06:27 UTC".
      • [OPEN ]Savannah #115696GGUS #60126: KIT: Cannot install CMSSW due to issues with NFS locking. It seems the problem persist.
    • Tier-2s
      • [ CLOSED ]Savannah #115724: SAM tests errors in T2_BR_SPRACE. A network driver problem induced a glitch on dcache system. Solved.
      • [ OPEN ]Savannah #115672: JobRobot with 0% and SAM error in T2_IN_TIFR. Site working to fix it.
      • [ OPEN ]Savannah #115694: Frontier seems down in T2_US_MIT although SAM tests are ok. Investigating.
    • Notes:
      • Tier-1s asked to configure new /cms/Role=t1production role. A few sites deployed it already (PIC, ASGC,KIT,IN2P3). In progress.
      • AFS at CERN slow access behavior. GGUS team ticket still open possibly to track more issues [ OPEN ]https://gus.fzk.de/ws/ticket_info.php?ticket=59728
    • Flavia: does CMS really run frontier at T2? Pepe: no, just SQUID caches.

  • ALICE reports - Patricia
    • Weekend Summary: Usual Pass1 reconstruction activities ongoing. In addition two new MC cycles started up during rthe weekend (15M pp events requested for two different generators). Large amount of analysis trains also running.
    • In terms of transfers, raw data registration at T1 sites have continued during the weekend (moreover with FZK and NDGF) with peaks around 200MB/s and around 45TB of raw data registered in CASTOR
    • T0 site
      • Large number of jobs executed during the weekend, good behavior of the T0 resources
    • T1 sites
      • All T1 sites in production, no errors found by ML in the local SE (expect for CNAF which is increading the available disk space for ALICE). In terms of CNAF, the ALICE experts had to kill all submitted agents on Sunday morning (preventing new agents entering the system)
    • T2 sites
      • No remarkable issues to report

  • LHCb reports - Roberto
    • Experiment activities:
      • Reconstruction productions progressing despite the internal issue in DIRAC with many jobs (70K) waiting in the central Task Queue. Some MC productions running concurrently (~10K jobs running in the system).
    • Issues at the sites and services
    • T0 site issues:
    • T1 site issues:
      • GRIDKA: now running at full capacity the reconstruction of the runs associated to the site accordingly the CPU share. No problem
      • NIKHEF: no HOME set issue: GGUS:60211. Restarted the ncsd servers stuck on some WN.
      • IN2p3: shared area instabilities. GGUS:59880. Problem seems related to site internal script that makes temporarily unavailable WNs. Requested to increase the timeouts to 20-30 minutes.
      • SARA: received a list of files lost following the recent data corruption incident.
      • pic: in SD tomorrow.
      • RAL: received the early announcement of a major upgrade of their castor instance to 2.1.9. LHCb will give any help to validate it and by the time of this intervention most likely an instance of the LHCb HC will be in place to contribute to the validation.
    • T2 site issues:
      • none

Sites / Services round table:

  • Michael/BNL - experienced consistency problems with postgress DB for namespace. DB was restored from a recent backup which took a few hours but the service should be fine now. Investigating why followup on GGUS ticket 60170 did not reache BNL.
  • Jon/FNAL - demonstrator test with xrootd have shown the need for frequent restart.
  • Gonzalo/PIC - reminded at scheduled downtime tomorrow from 6am-6pm
  • Xavier/KIT- ntr (Xavier/ATLAS: still one pool offline - any info on this?) - KIT: This data (scratch) was declared to be lost already on Fri
  • Gang/ASGC: debugging slow staging for ATLAS, suspect some DB inconsistency and will install missing oracle monitoring agent to help with analysis
  • Rolf/IN2P3: ntr
  • Tiju/RAL: intervention on CASTOR tape today completed successful. The scheduled intervention for work on the transformers has been cancelled. New date still being discussed.
  • Christian/NGDF: ntr
  • CNAF: ntr
  • Ron/NL-T1: reminder: maintenance on Wed. Problem with NCSD daemon stuck (related to site LDAP service) is fixed now.
  • Rob/OSG: Working with BNL on resolving the GGUS ticket problems mentioned by BNL.
  • Miguel/CERN : Applied latest castor/xroot patches to CASTORCMS stager today. Also finished important DB update for nameserver DB to allow usage of more cache memory. Similar castor/xroot patches planned tomorrow for CASTORATLAS and CASTORCERNT3, the day after for CASTORLHCB and CASTORALICE.
  • Gavin/CERN: CERN CC vault has cooling problem. Will reduce batch and notify VObox owner with the list of nodes which need to be shut down.

AOB:

Tuesday:

Attendance: local(Zbigniew, Pepe, Ale, Roberto, Manuel, Gavin, Lola, Harry, Miguel, Patricia, Iuri, Jean-Philippe, Marten, Nilo, Stephane, Simone, Julia, Dirk);remote(Jon/FNAL, Daniele/CNAF,Michael/BNL, Gang/ASGC, Vladimir/LHCb, Kyle/OSG, Ronald/NL-T1,Rolf/ IN2P3, Tiju/RAL, Christian/ NDGF, Xavier/KIT, Daniele/CMS)

Experiments round table:

  • ATLAS reports - Iuri
    • No major issues, problems
    • CERN/Tier-0: slowly recovering from the backlog of collision data to CERN.
    • Group production jobs for ICHEP are running at several T1s, should be completed quickly.
      • TAIWAN-LCG2 has some stage-in/out job failures (GGUS:60231). Thanks to the site-team for the investigation and the quick action: the number of jobs accessing to disk servers was reduced temporarily to decrease the load. Hope automatically re-submitted jobs can be finished successfully.
    • BNL minor issue with the file transfer failures to NET2 DATADISK is under investigation (GGUS:60249).

  • CMS reports - Pepe
    • Tier-0
      • The Tier-0 went down for moving back to Read-Write AFS software area this morning. Took less than 30'. Note: the RO AFS area will be rebuilt shortly. Users @ CERN should go back to using the default area. The default area is expected to become RO next week but there will be an announcement. Users will need to take some action after the change. They've been notified.
    • Tier-1s
      • KIT notification: dCache headnode crash. T1_DE_KIT is in emergency shutdown from 9:00-13:00h. (For details see here: http://grid.fzk.de/monitoring/status/status.php). Update 20-07-2010 11:39 : The downtime is finished now officially. All our checks and the CMS SAM tests are green. No problems left. CMS acknowledges.
      • [CLOSED ]Team GGUS #60154: Tape backlog at CNAF - Muonia datasets not written to tape. Note: the non-custodial datasets were unsuspended yesterday afternoon. Rather high rates from PIC->CNAF observed (350 MB/s for 3hs). Later data from other T1s arrived. In total, last 24h, 11 TBs transferred and 14 TBs from disk to MSS copied (average of 200 MB/s).
      • [OPEN ]Savannah #115696GGUS #60126: KIT: Cannot install CMSSW due to issues with NFS locking. Was re-opened by CMS. KIT is now running nfslockd on the appropriate fileserver and now the commands succeeded without any error message. CMS will give a try during today.
    • Tier-2s
      • [ OPEN ]Savannah #115793: transfers to T2_BR_UERJ failing: Could not submit to FTS. Probably expired proxy on the vobox.
      • [ OPEN ]Savannah #115786: timeouts in transfers from T1_UK_RAL to T2_FI_HIP. FTS parameters at RAL STAR-FIHIPT2 FTS channel are being adjusted.
      • [ CLOSED ]Savannah #115785: T2_CH_CSCS - PhEDEx agents down for more than 10h. NFS server caused some issues with phedex agents, that needed a restart.
      • [ OPEN ]Savannah #115763: T2_FR_IPHC -> T1_US_FNAL_Buffer transfers failing: SOURCE file doesn't exist . Site asked to check those dataset files.
      • [ CLOSED ]Savannah #115762: T2_US_MIT to T1_US_FNAL_Buffer transfers failing: checksum mis-match. Two files were invalidated in DBS.
      • [ CLOSED ]Savannah #115761: T2_CN_Beijing -> T1_US_FNAL_Buffer transfer error. One file was lost at last dcache disk problem. Not able to recover it. File invalidated in DBS.
      • [ CLOSED ]Savannah #115759: T2_US_Wisconsin to T1_IT_CNAF_Buffer transfer error: SOURCE file does not exist. It was a SRM glitch. The file eventually transferred.
      • [ OPEN ]Savannah #115672: JobRobot with 0% and SAM error in T2_IN_TIFR. Site working to fix it. Apparently these errors are solved by now (indeed since a few days, although no actions were written to ticket). Site asked to provide information.
      • [ OPEN ]Savannah #115694: Frontier seems down in T2_US_MIT although SAM tests are ok. Investigating.
    • Notes
      • At yesterday's cooling incident we asked CERN/IT to keep all our webtools backends up. We are revising machines criticallities, as we know we have some machines at wrong importance levels to CERN/IT p.o.v.
      • Tier-1s asked to configure new /cms/Role=t1production role. A few sites deployed it already (PIC, ASGC,KIT,IN2P3). In progress.
      • AFS at CERN slow access behavior. GGUS team ticket still open possibly to track more issues [ OPEN ]https://gus.fzk.de/ws/ticket_info.php?ticket=59728
    • Weekly scope plan (post-ICHEP-wave)
      • Tier-0: Technical stop, upgrades and tests
      • Tier-1: running backfill if nothing else comes up / StorageConsistency checks at T1s. Sites have been contacted.
      • Tier-2: very low MC production activity

  • ALICE reports - Patricia
    • GENERAL INFORMATION: Usual reconstruction, analysis trains and 2 MC cycles ongoing. Large Grid activity. Raw data transfer activities also ongoing (low rate with NDGF)
    • T0 site
      • Issue reported yerterday: Large number of Alice jobs waiting in the CREAM queues (30K jobs). The origin was a bad response of the CREAM-DB that Alice assumed as empty queues. The production through CREAM was stopped. This morning the queues are drained. In addition the ALICE code has been imporved to avoid this situation in the future. The system will be put in production this afternoon.
      • The CC cooling problem reported yesterday afternoon affecting the experimental VOBOXES did not affect the Alice production (importance of the nodes higher than 50). The only affected activity concerned the CAF nodes.
    • T1 sites
      • All T1 sites in production with good performance of the local SEs. Only CNAF is still pending. (Ticket GGUS:60169)
    • T2 sites
      • Grenoble: Alice User proxy expired at the local VOBOX. Alice contact person at the site informed about this issue.
      • IN2P3-IRES: Several sites inclusing IRES have observed a lot of very long jobs (over 40h CPu time). These jobs are aparently stuck in endless loops. Prcedure: kill all these jobs and in addition put 24h CPOU time limit of the ALICE queues as prevention measure.
      • Rolf: can you confirm that problematic jobs also arrive at IN2P3. Patricia: yes, also happened at Lyon.

  • LHCb reports - Vladimir
    • Experiment activities: *There are some MC production ongoing and few remaining runs to be reco-stripped. Activities dominated by user activity and a new group has been added in DIRAC with the highest priority (lhcb_conf) to get resources for running jobs with highest precedence due to the forthcoming ICHEP conference.
    • T0 site issues:
      • The cooling problem in the vault reported yesterday affected also some lhcb voboxes and some of them hosting critical services for LHCb (Sandboxing service). Through CDB templates the priority of these nodes has been risen while other vobox priorities have been relaxed.
      • Keeping for traceability the still ongoing issue VOMS-VOMRS syncronization (GGUS:60017) awaiting for expert to come back.
    • T1 site issues:
      • SARA: due to an AT RISK intervention at the SARA storage, some user jobs were failing being declared stalled and then killed.
    • T2 site issues:
      • none
    • Roberto: 3D db at RAL went down (now up again). Also hitting again MC output upload problem in UK and the Barcelona T2.

Sites / Services round table:

  • Jon/FNAL - External generators were being used in our grid computing rooms yesterday while the normal power was being worked on (8 hours of work). One of the external generators failed, causing the cooling units to be unable to remove the heat in the room. Facilities ordered a power shutdown at ~11:30 AM Chicago time to prevent damage. Repairs were made and service was restored between 3-4 PM. Another 45 min of work is scheduled for today at 7 AM - we are not expecting problems. We notified CMS when the Monday incident started and stopped and logged it in the OSG downtime pages. Interesting, because just workers were down, the SAM tests queued in our system and then ran successfully when power was restored (within the SAM timeout period). So just like a previous 8 minute real outage getting flagged as a 5 hour SAM outage, this time it worked in reverse - a 4 hour worker node only outage was buffered and we were flagged as 100% available. Pease note that data operations and SE operations were not affected at all by this power outage.
  • Di/Triumf (by email)
    • We plan to upgrade the DDN controller and disk firmware of our storage system during the LHC technical stop next week. The downtime will start at 17:00 UTC and end at 23:00 UTC on Wednesday (July 21). In these period the whole dCache system will not be available. Please let us know if there are problems with you.
    • Besides, we will move FTS database to new Oracle RAC next monday, thus FTS will not be available from 17:00 UTC to 19:00 UTC on Monday (July 19).
  • Daniele/CNAF: storage requested by ALICE has just been just added. On 27 July FTS downtime due to Oracle DB intervention (already in GOCDB)
  • Michael/BNL - ntr
  • Gang/ASGC - ntr
  • Kyle/OSG - investigation of the GGUS propagation issues showed that postfix was not running on the goc ticket exchange. more details at https://twiki.grid.iu.edu/bin/view/Operations/GOCServiceOutageJuly162010
  • Ronald/NL-T1: 3 disk server with infiniband probs
  • Rolf/IN2P3 - ntr
  • Tiju/RAL - Problems encountered with planned Oracle intervention: had to roll back - short service stop will be required. The intervention on the CASTOR tape system went well.
  • Christian/ NDGF, ntr
  • Xavier/KIT - ntr
  • Gavin/CERN: Preparing a SIR about cooling incident and depended other service degradation (eg CASTOR)

AOB:

  • A service incident report for the recent issues with propagating GGUS tickets has been requested and is being prepared

Wednesday

Attendance: local(Iuri, Ale, Stephen, Gavin, Julia, JPB, Miguel, Nilo, Marcin, Lola, Edoardo, Maarten, Simone, Dirk);remote(Daniele/CNAF, Vladimir/LHCb, Gonzalo/PIC, Jon/FNAL, Michael/BNL, Gang/ASGC,Tiju/RAl, Kyle/OSG, Ron/NL-T1 , Daniele/CMS, Christian/NDGF)..

Experiments round table:

  • ATLAS reports - Iuri
    • T0:
      • Since this morning 9:00, the checksum has been switched on for Gridftp-Castor internal usage. No problem for the moment. Tomorrow morning, FTS transfers checksum to CERN will be switched on for FT if everything goes well.
    • T1s:
      • CNAF confirmed that the current version of Storm installed at CNAF should be able to make the checksum for FTS transfers. Checksum activation is now done for FT to INFN-T1_DATADISK and INFN-T1_DATATAPE space tokens.The first functional test file transfers to the site ufortunately failed with "Destination file/user checksum mismatch" error. GGUS:60312 filed several hours ago. Under investigation.The site team discovered that checksum were activated only on D1T0. They are working to activate on D0T1 too.
      • CNAF File transfer failures from T0 due to transfer gridftp error. A fair number of these files transfer successfully on 2nd or 3rd attempt, but some are still failed after three. GGUS:60287 submitted this morning. The site team confirmed they have some problem with LDAP, people are working to fix it.
      • RAL yesterday had a problem with Ogma, the Atlas 3D database. The database was up and working fine but the streaming from CERN didn't work. Experts try to resend the missing LCRs by another temporary streams setup.
      • some high priority group production tasks successfully produced the data at TAIWAN-LCG2, BNL and other sites.
      • BNL - scheduled downtime (dCache,network) ~9h.
      • TRIUMF - scheduled downtime (dCache) ~6h.
    • T2s:
      • OU_OCHEP_SWT2 File transfer failures due to http timeout to SRM. GGUS:60272. The site confirmed that with the manual command with gsiftp protocol, after some time the file starts to copy. Looks like the problem is with the authentication/authorization of proxy in the Bestman-Gateway system somehow, according to the bestman error logs. gridftp doesn't seem to have that problem, only bestman.
      • LIPLisbon->PIC File transfer failures with source error, request timeout and some connection to SRM failures. GGUS:60286. The site reported at 10am that SRM problem is fixed.
      • HEPHY-UIBK site has some issue with SRM. GGUS:59962, the site team is investigating the problem. * Daniele: transfer problems last night caused by one LDAP server being out of sync - users disappeared from LDAP among them the storm user used for transfers. Service has been restored during the morning. Site is implementing SMS based alarming to avoid recurrence of this problem. ◦ Marcin/DB: today in the morning one listener of the ATLAS offline DB was stuck - fixed after few minutes.

  • CMS reports - Stephen
    • Tier-0
      • Very busy in last 24 hours with backup of transfer from Point 5 (due to upgrades there)
    • Tier-1s
      • [CLOSED ]Savannah #115696GGUS #60126: KIT: Cannot install CMSSW due to issues with NFS locking. Successfully installed yesterday.
      • [ OPEN ]Savannah #115745: CNAF: Transfers failing to many sites. Due to number of pool accounts (probably already fixed).
    • Tier-2s
      • [ OPEN ]Savannah #115811: SAM CE errors at T2_FI_HIP. Due to only SLC4 available on certain clusters.
      • [ OPEN ]Savannah #115808: Analysis SAM CE error at T2_TR_METU.
      • [ OPEN ]Savannah #115806: Transfers from T2_US_Caltech to T2_RU_JINR failing.
      • [ CLOSED ]Savannah #115793: transfers to T2_BR_UERJ failing: Could not submit to FTS. Successful transfers.
      • [ OPEN ]Savannah #115786: timeouts in transfers from T1_UK_RAL to T2_FI_HIP. FTS parameters at RAL STAR-FIHIPT2 FTS channel are being adjusted.
      • [ CLOSED ]Savannah #115763: T2_FR_IPHC -> T1_US_FNAL_Buffer transfers failing: SOURCE file doesn't exist . Files lost and invalided.
      • [ CLOSED ]Savannah #115672: JobRobot with 0% and SAM error in T2_IN_TIFR. Site working to fix it. Apparently these errors are solved by now (indeed since a few days, although no actions were written to ticket). Due to bandwidth limits, site attempting to expand limit.
      • [ OPEN ]Savannah #115694: Frontier seems down in T2_US_MIT although SAM tests are ok. Investigating (there was a power cut)
    • Notes
      • Tier-1s asked to configure new /cms/Role=t1production role. A few sites deployed it already (PIC, ASGC,KIT,IN2P3). In progress.
      • AFS at CERN slow access behavior. GGUS team ticket still open possibly to track more issues [ OPEN ]https://gus.fzk.de/ws/ticket_info.php?ticket=59728
    • Weekly scope plan (post-ICHEP-wave)
      • Tier-0: Technical stop, upgrades and tests
      • Tier-1: running backfill if nothing else comes up / StorageConsistency checks at T1s. Sites have been contacted.
      • Tier-2: very low MC production activity
    • Daniele: is there a ticket for the role creation? Stephen: yes, a Savannah ticket against each site (not GGUS).

  • ALICE reports - Lola
    • T0 site
      • Ticket: GGUS:60309 submitted this morning. blparser service for ce203.cern.ch not alive
    • T1 sites
      • CNAF: Ticket GGUS:60169 concerning the increase of the available disk space for ALICE: CLOSED and MonaLisa is again publishing good results for this SE
    • T2 sites
      • Minor and usual operations applied to different T2 sites. no important issues to report

  • LHCb reports - Vladimir
    • Experiment activities:
    • ~10K MC plus user analysis. Reconstruction almost done. The backlog of jobs formed in DIRAC has been drained restarting one agent.
    • T0 site issues:
      • none
      • Keeping for traceability the still ongoing issue VOMS-VOMRS syncronization (GGUS:60017) awaiting for expert to come back.
    • T1 site issues:
      • CNAF: GGUS:60314 (HOME not set " issue as seen at NIKHEF)
      • RAL Following the intervention on Oracle yesterday the LFC there was unreachable. (GGUS:60274)
      • IN2p3: Shared area still problematic (SAM jobs failing there, old ticket open GGUS:59880)
      • IN2p3: Inconsistency between BDII and SAMDB/GOCDB (GGUS:60321).
    • T2 site issues:
      • none

Sites / Services round table:

  • Daniele/CNAF - farming team is planning a GPFS upgrade on worker nodes. Plan to proceed in blocks of WN boxes to avoid general downtime (would be a week). How should this be announced? Maarten: at risk in GOCDB should suffice. Inform experiments about fraction of WNs which will be unavailable during the upgrade and it's total duration.
  • Gonzalo/PIC - ntr
  • Jon/FNAL - ntr
  • Michael/BNL - in scheduled downtime - all progressing well and expect to not use the full 9h intervention slot (estimate 6h)
  • Gang/ASGC - ntr
  • Tiju/RAL - RAL had to stop LFC, FTS around 3.30 local time for rollback of DB intervention. All services came back after 15 mins back apart for ATLAS 3D streams. Similar DB intervention planned for today and tomorrow have been canceled.
  • Kyle/OSG - ntr
  • Ron/NL-T1 - yesterday busy with scheduled maintenance - all going according to plan
  • Christian/NDGF - ntr

AOB:

Thursday

Attendance: local(Iuri, Stephane, Patricia, Roberto, Maarten, Zbigniew, Douglas, Jean-Philippe, Ale, Nicolo, Manuel, Dirk);remote(Jon/FNAL, Michael/BNL, Gonzalo/PIC, Gang/ASGC, Kyle/OSG, Tiju/RAL, Stephen/CMS, Xavier/KIT, Daniele/CNAF, Rolf/IN2P3, Vladimir/LHCb, Christian/NDGF, Tristan/NL-T1).

Experiments round table:

    • T0:
      • 24 hours after the start of Griftp/Castor checksum, checksum for FTS transfers has been switched on for functional test transfers T0->CERN. If it works over the week-end, it will be switched on for all CERN DDM enpoints.
    • T1s:
      • PIC file transfer failures due to dCache pool problems reported at ~midnight. GGUS:60349. As these failures continued in the morning ALARM GGUS:60359 was submitted. Fixed very quickly in ~1h.
      • CNAF checksum for TAPE space tokens is pending permanent activation by CNAF : GGUS:60312 is updated.
      • RAL the issue with Ogma DB is resolved and DB at RAL is now back in sync with ATLAS Conditions.
      • BNL soon after the downtime completed, some file transfer failures and test production jobs failures were seen. GGUS:60351 filed at 3am, at 4am the problem was solved, connectivity and name resolution service at BNL is restored.
      • FZK small amount of file transfer failures due to source errors observed by the shifter during the night is under investigation.
    • T2s:
      • Australia-ATLAS file transfer failure due to globus connection timeouts. GGUS:60344 was submitted at ~8pm, solved at ~10pm: one disk server fell over, rebooted.
    • Michael: BNL got briefly disconnected from ESNET during heavy thunderstorm in New England - the problem took place outside of the site.
    • Zbigniew: After RAL DB intervention for ATLAS some some change records were missing, most likely due to a known & reported oracle bug. The DB has been synced with the help of the CERN team. After a reminder to streams development we got the confirmation that also the originating problem in oracle has now been fixed. CERN team will follow up with test of the patch as soon as it becomes available. In the meantime ATLAS responsible have confirmed that the RAL DB is working correctly.
    • Gonzalo: disk space in dcache at PIC is separated in two pools (old/new hardware). New data was given preference to go to new, more performant pool. Around midnight the space in the new pool got full but for some reason dcache did not use the available space on the older pool. The site is investigating with dcache development if this is s/w or configuration bug. In the mean PIC forced dcache to use free space as workaround for the problem.

  • CMS reports - Stephen
    • Tier-0
      • Normal operation.

    • Tier-1s
      • [ OPEN ]Savannah #115745: CNAF: Transfers failing to many sites. Not clear if this is actually fixed.

    • Tier-2s
      • [ OPEN ]Savannah #115845: File may be corrupt at T2_US_Wisconsin.
      • [ OPEN ]Savannah #115844: Single user has authentication issue at CIEMAT.
      • [ OPEN ]Savannah #115841: Transfers to Aachen failing. Mostly in Debug, current error in Prod may be source site issue.
      • [ OPEN ]Savannah #115839: Frontier activity from Warsaw not using local cache.
      • [ OPEN ]Savannah #115811: SAM CE errors at T2_FI_HIP. Due to only SLC4 available on certain clusters.
      • [ OPEN ]Savannah #115808: Analysis SAM CE error at T2_TR_METU. Green now, but no feedback from site.
      • [ OPEN ]Savannah #115806: Transfers from T2_US_Caltech to T2_RU_JINR failing. Partial transfer perhaps blocking retries. No feedback from T2_RU_JINR.
      • [ OPEN ]Savannah #115786: timeouts in transfers from T1_UK_RAL to T2_FI_HIP. Looks good now except for source sites with other known problems.
      • [ OPEN ]Savannah #115694: Frontier seems down in T2_US_MIT although SAM tests are ok. Investigating (there was a power cut)

    • Notes
      • Tier-1s asked to configure new /cms/Role=t1production role. A few sites deployed it already (PIC, ASGC, KIT, IN2P3, FNAL, CNAF, waiting for RAL). In progress.
      • AFS at CERN slow access behavior. GGUS team ticket still open possibly to track more issues [ OPEN ]https://gus.fzk.de/ws/ticket_info.php?ticket=59728
      • Will impose rate limit in CASTOR at CERN on Monday for normal users

    • Weekly scope plan (post-ICHEP-wave)
      • Tier-0: End of technical stop and tests ongoing
      • Tier-1: running backfill if nothing else comes up / StorageConsistency checks at T1s. Sites have been contacted.
      • Tier-2: very low MC production activity * Tiju: role implementation at RAL is work in progress
    • Daniele: CNAF problem is fixed - VO should check

  • ALICE reports - Patricia
    • GENERAL INFORMATION: 90% of the currently running MC cycles completed. New MC requests foressen for the weekend.
    • FOR ALL SITES: Several sites have seen several very long jobs (over 46h) running at the sites and have asked what to do with such a jobs. These are pathological jobs and even if they finish correctly, their outputs cannot be used. Site admins can kill these jobs when they detect them at their sites. Also as prevention measure, sites can decrease the CPU time of the ALICE queues to 24h
    • T0 site
      • Issue reported yesterday through GGUS:60309. SOLVED and the solution has been already validated
    • T1 sites
      • CNAF: low rate of running jobs although the central TQ has still requests compatible with the site. Yesterday the ALICE contact person announced and upgrade of gpfs on the WNs.
    • T2 sites
      • Torino-INFN: Site out of production. The current CREAM system is being updated to the lateat 1.6 version
      • Madrid-CIEMAT: Same operations ongoing
    • Daniele: have forwarded the request to communicate plan for GPFS intervention discussed yesterday. Farm team deals with one rack at a time - each rack is different - therefore no estimate for the impact on VOs. No estimate for the duration of the intervention as it depends on time to drain racks. end time either. The site decided to not register an at-risk period. VOs should open a GGUS tickets with CNAF to investigate if the reduction of WN resources is consistent with the CNAF intervention.

  • LHCb reports - Roberto
    • Experiment activities:
      • A lot of MC production run last 24 hours and a lot of new data to be reconstructed and to be made available to users for ICHEP. Playing with internal priorities to boost the reconstruction activity.
      • All T1 site admins supporting LHCb are invited to update their WMSes with the patch 3621. This is now running happily in production at CERN since weeks and would cure CREAMCE problems.
    • T0 site issues:
      • Keeping for traceability the still ongoing issue VOMS-VOMRS syncronization (GGUS:60017) awaiting for expert to come back.
      • Manuel: This ticket does not correspond to any particular problem but rather to the fact that suspended users (as it ought to be) are deleted from VOMS to prevent them form using their certs. The ticket was closed last Tuesday (20/07/2010).
    • T1 site issues:
      • CNAF: GGUS:60314 (HOME not set " issue as seen at NIKHEF). Solved: problem with ldap.
      • RAL: reported some SAM jobs for FTS are failing.
      • PIC: noticed some SAM tests failing because of the shared area.
      • SARA: the queue there, after the intervention, are not matched by the LHCb production JDL. (GGUS:60343)
      • IN2p3: Shared area issue: going to deploy a solution adopted at CERN across all WN (GGUS:59880)
      • IN2p3: Inconsistency between BDII and SAMDB/GOCDB (GGUS:60321).
      • IN2p3: CREAMCE cccreamceli03.in2p3.fr seems to be not publishing properly and then is not matched GGUS:60366 * IN2P3: All transfers were failing to IN2p3. SRM bug, needed to be restarted at Lyon. GGUS:60329
    • T2 site issues:
      • none
    • Vladimir: new problem at SARA due to storage h/w failures. LHCb has banned SARA due many jobs failures and would like to be informed on the progress, to reenable the sites.
    • Gonzalo: PIC sam test failure seems a isolated glitch (one test failed at 6am - next one was OK). Will try to understand reason.

Sites / Services round table:

  • Jon/FNAL - ntr
  • Michael/BNL - ntr
  • Gonzalo/PIC - ntr
  • Gang/ASGC - ATLAS pre-staging problem understood and solved. After fixing a tape related problem (outside castor) the reading from problematic tapes is now working.
  • Kyle/OSG - during yesterdays ALARM test all tickets where received - but the expected SMS to Michael was not received. Michael: phone number in OSG DB has correct number and received other SMS: not clear yet why the notification did not work.
  • Tiju/RAL - ntr
  • Xavier/KIT - unsuccessful update of job submission system with security fix cause the system to be down for 2h. Site rolled back the server side update just keeping the client side one (which closes the security problem). No running jobs have been lost.
  • Daniele/CNAF - ntr
  • Rolf/IN2P3 - will follow-up on tickets mention above
  • Christian/NDGF - heads-up on cooling intervention in Copenhagen next week. Some ATLAS tape data will be unavailable during the intervention.
  • Tristan/NL-T1 - one storage node defective (vendor working on it). On backup link to NDGF after problem with normal path.

AOB:

Friday

Attendance: local(Iuri, Jamie, Stephen, Lola, Patricia, Douglas, Julia, Jean-Philippe, Roberto, Edward, Nilo, Ale, Jacek, Manuel, Dirk);remote(Jon/FNAL, Michael/BNL, Tristan/NL-T1, Torre/NDGF, Xavier/KIT, Paolo/CNAF, Alexander/NL-T1, Rolf/IN2P3, Gang/ASCG, Vladimir/LHCb, Rob/OSG, Gonzalo/PIC, Tiju/RAL).

Experiments round table:

  • ATLAS reports - Iuri
    • General notes:
      • Group and centrally managed production runs smoothly: overall efficiency is 98% for the past 12h.
      • File transfer DDM efficiency is also very good.
    • T1s:
      • PIC: a minor issue with file transfer failures withing ES cloud. GGUS:60399 verified. Only small amount of files produced the errors, end the DDM efficiency was not bad for the transfers to T2s. These transfers finally succeeded the next consecutive attempts, no failures anymore.
      • BNL: some file transfer failures from/to BNL due to the high load on the postgres DB associated with dCache name space manager. Reported at 3:45am, resolved at ~5:30am.Transfer efficiency is back to normal (~95%). No GGUS filed, just 2 new Elog records.
    • T2s:
      • CA-SCINET-T2->PIC file transfer failures. GGUS:60368 solved: SRM restarted, some issue with the services that didn't fully recover after the disk pools being down for a couple of days due to the disk controller failures.
      • UKI-LT2-QMUL file transfer failures, destination errors with connection to SRM failure. GGUS:60401 assigned at ~5am. Site confirmed that one of their disk servers has crashed. Under investigation.
      • GOE->FZK file transfer failures with gridftp copy wait connection timeouts under investigation. GGUS:60381.
      • INFN-ROMA1 production jobs failed during ~2 hours yesterday evening. Stage-in/our errors. GGUS:60398 assigned at ~10pm. The site team investigated, large ATLAS file transfer volume saturated local WAN at 1GBps, no other problems observed. In ~2 hours everything was back to normal, the traffic decreased and no job failures seen anymore.

  • CMS reports - Stephen
    • Tier-0
      • Normal operation.
      • Tomorrow may run for an hour without zero suppression (HeavyIon tests) and therefore stress P5->IT link (among other things)
    • Tier-1s
      • [ OPEN ]Savannah #115876: JobRobot & SAM CE Errors at T1_TW_ASGC. The hand of god?
      • [ OPEN ]Savannah #115873: SAM CE prod & sft test jobs expiring for T1_DE_KIT
      • [ OPEN ]Savannah #115745: CNAF: Transfers failing to many sites. Site reports it is fixed and will provide more information later.
    • Tier-2s
      • [ OPEN ]Savannah #115875: File missing at T2_US_UCSD. Analysis results file, may be problem producing it.
      • [ OPEN ]Savannah #115872: Corrupted file at T2_DE_RWTH.
      • [ CLOSED ]Savannah #115845: File may be corrupt at T2_US_Wisconsin. Deleted and invalidated.
      • [ CLOSED ]Savannah #115844: Single user has authentication issue at CIEMAT. lcg-expiregridmapdir cron wasn't running.
      • [ OPEN ]Savannah #115841: Transfers to Aachen failing. Mostly in Debug, current error in Prod are source site issue.
      • [ OPEN ]Savannah #115839: Frontier activity from Warsaw not using local cache. Was small analysis cluster, advised to use existing cache in Warsaw or setup a new one.
      • [ OPEN ]Savannah #115811: SAM CE errors at T2_FI_HIP. Due to only SLC4 available on certain clusters.
      • [ OPEN ]Savannah #115808: Analysis SAM CE error at T2_TR_METU. Reported due to a faulty disk array losing files, they are clearing up.
      • [ OPEN ]Savannah #115806: Transfers from T2_US_Caltech to T2_RU_JINR failing. Partial transfer perhaps blocking retries. No feedback from T2_RU_JINR (still).
      • [ OPEN ]Savannah #115786: timeouts in transfers from T1_UK_RAL to T2_FI_HIP. Looks good now except for source sites with other known problems.
      • [ CLOSED ]Savannah #115694: Frontier seems down in T2_US_MIT although SAM tests are ok. Doesn't look there was a problem with Frontier.

    • Notes
      • Tier-1s asked to configure new /cms/Role=t1production role. A few sites deployed it already (PIC, ASGC, KIT, IN2P3, FNAL, CNAF, waiting for RAL). In progress.
      • AFS at CERN slow access behaviour. GGUS team ticket still open possibly to track more issues [ OPEN ]https://gus.fzk.de/ws/ticket_info.php?ticket=59728
      • Will impose rate limit in CASTOR at CERN on Monday for normal users

    • Weekly scope plan (post-ICHEP-wave)
      • Tier-0: End of technical stop and tests ongoing
      • Tier-1: running backfill if nothing else comes up / StorageConsistency checks at T1s. Sites have been contacted.
      • Tier-2: very low MC production activity

  • ALICE reports - Patricia
    • T0 site
      • Using basically the CREAM-CDE resources, no issues to report
    • T1 sites
    • T2 sites
      • No new issues to report

  • LHCb reports - Vladimir
    • Experiment activities:
    • MC Production running with lowest priority to allow Reconstruction and analysis running firstly

Issues at the sites and services

    • T0 site issues:
      • none
    • T1 site issues:
      • RAL: Suffering jobs failing because exceeding memory (which is a bit lower than what requested with the new data). Requested to increase to 3GB the memory (GGUS:60413)
      • SARA: downtime at risk, many users affected accessing data. The Storage has been banned yesterday. What is the status of this intervention (we need to run reconstruction at SARA)
      • IN2p3: Shared area issue: 25% reconstruction jobs failed today because of that (GGUS:59880)
      • IN2p3: CREAMCE cccreamceli03.in2p3.fr. Corrected the information, tbc (GGUS:60366)
    • T2 site issues:
      • Shared area issue at some sites.
    • Rolf/IN2P3 - both problems are worked on: need patch on worker node to improve response(ticket updated).
    • Roberto: LHCb is exchanging list of hot files with SARA - as only 3% of storage is affected by their problem some file rebalancing may allow to use the remaining part of the site.

Sites / Services round table:

  • Jon/FNAL - ntr
  • Michael/BNL - ntr , Rob: should we create a ticket about the missing SMS notification discussed yesterday? Michael: Yes, please.
  • Alexander/NL-T1 - storage server still down. vendor has not yet been able to solve the issue.
  • Torre/NDGF - ntr
  • Xavier/KIT - Atlas data access problems since yesterday - GGUS:60412 documents the problem.
  • Paolo/CNAF - Ticket about cream has been solved few minutes ago. Ticket about low job number for ALICE will be updated soon.
  • Rolf/IN2P3 - ntr
  • Gang/ASCG - CMS sam test problem - now recovered, max wall clock time
  • Tiju/RAL - ntr
  • Rob/OSG - ntr
  • Gonzalo/PIC - ntr
  • Jacek/CERN - from 6h-10h one node of CMS online DB was in a strange state, some application suffered tcp timeouts. Problem with ATLAS DB at TRIUMF after storage migration - DB streaming does not work yet. Waiting for contact with site experts.

AOB:

-- JamieShiers - 15-Jul-2010

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2010-07-23 - DirkDuellmann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback