LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek110314 (2011-03-18, JamieShiers)

EditAttachPDF

Week of 110314

Week of 110314

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday:

Attendance: local(Mattia, Jerka, Ale, Maria, Jamie, Maarten, Jan, Dirk, Ricardo, Simone, MariaDZ, Manuel);remote(Michael, Daniele, Roberto, Jon, Gareth, Ulf, Felix, Lorenzo, Gonzalo, Rolf, Onno, Stefano, Dimitri, Rob).

Experiments round table:

ATLAS reports - Weekend report:
- CERN AFS low response: filed GGUS:68546, which triggered SNOW:INC020649. Can someone from AFS team comment if there is any update, please? [ Jan - ticket went to helpdesk this morning as after working hours. helpdesk has posed a question ("do you still have problem?") but as feedback into ggus not there no answer can be expected. ]
- CERN-PROD SRM deletion slow: filed GGUS:68588, which triggered SNOW:INC021154. Can someone please comment if there is any update? [ Jan - will answer always in GGUS whilst bi-directional gateway not there. MariaDZ - developed but test period not finished. ]
- Sunday 13th March 2011, at 18:04 CET, the stable beam has been declared. First 7 TeV datataking with stable beam in 2011!
- ATLAS may start bulk HI reprocessing today.
- Upgrade of PSI eLog to 2.9.0 is recommended by CERN Security Team.

CMS reports -
- LHC / CMS detector
  - First stable beams of 2011 at 18:05 on 13th March 2011
    - Before: 1.38 TeV ramp & collisions, finally stable beams at 3.5 TeV
    - Then: filling scheme 1 probe + 3 nominal bunches (2-2.5 um), L ~1.2E30 cm-2s-1 for ATLAS/CMS.
    - First fill dumped after 7 hours at 01:00.
    - Refill – during the ramp shift crews realized that one bunch was in the wrong bucket: dump and restart.
    - Second fill in at 05:10. Bunch intensities slightly higher. L ~ 1.5E30 cm-2s-1 for ATLAS/CMS. Lost at 07:20 – electrical network (and more..)
    - Impressive speed of ramp and squeeze. Availability of the machine at 80% afterwards.
- CERN / central services
  - CASTORCMS service availability degraded to <90% (as from SLS) (GGUS:68557). Opened since suggested/asked to, by Ignacio, at one of last week's WLCG calls. Clear in the ticket field it was not a problem, possibly due to load. Further checks (from both CMS and Castor team, thanks) showed it was indeed due to simultaneous heavy use of t1transfer pool. (Comment: I have the impression we could be wasting precious time by the Castor Ops team by opening GGUS in such cases, but if is helps them we will do as suggested). [ Miguel - if CMS shifter or ops can understand degradation from monitoring then no need to open ticket. Ticket useful when not clear where degradation is coming from. ]
  - ELOG: no new notification of the 'cernmx' errors, still asking shifters to monitor it though - thanks to Stefan Roiser for his help
- Tier-0 / CAF
  - Rates ok - very high at the beginning of first run: 888 Hz - then soon 540 Hz, to kept an eye on. Express ~50 Hz, running fine no crashes
  - prompt reco will start tonight - tomorrow morning
- Tier-1
  - IN2P3: a couple of tickets still open, to be discussed at today's Ops meeting (slow progress in deletions, see SAV:119628; CERN-IN2P3 transfer degradation, see SAV:119259)
  - CNAF: accumulating migration backlog to tapes since Friday (SAV:119715), FIXED over the weekend. All smooth now.
- Tier-2
  - NTR
- [ CMS CRC-on-duty from Mar 8th to Mar 14th: Daniele Bonacorsi. Next CRC: Ian Fisk ]

ALICE reports - General Information: little activity ongoing for ALICE during the weekend. Since this morning we are running RAW data and Pass 0 reco and this afternoon started a couple of MC cycles. Tomorrow we expect to start the Pass2 reconstruction of LHC10h run
- T0 site
  - GGUS:68549: large mismatches between what LSF reports for ALICE and what MonALISA shows as active jobs. There were few hundreds jobs stuck in the WNs running 'arc' command. It looks like it is a problem which comes from the gssklog code and the AFS experts have been contacted already [ Jan - noted a few peaks of 5GB/s on ALICE disk pool. All went fine ]
- T1 sites
  - RAL: GGUS:68578. The local CREAM-CE was reporting the famous 44444 jobs waiting. There was an occasional problem related to some timeout when the CEs, restarting the component of the batch system solved it. CLOSED
- T2 sites
  - Usual operations

LHCb reports - MC productions running at full speed.
- T0
  - Davinci jobs failing for some production at CERN. Application devs looking at that.
  - cvmfs - installed in batch nodes at the moment - after all weekend proven to be OK for LHCb and then we can go ahead in enabling LHCb to use cvmfs to the rest of nodes configured.
- T1
  - RAL: tomorrow Network intervention.
- T2 site issues: :
  - 3 T2 sites failed for MC productions, GGUS tickets have been opened for theses sites
- SAM tests: IN2P3 having failures related to s/w installation - has decreased availability of site and notified Roberto / Stefan

Sites / Services round table:

BNL - ntr
FNAL - ntr
RAL - had a couple of things over w/e. 1 ATLAS diskserver disk0tape1 out of action - transparent to VO. Fault on 1 out of 3 SRMs for LHCb - out of production over w/e but 2 remaining nodes continue. May have FTS issue but no more news. Outage tomorrow on network routers scheduled in GOCDB for morning tomorrow.
NDGF - had network problems on Saturday 14:00 - 16:40 UTC, most of the Danish pools and ARC-CE resources went offline due to unannounced network maintenance that should have lasted mostly 5 minutes.
ASGC - ntr
CNAF - ntr
PIC - ntr
IN2P3 - ntr
NL-T1 - ntr
KIT - ntr
OSG - continuing to work on correcting BNL SAM records. Had corrected unknown and scheduled maintenance times. Some more issues for Feb being prepared. Caught and correct a couple of bugs but aware of another from OSG all hands meeting last week.

CERN storage - update for CASTOR ALICE On Wednesday - new xroot code and ALICE transfer plug in. Doing risk assessment. For info for LHCb have seen a number of DB issues on SRM. Fixed in a later release so should prepare an update. Currently low level but can blow up quite quickly.

AOB:

Tuesday:

Attendance: local(Jamie, Mattia, Simone, Marcin, Stefan, Maarten, Pedro, Alessandro, Jan, Philippe);remote(Jon, Michael, John, Ulf, Gonzalo, Ronald, Felix, Rolf, Ian, Rob, Vladimir Sapunenko).

Experiments round table:

ATLAS reports -
- CERN AFS low response: filed GGUS:68546, which triggered SNOW:INC020649.
  - I see no update in GGUS. And SNOW URL https://cern.service-now.com/nav_to.do?uri=incident.do?sysparm_query=number=INC020649 does not show any record (Exact message: "record not found")
  - Beside the GGUS to SNOW issue (to be taken offline) can someone from AFS team comment if there is any update, please? [ Jan - asked service response ago to look at this ]
- CERN-PROD SRM deletion slow: filed GGUS:68588, which triggered SNOW:INC021154.
  - One ATLAS application thought to hammer the CASTOR name server (from trigcom account). Responsible of the application has been contacted. According to the responsible, the application lists the content of one CASTOR directory every 30 mins. CASTOR support has been asked to provide further details (name of files with timestamps etc ...) [ Jan - not fully understood what is happening but see below ]
  - Discovered wrong execution plan in the SRM database, presumably linked to the update to 2.10 (when Oracle stats were lost). CASTOR experts working on this. [ Jan - DB execution plan fix seems to have solved problem according to our monitoring. Ale - in last hour deletion rate increased considerably - now 4 Hz. ]
- Problems accessing files from RAL during 90 minutes yesterday afternoon (at the time of this meeting). Initially the problem was thought to be FTS but RAL then confirmed the issue was in CASTOR SRM. Promptly fixed.
- All ATLAS Site Services has three degrades (periods of a few minutes) in availability during the night (30%). This means the sensor could not collect the metrics (nothing was wrong with services per se). Looks like a glitch in Lemon (or AFS). Experts (ATLAS) investigating, will escalate if needed.

CMS reports -
- LHC / CMS detector
  - Expecting 32 bunches today
  - So far CMS trigger is running at a higher rate than expected in stable operations, but expected for now
  - The 48h calibration hold is operational - recons jobs from Sunday afternoon just entering system. No crashes in express dataset line
- CERN / central services
  - Lost Offline DB again. This is the second failure in 1 week. Many offline services impacted. Only 2 things in data taking chain that a really critical: CASTOR and Oracle. Within 15' conditions cached into FroNTier start failing jobs. Good to get post-mortem of this. [ Marcin - since Friday's failure CMS offline DB was on standby h/w. Today at 11:50 experienced a serious multi-disk failure of standby h/w. Original DB recovered on original h/w and failed back to original h/w. 1.5h downtime for DB. Now DB is ok and working on original h/w. Problem on Friday caused by power cut - today's problem caused by multi-disk failure: SIR will be provided soon - looking to see why we experienced these failures.
- Tier-0 / CAF
  - prompt reco started and success looks OK.
- Tier-1
  - Production problem at FNAL with inconsistent software distributions. Should be fixed with the rollout of CVMFS,
- Tier-2
  - NTR

ALICE reports - General Information: ClusterMonitor service is failing at some sites. It looks like there is a configuration problem for that service. Ongoing
- T0 site
  - GGUS:68549: There were few hundreds jobs stuck in the WNs running 'arc' command. gssklog problem was solved. Ticket closed but we still see the mismatch between MonaLISA and what the site reports.
- T1 sites
  - CNAF GGUS:68604: VOBox was not reachable yesterday morning. Problem solved
  - CNAF GGUS:68625: after the VOBox was back there is a problem with the proxy delegation to the CREAM-CE. There was a problem with one of the CREAM-CEs which had to be reconfigured.
  - IN2P3: affected by the ClusterMonitor problem
- T2 sites
  - Usual operations

LHCb reports - MC productions running without problems
- T0
  - DaVinci problems (reconstruction program) have been understood, i.e. because of mismatch in Python versions. A new configuration fixing the problem has been deployed on Grid sites.
  - cvmfs - after successful tests during the w/e on a subset of migrated nodes the cvmfs usage at CERN has been extended to all nodes currently equipped with cvmfs.
- T1
  - NTR
- T2 site issues: :
  - Chasing up problems with T2 sites on MC productions
- IN2P3 SAM test issues reported yesterday - problems were coming from tests: now fine.

Sites / Services round table:

FNAL - ntr
BNL - ntr
RAL - downtime this morning, everything seems ok.
NDGF - ntr
PIC - ntr
NL-T1 - ntr
ASGC - ntr
IN2P3 - network failure yesterday evening about 21:00. Connection to national network provider broken. Other connections especially to CERN not concerned. Lasted 30' and was due to h/w failure on switch. Normally all jobs running had no problems - some stayed longer in the queues. Only effect on production that we saw.
KIT - had this morning outage of dCache SRM instance. Restarted the services and after 1h services are back and investigated to look for reason. Think maybe h/w problems.
OSG - ntr

CERN storage: ALICE tomorrow morning xroot upgrade (registered 1h downtime in GOCDB. Would ask this is not counted against site availability as notice <24h as experiment asked as to do this quickly. Waiting for date from LHCb to upgrade SRM instance.

AOB: (MariaDZ on duty as a CERN guide at this time today) Put some comments in GGUS:68331 about https://cern.service-now.com/nav_to.do?uri=u_request_fulfillment.do?sysparm_query=number=RQF0002233 both on the workflow and the (not yet offered) solution. Measures were also taken to handle creation of duplicate tickets when submitters put bla.support@cernNOSPAMPLEASE.ch in Cc: (see https://savannah.cern.ch/support/?118728#comment2). TEAM tickets' email notification to CERN Service Managers is being discussed with PES. Please wait for the relevant T1SCM presentation this week.

Wednesday

Attendance: local(Maria, Jamie, Simone, Fernando, Ricardo, Jan, Alessandro, Eva, Nilo, Maarten, Roberto, Mattia, Lola);remote(Michael, Jon, Felix, Gonzalo, Ulf, John, Ian, Jeremy, Onno, Rolf, Foued, Rob).

Experiments round table:

ATLAS reports -
- AFS low response. Waiting for some comment/feedback from AFS concerning GGUS:68546. Can someone from AFS team comment if there is any update, please?
  - AFS experts contacted ATLAS providing the status. Work in progress in understanding the issue. now in contact with ATLAS experts via email.
- Lemon glitch reported last week (GGUS:68331): follow up from the lemon team, now in contact with ATLAS experts via email.
- CASTOR deletion speed: yesterday after the intervention at the SRM oracle DB and after increasing the pressure from the deletion agent, 4Hz were reached.
  - All data now have been deleted from the namespace.
  - But we did not get the physical space back (yet). deletion between namespace and disk server in CASTOR is asynch, but I would expect a few hours, while space has not been freed for days now. ATLAS and CASTOR experts in contact.
- Reprocessing of Heavy Ions started yesterday evening. 25% of jobs completed. Current situation can be monitored in http://atladcops.cern.ch:8000/j_info/r2111_r2114.html. Nothing special to report.
- GGUS to SNOW is being really a problem. Take example of GGUS:68331 GGUS tickets does not get updated by support unit. The updates go now to https://cern.service-now.com/nav_to.do?uri=incident.do?sys_id=948f05800a0a8c080185b17a9fd1b3e3. But:
  - Only one person in ATLAS can see https://cern.service-now.com/nav_to.do?uri=incident.do?sys_id=948f05800a0a8c080185b17a9fd1b3e3 (and this is not me, but I am the AMOD)
  - GGUS tells me if there is an update, I am not sure about https://cern.service-now.com/nav_to.do?uri=incident.do?sys_id=948f05800a0a8c080185b17a9fd1b3e3
  - The original GGUS ticket has become a discussion thread about GGUStoSNOW rather than the original problem (lemon)

CMS reports -
- LHC / CMS detector
  - Still hoping for 32 bunches today
  - CMS trigger is still running at a higher rate than expected in stable operations, but only 1-2 more fills
  - The 48 calibration hold is operational. Beam spot was calculated and added in prompt reco
- CERN / central services
  - Recovered from database loss yesterday
- Tier-0 / CAF
  - Problem with prompt reco injector that seems to be related to an Oracle upgrade. Database notified. CMS code has changed since Fall and is now crashing ~2 x a day.
- Tier-1
  - Data being distributed custodially to Tier-1s
  - New pp data being dwarfed by the HI transfers to FNAL 600-800MB/s of data for HI run >> pp load
  - Problem with stager at ASGC, SAV:119798: Transfers not flowing from ASGC, but seems to be recovering
- Tier-2
  - NTR

ALICE reports - General Information: ClusterMonitor service is failing at some sites. It looks like there is a configuration problem for that service. Ongoing
- T0 site
  - Nothing to report [ understood mismatch reported yesterday - LSF reported correctly; large fraction of jobs not accounted for in MonaLisa as killed by production manager. A "known deficit" that may be fixed later this year ]
- T1 sites
  - IN2P3: affected by the ClusterMonitor problem. Solved
- T2 sites
  - Several operations

LHCb reports -
- MC productions running without major problems
- Validations of work flows for Collision 11 data taking on going.
- T0
  - SAM tests jobs on CVMFS seem to have problem not finding files. Seems related to a cache issue and Steve is working on a patch.
  - Intervention on SRM-lhcb. Not before next Monday (we have to drain a backlog of data from ONLINE to CASTOR). Any way it will be dependent on LHC schedule.
- T1
  - IN2P3 having problems with shared s/w area and tests
- T2 site issues: :
  - NTR

Sites / Services round table:

BNL - ntr about T1. Update on BNL-CNAF connectivity: after re-engineering link and careful testing, routing of traffic restored to original circuit. Now operational and can now be moved back to production. Got 80% of available bandwidth and got no failures. Consistent with counters looked at by network engineers. Providers asked to restore original routing. Will keep ticket open for 1 more month and if no more failures will be closed. Simone - what was problem? Michael - numerous issues; buffer size mismatch between network providers, several upgrades made, faulty equipment. Whole circuit re-engineered, basically a new circuit.
FNAL - ntr
ASGC - [ text missing ] 2 disk servers have errors; one understood, other still working on it.
PIC - issue mentioned by ATLAS. between midnight and 0400 failure in 6 disk servers. Hung. Rebooted this morning. Caused ~400 jobs of ATLAS reprocessing to fail. Cause of issue in GGUS ticket - needs to be followed up. Looks like a kernel issue - centos 5.6. Looks as if some kernel parameter?? Since 10:00 this morning jobs running fine.
NDGF - ntr
RAL - lost disk server last night - ATLAS datadisk. Since meeting began has been put back into production.
NL-T1 - ntr
KIT - 3 of CREAM CEs now in 1 h downtime to reduce job load
IN2P3 - ntr
GridPP ntr
OSG - ntr

CERN Storage - xroot update for ALICE this morning went aok.

AOB: (MariaDZ) GGUS:68331 contains some updates. This URI is now fixed in yesterday's AOB as well (had access restricted).

Thursday

Attendance: local(Ian, Philippe, Stefan, Maarten, Jamie, Maria, Dan, Fernando, Stephane, Mike, Pedro, Jan, Lola, Alessandro, Nilo, Maite, MariaDZ);remote(Michael, Jon, Gonzalo, Vladimir, Ronald, DImitri, Ulf, Tiju, Rolf, Rob, Jhen-Wei).

Experiments round table:

ATLAS reports -
- CASTOR deletion speed: the statement that free space does not increase after deletion is a monitoring artifact. Has to do with disk servers offline and the way SRM reports total and free space. [ Issue understood and closed ]
- TAIWAN-LCG2 is suffering from high load on disk servers after the recall from tape. Jobs fail with timeout in accessing files from staging buffer. Site managers are adding disk servers to share the load (GGUS:68714)
- Another 5 disk servers crashed last night in PIC (affecting approx 150 reprocessing jobs). The crash happened nearly at the same time of the previous night. PIC investigating (GGUS:68681) [ Gonzalo - as reported issue has reproduced today at very similar time as yesterday. Looks like some cron daily process. Not completely clear what is happening. Current suspicion is that update DB process is looking at pnfs via nfs mount. Not able to reproduce in command line but something funny happening in cron. People still working on it. Trying to solve... ]
- Timeouts accessing data from the tape buffer at FZK-LCG2. Many jobs were trying to access files from a few servers in the tape buffer and the servers exhausted the max number of available configured connections. Parameters will be tuned at the next burst. Any reason why the data ended up in a few servers to begin with? [ Dimitri - will forward issue and get feedback ]

CMS reports -
- LHC / CMS detector
  - Still hoping for 32 bunches today
- CERN / central services
  - Problem found in one of the CMS data service components. The problem causes all files injected into PhEDEx to be registered for FNAL and ignoring the original site allocation. We were forced to shut down new central registrations until it's fixed. This impacts new data and store results. They will not be transferred immediately, since they are not injected. Expected to be repaired and cleaned this afternoon.
  - GGUS:68710: Either certificate or hostname wrong for CMS e-logbook [ Stefan - will follow up ]
- Tier-0 / CAF
  - Problem with prompt reco injector identified as an Oracle issue. Patch identified waiting for restart [ Eva - problem reported s a known bug affected 10.2.0.5. Patch on INTR since one month and will be applied on production asap ]
- Tier-1
  - NTR
- Tier-2
  - NTR

ALICE reports -
- T0 site
  - We applied a patch to the Cluster Monitor to avoid the high memory consumption suffered since a while
- T1 sites
  - FZK: GGUS:68704. Submission was failing since last night in cream-1 and cream-3. Our CE AliEn service was still trying to submit because the CREAMs were not put in maintenance, so for the service they were in production. Please when something like that happens, put the CEs in maintenance, otherwise we will keep submitting. [ DImitri - we declared downtime in GOCDB. Maarten - have to put batch system in drain mode so info providers will not advertise as in production ]
  - FZK: GGUS:68698. Submission NAGIOS test was not working using a particular certificate, the one that NAGIOS uses for the tests, due to a misconfigured range of accounts in /etc/sudoers.
- T2 sites
  - TriGrid: site was down for a couple of days because AliEn was not installed. Waiting for the site admin to do it, we do not have permission
  - Usual operations

LHCb reports -
- MC productions running without major problems
- Validations of work flows for Collision 11 data taking on going.
- T0
  - SAM tests jobs on CVMFS seem to have problem not finding files. Seems related to a cache issue and Steve is working on a patch.
  - Intervention on SRM-lhcb. System is drained, Proposal to have the intervention Monday morning - still to be confirmed (previous proposal of Friday was rejected)
- T1
  - Problem with pilot jobs aborting at PIC (GGUS:68721)
  - FTS problems CERN -> CNAF replicating MC-DST are currently under investigation
  - SAM failures at IN2P3 now cleared up - site green for past 2 days
- T2 site issues: :
  - NTR

Sites / Services round table:

BNL - ntr
FNAL - ntr
PIC - issue mentioned by Stefan - power glitch affecting a few nodes (extra capacity at a datacentre without UPS. 15-20 nodes went down and when back came back in strange mode. Some job failures for LHCb and CMS - nodes removed and system then fine again. In configuration of those nodes in the event of a power glitch they tried to restart - now stay off until manual intervention brings them back properly
KIT - problems with 2 CREAM CEs. Investigating. One of CREAMs has out of memory Java errors. Probably >6GB MySQL DB on this host. Put this m/c in downtime until tomorrow. CREAM-1 and CREAM-2.
CNAF - ntr
NL-T1 - ntr
NDGF - ntr
RAL - ntr
IN2P3 - ntr
ASGC - loading issue: have one disk server assigned to pool. Another one will be assigned to stage pool very soon. Hope to improve situation
OSG - ntr

CERN - ntr

AOB:

Friday

Attendance: local(Mattia, Maria, Jamie, Jan, Simone, Maarten, Lola, Alessandro, Massimo, Dan, Eva, Philippe, Fernando, Ian);remote(Michael, Ulf, Jon, Xavier, Rolf, Roberto, Gonzalo, Jhen-Wei, Tiju, Onno, Vladimir, Lorenzo).

Experiments round table:

ATLAS reports -
- Issue accessing staged files at INFN-T1. The Garbage Collector does not take into account the staging of the file as "one access". Therefore keeps old files (already accessed by jobs) while deleting new ones. A patch has been put in place on the fly by INFN-T1 experts. Looks OK now.
- TAIWAN disk buffer overload (reported yesterday) seems to re-appear. TAIWAN site admins verified the problem was due to some problematic disk server.
- Some files needed multiple attempts (SRMBringOnline) before being staged at SARA (up to 20). The SARA MSS was very busy at that time and was staging at very high throughput. SARA experts are investigating how to optimize further.
- RAL LFC "lost" 2 out of three machines in the LFC loadbalancing for a few hours (looks like an effect of a newtwork hardware issue during the night). Now machines are back in production. No negative effect observed in production activities during degraded period.
- GGUS was not allowing TEAM perople to update TEAM tickets yesterday night. Gunter confirmed there was a problem in one module and fixed.
- For approx 20 mins this morning it was not possible to log into Savannah. Problem solved before someone in ATLAS could file a complaint.

CMS reports -
- LHC / CMS detector
  - Still hoping for 32 bunches today
- CERN / central services
  - Problems in CMS Data Injector are fixed. Broken blocks are re-injected.
  - Meeting with IT-ES yesterday and an operational agreement on how to support the web redirector
- Tier-0 / CAF
  - NTR
- Tier-1
  - NTR
- Tier-2
  - NTR

ALICE reports -
- CERN
  - GGUS:68549 - still found > 1k jobs stuck in AFS on exit
  - trying to understand and resolve discrepancies between MonALISA and LSF: many jobs found stuck in the 'SAVING' state, timeout/retry policy needs improvements. Since 2 weeks when changing from AFS to Torrent for s/w area found these mismatches. Not clear if there is a strong correlation. Jobs still have some afs references which will now be weeded out. In principle mechanism hasn't changed for many years but as we don't depend on this functionality anymore we will switch it off (hopefully early next week). MonAlisa / LSF mismatch - the LSF-reported jobs really are running whereas MonAlisa thinks only a fraction of these are useful. Upload of output at end of job to SE has not timeout - can get stuck for hours to an unresponsive SE. Have to address this soon - experts on it. Always a remote SE involved, can be high load or other condition that causes this.
  - 11 new machines added to CAF
- T1
  - KIT - GGUS:68704: should be fixed, pending verification on ALICE side
  - KIT - GGUS:68698: Nagios submission OK now
- T2
  - usual operations

LHCb reports -
- MC productions running at lower pace
- Validations of work flows for Collision 11 data taking on going.
- By mistake 17K sdst files (output of the reconstruction and inpout for any eventual restripping campaign) have been deleted at 5 sites. We asked the affected sites to try and do their best to recover these data from their tape systems. It is not urgent (we do not have clear idea when the next reprocessing will happen) but it would be vital to have as much as possible these data recovered. PIC did it already. CERN, RAL and GridKA will look at that, IN2P3 reports there is no way to recover them. [ Rolf - just discovering this now; will follow up ]
- T0
  - Asked IT-PES to have a way to discriminate at job submissions WN mounting AFS from the ones mounting CVMFS in order to steer SAM sw-installation jobs to the right shared area flavor.
- T1
  - NTR
- T2 site issues: :
  - NTR

Sites / Services round table:

BNL - ntr
NDGF -
FNAL - ntr
KIT - had some problems with CREAM service. 1 & 3 back online. 2 will stay in downtime until MOnday. Currently staging erroneously removed files. When done will get them on VO box after which they will have to write htem back into dCache again
IN2P3 - minor network problem at 17:45 for 5'. A switch connecting others rebooted without reason. Lasted 5'. No real impact on production. If you saw something at that time this might be cause
PIC - ntr
ASGC - ntr
RAL - yesterday night had problem with network switch. Many services degraded for ~3h. All back now.
NL-T1 - a few things: 1) reminder that on Tuesday 22nd have scheduled downtime for SARA SRM and CREAM CE. 2) Have problem on some WNs - less local disk space than expected. Probably configuration error during local reinstall. Keep them running as ATLAS doing a lot of work and don't want to interrupt that. Only affects some jobs - during Tuesday's downtime will probably have some time to investigate. 3) Working on decommissioning of ATLAS MC space token. Not all space from this token has been fully released. Asked for assistance from dCache developers. Yesterday tried some procedures to release space but tried things that triggered some bugs in dCache and triggered two SRM crashes. Restarted immediately but now will wait for dCache developer to come with proper procedure
CNAF - ATLAS already reported about staging problems. NTA

CERN storage - still trying to recover LHCb files. Majority will probably be ok. Need to check whether any tape have been recycled. If not AOK.

AOB:

Summer time in Europe starts on Sunday March 27th 2001 ("Spring forward, Fall back").

-- JamieShiers - 09-Mar-2011

Topic revision: r18 - 2011-03-18 - JamieShiers

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback