WLCG Tier1 Service Coordination Minutes - 19 May 2011
Attendance
Action list review
Release update
Data Management & Other Tier1 Service Issues
WLCG Baseline Versions
Status of open GGUS tickets
The meeting will not take place this time. The e-group wlcg-service-coordination is asked to comment on issues offline. Are the "Type of Problem" values in
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCriticalServices#GGUS_Type_of_Problem_field necessary and sufficient for TEAM and ALARM tickets? Comments to Maria Dimou please.
Review of recent / open SIRs and other open service issues
Conditions data access and related services
Database services
- Experiment reports:
- ATLAS:
- First instance of Atlas offline database (ADCR) has crashed on Sunday (08.05). Issue has been caused by internal database error and is now being under investigation. Services were available on the surviving nodes while the instance restarted and were relocated back to instance one after it came back into operation.
- We had three hangs of ATLAS offline DB (ADCR) during which the service was not available: Monday 16th between 16:25 and 17:10, Monday 16th between 21:50 and 23:30 and Tuesday 17th between 1:50 a.m. and 2:40 a.m. No data loss occurred except for all uncommitted data. All incidents were caused by unusual reaction of ASM on a broken disk (itstor737 disk 3). ASM did not properly initiated a rebalance operation during the first incident and was affected by some problems during second and third. After the incidents a normal rebalance has finished and we were trying to forcefully evict the problematic disk. SR has been opened on this issue.
- On 18th of May around 11:20 ADCR DB experienced another disk failure during rebalancing operation which did not finish. Decision was taken to switchover the DB to the standby cluster. Switchover completed successfully after several minor issues and the DB was back operational at 13:05. IN2P3 reported that AMI applications are not able to reach the DB. It turned out that DB was not visible outside of CERN. We requested the port on the firewall to be opened and it was done the next day (19th of May in the morning).
Site |
Status, recent changes, incidents, ... |
Planned interventions |
ASGC |
|
|
BNL |
-CPU April 2011 and OS kernel patches deployed in VOMS and Conditions database clusters. - Applied Streams patch (9232517) in Conditions Database |
Apply CPU April 2011 LFC and FTS cluster and standby database cluster. |
CNAF |
|
|
KIT |
|
|
IN2P3 |
|
|
NDGF |
Nothing to report |
None |
PIC |
Nothing to report |
Planning to apply April CPU Patch in two weeks time. No exact date yet. |
RAL |
April CPU (+ recommended patches) has been applied on CASTOR DB and 3D. Started testing of the new HW that will used for the data guard configuration. |
|
SARA |
Nothing to report |
On the 24th of May - upgrade to 10.2.0.5 and application of CPU April 2011 and all other recommended patches. |
TRIUMF |
|
|
AOB
--
JamieShiers - 05-May-2011