WLCG Tier1 Service Coordination Minutes - 5 May 2011
Attendance
Action list review
Release update
Data Management & Other Tier1 Service Issues
Site |
Status |
Recent changes |
Planned changes |
CERN |
CASTOR 2.1.10 (all) SRM 2.10-x xrootd: ALICE 2.1.10 update 1, others: 2.1.9-7 |
|
Planned move to SLC5 nodes for the Name Server (May 10). Some Oracle security pathches will be also deployed in the next days |
ASGC |
CASTOR 2.1.10-0 SRM 2.10-2 DPM 1.8.0-1 |
CASTOR upgraded for ATLAS and CMS |
7/5: 8-hours downtime due to DPM DB optimisation; move DPM from T2 to T1: after that all transfers from/to ASGC DPM will go via the FTS T1 channel (shared with CASTOR) |
BNL |
dCache 1.9.5-23 (PNFS, Postgres 9) |
None |
Transition to Chimera planned in summer 2011 |
CNAF |
StoRM 1.5.6-3 SL4 (CMS, LHCb,ALICE) StoRM 1.6 SL5 (ATLAS) |
|
|
FNAL |
dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25 Scalla xrootd 2.9.1/1.4.2-4 Oracle Lustre 1.8.3 |
none |
none |
IN2P3 |
dCache 1.9.5-24 (Chimera) on all core servers and pool nodes |
|
|
KIT |
dCache (admin nodes): 1.9.5-15 (Chimera), 1.9.5-24 (PNFS) dCache (pool nodes): 1.9.5-9 through 1.9.5-24 |
|
|
NDGF |
dCache 1.9.12 |
|
|
NL-T1 |
dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF) |
|
|
PIC |
dCache 1.9.5-25 (PNFS, Postgres 9) |
|
|
RAL |
CASTOR 2.1.10-0 2.1.9-1 (tape servers) SRM 2.10-2,2.8-6 |
none |
ALICE SRM upgrade to 2.10-2 on 10/5/11. Update to support T10KC tape media during May |
TRIUMF |
dCache 1.9.5-21 with Chimera namespace |
|
|
CASTOR news
CERN operations
Development
xrootd news
dCache news
- SRM overwrite for dCache at SARA: dCache provides the feature, but Ron has to decide whether or not to enable it. dCache people are currently documenting on how overwrite works in dCache with respect to the tape backend. We expect this to be available within days.
StoRM news
- Checksum and recommended versions: for SL5 the only version is 1.6.2. For SL4 checksum support is available since 1.5.* The checksum is calculated in two ways that can coexist:
- via GridFTP on the fly
- via che Checksummer service (on demand)
The checksum is returned whenever it is defined as a file attribute and it will be calculated by either of the above options.
FTS news
DPM news
LFC news
- LFC 1.8.0-1 for gLite 3.1: waiting for rebuild of the meta package with the correct VOMS libraries (1.9.10-14)
LFC deployment
Site |
Version |
OS, n-bit |
Backend |
Upgrade plans |
ASGC |
1.8.0-1 |
SLC5 64-bit |
Oracle |
None |
BNL |
1.8.0-1 |
SL5, 64-bit |
Oracle |
None |
CERN |
1.7.3 64-bit |
SLC4 |
Oracle |
Upgrade to SLC5 64-bit pending |
CNAF |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
FNAL |
N/A |
|
|
Not deployed at Fermilab |
IN2P3 |
1.8.0-1 |
SL5 64-bit |
Oracle 11g |
Oracle DB migrated to 11g on Feb. 8th |
KIT |
1.7.4-7 |
SL5 64-bit |
Oracle |
Oracle backend migration pending |
NDGF |
1.7.4.7-1 |
Ubuntu 9.10 64-bit |
MySQL |
None |
NL-T1 |
1.7.4-7 |
CentOS5 64-bit |
Oracle |
|
PIC |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
RAL |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
TRIUMF |
1.7.3-1 |
SL5 64-bit |
MySQL |
|
Experiment issues
WLCG Baseline Versions
FTS & overwrite mode 10'
Consistency of storage elements & LFC 15'
Status of open GGUS tickets
GGUS - Service Now interface: update
Review of recent / open SIRs and other open service issues
VOMS SL4->5 migration
Conditions data access and related services
Database services
- Patching recommendations - list patches that were/will be additionally applied on top of vanilla 10.2.0.5 at CERN and are recommended to be applied also on T1s databases:
- 9232517 PROPAGATION MISSING MESSAGES AFTER DESTINATION QUEUE OWNERSHIP SHIFT ON RAC - highly recommended for T1s DBs
- 9184754 SGA corruption / ORA-600 [ktcccenxt] / dump using Lobs
- 9577583 FALSE ORA-942 OR OTHER ERRORS WITH MULTIPLE SCHEMAS HAVING IDENTICAL OBJECTS
- 7612454 DSS PERF REGRESSIONS IN SERIAL DIRECT READS
- 8684595 GETTING ORA-01115, 27069, WHILE RUNNING PQ, WHEN AUTOEXTEND IS ON
- 8970313 STALE FILE CACHE IN RAC ENV AFTER TABLESPACE DROP AND RECREATE
- 9586877 THE FIX FOR BUG 7526851 (& BUG 8494071) NEEDS REWORKING TO AVOID ORA-904 ERRORS
- Additionally here are some important fixes for OEM agents that monitor 10.2.0.5 dbs:
- 9282414 - 10.2.0.5 PSU 2 for OEM agent
- 10170020 - alert_log fix for 10.2.0.5 DBs
- Patching status:
- All test and integration DBs have been patched
- Production DBs coming next week during technical stop.
- One test and one integration DB (int11r) has been already upgraded to Oracle 11.2.0.2
- Experiment reports:
- ALICE:
- ATLAS:
- ATLASDD which keeps ATLAS Geometry data was added to Atlas Conditions replication to Tier1s on Thursday 14.04 10AM. Operation required 2 hours of replication downtime.
- On Monday (18.04) morning ATLAS and LHCB replications of conditions data to T1s were affected by the streams process deadlock which occurred after weekly short maintenance stop of replication service. To get rid of lock restart of downstream instances was required.
- CMS:
- CMS offline production database got stuck on Thursday April 28th at areound 23:30. Investigation showed that the hang was related to a library cache issue on node number 4 of the cluster. Restarting the node (at 0:50) healed the situation. The root cause of the problem is still not understood. The issue affected all CMS offline applications which could not process the load during 1 hour and 20 minutes.
- LHCb:
- Other:
- A problem with LGC integration database (int6r) started after a disk failure on 26th of April at 20:38. One hour later this normally transparent failure caused int6r integration DB to hang for unknown reason following SCSI errors reported by the OS. Monitoring systems informed us about a problem, but no reaction was required as this is not a production DB. Further analysis and problem recovery was performed in the morning following day. A database restart was required to clean the locking conditions that caused this freeze. DB was not available from the user point of view between April 26 21:45 and April 27 9 a.m.
Site |
Status, recent changes, incidents, ... |
Planned interventions |
ASGC |
Castor 2.1.10 upgrade done. |
None |
BNL |
* Successful migration of RAC hosting the Conditions Database hardware, BNL Oracle Enterprise Manager hardware and LFC-FTS Standby database cluster, to new datacenter room (04/27/11) * Follow up apply process and gather_stats_jobs contention observed 04/13/11 - after analyzing log files, trace files collected and looking at the timeline of the events involving the two process contention it appears that Bug 3642294 is affected the gather_stats_jobs. This is being followed up with ORACLE via SR. |
Apply quarterly oracle security patches on database services. |
CNAF |
Updated a local DB (Lemon) to latest PSUs for 10.2.0.5 without any problem. Waiting with WLCG DBs. |
None |
KIT |
|
Looking for new date for migration of FTS/LFC RAC to new hardware. |
IN2P3 |
|
|
NDGF |
|
|
PIC |
Nothing to report |
None |
RAL |
Nothing to report |
10th May 2011 - 09:00 and 11:00 BST (UTC+1) - Patching ATLAS conditions, LHCb conditions and LHCb LFC |
SARA |
Nothing to report (Liberation Day in the Netherlands) |
On June 24 expected to upgrade to 10.2.0.5 |
TRIUMF |
Nothing to report (not participating this Thursday) |
None |
AOB
--
JamieShiers - 29-Apr-2011