Tier0 Services Required for Service Challenge 4 and Initial LHC Service
The following table gives a high-level overview of the services required for SC4 and the Initial LHC Service.
Some of these services are required at Tier1s and / or Tier2s. A list of services required per site will be produced at a later date.
It is intended as a first step at understanding the service issues and the implications on middleware enhancements / hardware requirements etc. The focus is redundancy, high availability and scalability, achieved where possible in software (which makes the hardware part much easier and much more flexible.)
Please see
An Overview of LCG 2 Middleware (Oct 2004) (an update on the timescale of end Sep 2005 will be prepared).
Issues that need to be addressed include:
- criticality (critical, high, medium, low) for acceptable downtime where
- C = critical means <1 hour,
- H = high < 4 hours,
- M = medium < 24 hours,
- L = low < 1 week (or some similar scale)
(proposed by Tim Bell - maybe these should be aligned with the parameters for minimum levels of T0 service in the MoU (page A3.2) - Jamie)
- disaster recovery (e.g. is it necessary to have the machines for the service in different locations?)
- service supports high availability (i.e. like BDII where the software can automatically provide for HA or where this needs to be implemented as a standby machine)
- externally accessibility required ?
Many of the services also include / rely on a database component: some Oracle, some MySQL. These issues also have to be addressed.
To be added:
- recovery procedures defined Y/N, tested Y/N
- expected lifetime of service; foreseen replacement service
Also need:
- Level 1, 2 & 3 procedures;
- Mailing lists (standards?)
- Documentation, FAQ, ...
- Monitoring, including comparison of delivered service level with agreed level
- ...
Need also to identify service manager / coordinator for each service / assign to organisational unit
Assign ownership of each service to carry deployment forward
Software supplier(s) also to be added, dependencies etc.
gLite components: R-GMA,
VOMS, FTS
gLite migration TDB: RB, CE
Others N/A
ID |
Service Name |
Acronym |
Purpose |
Contact Information |
Current Situation |
Growth |
Availability Issues |
Criticality (C/H/M/L) |
1 |
ResourceBroker |
RB |
Farms out jobs to sites+logging and book-keeping |
David Smith |
20 machines with raid array |
|
Concern |
C |
2 |
MyProxy |
|
Renew/acquire credentials |
Maarten Litmaath |
|
|
Long-running jobs cannot renew proxy, FTS uses directly (hence C) |
C |
3 |
BdiiService |
BDII |
Grid information system |
Lawrence Field |
4 farm nodes, dns alias |
depends on query rate, add commodity boxes |
no automatic failover to external BDIIs if CERN site down. Some sites have their own BDIIs. State kept (4MB) in memory and on disk |
C |
4 |
SiteBdii |
|
|
Lawrence |
1 |
|
Need at least one additional mc |
H |
5 |
ComputeElement |
CE |
|
|
|
|
|
C |
6 |
RgmaService |
R-GMA |
Grid monitoring |
Lawrence Field |
see below |
|
|
M |
7 |
MonboxService |
|
see above |
Lawrence Field |
1 farm node, 2GB memory |
|
Properly configured clients ok - see below |
M |
8 |
ArchiverService |
|
see above |
Lawrence Field |
4 as above. Local mysql DB |
|
Permanently lose monitoring info after client timeout |
M |
9 |
GridView |
|
|
|
|
|
|
M |
10 |
SftService |
SFT |
Regular tests of components per site |
Piotr Nyczyk, Judit Novak |
2 farm nodes, MySQL |
Depends on need for historical data / number of tests |
Detailed site status unavailable |
M |
11 |
GridPeek |
|
For storage of log files of running jobs (to provide visibility prior to job end). |
Patricia Mendez |
1 DPM instance |
add additional servers / storage as required |
Log files of current jobs not visible |
M |
12 |
VomsService |
VOMS |
manages users / roles / VOs |
Maria Dimou |
Pilot - farm node running application server + DB |
Separate DB from App server |
Current jobs ok, new jobs cannot be submitted |
H |
13 |
LcgFileCatalog |
LFC |
Site local file catalog for ALICE, ATLAS, CMS. Global catalog for LHCb |
hep-service-lfc@cernNOSPAMPLEASE.ch |
5 farm nodes (LFC servers) + Oracle DB |
|
|
C |
14 |
FileTransferService |
FTS |
reliable file transfer service - CMS currently using phedex. |
fts-support@cernNOSPAMPLEASE.ch |
2 disk servers (lxshare021d and 026d) + pilot |
|
Key service offered by Tier0 for T0<->T1 data production data transfers |
C |
15 |
CastorGrid |
CASTORGRID |
This is the low level service which runs the actual SRM and gridFTP to perform data transfers in and out of CASTOR. |
Wan-Data.Operations@cernNOSPAMPLEASE.ch |
8 load balanced worker nodes connected via 2 x 1Gb link. |
Can grow as needed provided there is enough network capacity. |
This model will probably be replaced by CASTOR WAN pools setup as used for SC3. |
C |
Services that use Oracle
LFC |
shared backend across all VOs |
FTS |
ditto |
CASTOR |
|
Gridview |
|
VOMS |
porting from MySQL in progress - target for SC4 |
Services that use MySQL
Tier1 Services
Tier2 Services
--
TimBell - 05 Sep 2005