Grid Deployment Board (GDB)
Web - Wiki - Agendas - Minutes
6 February 2008 - GDB Meeting - Agenda
The GDB in January covered the progress of HEP CPU Benchmarking, a test bed with
different processors is being provided to the Experiments to benchmark
their applications. The GSSD
Final Report summarized the work still needing to be
completed and some interesting new issues were raised about space tokens.
The Worker Node
Working Group was launched; it should be investigating the
future WN characteristics in terms of memory, hard disk space, with
subsequent matching of such resources. The output of the WG should describe
the deployment process with details to advertise how to describe
heterogeneous WNs at the sites. A set of Security Documents, on VO Operations
and Pilot Jobs policies, was presented and should be approved in the near
future. The GDB also launched the Review
of LHC Multi-User Pilot Job Frameworks.
5 February 2008 - Pre-GDB Meeting - Agenda
CCRC'08
F2F Meeting - The all day meeting focused on the baseline
versions of the gLite
Middleware and of the SRM implementations. Presentations on
progress, plans and issues covered gLite,
CASTOR, dCache, DPM and StoRM. Readiness of the Sites and
Experiments was also discussed in detail as well as tracking the challenge, including
monitoring, logging and reporting.
9 January 2007 - GDB Meeting - Agenda
The main topics presented and discussed at the GDB in
January 2008 were the Handling
of Persistent Storage Classes (T1D0), the status of the SRM2 Deployment
and the preparation for the CCRC08
Challenges.
There were also reports about CPU Benchmarking in HEPiX,
Security Policies,
Pilot Jobs
and the Worker Node
Strategies.
10 January 2008 - Post-GDB
Meeting - Agenda
CCRC
2008 F2F Meeting - The all-day F2F meeting focused on the preparation of the CCRC challenges
for February and May 2008. The Experiments presented their
requirements in terms of storage at the different sites (Tier-0 and
Tier-1). The SRM 2.2
Deployment at every WLCG Site was discussed and Tape Usage
Statistics at CERN were presented. Proposals for basic operations, bug
fixing, upgrades, metrics, monitoring and reporting were also discussed.
25 February 2008 - Agenda, Minutes
Moving
to gLite 3.1 - One month of proven good performance of the
gLite 3.1 service is the criteria for being able to withdraw the previous
version (gLite 3.0). Withdrawal of support means no more bug fixes or
functional updates (other than for security issues).
Pre-Release
of SAM Services - Pre-release version of the SAM web
services: As announced on Friday Feb. 22 through the same-announce mailing
list, there is a new pre-release version of the SAM web services
(lcg-sam-server-ws-0.11.0) installed on the SAM Validation instance. Link
People are encouraged to review these changes, adapt their code (if
necessary) and test the new interfaces as soon as possible.
GLUE
2.0 Draft Available - The initial draft of Glue 2.0 is
available at the following Link. Send feedback to
L.Field.
DCache
Updated for CCRC08 - Where to get dCache updates during
CCRC '08. It was clarified that during CCRC '08 WLCG sites should take
dCache updates from the official dCache repositories. Link
18 February 2008 - Agenda, Minutes
64-Bits
SL4 WN - The 64bit SL4 WN will be available relatively soon
and there was a discussion on how sites should publish whether they have
32bit, 64 bit or a mixture. Proposals will be discussed at the next
meeting.
Running
MW Services on a Single Node - Sites and VOs were asked to
give any feedback they have on problems seen with combining several
middleware services on one physical node.
Down-time
Scheduling Procedures - LHCb raised the issue that some
sites seem not to be adhering to the agreed WLCG rules for announcing
service downtime. This will be monitored more closely.
11 February 2008 - Agenda, Minutes
GOCDB
Outage - There was an analysis of the down-time of RAL (due
to a power cut) and the effect it had on grid operations due to the outage
of the GOC DB. Lessons learned will be forthcoming.
Problems
with Jobs Submission - There was an analysis of the VOMS
issue which prevented many users in the UK from submitting jobs for ~24
hours. Procedures will be changed to ensure that a similar problem doesn't
occur again.
Retirement
of the Classic SE- It was announced that the classic SE
will be retired at the end of May. Any VOs or sites that have an issue with
this should contact N.Thackray or Maria Barroso Lopez. An announcement
containing this information was broadcasted.
All
ATLAS WN Move to SLC4 - ATLAS requested to move all their
WNs from SLC3 to SLC4. The deadline will be 15 March 2008.
Site
Share Installation Area - ATLAS also requested that all
sites increase the shared software installation area at the sites from 10 to
100 GB
4 February 2008 - Agenda, Minutes
Sites
Suspended- Some sites had to be suspended due to many
problems encountered when operating them.
28 January 2008 - Agenda, Minutes
GridView
Availability Algorithm - Decision made to amend the
GridView site availability algorithm so that "transparent"
interventions are handled correctly (i.e. they will not be counted as
down-time).
Timeouts
of SE and SRM Tests - SE and SRM tests failing at FNAL
(with timeouts after 600s). After discussion with the site and the
monitoring team, the site will try to address this through the use of
better hardware.
21 January 2008 - Agenda, Minutes
Middleware
for CCRC08 - The base versions for CCRC '08 of the various
middleware services were advertised to all sites and discussed. GLite 3.1
VOBox was released into production.
14 January 2008
- Agenda, Minutes
Space
Tokens Needed - ATLAS requested the following for CCRC '08:
Each site should publish in the Information System updated information in
some GLUE fields: for all storage areas including their space token
association
Issues
at some Sites. - LHCb reported RFIO problems at both CNAF
and RAL that were seriously hindering production.
Architects Forum (AF)
Minutes
- Web
21 February 2008 - Minutes
ROOT
Will Support Qt4 only - The ROOT interface to the Qt GUI
toolkit version 3 (Qt3) will not be further supported starting from next
production release. Only Qt4 will be supported.
New
Projects in PH-SFT - Agreed that Architects Forum will also
monitor/drive the developments for the two new R&D projects hosted in
the PH-SFT group. Both project leaders will be invited to the meetings and
short status reports will be given regularly.
LCG_54a
Configuration Ready - It includes some changes with respect
to LCG_54. A new patch configuration will be prepared in order to include the
new version of COOL.
Experiments
Releases - Experiments are picking-up configuration
LCG_54(x) for building their production releases during the next weeks.
7 February 2008 - Minutes
Experiments
Agree to Abandon Support for Qt3 - Requested to experiments
their support needs with respect to the version of the Qt GUI toolkit and
its interoperability with ROOT. Decided that if no reason is identified
from ATLAS or CMS to keep Qt3 support, then ROOT will move to Qt4
exclusively.
CORAL
Server Decision - Discussion about CORAL server
development. Decided to go ahead with the proposal and develop the
prototype. This prototype needs to be delivered quickly because of the
urgent needs from ATLAS online.
LCG_54a
in Preparation - New configuration in preparation with
latest bug fixes. ATLAS would like to defer the release of LCG_54a for a week.
This was agreed and the target date
for the bug/fix release was fixed for February 18th.
24 January 2008 - Minutes
Geant4
Funding - J.Apostolakis made a summary of his analysis of
the effects the DoE budget cut in SLAC has for Geant4. He also has written
a memo with more details which has been sent to ATLAS and CMS management.
Changes
to the Nightly Builds - There will be some changes in the
nightly builds with new slots for patches of LCG_54 and new version of gcc
compiler.
CORAL
Server Proposal - D.Duellmann presented the CORAL server
proposal in the last AA
meeting. The experiments should comment on the proposal. The decision
eventually could be taken at the next AF.
10 January 2008 - Minutes
ROOT
Release in Preparation - The production release of ROOT
scheduled for Wednesday 16th assuming all pending problems are solved and
no new ones are detected by the ongoing experiment validations. A
few days later we should get the complete LCG_54 release.
Python
2.5 in Candidate Release - Python 2.5 has been validated
and is going to be added into the candidate configuration for release
("dev" slot in the nightly builds)
MAC
OSX 10.5 new Supported Platform - Agreed to introduce
MacOSX 10.5 (leopard) as a new platform after the release of LCG_54
GCC
4.2 Future Support - Agreed to start working for gcc 4.2
after the release of LCG_54. Eventually we will drop gcc 4.1 after the complete
software stack has been validated with gcc 4.2.
Adding
Geant4 and Genser in the Nightly Builds - Suggested to have
Geant4 and Genser MC generators as part of the nightly builds to facilitate
the validation by ATLAS.
|
|
Management Board (MB)
Web - Wiki - Members - Agendas - Minutes
26
February 2008 - Agenda, Minutes
CRC08
Will Continue Beyond February- J.Shiers presented the
progress of the CCRC08 challenge. Data exports from CERN are now running at
1-2 GB/s. The peak by the Experiments is ~1GB/s greater than the average
achieved during SC4 (1.3GB/s). CMS alone has managed more than 1GB/s and
ATLAS starts running at similar rates. The number of problems
reported is only a few per VO per day. The CCRC operations will not stop
after February but will continue in March and beyond. Site participation in
the daily operations calls needs to be improved.
CMS
QR Report - M.Kasemann presented the CMS’ grid activities
since October 2007, covering the CSA07 activities and the CCRC08-February
tests The CMS Computing infrastructure is fully utilized by ongoing
production. Finished the CSA07 production (and much more) and a detailed
analysis of the performance was executed. The newly formed Processing and Data Access
(PADA) taskforce addresses deployment, integration, commissioning and scale
testing. It will bring the elements of the CMS Computing Program into
stable and scalable operations. The CCRC08 functional tests in February
have actually complemented CSA07 and tested important additional
functionality.
Tape
Efficiency Metrics - Site Roundtable. The LHCC Referees
asked that all Tier-1 sites also collect tape efficiency metrics in the
near future, during CCRC08 in May these metrics should be available in
order to analyze how efficiently tapes are used by the typical read/write
patterns.
Sites
Availability and Reliability Algorithm - A.Aimar described
how the Availability and Reliability calculations are performed in SAM and
GridView, the results are then used for the reports on site availability
and reliability to the Overview Board.
19
February 2008 - Agenda, Minutes
CCRC08
and WLCG Services - J.Shiers presented an update on the
CCRC08 challenge. The scope and timeline CCRC08 will not achieve the
sustained exports from ATLAS+CMS (+others) at nominal 2008 rates for 2 weeks
by end February 2008. These goals can be achieved soon after February.
Therefore the proposal is to continue CCRC08 through March, April and beyond.
The WLCG Computing Service is in full production mode and to run
permanently is its actual purpose. One needs to move from the mind-set of
“challenge then relax” to “full production all the time”. Experiments
should remember that there is no GGUS TPMs on weekends / holidays / nights.
A problem submitted to GGUS on a Friday evening will be answered only the
next Monday.
LHCC
Referees Feedback - I.Bird summarized the meeting with the
LHCC Referees that took place the same morning. The questions and concerns
of the referees were about: Status of the Sites, the SRM installations and why
the Experiments were not all running at the same time - the Referees would
like this to be demonstrated. They also asked that tape efficiency and overall
metrics for site performance should be defined and measured.
Availability
and Reliability in January 2008 - Tier-1 Reliability and
Availability. Below are the reliability metrics since April 2007. One
should note that the target is moved to 93%; otherwise 10 sites would have
been above the old 91% target. The averages for the best 8 in the last 6
months were: Aug 94% Sept 93% Oct 93% Nov 95% Dec 96% Jan 95%
I.Bird proposed that the MB starts, from next month, to also
verify the status of reliability and availability of Tier-2 sites.
12 February 2008 - Agenda, Minutes
CRC
08 Started - J.Shiers presented a summary of the initial
days of CCRC08.The February CCRC08 run started on Monday 4 February. The preparations for this
challenge have proceeded (largely) smoothly. The execution is still manpower intensive and
schedules remain extremely tight. M
ATLAS
QR Report - D.Barberis presented the ATLAS QR report, an
update on the information presented in November. ATLAS considers the FDR-1
exercise very useful and has learned important information on data concentration
at CERN and event mixing (jobs with many input files) -
The data quality loop was tried and basically works. The calibration
procedures were also attempted for the first time and still need testing.
The ATLAS Tier-0 internals are not a worry except for operations manpower;
shifts are being organized.
5 February 2008 - Agenda, Minutes
CCRC08
Preparation - J.Shiers reported on the status and progress
of the CCRC08. With respect to previous challenges it is better prepared
and executed. The information flow has been considerably improved; a lot of
work was focused on ensuring that middleware and storage-ware are ready.
Accurate and timely reporting of problems and their resolution still needs
to be standardized.
CASTOR
Metrics - T.Cass proposed the set of metrics that could be
collected to measure and monitor the performance of the MSS storage at the
Tier-0 site. There are many different CASTOR related performance
measurements that can be collected, but these are not easily available from
a single location. Metrics will now be tracked consistently in order to
show performance issues, and hopefully improvements, over time.
Post
SLC4 options at CERN -T.Cass presented the possible options
for the future platform to support. The choice discussed is between
progressively delivering SLC5 services (test clusters, build services, etc),
or skipping the RHES5 based platform and introducing SLC6 services from the
release of RHES6, expected Q4 08.
ALICE
Quarterly Report - L.Betev presented the QR for ALICE, from
Nov 07 to Jan 08. The
ALICE MC production continued with very good site and services availability.
The Conditions data collection was in operation from day one (Shuttle
system to Offline CondDB). All data source components are ready and
integrated, including the DAQ/DCS/HLT databases and fileservers. The focus
is now on having a full complement of conditions data - and the
corresponding online software - for all detectors.
Applications
Area before CCRC - P.Mato presented the status of the
Applications Area software. The AA software is very weakly coupled with
grid services. Very few points of contact (e.g. access to event and
conditions data). Typically each experiment will use the version they have
managed to fully integrate and validate with their applications. The CCRC
February run was based on last year’s releases while the CCRC May run will
be on based on the new configuration.
29
January 2008 - Agenda, Minutes
OSG
Site Availability and Storage Services - In the last couple
of weeks OSG started uploading RSV records to SAM publishing information on
the US Tier-2 production resources listed in the WLCG MOUs for ATLAS and
CMS. The logs, on the SAM side, have revealed that uploads are steady and
accessible. The SAM web interface still does not present the expected
results. Joint debugging is actively taking place
Interim
User Accounting Policy - J.Gordon presented the status of
user level accounting and the current plans. The proposal is to have an
interim solution approved by the MB. ATLAS would like be granted access to
their user data. ATLAS also uses VOMS Roles/Groups (related issue) The
interim policy (see Policy Document below) is that selected ATLAS people
should agree before being given access to ATLAS data. Sites are informed that
if they publish UserDN data, then the ATLAS VRM will have access to the
ATLAS data.
22
January 2008 - Agenda, Minutes
CCRC08
Preparation - The services are not only going to be tested
for scaling but also new updates are still being tested or certified (LFC,
Gfal, lcg-utils). One should agree on a target set of versions for all
components of middleware and storage software no later than the April
Face-to-Face Meetings (April 1/2).
Pilot
Jobs Framework Review - J.Gordon presented the updated mandate
resulting from the last GDB discussion. The proposed mandate is to
review the Multi-User Pilot Job Frameworks of the LHC Experiments (Draft 1)
and to produce a report to the Management Board about the safety and effect
of the framework.
GDB
Summary - J.Gordon presented a summary of the January GDB.
15
January 2008 - Agenda, Minutes
Metrics
for of Tape Efficiency -: During the previous meeting
T.Bell presented the metrics used at CERN and proposed that some of those
metrics should be collected by the Tier-1 sites. The Tier-1 sites should
comment whether the Tier-1’s can collect the same metrics.
WLCG
Service Interventions / Interruptions - J.Shiers summarized
the issues of (1) “expected response time from the Experiments” and (2)
what could be achieved by defining clear procedures and common standards. Some Experiments (CMS and LHCb)
have requested for the most critical services, a maximum downtime of 30’.
It as has been stated on several occasions, including at the WLCG Service
Reliability workshop and at the OB, maximum downtime of 30’ is impossible
to guarantee an affordable cost. Intervention in 30’ is not obvious.
Update
on CPU Benchmarking - H.Meinhard reported the progress at
the HEP CPU Benchmarking working group: Several different systems
(currently based on seven different processors) are available at CERN. The
standard SPEC benchmarks were run at CERN on those seven hosts.
NDGF
SAM Tests - O.Smirnova presented an update about the
specific SAM tests that have been developed in order to calculate the
standard Site Availability at NDGF.
8 January2008 - Agenda, Minutes
OSG
Resource and Service Validation (RSV) - OSG is developing a service testing infrastructure
and probes that will calculate the availability and reliability of the OSG
sites. The results collected will also be passed by the standard WLCG SAM
availability monitoring system and published via GridView.
CCRC08
F2F Meeting - J.Shiers presented the status of the CCRC
planning and the preparation of the F2F Meeting scheduled for the 10
January 2008 at CERN. Link
Improving
Tape Efficiency at Tier-0 Site - T.Bell summarized the
issues concerning storage tape handling and explained which are the major
areas for improvement, both in the configuration of the Tier-0 storage
system and in the usage patterns of the storage in the Experiments’
applications. The presentation was followed by a discussion and by feedback
from the Experiments on how their applications could improve tape usage. A
similar kind of analysis of tape efficiency will be performed at the Tier-1
Sites in the coming weeks.
|