Requirements for ACE from the LHC experiments
Here we try to summarize an outcome of meetings/discussions pieces of existing documentation, related to this subject :
Whether experiments are happy with one single algorithm for VO-specific availability calculation which is consistent with GridView algorithm?
In general people tempt to agree that we should keep just one single algorithm (based on the discussion at the meeting on the 23.06.2010).
Most of the limitations of the current system rather relate to the topology description than to the algorithm of availability calculation.
The
GridView Service Availability Computation algorithm is described
in this document.
What should be done differently compared to the current algorithm is the following:
1).
If site has two services of the same type and one is down another one in maintenance, currently
GridView considered the service type in maintenance.
It should be other way around. It should be rather down than maintenance, because otherwise the site can register one fake service and keep it all
the time in maintenance and it won't be ever down. Writing this page, I asked myself, what should happen if there 10 services of the same type
and 9 of them are in maintennace 1 is down, should be the overall state of the service type considered to be down then? Should not one take as a value
a value of the majority of services of a given type?
2).
When there are several services of the same flavour, then there is a logical 'OR' in the availability calculation.
VOs should be able to redefine this default behavior, if the need, changing 'OR' by 'AND'. THis possibility should be
foreseen on the UI where VOs define a profile.
3).
Profile should be linked to a certain group. For example, it that certain tests or service types are critical ONLY for T1s, but not critical for T2s.
How to handle VO-specific service types, like CRAB server for example
David told that there was an agreement recorded in the document approved by MB, that all services which would be tested by SAM should be registered
in GOCDB. Alessandro mentioned that registration in GOCDB is not a straight forward process and pointed to a corresponding savannah bug
https://savannah.cern.ch/support/?113592.
Andrea expressed some doubts that experiments would be happy to register experiment-specific services in GOCDB.
It was suggested to re-discuss this question inside the experiments in order to understand whether they agree with the statement that all experiment-specific services which need to be tested by SAM should be registered in GOCDB.
Some remarks:
The validity of the test should be defined on the test level. Default is 24 hours. Where/how it should be defined?
If there is no critical tests defined for the site, the site would be always green
SAM won't use BDII for availability calculation, since information which can be taked from BDII (which services are used by VO), should come
with the VO topology description. However, if topology description is not provided, then BDII can be used on this purpose.
Whether the overall site is considered to be in downtime or only a particular service type is defined by site admin. VOs are free to define different profiles in order to decouple various functionalities of the sites in various profiles.
GridView considers only scheduled downtime as maintenance, unscheduled downtime is regarded as the site is down.
Meetings
Meetings are recorded on this twiki only starting from the 23.06.2010, there were many discussion before this date
Meeting 23.06.2010
Attended by:
David Collados , Phool Chand, Andrea Sciaba, Roberto Santinelli,
Alessandro Di Girolamo, Akshat Kakkar, Pablo Saiz, Julia Andreeva
The goal of the meeting was to understand the requirements of the experiments for VO-specific availability calculations.
The discussions started from
the old document written by William Olliver two years ago.
Discussed issues are described above.
We did not get to the point of discussion of the time line for having VO-specific availabilities
calculation prototype in place. Would be nice if Andrea, Alessandro and Roberto can look again through Williams document and think about
issues discussed today, so that by the next meeting we can have some updated version of the requirements for
GridView and then can
estimate how much time is required for implementation and whether we can have some intermediate milestones.