SC4 BDII Home Page
This page defines the installation, configuration and the procedures related to the CERN SC4
BDII service.
This page documents the current situation. It does not cover requirements or issues. These are covered in the
BdiiNotes.
Overview
The CERN
BDII service is defined as a critical service in the
services catalog.
There are three implementations
- The Global BDII which provides a world wide view of the BDII data on the grid
- The site GIIS which provides a consolidated view of the various GRIS servers on the CE and SE.
- A vo-specific BDII which is a view on the Global BDII with the inclusion of the VO white and black listing of sites
Installation and Configuration
CDB template is
pro_type_gridbdii_lcg_slc3.tpl
Base SLC3 installation:
- file partition as in here
- SLC3 Software plus LCG EGEE.BDII: see here
LCG components:
include pro_software_components_lcg_yaim_2_6;
"/software/components/yaim/active"=true;
"/software/components/yaim/nodetype/LCG-BDII" = true;
# for yaim, local user edguser is needed:
"/software/components/access_control/roles/edguser/0"="edguser";
"/software/components/access_control/privileges/acl_interactive/role/edguser/0/targets"=list("+cluster::gridbdii");
The bdii processes run under the edguser account and group. The reserved accounts and uid/gid values for grid server processes are
here.
Work is underway to add the edguser and edginfo users and groups in SINDES passw template and in group.header file ! (see more info
here).
Configure Yaim:
ncm-ncd --configure yaim
[INFO] running component: yaim
---------------------------------------------------------
[INFO] updated /etc/lcg-quattor-site-info.def
[INFO] running command: /opt/lcg/yaim/scripts//configure_node /etc/lcg-quattor-site-info.def BDII
[INFO] '/opt/lcg/yaim/scripts//configure_node /etc/lcg-quattor-site-info.def BDII' STDOUT output produced:
Configuring config_upgrade ...
Configuring config_edgusers ...
Configuring config_bdii ...
Stopping BDII [FAILED]
Starting BDII [ OK ]
Configuring config_fmon_client ...
Stopping edg-fmon-agent: [ OK ]
Starting edg-fmon-agent: [ OK ]
Configuration Complete
[INFO] '/opt/lcg/yaim/scripts//configure_node /etc/lcg-quattor-site-info.def BDII' run succesfully
[INFO] configure on component yaim executed, 0 errors, 0 warnings
=========================================================
[OK] 0 errors, 0 warnings executing configure
Management Procedures
SMS
What does "maintenance" mean for a bdii host ?
Currently /usr/libexec/SetToDesiredState.gridbdii
is used to put the nodes into production/maintenance. On maintenance, there is NO /etc/nologin file, overwise the bdii deamon cannot be started.
The "gridbdii" CDB cluster has to be configured to have the expected behaviour !!!
Operations
Use /etc/init.d/bdii start|stop|status for:
The cluster is configured so that bdii is automatically started at boot:
"/software/components/chkconfig/service/bdii/add" = true;
Monitoring
Lemon Sensors
The Lemon sensors for
BDII provide the following functions
- Status performs a lookup for the CERN-PROD latitude. This is defined in both the global and the GRIS BDIIs. The time to perform this operation is returned as an approximate performance metric,
- Load checks the number of concurrent queries. This can be used as an average load factor.
This is covered by the lemon-sensor-grid-bdii rpm.
To configure this for a machine, there are two CDB profiles
The
pro_monitoring_cos_gridbdii profile defines the templates for the monitors.
Within the profile
pro_system_gridbdii, the pro_monitoring_cos_grid_bdii template is included and the metrics are set to active.
The Lemon sensor added to the CDB is
############################################################
#
# template pro_monitoring_cos_gridbdii
#
############################################################
template pro_monitoring_cos_gridbdii;
"/system/monitoring/metric/_801" = nlist(
"name", "BdiiStatus",
"descr", "returns <0 if Bdii down, >=0 if Bdii up",
"class", "GridBdii::BdiiStatus",
"period", 120,
"active", false,
"latestonly",false,
);
"/system/monitoring/metric/_802" = nlist(
"name", "BdiiLoad",
"descr", "Returns a value for the current load on the BDII",
"class", "GridBdii::BdiiLoad",
"period", 120,
"active", false,
"latestonly",false,
);
Testing the lemon sensor is as follows
$ /usr/bin/perl -I/opt/edg/lib/perl -MGridBdii /usr/libexec/sensors/edg-fmon-sensor.pl
INI 1 GridBdii::BdiiStatus
GET 1
PUT 01 1 1130580132 0.085848
INI 2 GridBdii::BdiiLoad
GET 2
PUT 01 2 1130580157 0
Once Lemon is polling for the result, the latest value can be retrieved with the lemon-cli command. The message below shows the metric is not yet polling.
$ /usr/sbin/lemon-cli -m 801
Retrieving samples from spool directory '/var/spool/edg-fmon-agent'
Nodes: lxb5588
Metrics: 801
Start: (null) - (-1)
End: (null) - (-1)
Failed to MRs_getSamples() : #11 : (src/MR_FlatFileDB.c,437): Cannot open database file, check metric and node
When it is working, it should reply
$ lemon-cli -m 802
Retrieving samples from spool directory '/var/spool/edg-fmon-agent'
Nodes: bdii002
Metrics: 802
Start: (null) - (-1)
End: (null) - (-1)
local: bdii002 802 Mon Oct 31 08:42:20 2005 0
Total: 1 result
Additional lemon templates need to be defined (
pro_monitoring_sensor_gridbdii.tpl), which are included by the software template of the cluster: (
pro_software_gridbdii_slc3.
template pro_monitoring_sensor_gridbdii;
"/system/monitoring/sensor/gridbdii/cmdline"="/usr/bin/perl -I/opt/edg/lib/perl -MGridBdii /usr/libexec/sensors/edg-fmon-sensor.pl";
"/system/monitoring/sensor/gridbdii/class"=list(
nlist("name", "GridBdii:BdiiStatus",
"descr","BDII application up or down",
"field",list(nlist ("name", "status",
"format", "float",
))
),
nlist("name", "GridBdii:BdiiLoad",
"descr","BDII application load (active searches)",
"field",list(nlist ("name", "load",
"format", "float",
))
),
);
Then run the fmonagent NCM component to load this into the Lemon configuration
# ncm-ncd --config fmonagent
After configuration, check in the file /var/log/edg-fmon-agent.log for the lines related to the load of the sensor (in this case 801 and 802). If you have the message below there is a problem with the CDB definitions.
2005-10-30 17:11:23 [ERROR] Metric '801' : no sensor found for metric class 'GridBdii::BdiiStatus'
2005-10-30 17:11:23 [ERROR] Metric '802' : no sensor found for metric class 'GridBdii::BdiiLoad'
The message below shows that all was loaded ok. A process running
GridBdii should be visible.
2005-10-30 19:33:51 [INFO] Metric '802' : using sensor 'gridbdii'
$ ps -ef | grep Bdii
root 20005 19990 1 19:42 ? 00:00:00 /usr/bin/perl -I/opt/edg/lib/perl -MGridBdii /usr/libexec/sensors/edg-fmon-sensor.pl
If you get the message in edg-fmon-agent.log, check using the INI/GET test for code problems.
2005-10-30 19:45:51 [ERROR] Sensor gridbdii terminated unexpectedly, exit code 255
Once these unit tests are completed, inform the Lemon support staff so that they can do the necessary steps on the server. The following define the metric information required for the ELFms team to
project-elfms-lemon@cernNOSPAMPLEASE.ch.
I plan to add two sensors for the global and site BDIIs. As requested on http://lemon.web.cern.ch/lemon/doc/sensor_metric_registration.html, I send the information below
These define two metrics
Name BdiiStatus
Instance GridBdii::BdiiStatus
Description Return >=0 if application is up, <0 if application is down
Sampling 120 seconds
Historical Yes
Name BdiiLoad
Instance GridBdii::BdiiLoad
Description Return the number of active LDAP searches
Sampling 120 seconds
Historical Yes
There will be up to 7 hosts in production reporting on this.
A template has been added to the CDB pro_monitoring_cos_gridbdii.tpl with this information in. I need to update it with the new metric ids.
Please advise me on the Metric IDs and install the necesssary collection structures on Lemon at CERN.
Currently, no alarms are required.
We would like an RRD display of the two values though. Is this possible to add two rrd graphs for all of the machines in the gridbdii cluster ?
Any problems, drop me a note.
Once completed, the data will be stored in the Lemon database and visible through the lemon interface. An example is
status and
load.
Alarm
The alarm for the
BDII is defined as follows by sending a mail to
project-elfms-lemon@cernNOSPAMPLEASE.ch.
Following the procedure defined at http://lemon.web.cern.ch/lemon/doc/sensor_metric_registration.html,
Name of Alarm: GRID_BDII_WRONG
Procedure for operations: http://service-cc-opm.web.cern.ch/service-cc-opm/cgi-bin/opm_overview.cgi?context=list_keywords&OPM=op-proc-gridbdii.html
Contact: grid-cern-prod-is@cern.ch
People: Laurence Field, Yvan Calas
Metric: 801
Reference Value: Alarm if <0
Frequency: Every 2 minutes
Cluster: gridbdii (bdii001, bdii002)
Script: None
Please inform me if there is additional information required.
Tim
Feedback from Miro gave that we could just add the following to the pro_monitoring_cos_gridbdii.tpl file
# Added GRID_BDII_WRONG if 801 < 0
"/system/monitoring/metric/_801/alarmtext" = "GRID_BDII_WRONG";
"/system/monitoring/metric/_801/reference" = "Positive value";
Load Balancing
The Load Balancing can be configured as follows
- Define the metrics using the lbclient component in CDB
- Update configuration
- Test
It uses the functionality of the latest lbclient (1.4) which allows the Lemon metrics to be used for the checking and load calculations.
See
here for more information on the configuration of a load balancing alias.
See
here for technical details about how to move machines from prod-bdii-exp to lcg-bdii alias (Veronique Lefebure, 26 October 2006)
Loadbalancing works if /etc/iss.nologin file is set when the node is expected to be removed from the alias.
SetToDesiredState.gridbdii is taking care of that for the "maitenance" state only. The "standby" state is not yet foreseen and should not be used.
The metric definition is done in pro_system_gridbdii.tpl. It defines
- Only consider the host if the application is up (_801>=0)
- If there are more than one host up, chose the one with the least load (-_802+100)
include pro_declaration_component_lbclient;
"/software/components/lbclient/active" = true;
"/software/components/lbclient/metric" = "/usr/local/sbin/lbclient";
"/software/components/lbclient/check/nologin" = true;
"/software/components/lbclient/check/tmpfull" = true;
"/software/components/lbclient/check/sshdaemon" = true;
"/software/components/lbclient/check/lemon" = list("_801>=0");
"/software/components/lbclient/load/lemon" = list("-_802+100");
Run on the
BDII servers
# ncm-ncd --co lbclient
On completion, the
/usr/local/etc/lbclient.conf
file should contain the following lines (or something similar)
check lemon _801>=0
load lemon -_802+100
The first step of testing is to use the lbclient interactively and verbosely. Check the metrics are considered and are retrieved correctly.
# /usr/local/sbin/lbclient -d
The function can be tested using the following from lxplus
$ snmpget -v 1 bdii001 -c public .1.3.6.1.4.1.96.255.1
SNMPv2-SMI::enterprises.96.255.1 = INTEGER: -1
If it comes back with the text below, run
/usr/local/sbin/lbclient -d
as root to check that the command is working. This error message is produced because the format of the output as not as expected.
Error in packet
Reason: (noSuchName) There is no such variable name in this MIB.
Failed object: SNMPv2-SMI::enterprises.96.255.1
Once the DNS load balancing alias is defined, you can check with host.
$ host prod-bdii
Administration
Problem Determination
There should be 3 bdii daemons running under the edguser account:
edguser 6544 1 0 15:23 ? 00:00:00 /usr/sbin/slapd
-f /opt/bdii//var/2172/bdii-slapd.conf -h ldap://localhost:2172 -u edguser
edguser 7113 1 0 15:24 ? 00:00:00 /usr/sbin/slapd
-f /opt/bdii//var/2173/bdii-slapd.conf -h ldap://localhost:2173 -u edguser
edguser 7687 1 0 15:26 ? 00:00:00 /usr/sbin/slapd
-f /opt/bdii//var/2171/bdii-slapd.conf -h ldap://localhost:2171 -u edguser
The status check should return OK:
/etc/init.d/bdii status
bdii OK
Daemons should start after boot (chkconfig mechanism) with the rolling log in
/opt/bdii/var/bdii.log and their configuration files are in
/opt/bdii/etc/bdii.conf
If the daemons are not running after reboot look at the log. You could try
/etc/init.d/bdii start
to try starting them.
The servers try to contact registered worldwide grid ldap servers each two minutes in a cycle that lasts about 30 seconds to republish their information. It is normal for some sites to fail to respond sometimes and these appear in the log. Each cycle is time-stamped.
The logs are rotated and /opt has normally over 100GB free. The cpu load is normally less than 1%. They read data externally on port 2170 and use ports 2171, 2172 and 2173 internally.
The lemon sensor generating the GRID_BDII_WRONG alarm is stored in
/opt/edg/lib/perl/GridBdii.pm
and it issues the command
ldapsearch -h $h -x -b GlueSiteUniqueID=CERN-CIC,mds-vo-name=cernlcg2,mds-vo-name=local,o=grid -p 2170
where $h is the local hostname. This command can be used to check if the cluster services are running by using the hostname lcg-bdii for the global bdii and hostname prod-bdii for the CERN site bdii. The output is usually
version: 2
#
# filter: (objectclass=*)
# requesting: GlueSiteLatitude
#
# CERN-PROD, CERN-PROD, local, grid
dn: GlueSiteUniqueID=CERN-PROD,mds-vo-name=CERN-PROD,mds-vo-name=local,o=grid
GlueSiteLatitude: 46.12
# search result
search: 2
result: 0 Success
# numResponses: 2
# numEntries: 1
If it replies as below, the grid services support should be contacted to find out why the information being returned is incomplete.
version: 2
#
# filter: (objectclass=*)
# requesting: ALL
#
# search result
search: 2
result: 32 No such object
matchedDN: mds-vo-name=local,o=grid
# numResponses: 1
Trouble shooting
- BDII does not startup after an inclean shutdown (eg. power cut)
- mv /etc/lcg-quattor-site-info.def /etc/lcg-quattor-site-info.def_save
- ncm_wrapper.sh --configure yaim
- /opt/lcg/yaim/scripts//configure_node /etc/lcg-quattor-site-info.def BDII
- /etc/init.d/bdii start
- check with an ldap search if things are back up
--
TimBell - 22 Sep 2005