LCG Web>LCGServiceChallenges>ProgressLogs>ServiceChallengeFourProgress>ScFourServiceTechnicalFactors>BdiiWlcg (2006-11-28, LaurenceField)

EditAttachPDF

SC4 BDII Home Page

This page defines the installation, configuration and the procedures related to the CERN SC4 BDII service.

This page documents the current situation. It does not cover requirements or issues. These are covered in the BdiiNotes.

SC4 EGEE.BDII Home Page

Overview

The CERN BDII service is defined as a critical service in the services catalog.

There are three implementations

The Global BDII which provides a world wide view of the BDII data on the grid
The site GIIS which provides a consolidated view of the various GRIS servers on the CE and SE.
A vo-specific BDII which is a view on the Global BDII with the inclusion of the VO white and black listing of sites

EGEE.BDII-Production.gif

Installation and Configuration

CDB template is pro_type_gridbdii_lcg_slc3.tpl

Base SLC3 installation:

file partition as in here
SLC3 Software plus LCG EGEE.BDII: see here

LCG components:

   include pro_software_components_lcg_yaim_2_6;
  "/software/components/yaim/active"=true;
  "/software/components/yaim/nodetype/LCG-BDII" = true;
#  for yaim, local user edguser is needed:
  "/software/components/access_control/roles/edguser/0"="edguser";
  "/software/components/access_control/privileges/acl_interactive/role/edguser/0/targets"=list("+cluster::gridbdii");

The bdii processes run under the edguser account and group. The reserved accounts and uid/gid values for grid server processes are here.

Work is underway to add the edguser and edginfo users and groups in SINDES passw template and in group.header file ! (see more info here).

Configure Yaim:

ncm-ncd --configure yaim

[INFO] running component: yaim
---------------------------------------------------------
[INFO] updated /etc/lcg-quattor-site-info.def
[INFO] running command: /opt/lcg/yaim/scripts//configure_node /etc/lcg-quattor-site-info.def BDII
[INFO] '/opt/lcg/yaim/scripts//configure_node /etc/lcg-quattor-site-info.def BDII' STDOUT output produced:
Configuring config_upgrade ...
Configuring config_edgusers ...
Configuring config_bdii ...
Stopping BDII                                              [FAILED]
Starting BDII                                              [  OK  ]
Configuring config_fmon_client ...
Stopping edg-fmon-agent:                                   [  OK  ]
Starting edg-fmon-agent:                                   [  OK  ]
Configuration Complete

[INFO] '/opt/lcg/yaim/scripts//configure_node /etc/lcg-quattor-site-info.def BDII' run succesfully
[INFO] configure on component yaim executed, 0 errors, 0 warnings

=========================================================

[OK]   0 errors, 0 warnings executing configure

Management Procedures

SMS

What does "maintenance" mean for a bdii host ? Currently /usr/libexec/SetToDesiredState.gridbdii is used to put the nodes into production/maintenance. On maintenance, there is NO /etc/nologin file, overwise the bdii deamon cannot be started.

The "gridbdii" CDB cluster has to be configured to have the expected behaviour !!!

Operations

Use /etc/init.d/bdii start|stop|status for:

Start
Stop
Status

The cluster is configured so that bdii is automatically started at boot:

 "/software/components/chkconfig/service/bdii/add" = true;

Monitoring

Lemon Sensors

The Lemon sensors for BDII provide the following functions

Status performs a lookup for the CERN-PROD latitude. This is defined in both the global and the GRIS BDIIs. The time to perform this operation is returned as an approximate performance metric,
Load checks the number of concurrent queries. This can be used as an average load factor.

This is covered by the lemon-sensor-grid-bdii rpm.

To configure this for a machine, there are two CDB profiles

The pro_monitoring_cos_gridbdii profile defines the templates for the monitors.

Within the profile pro_system_gridbdii, the pro_monitoring_cos_grid_bdii template is included and the metrics are set to active.

The Lemon sensor added to the CDB is

############################################################
#
# template pro_monitoring_cos_gridbdii
#
############################################################

template pro_monitoring_cos_gridbdii;

"/system/monitoring/metric/_801" = nlist(
        "name",      "BdiiStatus",
        "descr",     "returns <0 if Bdii down, >=0 if Bdii up",
        "class",     "GridBdii::BdiiStatus",
        "period",    120,
        "active",    false,
        "latestonly",false,
);


"/system/monitoring/metric/_802" = nlist(
        "name",      "BdiiLoad",
        "descr",     "Returns a value for the current load on the BDII",
        "class",     "GridBdii::BdiiLoad",
        "period",    120,
        "active",    false,
        "latestonly",false,
);

Testing the lemon sensor is as follows

$ /usr/bin/perl -I/opt/edg/lib/perl -MGridBdii /usr/libexec/sensors/edg-fmon-sensor.pl
INI 1 GridBdii::BdiiStatus
GET 1
PUT 01 1 1130580132 0.085848
INI 2 GridBdii::BdiiLoad
GET 2
PUT 01 2 1130580157 0

Once Lemon is polling for the result, the latest value can be retrieved with the lemon-cli command. The message below shows the metric is not yet polling.

$ /usr/sbin/lemon-cli -m 801
Retrieving samples from spool directory '/var/spool/edg-fmon-agent'
        Nodes:    lxb5588
        Metrics:  801
        Start:    (null) - (-1)
        End:      (null) - (-1)

Failed to MRs_getSamples() : #11 : (src/MR_FlatFileDB.c,437): Cannot open database file, check metric and node

When it is working, it should reply

$ lemon-cli -m 802
Retrieving samples from spool directory '/var/spool/edg-fmon-agent'
        Nodes:    bdii002
        Metrics:  802
        Start:    (null) - (-1)
        End:      (null) - (-1)

        local:  bdii002 802     Mon Oct 31 08:42:20 2005        0

        Total: 1 result

Additional lemon templates need to be defined (pro_monitoring_sensor_gridbdii.tpl), which are included by the software template of the cluster: (pro_software_gridbdii_slc3.

template pro_monitoring_sensor_gridbdii;

"/system/monitoring/sensor/gridbdii/cmdline"="/usr/bin/perl -I/opt/edg/lib/perl -MGridBdii /usr/libexec/sensors/edg-fmon-sensor.pl";

"/system/monitoring/sensor/gridbdii/class"=list(
        nlist("name", "GridBdii:BdiiStatus",
              "descr","BDII application up or down",
              "field",list(nlist ("name",   "status",
                                  "format", "float",
                              ))
         ),
        nlist("name", "GridBdii:BdiiLoad",
              "descr","BDII application load (active searches)",
              "field",list(nlist ("name",   "load",
                                  "format", "float",
                              ))
         ),
);

Then run the fmonagent NCM component to load this into the Lemon configuration

# ncm-ncd --config fmonagent

After configuration, check in the file /var/log/edg-fmon-agent.log for the lines related to the load of the sensor (in this case 801 and 802). If you have the message below there is a problem with the CDB definitions.

2005-10-30 17:11:23 [ERROR] Metric '801' : no sensor found for metric class 'GridBdii::BdiiStatus'
2005-10-30 17:11:23 [ERROR] Metric '802' : no sensor found for metric class 'GridBdii::BdiiLoad'

The message below shows that all was loaded ok. A process running GridBdii should be visible.

2005-10-30 19:33:51 [INFO]  Metric '802' : using sensor 'gridbdii'

$ ps -ef | grep Bdii
root     20005 19990  1 19:42 ?        00:00:00 /usr/bin/perl -I/opt/edg/lib/perl -MGridBdii /usr/libexec/sensors/edg-fmon-sensor.pl

If you get the message in edg-fmon-agent.log, check using the INI/GET test for code problems.

2005-10-30 19:45:51 [ERROR] Sensor gridbdii terminated unexpectedly, exit code 255

Once these unit tests are completed, inform the Lemon support staff so that they can do the necessary steps on the server. The following define the metric information required for the ELFms team to project-elfms-lemon@cernNOSPAMPLEASE.ch.


I plan to add two sensors for the global and site BDIIs.  As requested on http://lemon.web.cern.ch/lemon/doc/sensor_metric_registration.html, I send the information below

These define two metrics

Name       BdiiStatus
Instance   GridBdii::BdiiStatus
Description   Return >=0 if application is up, <0 if application is down
Sampling   120 seconds
Historical   Yes

Name    BdiiLoad
Instance   GridBdii::BdiiLoad
Description   Return the number of active LDAP searches
Sampling   120 seconds
Historical    Yes

There will be up to 7 hosts in production reporting on this.

A template has been added to the CDB pro_monitoring_cos_gridbdii.tpl with this information in.  I need to update it with the new metric ids.

Please advise me on the Metric IDs and install the necesssary collection structures on Lemon at CERN.

Currently, no alarms are required.

We would like an RRD display of the two values though.  Is this possible to add two rrd graphs for all of the machines in the gridbdii cluster ?

Any problems, drop me a note.

Once completed, the data will be stored in the Lemon database and visible through the lemon interface. An example is status and load.

Alarm

The alarm for the BDII is defined as follows by sending a mail to project-elfms-lemon@cernNOSPAMPLEASE.ch.

Following the procedure defined at http://lemon.web.cern.ch/lemon/doc/sensor_metric_registration.html,

Name of Alarm: GRID_BDII_WRONG
Procedure for operations: http://service-cc-opm.web.cern.ch/service-cc-opm/cgi-bin/opm_overview.cgi?context=list_keywords&OPM=op-proc-gridbdii.html
Contact: grid-cern-prod-is@cern.ch
People: Laurence Field, Yvan Calas
Metric: 801
Reference Value: Alarm if <0
Frequency: Every 2 minutes
Cluster: gridbdii (bdii001, bdii002)
Script: None

Please inform me if there is additional information required.

Tim

Feedback from Miro gave that we could just add the following to the pro_monitoring_cos_gridbdii.tpl file

# Added GRID_BDII_WRONG if 801 < 0
"/system/monitoring/metric/_801/alarmtext" = "GRID_BDII_WRONG";
"/system/monitoring/metric/_801/reference" = "Positive value";

Load Balancing

The Load Balancing can be configured as follows

Define the metrics using the lbclient component in CDB
Update configuration
Test

It uses the functionality of the latest lbclient (1.4) which allows the Lemon metrics to be used for the checking and load calculations.

See here for more information on the configuration of a load balancing alias.

See here for technical details about how to move machines from prod-bdii-exp to lcg-bdii alias (Veronique Lefebure, 26 October 2006)

Loadbalancing works if /etc/iss.nologin file is set when the node is expected to be removed from the alias. SetToDesiredState.gridbdii is taking care of that for the "maitenance" state only. The "standby" state is not yet foreseen and should not be used.

The metric definition is done in pro_system_gridbdii.tpl. It defines

Only consider the host if the application is up (_801>=0)
If there are more than one host up, chose the one with the least load (-_802+100)

include pro_declaration_component_lbclient;
"/software/components/lbclient/active" = true;
"/software/components/lbclient/metric" = "/usr/local/sbin/lbclient";
"/software/components/lbclient/check/nologin" = true;
"/software/components/lbclient/check/tmpfull" = true;
"/software/components/lbclient/check/sshdaemon" = true;
"/software/components/lbclient/check/lemon" = list("_801>=0");
"/software/components/lbclient/load/lemon" = list("-_802+100");

Run on the BDII servers

# ncm-ncd --co lbclient

On completion, the /usr/local/etc/lbclient.conf file should contain the following lines (or something similar)

check lemon _801>=0
load lemon -_802+100

The first step of testing is to use the lbclient interactively and verbosely. Check the metrics are considered and are retrieved correctly.

# /usr/local/sbin/lbclient -d

The function can be tested using the following from lxplus

$ snmpget -v 1 bdii001 -c public .1.3.6.1.4.1.96.255.1
SNMPv2-SMI::enterprises.96.255.1 = INTEGER: -1

If it comes back with the text below, run /usr/local/sbin/lbclient -d as root to check that the command is working. This error message is produced because the format of the output as not as expected.

Error in packet
Reason: (noSuchName) There is no such variable name in this MIB.
Failed object: SNMPv2-SMI::enterprises.96.255.1

Once the DNS load balancing alias is defined, you can check with host.

$ host prod-bdii

Administration

Problem Determination

There should be 3 bdii daemons running under the edguser account:

edguser   6544     1  0 15:23 ?        00:00:00 /usr/sbin/slapd 
-f /opt/bdii//var/2172/bdii-slapd.conf -h ldap://localhost:2172 -u edguser

edguser   7113     1  0 15:24 ?        00:00:00 /usr/sbin/slapd 
-f /opt/bdii//var/2173/bdii-slapd.conf -h ldap://localhost:2173 -u edguser

edguser   7687     1  0 15:26 ?        00:00:00 /usr/sbin/slapd 
-f /opt/bdii//var/2171/bdii-slapd.conf -h ldap://localhost:2171 -u edguser

The status check should return OK:

/etc/init.d/bdii status 
bdii OK

Daemons should start after boot (chkconfig mechanism) with the rolling log in

/opt/bdii/var/bdii.log        and their configuration files are in
/opt/bdii/etc/bdii.conf

If the daemons are not running after reboot look at the log. You could try

/etc/init.d/bdii start

to try starting them.

The servers try to contact registered worldwide grid ldap servers each two minutes in a cycle that lasts about 30 seconds to republish their information. It is normal for some sites to fail to respond sometimes and these appear in the log. Each cycle is time-stamped. The logs are rotated and /opt has normally over 100GB free. The cpu load is normally less than 1%. They read data externally on port 2170 and use ports 2171, 2172 and 2173 internally.

The lemon sensor generating the GRID_BDII_WRONG alarm is stored in

/opt/edg/lib/perl/GridBdii.pm

and it issues the command

ldapsearch -h $h -x -b GlueSiteUniqueID=CERN-CIC,mds-vo-name=cernlcg2,mds-vo-name=local,o=grid -p 2170

where $h is the local hostname. This command can be used to check if the cluster services are running by using the hostname lcg-bdii for the global bdii and hostname prod-bdii for the CERN site bdii. The output is usually

version: 2

#
# filter: (objectclass=*)
# requesting: GlueSiteLatitude
#

# CERN-PROD, CERN-PROD, local, grid
dn: GlueSiteUniqueID=CERN-PROD,mds-vo-name=CERN-PROD,mds-vo-name=local,o=grid
GlueSiteLatitude: 46.12

# search result
search: 2
result: 0 Success

# numResponses: 2
# numEntries: 1

If it replies as below, the grid services support should be contacted to find out why the information being returned is incomplete.

version: 2

#
# filter: (objectclass=*)
# requesting: ALL
#

# search result
search: 2
result: 32 No such object
matchedDN: mds-vo-name=local,o=grid

# numResponses: 1

Trouble shooting

BDII does not startup after an inclean shutdown (eg. power cut)
- mv /etc/lcg-quattor-site-info.def /etc/lcg-quattor-site-info.def_save
- ncm_wrapper.sh --configure yaim
- /opt/lcg/yaim/scripts//configure_node /etc/lcg-quattor-site-info.def BDII
- /etc/init.d/bdii start
- check with an ldap search if things are back up

-- TimBell - 22 Sep 2005

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
gif	BDII-Production.gif	r1	manage	31.6 K	2005-11-02 - 13:51	TimBell
vsd	BDII-Production.vsd	r1	manage	154.0 K	2005-11-02 - 13:50	TimBell
draw	CernBdiiProd.draw	r4 r3 r2 r1	manage	6.3 K	2005-10-31 - 11:48	TimBell	TWiki Draw draw file
gif	CernBdiiProd.gif	r4 r3 r2 r1	manage	4.3 K	2005-10-31 - 11:49	TimBell	TWiki Draw GIF file

Topic revision: r38 - 2006-11-28 - LaurenceField

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback