FTSSmokeTest < LCG

LCG Web>FtsWlcg>FtsRelease15>FtsProcedures15>FTSSmokeTestAndActions>FTSSmokeTest (2007-04-10, SteveTraylen)

FTS Smoke-Test

The following describes the procedure for checking the FTS nodes. It should be followed mostly in sequence.

The code examples below are designed, as much as possible, to be cut and pasted into the terminal window. Please report any errors you see.

1) Ping the node.

Log into any lxplus batch node. Attempt to ping the "problematic" FTS server. If you cannot ping the host there is likely a hardware problem, a network problem or the machine is in the middle of being rebooted. To cover the last case, wait a few minutes and try again. Call the operator if the problem persists.

2) Connect to the node.

2) ssh into the FTS node as root (make sure you have run klog.krb to obtain your kerberos AFS token). If you cannot ssh into the node, contact the operator.

3) Check whether the agent daemon is running.

Run the command:

ps aux | grep glite

You should see a single daemon called glite-transfer-fts-something running as user sc3.

Possible problem: There is more than one glite-transfer-fts-something process:

There should not be more than one. kill -9 all the daemon processes identified using ps aux. Then run the FtsMeanAndClean15 procedure for restarting the service from a clean config.

Possible problem: There are none, or there are none after a restart:

Attempt to restart the service using service gLite stop followed by service gLite start. If the daemon is still not there, look in the logfile /var/log/glite/glite-transfer-fts-something.log (it should be the latest logfile in that directory). The service should have dumped its configuration into the logfile upon startup. Look for any ERROR class messages. Typical startup failures can include configuration file parsing errors (if the config file has been corrupted or changed) or database connection errors (in which case the problem is likely on the database node).

4) Check whether the web service container daemon is running.

Run the command:

ps aux | grep tomcat4

You should see a single java daemon running. ps aux will show the name of the daemon as the full Java comandline.

Possible problem: There is more than one java process owned by tomcat4 (this is unusual):

There should not be more than one. Run the same FtsMeanAndClean15 procedure as for 3).

Possible problem: There are no java processes, or none after a restart:

Attempt to restart the service using service gLite stop followed by service gLite start. If the daemon is still not there, look in the logfile /usr/share/tomcat5/logs/org.glite.data (it should be among the latest logfiles in that directory). The service should have dumped sonme of its configuration into the logfile upon startup. Look for any ERROR or FATAL class messages. Typical startup failures can include configuration file parsing errors (if the config file has been corrupted or changed), grid-mapfile parsing errors (if any of the service mapfiles have been corrupted or changed) or database connection errors (in which case the problem is likely on the database node).
Run, as root, the commands glite-transfer-channel-list and glite-transfer-list. If any of them display a message 'tcp connect error', 'InternalException' or 'Service is misconfigured' look in the =org.glite.data logfile for ERROR or FATAL class messages that may indicate what the problem is.

5) Check the number of gLite instances configured to run inside the tomcat container.

There should be one and only one. Issue the following commands and check the output:

ls -trl /etc/tomcat5/Catalina/localhost/*glite*

there should only be one file.

ls -trl /usr/share/tomcat5/webapps/

ignore ROOT, balancer and webdav directories if they exist. Other than these, there should only be one directory.

ls -trld /usr/share/tomcat5/work/Catalina/localhost/*glite*

there should only be one directory.

If any of these checks fail, run the same FtsMeanAndClean15 procedure as mentioned in step 3).

6) If the database appears to be down

If any of the logfiles suggest a problem connecting to the database, try to confirm this manually.

Look in the config file /opt/glite/etc/config/glite-file-transfer-service-oracle.cfg.xml

In the first 50 lines of the files, you'll find the following paramters (it's an XML file): DBUSER, DBPASSWORD, DBSERVICENAME, DBHOST. Take the values and run the following sqlplus command:

sqlplus DBUSER/DBPASSWORD@DBHOST/DBSERVICENAME

for example, for the production throughput FTS on lxshare026d, run:

sqlplus fts_sc3prod/xxxxxxxx@fts.cern.ch/fts_prod.cern.ch

Possible problem: invalid username/password; logon denied

Like it says, your username/password pair is wrong; the service is misconfigured. Edit the config file, changing to the correct correct username/password [ where? ]. If you're sure they are correct and it still says 'logon denied', contact the DB service support.

Possible problem: account locked

After a couple or so tries with the wrong password, the account is locked for 15 minutes or so. This means someone (possibly you) has attempted to connect with the wrong paswword. If the service logfiles indicate that the service cannot connect to the DB, stop the service (service gLite stop), wait 15 minutes, then attempt to restart (service gLite start). If this does not help, call DB support (there may be an errant script attempting to connect to the DB somewhere with a bad password).

Possible problem: it hangs, gets connection refused, a "TNS:could not resolve" message, or a "TNSlistener does not currentrly know of the service requested"

Check the hostname and servicename are the correct for this FTS node [ where? ]. If they appear to be correct, try again in a minute or two, and if you still have problems, contact DB support.

Once at the prompt (assuming a successful logon), run the command:

select table_name from user_tables;

you should see 6 tables (T_FILE, T_CHANNEL, T_JOB, T_SCHEMA_VERS, T_LOGICAL).

If you so not see all 6 tables, contact FTS 3rd level support.

7) Filesystems full

The operator should have told you if this was the case (since this is likely the alarm that triggered the call). The FTS produces a lot of logs (both the webservice and the agent daemon).

Check the partition where the web-service logs live:

df -H /usr/share/tomcat5/logs/

du -H /usr/share/tomcat5/logs/

If the partition is becoming full and it looks like the log directory is to blame, you should remove the logs, ideally from the oldest file backwards. It would be preferable to copy the logs somewhere first (and then report where you put them at the daily meeting). If you need to delete any of the active logs (the ones without a datestamp) restart the service afterwards (service gLite stop followed by service gLite start). Most of the production FTS boxes live on disk servers, so there should be large fairly empty disks accessible in /shift/.

Check the partition where the agent daemon logs live:

df -H /var/log/glite/

du -H /var/log/glite/

The procedure here is the same if it looks like the daemon logfiles are filling up the disk.

Check the partition where the transfer logs live (this is the most likely one to fill up since thousands of small logfiles will be here):

df -H /var/tmp/glite-url-copy-sc3/

df -H /var/tmp/glite-url-copy/sc3/

If it looks like this log directory is responsible for filling the partition, run the cleanup script:

glite-url-copy-cleanlog sc3 "" /shift/lxshare026d/data01/ 0 12

glite-url-copy-cleanlog sc3 "" /shift/lxshare026d/data01/ 1 12

which should cleanup all logs over 12 hours old for all channels (both completed and failed) writing the archive gzipped tar into a suitable directory (/shift/lxshare026d/data01/ in this example). Note that this script is running tar and rm over very many small files, so may take some time to complete.

8) Random DB or SQL errors in the logs

If any of the logfiles indicate random SQL errors (you may see some in the tail of /var/log/glite/glite-transfer-agent-fts-something.log or in a Java stacktrace in /usr/share/tomcat5/logs/org.glite.data)

attempt to stop the service (service gLite stop).
run the procedure to check the database from 6) above.
If that checks out, restart the service (service gLite start).
Run the commands glite-transfer-list, glite-transfer-channel-list.
Check the /usr/share/tomcat5/logs/org.glite.data logfile for any new SQL errors (in a Java stacktrace) - if there are any contact FTS 3rd level support.
Wait a couple of minutes and check the /var/log/glite/glite-transfer-agent-fts-something.log for any new SQL errors - if there are any, contact FTS 3rd level support.

9) CRL lists have expired

The hostcert should not expire until March 2006, but the CRL lists, if present, expire more often.

Run the command:

openssl verify -crl_check -CApath /etc/grid-security/certificates/ /etc/grid-security/hostcert.pem

this should print OK for the cert. If not, then the CRL list has probaby expired.

Current remedy is to delete the file /etc/grid-security/certificates/fa3af1d7.r0 (since we have no outbound connectivity on the nodes, the normal CRL fetch script will hang). Restart the service after this (service gLite stop foloowed by service gLite start).

-- GavinMcCance - 07 Jul 2005

Topic revision: r2 - 2007-04-10 - SteveTraylen

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback