FTS Smoke-Test
The following describes the procedure for checking the FTS nodes. It should be followed mostly in sequence.
The code examples below are designed, as much as possible, to be cut and pasted into the terminal window. Please report any errors you see.
1) Ping the node.
Log into any lxplus batch node. Attempt to ping the "problematic" FTS server. If you cannot ping the host there is likely a hardware problem, a network problem or the machine is in the middle of being rebooted. To cover the last case, wait a few minutes and try again. Call the operator if the problem persists.
2) Connect to the node.
2) ssh into the FTS node as
root
(make sure you have run
klog.krb
to obtain your kerberos AFS token). If you cannot ssh into the node, contact the operator.
3) Check whether the agent daemon is running.
Run the command:
ps aux | grep glite
You should see a single daemon called
glite-transfer-fts-something
running as user
sc3
.
Possible problem: There is more than one
glite-transfer-fts-something
process:
- There should not be more than one.
kill -9
all the daemon processes identified using ps aux
. Then run the FtsMeanAndClean15 procedure for restarting the service from a clean config.
Possible problem: There are none, or there are none after a restart:
- Attempt to restart the service using
service gLite stop
followed by service gLite start
. If the daemon is still not there, look in the logfile /var/log/glite/glite-transfer-fts-something.log
(it should be the latest logfile in that directory). The service should have dumped its configuration into the logfile upon startup. Look for any ERROR
class messages. Typical startup failures can include configuration file parsing errors (if the config file has been corrupted or changed) or database connection errors (in which case the problem is likely on the database node).
4) Check whether the web service container daemon is running.
Run the command:
ps aux | grep tomcat4
You should see a single java daemon running.
ps aux
will show the name of the daemon as the full Java comandline.
Possible problem: There is more than one
java
process owned by tomcat4 (this is unusual):
- There should not be more than one. Run the same FtsMeanAndClean15 procedure as for 3).
Possible problem: There are no java processes, or none after a restart:
- Attempt to restart the service using
service gLite stop
followed by service gLite start
. If the daemon is still not there, look in the logfile /usr/share/tomcat5/logs/org.glite.data
(it should be among the latest logfiles in that directory). The service should have dumped sonme of its configuration into the logfile upon startup. Look for any ERROR
or FATAL
class messages. Typical startup failures can include configuration file parsing errors (if the config file has been corrupted or changed), grid-mapfile parsing errors (if any of the service mapfiles have been corrupted or changed) or database connection errors (in which case the problem is likely on the database node).
- Run, as
root
, the commands glite-transfer-channel-list
and glite-transfer-list
. If any of them display a message 'tcp connect error'
, 'InternalException'
or 'Service is misconfigured' look in the =org.glite.data
logfile for ERROR
or FATAL
class messages that may indicate what the problem is.
5) Check the number of gLite instances configured to run inside the tomcat container.
There should be one and only one. Issue the following commands and check the output:
ls -trl /etc/tomcat5/Catalina/localhost/*glite*
there should only be
one file.
ls -trl /usr/share/tomcat5/webapps/
ignore
ROOT
,
balancer
and
webdav
directories if they exist. Other than these, there should only be
one directory.
ls -trld /usr/share/tomcat5/work/Catalina/localhost/*glite*
there should only be
one directory.
If any of these checks fail, run the same
FtsMeanAndClean15 procedure as mentioned in step 3).
6) If the database appears to be down
If any of the logfiles suggest a problem connecting to the database, try to confirm this manually.
Look in the config file
/opt/glite/etc/config/glite-file-transfer-service-oracle.cfg.xml
In the first 50 lines of the files, you'll find the following paramters (it's an XML file):
DBUSER
,
DBPASSWORD
,
DBSERVICENAME
,
DBHOST
. Take the values and run the following
sqlplus
command:
sqlplus DBUSER/DBPASSWORD@DBHOST/DBSERVICENAME
for example, for the production throughput FTS on lxshare026d, run:
sqlplus fts_sc3prod/xxxxxxxx@fts.cern.ch/fts_prod.cern.ch
Possible problem: invalid username/password; logon denied
Like it says, your username/password pair is wrong; the service is misconfigured. Edit the config file, changing to the correct correct username/password [ where? ]. If you're
sure they are correct and it still says 'logon denied', contact the DB service support.
Possible problem: account locked
After a couple or so tries with the wrong password, the account is locked for 15 minutes or so. This means someone (possibly you) has attempted to connect with the wrong paswword. If the service logfiles indicate that the service cannot connect to the DB, stop the service (
service gLite stop
), wait 15 minutes, then attempt to restart (
service gLite start
). If this does not help, call DB support (there may be an errant script attempting to connect to the DB somewhere with a bad password).
Possible problem: it hangs, gets connection refused, a
"TNS:could not resolve"
message, or a
"TNSlistener does not currentrly know of the service requested"
Check the hostname and servicename are the correct for this FTS node [ where? ]. If they appear to be correct, try again in a minute or two, and if you still have problems, contact DB support.
Once at the prompt (assuming a successful logon), run the command:
select table_name from user_tables;
you should see 6 tables (
T_FILE, T_CHANNEL, T_JOB, T_SCHEMA_VERS, T_LOGICAL
).
If you so not see all 6 tables, contact FTS 3rd level support.
7) Filesystems full
The operator should have told you if this was the case (since this is likely the alarm that triggered the call). The FTS produces a lot of logs (both the webservice and the agent daemon).
Check the partition where the web-service logs live:
df -H /usr/share/tomcat5/logs/
du -H /usr/share/tomcat5/logs/
If the partition is becoming full and it looks like the log directory is to blame, you should remove the logs, ideally from the oldest file backwards. It would be preferable to copy the logs somewhere first (and then report where you put them at the daily meeting). If you need to delete any of the active logs (the ones without a datestamp) restart the service afterwards (
service gLite stop
followed by
service gLite start
). Most of the production FTS boxes live on disk servers, so there should be large fairly empty disks accessible in
/shift/
.
Check the partition where the agent daemon logs live:
df -H /var/log/glite/
du -H /var/log/glite/
The procedure here is the same if it looks like the daemon logfiles are filling up the disk.
Check the partition where the transfer logs live (this is the most likely one to fill up since thousands of small logfiles will be here):
df -H /var/tmp/glite-url-copy-sc3/
df -H /var/tmp/glite-url-copy/sc3/
If it looks like this log directory is responsible for filling the partition, run the cleanup script:
glite-url-copy-cleanlog sc3 "" /shift/lxshare026d/data01/ 0 12
glite-url-copy-cleanlog sc3 "" /shift/lxshare026d/data01/ 1 12
which should cleanup all logs over 12 hours old for all channels (both completed and failed) writing the archive gzipped tar into a suitable directory (
/shift/lxshare026d/data01/
in this example). Note that this script is running
tar
and
rm
over very many small files, so may take some time to complete.
8) Random DB or SQL errors in the logs
If any of the logfiles indicate random SQL errors (you may see some in the tail of
/var/log/glite/glite-transfer-agent-fts-something.log
or in a Java stacktrace in
/usr/share/tomcat5/logs/org.glite.data
)
- attempt to stop the service (
service gLite stop
).
- run the procedure to check the database from 6) above.
- If that checks out, restart the service (
service gLite start
).
- Run the commands
glite-transfer-list
, glite-transfer-channel-list
.
- Check the
/usr/share/tomcat5/logs/org.glite.data
logfile for any new SQL errors (in a Java stacktrace) - if there are any contact FTS 3rd level support.
- Wait a couple of minutes and check the
/var/log/glite/glite-transfer-agent-fts-something.log
for any new SQL errors - if there are any, contact FTS 3rd level support.
9) CRL lists have expired
The hostcert should not expire until March 2006, but the CRL lists, if present, expire more often.
Run the command:
openssl verify -crl_check -CApath /etc/grid-security/certificates/ /etc/grid-security/hostcert.pem
this should print
OK
for the cert. If not, then the CRL list has probaby expired.
Current remedy is to delete the file
/etc/grid-security/certificates/fa3af1d7.r0
(since we have no outbound connectivity on the nodes, the normal CRL fetch script will hang). Restart the service after this (
service gLite stop
foloowed by
service gLite start
).
--
GavinMcCance - 07 Jul 2005