GLiteWMSTroubleshooting < LCG

LCG Web>LCGGridDeployment>LCGProductionServices>GLiteWMSTroubleshooting (2011-06-21, AndresAeschlimann)

Important places and files
What is wrong with my gLite WMS
Performance and stability improvement on the LB and WMS nodes

This wiki page is based on the slides made by Di for the WLCG Collaboration Workshop (January 2007). Thanks to Di :-)

Important places and files

${INSTALL_ROOT}/glite/etc/config/glite-*.cfg.xml: to verify if the conversion with the tool YAIM2gLiteConverter was ok. This tool tranforms Yaim configuration values to gLite XML configuration files:
- /opt/glite/etc/config/glite-wms.cfg.xml: configuration file for the WMS.
- /opt/glite/etc/config/glite-lb.cfg.xml: configuration file for the LB.
Some other important configuration files:
- /opt/glite/etc/glite_wms.conf: general configuration file.
- /opt/glite/etc/glite_wms_wmproxy_httpd.conf and /opt/glite/etc/glite_wms_wmproxy.gacl: configuration files for the WMproxy.
- /opt/condor-c/etc/condor_config: global configuration file for Condor.
grid-mapfile related files:
- To map or to blacklist an user: /opt/edg/etc/grid-mapfile-local

What is wrong with my gLite WMS

Error while calling the "NSClient::multi" native api AuthentificationException: Failed to establish security context...

Check if your DN is in the /etc/grid-security/grid-mapfile file on the WMS.
Restart the network server (possibly).

Many jobs stay in running or ready status forever

Probably the log monitor daemon is dead and could not be restarted
- Increase the LogLevel value in the LogMonitor section of the /opt/glite/etc/glite_wms.conf file.
- Restart the log monitor daemon to check which log file causes the log monitor to crash.
- Remove the corrupted log file from /var/glite/logmonitor/CondorG.log directory.
Problem with the script /opt/lcg/sbin/grid_monitor.sh.
Daemon interlogd got stuck, so it should be restart.
- A workaround patch is available.

glite-job-logging-info shows error message "Cannot take token!"

Check that the edg-gridftp-clients or glite-gridftp-clients package are installed on the WNs.
Check if it is possible to do some globus-url-copy of files to and from WMS on WNs.
Proxy expired before job executing and could not be renewed.
Job submitted to batch system but BLAH did not get submission informations, so it submitted the job again, and the token was removed by the first running one.

"Got a job held event, reason: Spooling input data files"

It may fail with "Globus error 7: authentication with the remote server failed".
Race condition between the gridmanager on machine A querying the job status of the job on machine B and the schedd on machine B releasing the job after file stage-in, fixed in later version of condor (which version ?).

glite-lb-bkserverd: "Database call failed (the table long_fields is full)" in /var/log/message

The database of the LB reached 4GB limit.
May cause incomplete log events.
Increase them by executing the following commands in MySQL:

                alter table short_fields max_rows=1000000000;
                alter table long_fields max_rows=55000000;
                alter table states max_rows=9500000;
                alter table events max_rows=175000000;

glite-job-logging-info shows: "Cannot read JobWrapper output, both from Condor and from Maradona"

Similar to the LCG workload management system. More informations can be found here.

glite-job-logging-info shows: "Got a job held event reason: The PeriodicHold expression 'Matched = TRUE && CurrentTime > QDate + 900' evaluated to TRUE"

Condor could not submit the job to CE in more than 900 seconds.
Probably Condor-C on CE could not be launched:
- because the authentication failed.
- The previous launcher jobs failed but still in condor queue. Remove it by condor_rm.
- IP address is incorrect in /etc/hosts.
Possibly because of firewall.

glite-job-logging-info shows "Got a job held event, reason: Error connecting to schedd..."

Condor met timeout when connecting sched on gLite CE.
Possibly because of unstable network, or a disk fills up somewhere.

glite-job-logging-info shows "Got a job held event, reason: Attempts to submit failed"

It means that the job could not be successfully handed over the batch system by the non-privileged user that resulted from the GRAM/LCMAPS.
For example, BLParser not running on batch system head node.

Failed to load GSI credential: edg_wl_gss_acquire_cred_gsi() failed

locallogger puts log event under /tmp (/var/glite/tmp in version 3.1), but tmp is full.

Unable to delegate the credential to the endpoint: https://:7443/glite_wms_wmproxy_server

VOMS extension of your proxy is not in /opt/glite_etc/glite_wms_wmproxy.gacl
VOMS extension missing in your proxy and WMproxy is configured only to support VOMS proxy.
WMproxy.log and lcmaps.log can give some clues.

Unable to register the job to the service: https://:7443/glite_wms_wmproxy_server

LB or LBProxy is too busy, increase the timeout.
LB is in bad status, restart it.

Lot of dglod files in /var/glite/log directory

Restart the glite-lb-locallogger service.

Performance and stability improvement on the LB and WMS nodes

See the work log related to the gLite WMS 3.1 here.

Split /var/glite into several partions of hard disks

Use the script WMS-post-install_found in _/root on the WMS node.
This script is doing the migration of the middleware log files, sandbox and MySQL database on /data01, /data02 and /data03 partitions.

Deploy standalone LB

One standalone LB (like rb201) can server several WMS.
Create a file /opt/glite/etc/LB-super-users containing the DN of allowd WMS. You can add the DN of your certificate there in order to debug and get log informations of other users's jobs.
Add the option "--super-users-file /opt/glite/etc/LB-super-user" in the startup line of the bkserverd daemon, i.e. in file /opt/glite/etc/init.d/glite-lb-bkserverd (see variable "super" in the script).
Add the following line in the WorkloadManagerProxy of file /opt/glite/etc/glite_wms.conf:

                LBServer = "rb201.cern.ch:9000";

Restart the glite-wms-wmproxy service.

Reduce the planner number per DAG job

In case there are too many DAG jobs.
For example, add "DagmanMaxPre=2" in the JobController section of file /opt/glite/etc/glite_wms.conf to reduce it to 2.

Increase the timeout

Set GLITE_WMS_QUERY_TIMEOUT and GLITE_PR_TIMEOUT variables in /etc/glite/profile.d/glite-setenv.* and add the line "PassEnv GLITE_PR_TIMEOUT" in /opt/glite/etc/glite_wms_wmproxy_httpd.conf.

Remove the pending export of job registrations into JP

If this service is not required by the experiments, you can disable this option by removing the string $maildir in /opt/glite/etc/init.d/glite-lb-bkserverd startup script. This service must be restarted. It will prevent the "no more inodes available in /tmp" error message...

-- YvanCalas - 09 Mar 2007

Topic revision: r6 - 2011-06-21 - AndresAeschlimann

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback