This wiki page is based on the slides made by Di for the WLCG Collaboration Workshop (January 2007). Thanks to Di :-)

Important places and files

  • ${INSTALL_ROOT}/glite/etc/config/glite-*.cfg.xml: to verify if the conversion with the tool YAIM2gLiteConverter was ok. This tool tranforms Yaim configuration values to gLite XML configuration files:
    • /opt/glite/etc/config/glite-wms.cfg.xml: configuration file for the WMS.
    • /opt/glite/etc/config/glite-lb.cfg.xml: configuration file for the LB.
  • Some other important configuration files:
    • /opt/glite/etc/glite_wms.conf: general configuration file.
    • /opt/glite/etc/glite_wms_wmproxy_httpd.conf and /opt/glite/etc/glite_wms_wmproxy.gacl: configuration files for the WMproxy.
    • /opt/condor-c/etc/condor_config: global configuration file for Condor.
  • grid-mapfile related files:
    • To map or to blacklist an user: /opt/edg/etc/grid-mapfile-local

What is wrong with my gLite WMS

Error while calling the "NSClient::multi" native api AuthentificationException: Failed to establish security context...

  • Check if your DN is in the /etc/grid-security/grid-mapfile file on the WMS.
  • Restart the network server (possibly).

Many jobs stay in running or ready status forever

  • Probably the log monitor daemon is dead and could not be restarted
    • Increase the LogLevel value in the LogMonitor section of the /opt/glite/etc/glite_wms.conf file.
    • Restart the log monitor daemon to check which log file causes the log monitor to crash.
    • Remove the corrupted log file from /var/glite/logmonitor/CondorG.log directory.
  • Problem with the script /opt/lcg/sbin/grid_monitor.sh.
  • Daemon interlogd got stuck, so it should be restart.
    • A workaround patch is available.

glite-job-logging-info shows error message "Cannot take token!"

  • Check that the edg-gridftp-clients or glite-gridftp-clients package are installed on the WNs.
  • Check if it is possible to do some globus-url-copy of files to and from WMS on WNs.
  • Proxy expired before job executing and could not be renewed.
  • Job submitted to batch system but BLAH did not get submission informations, so it submitted the job again, and the token was removed by the first running one.

"Got a job held event, reason: Spooling input data files"

  • It may fail with "Globus error 7: authentication with the remote server failed".
  • Race condition between the gridmanager on machine A querying the job status of the job on machine B and the schedd on machine B releasing the job after file stage-in, fixed in later version of condor (which version ?).

glite-lb-bkserverd: "Database call failed (the table long_fields is full)" in /var/log/message

  • The database of the LB reached 4GB limit.
  • May cause incomplete log events.
  • Increase them by executing the following commands in MySQL:
                alter table short_fields max_rows=1000000000;
                alter table long_fields max_rows=55000000;
                alter table states max_rows=9500000;
                alter table events max_rows=175000000;

glite-job-logging-info shows: "Cannot read JobWrapper output, both from Condor and from Maradona"

  • Similar to the LCG workload management system. More informations can be found here.

glite-job-logging-info shows: "Got a job held event reason: The PeriodicHold expression 'Matched = TRUE && CurrentTime > QDate + 900' evaluated to TRUE"

  • Condor could not submit the job to CE in more than 900 seconds.
  • Probably Condor-C on CE could not be launched:
    • because the authentication failed.
    • The previous launcher jobs failed but still in condor queue. Remove it by condor_rm.
    • IP address is incorrect in /etc/hosts.
  • Possibly because of firewall.

glite-job-logging-info shows "Got a job held event, reason: Error connecting to schedd..."

  • Condor met timeout when connecting sched on gLite CE.
  • Possibly because of unstable network, or a disk fills up somewhere.

glite-job-logging-info shows "Got a job held event, reason: Attempts to submit failed"

  • It means that the job could not be successfully handed over the batch system by the non-privileged user that resulted from the GRAM/LCMAPS.
  • For example, BLParser not running on batch system head node.

Failed to load GSI credential: edg_wl_gss_acquire_cred_gsi() failed

  • locallogger puts log event under /tmp (/var/glite/tmp in version 3.1), but tmp is full.

Unable to delegate the credential to the endpoint: https://:7443/glite_wms_wmproxy_server

  • VOMS extension of your proxy is not in /opt/glite_etc/glite_wms_wmproxy.gacl
  • VOMS extension missing in your proxy and WMproxy is configured only to support VOMS proxy.
  • WMproxy.log and lcmaps.log can give some clues.

Unable to register the job to the service: https://:7443/glite_wms_wmproxy_server

  • LB or LBProxy is too busy, increase the timeout.
  • LB is in bad status, restart it.

Lot of dglod files in /var/glite/log directory

  • Restart the glite-lb-locallogger service.

Performance and stability improvement on the LB and WMS nodes

See the work log related to the gLite WMS 3.1 here.

Split /var/glite into several partions of hard disks

  • Use the script WMS-post-install_found in _/root on the WMS node.
  • This script is doing the migration of the middleware log files, sandbox and MySQL database on /data01, /data02 and /data03 partitions.

Deploy standalone LB

  • One standalone LB (like rb201) can server several WMS.
  • Create a file /opt/glite/etc/LB-super-users containing the DN of allowd WMS. You can add the DN of your certificate there in order to debug and get log informations of other users's jobs.
  • Add the option "--super-users-file /opt/glite/etc/LB-super-user" in the startup line of the bkserverd daemon, i.e. in file /opt/glite/etc/init.d/glite-lb-bkserverd (see variable "super" in the script).
  • Add the following line in the WorkloadManagerProxy of file /opt/glite/etc/glite_wms.conf:
                LBServer = "rb201.cern.ch:9000";
  • Restart the glite-wms-wmproxy service.

Reduce the planner number per DAG job

  • In case there are too many DAG jobs.
  • For example, add "DagmanMaxPre=2" in the JobController section of file /opt/glite/etc/glite_wms.conf to reduce it to 2.

Increase the timeout

  • Set GLITE_WMS_QUERY_TIMEOUT and GLITE_PR_TIMEOUT variables in /etc/glite/profile.d/glite-setenv.* and add the line "PassEnv GLITE_PR_TIMEOUT" in /opt/glite/etc/glite_wms_wmproxy_httpd.conf.

Remove the pending export of job registrations into JP

  • If this service is not required by the experiments, you can disable this option by removing the string $maildir in /opt/glite/etc/init.d/glite-lb-bkserverd startup script. This service must be restarted. It will prevent the "no more inodes available in /tmp" error message...

-- YvanCalas - 09 Mar 2007

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2011-06-21 - AndresAeschlimann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback