Summary of ROC Nagioses Evaluation

Points that one should take into account before proceeding with the evaluation details.

Differences Nagios vs SAM

WMSes

  • Different WMSes are used to submit jobs by ROC Nagios nodes (two WMSes wmssamtest0{1,2}.cern.ch) and SAM UIs (four wms{201,206,208,209}.cern.ch). Both WMS sets point to the same tBDII - sam-bdii.cern.ch.

Timeouts related to jobs submission

In SAM two UIs submit jobs on odd/even hours. Thus, submitted jobs have two hours to complete. Timeouts are following:

  • WMS: MatchRetryPeriod = 58 min, ExpiryPeriod = 2 hours
  • SAM: job is discarded after 12 hours
The common problem of "no compatible resources" and eventual jobs abortion is handled by

In Nagios job submission against each CE is scheduled to happen hourly (T_N_SCHED). Here are the timeouts used on WMS and in the jobs monitoring metric in Nagios.

                T_N_SCHED(1h)
          T_J_GLOB |        T_J_DISCARD
      T_J_WAIT  |  |               |
|-----------|---|--|---------------|-> t
                  |                |
                T_WMS_MatchRetr  T_WMS_Exp

T_J_WAIT - 45 min (--timeout-job-waiting)
T_J_GLOB - 55 min (--timeout-job-global)
T_WMS_MatchRetr - 58 min (MatchRetryPeriod)
T_N_SCHED - 1 hour (Nagios metric scheduling)
T_J_DISCARD - 2 hours (--timeout-job-discard)
T_WMS_Exp - 2 hours (ExpiryPeriod)
   

For more details please see: jobs submission Nagios CE testing (metrics used), Jobs Submission and Monitoring (testing logic used).

Since 8th of Feb we started using VOMS proxy with primary attribute /ops/Role=lcgadmin.

Transport

  • To deliver test results from WNs Nagios uses network of Message Brokers (MB). While SAM uses predefined endpoint to send test results. The MBs are published in tBDII and tests launching framework on WNs is able to discover the brokers dynamically. NB! now on ROC Nagioses MB to be used is provided to WN metrics (stomp://gridmsg002.cern.ch:6163). It's tested from WN for connectivity. If it fails, then, discovery for MBs in 'PROD' network is done (from tBDII defined on WN). If this fails, we raise WARNING / UNKNOWN depending on the failure.

Open discrepancies that require further investigation

During our evaluation process since last week we found few issue related to Nagios that were fixed.

Here are the remaining ones that we are currently working on.

  • Missing WN test results. See Issue 5. "Problem connecting to MB" - affecting 7 CEs out of 441 CEs.
  • CEs in CRITICAL due to "Status Getting job output: Failed. Connecting to the service https://wmssamtest02.cern.ch:7443/glite_wms_wmproxy_server Error - Operation failed HTTP Error Internal Server Error" should be UNKNOWN. We saw it once in the last three days. Happens very rarely (but randomly - can affect any CE). Bug - will be fixed this week.
  • Now logging info for jobs is not complete as the currently being use version of Nagios allows 8KB of details data. New release will increase the size to 16KB. Thus, more debugging info can be seen in the metrics output.

-- KonstantinSkaburskas - 08-Feb-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2010-02-08 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback