WMS3.2 Pilot Home Page


  • Start Date: Mon 03 Aug 2009
  • End Date: 20 Aug 2009
  • Description: WMS 32. @ CERN
  • Coordinators: Maarten Litmaath, Antonio Retico
  • Contact e-mail: wms-operations@cern.ch (WMS Operations at CERN)
  • Status : Closed
  • Related meetings

Description

A WMS at cern is installed starting with the version currently in PPS and certification

wms219.cern.ch runs these WMS patches:

https://savannah.cern.ch/patch/index.php?2597

https://savannah.cern.ch/patch/index.php?2896

https://savannah.cern.ch/patch/index.php?3044

https://savannah.cern.ch/patch/index.php?3156 <-- in certification

It also has this LB patch, except for its YAIM component :

https://savannah.cern.ch/patch/index.php?2848

For some reason the node still has glite-yaim-lb-4.0.2-1 instead of glite-yaim-lb-4.1.0-1;

Use cases

The WMS will be left operating with standard load from the 4 experiments.

Objective and metrics

Technical documentation

Installation Documentation

Patches installed from PPS + Patch repository in certification

Configuration Instructions

standard YAIM configuration

Pilot Layout

wms219.cern.ch is the only machine running WMS 3.2 at CERN. The node supports the four EXP VOs + ops and dteam.

Tasks and actions:

Actions for SA1 are tracked via the TASK:XXXX available from the PPS task tracker

Tasks for other participants are tracked here

Assigned to Due date Description State Closed Notify  
Main.CERN_PPS 2007-03-05 Example Action Item 2008-04-16 AntonioRetico   edit

Results

Feedback from the experiments

---------- Forwarded message ----------
Date: Wed, 5 Aug 2009 02:09:39 +0200 (CEST)
From: Maarten.Litmaath@cern.ch
To: Andrea Sciaba <Andrea.Sciaba@cern.ch>
Cc: Alessandro Di Girolamo <Alessandro.Di.Girolamo@cern.ch>,
     Simone Campana <Simone.Campana@cern.ch>,
     Roberto Santinelli <Roberto.Santinelli@cern.ch>,
     Patricia Mendez Lorenzo <Patricia.Mendez@cern.ch>,
     Nicolo Magini <Nicolo.Magini@cern.ch>, Daniel.Colin.Vanderster@cern.ch,
     "wms-operations (WMS Operations at CERN)" <wms-operations@cern.ch>,
     Johannes Elmsheuser <johannes.elmsheuser@physik.uni-muenchen.de>,
     Antonio Retico <Antonio.Retico@cern.ch>
Subject: WMS 3.2 pilot node wms219 looks good

Hi all,
CMS and LHCb have confirmed that wms219.cern.ch works fine for them
and I did not receive complaints from ATLAS or ALICE either,
so I think we can consider the current set of rpms and adjustments
to the default configuration satisfactory.
We now can proceed with the formal release procedure.
I will supply details to the certification and release teams.
Thanks,
   Maarten

Comments and issues from operations

Maarten: [to get in synch with PATCH:2848]

  • wms219 reconfigured with glite-yaim-lb-4.1.0-1
  • bunch of test jobs submitted : all looks normal.
  • 2k more jobs submitted: looking for unexpected increases in disk usage. No new processes in "top".
  • after a day with 8561 Condor-G jobs, including 5k (sic) "ops" jobs spread all over the grid, there is no sign of real trouble. The only remarkable fact seems to be a new memory consumption record for the Workload Manager:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9387 glite     25   0 3184m 3.0g 5892 S  0.0 19.3 477:11.92 glite-wms-workl
For the gLite 3.1 code I have seen up to 2.7 GB (higher values not excluded).
  • I think we can go ahead with the release after the formal certification of patch #3156 and with the release notes I detailed earlier.
Cheers, Maarten

Recommendation for Deployment in production

When we go ahead with the release to production, the following should be part of the release notes:


- In /opt/glite/etc/glite_wms.conf the "--ftpconn" values typically need to be increased from 30 e.g. to 300, to avoid the limiter refusing jobs too frequently.

Bug: https://savannah.cern.ch/bugs/?53297

- In /opt/globus/etc/gridftp.conf "connections_max" typically needs to be increased e.g. from 50 to 500, to avoid GridFTP connections being refused too quickly. Site-info.def should be adjusted:

GRIDFTP_CONNECTIONS_MAX=500

- The WMProxy logging is fairly useless at its default level, so the admin may want to increase it from 5 to 6.

Bug: https://savannah.cern.ch/bugs/?53294

Until the issues have been addressed by future YAIM versions, the WMS admin can create a file /opt/glite/yaim/functions/post/config_glite_wms with the following function to let YAIM adjust the parameters in its post-configuration step:

config_glite_wms_post()
{
    perl -i -pe '
   BEGIN {
       $flag = 0;
   }
   s/(--ftpconn) \d+/$1 300/;
   /^\s*WorkloadManagerProxy/ && ($flag = 1);
   $flag && s/(LogLevel *=) *\d+/$1  6/ && ($flag = 0);
    ' /opt/glite/etc/glite_wms.conf

    /opt/glite/etc/init.d/glite-wms-wmproxy restart
}

- The Workload Manager is observed to take even more memory than seen with the WMS 3.1 code and therefore may need to be restarted regularly.

Bug: https://savannah.cern.ch/bugs/?54144

Example cron job:

# cat /etc/cron.d/restart-wm
16 2 * * * root (date; /opt/glite/etc/init.d/glite-wms-wm restart) >> /var/log/restart-wm.log 2>&1

List of issues found

Issue Reported by Bug(s) Status Open/Closed
WMS 3.2 job wrapper template fails when 3.1 version works operations BUG:53078 fix certified in PATCH:3156 closed
WMS 3.2 generates unusable BrokerInfo file operations BUG:53448 fix certified in PATCH:3156 closed
[ yaim-wms ] glite_wms.conf hardcoded parameters operations BUG:53297 issue for release notes described at https://savannah.cern.ch/bugs/?48479#comment8 open
glite-brokerinfo does not evaluate attribute references developer BUG:53686 Integration candidate open
Some information is missing in the BrokerInfo file developer BUG:53706 fix certified in PATCH:3156 closed
WMS 3.2 Workload Manager memory leak? operations BUG:54144 None open
There are currently no open critical issues

History

12-Jul-2009 : first installation at CERN

22-Jul-2009 : EMT received the list of critical bugs to be fixed before release to production

29-Jul-2009 : PATCH:3156 with the fixes released to integration and installed on wms319

03-Aug-2009 : Pilot Home page created

05-Aug-2009 : CMS and LHCb confirmed that wms119 is running fine. No bad news from Alice or Atlas

07-Aug-2009 : further test after LB re-configuration showed significantly increased memory consumption of the WMS

28-Aug-2009 : WMS 3.2 in production with gLite 3.1 Update 53


Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2009-09-02 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback