LCG Web>LCGGridDeployment>GLitePreProductionServices>EGEE_PPS_Coordination>PpsPilots>PpsPilotWMS32 (2009-09-02, unknown)

EditAttachPDF

WMS3.2 Pilot Home Page

Start Date: Mon 03 Aug 2009
End Date: 20 Aug 2009
Description: WMS 32. @ CERN
Coordinators: Maarten Litmaath, Antonio Retico
Contact e-mail: wms-operations@cern.ch (WMS Operations at CERN)
Status : Closed
Related meetings

Description
- Use cases
- Objective and metrics
Technical documentation
- Installation Documentation
- Configuration Instructions
Pilot Layout
Tasks and actions:
Results
History

Description

A WMS at cern is installed starting with the version currently in PPS and certification

wms219.cern.ch runs these WMS patches:

https://savannah.cern.ch/patch/index.php?2597

https://savannah.cern.ch/patch/index.php?2896

https://savannah.cern.ch/patch/index.php?3044

https://savannah.cern.ch/patch/index.php?3156 <-- in certification

It also has this LB patch, except for its YAIM component :

https://savannah.cern.ch/patch/index.php?2848

For some reason the node still has glite-yaim-lb-4.0.2-1 instead of glite-yaim-lb-4.1.0-1;

Use cases

The WMS will be left operating with standard load from the 4 experiments.

Objective and metrics

Technical documentation

Installation Documentation

Patches installed from PPS + Patch repository in certification

Configuration Instructions

standard YAIM configuration

Pilot Layout

wms219.cern.ch is the only machine running WMS 3.2 at CERN. The node supports the four EXP VOs + ops and dteam.

Tasks and actions:

Actions for SA1 are tracked via the TASK:XXXX available from the PPS task tracker

Tasks for other participants are tracked here

Assigned to	Due date	Description	State	Closed	Notify
Main.CERN_PPS	2007-03-05	Example Action Item		2008-04-16 AntonioRetico		edit

Results

Feedback from the experiments

---------- Forwarded message ----------
Date: Wed, 5 Aug 2009 02:09:39 +0200 (CEST)
From: Maarten.Litmaath@cern.ch
To: Andrea Sciaba <Andrea.Sciaba@cern.ch>
Cc: Alessandro Di Girolamo <Alessandro.Di.Girolamo@cern.ch>,
     Simone Campana <Simone.Campana@cern.ch>,
     Roberto Santinelli <Roberto.Santinelli@cern.ch>,
     Patricia Mendez Lorenzo <Patricia.Mendez@cern.ch>,
     Nicolo Magini <Nicolo.Magini@cern.ch>, Daniel.Colin.Vanderster@cern.ch,
     "wms-operations (WMS Operations at CERN)" <wms-operations@cern.ch>,
     Johannes Elmsheuser <johannes.elmsheuser@physik.uni-muenchen.de>,
     Antonio Retico <Antonio.Retico@cern.ch>
Subject: WMS 3.2 pilot node wms219 looks good

Hi all,
CMS and LHCb have confirmed that wms219.cern.ch works fine for them
and I did not receive complaints from ATLAS or ALICE either,
so I think we can consider the current set of rpms and adjustments
to the default configuration satisfactory.
We now can proceed with the formal release procedure.
I will supply details to the certification and release teams.
Thanks,
   Maarten

Comments and issues from operations

Maarten: [to get in synch with PATCH:2848]

wms219 reconfigured with glite-yaim-lb-4.1.0-1
bunch of test jobs submitted : all looks normal.
2k more jobs submitted: looking for unexpected increases in disk usage. No new processes in "top".
after a day with 8561 Condor-G jobs, including 5k (sic) "ops" jobs spread all over the grid, there is no sign of real trouble. The only remarkable fact seems to be a new memory consumption record for the Workload Manager:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9387 glite     25   0 3184m 3.0g 5892 S  0.0 19.3 477:11.92 glite-wms-workl

For the gLite 3.1 code I have seen up to 2.7 GB (higher values not excluded).

I think we can go ahead with the release after the formal certification of patch #3156 and with the release notes I detailed earlier.

Cheers, Maarten

Recommendation for Deployment in production

When we go ahead with the release to production, the following should be part of the release notes:

- In /opt/glite/etc/glite_wms.conf the "--ftpconn" values typically need to be increased from 30 e.g. to 300, to avoid the limiter refusing jobs too frequently.

Bug: https://savannah.cern.ch/bugs/?53297

- In /opt/globus/etc/gridftp.conf "connections_max" typically needs to be increased e.g. from 50 to 500, to avoid GridFTP connections being refused too quickly. Site-info.def should be adjusted:

GRIDFTP_CONNECTIONS_MAX=500

- The WMProxy logging is fairly useless at its default level, so the admin may want to increase it from 5 to 6.

Bug: https://savannah.cern.ch/bugs/?53294

Until the issues have been addressed by future YAIM versions, the WMS admin can create a file /opt/glite/yaim/functions/post/config_glite_wms with the following function to let YAIM adjust the parameters in its post-configuration step:

config_glite_wms_post()
{
    perl -i -pe '
   BEGIN {
       $flag = 0;
   }
   s/(--ftpconn) \d+/$1 300/;
   /^\s*WorkloadManagerProxy/ && ($flag = 1);
   $flag && s/(LogLevel *=) *\d+/$1  6/ && ($flag = 0);
    ' /opt/glite/etc/glite_wms.conf

    /opt/glite/etc/init.d/glite-wms-wmproxy restart
}

- The Workload Manager is observed to take even more memory than seen with the WMS 3.1 code and therefore may need to be restarted regularly.

Bug: https://savannah.cern.ch/bugs/?54144

Example cron job:

# cat /etc/cron.d/restart-wm
16 2 * * * root (date; /opt/glite/etc/init.d/glite-wms-wm restart) >> /var/log/restart-wm.log 2>&1

List of issues found

Issue	Reported by	Bug(s)	Status	Open/Closed
WMS 3.2 job wrapper template fails when 3.1 version works	operations	BUG:53078	fix certified in PATCH:3156	closed
WMS 3.2 generates unusable BrokerInfo file	operations	BUG:53448	fix certified in PATCH:3156	closed
[ yaim-wms ] glite_wms.conf hardcoded parameters	operations	BUG:53297	issue for release notes described at https://savannah.cern.ch/bugs/?48479#comment8	open
glite-brokerinfo does not evaluate attribute references	developer	BUG:53686	Integration candidate	open
Some information is missing in the BrokerInfo file	developer	BUG:53706	fix certified in PATCH:3156	closed
WMS 3.2 Workload Manager memory leak?	operations	BUG:54144	None	open

There are currently no open critical issues

History

12-Jul-2009 : first installation at CERN

22-Jul-2009 : EMT received the list of critical bugs to be fixed before release to production

29-Jul-2009 : PATCH:3156 with the fixes released to integration and installed on wms319

03-Aug-2009 : Pilot Home page created

05-Aug-2009 : CMS and LHCb confirmed that wms119 is running fine. No bad news from Alice or Atlas

07-Aug-2009 : further test after LB re-configuration showed significantly increased memory consumption of the WMS

28-Aug-2009 : WMS 3.2 in production with gLite 3.1 Update 53

Topic revision: r6 - 2009-09-02 - unknown

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback