---------- Forwarded message ----------
Date: Wed, 5 Aug 2009 02:09:39 +0200 (CEST)
From: Maarten.Litmaath@cern.ch
To: Andrea Sciaba <Andrea.Sciaba@cern.ch>
Cc: Alessandro Di Girolamo <Alessandro.Di.Girolamo@cern.ch>,
Simone Campana <Simone.Campana@cern.ch>,
Roberto Santinelli <Roberto.Santinelli@cern.ch>,
Patricia Mendez Lorenzo <Patricia.Mendez@cern.ch>,
Nicolo Magini <Nicolo.Magini@cern.ch>, Daniel.Colin.Vanderster@cern.ch,
"wms-operations (WMS Operations at CERN)" <wms-operations@cern.ch>,
Johannes Elmsheuser <johannes.elmsheuser@physik.uni-muenchen.de>,
Antonio Retico <Antonio.Retico@cern.ch>
Subject: WMS 3.2 pilot node wms219 looks good
Hi all,
CMS and LHCb have confirmed that wms219.cern.ch works fine for them
and I did not receive complaints from ATLAS or ALICE either,
so I think we can consider the current set of rpms and adjustments
to the default configuration satisfactory.
We now can proceed with the formal release procedure.
I will supply details to the certification and release teams.
Thanks,
Maarten
Comments and issues from operations
Maarten: [to get in synch with PATCH:2848]
wms219 reconfigured with glite-yaim-lb-4.1.0-1
bunch of test jobs submitted : all looks normal.
2k more jobs submitted: looking for unexpected increases in disk usage. No new processes in "top".
after a day with 8561 Condor-G jobs, including 5k (sic) "ops" jobs spread all over the grid, there is no sign of real trouble. The only remarkable fact seems to be a new memory consumption record for the Workload Manager:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9387 glite 25 0 3184m 3.0g 5892 S 0.0 19.3 477:11.92 glite-wms-workl
For the gLite 3.1 code I have seen up to 2.7 GB (higher values not excluded).
I think we can go ahead with the release after the formal certification of patch #3156 and with the release notes I detailed earlier.
Cheers,
Maarten
Recommendation for Deployment in production
When we go ahead with the release to production, the following should be part of the release notes:
- In /opt/glite/etc/glite_wms.conf the "--ftpconn" values typically
need to be increased from 30 e.g. to 300, to avoid the limiter
refusing jobs too frequently.
Bug: https://savannah.cern.ch/bugs/?53297
- In /opt/globus/etc/gridftp.conf "connections_max" typically needs
to be increased e.g. from 50 to 500, to avoid GridFTP connections
being refused too quickly. Site-info.def should be adjusted:
GRIDFTP_CONNECTIONS_MAX=500
- The WMProxy logging is fairly useless at its default level,
so the admin may want to increase it from 5 to 6.
Bug: https://savannah.cern.ch/bugs/?53294
Until the issues have been addressed by future YAIM versions, the WMS admin can create a file /opt/glite/yaim/functions/post/config_glite_wms
with the following function to let YAIM adjust the parameters in its post-configuration step:
- The Workload Manager is observed to take even more memory than seen
with the WMS 3.1 code and therefore may need to be restarted regularly.
Bug: https://savannah.cern.ch/bugs/?54144
Example cron job:
glite-brokerinfo does not evaluate attribute references
developer
BUG:53686
Integration candidate
open
Some information is missing in the BrokerInfo file
developer
BUG:53706
fix certified in PATCH:3156
closed
WMS 3.2 Workload Manager memory leak?
operations
BUG:54144
None
open
There are currently no open critical issues
History
12-Jul-2009 : first installation at CERN
22-Jul-2009 : EMT received the list of critical bugs to be fixed before release to production
29-Jul-2009 : PATCH:3156 with the fixes released to integration and installed on wms319
03-Aug-2009 : Pilot Home page created
05-Aug-2009 : CMS and LHCb confirmed that wms119 is running fine. No bad news from Alice or Atlas
07-Aug-2009 : further test after LB re-configuration showed significantly increased memory consumption of the WMS
28-Aug-2009 : WMS 3.2 in production with gLite 3.1 Update 53