WorkLog2005 < LCG

TWiki>

LCG Web>LCGGridDeployment>LCGProductionServices>WorkLog>WorkLog2005 (2007-02-19, AlbertoAimar)

EditAttachPDF

LCG Production Services - LCG Grid Deployment

Current WorkLog (2007)

Production Services Work Log (2005)

2005-12-14: Problem with NFS on lxb2003. Service restarted.

2005-12-14: Changed bdii for gdrb10 from lcg-bdii to atlas-bdii.cern.ch.

2005-12-13: Added DLI to gdrb04, gdrb07, gdrb09 & gdrb10.

2005-12-13: David Smith: Added DLI use to gdrb04, gdrb07 & gdrb10 again.

2005-12-12: all the RB services on lxn1177 have been stopped.

2005-12-12: Move R-GMA to Authenticated Connectors for all nodes in production.

2005-12-12: An unexpected power cut crashed all production RBs gdrbxx.

2005-12-09: problem with the disk server lxfsrk524 attached to lxb2004, lxb2003 and lxn1183.

2005-11-30: 5 AFS users added on gdrb09 and gdrb10 (see mail from Simone). Fixed in the afternoon.

2005-11-30: David Smith: A couple of process of the bkserver were constantly running on gdrb08, so I restarted LB and LL services - appears to have set in yesterday after a mysql client timeout was logged. I will investigate LB server code.

2005-11-29: David Smith: The LM on gdrb01 was approching its FD limit of 1024. I removed 690 jobs from the condor queue dating from september or october, waited for the LM to process the aborts and then sent it a SIGTERM. Many condorg logfiles were tidied away into recycle/. Restarted LM again, FD use at 79.

2005-11-29: Installed R-GMA version 5.0.8-1 all gdrb machines and pointed them to lxn1178.

2005-11-28: patch for the R-GMA vulnerability installed on mon.cern.ch and lxn1178 (cf. http://goc.grid.sinica.edu.tw/gocwiki/R-GMA_server_upgrade_-_Patch_%23593).

2005-11-28: lcg-bdii (bdii001) highly overloaded this afternoon. Need to add another node behing the aliase.

2005-11-28: R-GMA switched off on the monbox mon.cern.ch and lxn1178 (spare monbox) due to security reason (see mail from Ian Neilson).. File /etc/cron.d/edg-rgma-restart disabled.

2005-11-18: David Smith: gdrb01 back to normal. Heavy submission load stopped at about 19hrs yesterday. I restarted the LL/LB later that night - there was 1 LB process using about 300Mbytes memory and in existence for 6 hours and the whole logging system appeared to be running slowly.

2005-11-17: David Smith: Sent mail to user submitting most of the jobs to gdrb01 asking them to also submit to gdrb03, to help with load balancing. (Copied also to Roberto Santinelli)

2005-11-17: David Smith: Noticed heavy submission load on gdrb01, cpu 100%. (Instantaneous submission rate ~30,000 per day). Tried adjusting relative scheduling priorities of service daemons to avoid broker backlog - however at 13:25 there is a matchmaking backlog of 397 jobs (about 25 minutes).

2005-11-16: lcg-bdii points now to bdii001.cern.ch.

2005-11-15: Added user 'lhcbprod' to gdrb04, gdrb07.

2005-11-10: lxn1178 becomes a spare monbox. Stress test started.

2005-11-10: lxb0725 removes from production. Stress test started for this machine.

2005-11-08: update of the new set of CA rpms (v1.00) on all production machines.

2005-11-07: Motherboard + memory changed on lxn1178.

2005-11-04: David Smith: Restarted lcg-mon-job-status on gdrb11, it had reach fd limit

2005-11-04: David Smith: Upgraded gdrb06 to lcg2.1.69-13

2005-11-02: David Smith: Added dteam, atlas as DLI VOs on gdrb10. This change was originaly made on 10 Oct, but lost on 19 Oct.

2005-11-02: David Smith: Restarted lcg-mon-job-status on gdrb04, gdrb07, they had reached their fd limit

2005-11-01: David Smith: Restarted LM on gdrb04, has been down since Oct 20.

2005-10-26: David Smith: Updated gdrb01, gdrb03 to lcg2_1_69_13

2005-10-25: Restarted lcg-mon-job-status on all gdrb nodes.

2005-10-24: no ssh on lxb0725. Need to reboot it. Fixed.

2005-10-24: no ping for lxn1192 (archiver). ITCM generated (motherboard dead).

2005-10-24: gdrb01 blocked during the week-end (reason unknown). Need to reboot it. Fixed

2005-10-21: 50 more cms pool accounts added on gdrb01, gdrb03 and gdrb08 with UIDs in the range 50051-50100 for the time being.

2005-10-20: David Smith: Restarted the bkserverd (ie. the LB server) on gdrb07. It seems the service had been restarted about an hour before with the script init.d/edg-wl-lbserver.ORIG rather than init.d/edg-wl-lbserver. The difference being that the latter script starts the lb server with different options. (Non standard, but requested by EIS).

2005-10-20: David Smith: Updated gdrb02 to lcg2_1_69_13.

2005-10-19: David Smith: Added name server '137.138.16.5' as first on the search list in /etc/resolv.conf on all the gdrb machines (previously only 137.138.17.5 was listed). gdrb01, gdrb08 and gdrb10 were unable to get replies from 137.138.17.5 for several hours this morning, although the other machines were. (Possibly a bad network flow??). gdrb01 and gdrb08 can now reach 137.138.17.5, although gdrb10 still cannot.

2005-10-19: David Smith: gdrb07 (lhcb) broker had NS and WM daemons running that were not configured to use DLI. (eg. they were restarted when the edg_wl.conf did not contain an appropreate DLICatalog list). The configuration file was changed subsequent to their launch - the daemons do not reread this part of the configuration. I restarted NS, WM.

2005-10-19: some of CRLs have expired on gdrb02 (bug #12182 on Savannah). Fixed.

2005-10-18: lxn1182 (RB for LHCB) has been retired from production.

2005-10-18: no ssh on lxb0725. Need to reboot this machine. Fixed.

2005-10-17: VO geant4 has been added in the list of VOs of gdrb01 and gdrb03.

2005-10-17: three new (host) DNs have been added on myproxy: VOBOX for gliocl.itep.ru (ITEP site), vobox01.pic.es (PIC site) and lxgate03.cern.ch.

2005-10-17: lxn1178 reinstalled from scratch.

2005-10-17: new alias mon.cern.ch points to the new MonBox lxb2009.cern.ch. GOC DB updated. All the gdrbxx machines point now to mon.cern.ch (lcg-mon-jobs-status deamon restarted).

2005-10-14: lxn1178 blocked two times this morning (reason unknown). Need to reinstall this machine from scratch

2005-10-13: lxn1178 blocked (reason unknown). Need to reboot it. Fixed.

2005-10-13: end of kernel upgrade on all production nodes.

2005-10-13: David Smith: Added lhcb, dteam to use DLI by default on gdrb04.

2005-10-13: a new (host) DN has been added on myproxy (VOBOX for Alice at RAL).

2005-10-13: gdrb04 has been assigned to LHCB experiments.

2005-10-12: a new (host) DN has been added on myproxy (VOBOX in Taiwan).

2005-10-12: beginning of kernel upgrade on all production nodes.

2005-10-11: a new (host) DN has been added on myproxy (VOBOX in Bari).

2005-10-10: David Smith: Updated RB software on gdrb01 & gdrb03 to lcg2_1_69_12.

2005-10-10: David Smith: Added atlas (& dteam) to default to using DLI on gdrb10, at the request of atlas. (ie. updated /opt/edg/etc/edg_wl.conf)

2005-10-07: David Smith: Updated RB software on gdrb02 to lcg2_1_69_12.

2005-10-06: David Smith: Updated RB software on gdrb02 to lcg2_1_69_11.

2005-10-03: David Smith: Updated RB software on gdrb02 to lcg2_1_69_9. This includes some fixes and new features.

2005-09-20: DHS: Added mysql access for lcg2 monitoring user for tables states & events to gdrb 1,3,4,6,8,9,10,11. gdrb 2,7 already had access granted.

2005-08-19: During the last few days, some errors occured on gdrb08 (and other RBs too). The problem was probably due to the sandbox disk on gdrb08 filling up completely with huge output files for some users. The sandbox area has been cleaned up and a cleanup job will be installed; the next RB version will prevent the upload of sandboxes exceeding a limit set by the admin.

2005-08-19: RGMA configuration have been updated on gdrb02 to gdrb11. Server in now lxn1178, and registry and schema point now to lcgic01.gridpp.rl.ac.uk. Daemon lcg-mon-job-status restarted.

2005-08-19: partition /mnt/raid full again on gdrb08 (GGUS ticket #4505). Huge files have been truncated. Fixed.

2005-09-15: impossible to submit jobs on gdrb08 (GGUS ticket #4480). Need to remove irepository.dat file and contents of the /mnt/raid/rb-state/opt/edg/var/spool/edg-wl-renewd directory. Fixed.

2005-09-13: DHS: Raid on gdrb01 is 85% full. Truncating sandbox files ending in stderr, stdout larger than 50Mb to 10Mb.

2005-09-13: DHS: Checked condition of gdrb08: Fixed corrupted renewd data file, restarted renewd. Removed ~600 held jobs from CondorG queue. The LM irpository.dat was corrupt stopping the LM restarting. Stopped JC, Removed LM data file and restarted LM, forcing it to reprocess the last 7 CondorG logs to repopulte the repository data file. Restarted JC.

2005-08-13: gdrb08 RAID partition was full. All output sandbox files with size > 100MB have been deleted. Fixed by Maarten.

2005-08-13: gdrb06 configured for CMS and dteam users only.

2005-09-07: DHS: gdrb03 RAID partition was 99% full. Truncated sandbox files with names ending in 'stdout' or 'stderr' that were larger than 100Mbytes to 10Mbytes. RAID now 73% full.

2005-09-07: (auto)update of the new set of CA rpms (v0.32) on all production machines.

2005-09-07: Partion /var full on lxb2003. A gridftp~ file was in /etc/logrotate.d directory. Need to remove it. Fixed by Maarten.

2005-09-06: AFS account for user santinel (Roberto Santinelli) added on gdrb07.

2005-09-06: RAID disk failure (hdg + raid controller) on gdrb05. Status changed to broken.

2005-09-05: gdrb05 repaired (need to reboot it several times... Problem with the raid controller ?!?) and reinstalled again. Stress test executed each five hours in order to test the raid disks + raid controller. This node moves to the production cluster again.

2005-09-02: new BDII alias "exp-bdii". It is a replacement alias for the atlas-bdii. This BDII service is linked to the freedom of choice page. Make sure that all experiments use this alias or "lcg-bdii" if they do not want to use the freedom of choice page.

2005-09-02: a new (host) DN has been added on myproxy (VOBOX in Milano).

2005-09-01: problem in the /etc/hosts file on gdrb08. This file listed gdrb08 gdrb08.cern.ch, rather than the fqdn first followed by the alias. Fixed by David.

2005-09-01: problem in the /etc/hosts file on gdrb07, gdrb09 gdrb10 and gdrb11: the entries were wrong (they listed lxbXXXX.cern.ch - former name of these nodes). Fixed by David.

2005-09-01: RAID disk failure (hdg?) on gdrb05. Status changed to broken.

2005-08-01: Apel problem on the CERN-PROD site. Tomcat daemon on the MON box lxb2009.cern.ch had died around 14:00 29-08-2005. Daemon restarted. Fixed.

2005-09-01: No ping on lxb2005 (atlas-bdii). Need to reboot this machine. Fixed.

2005-08-30: new update of lcg-mon-job-status package on RBs gdrb02 to gdrb11.

2005-08-29: update of lcg-mon-job-status package on RBs gdrb02 to gdrb11 (see instructions).

2005-08-29: SL 3.0.5 + Raid software + RB + fabric monitoring installed on gdrb05. This node moves to the production cluster again.

2005-08-26: gdrb05 repaired (hdg + motherboard replaced).

2005-08-25: on all gdrbxx (except gdrb05 - Done on 2005-08-29): user pchand added + /etc/shift.localhosts file created with the following content below + rfiod daemon restarted:

2005-08-25: Installation of a new MONBOX on lxn1190. GOC DB updated.

2005-08-24: according to the Yumit results web page, all nodes have been updated (kernel, SL3 release and packages).

2005-08-24: mysql database updated on gdrb02 in order to allow the monitoring of this machine by the Real Time Monitor tool (see the GOC Wiki Administration FAQ section "How to allow monitoring of your RB by the Real Time Monitor").

2005-08-24: file /opt/bdii/etc/bdii-update.conf updated on site-bdii: entries for gdrbxx nodes added). To do for gdrb05 entry (Done on 2005-08-29).

2005-08-23: GOC DB for CERN-CIC site updated: new gdrbxx nodes added and old RBs removed.

2005-08-23: gdrb09 and gdrb10 have been assigned for atlas and dteam VOs only.

2005-08-23: lxn1175, lxb0728 and lxb0729 become free for production use only (These machines have been re-installed from scratch).

2005-08-23: gdrb03 becomes available for general production RB (however not available for VO unosat). See mail written by David on the lcg-rollout mailing list.

2005-08-23: a new (host) DN has been added on myproxy.

2005-08-22: apel problem for SFT solved on lxn1184 (CERN-CIC). The problem with apel was due to the fact that we were using the same MON node to publish records from both the CERN-PROD and CERN-CIC sites. The site name is used to publish the tuples and it had got set to LCG-CERN. Laurence has set up two cronjobs with different config files to publish information about both sites:
- CERN-CIC uses /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml
- CERN-PROD uses /opt/glite/etc/glite-apel-publisher/publisher-config-prod.xml

2005-08-22: disk full on gdrb08 (CMS RB) during the submission of a job (GGUS ticket #4170). Two users (from CMS) have generated huge files (> 2Gb). Need to remove all these files. Solved.

2005-08-18: warning on GStat concerning the site name attribute (this variable must be set to CERN-CIC). Fixed.

2005-08-18: replica management test failed on lxn1184 (LRC and RMC endpoints disappeared). An error was made when we tried to do an intervention to add some information for a VO. Solved by Laurence.

2005-08-17: error during the retrieving of files on gdrb08 (GGUS ticket #4418). The cause of the problem is that a user has been remapped on the RB and is subsequently trying to retrieve output that was submitted with the old mapping. Need to restart crond daemon on all the gdrbxx nodes. Solved by David.

2005-08-16: apel problem for SFT test on lxn1184 (CERN-CIC). This problem comes from the cron jobs. The two environment variables RGMA_HOME and APEL_HOME have to be set to /opt/glite in the /etc/crontab/edg-rgma-apel, /opt/edg/etc/profile.d/edg-rgma-env.sh and /opt/edg/etc/profile.d/edg-rgma-env.csh files. Will be fixed in the new yaim release.

Topic revision: r7 - 2007-02-19 - AlbertoAimar

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback