EMI Web>EmiProjectStructure>SA2>EticsStatusLogbook (2011-11-01, AndresAbadRodriguez)

ETICS Status and Logbook

Current Status

Service	Status
Build Service	OK
Repository Access	OK
Nighlty builds	OK
EMI Builds	OK
Saket Builds	OK

Logbook of Services Interruptions

Please add here all interruptions of the services. I Remember the release and problems followed, the issues of the servers not responding and the recent issue with certifications

Planned (Hours)	Start Time	End Time	Services Interrupted	Impact	Reasons and Solutions
N (2.0)	1/11/2011 13:00	1/11/2011 15:00	Job submission and job registration	Jobs where done but they never finish because the queue in the repository were never processed	Cleaned a job that was broken in the queue and pool restarted
N (28.0)	23/10/2011 07:50	24/08/2011 12:00	etics & etics-repository	Etics: configuration and repository tabs had some time out issues. Etics-repository: jobs were waiting in the queue to be register	Tomcat was hung in both servers. A correct reboot fixed the problem
N (26.0)	06/09/2011 10:30	07/09/2011 12:30	SL6 worker nodes	It was not possible to build in SL6	Condor crashed and was not possible to send SL6 builds.
N (4.0)	06/09/2011 10:30	07/09/2011 14:30	All worker nodes	It was not possible to build because the jobs were not matching	Condor crashed and was not possible to match jobs.
N (7.0)	26/08/2011 04:00	26/08/2011 11:00	etics-repository	builds did not register to the etics-repository	The server root partition was full due to small size and temporary files of nightly dumps saved in /tmp. Temporary files removed.
N (26.0)	21/08/2011 08:00	22/08/2011 10:00	etics-server	portal, web applications and webservices in the etics server were no accessible	The server root partition were full due to the logs in /var/log/httpd. Deleting some logs fix the problem
N (1.0)	16/08/2011 12:00	16/08/2011 13:00	etics-server	portal, web applications and webservices in the etics server were no accessible	The server was down and even a tomcat restart did not fix the problem. A complete reboot of the server fixed the issue
N (4.5)	15/08/2011 9:30	15/08/2011 14:00	etics-repository	etics-repository access to files	As the cron job for cleaning the AFS datastore location is still not working, we needed to trigger a garbage collection because AFS quota was at 98%. The garbage collection operations are heavy on the repository (usually done at 4AM) and they blocked the download of files.
N (6.0)	13/08/2011 12:00	15/08/2011 18:00	etics-aux cron builds	EMI Nightly builds	The client is creating a copy of the certificates in .eticskyStore. This copy was corrupted because one certificate was not copied. Removing this folder fix the problem (if it is not present, it is recreated again by the client)
N (6.0)	10/08/2011 04:00	10/08/2011 10:00	etics server	portal, web applications and webservices in the etics server were no accessible	Kernel problem. No commands could execute via ssh. Required a hardware reboot which updated the kernel and it seems solved the problem.
N (20.0)	08/08/2011 14:00	09/08/2011 10:00	ALL	portal, web applications and webservices in the etics server were no accessible	Problems installing the new release (firewall, permissions, etc..), very long import export of the database
Y (2.0)	08/08/2011 12:00	08/08/2011 14:00	ALL	portal, web applications and webservices in the etics server were no accessible	New release installation. ETICS 3.5
N (17.0)	07/08/2011 23:00	08/08/2011 16:00	etics-repository	etics-repository registrations and access to files	The cron job to clean the AFS datastore location is not running. The AFS exceeded quota and no files were stored. Garbage collector run manually.

Topic revision: r12 - 2011-11-01 - AndresAbadRodriguez

Public webs

- Cern Search
- TWiki Search
- Google Search
EMI All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback