Please add here all interruptions of the services. I Remember the release and problems followed, the issues of the servers not responding and the recent issue with certifications
Planned (Hours) |
Start Time |
End Time |
Services Interrupted |
Impact |
Reasons and Solutions |
N (2.0) |
1/11/2011 13:00 |
1/11/2011 15:00 |
Job submission and job registration |
Jobs where done but they never finish because the queue in the repository were never processed |
Cleaned a job that was broken in the queue and pool restarted |
N (28.0) |
23/10/2011 07:50 |
24/08/2011 12:00 |
etics & etics-repository |
Etics: configuration and repository tabs had some time out issues. Etics-repository: jobs were waiting in the queue to be register |
Tomcat was hung in both servers. A correct reboot fixed the problem |
N (26.0) |
06/09/2011 10:30 |
07/09/2011 12:30 |
SL6 worker nodes |
It was not possible to build in SL6 |
Condor crashed and was not possible to send SL6 builds. |
N (4.0) |
06/09/2011 10:30 |
07/09/2011 14:30 |
All worker nodes |
It was not possible to build because the jobs were not matching |
Condor crashed and was not possible to match jobs. |
N (7.0) |
26/08/2011 04:00 |
26/08/2011 11:00 |
etics-repository |
builds did not register to the etics-repository |
The server root partition was full due to small size and temporary files of nightly dumps saved in /tmp. Temporary files removed. |
N (26.0) |
21/08/2011 08:00 |
22/08/2011 10:00 |
etics-server |
portal, web applications and webservices in the etics server were no accessible |
The server root partition were full due to the logs in /var/log/httpd. Deleting some logs fix the problem |
N (1.0) |
16/08/2011 12:00 |
16/08/2011 13:00 |
etics-server |
portal, web applications and webservices in the etics server were no accessible |
The server was down and even a tomcat restart did not fix the problem. A complete reboot of the server fixed the issue |
N (4.5) |
15/08/2011 9:30 |
15/08/2011 14:00 |
etics-repository |
etics-repository access to files |
As the cron job for cleaning the AFS datastore location is still not working, we needed to trigger a garbage collection because AFS quota was at 98%. The garbage collection operations are heavy on the repository (usually done at 4AM) and they blocked the download of files. |
N (6.0) |
13/08/2011 12:00 |
15/08/2011 18:00 |
etics-aux cron builds |
EMI Nightly builds |
The client is creating a copy of the certificates in .eticskyStore. This copy was corrupted because one certificate was not copied. Removing this folder fix the problem (if it is not present, it is recreated again by the client) |
N (6.0) |
10/08/2011 04:00 |
10/08/2011 10:00 |
etics server |
portal, web applications and webservices in the etics server were no accessible |
Kernel problem. No commands could execute via ssh. Required a hardware reboot which updated the kernel and it seems solved the problem. |
N (20.0) |
08/08/2011 14:00 |
09/08/2011 10:00 |
ALL |
portal, web applications and webservices in the etics server were no accessible |
Problems installing the new release (firewall, permissions, etc..), very long import export of the database |
Y (2.0) |
08/08/2011 12:00 |
08/08/2011 14:00 |
ALL |
portal, web applications and webservices in the etics server were no accessible |
New release installation. ETICS 3.5 |
N (17.0) |
07/08/2011 23:00 |
08/08/2011 16:00 |
etics-repository |
etics-repository registrations and access to files |
The cron job to clean the AFS datastore location is not running. The AFS exceeded quota and no files were stored. Garbage collector run manually. |