LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek100802 (2010-08-06, HarryRenshall)

EditAttachPDF

Week of 100802

Week of 100802

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday:

Attendance: local(Hurng, Harry(chair), Maarten, Jean-Philippe, Doug, Edward, Simone, Dirk, Jan, Jamie, Carlos, Ian, Ricardo);remote(Mark(IN2P3), Michael(BNL), Onno(NL-T1), Xavier(KIT), Gonzalo(PIC), Gang(ASGC), Vladimir(LHCb), Gareth(RAL), Rob(OSG), Stefano(INFN), Christian(NDGF), Catalin(FNAL)).

Experiments round table:

ATLAS reports -

Tier-0: Beam Friday night, through Sat. morn. and things went well, systems kept up with data. Sat. mid-morning one merged RAW file from the Tier-0 system was moved into CASTOR pool machines, and then wasn't seen again. Was recognized a couple hours later when it couldn't get to Tier-1 site. A couple tickets were made, and Tier-0 exports were called in. After discussion it really was a CASTOR problem and an alarm ticket was made for CERN-PROD to contact experts. A couple hours later there was no reply, I called central computing on call, and just e-mailed Jan Iven. Central computing on call did not know about the alarm ticket (or really seemed to know what an alarm ticket is), but Jan updated the tickets, and he also said he didn't see the alarm ticket, so no alarms were raised. Jan looked at the machine, and said the raid controller had reset before file was written to disk, so it was really lost. (Jan gave an update that this was a known problem with this contoller that should not happen again. In addition our matching sensor will be improved. Doug added that the T0 experts are also looking to improve their CASTOR handshake monitoring.) Over the next few hours, attempts to call on-line systems experts, and more discussions with Tier-0 experts, by midnight it was determined which un-merged data was still on disk, and could replicate the file. Now in discussion about how to get a new file into the frozen dataset, and correctly mark the data as replaced. Now need to discuss procedures about who to call when, and try and track down what really happens to alarm tickets to CERN-PROD. GGUS:60720 GGUS:60722 GGUS:60723 ;

Also GGUS ticket about alarm problem: GGUS:60762 : Replied SOLVED We fixed the problem this morning. The permissions on a specific directory have been removed. Currently we are investigating how this happened. We assume it happened with the last GGUS release. Probably it happened with copying directories from the test instance to the production instance. As a consequence of this we will not only test the alarm process via an automated script but also submit at least one alarm ticket via the web portal in future.

Tier AFS overload issue. Acron jobs from ddmusr03 accessing a certain AFS volume are failing with timeout, people can see details here: GGUS:60765

Tier-1: Disk failure at RAL, they say that the data isn't lost, but it will take about 3 days to rebuild the disks. During this time the data is still in the systems as existing at RAL, and jobs show up and fail repeatedly. On Sunday afternoon problems increase until all data transfers into and out of site fail. After some discussion decide to blacklist RAL, and set available space in data distribution utils to 0. Late at night site reports failures due to disk outages and database corruptions, and that these should be fixed. Site turned on again Mon. morning, and transfer mostly working, but still 16% are failing to site. GGUS:60734. Gareth confirmed one disk server is still out of production - suffered fsprobe failures and a disk failure - being checked out now. A second disk server had gone down on Thursday and came back this morning but the main problem over the weekend was Oracle database corruption behind the ATLAS CASTOR stager. This stopped the castoratlas instance working for several hours until yesterday evening.

Disk controller failure FZK causing random failures all over the DE cloud. Not sure this could fixed over the weekend. Not sure status at this time. GGUS:60721. Xavier reported that the affected pool should be coming online as we speak.

Tier-2: SRM contact issues off and on repeatedly to WEIZMANN-LCG2 for days now. Finally listed downtime in GOCDB, and was automatically blacklisted by Atlas utils. GGUS:60710

RHUL still failing SAM tests for days now, still out. GGUS:60631

Milano actually passed all SAM tests for a full day, asked if they wish to be turned on, seems to be working at this time. GGUS:60483

RRC-KI in maintenance mode and has been out for days.

LIP-Coimbra producing SRM contact problems for days without any feedback from site. Was blacklisted until site replies. GGUS:60715

SRM contact issues with IL-TAU-HEP, site is adding more disks to luster pool. GGUS:60727

TR-10-ULAKBIM with bdii.ulakbim.gov.tr: No entries for host, need to look at info as published for site. GGUS:60728

CMS reports -

Tier-0: Normal operation. Smooth weekend

Tier-1s: 2 New transfer issues: https://savannah.cern.ch/support/index.php?116025: T1_ES_PIC_Buffer and T1_IT_CNAF_Buffer -> T2_US_Purdue transfer error during the last hour and https://savannah.cern.ch/support/index.php?116015: Transfer Failures from T1_IT_CNAF to T2_US_Caltech. At least CNAF to Caltech seems to be related to the June 24 change at CNAF that appears to have stopped jumbo frames from working. Stefano reported from CNAF that the problems are related to their GPN connection not yet being enabled for jumbo frames (while the OPN is) and that this is being worked on.

ALICE reports -

Weekend report: increasing the number of running jobs during the weekend due to the startup of a new MC cycle this weekend (currently, 15% of the expected number of events has been already generated). 27TB of recorded raw data in CASTOR@T0 and in addtion the raw data transfers activitity has continued during the weekend moreover with FZK, CNAF and NDGF (max peak achieved at 200MB/s).

T0 site: GGUS: 60747 submitted this morning. ce201 is showing SSL authentication problems at submission time. CURRENT STATUS: SOLVED

Remedy ticket submitted through VOBOX support (CT702603). Required access to the CBD template: prod/pro_params_alice_fileserver_xrootd responsible of the voalicefs0* nodes (xrootd). Current Status: SOLVED

T1 sites: All T1 sites up and running during the weekend

T2 sites: Several operational issues found at some T2 sites that are being directly handled with the Alice contact persons at the sites

LHCb reports -

Experiment activities: Received a lot of data during weekend. Reconstruction, MonteCarlo and user analysis jobs. GGUS (or RT) tickets:

T1 site issues: 1) still SARA problems with NIKHEF_M-DST. 2) software installation problem at IN2P3 (SharedArea)

Sites / Services round table:

BNL: Had a small problem with their shared area being filled too rapidly for the cleaning mechanism by ATLAS production jobs. The area has been doubled in size.

NL-T1: The ATLAS stager problem where requests are expired very quickly (GGUS 60587) is still under investigation with the dcache developers. Also the LHCb timeouts to return turls (GGUS 60603) is probably related to this. Simone reported that he had checked that DDM was not passing bad parameters but that some parts of a multi-file request were succeeeding. Onno thought this was normal for those files which were already disk resident.

KIT: On Friday the srm door gridka-dcache.fzk.de was down for 2 hours. The reason was the postgress sql database was 100% busy and had to be rebooted. Today a fileserver is down with disk failures affecting 10 LHCb pools. The oncall service is not being triggered by GGUS tickets - will be fixed on Wednesday. Backup mechanisms are being used till then.

ASGC: There will be a scheduled bdii upgrade tomorrow morning but production should not be affected.

NDGF: A power failure at the U. Geneva Sunday caused file transfer failures. Should be fixed tomorrow.

FNAL: Starting to test new SRM scalable/distributed load-balancing system.

CERN CASTOR: The gridftp checksum saga continues and had to be turned off for ATLAS on Friday afternoon. However a new CASTOR snapshot is now available and will be deployed as soon as possible.

AOB: 1) Ian Fisk queried the current slowness of the CERN username ldap server (used by slc5 login and phonebook). Ricardo is investigating - server overload and in some cases client name server cacheing daemons got stuck or crashed. More info tomorrow. 2) Jamie reported that the repeated GGUS ticket failures issue has been escalated to EGI who will be analysing these problems.

Tuesday:

Attendance: local(Hurng, Harry(chair), Lola, Edward, Patricia, Roberto, Jaimie, Jean-Philippe, Maarten, Jacek, Ricardo, Jan, Miguel, Flavia, Alessandro);remote(Stefano(INFN), Xavier(KIT), Michael(BNL), Gang(ASGC), Ronald(NL-T1), Jeremy(GridPP), Vladimir(LHCb), Tiju(RAL), Rob(OSG), Ian+Marie-Christine(CMS), Christian(NDGF), Catalin(FNAL)).

Experiments round table:

ATLAS reports -

Tier-0:

2 ATLAS runs yesterday for data taking (Sun. evening - Mon. noon, Mon. night - Tue. morning). Today is more in calibration mode.

CASTOR-2.1.9-8 installed on C2ATLAS disk servers to address the gridftp checksum issue. Checksum calculation resumed. Keep monitoring the following data transfers.

certain file transfer from T0 to 2 German calibration sites (LRZ-LMU and MPPMU) encountered a FTS transfer error ("connection prematurly closed"). The same file transferred ok to other sites in other clouds; but as it failed in different sites in German cloud, we suspect this could be something to do with the data source (perhaps the transfer agent of the particular FTS channel??). Escalate this issue to CERN with GGUS:60813.

AFS overload issue from yesterday: Acron jobs from ddmusr03 now running fine. It was affected by the high load on xldap server yesterday. GGUS:60765. No more error messages today.

Tier-1s:

one more disk server failure at RAL causing T0 export transfer failures last night. Site admin brought the disk server offline this morning for investigation (GGUS:60780). In total, there are 2 disk servers offline at RAL. The files on one of the disk server behind MCDISK is also causing MC production jobs failures (GGUS:60711). Some of production jobs are really urgent to finish. Any schedule to have the disk server (gdss417) back online? Can site also provide a list of files on the server? Tiju gave an update: They are still working on the server behind MCDISK but there is a possibility of some data loss. A list of files has already been sent to ATLAS - agreed it will be resent. The server behind DATADISK should be back soon.

FZK disk controller failure recovered. All disk servers online. (GGUS:60721)

Keep generating data pre-staging test for SARA to investigate the issue with data recall from tape. (GGUS:60587)

Tier-2s:

NET2/Boston University had transfer failures this noon - solved around 2:00 pm by restarting BeSTMan.

WEIZMANN finished downtime and back online yesterday but some transfers still needs few attempts to finish, suspect a problem in some of the backend disk servers (GGUS:60807)

SRM errors at IN2P3-CPPM last night - some service went down. Fixed this morning (GGUS:60781)

INFN-MINALO-ATLASC is now back online in data transfer. PANDA production queue are under test for the moment.

UNIGE had power cut, require few days to recover. Downtime until 9th of August.

CMS reports -

Tier-0:

Normal operation from Computing. CMS had to ramp the magnet down for filter cleaning. CMS will take advantage of the period without beam to do some testing. The test that impacts Computing is a Heavy Ion test with non-zero suppressed raw data. This is sent to FNAL, so if the test runs for long we will see activity on the CERN to FNAL link. In fact the intervention was short so this virgin raw data test has been postponed.

Tier-1s:

2 transfer issues: https://savannah.cern.ch/support/index.php?116025: T1_ES_PIC_Buffer and T1_IT_CNAF_Buffer -> T2_US_Purdue transfer error during the last hour and https://savannah.cern.ch/support/index.php?116015: Transfer Failures from T1_IT_CNAF to T2_US_Caltech. At least CNAF to Caltech seems to be related to the June 24 change at CNAF that appears to have stopped jumbo frames from working Otherwise no new tickets to Tier-1s. Stefano reported for CNAF that the reconfiguration of the GPN to support jumbo frames was finished a few minutes ago and the testing is looking ok.

Note: Tomorrow Marie-Christine Sawley will be CRC and reporting at this meeting.

ALICE reports -

Status of the production:

Pass1 reconstruction activities at the T0: ongoing with no remarkable issues to report

MC production: 2 cycles in production

Raw data registration at T0 and transfers to T1 sites: ongoing with transfer rates achieving 200MB/s

T0 site: Over 4000 concurrent jobs using both backends: LCg-CE and CREAM-CE. Remedy ticket opened yesterday and concerning the access of the aliprod to voalicefs0* nodes: CLOSED

T1 sites: CNAF: raw data transfers to CNAF failed yesterday. Apparently nothing had changed on the xrd3cp node at CERN, therefore experts were wondering whether these failures could be caused by a local problem at the site. The Alice contact person confirmed there was an old version of xrootd on the server and that the Quattor profile had been modified in order to avoid the use of this old version - situation looking much better.

T2 sites: no remarkable issues to report besides the usual operation activities

LHCb reports -

Experiment activities: Reconstruction of large amount of data received. Many user jobs. Some MC production going on.

T0 site issues: nothing

T1 site issues:

GRIDKA: Users reporting problems accessing some data. Also some SAM tests are failing - Roberto has opened a GGUS ticket.

Managed to run software installation jobs a IN2P3 (SharedArea)

Sites / Services round table:

KIT: The LHCb and ATLAS pools that went down yesterday are now back online. Vendor problems as last Friday happened again ( GPFS ?) but understood this time and fixed. Cream-CE 2 is still in downtime under investigation, Cream-CE 1 was in unscheduled downtime but is now under testing to resume for CMS while Cream-CE 3 is working normally.

NL-T1: The two dcache GGUS tickets from yesterday are still open so will be raised at a dcache phone call today. Meanwhile large log files are being inspected.

FNAL: Seeing some problems with the CERN lcg-bdii in the monitoring for CMS. Maarten will take a look.

CERN xldap: The ATLAS reported "AFS issue" from yesterday was not in fact AFS related but due to the xldap overload. A full incident report can be found here: https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentLDAP020810

CERN Frontier: RPMs for Tier 1, also for monitoring, are now available as attached to these notes.

AOB: Maarten queried if there had been an ATLAS test alarm ticket. In fact SARA and CERN were tested successfully so not reported. It was agreed to test BNL and any other T1 that requested to be tested.

Wednesday

Attendance: local(Hurng, Harry(chair), Marie-Christine, Eduardo, Gavin, Ricardo, Luca, Lola, Edward, Miguel, Maarten, Alessandro, Jamie, Nilo, Simone, Dirk);remote(Michael(BNL), Xavier(KIT), Stefano(INFN), Gang(ASGC), Vladimir(LHCb), Tiju(RAL), Rob(OSG), Gonzalo(PIC), Onno(NL-T1), Catalin(FNAL)).

Experiments round table:

ATLAS reports -

Tier-0:

1 short ATLAS collision run this early morning (4:00 - 6:00 am). Continue the calibration in the morning and aim for stable physics runs by noon. from mid-night yesterday, we started seeing "connection refuse error to lxfsrcxxxx.cern.ch" error in T0 export. The error lasted ~20 minutes and went away ... 3 times until this noon. ATLAS will keep an eye on this. Miguel queried if the errors are hard - apparently not as retries are made and work. ATLAS will put in tickets if this recurrs.

T0 export issue to 2 German calibration site: problem solved and dataset replicated successfully to the destination sites.

Tier-1s:

RAL disk server issue - last minute update saying that there seems to be a chance to recover the 47 k files on the disk server behind MCDISK ... although in ATLAS, we have been considered those files are lost after discussions yesterday. It turns out the urgent (IBL) production tasks are not affected seriously so we decide to wait. Waiting for the recovery of the disk server behind the DATADISK (last update says the rebuild has been done, disk will be online soon). Tiju reported the DATADISK server is back in production and work on the MCDISK continues but there is no time estimate yet.

Transfer errors from FZK MCDISK to NDGF SCRATCHDISK (GGUS:60437). A long standing issue since 24 July. Any update from FZK or NDGF-T1? Report from KIT was no news yet.

SRM errors (SRM_ABORTED) in transfers from T0 to TW-LCG2_SCRATCHDISK. Ticket GGUS:60740 reopened. This was first seen on Sunday when a similar ticket was closed as being due to a transient error.

test alarm ticket to BNL last evening worked as expected. Thanks!

Tier-2s:

MC production job failure at IN2P3-LAPP (GGUS:60744)

UNIGE power cut. It seems that the downtime is shorter than the original plan and finished. Checking with the site. (GGUS:60739)

transfer failure from CYFRONET-LCG2 to FZK (source error): gridftp connection timeout (GGUS:60841)

transfer failure from UTA_SWT2_PRODDISK to BNL (source error): gridftp user mapping error (GGUS:60837)

transfer failure to IL-TAU-HEP, site is also failing the SAM test (GGUS:60854)

Eduardo reported on the FZK - NDGF transfer problems where FZK had asked him to verify the CERN links. These seem correct so now they will run iperf tests between the two sites.

CMS reports -

Tier-0: Normal operation from Computing. Yesterday CMS had to ramp the magnet down for filter cleaning. Ramping up was very fast and none of the scheduled plans for zero Tesla happened.

Tier-1s:

1 transfer issue with FNAL, ticket open in Savannah sr #116082: all transfers to FNAL are failing. Catalin reported that this issue was already successfully closed.

severe problems with the CMS SAM tests at KIT, GGus ticket open last week. Fixing and tests on going.

Some problems in accessing cmsweb since the front end upgrade, maybe due to not well understood security interactions between Firefox and PhEDEX web pages

Tier-2: Transfer problems between Beijing and IN2P3 - a ticket has been opened. Probably seen before from Beijing to a Tier-1.

ALICE reports -

General information: common reconstruction and MC prodcution activities ongoing. In terms of the raw data transfers, after the issues reported yesterday with CNAF, the site receiving raw data again

T0 site: Almost achieving the 6000 concurrent jobs at this site. In addition some operations have been required this morning to change the disk of one of the ALICE CAF nodes.

T1 sites: All T1 sites in production

LHCb reports -

Experiment activities: Reconstruction/reprocessing is proceeding slowly due to an high failure rate due to the application (high memory consumption, anomalous CPU time required - waiting for a new version). New production will be launched today for merging output and new reconstruction campaign will be run as soon as a fix will be provided. Many user jobs. No MC production running.

T0 site issues: Observed in the last days slowness on lxplus SLC5 nodes. Related to: http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/20100802Lxplusresponseproblems.htm

T1 site issues:

GRIDKA: Problem with users and SAM tests on their Storage has been fixed. GGUS:60821

RAL: SAM tests failing yesterday night on the LHCb_USER space token. It was a faulty disk server taken out of production (GGUS:60840 related to GGUS:60847). Requested the list of files from RAL whose space token has been temporary banned.

IN2p3: SAM test reporting again failures and/or degradation of the shared area (GGUS:59880)

Sites / Services round table:

PIC: Finished scheduled CE intervention this morning done to reconfigure part of the networking of the worker nodes and production resumed an hour ago. The problems of US Tier-2, e.g. Purdue, downloading from PIC have been resolved after contacting ESNET. Not sure of the details but something in ESNET was acting as an MTU black hole.

NL-T1: They have enough information now from the ATLAS stage tests so the tests should be stopped. The dcache team are busy trying to pinpoint the problem. An option for ATLAS would be to downgrade their srm dcache server from dcache 1.9.5.21 to 1.9.5.19 but this srm is rather busy so if ATLAS would like to request this they should let NL-T1 know ? Hurng said this was not urgent and ATLAS would prefer to wait on the dcache team. Final report from NL-T1 was that a CREAM-CE hung last night and was restarted this morning.

FNAL: Had an overnight srm failure.

OSG: Yesterday successfully retested alarm procedures that were giving trouble last week.

AOB:

Thursday

Attendance: local(Hurng, Harry(chair), Ricardo, Luca, Patricia, Simone, Jamie, Roberto, Gerrardo, Alessandro, Ignacio, Miguel, Edward, Marie-Christine, Maarten);remote(Michael(BNL), Xavier(KIT), Ronald(NL-T1), Gang(ASGC), Jeremy(GridPP), Catalin(FNAL), Tiju(RAL), Stefano(CNAF), Kyle(OSG), Christian(NDGF)).

Experiments round table:

ATLAS reports -

Tier-0:

stable run from last evening (17:30) till this morning. Continue the calibration while LHC is doing ramp test with 2 bunches in the afternoon. Run stable as soon as the tests are finished.

ATLAS central catalogue availability went down last night causing problems in all ADC activities (include T0 export). It was understood to be affected by the upgrade of the ATLAS data deletion agent yesterday. The deletion agent is stopped until the problem is fixed. The central catalogue should be working fine now; but it may cause disk full problem at sites.

T0 processing:

There were increasing number of "aborted" AOD merge jobs, this is due to the AOD file size exceeded the hardcoded value in ATLAS code. Experts fixed it on-the-fly.

by noon, T0 expert reported that some files written to Castor this morning in t0atlas pool are not available for T0 data processing. There were some servers down in Castor. This also causes "locality unavailable" error in the transfers to Tier-1s. GGUS:60892. Ignacio reported this was due to a network switch failure, in front of 6 diskservers, at 12.45 today. Understanding was confused by the date being reported by lemon as being yesterday - apparently a known problem but not easy to fix. Miguel queried ATLAS ability to recover from such a failure - Alessandro reported this depends on the situation - a meeting will be held with the T0 experts to look at all such use cases.

the network usage of t0atlas pool went high this morning (several alarms received). Reducing the T0 export rate by a factor of 3 (via the FTS "red button") by noon - usage should now decrease.

Miguel reported that PES have no evidence of the T0 "connection refused" errors seen last weekend and requested a ticket giving time stamps - Hurng will do so and the errors are no longer happening.

Tier-1: Mostly ATLAS rather than site issues.

IT cloud feedback - The pilot jobs sent via certain pilot factory causing jobs to fail. This is due to ATLAS VO membership of the pilot factory runner was expired. It was extended and should be fixed by now.

transfer from INFN-NAPOLI to TRIUMF "filename too long". This is a known limitation in dCache. As the final destination (AUSTRIA T2) is using DPM, the transfer was re-routed to avoid going through CA T1.

Testing new staging service with PIC. In the same time, we observed that the SARA prestaging issue (GGUS:60587) could be related to the "SRM request lifetime" issued by DDM prestaging service. Still debugging it with SARA. Simone gave an update: There are various timeouts in DDM and one of them was set to be milliseconds rather than seconds meaning the total lifetime of an srm request was only a few seconds. This has been ok till now but SARA must have reached this boundary recently. The timeout has been increased and is being tested at SARA.

functional tests against the new Castor at RAL (RAL-LCG2_PPSDATADISK) is going on.

Tier-2/3: turning VICTORIA-LCG2 to ATLAS T3 site and replacing T2 functions by a new site: CA-VICTORIA-WESTGRID-T2 (on-going action in ATLAS)

CMS reports -

Tier-0: There was a problem with the CMS logbook thought to be related to an upgrade last week. A ticket was raised with IT but meanwhile the application crashed and was restarted - the ticket has been closed.

Tier-1s: dcache busy issue with FNAL, ticket open in Savannah sr #sr #116108

ALICE reports -

Status of the production: common reconstruction and MC production activities ongoing

T0 site: Pass1 reconstruction activities ongoing with no remarkable issues

T1 sites: CNAF: A new Alice vobox has been put yesterday afternoon in production for glexec testing and development operations (similar setup at FZK)

File transfers to all Tier-1 have slowed down so this pointed to a problem with central ALICE services - in fact an attempt was being made to retransfer all previously failed transfers and, of course, many of them needed to be restaged from tape.

LHCb reports -

Experiment activities: Reconstruction reprocessing of the huge amount (100 inverse nbarn1/3 of ICHEP statistics in one night), large fraction of jobs failing because of internal application issues.

T0 site issues: Many files reported to be unavailable. GGUS:60886. Ignacio reported this was due to a single disk server being down. These data are in a T1D1 class so are unavailable till they are restaged from tape.

T1 site issues:

CNAF: problems with both SAM tests and real data transfer to Storm. Problem fixed (certificates in a wrong location). GGUS:60875

GRIDKA: Users reporting times out accessing data. Looks like a load issue with dcache pools GGUS:60887

RAL: We received the list of files in the faulty disk server of the LHCb_USER space token (GGUS:60847). 22K user files we would like to know when the admins think the server will be back. Tiju reported that this will be later today.

IN2p3: SAM test reporting again failures and/or degradation of the shared area (GGUS:59880)

PIC: An internal DIRAC accounting migration is being done so accounting is currently not available.

Sites / Services round table:

BNL: Planned replacement of cisco switches by a Force10 infrastructure is ongoing one by one - no outtages are expected.

KIT: A workaround has been put in place for their on-call emergency workflow during the transition of their helpdesk from ROC to NGI where they were not yet prepared for alarm tickets. This will be fixed properly next week.

FNAL: Had problems with a dcache component that had been running in a low memory Java domain of 2 GB - this has now been increased to 16 GB.

CNAF: Will have a scheduled FTS downtime on Monday to change configuration files on the Oracle backend.

NDGF: Planning a dcache upgrade tomorrow on some pools in Bergen. Might be some problems with ATLAS and ALICE data there.

OSG: Update on ticket 60858 - still working with GGUS to understand why M.Ernst is not receiving SMS alarm notifications.

CERN CE: On Monday they will reconfigure the CE's supporting small VO's at CERN - should be transparent.

AOB:

Friday

Attendance: local(Hurng, Harry(chair), Ignacio, Marie-Christine, Lola, Edward, Alessandro, Roberto, Nilo, Luca, Ricardo );remote(Barbara(INFN), Gonzalo(PIC), Onno(NL-T1), Michael(BNL), Xavier(KIT), Gang(ASGC), Catalin(FNAL), Tiju(RAL), Rob(OSG), Christian(NDGF)).

Experiments round table:

ATLAS reports -

Tier-0:

stable run after the beam injection at 4:00 am early this morning.

T0 export rate restored in the afternoon yesterday so back to the default rate and working fine.

GGUS ticket concerning the transient "gridftp connection refused" issue created as requested - GGUS:60899

Tier-1:

MC jobs at FZK failed with dCache error: "Failed to create a control line ... Failed open file in the dCache" ... GGUS:60914

MC jobs at TW-LCG2 failed with "no accessible BDII". According to the update on GGUS:60912, there was a high load on the shared file system.

Update from RAL concerning the disk server behind MCDISK token - was said to be finished earlier then withdrawn ? Tiju gave an update: The machine was put online then had to be taken offline again. The data, however, is safe and is currently being copied to another machine.

Transfer errors in the test of the new Castor instance at RAL. Problem with disk array under the database - quickly fixed by RAL.

TW-LCG2 is now validated as a fully functional Tier-1. Quote from ATLAS computing coordinator: "ASGC should now be used as a fully functional Tier-1 again and we should not make duplicate copies of their RAW data on other Tier-1's any more. We had made re-processing the critical test and because they passed we should consider them fully validated again.".

CMS reports -

Tier-0: Normal operation. Smooth operations as of today with with 25x25 bunches. Tail of skims on going

Tier-1s: [ OPEN ]Savannah #116138: TT1_UK_RAL_Buffer to T1_IT_CNAF_buffer: no transfer but queued data https://savannah.cern.ch/support/index.php?116138: no transfer since 12 hours. Barbara will follow up.

Tier-2s:

[ CLOSED ]Savannah #116132: Some PhEDEx Components down at T2_IT_Pisa: power cut late morning, should have recovered

[ OPEN ]GGUS Information Ticket-ID 60855 about the transfer between T2 Beijing and CCIN2P3, status?

Notes Will need to run around 10K MC events with pile up today or over over the weekend.

ALICE reports -

Production: reconstruction and MC activities ongoing

T0 site: Pass1 reconstruction activities ongoing

LHCb reports -

Experiment activities: Reconstruction and merging. No MC. Due to the intervention in the DIRAC accounting system at pic (migrating to CERN) the SSB dashboard is empty.

T0 site issues: Many files reported to be unavailable. GGUS:60886. Disk server problem solved. Files finally migrated to tape

T1 site issues:

CNAF: issue yesterday with streams replication (both LFC and condDB) timing out to CNAF. Problem with one of the two nodes serving the CNAF DB unreachable. Restored replications later in the evening. Barbara reported this machine is having trouble with its public network interface getting stuck (3rd time now) - has to be configured down then back up. If the problem persists the service will be run on only one machine while they investigate.

GRIDKA: Users reporting times out accessing data. Looks like a load issue with dcache pools GGUS:60887. This seems to affect also reconstruction activity on the site; any news? Xavier reported that they see degradation only with a large queue to write to tape but steadily decreasing. Roberto will add more recent details to the ticket.

RAL: recovered the faulty disk server, LHCb_USER space token at RAL has been re-enabled in DIRAC

Sites / Services round table:

PIC: Would like CMS to respond to Savanah ticket 115898 which requests the green light to delete some non-custodial tape data. Marie-Christine will pass this on.

NL-T1: The ATLAS srm requests are no longer timing out since the timeout change in ATLAS DDM but they are investigating why this degradation happened.

KIT: Earlier this morning the ATLAS dcap doors died (out of memory) and were restarted and ATLAS disk pools suffered i/o errors around midday so these will explain some of the problems ATLAS reported today.

OSG: Have re-tested the routing of ggus alarm tickets to SMS messages - out of 5 Michael only got one while Kyle and Rob himself got all. Ticket 60858 has been closed by GGUS as not their issue. Harry requested Rob to send additional technical information to wlcg-scod@cernNOSPAMPLEASE.ch.

CERN CASTOR: 100TB of space has been added to ATLASSCRATCHDISK as requested.

CERN Database: Also following the LHCb streams replication failures to CNAF.

NDGF: The upgrade of Bergen dcache pools today was successful. There will be a short dcache outtage on Monday for a reboot upgrade of the kernel on dcache nodes. Scheduled for 2-3 pm but could be shorter.

AOB:

-- JamieShiers - 29-Jul-2010

Topic revision: r14 - 2010-08-06 - HarryRenshall

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback