LCG Web>WLCGCommonComputingReadinessChallenges>CCRC08DailyMeetingsWeek080407 (2008-04-11, HarryRenshall)

EditAttachPDF

-- HarryRenshall - 04 Apr 2008

Week of 080407

Open Actions from last week:

Daily CCRC'08 Call details

To join the call, at 15.00 CET Tuesday to Friday inclusive (usually in CERN bat 28-R-006), do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here

Monday:

See the weekly joint operations meeting minutes

Additional Material: There will be LCG OPN tests Wednesday 9 April from 15.00 to 19.00 CET. The plan of the test is here: https://twiki.cern.ch/twiki/bin/view/LHCOPN/BackupTest RAL will be unreachable for 15-20 minutes between 16:45 and 17:15 PIC will be unreachable for 15-20 minutes between 17:15 and 17:45

The goal of the maintenance is to verify that all the backup solutions work as expected. The T1s with a backup link should be up all the time, but at the moment we cannot guarantee that it will be the case and there may be outages at any time for any Tier1.

Tuesday:

elog review: Nothing new

Experiments round table:

ATLAS (from K.Bos): we have started our Functional Test (FT) around 3:30 pm today. We had decided to run at a rate as if we were taking data with the detector at 200 HZ for 10 hours per day. In that case we run at roughly 40% of the maximally needed throughput because we have 24 hours to export the data we produce in 10. This will be the mode we will be running in May also unless we are taking cosmic ray data in which case the rate is determined by the TDAQ. We intend to stop the production on Friday morning and remove the left-over FT subscriptions on Sunday night. So we leave the system 2.5 days to transfer all the data. On Monday morning we will analyze the results.

CMS: the recent SAM srm test failures due to information disappearing from the bdii has been fixed.

LHCb: Preparing for infrastructure testing to start on 18 April. The slow throughput recently seen on the CERN to CNAF traffic has been corrected (not known what happened). The LHCb LFC is now replicated to NL-T1 and PIC is scheduled to start next week.

Sites round table: RAL would like information on Tier 1 resource requirements for the May CCRC from CMS and ALICE as soon as possible. NL-T1 confirmed they expect new (2007 pledges) hardware resources to start being published during the last 2 weeks of May.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

elog review: Nothing new.

Experiments round table:

ATLAS (SC): This is a quick summary of the present status for ATLAS activities:

1) Calo data distribution has been very successful. In less than 24h the whole amount of data has been distributed according to the ATLAS computing model. More than 90% of the data has been distributed in less than 6 hours. The overall data distribution counts 412 dataset subscriptions, out of which only 8 did not complete yet (all in NDGF, problem with space token configuration)

2) Functional test have been running till ~8PM yesterday night. At this time an AFS problem (a hardware failure on a file server) prevented the machinery to keep running. Manual intervention was needed. Machinery restarted today at 9AM and the activity is currently ongoing.

3) A problem with space tokens at NDGF was fixed but they now have a problem with their OPN network switch.

In addition, as a reminder, there needs to be a big intervention in LFC @ cern to migrate SRMV1 entries to SRMv2 entries. The data volume is > 8M entries. This could be done dynamically but would take more than 20 hours when the service would not be stopped but some degradation would be observed. Another possibility is to do a short 1 hour update but this would need a service stoppage. This is being tested now and, if successful, could be scheduled for next Monday.

LHCb (RS): for the May CCRC they are requesting a new space token (LHCBUSER) of type 'custodial online' to be deployed at CERN and the T1. They would require 3.3 TB of such space at CERN, about 3 TB at NL-T1 and less at the other T1. When asked JT said that NL-T1 should be able to provide this space provided it were within the MoU envelope of LHCb.

Sites round table:

NL-T1 (JT) reported ATLAS jobs disappeared rapidly about 21.00 last night. K.Bos will follow up what happened.

GRIF (MJ) plan to install the new version of dpm which will correct ACL support for ATLAS. S.Jezeqel is going to provide a script to correct the ACL of files already in the GRIF ATLAS dpm.

Core services (CERN) report: Experiments have been invited to join a production scale test of the SL4 version of the WMS.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB: The OPN backup link tests announced on Monday should now be starting.

Thursday

elog review: Nothing new

Experiments round table:

ATLAS (SC): Here there are the two main points for today:

1) Functional tests are ongoing as scheduled. In the night it was observed an increase of the queue for tape migration in CASTOR. This was due to too many small files being produced at the T0. This has been changed this morning (so now datasets contain less files, by a factor 10, but larger size). Test will stop tomorrow.

Observations:

- RAL is in scheduled downtime. No RAW are being assigned to RAL since yesterday (but AOD and ESD are, this will be a also a test of recovery after downtime).

- SARA is having dCache problems. Below you find the explanation from Ron Trompert

* Yesterday we have upgraded dCache to 1.8.0-14 and java 1.6. We see now a number of problems as a result of this, hanging processes and high loads on our dCache head node. We have already moved the dCache head node back to java 1.5 which seemed to have reduced the load.

We are now in unscheduled maintenance to see if this is enough the solve the matter. If not, we will move back to java 1.5 altogether. *

- CNAF has problems at the tape endpoint. After the end of the intervention on tuesday, they have been showing efficiency between 30% and 60% for tape. CNAF site managers believe this is "left over" from the intervention and are looking into it.

2) The intervention on LFC for the srm.cern.ch->srm-atlas.cern.ch has been postponed to the next week. The procedure user last week for 1.5 M events was too slow for the 8.5 M events to be migrated (did not finish in 24h). Therefore a new (faster) procedure is bein put in place, but this will imply some downtime (estimated 2h). We are trying to pack everything in the hardware migration intervention for the ATLAS and WLCG databases next week. Maria Girone is collecting the various infos and will present a proposal.

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

elog review: At around mid-day castoratlas stopped responding. Was traced to a locked thread which was cleared after about an hour.

Experiments round table:

ATLAS (K.Bos): this morning we stopped the data generator for the functional test. We will leave the system in peace and give it time to transfer all the files until Sunday night. At the bottom of this message you see the situation as of this morning. It shows the Site(submitted/succeeded) for data sets. You can see that the situation is already quite good. Almost all RAW data is on tape already. SARA was down yesterday for some time but they are catching up right now. BNL didn't transfer last night so there is some backlog and NDGF is slow because they have a limited number of tape drives for this test. For the data that is supposed to go to disk the situation is very similar which is understandable because these transfers share the same FTS channel.

On Tuesday we did a similar transfer for the Calorimeter data from last week and we saw very similar results: within 24 hours 99% of all the files made it. There are just a few that did not transfer within the first day and the interesting thing is that they are still pending. We will have to look into those in detail.

For these transfers we used the newly written "plugin" that uses a configuration file describing the transfer policy. This allows us to swiftly react on sites having problems. For example RAL went into a scheduled down time yesterday and we took them out in the configuration file. That is why you see they have no backup for the RAW, the plugin automatically distributed those data sets to other sites. They do get a backlog on AOD's because they are supposed to go to all T1's and they were subscribed and will go there when RAL comes back on-line.

This week was Muon week and we may still expect some data to be transfered this weekend. However we have not been told of any data coming soon. Next week is also Muon week and again we may transfer data when presented to us. All those data using cosmic rays will be distributed like the Calorimeter data and the test data this week. It will be like this almost all weeks from now on until the beam comes. Next week we will remove the test data from the disks again but the detector data has to stay until we have permission from the detector people to delete.

Next week we have time to test more T1 - T1 channels. CNAF indicated that they are ready to test now but maybe PIC and RAL would like to try again. Please let us know if you could do it. The procedures are well described in Stephane's wiki pages but it has to be done from the T1 site. We have 2 more weeks to also do ASGC, BNL, Triumf and FZK and maybe also NDGF wants to retry. The only completed tests were SARA and Lyon so far.

Next week will be also a week of many interventions: the ATLAS Oracle RAC switches to new hardware and will be down on Tuesday afternoon. This affects the T1-T1 tests. On Thursday a similar change is foreseen for the WLCG RAC. This will affect the CERN FTS and LFC. We plan to have an ATLAS LFC intervention at that time also. On Thursday we also shut down the last classic SE at CERN and by the end of the week we should have no SRMv1 left at CERN. You will be informed on all these interventions separately.

LHCb (RS): Main activity next week will be movement of data in preparation for CCRC'08 phase 2. There is currently a problem with the RAL LFC.

Sites round table:

RAL: have upgraded all their LHC castor instances to version 1.6.

NL-T1 (JT): SARA is having problems completing their dcache upgrade and this has caused batch job failures at both SARA and NIKHEF.

Core services (CERN) report:

DB services (CERN) report: The test migration to new quadcore HW has been successfully performed on all LHC offline Oracle databases concerned last week. The final migrations of the LHC RAC production databases to the new HW will go ahead as planned next week as follows:

CMSR & LHCBR: Tuesday 15.04 ATLR: Wednesday 16.04 LCGR: Thursday 17.04

(these major migrations (which will include a move from 32 to 64bit) will be performed with a downtime of only two hours, thanks to the use of Oracle Data Guard)

The downstream databases for ATLAS and LHCb streams setup will be migrated next week, Tuesday 15th April and Wednesday 16th April, to the new hardware.

The LFC deployment team has proposed a clean-up of the ATLAS LFC local catalog deployed on the LCG RAC, which will follow after the hardware upgrade. The ATLAS local LFC will not be available for about two hours.

A blocking issue with the monitoring of the storage arrays on RAC5 and RAC6 has been solved with the help of IT-FIO-TSI. Many thanks.

New Infortrend storage has showed a few more problems with the controllers. So far we have found 5 controllers out of 60 new arrays (3 last week and 2 this week) with issues that required vendor intervention.

CNAF finished the computer center move last Monday, 7th April, after 2 weeks of downtime. LFC and LHCb replication was synchronized in less than 1 hour. ATLAS replication is still pending because CNAF is now migrating the production servers for ATLAS. CNAF ATLAS database is scheduled to be ready next Monday, 14th April.

BNL completed successfully an intervention to upgrade the production servers for ATLAS from 32 to 64 bit. Even though the intervention was extended 1 day more due to some complications, there was not impact on the ATLAS replication environment.

Monitoring / dashboard report:

Release update:

AOB: There will be a meeting at CERN on Tuesday looking at having a pilot SL4 WMS service in parallel to the production service.

Topic revision: r6 - 2008-04-11 - HarryRenshall

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback