Week of 061106

Open Actions from last week: Costin and Miguel to try to understand cyclic nature of ALICE FTS transfer rates.

Chair: Jamie

Gmod:

Smod: Ignacio

Monday:

Log:

The sudden drop in the ALICE transfers to/from CERN yesterday evening at around 8pm, was due to a problem with the ALICE CASTOR2 instance: an unrecoverable and new error from the LSF API resulted in all requests to fail. The only possible remedy was a restart of the LSF master, which I did at ~09:30 this morning. We're looking into how to protect against this problem in the future. Olof

CMS report: Over the course of the last days we were focusing on improving Tier-1 to Tier-2 transfers and achieved some remarkable and very stisfactory results. We have seen rates from some Tier-1s (e.g. FZK to DESY) up to 200MB/s sustaining multiple hours and were able to replicate datasets simultaneously from a particular Tier-1 (i.e. PIC) to 18 different Tier-2s almost error-free.

These transfer tests that were carried out systematically (datasets were replicated from any Tier-1 to all Tier-2s participating in CSA06) have shown that the CMS data model is viable regarding the strategy that allows any CMS Tier-2 to request data from any CMS Tier-1.

Regarding job execution we succeeeded in getting more sites involved in the CSA06 analysis processing activities and the total job volume is now approaching 30k jobs/day. The efficiency (grid job submission + application) is above the goal of 90%. Michael

GSSDATLAS: many source SRM transfer failures - to be followed up.

New Actions:

Discussion:

Tuesday:

Log: Same CASTOR/LSF problem occurred in the evening, killing ALICE transfers again. Being followed up by IT-FIO. FTS/DB problems to be followed up (GD/PSS). GSSDATLAS transfers out of CERN failing - follow up with Miguel Branco/Zhongliang Ren.

New Actions: See above.

Discussion:

Wednesday

Log:

Massive failures of GSSDATLAS transfers mostly resolved as user error (source file doesn't exist). GSSDATLAS have been informed.

Non-scheduled intervention noted for GridView (it's marked on the page). This should be scheduled according the agreed WLCG procedure.

Alice: Good data rates over night to all except CNAF. Big increase in data rate at 01:00 UTC. Not understood why..

CMS: Transfers to CNAF basically down. Will be followed up with Luca

GSSDATLAS: follow up on massive failures (still source) to BNL and GridKA

Hi all, on November the 7th, we have planned a downtime for power mainteinance at INFN-T1 site.

It will start at 9:00 am (CET) and finish at 15:00 (CET).

For some services (FTS, LFC for GSSDATLAS and ALICE and storage resource) the down will be shorter (some minutes).

Regards, Andrea on behalf of INFN-T1 staff

Hello.

CNAF deployed a more reliable LFC server (redundant DNS load balanced). The new server name is lfc.cr.cnaf.infn.it This should be changed in ToA as soon as possible to benefit of the new service. At least my cache contains the old name. Cnaf people in cc: for questions or comments

Simone

New Actions:

Discussion:

Thursday

Log:

Big increase in ALICE data rates seen yesterday correlate to big increase at one site for this VO (IN2P3)

FTS service: The db lock-up problem..

We're trying solution 2 for now:

PSS wiki page

New Actions:

Discussion:

Friday

Log: High load problems on LCG RAC:

The problem on Oracle LCG RAC is solved and the cause is not yet understood but it had the root on lcgr3, used mainly by lcg_fts_w, lcg_lfc and also by lcg_same, lcg_voms and lcg_shiva. Several logs were taken and we will open a service request with Oracle to understand better the problem.

If you believe your application changed this morning (some patch deployment or manual access with heavy queries), please report to us so it will be easier to identify to problem with Oracle.

At ~13:17 the c2alice scheduler started to report problems again.

Our freshly configured metric was sampled at 13:31, and found 521 error messages. One minute later the actuator restarted LSF, and the scheduler problems stopped. So the full recovery chain works.

Good transfer rates for ALICE to IN2P3 (peaks over 100MB/s reported by GV) and SARA (closer to 150MB/s). Lost FZK from 15:00 UTC to 01:00 this morning. CNAF still unavailable. Transfers effectively stop 15:00 UTC. Due to an ALICE machine that crashed and had to be power cycled. Drop in overall rate when FZK added manually at 01:00 not understood - also affected DTEAM. To be followed up.

100% failure rates for CMS to ASGC, FZK and CNAF. 100% GSSDATLAS failure rates (still source srm) to BNL, 73% to FZK.

CMS CSA06 ramping down. 50k jobs/day target met. Concerns on job output retrieval - 10hours / 200 jobs. To be followed up.

New Actions:

Discussion:

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2007-02-02 - FlaviaDonno
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback