LCG Web>WLCGCommonComputingReadinessChallenges>WLCGDailyMeetingsWeek080616 (2008-06-24, JamieShiers)

EditAttachPDF

Week of 080616

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CET Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here

General Information

Web archive: https://mmm.cern.ch/public/archive-list/w/wlcg-operations/
CERN IT status board: http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/
elogs: https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Roberto, Maria, Jamie, Harry, Gavin, Jan, Andrea, Alessandro);remote(Derek, Gonzalo, Michel, IN2P3).

elog review:

Experiments round table:

LHCb (Roberto) - restart on Thursday CCRC-like activities with intermediate version of DIRAC. Several new features, e.g. multiple work-flows, plus new job finalisation module. New logging & failover. SRM DB upgrade - tomorrow ok! RAL & IN2P3. IN2P3 to support dcap instead of gsidcap.

CMS (Andrea) - RAS (rien a signaler)

ATLAS (Ale) - functional tests also during w/e. CCRC'08 II with suffix of week number used to identify files. Start deletion after 8 days. (1st time this naming is used). FDR data - delivered reasonably well during w/e. Low efficiency with castor @ CERN. elog. Waiting for reply. During scheduled down time 08-10 UTC 'blind' - gridmap empty.
- Naming convention for the ATLAS Functional Test datasets
- ccrc08_run2.0XXYYY.physics_blabla
- where XX is the week number, this week (started Monday at 0:00 am) is 25, and YYY is sequential.
- This is just to give the information that we are going, starting tomorrow, to delete the ccrc08_run2.024yyy.blabla data. For the sites this operation is transparent.

Sites round table:

Core services (CERN) report:

VOMS - Steve called Gav to report problem. Restarted daemons - had been done in meantime also be Remi. Seems s/w did not reconnect to DB correctly. Known bug (since long time) - some procedures missing for FIO to intervene in case of problems recovering.

# threads for LHCb LFC increased to 60 taking advantage of DB intervention below.

Move shared SRM for CMS and ALICE will move to seperate RACs Thursday maybe TBC.

DB services (CERN) report:

Upgrade to 10.2.0.4 for WLCG cluster & PDB cluster (ALICE + HARP + ...) Interventions < 3.5 hours. Some problems with GridView - some connections fail. Experts investigating... RAC6 - Ethernet switches on critical power (CMSR & LHCBR of a few minutes) - as scheduled.

Service managers should also have announced - e.g. EGEE broadcast - of associated services.

Monitoring / dashboard report:

Release update:

AOB:

N.B. we are transitioning from CCRC'08 to WLCG operations. Mailing list (see above) and meeting template has changed. Revamped wiki will come soon!

Tuesday:

Attendance: local(Julia, Gavin, Harry, Jamie, Maria, Roberto);remote(Derek, Gonzalo, Michel).

elog review:

Experiments round table:

LHCb (Roberto) - quiet week, s/w week, confirm restart on Thursday.

CMS (Daniele by email) - ggus ticket and elog entry regarding transfer problems to FZK

Sites round table:

Core services (CERN) report:

LHCb SRM DB intervention went ok today, ALICE tomorrow at 10:30, CMS to be scheduled. Thursday?

LFC - VOBOXes of LHCb - to be recognized as host certificate by the LFC, a VOBOX DN must finish with CN=myhostname or CN=host/myhostname. (So, the host certificate - just renewed - for VOBOX for LHCb at NIKHEF is wrong.)

VOMS follow-up: Yes FIO don't have a procedure since it is basically impossible to measure when the service is not available since there is no monitoring agent that is a member of the VO. And it seems to be hard to work out externally with out being in the VO if it is working. Steve

DB services (CERN) report:

ATLAS offline RAC tomorrow, CMS online & offline RACs next Tuesday. GridView problem of yesterday - some instability of service wrt Oracle cluster-ware. DBAs recreated service definition - situation looks better - users should not see any instability? Opened service request with Oracle. Incredibly long transactions? Miguel -> developers -> schema review.

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

Attendance: local(Alessandro,Andrea,Sophie,Harry,James,Jan,Gavin);remote(Derek,Gonzalo,Michel,Jeremy).

elog review: NDGF failures to all T1 - in fact a scheduled migration to LFC. See ATLAS report

Experiments round table:

ATLAS (AdG): NDGF is migrating the ATLAS catalog from RLS to LFC. ATLAS site services have been upgraded to match and they expect to restart at about 16.00 CEST.

CMS (AS): CRUZET-2 ended lastSunday. Over 30M cosmic evts taken. First live application of T0 repacker (tested and deployed in MAY CCRC - see CMS report as from the post-mortem ws). 100% job success at the T0 for this (!). Some issues found in the RECO step, resulting in very large RECoed files (related to unsuppressed zeroes in ECAL?). About 50 runs for ~32M evts (RAW) in global DBS, approx same nb evts as RECO in both scenario (processing via splitting by files, or by evts). First analysis datasets created for CRUZET-2 (about 30). Collecting feedback from analysists.

All iCSA08 production is finished/closed/migrated, for all workflows. Proceeding with deletion of un-necessary Spring07, CSA07, CSA08 datasets (over 2500 samples!). heavy Ion Production runs at T1's. All CMSSW_1_6_12 FastSim (70 M evts produced last week) sent to trash, due to wrong config file (too bad) - not computing fault. --- Transfer of CRUZET-2 data to T1's sites continues, CNAF is the custodial one, problems in migrating to tapes over the weekend now fixed, digesting the backlog nicely.

Some misbehaviour on SRM (srm-cms) at CERN reported well before the scheduled downtime window (10h30-11h30), Gavin/Jan informed and in the loop, all errors apparently disappeared after the intervention ended, Jan can provide details if needed, CMS is aware and informed by FacOps, all ok now (thanks).

Sites round table:

Core services (CERN) report:

Intervention on tape libraries tomorrow 08.00 to 14.00 CEST so there will be delayed tape mounting but no data loss.

In order to isolate the Castor SRM service from potential problems on the Castor stager backend (in the case where the stager backend has too many requests and becomes slow to respond), we would like to make a configuration change to all SRM services tomorrow at 10.30 (just after the dteam/ops SRM intervention). The change will be to lower the stager timeout – this should prevent exhaustion of the SRM thread pool that we observed during CCRC.

DB services (CERN) report:

Monitoring / dashboard report: Two issues after the gridview upgrade- the TNS was slightly out of date so the services had to be restarted - a query has stopped working and will be fixed tomorrow so there will be a few hours when the summary data for availability will not be updated.

Release update:

AOB: Gonzalo asked if the CERN Oracle server upgrades this week only applied to the 3D databases. The answer was that all servers are being converted to version 10.2.0.4 so all databases will be affected. Clients, however, are not yet being upgraded.

Thursday

Attendance: local(Roberto, Jamie, Harry);remote(IN2P3, Derek, Jeremy).

elog review:

Experiments round table:

LHCb (Roberto): "CCRC'08" not restarted yet - probably "interim DIRAC" still being worked on. Restarted DC06. Problems with info published by RAL.

CMS (elog) - nothing special

ATLAS (elog) - nothing special

Sites round table:

RAL (Derek): information gathering plug-ins timing out? -> 0 published as number of running jobs hence not attactive - under investigation

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local(Julia, Andrea, Harry, Roberto);remote(Michel).

elog review: Nothing new.

Experiments round table:

CMS (AS): CMS SAM SE tests are failing into CASTOR at CERN since yesterday with write permision denied. Ticket submitted.

LHCb (RS): Small scale tests for the LHCb CCRC are continuing and will rampup on Monday. Added an observation entry in the elog LHCb observation with some plans and wishes for the next weeks.

ALICE (PM): ALICE do not have enough disk space at all their Tier1 to fully stress sending custodial raw data during the commissioning exercise so will only use FTS for occasional testing. They will produce first pass processing ESD at CERN for transmission to Tier 1 and eventual second pass processing.

Sites round table: Michel (GRIF) asked about a report he had heard that ATLAS will be asking for additional space tokens at their Tier 2 (a suivre).

Core services (CERN) report:

DB services (CERN) report: There is currently a not-understood dashboard problem of delays in updating the job information for CMS.

Monitoring / dashboard report:

Release update:

AOB: LHCb would appreciate information from RAL on their not publishing batch queue lengths due to queries timing out (reported yesterday).

-- JamieShiers - 10 Jun 2008

Topic revision: r14 - 2008-06-24 - JamieShiers

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback