LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek080929 (2008-10-02, JamieShiers)

EditAttachPDF

Week of 080929

Week of 080929

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here

General Information

Web archive: https://mmm.cern.ch/public/archive-list/w/wlcg-operations
CERN IT status board: http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/
elogs: https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Alessandro, Jamie, Markus, Ricardo, Harry, Miguel, Patricia, Jan), Jean-Philippe;remote(Gareth, Michael).

elog review:

Experiments round table:

ATLAS - weekend went pretty smoothly. Expect cosmic data taking with all subdetectors tonight - 16:00 ATLAS meeting for details. Replication automatic - i.e. expect cosmics as announced. GGUS - providing list of 5 Tier2s (calibration T2s) - 4 are EGEE, 1 Michigan - use contact e-mail from tier1s of atlas. OSG tickets only ok for BNL - at WLCG ops will ask GGUS to implement direct access for at least these 5 tier2 sites. (Ticket by-passes ROC). Michael - welcome very much if this can be implemented - would ease life! Q: cosmic data. LAr sample? A: will follow-up.

ALICE - issue reported with WMS at RAL now solved. 2 WMSs at RAL up and running - put in production for ALICE this morning (some auth issues). New ALICE q at CERN - increase time limit to 20 hours (2 normalised days - previous were 1 normalised day) (LSF ticket - close) queues to be provided as from tomorrow. Test new queues tomorrow morning.

Sites round table:

RAL - Our (Interim) post mortem of the LHCB lost file incident is at: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20080917

Core services (CERN) report:

DB services (CERN) report:

Friday network problems (router) affecting CMS online - streaming not working. Corrected, but DNS intervention later broke it again - fixed this morning. Friday night LFC stream to SARA aborted. Fixing some rows at destination - data was changed at destination but should be R/O!

SAM: intervention tomorrow (broadcast sent) - 1 hour of tests will not be published.

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Andrea, Patricia, James, Harry, Jamie, Gavin, Jean-Philippe, Ricardo, Alessandro, Miguel);remote(Gonzalo, Derek, Michael).

elog review:

Experiments round table:

CMS - nothing to report today. In contact with CASTOR team to run transfers using new SRM endpoint which will be setup probably next week.

ALICE - two qs mentioned yesterday put in production.

ATLAS - how many samples written in each event - 5 (q from Michael yesterday). Each event around 3MB. Running cosmic data taking with all subdetector. All sites good efficiency except Lyon in scheduled downtime.

LHCb: an issue with myproxy server preventing to have proxies renewed and then jobs running smoothly. As far as I know (Maarten was reporting that during EGEE08) this is a known issue with the reliability of myproxy server software whose solution should be on the way but not clear to me where exactly.
Other VOs have experienced this as well in the past and I'd like now to say that also LHCB did. Gavin - some hiccoughs with myproxy service in last few days but nothing chronic.

Sites round table:

PIC (Gonzalo) - issue this morning around 11:00. SRM over-loaded by LHCb requests. Happened a couple of weeks ago - was in contact with LHCb to schedule controlled test. Arranged for yesterday but seemed to be launched today(!). New parameters didn't help - srmget requests timeout. Puts work however. Now rebooting SRM, asking LHCb to stop tests, schedule a new test to try to fix. Michael - file transfer rate per hour? A: get requests piling up in some internal SRM q. Each generate a pin request - internal qs overloaded. Transfer rate not so high. Michael - correlated with high pnfs load? No. See increase in SRM load but not 'too high'. Only seen 'now' - 13K srmget reqs in one shot - presumably would also happen with other VOs but has not happened so far. Q for Michael - do you have tests to show how many you can stand? A: yes, launched about 10K requests at a time. Took some time [ to handle? ] but never got stuck. Gonzalo - eventually get requests succeed but takes some 4-5H. SAM get requests for ex. timeout during this period. Investigation on-going with dCache people.

RAL (Derek) - no update, still investigating. No decision yet as whether separating out into different DBs would help.

Core services (CERN) report:

FTS 2.1 (SLC4 version) deployed in production in parallel - encouraging experiments to test it out and when confident move to it. Proposal would be to make 'an official' release - to be discussed and agreed at some board.

DB services (CERN) report:

No news from our side. Complementing what was said yesterday, we could not find the cause of the LFC stream to SARA failure. Their DBA says no data was written on their side.
Scheduled intervention this morning on SAM was postponed as tests could not terminate in due time. It will be re-scheduled soon.

Monitoring / dashboard report:

Setup new elog for Kors for tracking outcome of regular management meetings. Nothing new on CMS elog...

Release update:

AOB:

N.B. Friday's meeting will (exceptionally) be in 28 R-006 due to a conflict with the LCG GridFest that day.

Wednesday

Rehearsal of LHC GridFest involving many people

Attendance: local();remote().

elog review:

Experiments round table:

ATLAS (Simone):
1. ATLAS Cosmic data (RAW and ESD) are currently being distributed to T1s both on disk and tape. This is a lot of traffic, also for the CERN-CERN channel (pumping between 150MB/s and 300MB/s in the last 48h).
2. I got notification from Gavin about the new FTS on slc4 being in production at CERN. I agreed with ATLAS the new service will start being used for functional test transfers starting from the next monday.
3. CASTOR operations asked if it could be possible for ATLAS to test a new version of CASTOR (still 1.1.7) in "PPS mode". This is possible, I can include it in the next round of functional test (monday) if the CASTOR endpoint is ready, otherwise in can be easily introduced at a later time.
4. Issues in the last 24h:
  - problem with postgres database in dCache at PIC, fixed this morning;
  - low incoming throughput into SARA, affects functional tests as well as MonteCarlo Production. Possibly a FTS issue, being followed.

Sites round table:

PIC: Incident this morning. Affected SRM service again 4AM -> 11.30AM. Due to auto-update on SRM service. recovered from and fixed on node.

Core services (CERN) report:

FTS - problem discovered in new FTS service. Will hold up release of FTS for T1s.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Thursday

Attendance: local(Ricardo, Gavin, Andrea, Jamie, Harry, Jean-Philippe, Luca, Roberto);remote(Derek, Michael, Simone, Michel).

elog review:

Experiments round table:

ATLAS - data export of cosmics on-going, no major problem. Other activity: re-staging tests. A bit problematic. Full size of FDR2 raw (7.3 TB) fully successfull for RAL, TRIUMF and CERN (issue with few files being followed up). For other Tier1s some problems. Graeme sent detailed report to sites. Massive bringonline then more "Panda like" bringonline. Michael - BNL has not participated in these tests. Use "panda mover" then does implicit (pre-staging) whenever job requests files. Kaushik & Sasha in contact on this... Sasha: recent results from 3D scalability tests. Probe limits at 3D Tier1 sites. Looking forward to further steps to build model that takes into account these limits. ATLAS data reprocessing jobs - slow control DCS data. Harry - can we have report please? Luca - only one session of tuning. Will put in a report. Roberto - did you run bringonline recently at GridKA? A: yes - results not encouraging. R: severe problems with requests timing out. S: yes, same.

CMS - recons jobs for cosmics going on smoothyl, recons data being exported to T1s. IN2P3 marked as custodial by mistake - corrected - should have been RAL.

LHCb - activity snapshot "fake MC simulation", distributed analysis. 3 points: 1) above mentioned timeouts with lcg_gt operations at GridKA. Concurrent activity with ATLAS? (shouldn't happen...) 2) PIC - LHCb controlled test that failed due to load. srm requests pile out. In case of timeout condition get in a vicious circle condition. 3) LFC migration - Sophie has proposed date & intervention plan, 7 Oct intervention at CERN 10-11, 11-12 at Tier1s. Sent to Tier1 admins. RAL agreed so far... Would allow retirement of SRM V1 endpoints!

Sites round table:

SARA - Wed 17:30 UTC+2: hardware error on one of the pool nodes at SARA - Due to a hardware error, one pool node is unavailable. Therefore disk-only data will be temporally unavailable. This will be fixed asap.

Core services (CERN) report:

2 things to follow up: 1) 2 of 30 odd transfer agents seg faulted. Logging library recommended by gLite has problems with long (>1024 chars) - gLite will patch and re-release. Potentially affects other components that use this library. 2) PIC had reported problems with delegation - "clock skew". Some strange kernel-processor problem - now fixed (use standard RHE kernel). No other sites had reported this. 3) Transfers CMS to IN2P3 high failure rate - investigating.

DB services (CERN) report:

Oracle has released the patch to fix the following Streams bug: no instantiation SCN when dropping a table (with 2 Streams setups between the same source and destination databases). This patch has been already applied and tested successfully on the streams test 3d environment.

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Topic revision: r9 - 2008-10-02 - JamieShiers

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback