LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek100315 (2010-03-19, JamieShiers)

EditAttachPDF

Week of 100315

Week of 100315

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Availability				SIRs & Broadcasts		Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday:

Attendance: local(Patricia, Majik, Miguel, Dirk, Harry, Jamie, Maria, MariaDZ, Jean-Philippe, Cedric, Nicolo, Timur, Steve, Andrea, Ueda, Alessandro, Simone, Gavin, Lola, Ewan, Eva, Nilo);remote(Michael/BNL, Jon/FNAL, Xavier/KIT, Gang/ASGC, Rolf/IN2P3, Tore/NDGF, Ron/NL-T1, Rob/OSG, Joel/LHCb, John/RAL).

Experiments round table:

ATLAS reports - No so quiet week-end regarding T1s
- CERN
  1. ntr
- T1 (in chronological order)
  1. FZK-LCG2_SCRATCHDISK went full (Fri). (ATLAS fault)
  2. FZK atlas dCache head node's certificate expired (Sat). Quickly fixed.
  3. TAIWAN : Unscheduled downtime due to power cut (Sun). Finally back around 12:00
  4. NDGF LFC's certificate expired (Sun). Problem reported by cloud before being noted by shifters. Cloud set offline in Panda + blacklisted in DDM.
  5. RAL srm went down (Sun).
  6. Last but no least : SARA FTS still down till this morning. Have been informed around 13:30 that they downgraded to FTS 2.1. Put back in DDM.
- T2
  1. SE problems at GRIF-LAL. Fixed.
  2. Problem at DESY-HH : 800 jobs from the NAF crashed the dcap door and block the HOTDISK pool, thus blocking production. Now fixed.
- Andrea - any more details about problem with FTS? JPB - problem with libstd++ and incompatibility with Oracle client. Documented with FTS 2.1 but not with FTS 2.2.3. Ron verified that this solution works for new FTS but no further update. Ron - exactly the issue. Have FTS 2.1 in production and will leave it running. Installed a test instance of FTS 2.2.3 which seems ok. Will switch at some point in the future.

CMS reports -
- T0 Highlights:
  - PromptReco replay running
- T1 Highlights:
  - MC processing tails completing at T1s
    1. Last job resubmitted to PIC
  - Backfill running
- T2 highlights
  - MC ongoing in all T2 regions.
    1. File loss at T2_IT_Bari (INFN-BARI) confirmed due to StoRM 1.4 file deletion bug, site in unscheduled downtime to deploy fix - site now back in production.
    2. Many job aborts at T2_FR_CCIN2P3 (IN2P3-CC-T2) for unknown reason.
  - MC preproduction ongoing.
    1. Failures caused by use of inconsistent conditions tags. (Problem on CMS side)
  - SAM tests running infrequently on CEs at T2_FI_HIP - probably issue with cron that registers ArcCEs as CEs in SAMDB, Savannah #113117
  - SAM test failures at T2_RU_ITEP (ITEP), disk issues; T2_BE_IIHE (BEgrid-ULB-VUB) - missing library libsigc-2.0.so.0
  - T2_AT_Vienna (Hephy-Vienna) transfer failures - problems in submission to FTS servers, probably caused by change of Austrian CA root CERN, GGUS #56401
- Other
  - Distributing new SAM and JobRobot datasets to all CMS T1s and T2s + interested T3s.
- FTS does not reload CA - try a restart of FTS webservice

ALICE reports - GENERAL INFORMATION: During the weekend, the production has been concentrated on Pass1 reconstruction activities at the T0 and chaotic analysis jobs. Max peak of 1900 concurrent jobs achieved tonight.
- T0 site
  - Testing of the latest AliEnv2.18 version ongoing at the T0. This is the unique site for the moment running this version. More news about this version during the ALICE Offline week: http://indico.cern.ch/conferenceDisplay.py?confId=66646
  - Pass1 reconstruction activities successfully running during the weekend
- T1 sites
  - No particular problems to report
- T2 sites
  - Subatech T2: the latest AliEnv2.18 will be installed this week at this site for testing purposes and before beginning the massive distribution to all sites (foreseen for the end of March). Sites should not try to update to the latest alien version. This will be done from CERN to all sites
  - Athens T2: Site announced yesterday the update of their local resources to SL5. The site (blacklisted at the end of 2009) will be tested and validated before entering in production. Still the local CREAM-CE requires also an update
  - Polish sites: On Friday evening the CREAM-CE systems from both Poznan and Cyfronet were tested. The system at Cyfronet was properly performing and the site has been configured to use CREAM-CE exclusively. Still several issues in Poznan prevents the full migration to the CREAM-CE submission mode. Following the issues with the site admin.

LHCb reports - Stripping of bbar and ccbar over previous MC productions runniing at very low profile . Some jobs (10%) killed by the watch dog declaring them stalled. Something related with the application.
- T2 sites issue: ITEP shared area issue; g++ not found in BIFI

Sites / Services round table:

BNL - ntr
FNAL - ntr
KIT - ntr
ASGC: Scheduled power maintenance applied at ASGC on March 14. We didn't announce this in GOCDB beforehand because in principal this should not affect ASGC services but unfortunately the power generators failed to run stably as expected due to insufficient battery capacity for the two power generators to synchronize at shortest time. Upgrade of the battery and fine-tune the configuration for power generator synchronization had been implemented on 14 March. On the other hand, Tape library robot arm was also breakdown due to the abruptly power cut, which will be covered by UPS protection very soon.
IN2P3 - early announcement of downtime on March 18 of FTS - will try to upgrade to 2.2.3
RAL - LHCb draining issue mentioned - finished this morning and ok. Various problems with ATLAS SRM (failed for a while) and jobs kiliing s/w server. Investigating. FTS upgrade on Wednesday and 2 scheduled "at risks" tomorrow.
NDGF - as ATLAS mentioned LFC cert expired yesterday and fixed around lunch. Migrated service to new host. Some problems with new O/S. LFC crashing rather randomly. People looking into it. Rather unstable at the moment.
NL-T1 - covered FTS thing under ATLAS; for SARA ntr; for NIKHEF message from Jeff: ATLAS SAM tests failing for SRM v2 - lcg_cr returns invalid argument. Q to ATLAS: no GGUS ticket - is this a real problem or a "ghost" message? JPB - please send some more info. Alessandro - following up.
OSG - ntr

CERN FTS - plan to upgrade Tue/Wed this week. Needed to drain production servers. Plan update and draining T0export and T1import services. Would like to force drain on Wednesday and set channels as draining. Simone - 2 activities sending FTS jobs to old endpoints. ATLAS integration testbed (stopped) and T1import. Can switch to new FTS 2.2.3 if channel config confirmed. More news at https://twiki.cern.ch/twiki/bin/view/FIOgroup/FtsInterventions23upgrade

CERN - glitch which caused slowdown ~10' of scheduling of all CASTOR instances. Must be shared component - still investigating. None of the experiments should have seen a real issue. Nico - saw something in monitoring but not in production. Miguel - some retries maybe? Nico - will check timestamps to correlate.

CERN DB - PVSS replication for CMS, users creating views referencing tables that don't exist on destination replicas breaking replication.

AOB:

NORTH AMERICA U.S.A. (except Hawaii, Arizona), Canada (except Saskatchewan) Daylight Saving Time (Summer Time) began on 14 March 2010
European Summer Time begins (clocks go forward) at 01:00 GMT on 28 March 2010
Hint: "spring forward, fall back"

Tuesday:

Attendance: local(Ueda, Dirk, Simone, Pepe, Jamie, Maria, Lola, Cedric, Majik, Jean-Philippe, Ignacio, Harry, Nilo, Nicolo, Roberto, Alessandro, Miguel);remote(Jon/FNAL, Gonzalo/PIC, MIchael/BNL, Xavier/KIT, Rolf/IN2P3, Jeff/NL-T1, Gang/ASGC, Tiju/RAL, Rob/OSG).

Experiments round table:

ATLAS reports -
- T1
  1. SARA : Some transfer error observed between NIKHEF and some T2s. Probably some FTS problems ([TRANSFER_MARKERS_TIME OUT] No transfer markers received for more than 120 seconds]). [ No errors in the last hour - fixed now? ]
  2. NDGF : Put back a new certificate on the server yesterday around lunch time. But occasional crashes of lfcdaemon. Could first put online in Panda around 14:30, but NDGF squad ask to wait before to put it back in DDM. Finally done during the night. No observed problems then.
  3. RAL : From the GGUS submitted Sunday : "All the problems were caused by requests from a single user. This request systematically knocked out each srm node. We are in contact with that user now to get more details, hopefully to replicate the problem under testing and look for solutions.". [ Tiju - no information at moment - will update tomorrow ]

CMS reports -
- T0 Highlights:
  - Replay test jobs running
- T1 Highlights:
  - CNAF transfers were bad for hours following the StoRM upgrade. Authentication issues on SRM and gridftp after upgrade.
  - Data rereconstruction tails completing at T1s 1. Last job running at PIC [ Gonzalo - why do some jobs take so long to run at PIC? A: discussed in dataops meeting yesterday - some jobs with datasets with many more events than average. Solution to resubmit with smaller input file? A: resubmitted.. Don't know if any other action - will ask CMS operations ]
  - Backfill test jobs running 1. Few memory failures at IN2P3
  - KIT 1. SAM test failures: connection errors to SRM (apparently affecting only CERN DNs?)
  - ASGC 1. CMS SLC5 software area disappeared? [ Gang - not disappeared - misconfiguration of CMS SW directory - will correct asap ]
  - PIC 1. Maradona errors in JobRobot disappeared with no intervention on PIC or CMS side - ticket closed.
- T2 highlights
  - MC ongoing in all T2 regions.
  - SAM test failures at T2_RU_ITEP (ITEP), disk issues on software area; T2_FR_GRIF_IRFU (GRIF) - Maradona error.
  - SAM tests running infrequently on CEs at T2_FI_HIP - probably issue with cron that registers ArcCEs as CEs in SAMDB, Savannah #113117
  - T2_AT_Vienna (Hephy-Vienna) transfer failures - problems in submission to FTS servers, probably caused by change of Austrian CA root CERN, GGUS #56401, GGUS #56465, GGUS #56467,
- [Data Ops]
  - Tier-0: Get ready for 900 GeV collisions.
  - Tier-1: Running backfill while waiting for real requests.
  - Tier-2: Finish 354 pre-production and old requests, start new production requests.
- [Facilities Ops]
  - CMS preparing for first 900 GeV collisions.
  - Follow-up on Tape Families setup in ASGC to bring the site to Operations before the run. Some 2009 Cosmic Data to be transferred in custodial/non-custodial mode to validate new Tape Families. Next Monday ASGC will report to CMS at Monday's CMS Operations Meeting.
  - New SAM and Job Robot datasets, created with later CMSSW versions, were subscribed to sites yesterday and approved centrally for Tier-1s and Tier-2s. Tier-3s willing to receive SAM/JR need to approve themselves. Once datasets land to sites, we will then migrate SAM and JobRobot to use latest CMSSW releases.
  - This week is CMS week. CERN Indico agenda here. (see detailed report but restricted access anyway )

ALICE reports - GENERAL INFORMATION: Pass1 reconstruction job still running at the T0 resources. Small fraction of user analysis jobs (chaotic analysis) also running at some T2 sites; Migration of SAM to Nagios: The packaging of the ALICE test suite for VOBOX has been provided this morning. together with the help of the montiroing experts, beginning the setup of the test suite; SAM tests: The CREAM-CE sensor has been included in the list of ALICE sensors. Part of the ops testsuite is currently running with Alice credentials
- T0 site
  - The alarm system of the ALICE voboxes has been reactivated this morning.
  - Testing of the AliEnv2.18 still going on at the T0
- T1 sites and T2 sites
  - There are no particular issues to report
  - GRIF_IPNO: The CREAM-CE cluster is being published as a subcluster of the LCG-CE. This procedure is preventing the submission of agents to this site. The ALICE model assumes the independency of the CEs which are pointing to the same cluster

LHCb reports -
- The 2 stripping productions launched on Friday (to fix the pb of large input files) are almost completed. Another small stripping to be launched and few millions MC events requested to be produced . Not really much activity. Mainly waiting for the first data in the afternoon and checking that all pieces of code are working fine for processing them.
- Testing SQUID for offloading central web servers application and dirac software repositories at pic.
- T0 sites issues: A glitch with the AFS shared area preventing to write lock files spawned by SAM (used also for SW installation). Immediately fixed.

Sites / Services round table:

FNAL - ntr
BNL - will carry out transparent intervention on Oracle - security and other patches. Also O/S update (rolling)
KIT - failing SAM tests (CMS report) - found out that certificate revocation lists not updated on all our machines. Not updated since last Thursday - fixing.
PIC - ntr
IN2P3 - ntr
ASGC - ntr
RAL - ntr
NL-T1 - yesterday something about ATLAS SAM tests failing at NIKHEF. Chased down - one of disk servers behind SE needed rebuild and didn't realize. Set to R/O - points to lcg_cr command having bad error message (invalid argument). GGUS ticket about transfer timeouts from NIKHEF to T2s - looked at it. Bandwidth issue offsite to "not well connected T2s". Timeouts increased to 1 year(!)
OSG - new ticket just opened for Jon - urgent ticket about file transfers FNAL - Vienna. Kyle will grab attachment and add to ticket. Nicolo - similar to ticket submitted to CERN FTS yesterday. Vienna admins not authorized to submit to FNAL FTS. Expiration at Austrian CA last week - requires restart of webservice on FTS server. Error message looks pretty much the same... Jon - if this is the cause why would ticket be directed at specific sites? New CA cert has to be distributed to all?? Rob - will check.

CERN - ntr

AOB:

LHC Commissioning Schedule

For the FTS 2.2.3 agent problem, please see: https://twiki.cern.ch/twiki/bin/view/EGEE/Fts223KnownIssues#Oracle_instantclient_libstdc_dep

Wednesday

Attendance: local(Jan, Jamie, Maria, Cedric, Eva, Nilo, Patricia, Manuel, Jean-Philippe, Steve, Simoe, Edoardo, Harry, Ignacio, Xavier, Lola, Miguel, Gavin, Alessandro);remote(Jon/FNAL, Gonzalo/PIC, Xavier/KIT, Michael/BNL, Gang/ASGC), AlessandroCavalli (INFN-CNAF, Rob/OSG, Tore/NDGF Onno/NL-T1, Rolf/IN2P3, Tiju/RAL)

Experiments round table:

ATLAS reports -
1. DB alerts received at ~18:30-19:00 linked to problems with pandamon. Panda experts asked to have a look. Blocking sessions were killed, but the problem will have to be investigated.
2. Some SRM problems observed at SARA quickly fixed.
3. RU-Protvino-IHEP : Still some FTS errors as reported yesterday : [TRANSFER error during TRANSFER phase: [TRANSFER_MARKERS_TIME OUT] No transfer markers received for more than 120 seconds]
4. GOEGRID : dCache problem, solved now.
5. Functional tests for transfers not running (noticed > lunch). Problem is with SLC4/5/Python environment - fixed & func.tests now working.
- Nota bene - ATLAS has noted SRM 2.9 as production version

CMS reports -
- T0 Highlights:
  - Replay test jobs running
  - Following up with PhEDEx configuration changes to migrate T0-->T1 transfers to use FTS2.2.3 fts22-t0-export.cern.ch
- T1 Highlights:
  - Data rereconstruction tails completing at T1s
    1. Last job running at PIC
  - Backfill test jobs running
    1. Jobs on CREAMCEs at KIT stuck in 'Running' state in WMS - probably need manual cleanup by WMS admins.
  - ASGC
    1. CMS SLC5 software area issue - caused by wrong VO_CMS_SW_DIR setting on some WNs, now fixed, software tags will be republished.
- T2 highlights
  - MC ongoing in all T2 regions.
  - SAM test failures at T2_RU_ITEP (ITEP), disk issues on software area - SW redeployment in progress; T2_HU_Budapest (BUDAPEST) - downtime extended.
  - SAM tests running infrequently on CEs at T2_FI_HIP - probably issue with cron that registers ArcCEs as CEs in SAMDB, Savannah #113117. Update 17/3: SAMDB fix applied, SAM tests now running more frequently.

ALICE reports -
- T0 site
  - Alarm system at the ALICE VOBOXES activated.
  - Small number of jobs coming from the Pass1 reconstruction running at the T0 resources
- T1 sites
  - Lyon-T1: Yesterday afternoon we found a problem with the local CREAM-CE. (GGUS ticket: 56475). Authentication error messages appearing at submission time. The problem was also reported to the ALICE contact person at the site. System is back in production since this morning
- T2 sites
  - Subatech T2: Instabilities expected today and tomorrow. subatech is being used as pilot site for the testing of the new AliEnv2.18 version
  - Publication of cluster/subcluster information at IRES and GRIF_IPNO. At these both sites the CREAM-CE cluster is published at the IS as a subcluster of the LCG-CE. The query to the CREAM-CE resource BDII (information source used by AliEn), in terms of GlueHostMainMemoryRAMSize GlueHostMainMemoryVirtualSize, does not work since apparently this query should be done at the level of the cluster node. The issue was reported by the two site admins. However the issue does not prevent the submission of new agents. The agent submission decision is taken based on the information provided by the fields running and waiting jobs and indeed this information is provided by the subcluster resource BDII.

LHCb reports -
- Rerun EXPRESS and FULL streams to test the 2010 workflows (using 2009 real data). The EXPRESS stream was OK but for the FULL stream a known issue at the application level has been encountered.
- Launching now a campaign to clean up old DC06 data across all storages.
- T0: SVN reported to be extremely slow today This is a problem already reported via RT:https://cern.ch/helpdesk/problem/CT662489&email=marco.clemencic@cern.ch
- T2 sites issue: Disk quota exceeded at INFN-PISA and problem importing some module in UKI-SCOTGRID-ECDF

Sites / Services round table:

PIC - ntr
KIT - ntr
FNAL - 3 things: 1) updated Vienna CA certs yesterday - done; 2) planning on switching production version of PhEDEx to latest rev tomorrow morning - running on debug instance for some time 3) 5am Chicago time detected dCache problems. Stack overflow exception - had to restart all admin services - total downtime ~30-60' but intermittent during this period. Paged on-call service for this.
BNL - ntr
CNAF - today we have a failure on one CE which is for ALICE & CMS. Some authentication problem - being investigated. Announcement of next week's intervention on DB for LHCb - which is on Tuesday - concludes intervention of last week which was not completed due to a problem. Maria - appreciate attendance and would like this to be more regular if possible!
NDGF - ntr; Jean-Philippe - LFC crashing at NDGF - look at core dumps, all crashes in VOMS library. Version is so recent that it has not even been sent by developer to certification! Huge risks in doing this! Strongly advise to go back to e.g. 1.8.12. Already running on non-supported platforms - e.g. Ubuntu - but should run certified versions of the middleware! (Running VOMS 1.9.14) - have to follow-up.
ASGC - ntr
NL-T1 - a few things: SARA SRM stuck yesterday and restarted - since ok. Some problems with LFC ATLAS and LFC LHCb: the problems started this morning caused by network failures - should be fixed by now. NIKHEF: problem with diskserver - some ATLAS files not accessible. No ETA for fix. Recovery of files so far unsuccessful. GGUS ticket 56485 on transfers NIKHEF - Protvino. So far not been able to pinpoint source of problem. 30' ago SARA FTS where channel is located increased some timeouts. Better?? Don't know yet... JPB - please check tcp buffer size at Protvino - last time we had a problem it was set to a v low value
IN2P3 - reminder of tomorrow's outage of FTS for upgrade to 2.2.3. Additional CE outage the one for ALICE and ATLAS to enlarge log area due to high number of ATLAS jobs in the past.
RAL - yesterday ATLAS asked for info on SRM ATLAS. Updated GGUS ticket 56439 with latest info. Upgraded to FTS 2.2.3
OSG - yesterdays CA problems resolved.

CERN DB - large transactions on PVSS replication delay of several hours. Schema not needed for replication so removed.

CERN Storage - instability on monitoring for SRM - nothing to do with experiment SRMs

CERN network - next Wednesday 07:00 - 12:00 we will be testing a new firewall. Tests so far positive so should be transparent. Please report any problems to IT-CS. Can switch back in a few seconds to normal firewall.

CERN Grid services - problem affecting execution of some grid jobs - those submitted from new ATLAS pool account. First discovered Monday evening solved right now. Ownership of this account had changed name - problem now solved.

CERN FTS - T0export draining. Should be able to force a drain in the next couple of days. T2 server? Also going to drain. No, just a downtime.

AOB: (MariaDZ) USAG today is cancelled due to illness. Please comment in savannah to advance the agenda items linked from http://indico.cern.ch/categoryDisplay.py?categId=355 Next and last meeting date April 14th.

LHC - close machine and caverns by 16:00. Magnets on by evening. Ramp to 3.5 TeV followed on Friday by tests of high intensity bunches. Then beams of 450 GeV and collisions (at injection energy)

Notes on updating CA certs

Thursday

Attendance: local(Timur, Malik, Ewan, Harry, Jean-Philippe, Zbigniew, Nicolo, Alessandro, Ignacio, Roberto, Miguel);remote(Xavier/KIT, Jon/FNAL, Tore/NDGF, Gonzalo/PIC, Rolf/IN2P3, Jeff/NLT1, Tiju/RAL, Rob/OSG, Luca/INFN).

Experiments round table:

ATLAS reports -
- SARA-MATRIX issue in exporting data https://gus.fzk.de/ws/ticket_info.php?ticket=56500 . FIXED but other sites should look at the ticket and may be comment.
- NIKHEF-ELPROD issue in exporting data https://gus.fzk.de/ws/ticket_info.php?ticket=56491 . In progress. Jeff asked for more information and Alessandro has updated the ticket just after the meeting.
- T2
  1. BU_ATLAS_Tier2 (NET2) storage issue: yesterday RT ticket #15735, today https://gus.fzk.de/ws/ticket_info.php?ticket=56511

CMS reports -
- T0 Highlights: Data taking and replay test jobs running. Following up with PhEDEx configuration changes to migrate T0-->T1 transfers to use FTS2.2.3 fts22-t0-export.cern.ch 1. 5/7 T1s using FTS2.2.3 exclusively.
- T1 Highlights: MC processing running at RAL, PIC. Backfill test jobs running.
- T2 highlights: MC ongoing in all T2 regions. SAM test failures at T2_FI_HIP (CSC) SOLVED- SAM dataset not available; T2_CH_CSCS (CSCS-LCG2) - Maradona errors

ALICE reports - GENERAL INFORMATION: Massive user analysis production yerterday afternoon-evening achieving max peaks over 6000 concurrent jobs
- T0 site and T1 sites - Low production at these sites, no particular issues to report. In terms of the issue reported for the CREAM-CE at IN2P3-CC (authentication problems observed at submission time), the ticket was closed yesterday and the system is performing well
- T2 sites
  - The site admin at RRC-KI has announced today the setup of a CREAM-CE for ALICE. The system will be tested and validated this afternoon. If all the tests are correct the system will enter in production this evening.
  - Hiroshima T2: The xrootd setup for the local SE has been succesfully completed this morning. However, bandwith issues with the site prevent the setup of the local SE in production. Following this issue with the ALICE experts.

LHCb reports - Ongoing the clean up activity of old DC06 data.
- T0 sites issues:SVN reported yesterday to be slow has been fixed.
- T2 sites: some problems with availability of g++ and libraries due to an error in a script setting the environment (fixed).

Sites / Services round table:

KIT: ntr
FNAL: reminder that they will install a new version of Phedex in one hour.
NDGF: ntr
PIC: ntr
IN2P3: FTS being upgraded to 2.2.3. Increasing the log partition on one CE server as announced yesterday. The log partition on the other CEs will be extended next week.
NLT1: problem at SARA solved.
RAL: ntr
OSG: ntr
INFN: ntr
ASGC: ntr

DBs at CERN: problem with ATLAS offline DB ast night around 05:00. The third listener node got stuck and prevented several new connections. The service was restored in one hour, but the problem is not understood yet (no high load).

AOB:

Friday

Attendance: local(Maria, Jamie, Harry, Dirk, Eva, Nicolo,Alessandro, Jean-Philippe, Andrea, Majik, Patricia, Roberto, Stephane, Steve, Nilo);remote(Gang/ASGC, Jon/FNAL, Rob/OSG, Xavier/KIT, Xavier Espinal/PIC, Rolf/IN2P3, Gareth/RAL, Gonzalo/PIC, Matthias/NDGF, Onno/NL-T1).

Experiments round table:

ATLAS reports - NIKHEF-ELPROD transfer problems with SARA = GGUS ticket open - will be added. A few T2 problems - people investigating. Cloud news: general migration underway. As soon as T1s have installed new FTS, will do a "checksum check". (Done for UK except for QMUL and RAL.

CMS reports -
- T0 Highlights:
  - Following up with PhEDEx configuration changes to migrate T0-->T1 transfers to use FTS2.2.3 fts22-t0-export.cern.ch 1. 6/7 T1s using FTS2.2.3 exclusively. {RAL to be switched}. Quality degradation in transfers to RAL and CNAF. Appears to be destination SRM issue. Lasted for a few hours - seems to be recovered.
- T1 Highlights:
  - Last data reprocessing job completed at PIC
  - MC processing running at RAL, PIC
  - Backfill test jobs running
  - PIC
    1. Authentication issues on FTS for CMS VO members after upgrade to FTS2.2.3 - fixed around 7pm.
- T2 highlights
  - MC ongoing in all T2 regions.
    1. T2_BE_IIHE (BEgrid-ULB-VUB) - many jobs crashing because CMS SW area unavailable (also seen in SAM tests), site excluded.

ALICE reports - GENERAL OPERATION ACTION APPLICABLE TO ALL SITES: The latest gLite3.2 vobox provides in default a gridftp server required to operate with jobs submitted to the CREAM-CE system. For all those sites already providing a CREAM-CE for ALICE, please check the status of the service: /etc/rc.d/init.d/globus-gridftp. This service must be up and running. Already announced through several ALICE TF meetings, still some VOBOXES have not activated such service
- T0 site and T1 sites
  - Small production running at the T0: Pass 1 reconstruction jobs
  - T1 are performing two parallel reconstrution passes (pass2 and pass6) for the testing of new AliRoot improvements. There are no particular issues to report at these sites
- T2 sites
  - RRC-KI (Russia): Yesterday the site admin announced the setup of a CREAM-CE. The system has been tested and validated this morning and the site has been configured in CREAM-CE submission mode (exclusively)
  - GRIF_IRFU (France): The local CREAM-CE info provider was stuck yesterday afternoon. The site has been put back in WMS/LCG-Ce submission mode. The system should be back in production by the beginning of the next week

LHCb reports - Ongoing the clean up activity of old DC06 data. Re-running FULL Stream reconstruction with new DaVinci application (fixing a known bug). This is the workflow to process 2010 real data. Running at low profile four different small MC productions and corresponding Merging (less than 2K jobs in total, including users).
- T0 sites issues:
  - none
- T1 sites issues:
  - CNAF: testing a first prototype in DIRAC for direct submission to CREAM and found mapping problems with the remote CE for the user and pilot FQANs
  - PIC: still investigating with local contact people the issue of the SL5-64bit application failing to be installed
  - Data upload from MC at NIKHEF failing - ggus ticket about to come.LcgFileCatalogClient:__getRoleFromGID: Failed to get role from

GID 0 Invalid argument

- T2 sites issue:
  - Shared area issue at INFN-TORINO and at IPSL-IPGP-LCG2
- PIC (Gonzalo) - issue with daVinci installation failing. Who will followup? Experiment? Site? A: s/w installation responsible will followup. DIRAC-side.

Sites / Services round table:

FNAL - OSG informed us that security problem - performed rolling upgrade on all CEs to resolve problem. No user downtime.
OSG - release of security update mentioned - ntr
KIT - ntr
IN2P3 - ntr
PIC - ntr
RAL - ntr
NL-T1 - at NIKHEF 2nd diskserver with problems. 56491 and 56577 GGUS tickets. ATLAS report about NIKHEF-SARA transfer problems probably related to this. NIKHEF in contact with vendor for solution. SARA FTS will be moved to new h/w and upgraded to 2.2.3 on Monday morning. Attempted previous week but problem solved - new m/c already running only have to pull switch.
NDGF - ntr
ASGC - ntr

CERN - ntr

AOB:

LHC - news as of 08:30: overnight priority to establishing ramp commissioning 3.5 TeV; priority 2 = measurements at 3.5 TeV needed for stable_beam requisites with collisions, priority 3 collisions at 450 GeV before going to collisions at 3.5 TeV.

Topic revision: r18 - 2010-03-19 - JamieShiers

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback