WLCG Tier1 Service Coordination Minutes - 29 April 2010

Attendance

Site Name(s)
CERN Tim, Roberto, Patricia, Harry, Andrea, Luca, Maite, MariaDZ, Maarten, Manuel, Maria, Jamie, Gavin, Zbyszek, Jean-Philippe, SImone, Eva, Alessandro, Flavia, Tony
ASGC Felix
BNL Carlos, John
CNAF AlessandroCavalli
FNAL Jon
KIT Angela
IN2P3  
NDGF Vera, Jon
NL-T1  
PIC Gonzalo
RAL Andrew
TRIUMF Andrew
OSG Kyle
GridPP Jeremy

Experiment Name(s)
ALICE  
ATLAS Kors, Dario, John
CMS Daniele, Dave D.
LHCb  

Discussion on LHC Schedule

  • Basic message from Roger Bailey - will try to hold these technical stops as per schedule. Even if there is a problem with the LHC likely to need these stops anyway. An issue this time was full maintenance of the LHC elevators - external company so very hard to shift!

  • Simone - winter shutdown? First estimate was 11 weeks but 2nd iteration likely to be shorter. Switch off ~10th December, come back end Jan / beginning Feb. TBC.

  • Tim - in the event of any changes, e.g. if one is moved forward by one week, what sort of notice? A: will be very difficult not to have these stops, even if we have to stop for other reasons. If there is an unscheduled stop ~one week just before a technical stop we might change but as this affects whole accelerator schedule very hard to change schedule.

  • Andrew - very useful to have this information!

  • Jamie - schedule service interventions through this meeting and try to limit to a) max 3 and b) 30% of total capacity.

  • Kors - technical stops relevant for export of data but not otherwise for Tier1 / Tier2 sites. Should coordinate site downtimes at any time.

  • Daniele - would not like to see a lot of sites down with short notification - if you plan to make downtimes please go through this meeting with the maximum notice possible so we can foresee impact and if necessary ask not to have it.

  • MariaDZ - can ask for this to be programmed into GOCDB.

  • Tim - and for Tier0?

  • Conclusion - schedule T1/T2 interventions independently of technical stops

Review of WLCG alarm chain

  • Jamie = need to be confident of end-to-end alarm chain. Need to understand why things break and discuss with relevant service providers why things have gone wrong.

  • MariaDZ - restart regular pre-GDB alarm chain tests?

  • Tim - loop-back test so that it can be automated?

  • Action on MariaDZ to organize meeting with relevant people to review recent problems

glexec deployment status

CERN, NIKHEF, KIT and PIC tested. Others either did not have glexec() on WNs or - as in case of RAL - did not work during this test.

RAL - just in process of configuring now - will continue over next couple of weeks.

FNAL - CMS already using it so know it is working for them (Jon)

OSG sites - had some trouble submitting to - to be followed up.

Data Management & Other Tier1 Service Issues

Storage systems: status, recent and planned changes

Site Status Recent changes Planned changes
CERN CASTOR 2.1.9-5 (All)
SRM 2.9-3 (all)
Upgraded all instances to CASTOR 2.1.9-5 and SRM 2.9-3  
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
none none
BNL dCache 1.9.4-3 none none
CNAF CASTOR 2.1.7-27 (ALICE)
SRM 2.8-5 (ALICE)
StoRM 1.5.1-3 (ATLAS, CMS, LHCb,ALICE)
26/4: 1-day transparent intervention on the SAN; reboot of the tape library.
28/4: TSM server upgraded to 6.3 to solve a potential, non-blocking issue; GPFS upgrade of all StoRM back-ends
StoRM upgrade to latest version (foreseen for 17/5), date to be agreed (next LHC stop?)
FNAL dCache 1.9.5-10 (admin nodes)
dCache 1.9.5-12 (pool nodes)
none none
IN2P3 dCache 1.9.5-11 with Chimera ? ?
KIT dCache 1.9.5-15 (admin nodes)
dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
Tape library re-alignment this afternoon (already reported) Adding disk space to LHCb. Gradually adding more space during the coming days. May result in temporary bottlenecks because empty disks have higher preference and activity will skew to these nodes.
NDGF dCache 1.9.7 (head nodes)
dCache 1.9.5, 1.9.6 (pool nodes)
none none
NL-T1 dCache 1.9.5-16 with chimera (SARA), DPM 1.7.3 (NIKHEF)   On may 10th dCache head nodes services will be moved to new hardware
PIC dCache 1.9.5-17 On 27/04/2010 upgraded from v15 to v17 and also disabled tape protection temporarily to allow CMS accessing files on tape with dcap protocol. Waiting both for the dCache patch that allows tape protection setting per VO and the CMSSW debugging of gsidcap access none
RAL CASTOR 2.1.7-27 (stagers)
CASTOR 2.1.8-3 (nameserver central node)
CASTOR 2.1.8-17 (nameserver local node on SRM machines)
CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers)
SRM 2.8-2
none none
TRIUMF dCache 1.9.5-17 with Chimera namespace dCache upgrade  

Other Tier-0/1 issues

  • CNAF: 28/4: moved the ATLAS LFC database to new hardware.
  • CNAF: ATLAS conditions database to be moved to new hardware (date to be agreed with ATLAS).
  • CNAF: FTS database to be moved to new hardware (probably next LHC stop).

CASTOR news

CASTOR SRM 2.9-3 was released (and deployed) and is available in the savannah release area. Release notes and upgrade instructions are available.

dCache news

Two major issues have been fixed in dCache 1.9.5-18, to be released this week:
  • GridFTP(2) and small files (e.g. at BNL): A fix has been tested at BNL and seemed to have solved the issue
  • For quite some time sites reported that in a running system, all of a sudden, members of particular CA were not accepted by dCache anymore and only a restart could reactivate the CA.

StoRM news

LFC news

LFC 1.7.4 in certification, fixes a bug seen by BNL and implements some bulk lookup methods.

FTS

Investigating the problem seen from time to time at TRIUMF, with source files from TRIUMF being deleted as a result of an attempted transfer to a remote site. It is not clear if the problem is at the FTS or at the SRM level.

CERN would decommission myproxy-fts as soon as possible and is waiting for the green light of the experiments. ATLAS and CMS already agreed with it.

Experiment issues

  • ATLAS: understanding the problem mentioned above at TRIUMF is the biggest concern.
  • LHCb: a problem is seen at some dCache sites (PIC and IN2P3), consisting in the ID of an SRM BringOnline request being lost. To be investigated.

WLCG Baseline Versions

Conditions data access and related services

Database services

  • ATLAS: Florbela asked when the Atlas archive database will be moved to new hw: (Eva) Next week, test databases will be moved; in 2/3 weeks, integration and archive databases. The delay is caused by instabilities found in the new hw. Atlas archive database will be first in the list.
  • CMS: New hw for P5 to be prepared. Thinking about splitting the online database in 2. Request to upgrade 11g one of them and profit from Active Data Guard for conditions replication.
    • (Eva) IT-DB does not have plans for a major upgrade (to 11g) until the LHC stop to avoid any instabilities which might be caused by the new Oracle version in the databases. CMS can decide to upgrade to 11g under their own risk, we do not have enough experience to run the service. Even if one of the development or integration databases are upgraded to validate the applications, problem might arise on the production system.
  • Sites reports:
    • RAL (Carmine): April security patches being discussed for 3d and castor. No dates yet.
      • New hw for TAGs database being prepared.
      • Licenses: request has been sent to the person responsible for Oracle licenses at CERN and we are waiting for the answer.
      • Tie1 monitoring status – development finished, to agree when to develop on the rest of the sites
    • BNL (Carlos): Request for new licenses. Also, for licenses requested in 2006, support contract is about to finished, how to renew it?
      • Eva will ask for information and send an email. Are there other sites affected by the support contract ending?
    • Triumf (Andrew): ntr
    • SARA (mail): downtime on April 20th due to network maintenance that lasted from 8 AM until 15 PM local time. Since Oracle RAC was down anyway we took the opportunity to upgrade the kernels of the cluster nodes.
    • CNAF (Alessandro): LFC db migrated to new hw. Conditions db still in old hw, problems found with procedure, requested help from CERN for future migration. Alessandro will collect all the information and send by email to setup a phone call.
    • NDGF (Jon): ntr
    • ASGC: archived logs retention changed to 15 days. Still observing problems: archived log area full. Eva suggested to review the RMAN scripts.
    • PIC (Elena): ntr
    • GRIDKA: LHCb requested squid installation

  • Tier0 - Streams:
    • 9th April - Bug affecting Atlas Streams replication between online and offline database found. Whole replication was stuck. Workaround was to reboot one of the Atlas online database (where the Streams processes were running). Problem took longer to be fixed because Atlas was not able to find the person to give the green light to reboot the node.
      • There is a patch available to fix this bug and will be applied together with the security patches.
      • Patch will be applied on all databases.
    • High memory consumption from one of the Streams queue monitor processes observed on lhcb downstream capture. Does not affect replication. Problem being investigated by Oracle.
    • Lhcb downstream capture for conditions is stuck since midday, looking at the problem.

AOB

-- JamieShiers - 27-Apr-2010

Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2010-05-03 - EvaDafonte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback