LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek100628 (2010-07-02, JamieShiers)

EditAttachPDF

Week of 100628

Week of 100628

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday:

Attendance: local(Stephane, Elena, Jamie, Markus, Alessandro, Jacek, Jean-Philippe, Patricia, Maarten, Andrea, Ricardo, Miguel, Maria, MariaDZ);remote(Jon, Gang, Roberto, Angela, Rolf, John, Jeremy, Thomas, Brian, Rob, Ron, Davide(?)).

Experiments round table:

ATLAS reports -
- ATLAS data taking: – “physics” run (data10_7TeV) when beam on
- CERN-CASTOR: ALARM ticket GGUS:59441 (Monday morning): problem to export data from T0, all attempts of writing and reading from T0 have been timing out.
  - Problem was identified regarding very high levels of logging. Logging daemon reset. Castor team is investigating a permanent solution.
- SARA-MATRIX: Alarm GGUS:59433, GGUS:59435 (DATATAPE diskbuffer in front of tape full - Sunday): problem with tape system at SARA-MATRIX. It affected raw data export. Fixed on Sunday evening. [ Also problem submitting alarm - submitted at 15:00 UTC, acknowledgement didn't come so submitted duplicate after 3 hours. Sent e-mail to GGUS support. Reply: that alarm ticket was sent to grid-support@sara. They have confirmation at 15:47. Ale - used GGUS to send alarm but not sure if alarm correctly sent ]
- Taiwan-LCG2: srm problem affected data export from T0 on Sunday, TEAM GGUS:59431. Fixed in 1 hour.
- LYON LFC: ongoing problem with accessing IN2P3 LFC GGUS:59344, IN2P3 claims that the problem is fixed (there was a problem with the CRLs update at CERN GGUS:59349, as the CNRS CRL is the first in the CRL list, that affected the french sites for any operation from CERN. A workaround was implemented) but we still see client side error messages in logs of ATLAS Site Services.[ Rolf - corresponding people are still working on this - nothing else to add. Waiting for analysis of problem. ] [ Miguel - CRL problem is what exactly? What services involved? Ale - discuss offline - reported everyday last week, Steve Traylen aware. cert not correctly updated on AFS UI at CERN. Not sure what correlation there is between CERN and Lyon... ]
  - how we can proceed?
- Problem when submitting alarm ticket: GGUS:59433, GGUS:59435 . Asked ggus developers and SARA-MATRIX admin.

CMS reports -
- CERN Services
  - Observed Castor problem this morning starting at 5am. Fixed around 10.30am. https://gus.fzk.de/ws/ticket_info.php?ticket=59443 ( Did not send alarm as ATLAS had already sent some... Not in data-taking so did not justify - a TEAM ticket was sent ]
- T0 Highlights
  - expect more data this week, starting tonight.
  - latest schedule update, next two fills with 3X3 bunches (2 colliding bunches in the experiment), inst. lumi up to 5E29 (estimate)
  - might stress T0 a little...
- T1 Highlights
  - tails of MC reprocessing
  - skimming workflows
- T2 Highlights
  - MC production as usual
- French CRLs not updated in AFS affects CMS heavily - French sites are having troubles with SAM tests because of this - all below threshold due to this - will have to correct as "this is not their fault". Ricardo + Maarten: still not understood. System for updating AFS CRLs "overhauled" - not yet working properly.

ALICE reports - GENERAL INFORMATION: Pass1 reconstruction + five analysis trains + MC production.Raw data transfers ongoing during the whole weekend (max peak around 40MB/s).
- T0 site
  - Friday afternoon: CAF experts announced the ALICE proof master node (lxbsq1409.cern.ch) was down/unreachable. A ticket was opened by the Operator (but with low priority). Reminder sent this morning by the CAF experts, issue still pending
  - Restart up of all services at the CREAM-VOBOXES to restart the production
- T1 sites
  - CNAF: ce07 (2nd CREAM service) not performing well. each submission takes too long and in addition jobs stay in status registered forever). ALICE contact person at the site contacted
- T2 sites
  - Hiroshima: hardware operation on storage is foreseen for today. The SE might be unreachable during the operation which could cause some failures in the jobs running at the site.

LHCb reports -
- LHCb Week in St. Petersburg.
- Received some data last night before the cryogenic issue. Reconstruction of this new data + reprocessing of old data. Due to a misconfiguration in DIRAC many jobs ended up in queues (mainly at GridKA and IN2p3) too short to hold them and failed because killed by the LRMS. Problem fixed.
- Reconstruction jobs over the new data are taking a bit too long to process the full event whose number of tracks (and then the number of combination) is very high. Some are really a killer (up to 30 seconds per event).
- No MC productions running.
- T0 site issues:
  - SAM Critical tests failing since several days at CERN. (GGUS:59452)
  - Open a GGUS ticket (GGUS:59422) to CERN to investigate on some transfers from UK sites (and not only) failing there with the error below (this could provide some hints in the investigation of the UK sites upload issue reported elsewhere) [ Maybe the problem stopped some weeks ago?? ]

             globus_xio: System error in writev: Broken pipe

- - Request to be notified when a job runs out of CPUTime at CERN is on hold. RT690202
- T1 site issues:
  - SARA: another disk server seems to have been down during the last week end with some data not available (GGUS:59447). This is not the same disk server reported by Onno on Friday
  - RAL: Observed some degradation in the SARA-RAL transfer channel. The replication takes very long and the channel is particularly slow if compared with any other channel setup at RAL (or at CERN) to RAL (GGUS:59397). Any progress? [ Brian - some diskservers don't have standard WAN tuning. Don't support jumbo frames so could be an issue of packet splitting. Do other Tier1s (other than SARA...) support jumbo frames? Do other Tier1s see a problem? Roberto - only SARA-> RAL. ]
  - IN2p3: Some users report to have problem fetching data out of there (GGUS:59412). Issue related with the wrongly installed French CA on AFS UI. [ Markus - this ticket has been closed ]
- T2 sites issues:
  - none

Sites / Services round table:

FNAL - ntr
ASGC - ntr
KIT - recently saw some problems with CREAM CEs - Tomcat handing - correlated with rising # of connections in closewait. See since 5-6 days. Had to restart Tomcat. Affects mainly 2 CEs - do other sites see similar behaviour? Maarten - did you increase the JVM memory size? Found to be necessary at CERN and CNAF?
IN2P3 - ntr
RAL - currently in at risk until Wednesday.
NDGF - ntr
NL-T1 - as Roberto said 1 dCache pool node down. Has difficulty in mounting filesystem. Vendor looking at it... Temp. missign SURLs sent to LHCb. Mass storage problems this w/e - 1 out of 4 nodes used to copy data to tape was down. Caused performance degradation - whole logic used in script to copy data to tape did not respond well to this one node being out. Fixed yesterday.
CNAF - having problem with CREAM CE as mentioned before by other sites - under investigation by developers
GridPP - ntr
OSG - handful of tickets over w/e.

CERN CASTOR - an incident as reported by ATLAS and CMS during night. Issue with logging - which slowed down request processing and generated some timeouts - will produce a SIR.

CERN DB - incident that affected ATLAS offline on Sat. 15:00 spike of load on one node on ATLAS offlline - node evicted and rebooted. 9 applications affected. Could not reconnect as Oracle services not relocated properly by Oracle clusterware - manual intervention - fixed around 16:00 Sat. Bug with similar symptoms reported to Oracle - checking if this also affected us.

AOB:

NDGF - until this morning test alarm last Wednesday was not acknowledged by site - A: had new operator who didn't understand process Since acknowledged.

NL-T1 - ATLAS alarm of 27th on dCache. Some problems in alarm flow.

Tuesday:

Attendance: local(Ricardo, Jean-Philippe, Harry, Jaroslava, Patricia, Roberto, Maarten, Alessandro, Jacek, Markus, MariaD, Miguel);remote(Jon/FNAL, Angela/KIT, Vladimir/CNAF, Gang/ASGC, Gonzalo/PIC, Rolf/IN2P3, Ronald/NLT1, John/RAL, Jeremy/GridPP, Rob/OSG).

Experiments round table:

ATLAS reports -
- Nothing to report

CMS reports -
- Issues
  - CRL file intermittently having zero https://gus.fzk.de/ws/ticket_info.php?ticket=59349 . Was effecting the SAM tests of French sites. Has been closed Friday.
  - transfer problems with FNAL (PhedEx problems?)
- T0 Highlights
  - processing incoming data
- T1 Highlights
  - skimming workflows
- T2 Highlights
  - MC production as usual in tails

ALICE reports -
- GENERAL INFORMATION: Pass 2 reconstruction activities (7 TeV) restarted yesterday evening. Increasing the activity at the sites
- T0 site
  - Pass1 reconstruction activities ongoing at CERN
- T1 sites
  - CNAF: Yesterday we reported a "lazy" performance of one of the two CREAM (ce07) systems at the site (too long submission per jobs and these were in status REGISTERED forever). The site experts reported in the evening a missbehaviour of the proxy delegation procedure executed by AliEn in the mentioned CE (user delegation procedure was executed once per second, while it should happen once per hour). This is a very rare issue that has been observed in those sites with 2 voboxes. The activity was stopped at CNAF for few seconds and after this all services were restarted following carefully the behavior of the proxy delegation to the CREAM system. Further investigations are needed to the level of the AliEn code
- T2 site
  - Spain T2 sites: Both out ot production today.
    - Trujillo: CREAM-CE not responding
    - Madrid: Alien user credentials expired --> services restarted with new credentials this morning

LHCb reports -
- Experiment activities:
  - No MC productions running. Reconstruction and reprocessing jobs (and merging) runnign to the completion. Mainly waiting for new data.
- Issues at the sites and services
  - T0 site issues:
    - SAM Critical tests failing since several days at CERN. (GGUS:59452). Any progress? Still in CERN ROC.
    - No progress on the GGUS ticket (GGUS:59422) to CERN to investigate on some transfers from UK sites (and not only) failing there with the error below (this could provide some hints in the investigation of the UK sites upload issue reported elsewhere) globus_xio: System error in writev: Broken pipe. Still being investigated.
    - Both glite WMS instances for LHCb have been upgraded with the last patch https://savannah.cern.ch/patch/index.php?3621 that will be ready in few days for being picked up by all sites supporting LHCb (and we warmly suggest them to upgrade).
  - T1 site issues:
    - SARA: As of today, the two disk servers issue reported last week is the main reason that currently prevent LHCB to carry until the completion the reco-stripping (GGUS:59447). Is there any forecast when we can have back these disk servers? Reply from Ronald: one disk server was put back in production last Saturday, the other server was put back earlier this afternoon.
    - RAL/SARA: Investigation of the observed degradation in the SARA-RAL transfer channel are continuing. Many thanks to both sites people involved. (GGUS:59397)
  - T2 sites issues:
    - none

Sites / Services round table:

FNAL: ntr
KIT: ntr
CNAF: ntr
ASGC: ntr
PIC: ntr
IN2P3: ntr
NLT1: ntr
RAL: the following large scale tests will be carried out in preparation for the decommission of GOCDB3: * Between 12/07 15:00 UTC and 15/07 15:00 UTC, All GOCDBPI queries will be pointing to GOCDB4. *. The purpose of this test is to verify the complete interoperability of all tools in a production environment before the actual decommission of GOCDBPI_v3. The rough timeline for the complete decommission of GOCDB3 is available on http://goc.grid.sinica.edu.tw/gocwiki/GOCDB4_development.
GridPP: it seems that ATLAS and CMS have dependencies on GOCDB 3.0. Could these experiments check?
OSG: tests of stability of SAM BDII and production BDII have continued. The failure rate for SAM BDII is higher (3 failures out of 1000 tests) and this is conssitent with January's statistics. In case of failure the BDII does not seem to contain any data. Investigation continues.

CERN central services
- DBs: STREAMS replication to ASGC failing since last night after a DB intervention at ASGC.

AOB: (MariaDZ) The late or non-delivery of GGUS email notifications is being de-bugged in GGUS:59485 Ron acknowledged that KIT sent out the mails as soon as the ticket was created. Now the investigation is between him and the NIKHEF mail server managers for checking their logs. Savannah ticket will be closed as the mail notifications were sent.

Wednesday

Attendance: local(Nilo, Flavia, Lola, Simone, Patricia, Miguel, Jamie, Maria, Jerka, Hary, Ale, Zbyszek, Ricardo, Edoardo, Roberto);remote(Michael, Jon, Gonzalo, Federico Stagni (LHCb), Gang, Rolf, Jeremy, Onno, Rob, Jeremy, CNAF).

Experiments round table:

ATLAS reports -
- Yesterday evening all CASTORATLAS service were affected by AFS issue, we could not retrieve files from nor copy files to the T0ATLAS and T0MERGE pools, we could not export data from CERN . ALARM ticket GGUS:59546 created. SLS monitors of all ServicesForATLAS were either grey or red; AFS, LXBUILD, CASTOR were down and grey; ATLAS Online and Offline DB, Storage space, Central Services were grey on SLS monitoring, but were running. AFS issue has been resolved at around 23:45 CEST, problems with data export went away afterwards. Thank you very much! Can CERN please comment this AFS issue in the "Round the table" part of the meeting?
- Yesterday afternoon expired voms-proxy of one of certificates used for FTS transfers. Issue was fixed and understood within 15 minutes from first observation.
- LFC at IN2P3-CC issue: We have uploaded a plot with number of registrations and number of errors in time to GGUS:59344. We have done 2 observations based on the failure occurrence:
  1. Failures definitely seem to be decoupled from CRL updates.
  2. We also do not see any correlation with the possible load caused by our DDM site services registration. Can IN2P3-CC LFC team please have a further look? Thanks!
- LFC - IN2P3 - local experts in contact with Jean-Philippe Baud on this problem

CMS reports - Apologies - change over day

ALICE reports - GENERAL INFORMATION: High Grid activity with more than 23K concurrent jobs (~80% coming from MC production).
- T0 site
  - Peaks around 5000 concurrent jobs, good behavior of the Alice resources
- T1 sites
  - RAL: Reinstallation of the AliEn software at the site (running an obsolete version). The site is back in production
  - CNAF: The Alice contact person at the site has announced a vety large number of submitted jobs to the site (8000). Both information sources normally used by Alice were failing this morning: the CREAM DB was not responding and in addition the voview information provided by the Alice queue was showing no running/waiting jobs. As prevention measure we have stopped the CE service at the site until the problem is solved and the queue drained. [ CNAF - continuing investigation on this problem ]
- T2 site
  - Madrid: missed packages in WN (gfortran). Queue blocked from the central services and problem announced to the Alice contact person at the site
  - Trujillo: site out of production, still CREAM is dening any connection
  - Hiroshima: blparser service not alive this morning. Announced to the site admin
  - PNPI: Site has been out of production for two days. When the services have been restarted, still cream was failing at the site. Reported to the Alice contact person

LHCb reports -
- Data taking since last night. Reconstruction of new data taken in the weekend seems to have a problem in the increased memory consumed by these jobs (and then LRMSes killing jobs). This has to be further investigated LHCb application side.
- GOCDB v4: some instability in querying the PI (GGUS:59560)
- T0 site issues:
  - Many pilots aborted to ce201.cernch CREAMCE GGUS:59559
  - Affected by the outage of AFS yesterday as any other VOs both in transferring data and in our shared areas. Right now the volumes interesting lhcb shared area have been recovered and SAM jobs verified the software is properly available to grid jobs.
  - SAM Critical tests failing since several days at CERN. (GGUS:59452). Any progress? [ Identified, should be an update soon, now fixed ]
  - No progress on the GGUS ticket (GGUS:59422) to CERN to investigate on some transfers from UK sites (and not only) failing there with the error below (this could provide some hints in the investigation of the UK sites upload issue reported elsewhere) [ Not easy problem - still being investigated ]

             globus_xio: System error in writev: Broken pipe

- - Both glite WMS instances for LHCb have been upgraded with the last patch https://savannah.cern.ch/patch/index.php?3621 that will be ready in few days for being picked up by all sites supporting LHCb (and we warmly suggest them to upgrade).
- T1 site issues:
  - IN2p3: issue with almost all files in a run whose locality was not being properly reported preventing reco jobs to be submitted. GGUS:59558
  - CNAF: jobs do not manage to grab computing resources preventing of a couple of productions to run successfully until completion. It looks like a share issue having very few jobs running and 90 jobs queued locally in the LRMS. GGUS:59562
  - CNAF: similar to CERN ce201 issue there are many jobs aborting against ce07 at CNAF with a proxy expiration issue GGUS:59568
  - SARA: confirmation that the two disk servers are back to life
- T2 sites issues:
  - none

Sites / Services round table:

BNL - observed high load on namespace of storage system. Found that user analysis jobs with very long file names - cannot be handled by lcg_cp. Attempts to create files fail & retried - causes high load. Done through wrapper to storage system - have implemented check and reject request. Request developer of Panda pilot to take care of those requests - these jobs should not start at all. Problem understood & fix in place. Simone - discussed at ATLAS ops on Monday. Conclusion is that there is a problem with those files - need protection so such files not created in the future. dCache limit 199, others have limits too. Will put limit in DDM - both on basename and dir tree. Will have to be done in Panda and Ganga too. Michael - SAM tests: see tests failing for most sites and Tier2s in particular. Grid info system or service problem - no problem with site services. Sites marked unknown - SAM tests fail. Same problem in other clouds, e.g. DE - same situation. Ale - Observing central(?) problem under investigation. Tests not launched in past 24h.
KIT - ntr
ASGC - ntr
FNAL - ntr
PIC - ntr
IN2P3 - ntr
NL-T1 - a few things: 1) found file of ATLAS that was corrupt. GGUS:59496. Don't have a copy that is correct - ATLAS recopy? 2) this morning file server had i/o errors fixed after reboot. Reported to vendor. 3) SARA - RAL performance GGUS:59397 - started iperf on a fileserver. RAL doing some tests towards this service. Reached 800Mbps which seems ok on 1Gpbs line but not for 10Gbps. At SARA end 10Gbps. RAL limit to fileservers? Brian - our diskservers only 1Gbit connected. Tests from heavily tuned d/s get 700 for single stream. Production servers for LHCb give 20-40 Mbps for small files and 100-150 Mbps for larger ones. TCP tuning for single stream rate? Onno - then will stop iperf service and if any more needed update GGUS.
RAL - nta
CNAF - CREAM CE problem connected with serious bug found in CREAM - developers working to produce a fix.
GridPP - LHCb transfer problems may have found culprit in TCP settings. Duplicate selector acknowledgement packet. Can LHCb increase # jobs reaching site to test? A: OK, don't have MC going on but will see..
OSG - ntr

CERN yesterday AFS issue. Complete disk array failure - impacted several volumes containing user dirs and s/w areas (LHCb) . All recovered - 3/4 arrays recovered and 1/4 from backup. Many dependent services including CASTOR and CCC affected. SIR underway. Michael - when AFS problem happened LHC page 1 said 'could not inject beam because of AFS problems'. Can you comment? Nilo - problem caused as critical machine that copies data to AFS volume every minute from cron. This created many jobs and blocked machine - CCC should not have any dependency on afs. Will move cron job to a less critical m/c.

CERN DB - yesterday had problem with ATLAS replication to ASGC. Looks like problem after scheduled intervention - DB did not come back properly, was blocked. Apply process could not start - after reboot ~17:00 was ok. Latency of 14h recuperated i <3h.

AOB: (MariaDZ) GGUS:59485 is re-opened, trying to find KIT mail log info on the mail to CERN e-group atlas-adc-expert@cernNOSPAMPLEASE.ch. If the KIT gateway completed the transmission, we should check the CERN e-groups' log. [ e-mail problems during alarm tests ] [ Onno - have asked mail admins to have a look into this but don't know when we will get an answer... ]

Thursday

Attendance: local(Lola, Marie-Christine, SImone, Jerka, Maria, Jamie, Patricia, Jean-Phllippe, Alessandro, Andrea, Maarten, MariaDZ, Ricardo);remote(Michael, Gonzalo, Roberto, Rolf, Gareth, Gang, Ronald, Thomas, Rob, Angela).

Experiments round table:

ATLAS reports -
- Nothing to report.
- Issues in progress: LFC at Lyon GGUS:59344; transfers BNL->Lyon slow GGUS:58646.

CMS reports -
- Issues
  - DBS had problem yesterday evening (1/3 degraded, then slow). [ Some kernel issues caused this reboot ]
  - Problems in DQMScheduler and CleanUpManager https://cmsweb.cern.ch/T0Mon/
  - Savannah ticket open for PhEDEX Display, sr #115464: https://savannah.cern.ch/support/?115464
- T0 Highlights
  - Data taken last night, more to come
  - Processing incoming data
- T1 Highlights
  - skimming workflows
- T2 Highlights
  - MC production as usual

ALICE reports -
- T0 site
  - Based on the number of reources currently used by ALICE, PES has offered some extra slots for the experiment.
- T1 sites
  - CNAF: Problem reported yesterday concerning the wrong information provided by the resource BDII of the local CREAM-CEs, SOLVED. System reporting the correct information, the site has been reopened to the production
- T2 site Due to the electric maintenance in all computers in IN2P3-LPSC (Grenoble) they will be stopped on Wednesday, July 7th from 7:00 am till 6:00 pm. The queues for Alice experiment will be disabled on Tuesday, July 6th at 11:00 am. submission services will be stopped in advance

LHCb reports -
- Reconstruction and stripping of recent (and forthcoming) data will be halted as Brunel and DaVinci applications are taking up to 60 times longer than foreseen to process each event.
- Essentially MC production running (20k jobs finished in the last 24 hs with a failure rate of ~7%)
- T0 site issues:
  - Issue with CREAMCE requires intervention of developers (all proxies are really expired in the CREAM belly) GGUS:59559
- T1 site issues:
  - (cfr CNAF-T2) also many production jobs are aborting at the T1
- T2 sites issues:
  - UK sites issue uploading to CERN and T1: people from Glasgow seem to have found a solution (see UK_sites_issue.docx) that boils down to a network tuning.
  - CNAF-T2: misconfiguration of the ce02 many jobs failing because pilots aborting. GGUS:59621

Sites / Services round table:

IN2P3 - ntr
RAL - we had some loss of data for CMS yesterday from d/s. Tape-backed up. Some files had not been migrated to tape. h/w problem - taken out of use. Actions taken meant data lost. Post-mortem in progress. CMS aware.
FNAL -ntr
NL-T1- 2308bytes long file being read with 8k simultaneous connections. xrootd traffic cannot be traced back to job so can't find out who responsible. Maintenance on SARA SRM on July 26.
BNL - ntr
PIC - around 12:00 observed ATLAS user who submitted 1300 jobs that all started compiling Athena source from s/w area. Collapsed. This had impact on others too so killed jobs. Should we expect? Follow with someone from ATLAS offline. SImone please provide DN of user and batch job IDs.
ASGC - SIR for DB incident this Tuesday uploaded.
NDGF - ATLAS analysis jobs through Panda not getting through. Problem with ARC control tower - person in charge in vacation but back next week(?)
KIT - ntr
CNAF - issue about batch system: since yesterday pm LSF is refusing to schedule jobs and leaves them pending. Trouble shooting. Parameters of LSF have to be adapted to cluster size which is now > 4000 nodes including some not active. Things going better -still under trouble shooting with support. Roberto - behind problem reported by LHCb today? Alessandro - have to check. Roberto - noticed many pilots aborting at CNAF.
OSG - OSG SE reporting via BDII will go in testing Tuesday next week. Start testing SE reporting then. If ATLAS and CMS ok put in production about a week later. Kyle will attend this meeting for next period (WLCG workshop etc.) Monday is a holiday in the US too.

AOB:

Friday

Attendance: local(Flavia, Lola, Simone, Marie-Christine, Andrea, Jamie, Maria, Jerka, Jean-Philippe, Ewan, Miguel, Zbyszsek, Harry, Maarten);remote(Jon, Gonzalo, Gang, MIchael, Kyle, Onno, Vera, John, Rolf, CNAF=Alessandro, Gareth).

Experiments round table:

ATLAS reports - ntr

CMS reports -
- Issues
  - vocms105 (Prod Front end, critical services) was out of network around 11 this morning, Running presently with vocms65 only.
- T0 Highlights
  - Data taken last night, Processing incoming data
- T1 Highlights
  - skimming workflows
- T2 Highlights
  - MC production

ALICE reports - General Information: High Grid activity with more than 25K concurrent jobs running. Pass 2 and pass 1 reconstruction activities ongoing.
- T0 Site - CERN: Top priority ticket GGUS:59637. ce201 and ce202 overloaded since last night. All jobs submitted were aborted. Problem under investigation.

LHCb reports - (no report)

Sites / Services round table:

FNAL - ntr
BNL - high load on namespace component in early morning in US. Due to lots of activity on ATLAS user disk space 40M hits/minute (or 14?) - trying to optimize access and layout so that such rates can be accommodated. Trying to understand too what caused such a high rate. [ Simone - r or w request? Michael - read ]
ASGC - ntr
NL-T1 - 2 issues: 1) at NIKHEF SRM was overloaded yesterday with ls operations by a biomed user. Other users got time outs GGUS:59638. FIxed by asking user to stop! Current DPM has no real defense against this yet. 2) SARA SRM ALICE could not write files to a pool as pool was filled up - files were not written to tape. GGUS:59639. Two errors: a) many files of 0 bytes - use a checksum for all files written to tape. Java lib used has error when calculating checksum for empty files. Should report 1 but reports 0. dCache calculates ok but script uses Sun Java library and hence checksums didn't match. b) ALICE uses xrootd protocol to write files and this is unauthenticated. dCache assigns files to root and script didn't like! script changed mapping from root to ALICE and all files now written to tape. [ JPB - will provide a way to ban users by local interface and/or ARGUS system. Maarten - as discussed in DPM UF there is a way to ban users for certain operations such as srmls. Remove user from 1 of gridmap files. These are generated automatically by cron job so need a post-processing step. JPB - but standard gridmap file not used any more... ]
NDGF - ntr
RAL - CMS p-m almost complete. One other issue: problem on a disk array that is part of Oracle DB behind castor - has problem at the moment. Problem is in area of redo logs. Likely we will announce an outage Monday morning to resolve - depends how checks this afternoon go.
IN2P3 - ntr
CNAF - batch system problem reported yesterday should be recovered now. CMS today is reporting problems on CREAM CE - request to developers. Storage side: half of gridftp servers were down last night - 50% failures for ATLAS and also LHCb storage out due to network (switch) failures. Fixed this morning around (early).
PIC - ntr
OSG - ntr

AOB:

Plan for next week: meeting Monday; looking into Tuesday - depends on flights. Other days most people at WLCG Collaboration workshop. Will announce if calls will be held.

-- JamieShiers - 24-Jun-2010

Topic revision: r13 - 2010-07-02 - JamieShiers

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback