LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek110926 (2011-09-30, OnnoZweersExCern)

EditAttachPDF

Week of 110926

Week of 110926

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday:

Attendance: local(Jhen-Wei, Rob, Scott, Dan, Evgeny/RU-T1, Ivan, MariaG, Ale,Ullrich, Maarten, Eva, Nicolo, Steve, Dirk);remote(Gonzalo/PIC, Ulf/NDGF, Onno/NL-T1, Daniele/CNAF, Lisa/FNAL, Tiju/RAL, Vladimir/LHCb, Dimitri/KIT).

Experiments round table:

ATLAS reports - Dan (for Torre)
- T0/Central Services
  - Large failure counts pulling files from CERN EOS SRM, SRM was stuck, OK after restart. GGUS:74621
  - CERN Castor file urgently needed for HLT reprocessing was stuck, quickly unstuck (Thank you!), underlying issue still to be understood. GGUS:74626
- T1 sites
  - TRIUMF job failures, cannot connect to geometry DB. Correlated with CE using CVMFS? CE turned off for production until it's resolved. GGUS:74634
  - SARA-MATRIX dCache node certificate expired, T0 exports failing. Problematic node promptly taken offline. GGUS:74642
  - IN2P3-CC high error rates (~50%) mostly due to put errors and lost heartbeat (non-response from pilots; WN issue). GGUS:74636 (9/25), related to GGUS:74593 (9/22)
  - FZK staging error appears solved. GGUS:74433
  - RAL SRM error solution verified. GGUS:74562 (Correlated with corrupted file problem being investigated by RAL/ATLAS)
  - PIC gridftp error solution verified. GGUS:74578
- T2 sites
  - Progress on ongoing issue of slow transfers from CERN to AGLT2 (muon calibration data). Problem with network path between CERN and AGLT2 seen. Study continues. GGUS:73463

CMS reports -
- LHC / CMS detector
  - Physics running
- CERN / central services
  - Sat 25: SLS showing unavailable for several services, due to issue on web server: http://itssb.web.cern.ch/service-incident/sls-shows-grey-services/26-09-2011
  - Sat 25: GGUS:74629 - network issue in CC, causing some critical CMS VOboxes to be unreachable for a few hours. The IT operators correctly acknowledged the related lemon alarms for these machines, but the CMS admins never received the usual automatic notification e-mails - were they also missed due to the network issues?
  - Sun 26: CMSR running with degraded performance, affecting all CMS Offline applications (in particular, PhEDEx and T0 Monitoring). Caused by misbehaving disk, fixed by IT-DB: INC:067344
- T0:
  - PromptReco - currently 14k jobs acquired by T0.
  - Results of test CMSSW_4_4_0 workflow: one job stuck in infinite loop, software experts are investigating, deployment delayed.
- CAF:
  - Since all production data is also replicated on EOS, stopping new transfers to CASTOR CMSCAF. Monitoring instructions for CMS computing shifters updated accordingly.
  - Known issue with CNAF-->EOS transfers INC:065462 - transfers failing with gridftp or marker timeout, happening during checksumming of large files.
- T1 sites:
  - MC production and backfill. Also PromptSkimming at FNAL.
    - Redeploying needed CMSSW version that was removed, causing workflows to fail.
  - ASGC: SAV:123651 for failure to open input files in MC production - site is restaging the files, and fixing config issues on WNs which have trouble opening the files - still open.
  - IN2P3: SAV:123694 for failure in SRM SAM tests, closed. Note that the issue on SRM disappeared after one hour, but the tests failed for 6 hours due to a problem in the PFN caching of the VO-specific SRM SAM test..
  - KIT: SAV:123709 - PromptSkim output file missing from storage.
- T2 sites:
  - MC workflows (also with WMAgent) and analysis
  - T2_PL_Warsaw failing SAM tests, SAV:123647 open since 22/9
  - T2_DE_DESY failing SAM CE tests, SAV:123703 closed.
  - T2_EE_Estonia failing SAM CE tests, SAV:123705 closed.
  - T2_CN_Beijing failing SAM SRM tests, SAV:123704 closed.

ALICE reports -
- T0 site
  - The worker nodes at cern downloaded the whole AliEn installation. This was a patch because some of the modules were not in the 'workernode' one. That has been fixed and it is working well now
- T1 sites
  - Nothing to report
- T2 sites
  - Usual operations

LHCb reports -
- Experiment activities
  - Reconstruction and stripping
  - Finishing on validation of reprocessing applications
- T0
  - Since the week-end increased number of pilots aborted (GGUS:74657)
- T1 sites:
  - SARA: Problem with pilot submission to CREAM-CEs (GGUS:74639)
  - SARA: Problems with file access via protocol (GGUS:74416), nodes were upgraded, after the above problem is fixed this can be validated.

Sites / Services round table:

Gonzalo/PIC - reminder: Wed intervention on tape systems. Should be largely transparent - site will cache writes and has asked experiments to prepare via pre-staging of required files. current staging activity is low.
Ulf/NDGF - disk-pool offline for atlas in Sweden on the weekend. Back online now
Onno/NL-T1 - today mass storage in maintenance (apologies for unexpected time). Also the frontier/squid server stopped again after a partition running full: adapted log rotate from daily to hourly and will soon move to machine with larger disk
Daniele/CNAF - ntr
Lisa/FNAL - ntr
Tiju/RAL - ntr
Dimitri/KIT - ntr
Jhen-Wei - ntr
Rob/OSG - tomorrow morning (afternoon cern time): os upgrades and apache security patches - important services will be upgraded in rolling fashion.

AOB:

MariaDZ : GGUS Release this Wednesday, hence there will be test ALARMs according to the standard procedures.
MariaG : Service coordination meeting will take place this Thu.

Tuesday:

Attendance: local(Massimo, Scott, Rob, Dan, Maarten, Alessandro, Dirk, John, Ulrich, Peter, Ivan, John, Eva, MariaD);remote(Ulf-NDGF, Paco-NLT1, JhenWei-ASGC, Burt-FNAL, Tiju-RAL, Xavier-KIT, Lorenzo-CNAF, Andrea, Vladimir-LHCb).

Experiments round table:

ATLAS reports - Dan
- T0/Central Services
  - Dashboard callbacks from DDM are slow and piling up. Related to some stored procedure in dashboard which is slow today. Service now ticket: https://cern.service-now.com/service-portal/view-incident.do?n=INC067829
- T1 Sites
  - Lyon added more SCRATCHDISK (GGUS:74636). ADC is still tweaking DDM lifetime/deletion so that the SCRATCHDISK is cleaned more agressively.
    - The issue with production job put errors is still ongoing (though only 8% errors in Panda now). Dcache devs provided a kernel config workaround and it will be tested next week.
  - T0 exports to RAL failing. ALARM this morning (GGUS:74686). Castor database issue, currently trying to recover.
- T2 Sites
  - Regarding slow CERN-AGLT2 link (GGUS:73463). The subscriptions were remade to transfer from BNL to AGLT2.

CMS reports -
- LHC / CMS detector
  - Acquired 19pb-1 last night
  - Preparing for a potential PH run this afternoon
- CERN / central services
  - CMSR Data Base not reachable (~2:20 PM), see GGUS:74701 : being investigated
  - follow up on network issue in CC (Sat 25: GGUS:74629) : the fact that the CMS admins never received the usual automatic notification e-mails upon VoBox's/services failures/reboot was still being investigated this morning
  - successful CMSWEB cluster upgrade this morning
- T0:
  - Currently running RECO at full pace (2.5k jobs, see http://lsf-rrd.cern.ch/lrf-lsf/info.php?queue=cmst0), but still with 3GB/core memory reservation due to our hungry application, hence a reduced CPU efficiency (http://lemonweb.cern.ch/lemon-web/info.php?entity=lxbatch/cmst0&cluster=1&type=host)
  - The whole situation should improve from tomorrow on with CMSSW_4_4_X (workflow tests on Tier-0 still not 100% done)
- T1 sites:
  - MC production and backfill. Also PromptSkimming at FNAL.
- T2 sites:
  - MC workflows (also with WMAgent) and analysis
  - T2_PL_Warsaw in downtime (see https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=77456&grid_id=0 ) , yet not reflected in CMS SSB. This is an issue with the downtime collector, being followed up by experts, see SAV:123720
  - T2_BR_UERJ is also in downtime (due to loss of external connection caused by twisted fiber) , however downtime also not reported in SSB. What occured is that UERJ is not visible in the BDii, as they are not publishing anything, and the buffered time expired and then it became white. This is a feature of the SSB that needs to be changed and CMS will submit a ticket to the Dashboard team (see SAV:123745)

ALICE reports -
- T0 site
  - Usual operations
- T1 sites
  - RAL: GGUS:74098 reopened. NAGIOS CREAM tests failing in one of the CE. Submission is not working to two of the CEs at the sites due to a failure in the delegation of the proxy.
- T2 sites
  - Usual operations

LHCb reports -
- Experiment activities
  - Reconstruction and stripping
  - Finishing on validation of reprocessing applications
- T0
  - Since the week-end increased number of pilots aborted (GGUS:74657)
- T1 sites:
  - SARA: Problem with pilot submission to CREAM-CEs (GGUS:74639), ticket closed and verified
  - SARA: Problems with file access via protocol (GGUS:74416), ticket closed and verified

Sites / Services round table:

ASGC: NTR
BNL : NTR
CNAF: Scheduled downtime tomorrow morning (HW). Data on disk will be available also during the intervention.
FNAL: Early warning: scheduled downtime Saturday October 15th (power cut - tentatively 8h downtime - more news later this week)
IN2P3: NTR
KIT: NTR
NDGF: NTR
NLT1: ATLAS jobs seem stuck on new worker nodes added to an existing CE. Investigating the cause of this behaviour.
PIC: NTR
RAL: DB issue under control. Rebuild almost finished
OSG: Maintenance due to OS upgrade going on on different service. Transparent (should eb finished in ~8h)

CASTOR/EOS: Checksum in flight turned out to be difficult for EOS (suspect this could generate timeout CNAF-CERN). Dirk will follow up this issue.
Dashboards: ntr
Databases: CMS problem in hand but solving it needed to bring down the DB. It should be finished soon (CMS sent the ticket for keeping track of the problem knowing experts were already working on that)
Grid services: CMS asked news on the network incident (Uli still waiting for an answer). It seems that operators were overwhelmend and they did not send the notification. SCAS service blocked early this morning (around 6:00) affecting pilots (via glexec): OK after restart.

AOB:

(John Hefferman): Kerberos config: CERN has migrated (Spring 2011) to use a new Kerberos infrastructure (based on Active Directory) and all managed machines (notably lxplus) have been migrated at that time. The old infrastructure (based on Heimdal) has been kept for compatibility and to allow a smooth transition for other clients (notably laptops). We notice that some of the default configurations distributed by other laboratories (for example FNAL) still refer to the old (Heimdal) Kerberos realm. We urge major sites to change their default configuration since Heimdal will be discontinued in the Winter shutdown (no firm dates yet). Instructions for CERN machines: http://linux.web.cern.ch/linux/docs/kerberos-migrate.shtml . More background information can be found under: https://twiki.cern.ch/twiki/bin/view/AFSService/MigrationFAQ)
(MariaDZ)
- GGUS Release tomorrow.
  - Comments on the test ALARMs, if any, should be recorded in Savannah:122175. Authorised ALARMers please do not put the test ALARM tickets in status 'verified' before there is a comment in the ticket diary by the oeprators that the right service was contacted and a comment by the service managers that the ticket reached them correctly. 'verify' is a terminal status that doesn't allow further ticket updates.
  - Highlight of this release is the introduction of the "Type of Problem" field in TEAM and ALARM tickets. Details in Savannah:117206
  - Interface of 3rd level middleware GGUS Support Units (SU) hosted in CERN/IT/GT to SNOW will start tomorrow with SU ETICS.
- GGUS issues (tickets neglected by supporters) for this Thursday's T1SCM should be sent to MariaDZ by tomorrow 5pm.

Wednesday

Attendance: local(Edoardo, Stefan, Massimo, Peter, Ulrich, Luca, MariaDZ, MariaG, Lola);remote(Ulf-NDGF, Gonzalo-PIC, Lisa-FNAL, Jhen-Wei-ASGC, Tiju-RAL, Rolf-IN2P3, Pavel-KIT, Lorenzo-CNAF, Kyle-OSG,, Ron-NLt1).

Experiments round table:

ATLAS reports - Dan
- T0/Central Services
  - Transfers into CERN-PROD_LOCALGROUPDISK failing, /var full on FTS server (GGUS:74710). Fixed in AM.
  - Dashboard slowdown (SNOW:INC067829) improved since dashboard devs removed parallelism from queries. Not fully understood, since parallel execution was enabled 4 months ago.
- T1 sites
  - RAL Castor outage due to oracle db errors. Recovered around 4pm CERN time (GGUS:74686). Was automatically re-enabled in DDM and we see no problems.
- T2 sites
  - One panda user had bad code that left 1000's of open connections to dcache at LRZ. He was banned from panda following the security procedures. He responded to notification and is fixing his code.
- Other
  - GGUS won't let us solve GGUS:74640. We get error "value too large for column "GUSPROD"."T213"."C803800006" (actual: 56, maximum: 50)". Sent message to GGUS "Contact Us".

CMS reports -
- LHC / CMS detector
  - Acquired 52pb-1 last night (recorded 47pb-1)
- CERN / central services
  - Tue Sep 27 (16:51) the Tier-0 system was restarted (after the CMSR issue GGUS:74701) and immediately failed to open a file on the Oracle DB (DBS) which was a BLOCKER. After some internal investigation CMS opened TEAM GGUS:74709 at 19:45. The ticket was escalated to ALARM at ~21:00 and the issue was fixed at ~21:30 and the Tier-0 export system was successfully restarted.
- T0 and CAF:
  - Given that CMS Tier-0 can keep up (no major backlog) with it's current application (CMSSW_4_2_X) and given that the CMS PH management would prefer running as long as possible under current conditions, it was decided to postpone the switch to CMSSW_4_4_X, unless there are dramatic changes in beam conditions. The drawback is the memory consumption (2.5GB/core mem reservation currently on cmst0), hence the reduced CPU efficiency
  - CMS now in negotiation with the CERN/LSF team to deploy the cmst0 spill over mechanism onto the CMS public (lxbatch) share
  - EOSCMS migration : (i) user write permission to CASTORCMS/CMSCAFUSER blocked, now need to bulk-migrate all user data to EOS; (ii) read permission from CASTORCMS/CMSCAF now blocked, hence this pool can soon be flushed completely; (iii) CMS is running out of space on EOS... currently capacity is 2PB, while we are expecting a total of 3PB on EOS, to be delivered "by October"
- T1 sites:
  - Preparing Fall 11 MC production.
  - Other CMS workflows : backfill.
- T2 sites:
  - MC workflows (also with WMAgent) and analysis
  - 3 open tickets to Dashboard team regarding Site Downtime monitoring : SAV:123746 (T2_US_Purdue), SAV:123720 (T2_PL_Warsaw) and SAV:123745 (T2_BR_UERJ). A fix for the first 2 was made on the SSB DEV instance. For the third case, still discussing an optimal solution.

ALICE reports -
- T0 site
  - ClusterMonitor stopped working in voalice12, it was not possible to restart the service. There was a problem with too many semaphores opened and forgotten by the service, so we have to kill them all and restart the service
  - Added during the meeting: a ticket will be open to pin down some Nagios problems
- T1 sites
  - Usual operations
- T2 sites
  - Usual operations

LHCb reports -
- Experiment activities
  - Reconstruction and stripping
  - Finishing on validation of reprocessing applications
- T0
  - Since the week-end increased number of pilots aborted (GGUS:74657), issue fixed and ticket verified
- T1 sites:
  - IN2P3: Increased number of stalled jobs (GGUS:74733). Pilots are submitted correctly but finish after a few minutes, subsequently the payload which continues to run is discovered not to have a pilot attached and is set to failed.
  - PIC: problems with staging of files, lcg-bringonline returns with an error message while staging files, but the files are actually staged
  - Gridka: Observing problems with jobs with many input files, mostly user jobs affected by this.

Sites / Services round table:

ASGC:n tr
BNL: ntr
CNAF: ntr
FNAL: ntr
IN2P3: Early warning: long outage foreseen for October the 4th.
KIT: ntr
NDGF: Network upgrade in Bergen (Friday at 11:00). This could briefly affect data access frol ALICE and ATLAS.
NLT1: Still investigating the worker node problem mentioned yesterday
PIC: Acknowlege LHCb problem. There is an intervention (almost finished) on the Enstore part.
RAL: ntr
OSG: ntr

CASTOR/EOS:
- EOS disk being added to CMS (500 TB today, 500 TB probably early next week). Intervention scheduled for tomorrow (transparent intervention - discussed with ATLAS and EOS).
- T0EXPORT (CMS) being serviced: available space could oscillate
- Can we shut down completely ATLPROD (ATLAS)?
Dashboards: Overload not understood. Luca mentioned that few days ago the load went up x10. Situation now OK; root cause under investigation
Databases: SIR will be prepared for the recent problems affecting CMS
Grid services: During the recent network hiccup CMS did not get a mail because the operator did not open full tickets (log only mode)

AOB: (MariaDZ) The Did you know?... text on the GGUS homepage today and until the next release contains the news about the new field "Type of Problem" in TEAM and ALARM tickets. Please click here to read it!.

Thursday

Attendance: local(Ivan, Massimo, Peter, Ulrich, Dirk, Elisa, Dirk, Dan, Andrea, Stefan, I Ueda, Lola, MariaG, MariaDZ, Edoardo, Alexei); remote(Ulf-NDGF, Lisa-FNAL, Daniele-CNAF, Rolf-IN2P3, Paco-NLT1, John-Ral, JhenWei-ASGC, Kyle-OSG).

Experiments round table:

ATLAS reports - Dan
- T0/Central Services
  - Panda/EOS jobs failed with Error:/bin/mkdir: cannot create directory '/eos/atlas': Permission denied today around 9:30 UTC. Part of transparent intervention?
- T1 sites
  - RAL failed to get file for panda job. (GGUS:74748). Related to offline disk server that suffered a double disk failure.
  - PIC had panda jobs failing to put (GGUS:74757). It was related to load in LFC, but recovered so the ticket was closed. (see next item)
  - Taiwan LFC stopped responding -- blocking Panda jobs and DDM transfers. Taiwan ALARMED (GGUS:74758). They reported high number of LFC accesses from a few remote hosts all accessing the same entry (/grid/atlas/user/p....). We identified the user -- he had 30,000 jobs in Panda, all failing, and up to 9-10,000 running in parallel at that moment. We banned him from Panda and killed all his jobs. Afterwards we checked his code: he has registered his app and data in ~10 SEs worldwide and the 10 LFCs; his job loops over all SE/LFC combinations and does an lcg-cp to the WN. With 10,000 concurrent jobs this DOS'd the LFCs (and SEs?). User will remain banned until we educate him with acceptable grid usage, and we are discussing security checks in the pilot to prevent similar jobs in the future.
- T2 sites
  - NTR

CMS reports -
- LHC / CMS detector
  - Acquired 41pb-1 last night (recorded 35pb-1)
- CERN / central services
  - nothing outstanding to report
- T0 and CAF:
  - On-going discussions between CMS and IT how to best setup the Tier-0 overspill to public queues (first manual, then automatic ?)
  - EOSCMS : upgrade planed for today 2PM to fix the recent problem in the transfer with xrd3cp
- T1 sites:
  - Preparing Fall 11 MC production + backfill activities.
- T2 sites:
  - MC workflows (also with WMAgent) and analysis
  - T2_FR_CCIN2P3 : CMS OPS reported that (CREAM) site is not accepting cms proxy (job sent via glideinWMS), see GGUS:74784

ALICE reports -
- T0 site
  - Usual operations
- T1 sites
  - RAL: A large number of waiting jobs for ALICE at the site , that do not move to running state. Under investigation.
- T2 sites
  - Usual operations

LHCb reports -
- Experiment activities
  - Reconstruction and stripping
  - Finishing on validation of reprocessing applications
- T0
  - Castor: Problems with access to disk pools via xrootd protocol (GGUS:74751)
  - Nagios: No probes being sent to prod-lfc-lhcb.ro for the last 48 hours (GGUS:74775)
- T1 sites:
  - IN2P3: Increased number of stalled jobs (GGUS:74733). Pilots are submitted correctly but finish after a few minutes, subsequently the payload which continues to run is discovered not to have a pilot attached and is set to failed.
  - PIC: problems with staging of files which returns with non 0 exit code but the string "DONE", the files are actually staged
  - Gridka: Observing problems with jobs with many input files when accessing storage via protocol, mostly user jobs affected by this.

Sites / Services round table:

ASGC: ntr
BNL : ntr
CNAF: ntr
FNAL: Two glitches seen during the GGUS test alarms (mail being not visible and wrong assignment). The second one is still not completely understood
IN2P3: Patch applied for LHCb problem: please verify
KIT: ntr
NDGF: Announced down time will be used also to restart dCache head nodes
NLT1: ntr
PIC: Acknowledged LHCb problem (under investigation)
RAL:
- The issue of ALICE (pilots failing) seems to be connected to short-lived proxies (being expired by the time the job enters execution. At last recent jobs have 48-hour proxies. RAL will delete long-pending jobs and communicate with ALICE to see if the problem is solved for most recent jobs.
- For ATLAS: the problematic disk server is now in standby (files back available)
- Larger-capacity tapes entering production
OSG: ntr

CASTOR/EOS: CASTOR 2.1.11-6 next Monday for CMS. EOS upgrade for ATLAS and CMS going on now
Dashboards: NTR
Databases:
- Problems affecting CMS: SIR prepared. Some clarification on the time incident needed.
- PDBR production database partially unavailable: disk array problem; the database became partially unavailable (one table space dropped). Quickly back in production.
Grid services: Two CE had problems last now. One could be restored immediately the second one could be recovered only this morning.
Network: CERN-ALT2 problem being debugged. Drilled down to some problem on the CERN-Chicago relay

AOB: (MariaDZ) 20 test ALARMs were launched and handled smoothly. There is one exchange pending with NL_T1 in GGUS:74723 and a recommendation to testers and operators to clearly 'simulate' a specific service name. Lisa's (FNAL) point is being followed up in https://savannah.cern.ch/support/?122175

Friday

Attendance: local(Massimo. Dan, Elisa, Peter, Ignacio, Alessandro, Ivan, Eva, Maarten); remote(Michael-BNL, Burt-FNAL, Onno-NLT1, JhenWei-ASGC, Kyle-OSG, Gareth-RAL, Xavier-KIT, Rolf-IN2P3).

Experiments round table:

ATLAS reports - Dan
- T0/Central Services
  - CC_voatlas61 dropped out of the DNS load balancer (CC writer). (SNOW:INC068591). Was a load balancer outage. Fixed in the AM.
- T1 sites
  - SARA gsidcap went down after CA update to egi-trust-anchor 1.41-1 (GGUS:74808). Rolled back to 1.40 and it recovered. Ticket with EGI. v1.42 does not fix the problem. Suggestion from SARA is that dCache sites running gsidcap services should stick to 1.40-1 for the time being.
  - RAL disk server went down again, reopened (GGUS:74748)
  - FR cloud has a conditions database access problem due to misconfiguration (CondDB env set to use CVMFS where not available) (BUG:87295)
- T2 sites
  - NTR

CMS reports -
- LHC / CMS detector
  - 10h fill yesterday afternoon/night : Acquired 75pb-1 last night (recorded 78pb-1)
- CERN / central services
  - nothing outstanding to report
- T0 and CAF:
  - cmst0 : very busy processing data !
- T1 sites:
  - High failure rates for data transfers from CCIN2P3 to many sites, see GGUS:74834
  - Fall 11 MC reprocessing : done with > 1 Billion events (re-DIGI == adding the PileUp). Will start re-RECO this week-end.
- T2 sites:
  - All transfers to/from T2_AT_Vienna failing with "connection error", see Savannah:123804 (no response from site admins yet)
  - MC workflows (also with WMAgent) and analysis

ALICE reports -
- T0 site
  - Last afternoon there were leftover proof processes from users which were consuming memory. It was not possible to get rid of them restarting proof, which usually works, or killing them. The machines were rebooted and they were back in bad shape. haldaemon was not working, there was a pid file left over. After removing those files and restarting again, it was not yet working. haldaemon must be started after messagebus and, because of some reason that we have not discovered yet, it was not probably because smartd was failing. So after restarting smartd, messagebus and haldaemon, everything is in shape.
- T1 sites
  - RAL: still problems with delegation to one of the CREAM-CEs lcgce03.gridpp.rl.ac.uk. Any news about that?
- T2 sites
  - Usual operations

LHCb reports -
- Experiment activities
  - Reconstruction and stripping
  - Starting reprocessing activities
- T0
  - Castor: Problems with access to disk pools via xrootd protocol (GGUS:74751), waiting for user reply
  - Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775)
- T1 sites:
  - IN2P3: Increased number of stalled jobs (GGUS:74733), upgrade of CREAM-CE has fixed the problem.
  - PIC: problems with staging of files which returns with non 0 exit code but the string "DONE", the files are actually staged
  - Gridka: Observing problems with jobs with many input files when accessing storage via protocol, mostly user jobs affected by this.
  - Gridka: missing results from nagios tests for CE, fixed yesterday around 8PM

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF: ntr
FNAL: ntr
IN2P3: Power failure maybe connected to some incidents seen by CMS and LHCb). Investigating if this is their root cause.
KIT: ntr
NDGF: ntr
NLT1: ntr
PIC: ntr
RAL: The box in draining has some problem to serve certain files (see Thursday report). Investigating.
OSG: ntr

CASTOR/EOS: During the meeting an ALARM ticket was received (ATLAS/T0MERGE). Investigating.
Dashboards: ntr
Databases: A node had to be rebooted (CMS instance) just before the meeting. Investigating to disentangle HW from application problems/high load). Some new logging is in place and this shluod help in pinning down these problems.

AOB:

-- JamieShiers - 25-Aug-2011

Topic revision: r33 - 2011-09-30 - OnnoZweersExCern

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback