Week of 080721

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Miguel, Simone, Jean-Philippe, Andrea, Harry, Roberto, Diana);remote(Derek, Michael, Michel).

elog review:

Experiments round table:

CMS: General Status: CRUZET-3 is over, it was a good experience, and cosmic runs are more and more mature exercises in terms of CMS computing in the Tier-0 data handling sector. CERN services in CRUZET-3 mostly OK. Plans: extend and finalize tests, and prepare to next cosmic exercise, foreseen for the 2nd half of August. From now on CMS will perform cosmics runs each Wednesday and Thursday.

Tier-0 Status: Work finalized on the P5->CERN transfer system, a repacker replay is now running (since July 17th), namely redoing the repack for CRUZET-3 data. Plans: Next monday CMS will start more replays with some T0 real prompt reco testing.

Tier-1 sites Status: CRUZET-3 AlcaReco exercises at T1 sites were the first time CMS ran on data outside CERN/FNAL. Plans: CMS foresees transfers of the order of few TBs per T1 site from CERN, starting imminently. IN2P3 is custodial for CRUZET-3 (as CNAF was custodial for CRUZET-2).

Tier-2 sites Plans: CMS would expect a centrally-trigger big transfer load of many CSA07 MC datasets to CMS T2's, as a needed step in order to complete the migration of the user analysis to T2 sites. Each T2 should expect to be asked to host a fraction of ~30 TB of those datasets. Good news is that among the needed ones, many datasets are already now hosted by at least one Tier2, so the load may be less than what could in principle be expected. Work is being done to maximize the availability of datasets at T2's with the minimal amount of WAN traffic triggered. Subscriptions to T2's and transfers themselves will start soon, not unprobably it may happen this week already, so please T2 sites be prepared. Once finalized, the usage of T1 for analysis will start to be banned (i.e. once verified dataset can be accessed at 1++ T2 sites, the datasets will be masked in DBS at T1 sites, so CRAB jobs will hence not be able to reach T1 sites anymore).

More: - Running data consistency campaigns and monitoring campaign (since some weeks) - more T2/T3's are coming in and joining CMS, and soon the PhEDEx topology. CMS have a CASTOR directory of 2.3 * 10**6 files of 160KB which are webcam dumps and have gone to tape. They are looking at deleting them and stopping fresh ones.

ATLAS: Several problems under investigation - dcache upgrade at FZK has extended their downtime - CNAF LFC daemon keeps crashing at short intervals. Restarted by cron but very inneficient. J-PB said known problem and advised they downgrade from 1.6.10 to 1.6.8. ATLAS ran cosmics last weekend and processes in the latest version of their site services got stuck. They were killed and restarted (alarms are clearly needed) but left a backlog of transfers. All T0 to T1 were affected and cosmics finished this morning. ATLAS find that some T1 tend to be down longer than expected and there is a long backlog of FDR1 reprocessing to be done (while that for FDR2 is being setup). They are thinking to do some reprocessing at CERN as a short term action (clearly T1 need to be able to deliver their workloads) so will be testing stagein mechanisms.

ALICE: Working on integration of CREAM-CE with ALIEN.

LHCb: DC06 simulation is running smoothly under Dirac3 but reconstruction and stripping tests are still ongoing so there is no official date yet for the start of DC06. Three recent reports have been made into the elog: 1) a complicated problem submitting to glite wms with different voms Roles. 2) the longstanding problem of gfal NEARLINE status is still reported at SARA after upgrades (SC thought this was fixed in a new gfal) 3) trying to query metadata information using gfal 1.10.11 and lcg_util 1.6.11 from StoRM endpoint at CNAF persistently fails and, secondly, StoRM reports a file has been removed successfully while it isn't. This is dangerous because files are subsequently removed from the catalogs, generating inconsistencies and preventing to clean up space.

Sites round table:

GRIF: has just upgraded lcgutils to the version running in the PPS to get new functionality.

RAL: HR asked if the GGUS service alarm mechanism was now working - the response was yes.

CERN: DB asked the same question - probably not is the answer else more mails would have been seen. There is also a question of who centralises the response. HR and DB will follow this up.

BNL: SC asked about the ATLAS files trapped in SRM T1 export last week. ME confirmed the problem (files that fail the first time end up pinned by dcache hence fail in subsequent srm transfer requests) in dcache patch 8 and that a bug report has been made. He agreed that other sites upgrading to P8 (SC was worried about IN2P3) should see this problem.

FZK (by email from Andreas Heiss): GridKa is experiencing network problems. Last saturday evening at about 19:20 a major network router failed. Almost all services were affected. Some services were up again on Sunday but some are still degraded or unavailable (as of 13:00 today). In particular, some dCache pool nodes are not yet available. We are working on it. A post-mortem analysis will follow.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Jean-Philippe,James,Roberto,Simone,Harry,Miguel);remote(Derek,Michael,Michel,Jeremy,Daniele).

elog review:

Two of the three entries from LHCb inserted yesterday (405/404) have been explained with a buggy version of gfal in production. An upgraded to 1.10.15 will fix both these problems. Still remanins however the problem StoRM side returning wrong information that StoRM developers should look carefully being failling the same request also using dCache client.The GGUS ticket is still open and peopole at CNAF looking at that.

The elog entry 406 (about proxy mixup in the 3.0 WMS) will be further investigated by ruinning the same test against a 3.1 instance of WMS atr CERN. Sophie will let LHCb know when this will be possible

Experiments round table:

LHCb(Roberto): he reports on the yesterday elog entries and the status of the problems

ATLAS (Simone): Commenting the entry 407 reports that the same exaxct problem reported last week at BNL (following an upgrade to latest version of dCache) appeared at Lyon.Some files in some pool were not accessible. This turns out in a poor data exporting activity from Lyon whilst a very efficient data import there. Michael said that they adopted some local procedure but they had cleaned up these files before dCache developers had time to have look. A bug has been filled and dCache developers will closely look at the problematic data files still available at IN2P3.
Job priorities mechanism (based on the VOVIew deny tag from INFN) has been deployed at some of the Italian T2 (Milano) and tests will come soon.Yesterday night the prestaging has been successfully tested at CERN. Now looking for fitting the machinery to run reprocessing jobs also at CERN

CMS (Daniele):
The current CRUZET-3 repack is around 13 TB, and it is being inserted into global DBS and injected into PhEDEx as we speak. CMS is starting a series of global runs tomorrow (Weds, 9AM CERN time, through Thurs afternoon CERN time), continuing on a weekly basis until the CRUZET-4 global run. CMS is informing T1 sites on the conventions for these datasets so T1 people can prepare. Reprocessing at T1 sites will happen, also.


Sites round table:

Nothing to Report

Core services (CERN) report: Nothing

DB services (CERN) report:

Nothing

Monitoring / dashboard report:

Upgraded this morning the elog service. It went smoothly.

Release update:

going on meeting about CREAM. glexec available on the PPS

AOB:

Wednesday

Attendance: local(Simone, Andrea, Harry, Jean-Philippe, Roberto, Patricia, Sophie);remote(Gonzalo, Derek, Daniele, Jeremy, Michael, Michel).

elog review: 5 new items - see under LHCb

Experiments round table:

LHCb (RS): 1) When transferring files to PIC with FTS sometimes the proxy delegation fails and apparently this is not due to the known delegation problem. Only seen at PIC. 2) All download attempts using gfal from FZK fail. More details have been requested. 3) By checking the configuration of our FTSes we noticed something quite strange with the channels into and out of CERN. The channels out of CERN are configured in the 'RAL style' with file limits per VO. But the channels into CERN are configured in the old style. It seems that they are quite asymmetric and I was wondering what the reason for this is? 4) We are experiencing quite a lot of instability submitting jobs through our two 3.0 WMS instances at CERN. This comes from the huge activity LHCb started to put in place through WMSes (20000 jobs on each of the two CERN WMS). We need to find a third instance of WMS, to drain one of the two, reintegrate it and drain the second one and then give back the temporary one. Sophie will provide LHCb one to start this procedure (ATLAS volunteered one of theirs). In the longer term LHCb recognise that they need to revisit their needs (at least for 2008) for central glite WMS to use in prodution and see if they could they migrate to the more efficient gLite 3.1 WMS. 5) Following yesterdays gfal-STORM problem STORM reports that in their current version they had a problem in case of a wrong Storage Area path being specified in the SURL the return structure is not well formed and the informations at file level are missing (a null value). This is an easy bug to solve and the fix will be included in the next update 5 of the version 1.3.20.

ATLAS (SC): File transfers between PIC and BNL failed yesterday due to a network problem to Madrid but failover to the backup link, via the CERN OPN, did not work. ATLAS are seeing frequent network problems at the moment. M.Ernst reported that they (BNL) detect the traffic is stuck somewhere between BNL and ESNET and believe that it is a routing issue.

CMS (DB): The first midweek (1.5 days) cosmics run started early this morning but the data flow chain was delayed by a bad startup script. Custodial data will be sent to FZK and CNAF but the weekly totals will be less than 10 TB.

ALICE (PM): Reported that their job submission mechanism is suffering from the fact that jobs which initially fail matchmaking to a CE retry later many times but do not appear in the information system (because they have not been scheduled to a CE). Andrea said the WMS retry was a configuration option which could be set to reject unmatched jobs on submission and it was agreed (Sophie) to reconfigure the ALICE WMS this way. ALICE are now testing the CREAM CE using about 100 jobs at FZK.

Sites round table:

PIC (GM) reported that LHCb pilot job tests (short jobs testing the framework) were putting a high load on their worker nodes. RS will follow this up.

Core services (CERN) report: SL reported that the LHCb VOboxes are making many LFC calls, especially to the read-only instance, that are failing authorisation by presenting the wrong credentials. RS will follow this up.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Thursday

Attendance: local(Simone, Julia, Harry, Maria, Andrea, Jean-Philippe, Miguel, Roberto);remote(Gonzalo, Derek, Michael).

elog review:

  • 416: LHCb: at SARA, understood that gfal_prestage() used without a protocol is silenty ignored by dCache. Bug opened for dCache, but LHCb will specify a protocol to avoid the problem.
  • 417: LHCb: clarified a question about a different configuration for outgoing and incoming FTS channels. Just following the experiment's requests.
  • 418, 419: LHCb: FTS proxy delegation not happening at PIC. Developers are investigating.

Experiments round table:

  • ALICE
    • no report
  • ATLAS (Simone)
    • Yesterday, due to problems with proxies in a couple of VO boxes, dataset subscriptions to Tier-1 sites stopped for 2-3 hours. After fixing the problem, high rise of transfers due to the backlog.
    • Data traffic from CERN to CNAF managed by a pre-production FTS server stopped. Gavin has been notified.
    • FDR2 rerun going on, but it is a pure Tier-0 exercise.

  • CMS (Andrea)
    • The mid-week cosmics run was abruptly interrupted at 22:30 last night due to a major breakdown of the electrical feed to the experiment. Repairs will take at least 1-2 days, which means that the cosmics run will not resume this week.
    • The PhEDEx monitoring stopped responding from this morning at about 0630. The cause was a misconfiguration of the CMSDOC web server, which resulted in the file robot.txt becoming unaccessible. As a consequence, the server was hit by lots of web crawlers, and brought to its knees as the PhEDEx pages were forced to produce huge amounts of data in response.

* LHCb (Roberto)

  • Ricardo is running scalability tests of DIRAC3 to exercise the WMS and the CEs. The goal is to prove the ability to run 25-30K jobs/day. Due to lack of publicity, these jobs where not expected and Gonzalo asked about them.

Sites round table:

  • no report

Core services (CERN) report (Miguel):

  • Noticed a 3% failure rate with outgoing transfers from CERN, reported by ATLAS and CMS. Apparently they are due to connections being dropped by GridFTP servers at remote sites (mostly dCache). More news when available.
  • Simone mentioned that a meeting has to be organized early next week to discuss priorities for tapes. He will propose a time.

DB services (CERN) report (Maria):

  • Intervention carried out at Gridka on the LFC server, another one upcoming at RAL. Testing the latest Oracle security upgrade; if ok, it will deployed in the next two weeks.

Monitoring / dashboard report:

  • no report

Release update:

  • no report

AOB:

  • none

Friday

Attendance: local(Roberto, Alessandro, Harry, Patricia, Andrea, Miguel, Julia, Jean-Philippe);remote(Jeremy, Derek, Michael).

elog review: 5 new items but all for LHCb so minuted below.

Experiments round table:

ATLAS (AdG): 1) ASGC had an ACL problem writing into their CASTOR server. It has been solved but why it happened is still under investigation. 2) ATLAS run a cron job on the FTS servers at their Tier 1 sites to refresh delegated proxies as a workaround of the proxy corruption bug but these tend to fail at PIC (in the create proxy step) more than at other sites and failed again today. GM said they would look into this - they had recently noticed a suspicious 1 second clock time difference in their FTS servers. 3) ATLAS are transferring FDR08 data but it is not clear what this weekends workload (e.g. cosmics) will be.

ALICE (PM): 1) They are still trying to find the best workaround to avoid their WMS repeatedly resubmitting jobs that fail matchmaking. AS suggested they copy the CMS configuration where the total time to retry is set to less than the time interval between retries so this will be requested for the ALICE WMS. 2) CREAM CE tests continue with about 300 jobs running. In order to retrieve job logs a gridftp server is needed so ALICE will probably request all Tier 1 to provide this service (having previously had them removed).

CMS (AS): 1) CMS need to improve the support of their experiment web server. Other services depend on it and it does not have 24 by 7 support. 2) A badly configured network alias at CERN was affecting their site database service (of mostly static information).

LHCb (RS): Five new ELOG entries - 1) gfal_deletesurls seems to receive from StoRM a return code of successfull operation while some of the files have not, in fact, been removed. 2) the gfal structure returned by StoRM at CNAF differs from the one returned by Castor. 3) the problem whereby gfal_ls to StoRM sometimes returns a null pointer is not, as previously reported, due to specifying a wrong storage path. 4) Data replication exercise from LHCb_DST at Lyon to other T1s all fail with errors like [REQUEST_TIMEOUT] failed to prepare source file in 180 seconds. 5) When transferring files into SARA SRM almost all files complete successfully first time but, when trying to transfer those files out to other Tier1s many transfers fail with errors like [REQUEST_TIMEOUT] failed to prepare source file in 180 seconds. Other LHCb issues are - 1) LHCb have finally decided to roll out their new Pilot role FQAN to all sites and would like this request to be followed by the WLCG operations (COD). Our VO-ID Card has been modified accordingly. 2) on the WMS load issue at CERN the 2 WMS have been drained after a new instance has been kindly loaned from ATLAS. We plan to install one with gLite 3.1 at CERN and keep the other two with 3.0. After successful testing we will trigger an upgrade to 3.1. on all WMSes at CERN and we also will convert one of the LCG RB into a WMS so that we can rely on 4 WMSes under gLite 3.1. The time line depends on our tests of the proxy mixup that keeps us needing an RB.

Sites round table:

PIC (GM): Announced a whole day downtime for Tuesday 5 August.

RAL (DR): There will be an 8-hour LHCb SE downtime at RAL for reconfiguration to a RAC setup on 29 July. RS asked if the LHCb batch jobs and queues could be paused ? This would not be straightforward as the CE support multiple VO but a simple solution would be to stop the LHCb queues well before the intervention.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report: The CMS dashboard was double counting the number of analysis jobs submitted via Condor-G. The CRAB team have found a solution. The dashboard team is testing notification of job status changes.

Release update:

AOB:

-- JamieShiers - 18 Jul 2008

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2008-07-25 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback