LCGSCM Information and Monitoring Systems Status

December 13th 2006

  • SAM team reorganization
    • our team got new people and was split into: production service, support, development
    • planning for next 6 months ready will be shown on next ROC managers meeting
  • SAM production service:
    • we already have 2 UIs and 2 RBs - installed, configured, jobs submission migrated from old hardware
    • 2 BDIIs and 2 gLite-WMSes will be installed
  • SAM support: new mailing list sam-support@cernNOSPAMPLEASE.ch
  • SFT phase out scheduled for end of this week
  • GOCDB3 is about to enter testing phase. I asked GridView team to contact GOCDB developers to integrate GridView GOCDB replication.

November 29th 2006

Work in progress

  • Inconsistencies in SAM availability numbers vs. GridView - we identified most of causes, now we have to clarify few details in the algorithm and integrate the tools
  • SAM production service:
    • we got 4 new machines from FIO
    • 2 machines for SAM UIs installed and configured - we should switch in few days
    • 2 machines for SAM server to be configured

Work in progress

  • Improvements of SAM Portal to make it better equivalent of SFT Report for operators - ongoing process
  • Standarization of application level interfaces for SAM results data
  • Integration of SAM with OSG sites:
    • passed to ROC CERN

October 11th 2006

SAM status

  • New version of FTS transfer test ready and in place for DTeam, OPS and Atlas. Results seem to be better for DTeam, however there are problems with SRMs for OPS and Atlas (couldn't manually replicate source file using lcg-utils). SRM in BNL not accessible even for DTeam.
  • SAM tests are now submitted for Atlas VO from central SAM UI (CE, gCE, SRM, FTS, LFC). Some details have to be decided (timeout for FTS transfers, critical tests for some of the services)
  • Rule-based "horizontal" alarm masking implemented. COD Dashboard development continues (IN2P3)
  • Several important fixes to SAM Portal and FCR

Work in progress

  • Improvements of SAM Portal to make it better equivalent of SFT Report for operators
  • Standarization of application level interfaces for SAM results data
  • Integration of SAM with OSG sites

October 4th 2006

SAM status

  • SAM for PPS installed - shared server, separate UI for submission
  • New version of FTS transfer test ready and in place. Results seem to be better. Still problems with several SRMs for DTeam. Will be migrated to OPS soon.
  • Transition to new COD dashboard (SAM based monitoring) scheduled for the middle of October.
  • Old version of FCR was phased out. There were problems with several VOs being not registered in the DB (GOCDB).

Work in progress

  • Rule-based alarm masking is in development. It became urgent as number of reduntant failures (mostly SE and SRM) would overload COD.
  • Improvements of SAM Portal to make it better equivalent of SFT Report for operators
  • Standarization of application level interfaces for SAM results data
  • Integration of SAM with OSG sites

September 21st 2006

SAM status

  • All tools were switched to OPS as the default VO for operations
  • Jobwrapper tests were tested on Certification Testbed. Installation on PPS should start soon
  • Improvements for SLS interface - possibility to drill down to services (not yet in production)

Work in progress

  • New version of FTS-transfer test in development, should be finished by the end of week
  • Improvements of SAM Portal to make it better equivalent of SFT Report for operators
  • Standarization of application level interfaces for SAM results data
  • Integration of SAM with OSG sites
  • Installation of SAM for PPS in progress

September 13th 2006

SAM status

  • SAM was integrated with ROC report generator on CIC Portal (with help of David)
  • Additional metrics added to data exports (html, excel) (David)
  • Jobwrapper tests were submitted to Certification Testbed (Piotr)
  • A lot of time spent on supporting SAM (GGUS tickets, bug fixes, help to other SAM users)

Work in progress

  • New version of FTS-transfer test according to suggestions from Gavin
  • Integration of SAM with OSG sites
  • Installation of SAM for PPS in progress

September 6th 2006

SAM status

  • Jobwrapper tests were sucessfully tested on one lcg-CE in Croatia. Need for packaging, glite-CE scripts and testing on PPS.
  • History query optimisation for SAM Portal applied, history view is now much faster
  • New version of FCR was put to production, tested and announced

Work in progress

  • New version of FTS-transfer test according to suggestions from Gavin
  • Additional (more detailed) metric data exports (html, excel)
  • Integration of SAM with OSG sites
  • Installation of SAM for PPS in progress

August 29th 2006

SAM status

  • SAM Client on our UI (lxn1182) is now installed from RPM. OPS jobs submission to gLite CEs was restored
  • Support of OPS VO at CERN: all lcg-CEs are passing the tests, there is still a problem with glite-CE ce103.cern.ch

Work in progress

  • New version of FTS-transfer test according to suggestions from Gavin
  • Installation of SAM for PPS in progress (no news from Antonio Retico).
  • History query optimisation for SAM Portal suggested by Miguel Anjo are being tested.

August 23rd 2006

SAM

Activities completed since last meeting

  • Support of OPS VO at CERN: all CEs are passing the tests
  • Application interface to SAM monitoring data (current status of services, failing tests) is ready and is being tested by Atlas. It was initially requested by Atlas (Benjamin Gaidioz) but will be useful for others as well.
  • SAM Client was packaged in RPM together with brief documentation. It is now being tested

Work in progress

  • Installation of SAM for PPS in progress (no news from Antonio Retico).
  • SAM Portal and FCR are in process of continuous improvements. We still have a number of requirements to meet.

GridView

Data Transfer (Gridftp):

  • Developed reports for VO-wise distribution of overall data transfers, VO-wise distribution of data transfers per site and site-wise distribution of data transfers per VO.
  • Implemented Graphs for data transfers from 'All Sites' to a particular site (in Current Summary and Hourly Report options)
  • Enhanced the gridftp log comparison script to get the host configuration for VOs from CDB inventory database.

Service Availability:

Developed Graphs and reports for Presentation of following Service Availability information computed over hourly, daily, weekly and monthly basis :
  • Aggregate site availability (aggregate of all tier 1/0 sites)
  • Site-wise availability for individual tier-1 sites.
  • Site-wise service availability of tier-2 sites (grouped by associated VOs)
  • Detailed availability of various services (CE, SE, SRM) running at a particular site

Job Monitoring :

  • Developed graphs and reports for presentation of overall summary of jobs indicating sites with high/low job execution rate, sites with high/low job success rate, VOs running more/less jobs etc.
  • Developed a summary report to indicate jobs lost from monitoring (due to records missing from R-GMA or other problems)
  • Modified the procedure for computation of job sucess rate to be based only on completed jobs (without considering running jobs)

August 16th 2006

Activities completed since last meeting

  • Prototype of job-wrapper tests is ready with publishing to R-GMA. It is now being tested. However there are many uncertainties like: which wrapper should be used (CE, RB), should we assume VOMS-proxies in all jobs, problems with getting CE-name and VO name in CE wrapper, but RB wrapper does not cover all jobs.

Work in progress

  • We are working on the application interface to SAM monitoring data (current status of services, failing tests). It was initially requested by Atlas (Benjamin Gaidioz) but will be useful for others as well.
  • Installation of SAM for PPS started (led by Antonio Retico).
  • SAM Portal and FCR are in process of continuous improvements. We still have a number of requirements to meet.

Issues

  • Support of OPS VO at CERN: tests are successful only on ce106 (srm.cern.ch as default SE), however default SE on other CEs (for example ce101) is still castorgrid.cern.ch which is failing as it was before. (btw. how it is possible that CEs that share batch farm have different default SE for OPS VO?)

August 2nd 2006

Activities completed since last meeting

  • Packaging and documentation of SAM is mostly done. YUM/APT repository for SLC4 was created. Documentation is being written.

Work in progress

  • Implementation of job-wrapper tests is underway. The tests will do a very basic check of WN (low overhead) and will be installed in software area of DTeam/OPS VO.
  • SAM Portal and FCR are in process of continuous improvements. We still have a number of requirements to meet.

Issues

  • Support of OPS VO at CERN: default SE is castorgrid.cern.ch which is not supporting OPS is still causing Replica Management tests failing.

July 26th 2006

Activities completed since last meeting

  • Development of interface between SAM and CIC-dashboard according the specification is done on SAM side. Development still has to be done on CIC site (+ integration). According to our planing the new tool for CIC-on-duty should be ready by September.
  • Most of inconsistencies between SFT report and CE report on SAM Portal were identified and fixes implemented. The remaining ones are due to a slight difference of how overall CE status is calculated in both systems and because SAM is distinguishing between LCG and gLite flavours of CE.
  • Simone has tested SAM job submission for Atlas VO.
  • FTS sensor in SAM seems to be quite reliable. The only strange behaviour observed (related rather to SRM test than to FTS) is failure for Sinica. They have two SRMs: Castor one and DPM one. The Castor one seems to be working fine but after using lcg-cr the physical file location is "sfn:" instead of "srm:" which misleads FTS sensor.

Work in progress

  • We are starting implementation of job-wrapper level tests executed just before and just after EACH job submitted to the grid. The tests will do a very basic check of WN (low overhead) and will be installed in software area of DTeam/OPS VO. The requirements and specification was agreed. The test was requested from experiments which complained about higher failure rate than it could be deduced from SAM/SFT results
  • Packaging and documentation of SAM is under way with help of other grid projects, who are interested in using SAM (Baltic, EELA, EU China, EU Med, HealthyChild...). We are preparing standalone installation (with no dependency on GOC DB) but this requires some additional development.
  • SAM Portal and FCR are in process of continuous improvements. We still have a number of requirements to meet.

Issues

  • Support of OPS VO at CERN: default SE is castorgrid.cern.ch which is not supporting OPS causing Replica Management tests failing.

July 12th 2006

Activities completed since last meeting

  • Core parts of SAM alarm system were implemented on validation installation. The interface to CIC-Dashboard was agreed with resulting specification.
  • SFT (CE sensor) was integrated with the new SAM framework. Now most of the jobs are submitted to CEs from the integrated SAM Submission Framework
  • New version of FCR was installed for validation. It was given to VOs (EIS people) for testing.

Work in progress

  • Development of interface between SAM and CIC-dashboard according the specification is in progress. According to our planing the new tool for CIC-on-duty should be ready by September.
  • We are now checking the inconsistencies between SFT report and CE report on SAM Portal. Some of the problems (related to GOCDB or BDII inconsistencies) were found and workarounds were implemented. However it is still not finished.
  • Simone is testing SAM job submission for Atlas VO. He wants to add Atlas specific test for CE sensor.

Issues

  • CERN still not supporting OPS VO. Today in the morning ce101 and ce102 were failing JS with listmatch. The submission mechanism (JDL) was slightly changed recently, but as it works with other sites it shouldn't affect CERN-PROD. We need close collaboration between FIO (Thorsten) and someone from our group (Maarten?) to solve this urging problem.

June 22nd 2006

Activities completed since last meeting

  • Integration of SAME Availability metrics with SLS (FIO) finished. The view displays site availability for CERN and all Tier1 sites.
  • The technical details were discussed and decided how to proceed with Trash.LCGLCG-OSG cross-monitoring.

Work in progress

  • Simple alarm system in SAM is being developed. The system will provide interface to CIC-Dashboard for operators
  • SFT (CE sensor) is being migrated to new SAM framework
  • Extended version of FCR with central services flagging is being developed/migrated to Oracle DB (additional tables in our schema)

Issues

  • CERN still not supporting OPS VO. Yesterday ce101 was passing JS but failing RM. ce102 is failing JS with listmatch (no queue?). This is now becoming very urgent as we want to start chasing sites that don't support OPS VO.

June 8th 2006

Activities completed since last meeting

  • Export of metric data to Excel spreadsheet is ready
  • Documentation of metric calculation algorithm is available on wiki
  • Extended FTS sensor with full n-n channels - first version ready
  • First version of Service Availability Display for Tier1 sites is available on development GridView portal - current availability only (sliding window)

Work in progress

  • Integration of SAME Availability metrics with SLS (FIO) almost ready, 1-2 days to finish with current availability of Tier0 and Tier1 sites
  • Extended version of FCR with central services flagging is being developed/migrated to Oracle DB (additional tables in our schema)

Issues

  • We need to clarify what FTS sensor should react on (level of independence of SRM sensor)

June 1st 2006

Activities completed since last meeting

  • SAME DB was successfuly validated and moved to production Oracle cluster, data from sensors was properly migrated and merged and sensors are now publishing to the production DB (>40 days of results by now)
  • all essential SAME components were moved to production:
    • publishing webservice
    • query webservice
    • SAME portal
    • BDII synchronization tool
  • Basic SAME Submission framework finished - first two sensors integrated:
    • FTS - simple version, no n-n channels tests yet
    • SE - lcg-cr, cp, del operations
  • Full summarization module was implemented and all historical data was reprocessed to generate availability metrics for sites and services

Work in progress

  • Export of metric data to Excel spreadsheet is being developed, first version should be ready today
  • Extended FTS sensor with full n-n channels test is under development
  • Documentation of metric calculation algorithm is underway
  • Extended version of FCR with central services flagging is being developed/migrated to Oracle DB (additional tables in our schema)
  • Integration of SAME Availability metrics with SLS (FIO)
  • GridView team is working on high-level availability display/dashboard

Issues

  • Starting from May 25th RB (Dave Kant) and BDII (GStat) sensors stopped to publish results to SAME DB. Sensor maintainers were contacted, BDII sensor should be back now, no information from RB yet.

May 17th 2006

  • (Piotr on COD meeting, it's Judit reporting)

Activities completed since last meeting

  • Due to overloaded SAME Submission Framework developer, Piotr took over the task. He finished the first version of the code.
  • Because of the same reason, he and Marteen developed a basic FTS sensor

Work in progress

  • Validation of SAME DB in progress (DB schema OK, we'll start to publish sensor resutls now), should finish by the end of May
  • SAME portal migrated to the validation DB for testing -- still has to be tested when sensor data will be there

May 10th 2006

Issues

  • SAME Submission Framework developer is overloaded (changing his job soon) and we had to switch to "plan B" with FTS/LFC sensors.

Activities completed since last meeting

  • Sensors integrated with production DB: CE (SFT,gstat), SE (gstat), RB (active), site-EGEE.BDII (gstat), toplevel-EGEE.BDII (gstat)
  • Because of lack of developer for SAME Submission Framework we decided to setup standalone FTS sensor (the same for LFC) - details agreed with Maarten

Work in progress

  • Standalone FTS sensor is going to be developed by our team (with help of Maarten and Gavin)
  • Validation of SAME DB in progress, should finish by the end of May
  • Extended version of FCR with central services flagging is being developed/migrated to Oracle DB (additional tables in our schema)
  • Integration of SAME Availability metrics with SLS (FIO)
  • GridView team is working on high-level availability display/dashboard

May 3nd 2006

Activities completed since last meeting

  • Script to merge information from the BDII with GOC DB into new extended Service Instance schema is ready
  • SAME Portal - the detailed view of results for sites and services is ready (currently only on development DB)
  • Publishing webservice moved to production DB (currently only writing), all sensors are now publishing results to production DB
  • Sensors integrated with production DB: CE (SFT,gstat), SE (gstat), RB (active), site-EGEE.BDII (gstat), toplevel-EGEE.BDII (gstat)
  • Documentation of SAME Framework specification (for LFC, FTS, MyProxy, R-GMA sensors) is ready

Work in progress

  • GridView team is working on high-level availability display/dashboard
  • SAME Submission Framework development still in progress
  • Validation of SAME DB is needed before moving all components to production, will start soon, should finish by the end of May
  • Integration of SAME Availability metrics with SLS (FIO)

March 29th 2006

Activities completed since last meeting

  • Dedicated VO for operations (OPS) was tested on small testbed, YAIM parameters identified, mail to be sent to ROC managers/sites to deploy the VO
  • Summarisation and metric calculation module for simple availability of site and central services is ready. Central service availability is aggregated also on service type level and on VO level.

Work in progress

  • Development of detailed view with results for sites and services in SAME is progressing well - becoming more needed now (requested by sites on joint operation meeting)
  • GridView team is working on high-level availability display/dashboard
  • Development of the script to merge information from the BDII with GOC DB into new extended Service Instance schema
  • Integration of new dedicated VO for operations in progress
  • Documentation of SAME Framework specification (for LFC, FTS, MyProxy, R-GMA sensors) in progress. All details decided on COD-7 Meeting in Lyon (28th March 2006)

March 22nd 2006

Activities completed since last meeting

  • RB sensor was integrated with SAME - results are now stored in Oracle
  • New area of data schema was designed which is the extension for GOC DB with the missing information like Service Instances and relations between them, nodes and VOs
  • Interface to CIC-dashboard was implemented on SAME side (XML with list of critical failures for operators)
  • Dedicated VO for operations (OPS) is ready for integration

Work in progress

  • Development of detailed view with results for sites and services in SAME is progressing well - becoming more needed now (requested by sites on joint operation meeting)
  • GridView team is working on high-level availability display/dashboard
  • Development of the script to merge information from the BDII with GOC DB into new extended Service Instance schema
  • Summarisation and metric modules still in progress
  • Integration of new dedicated VO for operations in progress

Issues

  • GOC DB used in SFT as main source of information regarding sites,nodes and services to monitor is lacking important information like: relations between nodes and service instances and relation between service instances and supported VOs. The existing relation between sites and VOs is not sufficient. GOC DB team was contacted, but the necessary development would take a lot of time. We decided to create extension to the schema on SAME side (Oracle DB).

March 14th 2006

  • SFT was integrated with new Oracle based environment (SAME), so CE availability results are already there
  • For other sensors we know, Dave Kant is now working on publishing results from RB and SRM sensors
  • First simple summaries and metrics for CE availability were implemented
  • Detailed display for operators, integration with CIC-dashboard is in progress now.

February 22nd 2006

Activities completed since last meeting

  • New VO for highlevel monitoring and interoperations (OPS) is under way. AUP document was written and approved by ROC managers, Oracle accounts ready, waiting for VOMS setup (by Maria Dimou, see issues)
  • Communication protocol between service sensors and SAME publishing scripts defined, documentation under way, integration meeting scheduled for 1st March.

Work in progress

  • Scheduled downtimes in SFT history view.
  • Integration of RB sensor (Dave Kant)
  • Development of SAME/SFT framework
  • Development of Oracle based archiver/publisher web service for SAME

Issues

  • Not clear situation with VOMS service. It is needed for OPS VO, but we are waiting for the decision: which version of VOMS will be used for the new VO?

February 1st 2006

Activities completed since last meeting

Site Functional Tests:
  • Data schema for SAME/SFT defined
  • Secure R-GMA connector test for SFT.
  • Centrally executed Apel test.
  • Release of SFT 2.1.2 with LCG-2.7.0
  • Internal data representation for Oracle implemented
  • RB sensor (Dave Kant) ready for integration
  • Hardware requirements for SAME/SFT defined

Work in progress

  • Scheduled downtimes in SFT history view.
  • Integration of RB sensor (Dave Kant)
  • Development of SAME/SFT framework
  • Development of Oracle based archiver/publisher web service for SAME

Issues

  • R-GMA in 2.7.0 incompatible with SFT server - SFT server will be running on 2.6.0 until we fully move to Oracle.

November 30th 2005

Activities completed since last meeting

Site Functional Tests:
  • SFT Client (job submission part) was moved from lxplus to dedicated LCG-2_6_0 UI on lxn1182.cern.ch which is AFS free.
  • Data schema for service availability monitoring was established (meeting in Lyon on November 17th)

BDII:

  • now there are two machines behind lcg-bdii alias: bdii01 and bdii02

Mon boxes/Archivers:

  • Upgraded to newer version of R-GMA (from gLite 1.4) - bug fixes in servlets

GridView:

  • Graphical reports for Jobs' status in GridView (Current Summary and Hourly cases).
  • VO-wise graphical report for GridFTP X-fer in GridView (Current Summary and Hourly cases).

Work in progress

Site Functional Tests:
  • Planning for service availability sensors and metrics (extension of data schema, SFT framework, implementation of sensors, integration, ...)
  • Scheduled downtimes in SFT history view.
  • Secure R-GMA connector test for SFT.
  • Centrally executed Apel test.

GridView:

  • Graphical reports for Jobs' status in GridView (Daily and Weekly cases).
  • VO-wise graphical report for GridFTP X-fer in GridView (Daily and Weekly cases).

Issues

SFT:
  • Replication failures still there - BDII is not a reason - waiting for more verbose GFAL library from James Casey to find the real reason.

November 16th 2005

Activities completed since last week

Link from Job Listmatch failure on SFT page to GStat page with details about the site BDII.

Work in progress

Setting up new machines for top-level BDII behind lcg-bdii.cern.ch alias.

Planning for service availability sensors and metrics (extension of data schema, SFT framework, implementation of sensors, integration, ...)

Scheduled downtimes in SFT history view.

Secure R-GMA connector test for SFT.

Centrally executed Apel test.

Graphs for hourly and daily reports of Jobs' status in GridView.

VO-wise graphical report of GridFTP log in GridView.

Issues

Currently there is only one machine running behind the alias lcg-bdii.cern.ch as the top level BDII. We received reports which suggest that machine is sometimes overloaded causing Replica Management tests fail (timeouts on querying information system). Laurence requested additional machines from FIO to install more BDII servers in round robin alias. For this simple load balancing all the machines have to be comparable in terms of performance.

November 1st 2005

Activities completed since last week

Software version display for sites was added on SFT report page.

Graphs for Current Summary of Jobs' status on GridView 's development page(GVDEV).

Work in progress

Scheduled downtimes in SFT history view.

Secure R-GMA connector test for SFT.

Centrally executed Apel test.

Graphs for hourly and daily reports of Jobs' status in GridView.

VO-wise graphical report of GridFTP log in GridView.

Issues

Dash Status

-- Main.pnyczyk - 16 Nov 2005

Edit | Attach | Watch | Print version | History: r36 < r35 < r34 < r33 < r32 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r36 - 2008-01-18 - LaurenceField
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback