WLCG Monitoring Task Force

Motivation

The WLCG Monitoring Task Force is set up following the outcome of the Data Challenge activity.

The highest priority task for the Task Force is to address current issues with transfers and Site monitoring, following the needs expressed by the Data Challenge activity.

Mandate

XrootD monitoring

  • Redesign current implementation based on XRootD server reports which rely on the UDP protocol. Work implies collaboration between MONIT and OSG developers
  • Coordinate with dCache developers to enable monitoring flow for the use case dCache+XRootD port
  • Make sure that XRootD monitoring data is properly integrated in the WLCG transfer monitor. This also includes ALICE XRootD monitoring flow

Components

  • Shoveler:
    • New component that ships XRootD monitoring streams to a message queue (source code)
    • Aim to be deployed as closed as possible to the XRootD server (to avoid UDP fragmentation)
  • Collector:
    • Similar component to previous GLED collectors (source code)
    • Reads XRootD streams from a message queue and aggregates them into a transfer document

Deployment of the xRootD monitoring components

These instructions cover the deployment of the components for non-US sites (i.e sites that need to report directly to CERN).

Collector

New Collector is thought to be deployed centrally for the area, so ideally there should not be more than one connector for US (running at ?) and one for Non-US sites (running at CERN). Since it reads/writes from/to a message queue sites might also decide to run their own in case they want to benefit from some local monitoring.

Please refer to the project link for more information about how to configure it.

Shoveler

The Shoveler on the other hand is aimed to be deployed in each different site, close as possible to the XRootD servers, this will allow XRootD to communicate via UDP with the shoveler that will then submit in a reliable way the messages to a message queue using either (AMQP or STOMP protocols).

  • Deploy shoveler close to the XRootD servers in your site
    • Choosing between docker or rpm installation (see the docs)
      • For easyness RPM will be added to WLCG repositories (will be updated here)
  • Contact WLCGMON-TF to get credentials to configure it (if you have received a GGUS ticket from us it can be used to request the allow of the certificate), you can choose between:
    • Basic authentication over HTTP
    • Credentials over TLS (valid GRID certificate will be required), preferably host with alias or robot one.
      • Provide it in RFC2253 format: openssl x509 -noout -in cert.pem -subject -nameopt RFC2253
  • XRootD server will need to be configured to send the streams to a second endpoint (shoveler)
    • ```=====> xrootd.monitor all flush 30s mbuff 1472 window 5s fstat 60 lfn ops xfr 5 [current destination] dest fstat files info user pfc tcpmon ccm [shoveler address]​```

Running

  • RPM installation comes with a predefined service "xrootd-monitoring-shoveler"
    • Open issue upstream to make sure the service depends on other bits like network being available.

Configuration Place this configuration inside /etc/xrootd-monitoring-shoveler/config.yaml if running from RPM. Please note the following specific cases when configuring the topic:

  • If your XRootD server is multi-vo and/or configured with VOMS, topic: /topic/xrootd.shoveler (without vo)
  • If you're running XRootD 5.6+ and use tokens for authentication, topic: /topic/xrootd.shoveler (without vo)
  • Otherwise, please use topic: /topic/xrootd.shoveler.[vo]
Make sure /var/spool/xrootd-monitoring-collector is avaialble in the host (this will be ensured by the package in future versions).

  • Basic auth:
mq: stomp

stomp:
  user: 
  password:
  url: dashb-lb-mb.cern.ch:61113
# Make sure to replace this with the correct vo from (alice, atlas, cms, lhcb)
  topic: /topic/xrootd.shoveler.[vo]

listen:
  port: 9993
  ip: 0.0.0.0

# Whether to verify the header of the packet matches XRootD's monitoring
# packet format
verify: true

# Export prometheus metrics (Optional)
metrics:
  enable: true
  port: 8000

# Directory to store overflow of queue onto disk.
# The queue keeps 100 messages in memory.  If the shoveler is disconnected from the message bus,
# it will store messages over the 100 in memory onto disk into this directory.  Once the connection has been re-established
# the queue will be emptied.  The queue on disk is persistent between restarts.
queue_directory: /var/spool/shoveler-queue
  • Certificates: Make sure to use 1.2.2+
    • Make sure the SSL_CERT_DIR environment variable is set (i.e: Environment="SSL_CERT_DIR=/etc/grid-security/certificates")
mq: stomp

stomp:
  cert: /path/to/cert.crt
  certkey: /path/to/cert.key
  url: dashb-lb-mb.cern.ch:61123
# Make sure to replace this with the correct vo from (alice, atlas, cms, lhcb)
  topic: /topic/xrootd.shoveler.[vo]

listen:
  port: 9993
  ip: 0.0.0.0

# Whether to verify the header of the packet matches XRootD's monitoring
# packet format
verify: true

# Export prometheus metrics (Optional)
metrics:
  enable: true
  port: 8000

# Directory to store overflow of queue onto disk.
# The queue keeps 100 messages in memory.  If the shoveler is disconnected from the message bus,
# it will store messages over the 100 in memory onto disk into this directory.  Once the connection has been re-established
# the queue will be emptied.  The queue on disk is persistent between restarts.
queue_directory: /var/spool/shoveler-queue

Current situation

Other producers

  • ALICE Monalisa: Will converge in the new flow by pointing ALICE XRootD servers parallel flow to new shovelers
  • xCache: OSG is already using the same flow for monitoring their instances
  • dCache: Testing new agreed data flow

Onboard sites

Here's the list of sites that have already get in contact with us to start sending some data through the new flow in parallel to the current GLED one.

  • CERN (EOS Alice)
  • RAL-LCG2
  • AUVERGRID (Waiting for new shoveler release)
  • UKI-NORTHGRID-MAN-HEP
  • US-MIT
  • UFlorida-HPC
  • UCSDT2
  • SPRACE
  • CIT_CMS_T2
  • US-GLOW
  • T3_US_OSG

Known issues

There are several issues that has been spotted while deploying test flows for different sites, in this section we try to gather all of them together

  • Lack of VO information
    • Current collector extracts it from the auth stream, not available for all the XRootD servers due to differences in configuration (requires VOMS or XRootD5.6 + tokens).
  • Wrong operation time (set to 0)
    • Initially though to be due to fast transfers under second resolution (since XRootD reports 0 in this case) open issue
    • Observed to have appeared also for big transfers, investigation is ongoing with XRootD

WLCG transfers

  • Consolidate WLCG transfer Dashboard following the lessons learned during October DCs
  • Define required minimun schema for the transfer documents to be used in the dashboard generation

Site network monitoring

Task Force meetings

WLCG Monitoring Task Force meetings take place be-weekly on Tuesday starting at 15 o'clock Geneva time

Membership

Members

  • Alessandra Forti
  • Borja Garrido
  • Derek Weitzel
  • Julia Andreeva
  • Rizart Dona
  • Shawn McKee

Contact

  • wlcgmon-tf (at cern.ch)
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2024-04-11 - BorjaGarridoBear
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback