Frontier Web>WLCGSquidOps (2024-02-25, MichalSvatos)

Introduction

The purpose of the group is to make sure that site squids are operating smoothly and to assist squid administrators to fix problems with their squids. This includes dealing with failovers to ATLAS, CMS, and CVMFS servers.

Members

Barry Jay Blumenfeld (CMS)
Edita Kizinevic (CMS)
Michal Svatos (ATLAS)

Contact

frontier-talk@cernNOSPAMPLEASE.ch
- general discussions about Frontier
atlas-adc-frontier@cernNOSPAMPLEASE.ch
- ATLAS specific issues
wlcg-squidmon-support@cernNOSPAMPLEASE.ch
- rapid response to squid configuration questions
atlas-frontier-support@cernNOSPAMPLEASE.ch
- email for alerts
wlcg-squid-ops@cernNOSPAMPLEASE.ch
- Squid ops group
experts:
- Dave Dykstra
- Alessandro De Salvo

Failover monitoring

N.B.:
- squids are not excluded on backup proxies
  - squids should never contact backup proxies
  - this is for cases like cloud where squid and worker nodes can share IPs
- DNS cache is cleared once a day
  - if a squid changes IP address, it will appear as a failover for up to a day (then a new record is created and the monitor will know again that it is a squid)

Script

The following text describes (as of 10.6.2021) how various input files are processed to obtain failovers and how monitoring web page displays them.

Summary of input files

config file
node mapping file
list of squids and their properties (grid-squids.json)
name translation file
records of failover activity in the last 72 hours (failover-record.tsv)
awstats files
records of awstats activity for each machine group (situation from the last time the script was running)
records of emails sent in the last 72 hour (email-record.tsv)

Summary of output files

records of emails sent in the last 72 hour (with addition of currently sent emails)

records of failover activity in the last 72 hours (with addition of current failovers)

records of failover activity in the last 72 hours (with addition of current failovers) with squids removed

summary of daily activity per site

high bandwidth logs:

list of machines which used more than 1GB of bandwidth in current day
list of machines which used more than 10GB of bandwidth in current day
there is also a page displaying content of those files:
- http://wlcg-squid-monitor.cern.ch/failover/displayHighBandwidth.html

The procedure

From the list of squids, it extracts host name, site name, IPs and IP ranges.
It loads last 72 hours of failover history.
for each Group (group of machines as defined by awstats, e.g. cmscernbp, atlastriumf, etc.):
1. It reads following from the 72 hour history: Timestamp, Group, Sites, Host, IP, IsSquid, Bandwidth, BandwidthRate, Hits, HitsRate (meaning of these values will be explained later).
2. It reads list of machines in the Group (from a config file) and checks awstats files for each machine, extracting hostname, number of hits, bandwidth, and IP (obtained from the host name by socket.getaddrinfo).
  - if a corrupted awstats records (with many l added to the beginning of the host name) are found, they are cleaned (if host name has more than 2 l in it, they are counted and the number of symbols is removed from the beginning the the host name)
3. It reads history of awstats activity file (data from the last time the script was running) and writes current situation into that file.
4. From the awstats activity file, hosts using a lot of bandwidth on current day are extracted. The output files have the following grouping
  - by IsSquid - non-squids are first, squids after them
  - by bandwidth - in each of the groups (non-squid or squid), the hosts are sorted by bandwidth starting from the highest
5. Data from current (i.e. data from awstats folders) and previous (i.e. data from awstats history file the script reads and writes) execution of the script are used and failover activity record is created for cases when host name and IP in current and previous are matched (and if there is increase in hits and bandwidth in current with respect to the previous) and then for host names in current which are not in previous (when new activity appears). Hosts which are on remove list in config file are not processed :
  - Group - group of machines as defined by awstats, e.g. cmscernbp, atlastriumf, etc.
  - Sites
    - if the IP is not found, it becomes "Unknown"
    - if the IP is found then the site name comes from maxminddb.open_database().get() which processes it
    - if maxminddb.open_database().get() cannot find a site name for the IP, it becomes "Unknown"
  - Host
    - if host name is any of '127.0.0.1', 'localhost', 'localhost6', '::1' then it is set to "localhost" otherwise it is set to host name from current data
  - IP
  - IsSquid
    - if the IP is found in list of squids the it is set to True
  - Bandwidth
    - bandwidth from previous is subtracted from current
  - BandwidthRate
    - bandwidth from previous is subtracted from current and then it is divided by difference in timestamps
  - Hits
    - hits from previous are subtracted from current
  - HitsRate
    - hits from previous are subtracted from current and then it is divided by difference in timestamps
6. Those are active failovers. For them, the following is done:
  1. Any host name in Unknown site which ends with cern.ch is moved to CERN-EuropeanOrganizationforNuclearResearch
  2. Site names are translated from geoip names to experiment specific names. This is done by other script outside of the main script and the main script just reads its output. But its functionality is:
    1. List of squids is read and for each squid, pair site name and geoip site name of the squid is kept.
    2. For CMS, site names are matched with lcg names from cms-lcg-sites.json. If the match is found, pair cms name and geoip name is kept.
    3. For ATLAS, site names are matched with "name" names from agis_site_info.json. If the match is found, pair "rc_site" name and geoip name is kept.
  3. If site name consist of several sites, the site names are searched in list of squids. For those which are found, IP ranges are extracted. Then it is checked if the IP in the failover activity record belongs to the IP ranges. If match is found, the site name is set to matched site.
    - range 0.0.0.0/0 is ignored as it seems every IP belongs there
  4. Dealing with site exceptions defined in config file (strings separated by semicolon).
  5. For each site, total hit rate coming from it is calculated as sum of hit rates from each host name belonging to it. If the hit rate exceeds a threshold (defined in config file), it is considered offending site and added into records of failover activity with current timestamp.
7. If site has records of failover activity two last consecutive records of failover activity, it is considered persistent failover. Two last consecutive records means:
  - There is one record which has timestamp less than 400 s from now.
  - There is one record which has timestamp more than 3000 s from now but less than 4000 s from now
8. For persistent failovers, summary per site is made. For each site, it provides sum per each machine group of hits in the last hour and last 24 hours.

Record of emails which were sent in the last 72 hours is read. Emails which were sent in the last 24 hours are extracted. For persistent failover sites which are not in on list of emails from the last 24 hours, emails (based on pre-prepared template) are sent to addresses from config file.

Example of sent email

hide

This is an automated message created at 2019-06-20 08:18:58 (UTC).  Many database queries from your site have connected
directly to the following Frontier Server Groups during the last hour, with a high
rate of queries not going through your local squid(s):

Group           GCode           RateThreshold [*]
FNAL Stratum 1  cvmfs.fnal.gov        5.56


The most common sources of this problem are:
    1. Squids are not running
    2. Squids are not listed in site-local-config.xml
    3. Not all local IP addresses are in squid's access control lists

When you have found the cause of the problem or if you have any questions, please
contact wlcg-squid-ops@cern.ch by replying to this message.

The record of Frontier activity from your site during the last period (60 minutes)
is displayed below. The full access history during the past 72 hours is available
at http://wlcg-squid-monitor.cern.ch/failover/failoverCvmfs/failover.html

[*] The rate is the effective number of queries per second over each period.

===== Record of Frontier server accesses in the last hour =====
Site: Unknown
Host aggregation:
IsSquid  GCode           Hits   Bandwidth
False    cvmfs.fnal.gov  35453  8.78 GiB


Detailed table:
IsSquid  GCode           Host                                                  Ip              Hits   Bandwidth
False    cvmfs.fnal.gov  localhost.localdomain                                 127.0.0.1       30208   7.80 GiB
False    cvmfs.fnal.gov  bb118-200-69-89.singnet.com.sg                        NoIpFound        112   45.72 MiB
True     cvmfs.fnal.gov  grinr06.inr.troitsk.ru                                185.207.88.136    2     1.73 kiB
False    cvmfs.fnal.gov  no-mans-land.m247.com                                 NoIpFound        12    10.86 kiB
False    cvmfs.fnal.gov  66.37.33.202.lightower.net                            NoIpFound         5     4.37 kiB
False    cvmfs.fnal.gov  174-30-26-155.wrbg.centurylink.net                    NoIpFound       1305   214.66 MiB
False    cvmfs.fnal.gov  customer15-179.lawireless.it                          NoIpFound         3     2.15 kiB
False    cvmfs.fnal.gov  fsquid.swt2.atlas-swt2.org.uta.edu                    NoIpFound        30    25.78 kiB
False    cvmfs.fnal.gov  dynamic-216-186-138-162.knology.net                   NoIpFound        49    148.32 kiB
False    cvmfs.fnal.gov  dynamic-216-186-150-154.knology.net                   NoIpFound         5     4.53 kiB
False    cvmfs.fnal.gov  storagelabinfo1.lncc.br                               NoIpFound         2     0.63 kiB
False    cvmfs.fnal.gov  96-78-73-89-static.hfc.comcastbusiness.net            NoIpFound         5     4.37 kiB
False    cvmfs.fnal.gov  86-126-27-130.rdsnet.ro                               NoIpFound        149   39.28 MiB
False    cvmfs.fnal.gov  ads-pool.nat.uw.edu                                   NoIpFound        15    13.65 kiB
False    cvmfs.fnal.gov  209-6-203-217.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com  NoIpFound         5     4.37 kiB
False    cvmfs.fnal.gov  107-145-170-127.res.bhn.net                           NoIpFound         4     3.49 kiB
False    cvmfs.fnal.gov  74.82.240.18.ifibertv.com                             NoIpFound         8     7.27 kiB
False    cvmfs.fnal.gov  46-140-17-230.static.cablecom.ch                      NoIpFound        44    19.47 MiB
False    cvmfs.fnal.gov  130.20.68.60.pnnl.gov                                 NoIpFound        30    26.67 kiB
False    cvmfs.fnal.gov  epldt085.ph.bham.ac.uk                                NoIpFound         1     325.00 B
False    cvmfs.fnal.gov  cpe-121-208-32-42.cfui-cr-004.woo.qld.bigpond.net.au  NoIpFound       2716   461.39 MiB
False    cvmfs.fnal.gov  host184537317.direcway.com                            NoIpFound        46     2.64 MiB
False    cvmfs.fnal.gov  wlg-nat.solnetsolutions.co.nz                         NoIpFound         3     3.35 kiB
False    cvmfs.fnal.gov  adsl-ull-50-53.49-151.wind.it                         NoIpFound         2     1.82 kiB
False    cvmfs.fnal.gov  ip52-181.comteam.at                                   NoIpFound         1     325.00 B
False    cvmfs.fnal.gov  home-78-83-130-166.optinet.bg                         NoIpFound         2     1.82 kiB
False    cvmfs.fnal.gov  cust-210.241.102.5.018.net.il                         NoIpFound        633   178.11 MiB
False    cvmfs.fnal.gov  pool-6-193.106.50.119.o.kg                            NoIpFound        11     2.25 MiB
False    cvmfs.fnal.gov  188-26-142-130.rdsnet.ro                              NoIpFound         2     0.63 kiB
False    cvmfs.fnal.gov  pop-139.126.escom.bg                                  NoIpFound        11    19.25 MiB
False    cvmfs.fnal.gov  cable-37-120-81-23.cust.telecolumbus.net              NoIpFound        32    22.57 MiB

Output files are written.

Web page

The webpage combines HTML with heavy usage of JavaScript. It contains:

some introductory info
information about time range currently being displayed
Machine Groups pie chart with legend
- Clicking on machine group either in the pie chart or in the legend filters "Hits by site per hour" plot and "Access History Detail" table.
- Blue word "Reset" next to the "Machine Groups" text resets the filtering
"Hits by site per hour" plot displays number of hits from failover sites in one hour bins and legend
- hovering over a site name in legend hides hits from other sites
- hovering over a bar on the plot displays site name and number of hits in the bar
"Access History Detail" table displays for each site, all its WNs, timestamp, number of hits and bandwidth
- hovering over a host name displays IP
- hovering over number of hits displays hits rate
- hovering over bandwidth displays bandwidth rate
"Record of Email Alarms" table shows sites which raised email alarm in the last 72 hours. It shows site name, who got the email and when.

Monitoring pages

Code

Email alerts

The failover monitor sends an email alert when number of hits from a site exceeds threshold defined in the config twice in a row. People who get these emails are:

ATLAS: Michal Svatos
CMS: cms-frontier-alarms
CVMFS: wlcg-squid-ops
CernVM: Dave Dykstra

ATLAS Elastic Search

The Elastic Search hosted in Chicago provides details of job logs which can provide further details in investigation.

How to search details in ATLAS Elastic Search

Open the Elastic Search page (needs user account)
Select "frontier_sql" index
Click on "Add a filter"
Choose "clientmachine" and "is"
Put IP address of the machine as the value.
Save

Filtering

to filter hits on CERN backup proxies, use Add filter button and then
- set field to clientsquid
- set operator to is one of
- set Values to 128.142.168.202 and 128.142.33.31
to filter hits on FNAL backup proxies, use Add filter button and then
- set field to clientsquid
- set operator to is one of
- set Values to 131.225.188.245 and 131.225.188.246
to use wildcards, use Add filter button and then
- in the top right corner of the window that opened, click on Edit as Query DSL
- replace match_phrase with wildcard
to see connection from local users
- set field to pandaid
- set operator to is
- set Values to 0

WLCG-WPAD dashboard

link
the hits come when something on a WLCG site (or non-WLCG sites running WLCG jobs, e.g. LHC@Home jobs) tries to use proxy autodiscovery to get information about nearest squid or
services monitored (each has one server in FNAL and one in CERN):
- wlcg-wpad - replies positively only at grid sites, and includes backup proxies at those sites after squids that are found. I think the only production use is old-config LHC@Home jobs.
- lhchomeproxy - like wlcg-wpad except at non-grid sites it replies DIRECT for openhtc.io destinations, so will use cloudflare. Used by current LHC@Home jobs.
- cernvm-wpad - like lhchomeproxy except at non-grid sites it watches for too many requests in too short of a time (more below) and if so directs them to cernvm backup proxies on port 3125. Used as default for CernVM, cvmfsexec, and soon to be the default configuration for cvmfs if people do not set CVMFS_HTTP_PROXY and are using the cvmfs-config-default configuration rpm.
- cmsopsquid - like cervnm-wpad except too many requests from non-grid sites in too short of a time get sent to the cms backup proxies on port 3128. Used by CMS opportunistic jobs in the U.S.
dashboard content
- type of info
  - no squid found - wpad returned no squid
  - no org found - nothing found in the geoip database
  - default squid - wpad returned site's squid
  - disabled - if site's squid is disabled (recorded in worker-proxies.json and/or shoal-squids.json)
  - overload - if there are too many requests from one geoip org in too short of a time
- service names
  - hits per service
  - type of info for each of the service names

CVMFS notes

not all stratum 1s have all the repositories of the domains that they host
they are run in primary/backup mode rather than round-robin
the stratum 1s are normally ordered by geo from the client
failover rules: https://cvmfs.readthedocs.io/en/stable/cpt-configure.html#failover-rules
details of CMVFS setting: cvmfs_config showconfig atlas.cern.ch
- important parameters:
  - CVMFS_HTTP_PROXY - normally should be a squid (there should not be DIRECT)
  - CVMFS_FALLBACK_PROXY
- if there is DIRECT in CVMFS_HTTP_PROXY, it is ignored if something is defined in CVMFS_FALLBACK_PROXY
details of CVMFS proxy: cvmfs_talk proxy info
version of the CVMFS: rpm -qi cvmfs|grep Version
checking if a machine can get repo details from the squid, e.g. : http_proxy=squid_name curl -vs http://cvmfs-stratum-one.cern.ch/cvmfs/atlas-nightlies.cern.ch/.cvmfspublished|wc -c
stratum ones operation twiki: https://twiki.cern.ch/twiki/bin/view/CvmFS/StratumOnes
supported repositories for each startum one:
availability of those stratums:https://monit-grafana.cern.ch/d/000000855/overview-service-availability?orgId=1&var-cs_name=All&var-fe_name=All&var-isfe=All&var-service=All&var-regexp=cvmfs_stratum1mon*&var-status=All&var-groupby=service_id&var-ssb_otgs=All&var-ssb_type=All

-- MichalSvatos - 2019-05-09

Topic revision: r29 - 2024-02-25 - MichalSvatos

Frontier

Public webs

- Cern Search
- TWiki Search
- Google Search
Frontier All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback