CENF Web>WebPreferences>ConfigCheck (2023-11-16, NikolinaIlic)

-- NikolinaIlic - 2023-11-09

Introduction

This twiki page describes how to start a partition, ensuring that configurations and pinning files are correctly specified, and properly applied. Below is a step by step guide of everything that should be checked before starting a partition. Generating server performance reports is also described. All the relevant configuration files can be found on this git repo: https://gitlab.cern.ch/dune-daq/online/np04daq-configs/-/tree/master?ref_type=heads.

Step by Step guide

Step 1: Before doing anything run this which will tell you what are the available network devices in the system:

tmux new -s np0X_coldbox
dpdk-devbind.py -s

This is relevant for the type of configuration files you need to specify

Step 2: Next log into the server you are working on (eg. np02-srv-003), and check where the NIC card is (NUMA 0 versus NUMA 1), by running the command:

lspci -s 98:00.0 -vvvvvvvvv

Step 3: Check the resources available by running:

ls cpu
numactl -H

Step 4: In order to know how to fill out the configuration and pinning files, it is necessary to look up how the cores on a server are arranged. For a mapping of cores for different server types run the command below:

lstopo

Always look at the P numbers (not PU!) to make sure you are correctly selecting the cores.

Step 5: Open the configuration file used for generating configurations, namely the global json files that contains the global DAQ parameters:

emacs np04daq-configs/readout_configs.json

In this file highlight the server that you are working on (eg. np02-srv-003). In this file you will see many configurable options:

  "common": {
   "generate_periodic_adc_pattern": false,
   "enable_tpg": false,
   "enable_raw_recording": false,
   "raw_recording_output_dir": ".",
   "fragment_send_timeout_ms": 10000,
   "latency_buffer_size": 139008,
   "dpdk_eal_args": "-l 0-1 -n 3 -- -m [0:1].0 -j",
   "dpdk_lcores_config": {
       "default_lcore_id_set": [1,2,3,4],
       "exceptions": [
                  {
          "host": "np02-srv-003",
          "iface": 0,
          "lcore_id_set": [1,57,3,59,5,61]
      }
           ]

Ensure to set default_lcore_id_set to the NUMA you want to use. Also in this file make sure the correct numa_id is set correctly (0 if your NIC is on NUMA0, 1 if your NIC is on NUMA1).

Ensure that the path is correct to the thread_pinning_file:


   "thread_pinning_file": "${PWD}/cpupin_files/cpupin-all.json",
   "numa_config": {
            "default_id": 1,
            "default_latency_numa_aware": true,

The latency buffer by default is numa aware and will be set to the default_id, but can be overridden by setting latency_buffer_numa_aware to false.

Other options that can be specified include whether to turn on Trigger Primitive Generation (enable_tpg), and enable SNB recording (enable_raw_recording).

Step 6: Now go to the pinning file, linked in the above readout_configs.json file:

emacs cpupin_files/cpupin-all.json

Highlight instances of the server you are working on (eg. np02-srv-003). Then specify the correct parent, rte-workers, and consumers (make sure the correct consumers on the NUMA node you need). The thread have to match what we have in readout_configs.json.

For information on how to define various parameters of a pinning file consult here (link coming).

Step 7: Make sure the paths are correctly defined in the file:

cp-2_CRP4_WIB.json

Step 8: Check both of the files below to make sure all paths are correct and consistent:

recreate_np02_daq_configuration.sh 
np02_daq.json

Step 9: Check the start run file to make sure all paths are consistent:

setup_for_run.sh

Step 9: Here the temporary procedure for when to boot/configure/start versus when to manually apply the pinning file will be described once we figure out what it is. More information on how to start a partition linked here.

IMPORTANT: Always cross check with numactl command to see if cores were correctly selected. Make sure that "free available memory" is decreasing on the nodes you are using.

Step 10: Monitor in graphana. Look at the PCM dashboard in Graphana to make sure configuration matches metrics (insert and explain some metrics, for example TP rates).

add PCM link in graphana

For example if you are running a configuration that is one APA/CRP per NUMA node, you should see maximum 1 GB inter CPU communication.

Step 11: Write Elog. Follow this template (link to come).

Running Performance tests

Topic revision: r6 - 2023-11-16 - NikolinaIlic

CENF

Public webs

- Cern Search
- TWiki Search
- Google Search
CENF All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback