-- AidanLewisReynolds - 2018-10-01

Relevant NP04 Machines

  • np04-srv-00[1,2,3,4] - Art processes running on the raw data
  • np04om - Merging processes and Monet

Relevant Directories

These locations are accessible from both np04om and np04-srv-00[1,2,3,4]. They are local to np04om and mounted at /nfs on np04-srv-00[1,2,3,4].
  • /OMoutput/[1,2,3,4] - Initial output from art processes
  • /OMoutput/OMoutput - Merged output and monet input

Troubleshooting Steps

These are the typical steps I take when trouble shooting the OM. Going in this order allows me to find the point in the chain which is causing issues quickly.

1. Check for Merged Output

On server np04om look for recent files in /OMoutput/OMoutput. If there are recent files here but they do not appear in monet then there is likely an issue with Monet, if there are no recent files the problem will be further down the chain.

[np04daq@np04-srv-023 fcl]$ ltr /OMoutput/OMoutput/np04_hist* | tail -20
-rw-r--r-- 1 np04daq np-comp 271M Oct  1 14:31 /OMoutput/OMoutput/np04_hist_run004849_0011_5_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 271M Oct  1 14:34 /OMoutput/OMoutput/np04_hist_run004849_0012_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 271M Oct  1 14:35 /OMoutput/OMoutput/np04_hist_run004849_0011_4_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 300M Oct  1 14:36 /OMoutput/OMoutput/np04_hist_run004849_0011_3_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 270M Oct  1 14:39 /OMoutput/OMoutput/np04_hist_run004849_0009_6_0_0.root
-rw-r--r-- 1 np04daq np-comp 270M Oct  1 14:40 /OMoutput/OMoutput/np04_hist_run004849_0011_1_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 271M Oct  1 14:43 /OMoutput/OMoutput/np04_hist_run004849_0009_5_0_0.root
-rw-r--r-- 1 np04daq np-comp 271M Oct  1 14:44 /OMoutput/OMoutput/np04_hist_run004849_0008_6_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 271M Oct  1 14:45 /OMoutput/OMoutput/np04_hist_run004849_0008_5_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 312M Oct  1 14:48 /OMoutput/OMoutput/np04_hist_run004849_0009_2_0_0.root
-rw-r--r-- 1 np04daq np-comp 271M Oct  1 14:50 /OMoutput/OMoutput/np04_hist_run004849_0009_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 271M Oct  1 14:53 /OMoutput/OMoutput/np04_hist_run004849_0006_6_0_0.root
-rw-r--r-- 1 np04daq np-comp 364M Oct  1 14:53 /OMoutput/OMoutput/np04_hist_run004849_0000_0_0_0.root
-rw-r--r-- 1 np04daq np-comp 332M Oct  1 14:56 /OMoutput/OMoutput/np04_hist_run004851_0001_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 369M Oct  1 14:59 /OMoutput/OMoutput/np04_hist_run004851_0001_2_0_0.root
-rw-r--r-- 1 np04daq np-comp 332M Oct  1 15:02 /OMoutput/OMoutput/np04_hist_run004851_0001_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 342M Oct  1 15:06 /OMoutput/OMoutput/np04_hist_run004851_0005_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 354M Oct  1 15:09 /OMoutput/OMoutput/np04_hist_run004851_0005_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 349M Oct  1 15:12 /OMoutput/OMoutput/np04_hist_run004851_0005_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 449M Oct  1 15:13 /OMoutput/OMoutput/np04_hist_run004851_0000_0_0_0.root

Fixing This

This means monet is probably down so restart that process.
  1. Log into np04om
  2. Look for the monet processes, which are managed by supervisord.
[np04daq@np04-srv-023 fcl]$ ps aux | grep presenter
np04daq  114765  1.6 14.6 30799540 28805912 ?   Sl   Sep26 122:43 python -m presenter.app
np04daq  114830  0.3  3.1 8153440 6163740 ?     Sl   Sep26  27:18 python -m presenter.app
np04daq  368403  0.0  0.0 112712   976 pts/4    S+   16:16   0:00 grep --color=auto presenter
  1. Kill the processes labelled as presenter, they will be restarted by supervisord.
[np04daq@np04-srv-023 fcl]$ kill 114765 
[np04daq@np04-srv-023 fcl]$ kill 114830

2. Check for Unmerged Output

On server np04om look for recent files in the unmerged directories, /OMoutput/[1234]. If there are recent files here but there weren't in the previous step the issue is to do with the merging script, if there are no recent files the problem will be further down the chain.

[np04daq@np04-srv-023 fcl]$ ltr /OMoutput/[1-4]/np04_hist* | tail -20                                                    
-rw-r--r-- 1 np04daq np-comp 281M Oct  1 15:06 /OMoutput/3/np04_hist_run004851_0006_dl1_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 302M Oct  1 15:06 /OMoutput/3/np04_hist_run004851_0005_dl7_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 280M Oct  1 15:07 /OMoutput/3/np04_hist_run004851_0005_dl9_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 301M Oct  1 15:08 /OMoutput/3/np04_hist_run004851_0006_dl1_2_0_0.root
-rw-r--r-- 1 np04daq np-comp 297M Oct  1 15:08 /OMoutput/3/np04_hist_run004851_0005_dl7_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 280M Oct  1 15:09 /OMoutput/3/np04_hist_run004851_0005_dl9_5_0_0.root
-rw-r--r-- 1 np04daq np-comp 279M Oct  1 15:10 /OMoutput/3/np04_hist_run004851_0005_dl7_5_0_0.root
-rw-r--r-- 1 np04daq np-comp 294M Oct  1 15:10 /OMoutput/3/np04_hist_run004851_0006_dl1_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 291M Oct  1 15:11 /OMoutput/3/np04_hist_run004851_0009_dl9_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 280M Oct  1 15:12 /OMoutput/3/np04_hist_run004851_0009_dl8_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 276M Oct  1 15:12 /OMoutput/3/np04_hist_run004851_0006_dl1_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 331M Oct  1 15:13 /OMoutput/3/np04_hist_run004851_0009_dl9_2_0_0.root
-rw-r--r-- 1 np04daq np-comp 324M Oct  1 15:14 /OMoutput/3/np04_hist_run004851_0009_dl8_2_0_0.root
-rw-r--r-- 1 np04daq np-comp 280M Oct  1 15:15 /OMoutput/3/np04_hist_run004851_0006_dl1_5_0_0.root
-rw-r--r-- 1 np04daq np-comp 297M Oct  1 15:15 /OMoutput/3/np04_hist_run004851_0009_dl9_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 302M Oct  1 15:16 /OMoutput/3/np04_hist_run004851_0009_dl8_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 281M Oct  1 15:17 /OMoutput/3/np04_hist_run004851_0009_dl9_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 281M Oct  1 15:17 /OMoutput/3/np04_hist_run004851_0011_dl3_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 282M Oct  1 15:18 /OMoutput/3/np04_hist_run004851_0009_dl8_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 282M Oct  1 15:19 /OMoutput/3/np04_hist_run004851_0009_dl9_5_0_0.root

Fixing This

If there are files here but not in the previous step it means the merging process is probably not working, to check this do
  1. Log into np04om
  2. Look for the merging processes, which are managed by suervisord
[np04daq@np04-srv-023 fcl]$ ps aux | grep mergeOM
np04daq  196341  0.5  0.0 113448  1776 ?        S    Sep17 106:28 /bin/bash ./mergeOMFiles.sh /OMoutput
np04daq  300664  0.6  0.0 113456  1816 ?        S    Sep17 124:40 /bin/bash ./mergeOMFiles.sh /OMoutput _dev
np04daq  414775  0.0  0.0 112712   976 pts/4    S+   16:27   0:00 grep --color=auto mergeOM
  1. If you can't see the mergeOMFiles processes, then you can start them manually with
[np04daq@np04-srv-023 ~]$ cd /nfs/sw/om/fcl/
[np04daq@np04-srv-023 fcl]$ nohup ./mergeOMFiles.sh /OMoutput &

3. Check the Art Logs

On servers np04-srv-00[1,2,3,4] you can find the logs from the online monitoring at /log/om. To find the logs for the current run I usually do

[np04daq@np04-srv-001 ~]$ grep -R run004600 /log/om                                                                      
/log/om/part0-OnlineMonitor_0-20180922163953.log:art -c /nfs/sw/om/fcl/RunOnlineMonitor1_0_0.fcl -s /data1/np04_raw_run004600_0001_dl5.root >/nfs/OMoutput/1/logfile_0_0_20180922_165225
/log/om/part0-OnlineMonitor_0-20180922163953.log:art -c /nfs/sw/om/fcl/RunOnlineMonitor1_dev_0_0.fcl -s /data1/np04_raw_run004600_0001_dl5.root >/nfs/OMoutput/1_dev/logfile_dev_0_0_20180922_165225

The file name output from grep is the logfile from the StartOM script which is excecuted by the run control, it should look something like below, longer runs will have more of the art -c lines at the bottom. A quick way to tell if the art processes are failing can be to tail -f the this file. If it is being written to with a frequency higher than about one line every couple minutes then the art processes are probably failing.

[np04daq@np04-srv-001 ~]$ cat /log/om/part0-OnlineMonitor_0-20180922163953.log

MRB_PROJECT=dunetpc
MRB_PROJECT_VERSION=v06_73_00
MRB_QUALS=e15:prof
MRB_TOP=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev
MRB_SOURCE=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/srcs
MRB_BUILDDIR=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/build_slf7.x86_64
MRB_INSTALL=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/localProducts_dunetpc_v06_73_00_e15_prof

PRODUCTS=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/localProducts_dunetpc_v06_73_00_e15_prof:/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/products:/nfs/sw/artdaq/products


MRB_PROJECT=larsoft
MRB_PROJECT_VERSION=v06_73_00
MRB_QUALS=e15:prof
MRB_TOP=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod
MRB_SOURCE=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/srcs
MRB_BUILDDIR=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/build_slf7.x86_64
MRB_INSTALL=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/localProducts_larsoft_v06_73_00_e15_prof

PRODUCTS=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/localProducts_larsoft_v06_73_00_e15_prof:/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/products:/nfs/sw/artdaq/products

The working build directory is /nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/build_slf7.x86_64
The source code directory is /nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/srcs
----------- check this block for errors -----------------------
The working build directory is /nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/build_slf7.x86_64
The source code directory is /nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/srcs
----------- check this block for errors -----------------------
INFO: no optional setup of artdaq_ganglia_plugin v1_02_11 -q +e15:+s65:+prof
INFO: no optional setup of artdaq_ganglia_plugin v1_02_11 -q +e15:+s65:+prof
INFO: no optional setup of artdaq_epics_plugin v1_02_05 -q +e15:+s65:+prof
INFO: no optional setup of artdaq_epics_plugin v1_02_05 -q +e15:+s65:+prof
----------------------------------------------------------------
INFO: copying $MRB_SOURCE/dunetpc/releaseDB/base_dependency_database
Number of processes: 1 on partition 0
//////////////////// Start process 0 ///////////////////////
0
----------------------------------------------------------------
Number of processes: 1 on partition 0
//////////////////// Start process 0 ///////////////////////
0
art -c /nfs/sw/om/fcl/RunOnlineMonitor1_0_0.fcl -s /data1/np04_raw_run004600_0001_dl5.root >/nfs/OMoutput/1/logfile_0_0_20180922_165225
art -c /nfs/sw/om/fcl/RunOnlineMonitor1_dev_0_0.fcl -s /data1/np04_raw_run004600_0001_dl5.root >/nfs/OMoutput/1_dev/logfile_dev_0_0_20180922_165225

The last part of the art -c lines here will point you at the output from the art processes, you want to make sure you look at the non dev versions of this when troubleshooting the production version of the OM. Just look for dev in the line and ignore those lines. The art logfile will be a very long file but we should only really be interested in the last few lines for basic trouble shooting, we basically just need to check the exit status.

[np04daq@np04-srv-001 ~]$ tail -2 /nfs/OMoutput/3/logfile_0_0_20181001_150946                                           

Art has completed and will exit with status 0.

There are 2 valid values here, 0 and 143, anything else and there is an issue with the decoding. 0 means everything went smoothly and the full input file has been processed. 143 means the art process was terminated by run control before the processing was finished, in this case the decoding worked fine but wasn't able to finish, there might not be plots in this case.

Fixing This

If there are issues with the decoding, then it is software related and you should notify the OM experts.

4. No Logs?

If you can't find any logs for your run in /log/om in the previous step then the OM probably wasn't started by the run control. To check this look for StartOM_RunNumber in the processes on np04-srv-00[1,2,3,4], if you can't find it and you are still taking a run then it was not started properly by the run control.
[np04daq@np04-srv-001 ~]$ ps aux | grep StartOM
np04daq   45718  0.0  0.0 115296  1496 ?        S    17:01   0:00 /bin/bash /nfs/sw/om/fcl/StartOM_RunNumber.sh 0 /nfs/sw/om/fcl 004855 1
np04daq   45719  3.9  0.0 115860  2176 ?        S    17:01   0:06 /bin/bash /nfs/sw/om/fcl/StartOMproddev.sh prod 0 /nfs/sw/om/fcl 004855 1
np04daq   45720  4.0  0.0 115760  2180 ?        S    17:01   0:07 /bin/bash /nfs/sw/om/fcl/StartOMproddev.sh dev 0 /nfs/sw/om/fcl 004855 1
np04daq   77633  0.0  0.0 115760  1096 ?        S    17:04   0:00 /bin/bash /nfs/sw/om/fcl/StartOMproddev.sh dev 0 /nfs/sw/om/fcl 004855 1
np04daq   77634  0.0  0.0 115860  1104 ?        S    17:04   0:00 /bin/bash /nfs/sw/om/fcl/StartOMproddev.sh prod 0 /nfs/sw/om/fcl 004855 1
np04daq   77644  0.0  0.0 112712   972 pts/2    S+   17:04   0:00 grep --color=auto StartOM

Fixing This

This means that the OM was not started by the run control, you can start it manually with the steps below.
  1. Log into np04-srv-00[1,2,3,4]
  2. Change directory to /nfs/sw/om/fcl/
  3. Run StartOM_RunNumber.sh with fcl path /nfs/sw/om/fcl
./StartOM_RunNumber.sh <PARTITION> <FHiCL path> <Run Number> [# of processes]
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2018-10-01 - AidanLewisReynolds
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CENF All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback