--
AidanLewisReynolds - 2018-10-01
Relevant NP04 Machines
-
np04-srv-00[1,2,3,4]
- Art processes running on the raw data
-
np04om
- Merging processes and Monet
Relevant Directories
These locations are accessible from both
np04om
and
np04-srv-00[1,2,3,4]
. They are local to
np04om
and mounted at
/nfs
on
np04-srv-00[1,2,3,4]
.
-
/OMoutput/[1,2,3,4]
- Initial output from art processes
-
/OMoutput/OMoutput
- Merged output and monet input
Troubleshooting Steps
These are the typical steps I take when trouble shooting the OM. Going in this order allows me to find the point in the chain which is causing issues quickly.
1. Check for Merged Output
On server
np04om
look for recent files in
/OMoutput/OMoutput
. If there are recent files here but they do not appear in monet then there is likely an issue with Monet, if there are no recent files the problem will be further down the chain.
[np04daq@np04-srv-023 fcl]$ ltr /OMoutput/OMoutput/np04_hist* | tail -20
-rw-r--r-- 1 np04daq np-comp 271M Oct 1 14:31 /OMoutput/OMoutput/np04_hist_run004849_0011_5_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 271M Oct 1 14:34 /OMoutput/OMoutput/np04_hist_run004849_0012_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 271M Oct 1 14:35 /OMoutput/OMoutput/np04_hist_run004849_0011_4_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 300M Oct 1 14:36 /OMoutput/OMoutput/np04_hist_run004849_0011_3_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 270M Oct 1 14:39 /OMoutput/OMoutput/np04_hist_run004849_0009_6_0_0.root
-rw-r--r-- 1 np04daq np-comp 270M Oct 1 14:40 /OMoutput/OMoutput/np04_hist_run004849_0011_1_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 271M Oct 1 14:43 /OMoutput/OMoutput/np04_hist_run004849_0009_5_0_0.root
-rw-r--r-- 1 np04daq np-comp 271M Oct 1 14:44 /OMoutput/OMoutput/np04_hist_run004849_0008_6_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 271M Oct 1 14:45 /OMoutput/OMoutput/np04_hist_run004849_0008_5_0_0.root.nomerge
-rw-r--r-- 1 np04daq np-comp 312M Oct 1 14:48 /OMoutput/OMoutput/np04_hist_run004849_0009_2_0_0.root
-rw-r--r-- 1 np04daq np-comp 271M Oct 1 14:50 /OMoutput/OMoutput/np04_hist_run004849_0009_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 271M Oct 1 14:53 /OMoutput/OMoutput/np04_hist_run004849_0006_6_0_0.root
-rw-r--r-- 1 np04daq np-comp 364M Oct 1 14:53 /OMoutput/OMoutput/np04_hist_run004849_0000_0_0_0.root
-rw-r--r-- 1 np04daq np-comp 332M Oct 1 14:56 /OMoutput/OMoutput/np04_hist_run004851_0001_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 369M Oct 1 14:59 /OMoutput/OMoutput/np04_hist_run004851_0001_2_0_0.root
-rw-r--r-- 1 np04daq np-comp 332M Oct 1 15:02 /OMoutput/OMoutput/np04_hist_run004851_0001_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 342M Oct 1 15:06 /OMoutput/OMoutput/np04_hist_run004851_0005_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 354M Oct 1 15:09 /OMoutput/OMoutput/np04_hist_run004851_0005_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 349M Oct 1 15:12 /OMoutput/OMoutput/np04_hist_run004851_0005_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 449M Oct 1 15:13 /OMoutput/OMoutput/np04_hist_run004851_0000_0_0_0.root
Fixing This
This means monet is probably down so restart that process.
- Log into
np04om
- Look for the monet processes, which are managed by supervisord.
[np04daq@np04-srv-023 fcl]$ ps aux | grep presenter
np04daq 114765 1.6 14.6 30799540 28805912 ? Sl Sep26 122:43 python -m presenter.app
np04daq 114830 0.3 3.1 8153440 6163740 ? Sl Sep26 27:18 python -m presenter.app
np04daq 368403 0.0 0.0 112712 976 pts/4 S+ 16:16 0:00 grep --color=auto presenter
- Kill the processes labelled as presenter, they will be restarted by supervisord.
[np04daq@np04-srv-023 fcl]$ kill 114765
[np04daq@np04-srv-023 fcl]$ kill 114830
2. Check for Unmerged Output
On server
np04om
look for recent files in the unmerged directories,
/OMoutput/[1234]
. If there are recent files here but there weren't in the previous step the issue is to do with the merging script, if there are no recent files the problem will be further down the chain.
[np04daq@np04-srv-023 fcl]$ ltr /OMoutput/[1-4]/np04_hist* | tail -20
-rw-r--r-- 1 np04daq np-comp 281M Oct 1 15:06 /OMoutput/3/np04_hist_run004851_0006_dl1_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 302M Oct 1 15:06 /OMoutput/3/np04_hist_run004851_0005_dl7_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 280M Oct 1 15:07 /OMoutput/3/np04_hist_run004851_0005_dl9_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 301M Oct 1 15:08 /OMoutput/3/np04_hist_run004851_0006_dl1_2_0_0.root
-rw-r--r-- 1 np04daq np-comp 297M Oct 1 15:08 /OMoutput/3/np04_hist_run004851_0005_dl7_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 280M Oct 1 15:09 /OMoutput/3/np04_hist_run004851_0005_dl9_5_0_0.root
-rw-r--r-- 1 np04daq np-comp 279M Oct 1 15:10 /OMoutput/3/np04_hist_run004851_0005_dl7_5_0_0.root
-rw-r--r-- 1 np04daq np-comp 294M Oct 1 15:10 /OMoutput/3/np04_hist_run004851_0006_dl1_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 291M Oct 1 15:11 /OMoutput/3/np04_hist_run004851_0009_dl9_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 280M Oct 1 15:12 /OMoutput/3/np04_hist_run004851_0009_dl8_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 276M Oct 1 15:12 /OMoutput/3/np04_hist_run004851_0006_dl1_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 331M Oct 1 15:13 /OMoutput/3/np04_hist_run004851_0009_dl9_2_0_0.root
-rw-r--r-- 1 np04daq np-comp 324M Oct 1 15:14 /OMoutput/3/np04_hist_run004851_0009_dl8_2_0_0.root
-rw-r--r-- 1 np04daq np-comp 280M Oct 1 15:15 /OMoutput/3/np04_hist_run004851_0006_dl1_5_0_0.root
-rw-r--r-- 1 np04daq np-comp 297M Oct 1 15:15 /OMoutput/3/np04_hist_run004851_0009_dl9_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 302M Oct 1 15:16 /OMoutput/3/np04_hist_run004851_0009_dl8_3_0_0.root
-rw-r--r-- 1 np04daq np-comp 281M Oct 1 15:17 /OMoutput/3/np04_hist_run004851_0009_dl9_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 281M Oct 1 15:17 /OMoutput/3/np04_hist_run004851_0011_dl3_1_0_0.root
-rw-r--r-- 1 np04daq np-comp 282M Oct 1 15:18 /OMoutput/3/np04_hist_run004851_0009_dl8_4_0_0.root
-rw-r--r-- 1 np04daq np-comp 282M Oct 1 15:19 /OMoutput/3/np04_hist_run004851_0009_dl9_5_0_0.root
Fixing This
If there are files here but not in the previous step it means the merging process is probably not working, to check this do
- Log into
np04om
- Look for the merging processes, which are managed by suervisord
[np04daq@np04-srv-023 fcl]$ ps aux | grep mergeOM
np04daq 196341 0.5 0.0 113448 1776 ? S Sep17 106:28 /bin/bash ./mergeOMFiles.sh /OMoutput
np04daq 300664 0.6 0.0 113456 1816 ? S Sep17 124:40 /bin/bash ./mergeOMFiles.sh /OMoutput _dev
np04daq 414775 0.0 0.0 112712 976 pts/4 S+ 16:27 0:00 grep --color=auto mergeOM
- If you can't see the
mergeOMFiles
processes, then you can start them manually with
[np04daq@np04-srv-023 ~]$ cd /nfs/sw/om/fcl/
[np04daq@np04-srv-023 fcl]$ nohup ./mergeOMFiles.sh /OMoutput &
3. Check the Art Logs
On servers
np04-srv-00[1,2,3,4]
you can find the logs from the online monitoring at
/log/om
. To find the logs for the current run I usually do
[np04daq@np04-srv-001 ~]$ grep -R run004600 /log/om
/log/om/part0-OnlineMonitor_0-20180922163953.log:art -c /nfs/sw/om/fcl/RunOnlineMonitor1_0_0.fcl -s /data1/np04_raw_run004600_0001_dl5.root >/nfs/OMoutput/1/logfile_0_0_20180922_165225
/log/om/part0-OnlineMonitor_0-20180922163953.log:art -c /nfs/sw/om/fcl/RunOnlineMonitor1_dev_0_0.fcl -s /data1/np04_raw_run004600_0001_dl5.root >/nfs/OMoutput/1_dev/logfile_dev_0_0_20180922_165225
The file name output from grep is the logfile from the
StartOM script which is excecuted by the run control, it should look something like below, longer runs will have more of the
art -c
lines at the bottom. A quick way to tell if the art processes are failing can be to
tail -f
the this file. If it is being written to with a frequency higher than about one line every couple minutes then the art processes are probably failing.
[np04daq@np04-srv-001 ~]$ cat /log/om/part0-OnlineMonitor_0-20180922163953.log
MRB_PROJECT=dunetpc
MRB_PROJECT_VERSION=v06_73_00
MRB_QUALS=e15:prof
MRB_TOP=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev
MRB_SOURCE=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/srcs
MRB_BUILDDIR=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/build_slf7.x86_64
MRB_INSTALL=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/localProducts_dunetpc_v06_73_00_e15_prof
PRODUCTS=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/localProducts_dunetpc_v06_73_00_e15_prof:/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/products:/nfs/sw/artdaq/products
MRB_PROJECT=larsoft
MRB_PROJECT_VERSION=v06_73_00
MRB_QUALS=e15:prof
MRB_TOP=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod
MRB_SOURCE=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/srcs
MRB_BUILDDIR=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/build_slf7.x86_64
MRB_INSTALL=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/localProducts_larsoft_v06_73_00_e15_prof
PRODUCTS=/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/localProducts_larsoft_v06_73_00_e15_prof:/nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/products:/nfs/sw/artdaq/products
The working build directory is /nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/build_slf7.x86_64
The source code directory is /nfs/sw/om/ProtoDUNE_OM_artdaq_321a_dev/srcs
----------- check this block for errors -----------------------
The working build directory is /nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/build_slf7.x86_64
The source code directory is /nfs/sw/om/ProtoDUNE_OM_artdaq_321a_prod/srcs
----------- check this block for errors -----------------------
INFO: no optional setup of artdaq_ganglia_plugin v1_02_11 -q +e15:+s65:+prof
INFO: no optional setup of artdaq_ganglia_plugin v1_02_11 -q +e15:+s65:+prof
INFO: no optional setup of artdaq_epics_plugin v1_02_05 -q +e15:+s65:+prof
INFO: no optional setup of artdaq_epics_plugin v1_02_05 -q +e15:+s65:+prof
----------------------------------------------------------------
INFO: copying $MRB_SOURCE/dunetpc/releaseDB/base_dependency_database
Number of processes: 1 on partition 0
//////////////////// Start process 0 ///////////////////////
0
----------------------------------------------------------------
Number of processes: 1 on partition 0
//////////////////// Start process 0 ///////////////////////
0
art -c /nfs/sw/om/fcl/RunOnlineMonitor1_0_0.fcl -s /data1/np04_raw_run004600_0001_dl5.root >/nfs/OMoutput/1/logfile_0_0_20180922_165225
art -c /nfs/sw/om/fcl/RunOnlineMonitor1_dev_0_0.fcl -s /data1/np04_raw_run004600_0001_dl5.root >/nfs/OMoutput/1_dev/logfile_dev_0_0_20180922_165225
The last part of the
art -c
lines here will point you at the output from the art processes, you want to make sure you look at the non dev versions of this when troubleshooting the production version of the OM. Just look for dev in the line and ignore those lines. The art logfile will be a very long file but we should only really be interested in the last few lines for basic trouble shooting, we basically just need to check the exit status.
[np04daq@np04-srv-001 ~]$ tail -2 /nfs/OMoutput/3/logfile_0_0_20181001_150946
Art has completed and will exit with status 0.
There are 2 valid values here, 0 and 143, anything else and there is an issue with the decoding. 0 means everything went smoothly and the full input file has been processed. 143 means the art process was terminated by run control before the processing was finished, in this case the decoding worked fine but wasn't able to finish, there might not be plots in this case.
Fixing This
If there are issues with the decoding, then it is software related and you should notify the OM experts.
4. No Logs?
If you can't find any logs for your run in
/log/om
in the previous step then the OM probably wasn't started by the run control. To check this look for
StartOM_RunNumber
in the processes on
np04-srv-00[1,2,3,4]
, if you can't find it and you are still taking a run then it was not started properly by the run control.
[np04daq@np04-srv-001 ~]$ ps aux | grep StartOM
np04daq 45718 0.0 0.0 115296 1496 ? S 17:01 0:00 /bin/bash /nfs/sw/om/fcl/StartOM_RunNumber.sh 0 /nfs/sw/om/fcl 004855 1
np04daq 45719 3.9 0.0 115860 2176 ? S 17:01 0:06 /bin/bash /nfs/sw/om/fcl/StartOMproddev.sh prod 0 /nfs/sw/om/fcl 004855 1
np04daq 45720 4.0 0.0 115760 2180 ? S 17:01 0:07 /bin/bash /nfs/sw/om/fcl/StartOMproddev.sh dev 0 /nfs/sw/om/fcl 004855 1
np04daq 77633 0.0 0.0 115760 1096 ? S 17:04 0:00 /bin/bash /nfs/sw/om/fcl/StartOMproddev.sh dev 0 /nfs/sw/om/fcl 004855 1
np04daq 77634 0.0 0.0 115860 1104 ? S 17:04 0:00 /bin/bash /nfs/sw/om/fcl/StartOMproddev.sh prod 0 /nfs/sw/om/fcl 004855 1
np04daq 77644 0.0 0.0 112712 972 pts/2 S+ 17:04 0:00 grep --color=auto StartOM
Fixing This
This means that the OM was not started by the run control, you can start it manually with the steps below.
- Log into
np04-srv-00[1,2,3,4]
- Change directory to
/nfs/sw/om/fcl/
- Run
StartOM_RunNumber.sh
with fcl path /nfs/sw/om/fcl
./StartOM_RunNumber.sh <PARTITION> <FHiCL path> <Run Number> [# of processes]