DIANE Network Testing for SLAs
- The job consist of 3K short tasks of various durations: 0.1,1.0,2.0 seconds
- Each tasks returns via the network the data message to the master and we test two cases: <1Kb and 100Kb of random string
- We use 4 workers
- Local setup : master and workers are in the local network in Taipei
- Remote setup : master is at CERN, workers are in Taipei
The experimental results shown in the table below: the total duration of the job (3000K tasks) and the message frequency rate on the master (note, that 100Kb was sent in one way only).
Clicking on each result you go the the detailed web page report. The task distribution (histogram) is measured from the point of sending the task to the worker to the receiving the result. This means it contains also the network latency.
- 0.1 second local case has an overhead from the framework and is also biased, because not all workers started at the same time (queuing time in the grid)
- 0.1 second remote also has a little bit of queuing overhead (not so much as the local one)
- 0.1 second remote has a strange pattern in the task duration histogram which we do not understand
- we see the network effect between the local and remote setup, the network overhead gets small when the task duration increases, because the frequency on the master is lower
- the question is: given the same frequency, will the network effect be seen if the task duration is longer (e.g. 20 s/task and 40 workers)
- some applications fall in 1,2 second range, some others are in orders of minutes
- current scalability of the system is 400 workers, meaning that to maintain the same frequency (to observe the network effect as in the 2s case above) the task duration should be about 200 seconds (3 minutes), what is the system overhead when the number of open TCP/IP connections and threads serving is 400...?
- we would need more measurments to establish the point where the network effect stop playing a role
- we do not seem to be suffering from the bandwith but more from the latency: in the most intensive case we have 40Hz * 50 Kb (one way 100Kb) = 16Mb/s
Appendix
The traceroute from CERN to Taipei:
[lxplus006] ~/network_test/results/test1 > traceroute lcg00122.grid.sinica.edu.tw
traceroute to lcg00122.grid.sinica.edu.tw (140.109.98.152), 30 hops max, 38 byte packets
1 b513-v-rca86-1-ip59 (137.138.5.1) 0.429 ms 0.347 ms 0.262 ms
2 b513-b-rfte6-2-rb13 (194.12.132.53) 0.809 ms 0.308 ms 0.405 ms
3 b513-e-rca80-1-pg2 (194.12.133.250) 1.096 ms 0.833 ms 0.439 ms
4 cernh2-gate3 (192.65.184.34) 1.006 ms 0.703 ms 0.667 ms
5 e513-e-rci76-2-vlan2 (192.65.192.7) 1.120 ms 0.660 ms 0.666 ms
6 202.169.174.146 (202.169.174.146) 124.311 ms 123.912 ms 123.771 ms
7 202.169.174.62 (202.169.174.62) 313.035 ms 312.669 ms 312.710 ms
8 202.169.175.101 (202.169.175.101) 316.662 ms 313.307 ms 313.118 ms
9 lcg00122.grid.sinica.edu.tw (140.109.98.152) 313.083 ms 312.648 ms 312.834 ms
The traceroute from Taipei to CERN:
traceroute: Warning: lxplus.cern.ch has multiple addresses; using 137.138.4.164
traceroute to lxslc3.cern.ch (137.138.4.164), 30 hops max, 38 byte packets
1 140.109.98.1 (140.109.98.1) 0.664 ms 0.621 ms 0.604 ms
2 202.169.175.102 (202.169.175.102) 1.209 ms 0.358 ms 0.221 ms
3 202.169.174.61 (202.169.174.61) 189.395 ms 189.408 ms 189.410 ms
4 202.169.174.145 (202.169.174.145) 312.381 ms 312.280 ms 312.301 ms
5 cernh2-vlan2.cern.ch (192.65.192.2) 312.281 ms 312.301 ms 312.306 ms
6 lxplus062.cern.ch (137.138.4.164) 312.391 ms 312.348 ms 312.301 ms
7 lxplus062.cern.ch (137.138.4.164) 319.193 ms 324.146 ms 313.613 ms
8 lxplus062.cern.ch (137.138.4.164) 315.864 ms 324.272 ms 325.011 ms
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * *
--
JakubMoscicki - 04 Oct 2006