DIANE Network Testing for SLAs

  • The job consist of 3K short tasks of various durations: 0.1,1.0,2.0 seconds
  • Each tasks returns via the network the data message to the master and we test two cases: <1Kb and 100Kb of random string
  • We use 4 workers
    • Local setup : master and workers are in the local network in Taipei
    • Remote setup : master is at CERN, workers are in Taipei

The experimental results shown in the table below: the total duration of the job (3000K tasks) and the message frequency rate on the master (note, that 100Kb was sent in one way only).

Clicking on each result you go the the detailed web page report. The task distribution (histogram) is measured from the point of sending the task to the worker to the receiving the result. This means it contains also the network latency.

  local remote
time\data < 1Kb 100Kb < 1Kb 100Kb
0.1 s 437s/36Hz 1438s/4Hz 1338s/5.5Hz 11772s/0.47Hz
1.0 s 787s/7.6Hz 2100s/2.8Hz 1318s/4.5Hz 12498s/0.43Hz
2.0 s 1539s/3.87Hz 2868s/2.0Hz 1910s/3.0Hz 13379s/0.40Hz

  1. 0.1 second local case has an overhead from the framework and is also biased, because not all workers started at the same time (queuing time in the grid)
  2. 0.1 second remote also has a little bit of queuing overhead (not so much as the local one)
  3. 0.1 second remote has a strange pattern in the task duration histogram which we do not understand
  4. we see the network effect between the local and remote setup, the network overhead gets small when the task duration increases, because the frequency on the master is lower
  5. the question is: given the same frequency, will the network effect be seen if the task duration is longer (e.g. 20 s/task and 40 workers)
  6. some applications fall in 1,2 second range, some others are in orders of minutes
  7. current scalability of the system is 400 workers, meaning that to maintain the same frequency (to observe the network effect as in the 2s case above) the task duration should be about 200 seconds (3 minutes), what is the system overhead when the number of open TCP/IP connections and threads serving is 400...?
  8. we would need more measurments to establish the point where the network effect stop playing a role
  9. we do not seem to be suffering from the bandwith but more from the latency: in the most intensive case we have 40Hz * 50 Kb (one way 100Kb) = 16Mb/s

Appendix

The traceroute from CERN to Taipei:

[lxplus006] ~/network_test/results/test1 > traceroute lcg00122.grid.sinica.edu.tw
traceroute to lcg00122.grid.sinica.edu.tw (140.109.98.152), 30 hops max, 38 byte packets
 1  b513-v-rca86-1-ip59 (137.138.5.1)  0.429 ms  0.347 ms  0.262 ms
 2  b513-b-rfte6-2-rb13 (194.12.132.53)  0.809 ms  0.308 ms  0.405 ms
 3  b513-e-rca80-1-pg2 (194.12.133.250)  1.096 ms  0.833 ms  0.439 ms
 4  cernh2-gate3 (192.65.184.34)  1.006 ms  0.703 ms  0.667 ms
 5  e513-e-rci76-2-vlan2 (192.65.192.7)  1.120 ms  0.660 ms  0.666 ms
 6  202.169.174.146 (202.169.174.146)  124.311 ms  123.912 ms  123.771 ms
 7  202.169.174.62 (202.169.174.62)  313.035 ms  312.669 ms  312.710 ms
 8  202.169.175.101 (202.169.175.101)  316.662 ms  313.307 ms  313.118 ms
 9  lcg00122.grid.sinica.edu.tw (140.109.98.152)  313.083 ms  312.648 ms  312.834 ms

The traceroute from Taipei to CERN:

traceroute: Warning: lxplus.cern.ch has multiple addresses; using 137.138.4.164
traceroute to lxslc3.cern.ch (137.138.4.164), 30 hops max, 38 byte packets
 1  140.109.98.1 (140.109.98.1)  0.664 ms  0.621 ms  0.604 ms
 2  202.169.175.102 (202.169.175.102)  1.209 ms  0.358 ms  0.221 ms
 3  202.169.174.61 (202.169.174.61)  189.395 ms  189.408 ms  189.410 ms
 4  202.169.174.145 (202.169.174.145)  312.381 ms  312.280 ms  312.301 ms
 5  cernh2-vlan2.cern.ch (192.65.192.2)  312.281 ms  312.301 ms  312.306 ms
 6  lxplus062.cern.ch (137.138.4.164)  312.391 ms  312.348 ms  312.301 ms
 7  lxplus062.cern.ch (137.138.4.164)  319.193 ms  324.146 ms  313.613 ms
 8  lxplus062.cern.ch (137.138.4.164)  315.864 ms  324.272 ms  325.011 ms
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * *

-- JakubMoscicki - 04 Oct 2006

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2006-10-13 - JakubMoscicki
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback