Benchmarking AVX on Intel SandyBridge
01/02/2011
On January 5th Intel announced the release of new SandyBridge CPU for notebooks and workstations.
On Janurary 21th we acquired a workstation featuring a i7-2600K 3.40GHz cpu with integrated graphics unit.
The main interest of SandyBridge is the 256-bit wide AVX floating-point processor with the associated new vector instructions.
With the help of OpenLab and the support of Intel we got access to documentation in particular
at the moment compilers' documentation is neither very detailed nor instructive for what concern AVX intrinsics even if they are fully supported by icc 12 and gcc 4.4 (I strongly advice to use gcc 4.6)
Software installation
We installed SLC6 as provided by CERN network-installation including afs. it comes with gcc 4.4.4 that provides full support for AVX.
We preferred to install the latest snapshot of gcc 4.6 with default naitve support for AVX.
gcc 4.6 has also better support for inlining, math optimization, c++0x and gives in general better performances that 4.4.
gcc 4.5.1 although supports AVX does not properly insert
VZEROUPPER
instructions in particular before external function calls (read: a penalty of 150 cycles each time).
this is what was done (not everything strictly needed)
yum -y install screen
yum -y install autoconf
yum install -y automake
yum install -y libtool
yum install -y cmake
yum install -y gcc-c++.x86_64
yum install -y flex
yum install -y yacc
yum install -y bison
yum install -y glibc*
yum -y install texinfo*
wget ftp://ftp.uvsq.fr/pub/gcc/snapshots/LATEST-4.6/gcc-4.6-20110325.bz2 (actually use just the latest available)
wget ftp://gcc.gnu.org/pub/gcc/infrastructure/mpc-0.8.1.tar.gz
wget ftp://gcc.gnu.org/pub/gcc/infrastructure/mpfr-2.4.2.tar.bz2
wget ftp://gcc.gnu.org/pub/gcc/infrastructure/gmp-4.3.2.tar.bz2
configured with
--enable-gold=yes --enable-lto --with-fpmath=avx
(remember to
make distclean
each time installation fails!)
if you wish to be independent of afs you may install also root and boost from the distribution
yum -y install boost*; yum -y install root*
micro-kernel benchmarks
We use the ususal
scimark2 as micro-kernel benchmark. we modified the original version to exercise more the memory access in "large" mode and added the ability to run the kernels in a different order
to avoid instruction synchronization in hyper-thread mode.
source code and executables in
/afs/cern.ch/user/i/innocent/w1/scimark2
we compare 11 versions
- gcc 4.5.1 compiled on NHL (sse2) SLC5 with
-O2
and -O3 -ffast-math
- gcc 4.4.4 SLC6 native (sse2) with
-O2
and -O3 -ffast-math
- gcc 4.4.4 SLC6 -mavx with
-O2
and -O3 -ffast-math
- gcc 4.6 SLC6 avx-native with
-O2
and -Ofast
- icc12 novec (compiled on SLC5 due to incompatibility with gcc 4.6) (which is equivalent to gcc -O2)
- icc12 sse (compiled on SLC5 due to incompatibility with gcc 4.6) (which is equivalent to gcc -O3 -ffast-math)
- icc12 avx (compiled on SLC5 due to incompatibility with gcc 4.6)
In addition we have also tested
link time optimization and
profile guided optimization for gcc 4.6.
In particular below
tpog stands for training i,e, running the executable compiled with
g++ -Ofast *.c -o scimark2_46_tpgo -flto -fuse-linker-plugin -fprofile-generate=$PWD/pgo
and
upgo stands for using pgo i.e. running the executable compiled with pgo info collected in the previous run
g++ -Ofast *.c -o scimark2_46_upgo -flto -fuse-linker-plugin -fprofile-use=$PWD/pgo
Added new benchmark using GCC 4.7
where
opt stands for
c++ -std=gnu++11 -DNDEBUG -Wall -Ofast -marchmcorei7-avx -mavx -fvisibility-inlines-hidden -ftree-vectorizer-verbose=2 \
--param vect-max-version-for-alias-checks=30 -funsafe-loop-optimizations -ftree-loop-distribution -ftree-loop-if-convert-stores \
-fipa-pta -Wunsafe-loop-optimizations -fgcse-sm -fgcse-las --param max-completely-peel-times=1 *.c -flto -o scimark2_47
and
graphite for the above plus
-fgraphite -fgraphite-identity -floop-block -floop-flatten -floop-interchange -floop-strip-mine -ftree-loop-linear -floop-parallelize-all
we report the number in MFLOP as scimark does running each kernel for at least 5 seconds
for reference this is this first table is running on NHL i7-950 3.07GHz
|
i7-950 3.07GHz |
|
|
|
small |
large |
very large |
|
FFT |
SOR |
MC |
SMM |
LU |
FFT |
SOR |
MC |
SMM |
LU |
FFT |
SOR |
MC |
SMM |
LU |
4.5.1 -O2 |
1141.89 |
942.28 |
400.41 |
1217.86 |
1770.29 |
518.16 |
877.21 |
400.41 |
1059.77 |
1760.18 |
223.99 |
867.82 |
400.41 |
895.10 |
1433.88 |
4.5.1 -O3 -ffast-math |
1149.42 |
941.13 |
409.70 |
1217.86 |
2015.25 |
518.16 |
877.21 |
410.20 |
1063.90 |
1901.37 |
222.24 |
867.82 |
409.70 |
896.67 |
1482.40 |
icpc -no-vec |
940.29 |
1125.23 |
426.36 |
1154.82 |
1838.83 |
icpc -O3 |
914.41 |
1125.23 |
425.82 |
1163.79 |
2866.84 |
the first four lines below correspond to run exactly the very same
user binary as above
|
i7-2600K 3.4GHz |
|
|
|
small |
large |
very large |
|
FFT |
SOR |
MC |
SMM |
LU |
FFT |
SOR |
MC |
SMM |
LU |
FFT |
SOR |
MC |
SMM |
LU |
4.5.1 -O2 |
1624.44 |
1237.21 |
531.77 |
1882.54 |
2193.31 |
620.39 |
1118.93 |
536.01 |
1701.35 |
2236.20 |
218.55 |
1118.66 |
542.95 |
1570.55 |
1847.08 |
4.5.1 -O3 -ffast-math |
1591.10 |
1237.21 |
536.87 |
1923.99 |
2395.32 |
617.61 |
1120.97 |
536.01 |
1710.23 |
2515.72 |
219.39 |
1105.77 |
536.00 |
1575.38 |
1915.49 |
icpc -no-vec |
1431.17 |
1448.84 |
530.92 |
899.29 |
2561.00 |
617.61 |
1315.82 |
530.92 |
1808.38 |
2573.38 |
icpc -O3 |
1481.31 |
1468.16 |
538.59 |
1688.53 |
4377.82 |
621.09 |
1314.41 |
530.92 |
1806.39 |
3907.20 |
219.68 |
1293.54 |
530.92 |
1659.64 |
2072.07 |
icpc -O3 -xAVX |
1573.16 |
1451.57 |
530.09 |
1744.72 |
3832.51 |
624.62 |
1315.82 |
530.09 |
1656.62 |
4191.22 |
219.68 |
1293.54 |
530.09 |
1542.17 |
2086.99 |
4.4 -O2 |
1520.04 |
1237.21 |
318.35 |
1923.99 |
2158.63 |
4.4 -O3 -ffast-math |
1515.08 |
1255.35 |
332.88 |
1949.03 |
2204.37 |
4.4 -O2 -mavx |
1540.17 |
1255.35 |
322.64 |
1949.03 |
2193.31 |
4.4 -O3 -ffast-math -mavx |
1620.66 |
1235.23 |
327.68 |
1923.99 |
2878.17 |
4.6 -O2 (-mavx) |
1639.71 |
1239.20 |
540.33 |
1916.96 |
2582.19 |
4.6 -Ofast (-mavx) |
1712.14 |
1237.21 |
535.16 |
1906.50 |
3345.27 |
626.04 |
1118.93 |
534.31 |
1794.52 |
2987.86 |
219.68 |
1105.77 |
530.09 |
1617.69 |
2051.22 |
4.6 -Ofast (-mavx) -lto |
1712.14 |
1235.23 |
705.35 |
1989.71 |
3381.63 |
4.6 -Ofast (-mavx) tpgo |
1707.95 |
1257.39 |
199.49 |
1920.47 |
3397.41 |
|
|
222.53 |
1122.58 |
199.49 |
1565.75 |
2072.58 |
4.6 -Ofast (-mavx) upgo |
1755.21 |
1259.45 |
710.73 |
2168.72 |
3473.03 |
|
|
219.39 |
1107.04 |
710.22 |
1709.52 |
2053.56 |
4.7 -Ofast (-mavx) -lto |
1726.97 |
1257.39 |
709.70 |
2012.62 |
4266.67 |
4.7 -Ofast (-mavx) -lto opt |
1729.11 |
1257.39 |
710.90 |
2008.77 |
4250.06 |
|
|
229.09 |
1122.58 |
709.70 |
1665.04 |
2151.65 |
4.7 -Ofast (-mavx) graphite |
1726.97 |
1259.45 |
709.70 |
2012.62 |
4250.06 |
4.7 -Ofast (-mavx) upgo |
1777.57 |
1261.51 |
624.15 |
1989.71 |
4551.11 |
639.13 |
1139.72 |
623.22 |
1925.26 |
3843.84 |
230.63 |
1123.90 |
624.15 |
1750.43 |
2167.46 |
The worse score for
MC with gcc 4.4 is for a known defect in the compiler see
bug report
math function benchmarks
we used the test program from
sse_mathfun by Julien Pommier to benchmark simple math functions.
We have not yet recoded the sse function there in avx though.
We have added few more functions w.r.t. the original test program: in particular double precision version and a 16-bit log from
icsiLog
here we compare the code compiled with gcc 4.6
-Ofast
on nehalem and on sandyBridge. I the latter case we have also recompiled glibc 2.13 in native mode
Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz, scl5 glic 2.5.58 gcc 4.6.0 SSE2
benching sinf .. -> 14.0 millions of vector evaluations/second -> 36 cycles/value on a 2000MHz computer
benching cosf .. -> 12.4 millions of vector evaluations/second -> 40 cycles/value on a 2000MHz computer
benching sincos (x87) .. -> 6.6 millions of vector evaluations/second -> 76 cycles/value on a 2000MHz computer
benching expf .. -> 1.6 millions of vector evaluations/second -> 297 cycles/value on a 2000MHz computer
benching logf .. -> 10.1 millions of vector evaluations/second -> 50 cycles/value on a 2000MHz computer
benching log16 .. -> 40.4 millions of vector evaluations/second -> 12 cycles/value on a 2000MHz computer
benching atan2f .. -> 10.4 millions of vector evaluations/second -> 48 cycles/value on a 2000MHz computer
benching atan2 .. -> 5.4 millions of vector evaluations/second -> 93 cycles/value on a 2000MHz computer
benching sinl .. -> 8.0 millions of vector evaluations/second -> 62 cycles/value on a 2000MHz computer
benching cosl .. -> 8.0 millions of vector evaluations/second -> 62 cycles/value on a 2000MHz computer
benching expl .. -> 8.4 millions of vector evaluations/second -> 60 cycles/value on a 2000MHz computer
benching logl .. -> 5.5 millions of vector evaluations/second -> 91 cycles/value on a 2000MHz computer
benching cephes_sinf .. -> 25.5 millions of vector evaluations/second -> 20 cycles/value on a 2000MHz computer
benching cephes_cosf .. -> 20.1 millions of vector evaluations/second -> 25 cycles/value on a 2000MHz computer
benching cephes_expf .. -> 7.3 millions of vector evaluations/second -> 68 cycles/value on a 2000MHz computer
benching cephes_logf .. -> 13.1 millions of vector evaluations/second -> 38 cycles/value on a 2000MHz computer
benching sin_ps .. -> 40.1 millions of vector evaluations/second -> 12 cycles/value on a 2000MHz computer
benching cos_ps .. -> 39.9 millions of vector evaluations/second -> 13 cycles/value on a 2000MHz computer
benching sincos_ps .. -> 36.4 millions of vector evaluations/second -> 14 cycles/value on a 2000MHz computer
benching exp_ps .. -> 33.3 millions of vector evaluations/second -> 15 cycles/value on a 2000MHz computer
benching log_ps .. -> 30.6 millions of vector evaluations/second -> 16 cycles/value on a 2000MHz computer
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz slc6 glibc 2.12-1.7 gcc 4.6.0 SSE2 (same executible as above)
benching sinf .. -> 22.8 millions of vector evaluations/second -> 22 cycles/value on a 2000MHz computer
benching cosf .. -> 20.7 millions of vector evaluations/second -> 24 cycles/value on a 2000MHz computer
benching sincos (x87) .. -> 7.6 millions of vector evaluations/second -> 66 cycles/value on a 2000MHz computer
benching expf .. -> 1.3 millions of vector evaluations/second -> 385 cycles/value on a 2000MHz computer
benching logf .. -> 16.6 millions of vector evaluations/second -> 30 cycles/value on a 2000MHz computer
benching log16 .. -> 49.2 millions of vector evaluations/second -> 10 cycles/value on a 2000MHz computer
benching atan2f .. -> 17.4 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer
benching atan2 .. -> 7.8 millions of vector evaluations/second -> 64 cycles/value on a 2000MHz computer
benching sinl .. -> 10.6 millions of vector evaluations/second -> 47 cycles/value on a 2000MHz computer
benching cosl .. -> 10.9 millions of vector evaluations/second -> 46 cycles/value on a 2000MHz computer
benching expl .. -> 11.4 millions of vector evaluations/second -> 44 cycles/value on a 2000MHz computer
benching logl .. -> 7.8 millions of vector evaluations/second -> 64 cycles/value on a 2000MHz computer
benching cephes_sinf .. -> 29.3 millions of vector evaluations/second -> 17 cycles/value on a 2000MHz computer
benching cephes_cosf .. -> 26.7 millions of vector evaluations/second -> 19 cycles/value on a 2000MHz computer
benching cephes_expf .. -> 9.9 millions of vector evaluations/second -> 51 cycles/value on a 2000MHz computer
benching cephes_logf .. -> 18.3 millions of vector evaluations/second -> 27 cycles/value on a 2000MHz computer
benching sin_ps .. -> 46.9 millions of vector evaluations/second -> 11 cycles/value on a 2000MHz computer
benching cos_ps .. -> 46.8 millions of vector evaluations/second -> 11 cycles/value on a 2000MHz computer
benching sincos_ps .. -> 42.2 millions of vector evaluations/second -> 12 cycles/value on a 2000MHz computer
benching exp_ps .. -> 36.4 millions of vector evaluations/second -> 14 cycles/value on a 2000MHz computer
benching log_ps .. -> 34.4 millions of vector evaluations/second -> 15 cycles/value on a 2000MHz computer
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz slc6 glibc 2.12-1.7 gcc 4.6.1 AVX (only executable)
benching sinf .. -> 19.9 millions of vector evaluations/second -> 25 cycles/value on a 2000MHz computer
benching cosf .. -> 20.8 millions of vector evaluations/second -> 24 cycles/value on a 2000MHz computer
benching sincos (x87) .. -> 7.7 millions of vector evaluations/second -> 65 cycles/value on a 2000MHz computer
benching expf .. -> 1.2 millions of vector evaluations/second -> 388 cycles/value on a 2000MHz computer
benching logf .. -> 16.6 millions of vector evaluations/second -> 30 cycles/value on a 2000MHz computer
benching log16 .. -> 50.0 millions of vector evaluations/second -> 10 cycles/value on a 2000MHz computer
benching atan2f .. -> 17.3 millions of vector evaluations/second -> 29 cycles/value on a 2000MHz computer
benching atan2 .. -> 7.8 millions of vector evaluations/second -> 64 cycles/value on a 2000MHz computer
benching sinl .. -> 10.5 millions of vector evaluations/second -> 48 cycles/value on a 2000MHz computer
benching cosl .. -> 11.0 millions of vector evaluations/second -> 45 cycles/value on a 2000MHz computer
benching expl .. -> 10.3 millions of vector evaluations/second -> 49 cycles/value on a 2000MHz computer
benching logl .. -> 7.7 millions of vector evaluations/second -> 65 cycles/value on a 2000MHz computer
benching cephes_sinf .. -> 29.9 millions of vector evaluations/second -> 17 cycles/value on a 2000MHz computer
benching cephes_cosf .. -> 27.2 millions of vector evaluations/second -> 18 cycles/value on a 2000MHz computer
benching cephes_expf .. -> 11.9 millions of vector evaluations/second -> 42 cycles/value on a 2000MHz computer
benching cephes_logf .. -> 20.1 millions of vector evaluations/second -> 25 cycles/value on a 2000MHz computer
benching sin_ps .. -> 46.7 millions of vector evaluations/second -> 11 cycles/value on a 2000MHz computer
benching cos_ps .. -> 47.4 millions of vector evaluations/second -> 11 cycles/value on a 2000MHz computer
benching sincos_ps .. -> 43.1 millions of vector evaluations/second -> 12 cycles/value on a 2000MHz computer
benching exp_ps .. -> 36.7 millions of vector evaluations/second -> 14 cycles/value on a 2000MHz computer
benching log_ps .. -> 35.0 millions of vector evaluations/second -> 14 cycles/value on a 2000MHz computer
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz slc6 glibc 2.13 gcc 460 SSE2 (glibc compiled with 461 -mavx)
benching sinf .. -> 22.7 millions of vector evaluations/second -> 22 cycles/value on a 2000MHz computer
benching cosf .. -> 21.3 millions of vector evaluations/second -> 23 cycles/value on a 2000MHz computer
benching sincos (x87) .. -> 7.6 millions of vector evaluations/second -> 66 cycles/value on a 2000MHz computer
benching expf .. -> 1.2 millions of vector evaluations/second -> 385 cycles/value on a 2000MHz computer
benching logf .. -> 18.6 millions of vector evaluations/second -> 27 cycles/value on a 2000MHz computer
benching log16 .. -> 49.5 millions of vector evaluations/second -> 10 cycles/value on a 2000MHz computer
benching atan2f .. -> 18.5 millions of vector evaluations/second -> 27 cycles/value on a 2000MHz computer
benching atan2 .. -> 8.9 millions of vector evaluations/second -> 56 cycles/value on a 2000MHz computer
benching sinl .. -> 11.9 millions of vector evaluations/second -> 42 cycles/value on a 2000MHz computer
benching cosl .. -> 11.9 millions of vector evaluations/second -> 42 cycles/value on a 2000MHz computer
benching expl .. -> 13.0 millions of vector evaluations/second -> 38 cycles/value on a 2000MHz computer
benching logl .. -> 9.6 millions of vector evaluations/second -> 52 cycles/value on a 2000MHz computer
benching cephes_sinf .. -> 29.4 millions of vector evaluations/second -> 17 cycles/value on a 2000MHz computer
benching cephes_cosf .. -> 26.7 millions of vector evaluations/second -> 19 cycles/value on a 2000MHz computer
benching cephes_expf .. -> 9.9 millions of vector evaluations/second -> 51 cycles/value on a 2000MHz computer
benching cephes_logf .. -> 17.6 millions of vector evaluations/second -> 28 cycles/value on a 2000MHz computer
benching sin_ps .. -> 46.8 millions of vector evaluations/second -> 11 cycles/value on a 2000MHz computer
benching cos_ps .. -> 46.8 millions of vector evaluations/second -> 11 cycles/value on a 2000MHz computer
benching sincos_ps .. -> 42.1 millions of vector evaluations/second -> 12 cycles/value on a 2000MHz computer
benching exp_ps .. -> 36.4 millions of vector evaluations/second -> 14 cycles/value on a 2000MHz computer
benching log_ps .. -> 34.0 millions of vector evaluations/second -> 15 cycles/value on a 2000MHz computer
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz slc6 glibc 2.13 gcc 461 AVX (glibc and executable compiled with 461 -mavx)
benching sinf .. -> 20.0 millions of vector evaluations/second -> 25 cycles/value on a 2000MHz computer
benching cosf .. -> 21.2 millions of vector evaluations/second -> 23 cycles/value on a 2000MHz computer
benching sincos (x87) .. -> 7.7 millions of vector evaluations/second -> 65 cycles/value on a 2000MHz computer
benching expf .. -> 1.2 millions of vector evaluations/second -> 388 cycles/value on a 2000MHz computer
benching logf .. -> 18.4 millions of vector evaluations/second -> 27 cycles/value on a 2000MHz computer
benching log16 .. -> 50.1 millions of vector evaluations/second -> 10 cycles/value on a 2000MHz computer
benching atan2f .. -> 18.4 millions of vector evaluations/second -> 27 cycles/value on a 2000MHz computer
benching atan2 .. -> 8.9 millions of vector evaluations/second -> 56 cycles/value on a 2000MHz computer
benching sinl .. -> 11.9 millions of vector evaluations/second -> 42 cycles/value on a 2000MHz computer
benching cosl .. -> 11.8 millions of vector evaluations/second -> 42 cycles/value on a 2000MHz computer
benching expl .. -> 11.7 millions of vector evaluations/second -> 43 cycles/value on a 2000MHz computer
benching logl .. -> 9.5 millions of vector evaluations/second -> 53 cycles/value on a 2000MHz computer
benching cephes_sinf .. -> 30.0 millions of vector evaluations/second -> 17 cycles/value on a 2000MHz computer
benching cephes_cosf .. -> 27.1 millions of vector evaluations/second -> 18 cycles/value on a 2000MHz computer
benching cephes_expf .. -> 11.9 millions of vector evaluations/second -> 42 cycles/value on a 2000MHz computer
benching cephes_logf .. -> 20.3 millions of vector evaluations/second -> 25 cycles/value on a 2000MHz computer
benching sin_ps .. -> 46.6 millions of vector evaluations/second -> 11 cycles/value on a 2000MHz computer
benching cos_ps .. -> 47.4 millions of vector evaluations/second -> 11 cycles/value on a 2000MHz computer
benching sincos_ps .. -> 43.2 millions of vector evaluations/second -> 12 cycles/value on a 2000MHz computer
benching exp_ps .. -> 36.8 millions of vector evaluations/second -> 14 cycles/value on a 2000MHz computer
benching log_ps .. -> 35.0 millions of vector evaluations/second -> 14 cycles/value on a 2000MHz computer
-Ofast
affects the loop in the benchmark, for comparison this is with just O2 (no auto-vectorization, less inlining)
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz slc6 glibc 2.13 gcc 461 -O2 AVX (glibc and executable compiled with 461 -mavx)
benching sinf .. -> 16.5 millions of vector evaluations/second -> 30 cycles/value on a 2000MHz computer
benching cosf .. -> 15.9 millions of vector evaluations/second -> 31 cycles/value on a 2000MHz computer
benching sincos (x87) .. -> 6.4 millions of vector evaluations/second -> 78 cycles/value on a 2000MHz computer
benching expf .. -> 1.2 millions of vector evaluations/second -> 388 cycles/value on a 2000MHz computer
benching logf .. -> 8.8 millions of vector evaluations/second -> 57 cycles/value on a 2000MHz computer
benching log16 .. -> 13.0 millions of vector evaluations/second -> 38 cycles/value on a 2000MHz computer
benching atan2f .. -> 8.7 millions of vector evaluations/second -> 57 cycles/value on a 2000MHz computer
benching atan2 .. -> 5.8 millions of vector evaluations/second -> 85 cycles/value on a 2000MHz computer
benching sinl .. -> 11.7 millions of vector evaluations/second -> 43 cycles/value on a 2000MHz computer
benching cosl .. -> 11.5 millions of vector evaluations/second -> 43 cycles/value on a 2000MHz computer
benching expl .. -> 8.7 millions of vector evaluations/second -> 57 cycles/value on a 2000MHz computer
benching logl .. -> 7.1 millions of vector evaluations/second -> 70 cycles/value on a 2000MHz computer
benching cephes_sinf .. -> 11.8 millions of vector evaluations/second -> 42 cycles/value on a 2000MHz computer
benching cephes_cosf .. -> 11.0 millions of vector evaluations/second -> 45 cycles/value on a 2000MHz computer
benching cephes_expf .. -> 7.8 millions of vector evaluations/second -> 64 cycles/value on a 2000MHz computer
benching cephes_logf .. -> 8.0 millions of vector evaluations/second -> 62 cycles/value on a 2000MHz computer
benching sin_ps .. -> 46.4 millions of vector evaluations/second -> 11 cycles/value on a 2000MHz computer
benching cos_ps .. -> 47.5 millions of vector evaluations/second -> 11 cycles/value on a 2000MHz computer
benching sincos_ps .. -> 43.1 millions of vector evaluations/second -> 12 cycles/value on a 2000MHz computer
benching exp_ps .. -> 36.6 millions of vector evaluations/second -> 14 cycles/value on a 2000MHz computer
benching log_ps .. -> 35.1 millions of vector evaluations/second -> 14 cycles/value on a 2000MHz computer
--
VincenzoInnocente - 23-Mar-2011