Compression performances
more tests on slc6 corei7-avx (04/03/2011)
[vinavx0] /tmp/innocent $ time ./gzip -1 glibc.tar
3.521u 0.099s 0:03.62 99.7% 0+0k 0+199336io 0pf+0w
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time gzip -1 glibc.tar
3.399u 0.118s 0:03.51 99.7% 0+0k 0+199336io 0pf+0w
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -1 glibc.tar
3.083u 0.126s 0:03.21 99.6% 0+0k 0+198952io 0pf+0w
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_fast -1 glibc.tar
3.042u 0.112s 0:03.15 100.0% 0+0k 0+198952io 16pf+0w
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_fastest glibc.tar
2.092u 0.126s 0:02.22 99.5% 0+0k 0+203880io 16pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_fast -7 glibc.tar
5.798u 0.099s 0:05.90 99.6% 0+0k 0+189776io 16pf+0w
[vinavx0] /tmp/innocent $ cp glibc.tar.gz glibc.tar.gz_7
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_fast -9 glibc.tar
12.831u 0.116s 0:13.09 98.8% 0+0k 0+189512io 0pf+0w
[vinavx0] /tmp/innocent $ cp glibc.tar.gz glibc.tar.gz_9
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_fastest glibc.tar
2.134u 0.111s 0:02.25 99.5% 0+0k 0+203880io 16pf+0w
-rw-r--r--. 1 innocent zh 189542400 Mar 4 09:55 glibc.tar
-rw-r--r--. 1 innocent zh 104385597 Mar 4 09:55 glibc.tar.gz_1
-rw-r--r--. 1 innocent zh 97163693 Mar 4 09:54 glibc.tar.gz_7
-rw-r--r--. 1 innocent zh 97029520 Mar 4 09:55 glibc.tar.gz_9
recompiling (again) libz (20/02/2011)
if one is interest only in fast compression (-1) recompiling zlib with -DFASTEST gives an additional speedup of 30%
(at least in compressing glibc)
time ~/w1/zlib-1.2.5/minigzip64_fastest -1 glibc.tar
2.112u 0.103s 0:02.21 100.0% 0+0k 0+203880io 12pf+0w
time ~/w1/zlib-1.2.5/minigzip64_ori -1 glibc.tar
3.113u 0.119s 0:03.23 99.6% 0+0k 0+198952io 0pf+0w
in principle it could be implemented as a runtime switch
recompiling (again) libz (20/02/2011)
summary:
we can get zlib compression (and even decompression) faster (~20%) just recompiling zlib with proper optimization such as
CFLAGS=-O3 ${LOC} -DUNALIGNED_OK -D_LARGEFILE64_SOURCE=1 -v -ftree-vectorizer-verbose=1 -msse3
as in
https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_opt_CMSSW_4_2_0_pre5_slc5_amd64_gcc451/self
vs
https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_ori_CMSSW_4_2_0_pre5_slc5_amd64_gcc451/self
apparenlty no need of assembler, specialized sse code, etc
as the assembler shows all the time goes in a couple of assignement and comparisions
/* do {
* match = s->window + cur_match;
* if (*(ushf*)(match+best_len-1) != scan_end ||
* *(ushf*)match != scan_start) continue;
* [...]
* } while ((cur_match = prev[cur_match & wmask]) > limit
* && --chain_length != 0);
*
* Here is the inner loop of the function. The function will spend the
* majority of its time in this loop, and majority of that time will
* be spent in the first ten instructions.
*/
the rest is minor.
(see
https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_asm_CMSSW_4_2_X_2011-02-11-0200_slc5_amd64_gcc451/self)
(at the end having igprof showing more details for the assembly code is not bad!)
for instance
I tryed to substitute the"comparison loop of "longest_match" with the specialized string-comparison instruction available in sse4.2
(see
http://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for a description)
the code is these few lines (acttually the default code compare byte by byte)
#ifdef USE_SSE4_2
const int ssebits=16;
const int mode1 = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_EACH | _SIDD_LEAST_SIGNIFICANT | _SIDD_NEGATIVE_POLARITY;
int res=0;
scan+=3, match+=3;
while (ssebits== (res = _mm_cmpestri( _mm_loadu_si128((__m128i const *)(scan)) , ssebits,
_mm_loadu_si128((__m128i const *)(match)), ssebits, mode1)) &&
(scan+=ssebits)<strend) {match+=ssebits;}
if (scan<strend) scan +=res;
else scan--;
#else
/* old c version */
scan++, match++;
do {
} while (*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
scan < strend);
if (*scan == *match) scan++;
#endif
timing results
https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_nhl_CMSSW_4_2_0_pre5_slc5_amd64_gcc451/self
are essentially equivalent to the C code (once compiled with optimization) or the sse2 asm code
https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_asm_CMSSW_4_2_X_2011-02-11-0200_slc5_amd64_gcc451/self
a small hint about the fact that usually the comparison fails seems to help (t.b.c)
if (likely( *(ushf*)(match+best_len-1) != scan_end ||
*(ushf*)match != scan_start) ) continue;
libz recompiled in cmssw (11/02/2011)
20% faster
new zlib (including asm version of "match" and "inflate_fast")
11-Feb-2011 13:19:03 CET Closed file file:/tmp/innocent/output_streamer.dat
53.792u 1.043s 0:55.51 98.7% 0+0k 0+0io 138pf+0w
https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_CMSSW_4_2_X_2011-02-11-0200_slc5_amd64_gcc451/self
original
11-Feb-2011 13:23:52 CET Closed file file:/tmp/innocent/output_streamer.dat
65.433u 0.964s 1:06.53 99.7% 0+0k 0+0io 0pf+0w
https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_ori_CMSSW_4_2_X_2011-02-11-0200_slc5_amd64_gcc451/self
comparison at symbol levels is difficult as the assembler version as more symbols (those whith capital letter)
igprof shows all internal symbols such as
LookupLoop
because they do not start by
.L
etc.
btw
-rw-r--r-- 1 innocent zh 177543189 Feb 11 13:33 output_raw.root
-rw-r--r-- 1 innocent zh 177543189 Feb 11 13:25 output_raw_ori.root
recompiling libz (11/02/2011)
I took zlib 1.2.5
on slc6 (should not make difference)
[modifed Makefile.in adding CFLAGS=-O3 -Wall
./configure; make; make test
mv minigzip64 minigzip64_ori
[
so far so good
then
[cp contrib/amd64/amd64-match.S ./match.S
fix Makefile
CFLAGS=-O3 ${LOC} -D_LARGEFILE64_SOURCE=1 -DNO_UNDERLINE
#CFLAGS=-O -DMAX_WBITS=14 -DMAX_MEM_LEVEL=7
#CFLAGS=-g -DDEBUG
#CFLAGS=-O3 -Wall -Wwrite-strings -Wpointer-arith -Wconversion \
# -Wstrict-prototypes -Wmissing-prototypes
SFLAGS=-O3 -fPIC -D_LARGEFILE64_SOURCE=1 -DNO_UNDERLINE ${LOC}
LDFLAGS= -L. libz.a
TEST_LDFLAGS=-L. libz.a
LDSHARED=gcc -shared -Wl,-soname,libz.so.1,--version-script,zlib.map
CPP=gcc -E -DNO_UNDERLINE
make clean
make LOC=-DASMV OBJA=match.o
mv minigzip64 minigzip64_sse
ls -l minigzip64_*
-rwxr-xr-x. 1 innocent zh 98012 Feb 11 10:45 minigzip64_ori
-rwxr-xr-x. 1 innocent zh 97914 Feb 11 11:00 minigzip64_sse
and THEN
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -1 glibc.tar
3.112u 0.108s 0:03.22 99.6% 0+0k 0+198952io 16pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.699u 0.135s 0:00.83 98.7% 0+0k 0+370200io 3pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_sse -1 glibc.tar
3.033u 0.123s 0:03.15 100.0% 0+0k 0+198952io 15pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.694u 0.144s 0:00.84 98.8% 0+0k 0+370200io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -6 glibc.tar
5.483u 0.131s 0:05.61 100.0% 0+0k 0+190088io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.671u 0.118s 0:00.79 98.7% 0+0k 0+370200io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_sse -6 glibc.tar
5.036u 0.127s 0:05.16 99.8% 0+0k 0+190088io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.656u 0.145s 0:00.80 98.7% 0+0k 0+370200io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -9 glibc.tar
15.621u 0.120s 0:15.74 100.0% 0+0k 0+189512io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.652u 0.138s 0:00.79 98.7% 0+0k 0+370200io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_sse -9 glibc.tar
12.966u 0.126s 0:13.09 99.9% 0+0k 0+189512io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.663u 0.124s 0:00.78 100.0% 0+0k 0+370200io 0pf+0w
[
btw
diff glibc.tar.gz glibc.tar.gz_sse
(no difference)
conclusion:
the code in
contrib/amd64/amd64-match.S
seems to work and produce indentical results to the C version
it is 20% faster in compression with -9
irrelevant for fast decompression
a second look to Intel IPP library (4/7/2009)
The previous results where biased a bit by the mixture of vectorization, parallelization using threads and hyper-threading on the i7.
I therefore repeated the comparison using affinity to limit the number of core/hyper-threads used
gzip
Machine |
CPU time |
real time |
Xeon L5420 @ 2.50GHz |
56.561u 0.573s |
0:57.16 |
i7 940 @ 2.93GHz |
43.506u 0.292s |
0:43.90 |
ipp-gzip
Machine |
affinity |
CPU time |
real time |
Xeon L5420 @ 2.50GHz |
1 core, 1 cpu |
35.473u 0.422s |
0:35.90 |
i7 940 @ 2.93GHz |
1 core, 1 cpu |
24.017u 0.192s |
0:24.20 |
Xeon L5420 @ 2.50GHz |
8 core, 8 cpu |
36.611u 0.832s |
0:07.67 |
i7 940 @ 2.93GHz |
4 core 8 cpu |
36.698u 0.564s |
0:06.83 |
Xeon L5420 @ 2.50GHz |
4 core 4 cpu 0,2,4,6 |
36.585u 0.852s |
0:13.73 |
i7 940 @ 2.93GHz |
2 core 4 cpu 0,2,4,6 |
36.186u 0.512s |
0:11.94 |
Xeon L5420 @ 2.50GHz |
4 core 4 cpu 0,1,2,3 |
36.381u 0.782s |
0:12.25 |
i7 940 @ 2.93GHz |
4 core 4 cpu 0,1,2,3 |
24.949u 0.448s |
0:08.89 |
so from a more fair comparison on 1 or 4 "cpus" there is indeed a speed-up in cpu-time on the i7 compared with the 5420 for ipp-gzip (x1.3) : this speed-up is similar to the one in vanilla gzip though (x1.5).
hyperthreading is then adding another bit (only 30% in this case...)
#IntelIPP
a first look to Intel IPP library
Intel has developed an extensive library, The Intel® Integrated Performance Primitives (IPP) library,
to best exploit their CPU architecture.
documentation can be found at http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-documentation/.
Andrej has installed a recent version in /afs/cern.ch/sw/IntelSoftware/linux/x86_64/ipp/6.0.2.076
Intel also provides code samples
http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-code-samples/
.
I was interested to the features of the new sse4.2 in particular the new intrinsics for string manipulation,
so I decided to give a look to gzip.
compiling IPP code samples
All the speedup is supposed to come from the ipp library so we can use gcc to compile our code (as we will do anyhow for HEP applications)
This is what seems to be needed to compile the code samples
setenv IPPROOT /afs/cern.ch/sw/IntelSoftware/linux/x86_64/ipp/6.0.2.076/em64t
setenv GCC_HOME /afs/cern.ch/sw/lcg/contrib/gcc/4.3.2/slc4_amd64_gcc43
setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/afs/cern.ch/sw/lcg/contrib/mpfr/2.4.1/slc4_amd64_gcc43/lib/:/afs/cern.ch/sw/IntelSoftware/linux/x86_64/ipp/6.0.2.076/em64t/sharedlib
testing ipp_gzip
I downloaded the samples in /afs/cern.ch/user/i/innocent/w1/intelIPP/ipp-samples
and compiled ipp_gzip
As test file I used the tar file of the source tree of CMSSW
Machine |
gzip real time |
gzip CPU time |
ipp-gzip real time |
ipp-gzip CPU time |
Xeon E5430 @ 2.66GHz" |
0:50.96 |
50.475u 0.456s |
0:08.29 |
34.262u 0.872s |
i7 940 @ 2.93GHz |
0:43.90 |
43.506u 0.292s |
0:08.45 |
33.882u 0.584s |
Xeon 5160 @ 3.00GHz |
0:45.53 |
44.380u 0.707s |
0:11.65 |
29.946u 0.980s |
Although the speedup exits in both cputime and realtime (the last one due to openMP)
I do not see any major difference between Clowertown, Harpertown and i7 even if the sse architecture is very different among the three.
these are the transcript of the results on three different machines always compared with "native" gzip
on a "Intel(R) Xeon(R) CPU E5430 @ 2.66GHz"
I get
time gzip -9 cmssw.tar; ls -l cmssw.tar*
50.475u 0.456s 0:50.96 99.9% 0+0k 0+0io 0pf+0w
-rw-r--r-- 1 innocent zh 113397712 Jun 17 13:24 cmssw.tar.gz
[pcphsft50] /tmp/innocent > time gzip -d cmssw.tar.gz ; ls -l cmssw.tar*
3.300u 0.792s 0:04.08 100.2% 0+0k 0+0io 0pf+0w
-rw-r--r-- 1 innocent zh 515102720 Jun 17 13:24 cmssw.tar
[pcphsft50] /tmp/innocent > time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s -9 cmssw.tar ; ls -l cmssw.tar*
cmssw.tar: deflating -- RL/WM MT -- 35.8 clocks per input symbol
34.262u 0.872s 0:08.29 423.7% 0+0k 0+0io 3276pf+0w
-rw------- 1 innocent zh 114294037 Jun 17 13:27 cmssw.tar.gz
[pcphsft50] /tmp/innocent > time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s -d cmssw.tar.gz ; ls -l cmssw.tar*
Bus error
0.024u 0.012s 0:00.06 50.0% 0+0k 0+0io 11pf+0w
-rw------- 1 innocent zh 515102720 Jun 17 13:27 cmssw.tar
-rw------- 1 innocent zh 114294037 Jun 17 13:27 cmssw.tar.gz
and on
Intel(R) Core(TM) i7 CPU 940 @ 2.93GHz
ls -l
total 503532
-rw-r--r-- 1 innocent zh 515102720 Jun 17 13:24 cmssw.tar
[lxcmsi1] /tmp/innocent $ source ~/scripts/ippset43
[lxcmsi1] /tmp/innocent $ time gzip -9 cmssw.tar; ls -l cmssw.tar*
43.506u 0.292s 0:43.90 99.7% 0+0k 0+0io 0pf+0w
-rw-r--r-- 1 innocent zh 113401244 Jun 17 13:24 cmssw.tar.gz
[lxcmsi1] /tmp/innocent $ time gzip -d cmssw.tar.gz ; ls -l cmssw.tar*
2.704u 0.520s 0:03.22 100.0% 0+0k 0+0io 0pf+0w
-rw-r--r-- 1 innocent zh 515102720 Jun 17 13:24 cmssw.tar
[lxcmsi1] /tmp/innocent $ time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s -9 cmssw.tar ; ls -l cmssw.tar*
cmssw.tar: deflating -- RL/WM MT -- 42.4 clocks per input symbol
33.882u 0.584s 0:08.45 407.8% 0+0k 0+0io 3272pf+0w
-rw------- 1 innocent zh 114297967 Jun 17 13:27 cmssw.tar.gz
[lxcmsi1] /tmp/innocent $ time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s -d cmssw.tar.gz ; ls -l cmssw.tar*
Bus error
0.032u 0.012s 0:00.07 57.1% 0+0k 0+0io 6pf+0w
-rw------- 1 innocent zh 515102720 Jun 17 13:27 cmssw.tar
-rw------- 1 innocent zh 114297967 Jun 17 13:27 cmssw.tar.gz
both gives "Bus Error.." in inflalting...
I also tested on
Intel(R) Xeon(R) CPU 5160 @ 3.00GHz
time gzip -9 cmssw.tar ; ls -l cmssw.tar*
44.380u 0.707s 0:45.53 99.0% 0+0k 144+221688io 1pf+0w
-rw-r--r-- 1 innocent zh 113401244 Jun 17 14:17 cmssw.tar.gz
[lxbuild066] /tmp/innocent >
bin/ cmssw.tar.gz common/ slc4_ia32_gcc345/ var/
[lxbuild066] /tmp/innocent > time gzip -d cmssw.tar.gz ; ls -l cmssw.tar*
2.930u 0.929s 0:20.03 19.2% 0+0k 24+1006072io 0pf+0w
-rw-r--r-- 1 innocent zh 515102720 Jun 17 14:17 cmssw.tar
[lxbuild066] /tmp/innocent > time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s -9 cmssw.tar ; ls -l cmssw.tar*
cmssw.tar: deflating -- RL/WM MT -- 60.1 clocks per input symbol
29.946u 0.980s 0:11.65 265.4% 0+0k 88+610872io 3114pf+0w
-rw------- 1 innocent zh 114307153 Jun 17 14:18 cmssw.tar.gz
[lxbuild066] /tmp/innocent > time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s -d cmssw.tar.gz ; ls -l cmssw.tar*
ipp_gzip: cmssw.tar.gz: invalid compressed data--crc error, chunk 0
and hangs....
once the file was produced
-rw------- 1 innocent zh 515102720 Jun 17 14:29 cmssw.tar
-- VincenzoInnocente - 11-Mar-2011