VIZip < LCG < TWiki

LCG Web>TWikiUsers>VincenzoInnocente>MultiCoreRD>MultiCoreBlogs>VIZip (2011-03-11, VincenzoInnocente)

Compression performances

more tests on slc6 corei7-avx (04/03/2011)

[vinavx0] /tmp/innocent $ time ./gzip -1 glibc.tar
3.521u 0.099s 0:03.62 99.7%   0+0k 0+199336io 0pf+0w
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time gzip -1 glibc.tar
3.399u 0.118s 0:03.51 99.7%   0+0k 0+199336io 0pf+0w
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -1 glibc.tar
3.083u 0.126s 0:03.21 99.6%   0+0k 0+198952io 0pf+0w
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_fast -1 glibc.tar
3.042u 0.112s 0:03.15 100.0%   0+0k 0+198952io 16pf+0w
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_fastest glibc.tar
2.092u 0.126s 0:02.22 99.5%   0+0k 0+203880io 16pf+0w


[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_fast -7 glibc.tar
5.798u 0.099s 0:05.90 99.6%   0+0k 0+189776io 16pf+0w
[vinavx0] /tmp/innocent $ cp glibc.tar.gz glibc.tar.gz_7
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_fast -9 glibc.tar
12.831u 0.116s 0:13.09 98.8%   0+0k 0+189512io 0pf+0w
[vinavx0] /tmp/innocent $ cp glibc.tar.gz glibc.tar.gz_9
[vinavx0] /tmp/innocent $ gunzip glibc.tar.gz
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_fastest glibc.tar
2.134u 0.111s 0:02.25 99.5%   0+0k 0+203880io 16pf+0w


-rw-r--r--.  1 innocent zh 189542400 Mar  4 09:55 glibc.tar
-rw-r--r--.  1 innocent zh 104385597 Mar  4 09:55 glibc.tar.gz_1
-rw-r--r--.  1 innocent zh  97163693 Mar  4 09:54 glibc.tar.gz_7
-rw-r--r--.  1 innocent zh  97029520 Mar  4 09:55 glibc.tar.gz_9

recompiling (again) libz (20/02/2011)

if one is interest only in fast compression (-1) recompiling zlib with -DFASTEST gives an additional speedup of 30% (at least in compressing glibc)

time ~/w1/zlib-1.2.5/minigzip64_fastest -1 glibc.tar
2.112u 0.103s 0:02.21 100.0%   0+0k 0+203880io 12pf+0w
time ~/w1/zlib-1.2.5/minigzip64_ori -1 glibc.tar
3.113u 0.119s 0:03.23 99.6%   0+0k 0+198952io 0pf+0w

in principle it could be implemented as a runtime switch smile

recompiling (again) libz (20/02/2011)

summary: we can get zlib compression (and even decompression) faster (~20%) just recompiling zlib with proper optimization such as CFLAGS=-O3 ${LOC} -DUNALIGNED_OK -D_LARGEFILE64_SOURCE=1 -v -ftree-vectorizer-verbose=1 -msse3 as in https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_opt_CMSSW_4_2_0_pre5_slc5_amd64_gcc451/self vs https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_ori_CMSSW_4_2_0_pre5_slc5_amd64_gcc451/self

apparenlty no need of assembler, specialized sse code, etc as the assembler shows all the time goes in a couple of assignement and comparisions

/* do {
*     match = s->window + cur_match;
*     if (*(ushf*)(match+best_len-1) != scan_end ||
*         *(ushf*)match != scan_start) continue;
*     [...]
* } while ((cur_match = prev[cur_match & wmask]) > limit
*          && --chain_length != 0);
*
* Here is the inner loop of the function. The function will spend the
* majority of its time in this loop, and majority of that time will
* be spent in the first ten instructions.
*/

the rest is minor. (see https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_asm_CMSSW_4_2_X_2011-02-11-0200_slc5_amd64_gcc451/self

) (at the end having igprof showing more details for the assembly code is not bad!)

for instance I tryed to substitute the"comparison loop of "longest_match" with the specialized string-comparison instruction available in sse4.2 (see http://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for a description)

the code is these few lines (acttually the default code compare byte by byte)

#ifdef USE_SSE4_2
      const int ssebits=16;
      const int mode1 = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_EACH | _SIDD_LEAST_SIGNIFICANT | _SIDD_NEGATIVE_POLARITY;
      int res=0;
      scan+=3, match+=3;
      while (ssebits== (res = _mm_cmpestri( _mm_loadu_si128((__m128i const *)(scan)) , ssebits,  
                                        _mm_loadu_si128((__m128i const *)(match)), ssebits, mode1)) &&
             (scan+=ssebits)<strend) {match+=ssebits;}
      if (scan<strend) scan +=res;
      else scan--; 
#else                        
      /* old c version */
      scan++, match++;
      do {
      } while (*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
               *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
               *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
               *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
               scan < strend);
       if (*scan == *match) scan++;
#endif

timing results https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_nhl_CMSSW_4_2_0_pre5_slc5_amd64_gcc451/self

are essentially equivalent to the C code (once compiled with optimization) or the sse2 asm code https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_asm_CMSSW_4_2_X_2011-02-11-0200_slc5_amd64_gcc451/self

a small hint about the fact that usually the comparison fails seems to help (t.b.c)

 
       if (likely( *(ushf*)(match+best_len-1) != scan_end ||
                    *(ushf*)match != scan_start) ) continue;

libz recompiled in cmssw (11/02/2011)

20% faster

new zlib (including asm version of "match" and "inflate_fast")


11-Feb-2011 13:19:03 CET  Closed file file:/tmp/innocent/output_streamer.dat
53.792u 1.043s 0:55.51 98.7%   0+0k 0+0io 138pf+0w
https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_CMSSW_4_2_X_2011-02-11-0200_slc5_amd64_gcc451/self

original

11-Feb-2011 13:23:52 CET  Closed file file:/tmp/innocent/output_streamer.dat
65.433u 0.964s 1:06.53 99.7%   0+0k 0+0io 0pf+0w
https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/repack_ori_CMSSW_4_2_X_2011-02-11-0200_slc5_amd64_gcc451/self

comparison at symbol levels is difficult as the assembler version as more symbols (those whith capital letter) igprof shows all internal symbols such as LookupLoop because they do not start by .L etc.

btw -rw-r--r-- 1 innocent zh 177543189 Feb 11 13:33 output_raw.root -rw-r--r-- 1 innocent zh 177543189 Feb 11 13:25 output_raw_ori.root

recompiling libz (11/02/2011)

I took zlib 1.2.5 on slc6 (should not make difference)

[modifed Makefile.in adding CFLAGS=-O3 -Wall
./configure; make; make test
mv minigzip64 minigzip64_ori

[ so far so good

then

[cp contrib/amd64/amd64-match.S ./match.S
fix Makefile
CFLAGS=-O3 ${LOC} -D_LARGEFILE64_SOURCE=1 -DNO_UNDERLINE
#CFLAGS=-O -DMAX_WBITS=14 -DMAX_MEM_LEVEL=7
#CFLAGS=-g -DDEBUG
#CFLAGS=-O3 -Wall -Wwrite-strings -Wpointer-arith -Wconversion \
#           -Wstrict-prototypes -Wmissing-prototypes

SFLAGS=-O3 -fPIC -D_LARGEFILE64_SOURCE=1 -DNO_UNDERLINE ${LOC}
LDFLAGS= -L. libz.a
TEST_LDFLAGS=-L. libz.a
LDSHARED=gcc -shared -Wl,-soname,libz.so.1,--version-script,zlib.map
CPP=gcc -E -DNO_UNDERLINE

make clean
make LOC=-DASMV OBJA=match.o

mv minigzip64 minigzip64_sse
ls -l minigzip64_*
-rwxr-xr-x. 1 innocent zh 98012 Feb 11 10:45 minigzip64_ori
-rwxr-xr-x. 1 innocent zh 97914 Feb 11 11:00 minigzip64_sse

and THEN
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -1 glibc.tar
3.112u 0.108s 0:03.22 99.6%   0+0k 0+198952io 16pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.699u 0.135s 0:00.83 98.7%   0+0k 0+370200io 3pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_sse -1 glibc.tar
3.033u 0.123s 0:03.15 100.0%   0+0k 0+198952io 15pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.694u 0.144s 0:00.84 98.8%   0+0k 0+370200io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -6 glibc.tar
5.483u 0.131s 0:05.61 100.0%   0+0k 0+190088io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.671u 0.118s 0:00.79 98.7%   0+0k 0+370200io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_sse -6 glibc.tar
5.036u 0.127s 0:05.16 99.8%   0+0k 0+190088io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.656u 0.145s 0:00.80 98.7%   0+0k 0+370200io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -9 glibc.tar
15.621u 0.120s 0:15.74 100.0%   0+0k 0+189512io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.652u 0.138s 0:00.79 98.7%   0+0k 0+370200io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_sse -9 glibc.tar
12.966u 0.126s 0:13.09 99.9%   0+0k 0+189512io 0pf+0w
[vinavx0] /tmp/innocent $ time ~/w1/zlib-1.2.5/minigzip64_ori -d glibc.tar.gz
0.663u 0.124s 0:00.78 100.0%   0+0k 0+370200io 0pf+0w

[ btw diff glibc.tar.gz glibc.tar.gz_sse (no difference)

conclusion: the code in contrib/amd64/amd64-match.S seems to work and produce indentical results to the C version

it is 20% faster in compression with -9 irrelevant for fast decompression

a second look to Intel IPP library (4/7/2009)

The previous results where biased a bit by the mixture of vectorization, parallelization using threads and hyper-threading on the i7. I therefore repeated the comparison using affinity to limit the number of core/hyper-threads used

gzip

Machine	CPU time	real time
Xeon L5420 @ 2.50GHz	56.561u 0.573s	0:57.16
i7 940 @ 2.93GHz	43.506u 0.292s	0:43.90

ipp-gzip

Machine	affinity	CPU time	real time
Xeon L5420 @ 2.50GHz	1 core, 1 cpu	35.473u 0.422s	0:35.90
i7 940 @ 2.93GHz	1 core, 1 cpu	24.017u 0.192s	0:24.20
Xeon L5420 @ 2.50GHz	8 core, 8 cpu	36.611u 0.832s	0:07.67
i7 940 @ 2.93GHz	4 core 8 cpu	36.698u 0.564s	0:06.83
Xeon L5420 @ 2.50GHz	4 core 4 cpu 0,2,4,6	36.585u 0.852s	0:13.73
i7 940 @ 2.93GHz	2 core 4 cpu 0,2,4,6	36.186u 0.512s	0:11.94
Xeon L5420 @ 2.50GHz	4 core 4 cpu 0,1,2,3	36.381u 0.782s	0:12.25
i7 940 @ 2.93GHz	4 core 4 cpu 0,1,2,3	24.949u 0.448s	0:08.89

so from a more fair comparison on 1 or 4 "cpus" there is indeed a speed-up in cpu-time on the i7 compared with the 5420 for ipp-gzip (x1.3) : this speed-up is similar to the one in vanilla gzip though (x1.5). hyperthreading is then adding another bit (only 30% in this case...)

#IntelIPP

a first look to Intel IPP library

Intel has developed an extensive library, The Intel® Integrated Performance Primitives (IPP) library, to best exploit their CPU architecture. documentation can be found at http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-documentation/.

Andrej has installed a recent version in /afs/cern.ch/sw/IntelSoftware/linux/x86_64/ipp/6.0.2.076

Intel also provides code samples http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-code-samples/ .

I was interested to the features of the new sse4.2 in particular the new intrinsics for string manipulation, so I decided to give a look to gzip.

compiling IPP code samples

All the speedup is supposed to come from the ipp library so we can use gcc to compile our code (as we will do anyhow for HEP applications) This is what seems to be needed to compile the code samples

setenv IPPROOT /afs/cern.ch/sw/IntelSoftware/linux/x86_64/ipp/6.0.2.076/em64t
setenv GCC_HOME /afs/cern.ch/sw/lcg/contrib/gcc/4.3.2/slc4_amd64_gcc43
setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/afs/cern.ch/sw/lcg/contrib/mpfr/2.4.1/slc4_amd64_gcc43/lib/:/afs/cern.ch/sw/IntelSoftware/linux/x86_64/ipp/6.0.2.076/em64t/sharedlib

testing ipp_gzip

I downloaded the samples in /afs/cern.ch/user/i/innocent/w1/intelIPP/ipp-samples and compiled ipp_gzip As test file I used the tar file of the source tree of CMSSW

Machine	gzip real time	gzip CPU time	ipp-gzip real time	ipp-gzip CPU time
Xeon E5430 @ 2.66GHz"	0:50.96	50.475u 0.456s	0:08.29	34.262u 0.872s
i7 940 @ 2.93GHz	0:43.90	43.506u 0.292s	0:08.45	33.882u 0.584s
Xeon 5160 @ 3.00GHz	0:45.53	44.380u 0.707s	0:11.65	29.946u 0.980s

Although the speedup exits in both cputime and realtime (the last one due to openMP) I do not see any major difference between Clowertown, Harpertown and i7 even if the sse architecture is very different among the three.

these are the transcript of the results on three different machines always compared with "native" gzip

on a "Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz"
I get
time gzip -9 cmssw.tar; ls -l cmssw.tar*
50.475u 0.456s 0:50.96 99.9%   0+0k 0+0io 0pf+0w
-rw-r--r--  1 innocent zh 113397712 Jun 17 13:24 cmssw.tar.gz
[pcphsft50] /tmp/innocent > time gzip -d cmssw.tar.gz  ; ls -l cmssw.tar*
3.300u 0.792s 0:04.08 100.2%   0+0k 0+0io 0pf+0w
-rw-r--r--  1 innocent zh 515102720 Jun 17 13:24 cmssw.tar
[pcphsft50] /tmp/innocent > time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s  -9 cmssw.tar ; ls -l cmssw.tar*
cmssw.tar:    deflating -- RL/WM MT --   35.8 clocks per input symbol
34.262u 0.872s 0:08.29 423.7%   0+0k 0+0io 3276pf+0w
-rw-------  1 innocent zh 114294037 Jun 17 13:27 cmssw.tar.gz
[pcphsft50] /tmp/innocent > time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s -d cmssw.tar.gz  ; ls -l cmssw.tar*
Bus error
0.024u 0.012s 0:00.06 50.0%   0+0k 0+0io 11pf+0w
-rw-------  1 innocent zh 515102720 Jun 17 13:27 cmssw.tar
-rw-------  1 innocent zh 114294037 Jun 17 13:27 cmssw.tar.gz

and on
Intel(R) Core(TM) i7 CPU         940  @ 2.93GHz
ls -l
total 503532
-rw-r--r-- 1 innocent zh 515102720 Jun 17 13:24 cmssw.tar
[lxcmsi1] /tmp/innocent $ source ~/scripts/ippset43
[lxcmsi1] /tmp/innocent $ time gzip -9 cmssw.tar; ls -l cmssw.tar*
43.506u 0.292s 0:43.90 99.7%   0+0k 0+0io 0pf+0w
-rw-r--r-- 1 innocent zh 113401244 Jun 17 13:24 cmssw.tar.gz
[lxcmsi1] /tmp/innocent $ time gzip -d cmssw.tar.gz  ; ls -l cmssw.tar*
2.704u 0.520s 0:03.22 100.0%   0+0k 0+0io 0pf+0w
-rw-r--r-- 1 innocent zh 515102720 Jun 17 13:24 cmssw.tar
[lxcmsi1] /tmp/innocent $ time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s  -9 cmssw.tar ; ls -l cmssw.tar*
cmssw.tar:    deflating -- RL/WM MT --   42.4 clocks per input symbol
33.882u 0.584s 0:08.45 407.8%   0+0k 0+0io 3272pf+0w
-rw------- 1 innocent zh 114297967 Jun 17 13:27 cmssw.tar.gz
[lxcmsi1] /tmp/innocent $ time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s -d cmssw.tar.gz  ; ls -l cmssw.tar*
Bus error
0.032u 0.012s 0:00.07 57.1%   0+0k 0+0io 6pf+0w
-rw------- 1 innocent zh 515102720 Jun 17 13:27 cmssw.tar
-rw------- 1 innocent zh 114297967 Jun 17 13:27 cmssw.tar.gz

both gives "Bus Error.." in inflalting...


I also tested on
Intel(R) Xeon(R) CPU            5160  @ 3.00GHz
time gzip -9 cmssw.tar ; ls -l cmssw.tar*
44.380u 0.707s 0:45.53 99.0%   0+0k 144+221688io 1pf+0w
-rw-r--r--  1 innocent zh 113401244 Jun 17 14:17 cmssw.tar.gz
[lxbuild066] /tmp/innocent >
bin/              cmssw.tar.gz      common/           slc4_ia32_gcc345/ var/
[lxbuild066] /tmp/innocent > time gzip -d cmssw.tar.gz ; ls -l cmssw.tar*
2.930u 0.929s 0:20.03 19.2%   0+0k 24+1006072io 0pf+0w
-rw-r--r--  1 innocent zh 515102720 Jun 17 14:17 cmssw.tar
[lxbuild066] /tmp/innocent > time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s -9 cmssw.tar ; ls -l cmssw.tar*
cmssw.tar:    deflating -- RL/WM MT --   60.1 clocks per input symbol
29.946u 0.980s 0:11.65 265.4%   0+0k 88+610872io 3114pf+0w
-rw-------  1 innocent zh 114307153 Jun 17 14:18 cmssw.tar.gz
[lxbuild066] /tmp/innocent > time ~/w1/intelIPP/ipp-samples/data-compression/ipp_gzip/bin/linuxem64t_gcc4/ipp_gzip -s -d cmssw.tar.gz  ; ls -l cmssw.tar*

ipp_gzip: cmssw.tar.gz: invalid compressed data--crc error, chunk 0

and hangs....
once the file was produced
-rw-------  1 innocent zh 515102720 Jun 17 14:29 cmssw.tar

-- VincenzoInnocente - 11-Mar-2011

Topic revision: r2 - 2011-03-11 - VincenzoInnocente

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback