Optimized Software packaging: exploiting link time optimization
Introduction
While benchmarking AVX and gcc 4.6 I tested once again the "lto" option. As you can notice in the last row of the
scimark2 benchmark in a
previous note this improved the timing of the "MonteCarlo" benchmark by a factor 2. I decided then to have a more closer look.
References
A first example: scimark2 Montecarlo
To better study "link time optimization" I extracted scimark2 Montecarlo benchmark as a stand-alone program. The program compute Pi by evaluating the area of a circle within the unit square.
The program is very simple:
a function to sample the circle ( in
MonteCarlo.c
)
double MonteCarlo_integrate(int Num_samples) {
Random R = new_Random_seed(SEED);
int under_curve = 0;
int count;
for (count=0; count<Num_samples; count++) {
double x= Random_nextDouble(R);
double y= Random_nextDouble(R);
if ( x*x + y*y <= 1.0)
under_curve ++;
}
Random_delete(R);
return ((double) under_curve / Num_samples) * 4.0;
}
and a couple of functions to generate random numbers (defined in
Random.h
, implemented in
Random.c
)
#ifdef HIDDEN
#pragma GCC visibility push(internal)
#else
#pragma GCC visibility push(default)
#endif
typedef struct
{
....
}
Random_struct, *Random;
Random new_Random_seed(int seed);
double Random_nextDouble(Random R);
void Random_delete(Random R);
double *RandomVector(int N, Random R);
double **RandomMatrix(int M, int N, Random R);
#pragma GCC visibility pop
the main program just run the integrator a certain number of time
#include "MonteCarlo.h"
int main() {
double result = 0.0;
int cycles=100000000;
for (int i=0; i!=10; ++i)
result += MonteCarlo_integrate(cycles);
return int(result);
}
Standard compilation of a whole program is done by
g++ -O3 Random.c MonteCarlo.c onlyMC.cc -o onlyMC_w
.
On my machine
onlyMC_w
takes 7.6 seconds.
to enable "whole program optimization at link time" in gcc 4.6 one has just to add
-lto
.
g++ -O3 Random.c MonteCarlo.c onlyMC.cc -o onlyMC_wo -flto
onlyMC_wo
takes 5.5 seconds.
We leave to the reader to show where the improvement comes from using
nm
,
igprof
,
pto
, or even simply
gdb
Let's now try to see how we can still produce this performance improvement in presence of shared libraries.
This can be achieved exploiting link time optimization with the linker (
ld
) will collaborate with the compiler using a plugin.
This mechanism is switched on using the compiler options = -flto=
It is rather obvious that high granularity packaging cannot be improved
g++ -Wl,-rpath ./ -O3 Random.c -fPIC -pthread -shared -o libRandom.so -flto
g++ -Wl,-rpath ./ -O3 MonteCarlo.c -fPIC -pthread -L./ -lRandom -shared -o libMonteCarlo.so -flto
g++ -Wl,-rpath ./ -O3 -fPIC -pthread -L./ -lMonteCarlo onlyMC.cc -o onlyMC -flto
onlyMC
runs in 7.75 seconds on my machine
Lets' then try to trigger the optimization between
MonteCarlo.c
and
Random.c
packaging them in the same library
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c -fPIC -pthread -shared -o libMonteCarlo_w.so -flto
g++ -Wl,-rpath ./ -O3 -fPIC -pthread -L./ -lMonteCarlo_w onlyMC.cc -o onlyMC_1 -flto
onlyMC_1
runs in the usual 7.75 seconds.
This is due to the ELF ABI that requires that any exported symbols could be overwritten in either the main program or any pre-loaded shared library.
To test this just create a library with only
Random.c
activating ffew printouts and preload it
g++ -Wl,-rpath ./ -O3 Random.c -fPIC -pthread -shared -o libRandomHi.so -flto -DHI
setenv LD_PRELOAD ./libRandomHi.so
the printout will appear running
onlyMC_1
, they will not running
onlyMC_wo
.
To allow gcc to optimize the library we shall therefore declare the symbols as *private to the dynamic shared object" in question
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c -fPIC -pthread -shared -o libMonteCarlo_w.so -flto -DHIDDEN
g++ -Wl,-rpath ./ -O3 -fPIC -pthread -L./ -lMonteCarlo_w onlyMC.cc -o onlyMC_1 -flto -DHIDDEN
this time
onlyMC_1
will run in 5.5 seconds.
setenv LD_PRELOAD ./libRandomHi.so
will be not effective.
using
nm
one can easily check that
libMonteCarlo_w.so
does not export anymore the symbols in
Random
.
This simple example shows the basic constructs to get optimized packages in shared libraries.
A comment that can be easily made is that
Random
is a typical generic utility that can be used in many algorithms and shall therefore
be packaged in a more basic library than the algorithms themselves. This is of course very true. In this case the only rule-of-thumb to follow to
avoid the performance penalty of a function call is to
inline any algorithm that use less cycles than the function call itself.
We leave to the reader the exercise to transform
Random
in a c++ class and inline the critical method to show how this will recover the performance.
It should be noted that
inlining
should not be interpreted in restrictive terms as
substitute in place as the case of a C macro. It is more a way to give to the compiler,
as in the link time optimization, access to more information about the function itself, that will allow it to better follow dependencies, propagate
consts etc.
All the previous possible behaviors are summarized by the following script
g++ -Wl,-rpath ./ -O3 Random.c -fPIC -pthread -shared -o libRandom.so -flto
g++ -Wl,-rpath ./ -O3 Random.c -fPIC -pthread -shared -o libRandomHi.so -flto -DHI
g++ -Wl,-rpath ./ -O3 MonteCarlo.c -fPIC -pthread -L./ -lRandom -shared -o libMonteCarlo.so -flto
g++ -Wl,-rpath ./ -O3 -fPIC -pthread -L./ -lMonteCarlo onlyMC.cc -o onlyMC -flto
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c onlyMC.cc -o onlyMC_w
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c onlyMC.cc -o onlyMC_wo -flto
g++ -Wl,-rpath ./ -O3 -fPIC -pthread Random.c MonteCarlo.c onlyMC.cc -o onlyMC_wor -flto
g++ -Wl,-rpath ./ -O3 -fPIC -pthread Random.c MonteCarlo.c onlyMC.cc -o onlyMC_wop -flto
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c -fPIC -pthread -shared -o libMonteCarlo_w.so -flto
g++ -Wl,-rpath ./ -O3 -fPIC -pthread -L./ -lMonteCarlo_w onlyMC.cc -o onlyMC_1 -flto
time ./onlyMC
time ./onlyMC_w
time ./onlyMC_wo
time ./onlyMC_wor
time ./onlyMC_wop
time ./onlyMC_1
setenv LD_PRELOAD ./libRandomHi.so
time ./onlyMC
time ./onlyMC_w
time ./onlyMC_wo
time ./onlyMC_wor
time ./onlyMC_wop
time ./onlyMC_1
unsetenv LD_PRELOAD
echo "with hidden visibility"
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c -fPIC -pthread -shared -o libMonteCarlo.so -DHIDDEN
g++ -Wl,-rpath ./ -O3 -fPIC -pthread -L./ -lMonteCarlo onlyMC.cc -o onlyMC -DHIDDEN
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c onlyMC.cc -o onlyMC_w -DHIDDEN
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c onlyMC.cc -o onlyMC_wo -flto -DHIDDEN
g++ -Wl,-rpath ./ -O3 -fPIC -pthread Random.c MonteCarlo.c onlyMC.cc -o onlyMC_wor -flto -DHIDDEN
g++ -Wl,-rpath ./ -O3 -fPIC -pthread Random.c MonteCarlo.c onlyMC.cc -o onlyMC_wop -flto -DHIDDEN
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c -fPIC -pthread -shared -o libMonteCarlo_w.so -flto -DHIDDEN
g++ -Wl,-rpath ./ -O3 -fPIC -pthread -L./ -lMonteCarlo_w onlyMC.cc -o onlyMC_1 -flto -DHIDDEN
time ./onlyMC
time ./onlyMC_w
time ./onlyMC_wo
time ./onlyMC_wor
time ./onlyMC_wop
time ./onlyMC_1
setenv LD_PRELOAD ./libRandomHi.so
time ./onlyMC
time ./onlyMC_w
time ./onlyMC_wo
time ./onlyMC_wor
time ./onlyMC_wop
time ./onlyMC_1
unsetenv LD_PRELOAD
the
libRandomHi.so
used as preload differs from
libRandom.so
just for a couple of print statements
The perfect plugin
A clean plugin architecture requires at most a single exported symbol for each plugin. Some implementation relies on static constructors to avoid even that.
Let's see how a "perfect plugin" architecture naturally exploit symbol invisibility and link time optimization.
The starting point is an Abstract Base Interface Class and a Base Factory such as
class Base {
public:
Base();
virtual ~Base();
virtual int i1() const=0;
virtual void hi() const=0;
};
template<typename B>
class Factory {
public:
typedef std::shared_ptr<B> pointer;
virtual pointer operator()()=0;
};
The simplest concrete plugin can be realized within a single file with the whole implementation hidden inside an anonymous namespace and a single exported entry point.
namespace {
class A : public Base {
public:
A(int ia) : a(ia) {}
int i1() const { return a;}
void hi() const {
std::cout << "A" << std::endl;
std::cout << "A " << typeid(*this).name() << std::endl;
std::cout << "A " << &typeid(*this) << std::endl;
}
int a;
};
struct FactoryA : public Factory<Base> {
pointer operator()() {
return pointer(new A(4));
}
};
}
extern "C" Factory<Base>* factoryA() dso_export;
extern "C" Factory<Base> * factoryA() {
static FactoryA local;
return &local;
}
of course more complex plugins will require the collaboration among different classes (with their declaration in header files, and implementation in
.cc
files) still all packaged in a single dynamic shared object.
(
Note, anonymous name spaces
cannot be used in header files!)
A minimal main program will look this
include "Base.h"
extern "C" typedef Factory<Base> * factoryP();
#include <dlfcn.h>
#include <string>
factoryP * load(std::string const & c) {
std::string shlib("plug"+c +".so");
std::string fname("factory"+c);
void * dl = dlopen(shlib.c_str(),RTLD_LAZY);
return reinterpret_cast<factoryP*>(dlsym(dl,fname.c_str()));
}
Factory<Base>::pointer get(std::string const & c) {
std::cout << "Get " << c << std::endl;
auto factory = load(c);
return (*(*factory)())();
}
#include<iostream>
#include<typeinfo>
int main() {
auto a = get("A");
auto d = get("D");
(*a).hi();
std::cout << (*a).i1() << std::endl;
std::cout << typeid(*a).name() << " " << &typeid(*a) << std::endl;
(*d).hi();
std::cout << (*d).i1() << std::endl;
std::cout << typeid(*d).name() << " " << &typeid(*d) << std::endl;
return 0;
}
Following the previous example a way to build this small application will be
export VFLAGS="-fvisibility-inlines-hidden -flto -g"
setenv VFLAGS "-fvisibility-inlines-hidden -flto -g"
g++ -Ofast -std=gnu++0x Base.cc -fPIC -pthread -shared -o libBase.so $VFLAGS -Wl,-rpath ./
g++ -Ofast -std=gnu++0x plugA.cpp -fPIC -pthread -shared -o plugA.so $VFLAGS -L./ -lBase -Wl,-rpath ./
g++ -Ofast -std=gnu++0x Derived.cc helloD.cc plugD.cpp -fPIC -pthread -shared -o plugD.so $VFLAGS -L./ -lBase -Wl,-rpath ./
g++ -Ofast -std=gnu++0x exFactory.cpp -fPIC -pthread -o exFactory $VFLAGS -L./ -lBase -ldl -Wl,-rpath ./
in this way
nm
will confirm that
plugA.so
exports only
factoryA
while
plugD.so
will also export all symbols defined in
Derived.cc
and
helloD.cc
.
A solution is to declare the visibility of those symbols and types as
hidden using the usual
attribute
annotation.
A simpler solution is to declare
external (i.e. with default visibility) the factory and hide everything else using a command line option as in
export VFLAGS="-fvisibility-inlines-hidden -flto"
setenv VFLAGS "-fvisibility-inlines-hidden -flto"
export PFLAGS="-fvisibility=hidden -fvisibility-inlines-hidden -flto"
setenv PFLAGS "-fvisibility=hidden -fvisibility-inlines-hidden -flto"
g++ -Ofast -std=gnu++0x Base.cc -fPIC -pthread -shared -o libBase.so $VFLAGS -Wl,-rpath ./
g++ -Ofast -std=gnu++0x plugA.cpp -fPIC -pthread -shared -o plugA.so $PFLAGS -L./ -lBase -Wl,-rpath ./
g++ -Ofast -std=gnu++0x Derived.cc helloD.cc plugD.cpp -fPIC -pthread -shared -o plugD.so $PFLAGS -L./ -lBase -Wl,-rpath ./
g++ -Ofast -std=gnu++0x exFactory.cpp -fPIC -pthread -o exFactory $VFLAGS -L./ -lBase -ldl -Wl,-rpath ./
From this example we can conclude that in a plugin architecture the simplest and more effective way to hide symbols and get access to the full power
of link time optimization is to collect all code belonging to each plugin in a single dynamical shared object (
the plugin!) and compile with
-fvisibility=hidden
option.
Detailed annotation of individual types and functions will still be needed in more classical library design where interfaces and implementation coexists.
In these cases c++ features such as anonymous namespace and inlining can be used to hide large parts of the implementation w/o using any specific gcc extension.
--
VincenzoInnocente - 29-Apr-2011