LCG Web>TWikiUsers>VincenzoInnocente>MultiCoreRD>MultiCoreBlogs>VILto (2011-06-23, VincenzoInnocente)

Optimized Software packaging: exploiting link time optimization

Introduction

While benchmarking AVX and gcc 4.6 I tested once again the "lto" option. As you can notice in the last row of the scimark2 benchmark in a previous note this improved the timing of the "MonteCarlo" benchmark by a factor 2. I decided then to have a more closer look.

References

an introduction to "Symbol Visibility in gcc
description of Link Time Optimization in gcc (a bit out-of-date)

A first example: scimark2 Montecarlo

To better study "link time optimization" I extracted scimark2 Montecarlo benchmark as a stand-alone program. The program compute Pi by evaluating the area of a circle within the unit square. The program is very simple:

a function to sample the circle ( in MonteCarlo.c)

double MonteCarlo_integrate(int Num_samples) {
  
  
  Random R = new_Random_seed(SEED);
  
  int under_curve = 0;
  int count;
  
  for (count=0; count<Num_samples; count++) {
    double x= Random_nextDouble(R);
    double y= Random_nextDouble(R);
    
    if ( x*x + y*y <= 1.0)
      under_curve ++;
  }
  
  Random_delete(R);
  
  return ((double) under_curve / Num_samples) * 4.0;
}

and a couple of functions to generate random numbers (defined in Random.h, implemented in Random.c)

#ifdef HIDDEN
#pragma GCC visibility push(internal)
#else
#pragma GCC visibility push(default)
#endif

typedef struct
{
 ....
}
Random_struct, *Random;

Random new_Random_seed(int seed);
double Random_nextDouble(Random R);
void Random_delete(Random R);
double *RandomVector(int N, Random R);
double **RandomMatrix(int M, int N, Random R);

#pragma GCC visibility pop

the main program just run the integrator a certain number of time

#include "MonteCarlo.h"

int main() {
  double result = 0.0;
  
  int cycles=100000000;
  for (int i=0; i!=10; ++i)
    result += MonteCarlo_integrate(cycles);
  
  return int(result);
}

Standard compilation of a whole program is done by g++ -O3 Random.c MonteCarlo.c onlyMC.cc -o onlyMC_w. On my machine onlyMC_w takes 7.6 seconds. to enable "whole program optimization at link time" in gcc 4.6 one has just to add -lto. g++ -O3 Random.c MonteCarlo.c onlyMC.cc -o onlyMC_wo -flto onlyMC_wo takes 5.5 seconds. We leave to the reader to show where the improvement comes from using nm, igprof, pto, or even simply gdb

Let's now try to see how we can still produce this performance improvement in presence of shared libraries. This can be achieved exploiting link time optimization with the linker (ld) will collaborate with the compiler using a plugin. This mechanism is switched on using the compiler options = -flto=

It is rather obvious that high granularity packaging cannot be improved

g++ -Wl,-rpath ./ -O3 Random.c -fPIC -pthread -shared -o libRandom.so -flto
g++ -Wl,-rpath ./ -O3 MonteCarlo.c -fPIC -pthread  -L./ -lRandom -shared -o libMonteCarlo.so -flto
g++ -Wl,-rpath ./ -O3 -fPIC -pthread  -L./ -lMonteCarlo onlyMC.cc -o onlyMC -flto

onlyMC runs in 7.75 seconds on my machine

Lets' then try to trigger the optimization between MonteCarlo.c and Random.c packaging them in the same library

g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c -fPIC -pthread -shared -o libMonteCarlo_w.so -flto
g++ -Wl,-rpath ./ -O3 -fPIC -pthread  -L./ -lMonteCarlo_w onlyMC.cc -o onlyMC_1 -flto

onlyMC_1 runs in the usual 7.75 seconds. This is due to the ELF ABI that requires that any exported symbols could be overwritten in either the main program or any pre-loaded shared library. To test this just create a library with only Random.c activating ffew printouts and preload it

g++ -Wl,-rpath ./ -O3 Random.c -fPIC -pthread -shared -o libRandomHi.so -flto -DHI
setenv LD_PRELOAD ./libRandomHi.so

the printout will appear running onlyMC_1, they will not running onlyMC_wo.

To allow gcc to optimize the library we shall therefore declare the symbols as *private to the dynamic shared object" in question

g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c -fPIC -pthread -shared -o libMonteCarlo_w.so -flto  -DHIDDEN
g++ -Wl,-rpath ./ -O3 -fPIC -pthread  -L./ -lMonteCarlo_w onlyMC.cc -o onlyMC_1 -flto  -DHIDDEN

this time onlyMC_1 will run in 5.5 seconds. setenv LD_PRELOAD ./libRandomHi.so will be not effective. using nm one can easily check that libMonteCarlo_w.so does not export anymore the symbols in Random.

This simple example shows the basic constructs to get optimized packages in shared libraries. A comment that can be easily made is that Random is a typical generic utility that can be used in many algorithms and shall therefore be packaged in a more basic library than the algorithms themselves. This is of course very true. In this case the only rule-of-thumb to follow to avoid the performance penalty of a function call is to inline any algorithm that use less cycles than the function call itself. We leave to the reader the exercise to transform Random in a c++ class and inline the critical method to show how this will recover the performance. It should be noted that inlining should not be interpreted in restrictive terms as substitute in place as the case of a C macro. It is more a way to give to the compiler, as in the link time optimization, access to more information about the function itself, that will allow it to better follow dependencies, propagate consts etc.

All the previous possible behaviors are summarized by the following script

g++ -Wl,-rpath ./ -O3 Random.c -fPIC -pthread -shared -o libRandom.so -flto 
g++ -Wl,-rpath ./ -O3 Random.c -fPIC -pthread -shared -o libRandomHi.so -flto  -DHI
g++ -Wl,-rpath ./ -O3 MonteCarlo.c -fPIC -pthread  -L./ -lRandom -shared -o libMonteCarlo.so -flto 
g++ -Wl,-rpath ./ -O3 -fPIC -pthread  -L./ -lMonteCarlo onlyMC.cc -o onlyMC -flto 
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c  onlyMC.cc -o onlyMC_w
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c  onlyMC.cc -o onlyMC_wo -flto
g++ -Wl,-rpath ./ -O3 -fPIC -pthread Random.c MonteCarlo.c  onlyMC.cc -o onlyMC_wor -flto
g++ -Wl,-rpath ./ -O3 -fPIC -pthread Random.c MonteCarlo.c  onlyMC.cc -o onlyMC_wop -flto  
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c -fPIC -pthread -shared -o libMonteCarlo_w.so -flto
g++ -Wl,-rpath ./ -O3 -fPIC -pthread  -L./ -lMonteCarlo_w onlyMC.cc -o onlyMC_1 -flto 


time ./onlyMC
time ./onlyMC_w
time ./onlyMC_wo
time ./onlyMC_wor
time ./onlyMC_wop
time ./onlyMC_1
setenv LD_PRELOAD ./libRandomHi.so
time ./onlyMC
time ./onlyMC_w
time ./onlyMC_wo
time ./onlyMC_wor
time ./onlyMC_wop
time ./onlyMC_1
unsetenv LD_PRELOAD

echo "with hidden visibility"

g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c -fPIC -pthread  -shared -o libMonteCarlo.so -DHIDDEN
g++ -Wl,-rpath ./ -O3 -fPIC -pthread  -L./ -lMonteCarlo onlyMC.cc -o onlyMC -DHIDDEN
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c  onlyMC.cc -o onlyMC_w -DHIDDEN
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c  onlyMC.cc -o onlyMC_wo -flto -DHIDDEN
g++ -Wl,-rpath ./ -O3 -fPIC -pthread Random.c MonteCarlo.c  onlyMC.cc -o onlyMC_wor -flto -DHIDDEN
g++ -Wl,-rpath ./ -O3 -fPIC -pthread Random.c MonteCarlo.c  onlyMC.cc -o onlyMC_wop -flto   -DHIDDEN
g++ -Wl,-rpath ./ -O3 Random.c MonteCarlo.c -fPIC -pthread -shared -o libMonteCarlo_w.so -flto  -DHIDDEN
g++ -Wl,-rpath ./ -O3 -fPIC -pthread  -L./ -lMonteCarlo_w onlyMC.cc -o onlyMC_1 -flto -DHIDDEN


time ./onlyMC
time ./onlyMC_w
time ./onlyMC_wo
time ./onlyMC_wor
time ./onlyMC_wop
time ./onlyMC_1
setenv LD_PRELOAD ./libRandomHi.so
time ./onlyMC
time ./onlyMC_w
time ./onlyMC_wo
time ./onlyMC_wor
time ./onlyMC_wop
time ./onlyMC_1
unsetenv LD_PRELOAD

the libRandomHi.so used as preload differs from libRandom.so just for a couple of print statements

The perfect plugin

A clean plugin architecture requires at most a single exported symbol for each plugin. Some implementation relies on static constructors to avoid even that. Let's see how a "perfect plugin" architecture naturally exploit symbol invisibility and link time optimization.

The starting point is an Abstract Base Interface Class and a Base Factory such as

class Base {
public:
  Base();

  virtual ~Base();

  virtual int i1() const=0;

  virtual void hi() const=0;
};

template<typename B>
class Factory {
public:
  typedef std::shared_ptr<B> pointer;

  virtual pointer operator()()=0;
};

The simplest concrete plugin can be realized within a single file with the whole implementation hidden inside an anonymous namespace and a single exported entry point.

namespace {

  class A : public Base {
  public:
    A(int ia) : a(ia)  {}
    
    int i1() const { return a;}

   
    void hi() const {
      std::cout << "A" << std::endl;
      std::cout << "A " << typeid(*this).name() << std::endl;
      std::cout << "A " << &typeid(*this) << std::endl;
    }
 
    int a;

  };

  struct FactoryA : public Factory<Base> {
    pointer operator()() {
      return pointer(new A(4));
    } 
    
  };

}

extern "C" Factory<Base>* factoryA() dso_export;


extern "C" Factory<Base> * factoryA()  {
  static FactoryA local;
  return &local;
}

of course more complex plugins will require the collaboration among different classes (with their declaration in header files, and implementation in .cc files) still all packaged in a single dynamic shared object. (Note, anonymous name spaces cannot be used in header files!)

A minimal main program will look this

include "Base.h"

extern "C" typedef Factory<Base> * factoryP();

#include <dlfcn.h>
#include <string>

factoryP * load(std::string const & c) {
  std::string shlib("plug"+c +".so");
  std::string fname("factory"+c);
  void * dl = dlopen(shlib.c_str(),RTLD_LAZY);
  return reinterpret_cast<factoryP*>(dlsym(dl,fname.c_str()));
}

Factory<Base>::pointer get(std::string const & c) {

  std::cout << "Get " << c << std::endl;
  auto factory = load(c);

  return (*(*factory)())();
}

#include<iostream>
#include<typeinfo>
int main() {
  auto a =  get("A");
  auto d = get("D");

  (*a).hi();
  std::cout << (*a).i1() << std::endl;
  std::cout << typeid(*a).name() << " " << &typeid(*a) << std::endl;

 (*d).hi();
 std::cout << (*d).i1() << std::endl;
 std::cout << typeid(*d).name() << " " << &typeid(*d) << std::endl;

 return 0;

}

Following the previous example a way to build this small application will be

export VFLAGS="-fvisibility-inlines-hidden -flto -g"
setenv VFLAGS "-fvisibility-inlines-hidden -flto  -g"
g++ -Ofast -std=gnu++0x  Base.cc -fPIC -pthread -shared -o libBase.so  $VFLAGS  -Wl,-rpath ./
g++ -Ofast -std=gnu++0x  plugA.cpp -fPIC -pthread -shared -o plugA.so  $VFLAGS -L./ -lBase -Wl,-rpath ./
g++ -Ofast -std=gnu++0x  Derived.cc helloD.cc plugD.cpp -fPIC -pthread -shared -o plugD.so  $VFLAGS -L./ -lBase -Wl,-rpath ./   
g++ -Ofast -std=gnu++0x  exFactory.cpp -fPIC -pthread -o exFactory  $VFLAGS -L./ -lBase -ldl -Wl,-rpath ./

in this way nm will confirm that plugA.so exports only factoryA while plugD.so will also export all symbols defined in Derived.cc and helloD.cc. A solution is to declare the visibility of those symbols and types as hidden using the usual attribute annotation. A simpler solution is to declare external (i.e. with default visibility) the factory and hide everything else using a command line option as in

export VFLAGS="-fvisibility-inlines-hidden  -flto"
setenv VFLAGS "-fvisibility-inlines-hidden -flto"
export PFLAGS="-fvisibility=hidden -fvisibility-inlines-hidden -flto"
setenv PFLAGS "-fvisibility=hidden -fvisibility-inlines-hidden -flto"
g++ -Ofast -std=gnu++0x  Base.cc -fPIC -pthread -shared -o libBase.so  $VFLAGS  -Wl,-rpath ./
g++ -Ofast -std=gnu++0x  plugA.cpp -fPIC -pthread -shared -o plugA.so  $PFLAGS -L./ -lBase -Wl,-rpath ./
g++ -Ofast -std=gnu++0x  Derived.cc helloD.cc plugD.cpp -fPIC -pthread -shared -o plugD.so  $PFLAGS -L./ -lBase -Wl,-rpath ./
g++ -Ofast -std=gnu++0x  exFactory.cpp -fPIC -pthread -o exFactory  $VFLAGS -L./ -lBase -ldl -Wl,-rpath ./

From this example we can conclude that in a plugin architecture the simplest and more effective way to hide symbols and get access to the full power of link time optimization is to collect all code belonging to each plugin in a single dynamical shared object (the plugin!) and compile with -fvisibility=hidden option. Detailed annotation of individual types and functions will still be needed in more classical library design where interfaces and implementation coexists. In these cases c++ features such as anonymous namespace and inlining can be used to hide large parts of the implementation w/o using any specific gcc extension.

-- VincenzoInnocente - 29-Apr-2011

Topic revision: r9 - 2011-06-23 - VincenzoInnocente

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback