My aim is to create a wrapper around Boost uniform real distribution with Mersenne Twister in order to make it available in a library. So I created a basic class like this:
class mt19937
{
protected:
boost::random::mt19937 gen_;
boost::random::uniform_real_distribution<double> real_;
public:
mt19937(unsigned long s = 5489UL) : gen_(s), real_(0., 1.) {};
double get() { return real_(gen_); };
};
Running a performance test though, I found out that my class is much slower than a direct call to Boost ojects. In fact, the following code which samples 10 billions of numbers, takes on my machine 30s:
constexpr unsigned long seed = 5489UL;
constexpr size_t iter = 100000;
double x = 0.;
boost::random::mt19937 gen(seed);
boost::random::uniform_real_distribution<double> real(0., 1.);
for (size_t i = 0; i < iter; ++i)
for (size_t j = 0; j < iter; ++j)
x = real(gen);
The class mt19937
described above, with the following code, takes around 70s:
mt19937 stduniform(seed);
for (size_t i = 0; i < iter; ++i)
for (size_t j = 0; j < iter; ++j)
x = stduniform.get();
Looking at assembler in Windows, in the first case the code executed for x = real(gen)
is the following, which seems to me just the call to boost::random::detail::generate_uniform_real
and the assignment to x
:
00007FF6D14639F0 movzx r9d,byte ptr [r15]
00007FF6D14639F4 lea rcx,[gen]
00007FF6D14639F9 movaps xmm2,xmm7
00007FF6D14639FC movaps xmm1,xmm8
00007FF6D1463A00 call boost::random::detail::generate_uniform_real<boost::random::mersenne_twister_engine<unsigned int,32,624,397,31,2567483615,11,4294967295,7,2636928640,15,4022730752,18,1812433253>,double> (07FF6D146141Ah)
With the function get()
I see the following istructions - it seems to perform some operations on registers that I cannot explain and a jump:
00007FF6D1463B61 movsd xmm3,mmword ptr [rbp+900h]
00007FF6D1463B69 lea rcx,[stduniform]
00007FF6D1463B6E movsd xmm4,mmword ptr [rbp+8F8h]
00007FF6D1463B76 movaps xmm2,xmm3
00007FF6D1463B79 mulsd xmm2,xmm6
00007FF6D1463B7D movaps xmm1,xmm4
00007FF6D1463B80 mulsd xmm1,xmm6
00007FF6D1463B84 movaps xmm0,xmm2
00007FF6D1463B87 subsd xmm0,xmm1
00007FF6D1463B8B comisd xmm0,xmm7
00007FF6D1463B8F jbe main+2F8h (07FF6D1463B98h)
00007FF6D1463B91 call boost::random::detail::generate_uniform_real<boost::random::mersenne_twister_engine<unsigned int,32,624,397,31,2567483615,11,4294967295,7,2636928640,15,4022730752,18,1812433253>,double> (07FF6D14615D7h)
00007FF6D1463B96 jmp main+307h (07FF6D1463BA7h)
00007FF6D1463B98 movzx r9d,byte ptr [rbx]
00007FF6D1463B9C movaps xmm2,xmm3
00007FF6D1463B9F movaps xmm1,xmm4
00007FF6D1463BA2 call boost::random::detail::generate_uniform_real<boost::random::mersenne_twister_engine<unsigned int,32,624,397,31,2567483615,11,4294967295,7,2636928640,15,4022730752,18,1812433253>,double> (07FF6D146141Ah)
Is it possible that a call to a function (that should be inlined) performed 10 billions of times can add this overhead? Do you have any suggestion about the code to increase performance?
I am working in Windows environment and using the compiler vc14 of VisualStudio2015, with Boost 1.7.1. I observed a similar behaviour with gcc4.9 on a Linux machine, where the direct call to Boost takes 30s and the new class takes 45s.
Thanks a lot for your time.
Aucun commentaire:
Enregistrer un commentaire