Benchmark cpp application --- google benchmark, cpp high resolution clock

 

Introduction

To keep your application improving, you will need a way to measure it. In the previous blog posts, I have introduced tools for profiling your application which gets you the ability to visualize the underlying of your software. In this blog post, I will share a benchmark method I used in my projects.

The methodology

According to [1], you need to have a base case to make a relative comparison to make your measuring result meaningful. 


Example --- remake digit count 10

I build up the measurement tools by following the example given by [1]. You can find the source code [2]. Here is a high level result.

2023-09-25T10:02:38+00:00
Running ./user_benchmark/cpp_showcases_Benchmark
Run on (20 X 5100 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 2048 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 0.17, 0.43, 0.44

 

The above diagram shows the relative improvement. The improvement is around 2 to 3. There is drop when digit is 5, 9 for relative_4. Clearly, that is related to the cache size. Surprisingly, relative_6 and 7 is almost always outperform of relative_4 which is not similar to the author suggested. He found that relative_4 hits the best result.

One may found the benchmark source code [3] for more details.

Code --- implementation for compare

class Digit_Count_10
{
public:
    Digit_Count_10() {}
    ~Digit_Count_10() {}

    u_int32_t Digits10_Basecase(u_int64_t v);
    u_int32_t Digits10_Step_4(u_int64_t v);
    u_int32_t Digits10_Step_5(u_int64_t v);
    u_int32_t Digits10_Step_6(u_int64_t v);
    u_int32_t Digits10_Step_7(u_int64_t v);
};

u_int32_t Digit_Count_10::Digits10_Basecase(u_int64_t v)
{
    u_int32_t result = 0;
    do
    {
        ++result;
        v /= 10;
    } while (v);
    return result;
}

u_int32_t Digit_Count_10::Digits10_Step_4(u_int64_t v)
{
    u_int32_t result = 1;
    for (;;)
    {
        if (v < 10) return  result;
        if (v < 100) return result + 1;
        if (v < 1000) return result + 2;
        if (v < 10000) return result + 3;
        v /= 10000U;
        result += 4;
    }
}

u_int32_t Digit_Count_10::Digits10_Step_5(u_int64_t v)
{
    u_int32_t result = 1;
    for (;;)
    {
        if (v < 10) return  result;
        if (v < 100) return result + 1;
        if (v < 1000) return result + 2;
        if (v < 10000) return result + 3;
        if (v < 100000) return result + 4;
        v /= 100000U;
        result += 5;
    }
}

u_int32_t Digit_Count_10::Digits10_Step_6(u_int64_t v)
{
    u_int32_t result = 1;
    for (;;)
    {
        if (v < 10) return  result;
        if (v < 100) return result + 1;
        if (v < 1000) return result + 2;
        if (v < 10000) return result + 3;
        if (v < 100000) return result + 4;
        if (v < 1000000) return result + 5;
        v /= 1000000U;
        result += 6;
    }
}

u_int32_t Digit_Count_10::Digits10_Step_7(u_int64_t v)
{
    u_int32_t result = 1;
    for (;;)
    {
        if (v < 10) return  result;
        if (v < 100) return result + 1;
        if (v < 1000) return result + 2;
        if (v < 10000) return result + 3;
        if (v < 100000) return result + 4;
        if (v < 1000000) return result + 5;
        if (v < 10000000) return result + 6;
        v /= 10000000U;
        result += 7;
    }
}

Code --- benchmark source code

#include <benchmark/benchmark.h>
#include <chrono>

template <typename T>
uint64_t Run_Batch_Min_Benchmark(int b
                                 , int n
                                 , T fun)
{
    std::chrono::time_point<std::chrono::system_clock> start_time_point, end_time_point;
    std::chrono::duration<double> fun_elapsed_seconds, loop_elapsed_seconds;
    uint64_t min_time = std::numeric_limits<uint64_t>::max();
    for (int i = 0; i < b; i++)
    {
        start_time_point = std::chrono::system_clock::now();
        for (int j = 0; j < n; j++)
        {
            fun();
        }
        fun_elapsed_seconds = std::chrono::system_clock::now() - start_time_point;

        // time the for loop to remove the overhead noise
        start_time_point = std::chrono::system_clock::now();
        for (int j = 0; j < n; j++) { benchmark::DoNotOptimize(j); } // { asm volatile(""); } // volatile the variable so that cpp O3 is not going to optimiz it
        loop_elapsed_seconds = std::chrono::system_clock::now() - start_time_point;

        uint64_t cur_run_time = static_cast<uint64_t>(fun_elapsed_seconds.count() * 1000 * 1000 + 0.5); // get nano second
        cur_run_time -= static_cast<uint64_t>(loop_elapsed_seconds.count() * 1000 * 1000 + 0.5); // remove loop overhead
        if (min_time > cur_run_time)
            min_time = cur_run_time;
    }
    return min_time;
}

void Digit_Count_Benchmark_Main()
{
    std::cout << "batch" << "\t"
              << "iteration" << "\t"
              << "num_digit" << "\t"
              << "base_case" << "\t"
              << "case_4" << "\t"
              << "case_5" << "\t"
              << "case_6" << "\t"
              << "case_7" << std::endl;
    std::vector<u_int64_t> vv {1, 12, 123, 1234, 12345, 123456, 1234567, 12345678
                               , 123456789, 1234567891, 12345678912, 123456789123, 1234567891234, 12345678912345
                               , 123456789123456, 1234567891234567, 12345678912345678};
    int batch = 50;
    int iteration = 10000;
    // batch minimum benchmark
    for (size_t i = 0; i < vv.size(); i++)
    {
        Digit_Count_10 dc;
        uint64_t run_time = Run_Batch_Min_Benchmark(batch, iteration, [&] () -> void
        {
            dc.Digits10_Basecase(vv[i]);
        });
        std::cout << batch << "\t" << iteration << "\t" << i+1 << "\t";
        std::cout << run_time << "\t";

        run_time = Run_Batch_Min_Benchmark(batch, iteration, [&] () -> void
        {
            dc.Digits10_Step_4(vv[i]);
        });
        std::cout << run_time << "\t";

        run_time = Run_Batch_Min_Benchmark(batch, iteration, [&] () -> void
        {
            dc.Digits10_Step_5(vv[i]);
        });
        std::cout << run_time << "\t";

        run_time = Run_Batch_Min_Benchmark(batch, iteration, [&] () -> void
        {
            dc.Digits10_Step_6(vv[i]);
        });
        std::cout << run_time << "\t";

        run_time = Run_Batch_Min_Benchmark(batch, iteration, [&] () -> void
        {
            dc.Digits10_Step_7(vv[i]);
        });
        std::cout << run_time << std::endl;
    }
}

// Define benchmark
static void BM_Digit_Count(benchmark::State& state) {
    for (auto _ : state)
    {
        Digit_Count_Benchmark_Main();
    }
}
BENCHMARK(BM_Digit_Count);

BENCHMARK_MAIN();

Note

You may hit a cpu scaling warning while you run the google benchmark. Please refer to [4] for the solution.

$ sudo cpupower frequency-set --governor performance
$ ./mybench
$ sudo cpupower frequency-set --governor powersave

 

Reference

[1] N. T. C. W. (2016, January 13). code::dive conference 2015 - Andrei Alexandrescu - Writing Fast Code I. YouTube. https://www.youtube.com/watch?v=vrfYLlR8X8k

[2] project/app/Algo_Data_Structure/Digit_Count_10.hpp · main · SulfredLee / cpp_showcases · GitLab. (2023, September 25). GitLab. https://gitlab.com/SulfredLee/cpp_showcases/-/blob/main/project/app/Algo_Data_Structure/Digit_Count_10.hpp?ref_type=heads

[3] project/user_benchmark/Digit_Count_Benchmark.hpp · main · SulfredLee / cpp_showcases · GitLab. (2023, September 25). GitLab. https://gitlab.com/SulfredLee/cpp_showcases/-/blob/main/project/user_benchmark/Digit_Count_Benchmark.hpp?ref_type=heads

[4] google microbenchmarking cpu scaling warning. (n.d.). Stack Overflow. https://stackoverflow.com/a/48110959/2358836

 

Comments