Introduction
To keep your application improving, you will need a way to measure it. In the previous blog posts, I have introduced tools for profiling your application which gets you the ability to visualize the underlying of your software. In this blog post, I will share a benchmark method I used in my projects.
The methodology
According to [1], you need to have a base case to make a relative comparison to make your measuring result meaningful.
Example --- remake digit count 10
I build up the measurement tools by following the example given by [1]. You can find the source code [2]. Here is a high level result.
2023-09-25T10:02:38+00:00 Running ./user_benchmark/cpp_showcases_Benchmark Run on (20 X 5100 MHz CPU s) CPU Caches: L1 Data 48 KiB (x10) L1 Instruction 32 KiB (x10) L2 Unified 2048 KiB (x10) L3 Unified 24576 KiB (x1) Load Average: 0.17, 0.43, 0.44
The above diagram shows the relative improvement. The improvement is around 2 to 3. There is drop when digit is 5, 9 for relative_4. Clearly, that is related to the cache size. Surprisingly, relative_6 and 7 is almost always outperform of relative_4 which is not similar to the author suggested. He found that relative_4 hits the best result.
One may found the benchmark source code [3] for more details.
Code --- implementation for compare
class Digit_Count_10 { public: Digit_Count_10() {} ~Digit_Count_10() {} u_int32_t Digits10_Basecase(u_int64_t v); u_int32_t Digits10_Step_4(u_int64_t v); u_int32_t Digits10_Step_5(u_int64_t v); u_int32_t Digits10_Step_6(u_int64_t v); u_int32_t Digits10_Step_7(u_int64_t v); }; u_int32_t Digit_Count_10::Digits10_Basecase(u_int64_t v) { u_int32_t result = 0; do { ++result; v /= 10; } while (v); return result; } u_int32_t Digit_Count_10::Digits10_Step_4(u_int64_t v) { u_int32_t result = 1; for (;;) { if (v < 10) return result; if (v < 100) return result + 1; if (v < 1000) return result + 2; if (v < 10000) return result + 3; v /= 10000U; result += 4; } } u_int32_t Digit_Count_10::Digits10_Step_5(u_int64_t v) { u_int32_t result = 1; for (;;) { if (v < 10) return result; if (v < 100) return result + 1; if (v < 1000) return result + 2; if (v < 10000) return result + 3; if (v < 100000) return result + 4; v /= 100000U; result += 5; } } u_int32_t Digit_Count_10::Digits10_Step_6(u_int64_t v) { u_int32_t result = 1; for (;;) { if (v < 10) return result; if (v < 100) return result + 1; if (v < 1000) return result + 2; if (v < 10000) return result + 3; if (v < 100000) return result + 4; if (v < 1000000) return result + 5; v /= 1000000U; result += 6; } } u_int32_t Digit_Count_10::Digits10_Step_7(u_int64_t v) { u_int32_t result = 1; for (;;) { if (v < 10) return result; if (v < 100) return result + 1; if (v < 1000) return result + 2; if (v < 10000) return result + 3; if (v < 100000) return result + 4; if (v < 1000000) return result + 5; if (v < 10000000) return result + 6; v /= 10000000U; result += 7; } }
Code --- benchmark source code
#include <benchmark/benchmark.h> #include <chrono> template <typename T> uint64_t Run_Batch_Min_Benchmark(int b , int n , T fun) { std::chrono::time_point<std::chrono::system_clock> start_time_point, end_time_point; std::chrono::duration<double> fun_elapsed_seconds, loop_elapsed_seconds; uint64_t min_time = std::numeric_limits<uint64_t>::max(); for (int i = 0; i < b; i++) { start_time_point = std::chrono::system_clock::now(); for (int j = 0; j < n; j++) { fun(); } fun_elapsed_seconds = std::chrono::system_clock::now() - start_time_point; // time the for loop to remove the overhead noise start_time_point = std::chrono::system_clock::now(); for (int j = 0; j < n; j++) { benchmark::DoNotOptimize(j); } // { asm volatile(""); } // volatile the variable so that cpp O3 is not going to optimiz it loop_elapsed_seconds = std::chrono::system_clock::now() - start_time_point; uint64_t cur_run_time = static_cast<uint64_t>(fun_elapsed_seconds.count() * 1000 * 1000 + 0.5); // get nano second cur_run_time -= static_cast<uint64_t>(loop_elapsed_seconds.count() * 1000 * 1000 + 0.5); // remove loop overhead if (min_time > cur_run_time) min_time = cur_run_time; } return min_time; } void Digit_Count_Benchmark_Main() { std::cout << "batch" << "\t" << "iteration" << "\t" << "num_digit" << "\t" << "base_case" << "\t" << "case_4" << "\t" << "case_5" << "\t" << "case_6" << "\t" << "case_7" << std::endl; std::vector<u_int64_t> vv {1, 12, 123, 1234, 12345, 123456, 1234567, 12345678 , 123456789, 1234567891, 12345678912, 123456789123, 1234567891234, 12345678912345 , 123456789123456, 1234567891234567, 12345678912345678}; int batch = 50; int iteration = 10000; // batch minimum benchmark for (size_t i = 0; i < vv.size(); i++) { Digit_Count_10 dc; uint64_t run_time = Run_Batch_Min_Benchmark(batch, iteration, [&] () -> void { dc.Digits10_Basecase(vv[i]); }); std::cout << batch << "\t" << iteration << "\t" << i+1 << "\t"; std::cout << run_time << "\t"; run_time = Run_Batch_Min_Benchmark(batch, iteration, [&] () -> void { dc.Digits10_Step_4(vv[i]); }); std::cout << run_time << "\t"; run_time = Run_Batch_Min_Benchmark(batch, iteration, [&] () -> void { dc.Digits10_Step_5(vv[i]); }); std::cout << run_time << "\t"; run_time = Run_Batch_Min_Benchmark(batch, iteration, [&] () -> void { dc.Digits10_Step_6(vv[i]); }); std::cout << run_time << "\t"; run_time = Run_Batch_Min_Benchmark(batch, iteration, [&] () -> void { dc.Digits10_Step_7(vv[i]); }); std::cout << run_time << std::endl; } } // Define benchmark static void BM_Digit_Count(benchmark::State& state) { for (auto _ : state) { Digit_Count_Benchmark_Main(); } } BENCHMARK(BM_Digit_Count); BENCHMARK_MAIN();
Note
You may hit a cpu scaling warning while you run the google benchmark. Please refer to [4] for the solution.
$ sudo cpupower frequency-set --governor performance $ ./mybench $ sudo cpupower frequency-set --governor powersave
Reference
[1] N. T. C. W. (2016, January 13). code::dive conference 2015 - Andrei Alexandrescu - Writing Fast Code I. YouTube. https://www.youtube.com/watch?v=vrfYLlR8X8k
[2] project/app/Algo_Data_Structure/Digit_Count_10.hpp · main · SulfredLee / cpp_showcases · GitLab. (2023, September 25). GitLab. https://gitlab.com/SulfredLee/cpp_showcases/-/blob/main/project/app/Algo_Data_Structure/Digit_Count_10.hpp?ref_type=heads
[3] project/user_benchmark/Digit_Count_Benchmark.hpp · main · SulfredLee / cpp_showcases · GitLab. (2023, September 25). GitLab. https://gitlab.com/SulfredLee/cpp_showcases/-/blob/main/project/user_benchmark/Digit_Count_Benchmark.hpp?ref_type=heads
[4] google microbenchmarking cpu scaling warning. (n.d.). Stack Overflow. https://stackoverflow.com/a/48110959/2358836



Comments
Post a Comment