EggCounter

I revamped some code based off of work I did a few years ago to make relatively fast perf counters. We are going to start using this in Builder so we can have fairly accurate statistics about our subsystems.

On modern Intel x86_64, these take about 5-8 instructions based on your optimization level. The nice thing is that they don't require any atomics. Those in the know are getting that warm feeling right now about why that is important. Atomics cause cacheline flushes, and cacheline flushes are the enemy of fast code. I once had a manager that called atomic instructions "cacheline nukes". Appropriate.

This revamp isn't much faster than the previous implementation (which is part of the MongoDB C driver). But it does employ a few tricks that allow us to not need to use XMACROS like we did there. If you've never used XMACROS, they are just an inversion of macros. You just call an undefined macro from a file. Then, in your source files, you define that macro and include the file. Do this again and again until you have everything you need. Convenienent, but messy.

EggCounter instead lets us define and mutate counters similar to how other things are done in GLib. First, you define the counter somewhere in your module.

EGG_DEFINE_COUNTER (my_counter, "Category", "Name, "Description")

Note this requires __attribute__((constructor)) which any GCC or Clang from this decade supports in its default mode.

Then you mutate the counters in the appropriate place.

EGG_COUNTER_INC (my_counter);
EGG_COUNTER_DEC (my_counter);
EGG_COUNTER_ADD (my_counter, 1234);
EGG_COUNTER_SUB (my_counter, 1234);

The sizes of a cacheline on modern Intel is 64-bytes. Counters are 64-bit values. Therefore we can pack 8 counters in a single cacheline.

A simple multi-threaded counter might just use atomic_inc(&counter). But since N cores could be accessing the same cacheline, you can end up getting cacheline thrashing (bouncing, ping-pong, or whatever other name you like).

So what EggCounter does is use a run of N_CPU cachelines. The counter is split up between those cachelines. Each thread will write to the cacheline associated with the executing CPU.

This makes writes fast, and reads (the less common case), slightly more expensive. To read the counter value, you need to do a volatile read across N_CPU cachelines.

Astute readers will be saying Aha! right now. And you are right. These counters DO NOT provide full correctness.

Given the following sequence of instructions (which roughly map to the non-optimized version), there is the possibility of thread pre-emption between the rdtscp and add instructions.

rdtscp                       <== Places Core ID in %ecx
mov    %ecx,-0x8c(%rbp)
mov    %ecx,%ecx
mov    0x10(%r13),%rsi
shl    $0x6,%rcx             <== Moves us into cacheline for Core ID
mov    0x8(%r13),%rdi
mov    %rcx,%rax
add    0x2e569e(%rip),%rax   <== Adds the value 

If our thread is descheduled between the rdtscp and the add, we could be rescheduled on a different core than the one represented by our already fetched Core Id. If we had some help from the scheduler, we could resolve this. Such the case, I believe some people were working on this a while back for tcmalloc.

Anyway, that assembly looks a lot more complicated than it is. Don't fear.

But there is more!

What is the use of fast counters if you have to synchronize with the process to actually read them? You end up introducing artifacts caused by your observations. To help mitigate this, the counters are stored in an mmap() region using a file-descriptor created with shm_open().

That means you can access the counters externally, and read them, without synchronizing with the target process.

To make this convenient, egg_counter_arena_new_for_pid() is provided. To list the values of all the counters simply use egg_counter_arena_foreach() and egg_counter_get().

As you might imagine, Builder comes with a tool to test this.

[christian@starlight gnome-builder]$ ./tools/ide-list-counters 16788 
Makecache  : Flag Cache Misses    :   0 : Number of flag cache misses
Makecache  : Flag Cache Hits      :   0 : Number of flag cache hits
Makecache  : Target Cache Misses  :   2 : Number of target cache misses
Makecache  : Target Cache Hits    :   2 : Number of target cache hits
ThreadPool : Queued Tasks         :   0 : Current number of pending tasks.
ThreadPool : Total Tasks          :   8 : Total number of tasks processed.
Clang      : Total Parse Attempts :   4 : Total number of attempts to create a translation unit.

Another nice feature of this, is that if the program crashes, the SHM region is still available for further inspection.

If you happen to have an older Intel, or ARM, or something without rdtscp, we simply fallback to atomics.

If you know your target platform has rdtscp, you can just -DHAVE_RDTSCP. Otherwise, I suggest an autoconf check similar to the following.

AC_MSG_CHECKING([for fast counters with rdtscp])
AC_RUN_IFELSE(
  [AC_LANG_SOURCE([[
   #include 
   int main (int argc, char *argv[]) { int cpu; __builtin_ia32_rdtscp (&cpu); return 0; }]])],
  [have_rdtscp=yes],
  [have_rdtscp=no])
AC_MSG_RESULT([$have_rdtscp])
AS_IF([test "$have_rdtscp" = "yes"],
      [CFLAGS="$CFLAGS -DHAVE_RDTSCP"])

Happy Hacking!

-- Christian Hergert 2015-05-08

Back to Index