Now that the modified lockless variant does not need a DWCAS anymore,
there's no reason to keep the much slower locked version, so let's
just get rid of it.
In GH issue #1275, Fabiano Nunes Parente provided a nicely detailed
report showing reproducible crashes under musl. Musl is one of the libs
coming with a simple allocator for which we prefer to keep the shared
cache. On x86 we have a DWCAS so the lockless implementation is enabled
for such libraries.
And this implementation has had a small race since day one: the allocator
will need to read the first object's <next> pointer to place it into the
free list's head. If another thread picks the same element and immediately
releases it, while both the local and the shared pools are too crowded, it
will be freed to the OS. If the libc's allocator immediately releases it,
the memory area is unmapped and we can have a crash while trying to read
that pointer. However there is no problem as long as the item remains
mapped in memory because whatever value found there will not be placed
into the head since the counter will have changed.
The probability for this to happen is extremely low, but as analyzed by
Fabiano, it increases with the buffer size. On 16 threads it's relatively
easy to reproduce with 2MB buffers above 200k req/s, where it should
happen within the first 20 seconds of traffic usually.
This is a structural issue for which there are two non-trivial solutions:
- place a read lock in the alloc call and a barrier made of lock/unlock
in the free() call to force to serialize operations; this will have
a big performance impact since free() is already one of the contention
points;
- change the allocator to use a self-locked head, similar to what is
done in the MT_LISTS. This requires two memory writes to the head
instead of a single one, thus the overhead is exactly one memory
write during alloc and one during free;
This patch implements the second option. A new POOL_DUMMY pointer was
defined for the locked pointer value, allowing to both read and lock it
with a single xchg call. The code was carefully optimized so that the
locked period remains the shortest possible and that bus writes are
avoided as much as possible whenever the lock is held.
Tests show that while a bit slower than the original lockless
implementation on large buffers (2MB), it's 2.6 times faster than both
the no-cache and the locked implementation on such large buffers, and
remains as fast or faster than the all implementations when buffers are
48k or higher. Tests were also run on arm64 with similar results.
Note that this code is not used on modern libcs featuring a fast allocator.
A nice benefit of this change is that since it removes a dependency on
the DWCAS, it will be possible to remove the locked implementation and
replace it with this one, that is then usable on all systems, thus
significantly increasing their performance with large buffers.
Given that lockless pools were introduced in 1.9 (not supported anymore),
this patch will have to be backported as far as 2.0. The code changed
several times in this area and is subject to many ifdefs which will
complicate the backport. What is important is to remove all the DWCAS
code from the shared cache alloc/free lockless code and replace it with
this one. The pool_flush() code is basically the same code as the
allocator, retrieving the whole list at once. If in doubt regarding what
barriers to use in older versions, it's safe to use the generic ones.
This patch depends on the following previous commits:
- MINOR: pools: do not maintain the lock during pool_flush()
- MINOR: pools: call malloc_trim() under thread isolation
- MEDIUM: pools: use a single pool_gc() function for locked and lockless
The last one also removes one occurrence of an unneeded DWCAS in the
code that was incompatible with this fix. The removal of the now unused
seq field will happen in a future patch.
Many thanks to Fabiano for his detailed report, and to Olivier for
his help on this issue.
The purpose of this debugging option was to prevent certain pools from
masking other ones when they were shared. For example, task, http_txn,
h2s, h1s, h1c, session, fcgi_strm, and connection are all 192 bytes and
would normally be mergedi, but not with this option. The problem is that
certain pools are declared multiple times with various parameters, which
are often very close, and due to the way the option works, they're not
shared either. Good examples of this are captures and stick tables. Some
configurations have large numbers of stick-tables of pretty similar types
and it's very common to end up with the following when the option is
enabled:
$ socat - /tmp/sock1 <<< "show pools" | grep stick
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753800=56
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753880=57
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753900=58
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753980=59
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753a00=60
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753a80=61
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753b00=62
- Pool sticktables (224 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753780=55
In addition to not being convenient, it can have important effects on the
memory usage because these pools will not share their entries, so one stick
table cannot allocate from another one's pool.
This patch solves this by going back to the initial goal which was not to
have different pools in the same list. Instead of masking the MAP_F_SHARED
flag, it simply adds a test on the pool's name, and disables pool sharing
if the names differ. This way pools are not shared unless they're of the
same name and size, which doesn't hinder debugging. The same test above
now returns this:
$ socat - /tmp/sock1 <<< "show pools" | grep stick
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 7 users, @0x3fadb30 [SHARED]
- Pool sticktables (224 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x3facaa0 [SHARED]
This is much better. This should probably be backported, in order to limit
the side effects of DEBUG_DONT_SHARE_POOLS being enabled in production.
Continuing the unification of local and shared pools, now the usage of
pools is governed by CONFIG_HAP_POOLS without which allocations and
releases are performed directly from the OS using pool_alloc_nocache()
and pool_free_nocache().
Till now it was limited to objects allocated from the OS which means
it had little use as soon as pools were enabled. Let's move it upper
in the layers so that any code can benefit from fault injection. In
addition this allows to pass a new flag POOL_F_NO_FAIL to disable it
if some callers prefer a no-failure approach.
The mem_should_fail() call enabled by DEBUG_FAIL_ALLOC used to be placed
only in the no-cache version of the allocator. Now we can generalize it
to all modes and remove the exclusive test on CONFIG_HAP_NO_GLOBAL_POOLS.
We're going to make the local pool always present unless pools are
completely disabled. This means that pools are always enabled by
default, regardless of the use of threads. Let's drop this notion
of "local" pools and make it just "pool". The equivalent debug
option becomes DEBUG_NO_POOLS instead of DEBUG_NO_LOCAL_POOLS.
For now this changes nothing except the option and dropping the
dependency on USE_THREAD.
Initially per-thread pool caches were stored into a fixed-size array.
But this was a bit ugly because the last allocated pools were not able
to benefit from the cache at all. As a work around to preserve
performance, a size of 64 cacheable pools was set by default (there
are 51 pools at the moment, excluding any addon and debugging code),
so all in-tree pools were covered, at the expense of higher memory
usage.
In addition an index had to be calculated for each pool, and was used
to acces the pool cache head into that array. The pool index was not
even stored into the pools so it was required to determine it to access
the cache when the pool was already known.
This patch changes this by moving the pool cache head into the pool
head itself. This way it is certain that each pool will have its own
cache. This removes the need for index calculation.
The pool cache head is 32 bytes long so it was aligned to 64B to avoid
false sharing between threads. The extra cost is not huge (~2kB more
per pool than before), and we'll make better use of that space soon.
The pool cache head contains the size, which should probably be removed
since it's already in the pool's head.
The pool_alloc_dirty() function was renamed to __pool_alloc() and now
takes a set of flags indicating whether poisonning is permitted or not
and whether zeroing the area is needed or not. The pool_alloc() function
is now just a wrapper calling __pool_alloc(pool, 0).
We've reached a point where the global pools represent a significant
bottleneck with threads. On a 64-core machine, the performance was
divided by 8 between 32 and 64 H2 connections only because there were
not enough entries in the local caches to avoid picking from the global
pools, and the contention on the list there was very high. It becomes
obvious that we need to have an array of lists, but that will require
more changes.
In parallel, standard memory allocators have improved, with tcmalloc
and jemalloc finding their ways through mainstream systems, and glibc
having upgraded to a thread-aware ptmalloc variant, keeping this level
of contention here isn't justified anymore when we have both the local
per-thread pool caches and a fast process-wide allocator.
For these reasons, this patch introduces a new compile time setting
CONFIG_HAP_NO_GLOBAL_POOLS which is set by default when threads are
enabled with thread local pool caches, and we know we have a fast
thread-aware memory allocator (currently set for glibc>=2.26). In this
case we entirely bypass the global pool and directly use the standard
memory allocator when missing objects from the local pools. It is also
possible to force it at compile time when a good allocator is used with
another setup.
It is still possible to re-enable the global pools using
CONFIG_HAP_GLOBAL_POOLS, if a corner case is discovered regarding the
operating system's default allocator, or when building with a recent
libc but a different allocator which provides other benefits but does
not scale well with threads.
When not sharing pools (i.e. when building with -DDEBUG_DONT_SHARE_POOLS)
we have about 47 pools right now, while MAX_BASE_POOLS is only 32, meaning
that only the first 32 ones will benefit from a per-thread cache entry.
This totally kills performance when pools are not shared (roughly -20%).
Let's double the limit to gain some margin, and make it possible to set
it as a build option.
It might be useful to backport this to stable versions as they're likely
to be affected as well.
pool-t.h was mistakenly including the full-blown includes for threads,
lists and api instead of the types, and as such, CONFIG_HAP_LOCAL_POOLS
and CONFIG_HAP_LOCKLESS_POOLS were not visible everywhere.
Till now the local pool caches were implemented only when lockless pools
were in use. This was mainly due to the difficulties to disentangle the
code parts. However the locked pools would further benefit from the local
cache, and having this would reduce the variants in the code.
This patch does just this. It adds a new debug macro DEBUG_NO_LOCAL_POOLS
to forcefully disable local pool caches, and makes sure that the high
level functions are now strictly the same between locked and lockless
(pool_alloc(), pool_alloc_dirty(), pool_free(), pool_get_first()). The
pool index calculation was moved inside the CONFIG_HAP_LOCAL_POOLS guards.
This allowed to move them out of the giant #ifdef and to significantly
reduce the code duplication.
A quick perf test shows that with locked pools the performance increases
by roughly 10% on 8 threads and gets closer to the lockless one.
This is the beginning of the move and cleanup of memory.h. This first
step only extracts type definitions and basic macros that are needed
by the files which reference a pool. They're moved to pool-t.h (since
"pool" is more obvious than "memory" when looking for pool-related
stuff). 3 files which didn't need to include the whole memory.h were
updated.