On an AMD EPYC 3rd gen, 20% of the CPU is spent calculating the amount
of pools needed when using QUIC, because pool allocations/releases are
quite frequent and the inter-CCX communication is super slow. Still,
there's a way to save between 0.5 and 1% CPU by using fetch-add and
sub-fetch that are converted to XADD so that the result is directly
fed into the swrate_add argument without having to re-read the memory
area. That's what this patch does.
When the integrity check fails, it's useful to get a dump of the area
around the first faulty byte. That's what this patch does. For example
it now shows this before reporting info about the tag itself:
Contents around first corrupted address relative to pool item:.
Contents around address 0xe4febc0792c0+40=0xe4febc0792e8:
0xe4febc0792c8 [80 75 56 d8 fe e4 00 00] [.uV.....]
0xe4febc0792d0 [a0 f7 23 a4 fe e4 00 00] [..#.....]
0xe4febc0792d8 [90 75 56 d8 fe e4 00 00] [.uV.....]
0xe4febc0792e0 [d9 93 fb ff fd ff ff ff] [........]
0xe4febc0792e8 [d9 93 fb ff ff ff ff ff] [........]
0xe4febc0792f0 [d9 93 fb ff ff ff ff ff] [........]
0xe4febc0792f8 [d9 93 fb ff ff ff ff ff] [........]
0xe4febc079300 [d9 93 fb ff ff ff ff ff] [........]
This may be backported to 2.9 and maybe even 2.8 as it does help spot
the cause of the memory corruption.
This function is particularly useful to dump unknown areas watching
for opportunistic symbols, so let's move it to tools.c so that we can
reuse it a little bit more.
When a corruption was detected in an object, it's often said that the
tag doesn't match the pool, but it should also check if it matches the
location of an earlier pool_free() call, which happens when -dMcaller
is used. That's what we're doing now.
This option has been set by default for a very long time and also
complicates the manipulation of the DEBUG variable. Let's make it
the official default and permit to unset it by setting it to zero.
The other pool-related DEBUG options were adjusted to also explicitly
check for the zero value for consistency.
As reported by github user @JB0925 in issue #2427, there is a possible
crash in pool_flush(). The problem is that if the free_list is not empty
in the first test, and is empty at the moment the xchg() is performed,
for example because another thread called it in parallel, we place a
POOL_BUSY there that is never removed later, causing the next thread to
wait forever.
This was introduced in 2.5 with commit 2a4523f6f ("BUG/MAJOR: pools: fix
possible race with free() in the lockless variant"). It has probably
very rarely been detected, because:
- pool_flush() is only called when stopping is set
- the function does nothing if global pools are disabled, which is
the case on most modern systems with a fast memory allocator.
It's possible to reproduce it by modifying __task_free() to call
pool_flush() on 1% of the calls instead of only when stopping.
The fix is quite simple, it consists in moving the zeroing of the
entry in the break path after verifying that the entry was not already
busy.
This must be backported wherever commit 2a4523f6f is.
In order to limit inter-thread contention on the global pool, in 2.9-dev3
with commit 7bf829ace ("MAJOR: pools: move the shared pool's free_list
over multiple buckets"), it was decided that if the selected bucket had
an empty free list, we would simply give up and fall back to the OS
allocator.
But this causes allocations to be made from the OS for certain threads,
to be released to overloaded pools that are sent back to the OS. One
visible effect is that sending a lot of traffic using h2load with 100
parallel streams over 100 connections causes 5-10k buffers to be
allocated, then reducing the load to only 10 connections doesn't make
these allocations go down, just because some buckets are no longer
visited.
Tests show that giving a second chance to pick another bucket in this
case is sufficient to visit all other buckets and recycle their pending
objects. Now "show pools" that starts at 10k buffers at 100 connections
goes down to about 150 with 1 connection and 100 streams in a fraction
of a second.
No backport is needed, as the issue is only in 2.9.
Since 2.9-dev3 with commit 7bf829ace ("MAJOR: pools: move the shared
pool's free_list over multiple buckets"), the global pool supports
multiple heads to reduce inter-thread contention. However, when
grabbing a freelist head fails because another thread is already
picking from it, we just skip to the next one and try again.
Unfortunately, it still maintains a bit of contention between thread
pairs when for some reasons only a few threads are used. This may
happen for example when running on a 4- or 8- thread system and
the two most active ones end up on adjacent buckets.
A better and much simpler solution consists in visiting a random bucket
instead of the current one. Tests show that the CPU usage spent in
pool_refill_local_from_shared() reduces at low number of connections
(hence threads).
No backport is needed, as the issue is only in 2.9.
Now when calling ha_panic() with a thread still under malloc_trim(),
we'll set a new tainted flag to easily report it, and the output
trace will report that this condition happened and will suggest to
use no-memory-trimming to avoid it in the future.
When looking at "show pools", it's often difficult to know which alloc()
corresponds to which free() since it's not often 1:1. But sometimes we
have all elements available to maintain a link between alloc and free.
Indeed, when the caller is recorded in the allocated area, we can store
the pointer to the just created bin instead of the caller address itself,
since the caller address is already in the memprof bin. By doing so, we
permit the pool_free() call to locate the allocator bin and update its
free count when caller tracing is enabled. This for example allows to
produce outputs like this on "show profiling" and a process started with
-dMcaller:
1391967 1391968 22805987328 22806003712| 0x59f72f process_stream+0x19f/0x3a7a p_alloc(0) [delta=-16384] [pool=buffer]
1391936 1391937 22805479424 22805495808| 0x6e1476 task_run_applet+0x426/0xea2 p_alloc(0) [delta=-16384] [pool=buffer]
1391925 1391925 22805299200 22805299200| 0x58435a main+0xdf07a p_alloc(0) [delta=0] [pool=buffer]
0 2087930 0 34208645120| 0x59b519 stream_release_buffers+0xf9/0x110 p_free(-16384) [pool=buffer]
695993 695992 11403149312 11403132928| 0x66018f main+0x1baeaf p_alloc(0) [delta=16384] [pool=buffer]
0 1391957 0 22805823488| 0x59b47c stream_release_buffers+0x5c/0x110 p_free(-16384) [pool=buffer]
695968 695970 11402739712 11402772480| 0x587b85 h1_io_cb+0x9a5/0xe7c p_alloc(0) [delta=-32768] [pool=buffer]
0 1391923 0 22805266432| 0x57f388 main+0xda0a8 p_free(-16384) [pool=buffer]
695959 695960 11402592256 11402608640| 0x586add main+0xe17fd p_alloc(0) [delta=-16384] [pool=buffer]
0 695978 0 11402903552| 0x59cc58 stream_free+0x178/0x9ea p_free(-16384) [pool=buffer]
(...)
Here it's quickly visible that all of them got properly released.
In pool_gc(), GCC 13.2.1 reports an error about a potential null potential
dereference:
src/pool.c: In function ‘pool_gc’:
src/pool.c:807:64: error: potential null pointer dereference [-Werror=null-dereference]
807 | entry->buckets[bucket].free_list = temp->next;
| ~~~~^~~~~~
There is no issue here because "bucket" variable cannot be greater than
CONFIG_HAP_POOL_BUCKETS. But to make GCC happy, we now break the loop if it
is greater or equal to CONFIG_HAP_POOL_BUCKETS.
When dumping pool information, we make a special case of the condition
where the pool couldn't be identified and we consider that it was the
correct one. In the code arrangements brought by commit efc46dede ("DEBUG:
pools: inspect pools on fatal error and dump information found"), a
ternary expression for testing this depends on the "if" block condition
so this can be simplified and will make Coverity happy. This was reported
in GH #2290.
The pools sizes were rounded up a little bit too much with commit
30f931ead ("BUG/MEDIUM: pools: fix the minimum allocation size"). The
goal was in fact to make sure they were always at least large enough to
store 2 list heads, and stuffing this into the alignment calculation
resulted in the size being always rounded up to this size. This is
problematic because it means that the appended tag at the end doesn't
always catch potential overflows since more bytes than needed are
allocated. Moreover, this test was later reinforced by commit b5ba09ed5
("BUG/MEDIUM: pools: ensure items are always large enough for the
pool_cache_item"), proving that the first test was not always sufficient.
This needs to be reworked to proceed correctly:
- the two lists are needed when the object is in the cache, hence
when we don't care about the tag, which means that the tag's size,
if any, can easily cover for the missing bytes to reach that size.
This is actually what was already being checked for.
- the rounding should not be performed (beyond the size of a word to
preserve pointer alignment) when pool tagging is enabled, otherwise
we don't detect small overflows. It means that there will be less
merging when proceeding like this. Tests show that we merge 93 pools
into 36 without tags and 43 with tags enabled.
- the rounding should not consider the extra size, since it's already
done when calculating the allocated size later (i.e. don't round up
twice). The difference is subtle but it's what makes sure the tag
immediately follows the area instead of starting from the end.
Thanks to this, now when writing one byte too many at the end of a struct
stream, the error is instantly caught.
When no tag matches a known pool, we can inspect around to help figure
what could have possibly overwritten memory. The contents are printed
one machine word per line in hex, then using printable characters, and
when they can be resolved to a pointer, either the pool's pointer name
or a resolvable symbol with offset. The goal here is to help recognize
what is easily identifiable in memory.
For example applying the following patch to stream_free():
- pool_free(pool_head_stream, s);
+ pool_free(pool_head_stream, (void*)s+1);
Causes the following dump to be emitted:
FATAL: pool inconsistency detected in thread 1: tag mismatch on free().
caller: 0x59e968 (stream_free+0x6d8/0xa0a)
item: 0x13df5c1
pool: 0x12782c0 ('stream', size 888, real 904, users 1)
Tag does not match (0x4f00000000012782). Tag does not match any other pool.
Contents around address 0x13df5c1+888=0x13df939:
0x13df918 [00 00 00 00 00 00 00 00] [........]
0x13df920 [00 00 00 00 00 00 00 00] [........]
0x13df928 [00 00 00 00 00 00 00 00] [........]
0x13df930 [00 00 00 00 00 00 00 00] [........]
0x13df938 [c0 82 27 01 00 00 00 00] [..'.....] [pool:stream]
0x13df940 [4f c0 59 00 00 00 00 00] [O.Y.....] [stream_new+0x4f/0xbec]
0x13df948 [49 46 49 43 41 54 45 2d] [IFICATE-]
0x13df950 [81 02 00 00 00 00 00 00] [........]
0x13df958 [df 13 00 00 00 00 00 00] [........]
Other possible callers:
(...)
We notice that the tag references pool_head_stream with the allocation
point in stream_new. Another benefit is that a caller may be figured
from the tag even if the "caller" feature is not enabled, because upon
a free() we always put the caller's location into the tag. This should
be sufficient to debug most cases that normally require gdb.
It's a bit frustrating sometimes to see pool checks catch a bug but not
provide exploitable information without a core.
Here we're adding a function "pool_inspect_item()" which is called just
before aborting in pool_check_pattern() and POOL_DEBUG_CHECK_MARK() and
which will display the error type, the pool's pointer and name, and will
try to check if the item's tag matches the pool, and if not, will iterate
over all pools to see if one would be a better candidate, then will try
to figure the last known caller and possibly other likely candidates if
the pool's tag is not sufficiently trusted. This typically helps better
diagnose corruption in use-after-free scenarios, or freeing to a pool
that differs from the one the object was allocated from, and will also
indicate calling points that may help figure where an object was last
released or allocated. The info is printed on stderr just before the
backtrace.
For example, the recent off-by-one test in the PPv2 changes would have
produced the following output in vtest logs:
*** h1 debug|FATAL: pool inconsistency detected in thread 1: tag mismatch on free().
*** h1 debug| caller: 0x62bb87 (conn_free+0x147/0x3c5)
*** h1 debug| pool: 0x2211ec0 ('pp_tlv_256', size 304, real 320, users 1)
*** h1 debug|Tag does not match. Possible origin pool(s):
*** h1 debug| tag: @0x2565530 = 0x2216740 (pp_tlv_128, size 176, real 192, users 1)
*** h1 debug|Recorded caller if pool 'pp_tlv_128':
*** h1 debug| @0x2565538 (+0184) = 0x62c76d (conn_recv_proxy+0x4cd/0xa24)
A mismatch in the allocated/released pool is already visible, and the
callers confirm it once resolved, where the allocator indeed allocates
from pp_tlv_128 and conn_free() releases to pp_tlv_256:
$ addr2line -spafe ./haproxy <<< $'0x62bb87\n0x62c76d'
0x000000000062bb87: conn_free at connection.c:568
0x000000000062c76d: conn_recv_proxy at connection.c:1177
In preparation for more detailed pool error reports, let's pass the
caller pointers to the check functions. This will be useful to produce
messages indicating where the issue happened.
When recording the caller of a pool_alloc(), we currently store it only
when the object comes from the cache and never when it comes from the
heap. There's no valid reason for this except that the caller's pointer
was not passed to pool_alloc_nocache(), so it used to set NULL there.
Let's just pass it down the chain.
In 2.9-dev4, commit 544c2f2d9 ("MINOR: pools: use EBO to wait for
unlock during pool_flush()") broke the thread-less build by calling
pl_wait_new_long() without explicitly including plock.h which is
normally included by thread.h when threads are enabled.
pool_flush() could become a source of contention on the pool's free list
if there are many competing thread using that pool. Let's make sure we
use EBO and not just a simple CPU relaxation there, to avoid disturbing
them.
clang is more picky than gcc regarding duplicate "inline". The functions
declared with "forceinline" don't need to have "inline" since it's already
in the macro.
This aims at further reducing the contention on the free_list when using
global pools. The free_list pointer now appears for each bucket, and both
the alloc and the release code skip to a next bucket when ending on a
contended entry. The default entry used for allocations and releases
depend on the thread ID so that locality is preserved as much as possible
under low contention.
It would be nice to improve the situation to make sure that releases to
the shared pools doesn't consider the first entry's pointer but only an
argument that would be passed and that would correspond to the bucket in
the thread's cache. This would reduce computations and make sure that the
shared cache only contains items whose pointers match the same bucket.
This was not yet done. One possibility could be to keep the same splitting
in the local cache.
With this change, an h2load test with 5 * 160 conns & 40 streams on 80
threads that was limited to 368k RPS with the shared cache jumped to
3.5M RPS for 8 buckets, 4M RPS for 16 buckets, 4.7M RPS for 32 buckets
and 5.5M RPS for 64 buckets.
The failed allocation counter cannot depend on a pointer, but since it's
a perpetually increasing counter and not a gauge, we don't care where
it's incremented. Thus instead we're hashing on the TID. There's no
contention there anyway, but it's better not to waste the room in
the pool's heads and to move that with the other counters.
That's the same principle as for ->allocated and ->used. Here we return
the summ of the raw values, so the result still needs to be fed to
swrate_avg(). It also means that we now use the local ->used instead
of the global one for the calculations and do not need to call pool_used()
anymore on fast paths. The number of samples should likely be divided by
the number of buckets, but that's not done yet (better observe first).
A function pool_needed_avg() was added to report aggregated values for
the "show pools" command.
With this change, an h2load made of 5 * 160 conn * 40 streams on 80
threads raised from 1.5M RPS to 6.7M RPS.
That's the same principle as for ->allocated. The small difference here
is that it's no longer possible to decrement ->used in batches when
releasing clusters from the cache to the shared cache, so the counter
has to be decremented for each of them. But as it provides less
contention and it's done only during forced eviction, it shouldn't be
a problem.
A function "pool_used()" was added to return the sum of the entries.
It's used by pool_alloc_nocache() and pool_free_nocache() which need
to count the number of used entries. It's not a problem since such
operations are done when picking/releasing objects to/from the OS,
but it is a reminder that the number of buckets should remain small.
With this change, an h2load test made of 5 * 160 conn * 40 streams on
80 threads raised from 812k RPS to 1.5M RPS.
The ->used counter is one of the most stressed, and it heavily
depends on the ->allocated one, so let's first move ->allocated
to a few buckets.
A function "pool_allocated()" was added to return the sum of the entries.
It's important not to abuse it as it does iterate, so everywhere it's
possible to avoid it by keeping a local counter, it's better. Currently
it's used for limited pools which need to make sure they do not allocate
too many objects. That's an acceptable tradeoff to save CPU on large
machines at the expense of spending a little bit more on small ones which
normally are not under load.
On many threads and without the shared cache, there can be extreme
contention on the ->allocated counter, the ->free_list pointer, and
the ->used counter. It's possible to limit this contention by spreading
the counters a little bit over multiple entries, that are summed up when
a consultation is needed. The criterion used to spread the values cannot
be related to the thread ID due to migrations, since we need to keep
consistent stats (allocated vs used).
Instead we'll just hash the pointer, it provides an index that does the
job and that is consistent for the object. When having just a few entries
(16 here as it showed almost identical performance between global and
non-global pools) even iterations should be short enough during
measurements to not be a problem.
A pair of functions designed to ease pointer hash bucket calculation were
added, with one of them doing it for thread IDs because allocation failures
will be associated with a thread and not a pointer.
For now this patch only brings in the relevant parts of the infrastructure,
the CONFIG_HAP_POOL_BUCKETS_BITS macro that defaults to 6 bits when 512
threads or more are supported, 5 bits when 128 or more are supported, 4
bits when 16 or more are supported, otherwise 3 bits for small setups.
The array in the pool_head and the two utility functions are already
added. It should have no measurable impact beyond inflating the pool_head
structure.
The pool's allocation counter doesn't strictly require to be updated
from these functions, it may more efficiently be done in the caller
(even out of a loop for pool_flush() and pool_gc()), and doing so will
also help us spread the counters over an array later. The functions
were renamed _noinc and _nodec to make sure we catch any possible
user in an external patch. If needed, the original functions may easily
be reimplemented in an inline function.
Running a stick-table stress with -dMglobal under 56 threads shows
extreme contention on the pool's free_list because it has to be
processed in two phases and only used to implement a cpu_relax() on
the retry path.
Let's at least implement exponential back-off here to limit the neighbor's
noise and reduce the time needed to successfully acquire the pointer. Just
doing so shows there's still contention but almost doubled the performance,
from 1.1 to 2.1M req/s.
Instead of reporting the inaccurate "malloc_trim() support" on -vv, let's
report the case where the memory allocator was actively replaced from the
one used at build time, as this is the corner case we want to be cautious
about. We also put a tainted bit when this happens so that it's possible
to detect it at run time (e.g. the user might have inherited it from an
environment variable during a reload operation).
The now unused is_trim_enabled() function was finally dropped.
The runtime detection of the default memory allocator was broken by
Commit d8a97d8f6 ("BUG/MINOR: illegal use of the malloc_trim() function
if jemalloc is used") due to a misunderstanding of its role. The purpose
is not to detect whether we're on non-jemalloc but whether or not the
allocator was changed from the one we booted with, in which case we must
be extra cautious and absolutely refrain from calling malloc_trim() and
its friends.
This was done only to drop the message saying that malloc_trim() is
supported, which will be totally removed in another commit, and could
possibly be removed even in older versions if this patch would get
backported since in the end it provides limited value.
We already have some generic code in trim_all_pools() to implement the
equivalent of malloc_trim() on jemalloc and macos. Instead of keeping the
logic there, let's just move it to our own malloc_trim() implementation
so that we can unify the mechanism and the logic. Now any low-level code
calling malloc_trim() will either be disabled by haproxy's config if the
user decides to, or will be mapped to the equivalent mechanism if malloc()
was intercepted by a preloaded jemalloc.
Trim_all_pools() preserves the benefit of serializing threads (which we
must not impose to other libs which could come with their own threads).
It means that our own code should mostly use trim_all_pools() instead of
calling malloc_trim() directly.
As reported by Miroslav in commit d8a97d8f6 ("BUG/MINOR: illegal use of
the malloc_trim() function if jemalloc is used") there are still occasional
cases where it's discovered that malloc_trim() is being used without its
suitability being checked first. This is a problem when using another
incompatible allocator. But there's a class of use cases we'll never be
able to cover, it's dynamic libraries loaded from Lua. In order to address
this more reliably, we now define our own malloc_trim() that calls the
previous one after checking that the feature is supported and that the
allocator is the expected one. This way child libraries that would call
it will also be safe.
The function is intentionally left defined all the time so that it will
be possible to clean up some code that uses it by removing ifdefs.
The global option 'no-memory-trimming' was added in 2.6 with commit
c4e56dc58 ("MINOR: pools: add a new global option "no-memory-trimming"")
but there were some cases left where it was not considered. Let's make
is_trim_enabled() also consider it.
In the event that HAProxy is linked with the jemalloc library, it is still
shown that malloc_trim() is enabled when executing "haproxy -vv":
..
Support for malloc_trim() is enabled.
..
It's not so much a problem as it is that malloc_trim() is called in the
pat_ref_purge_range() function without any checking.
This was solved by setting the using_default_allocator variable to the
correct value in the detect_allocator() function and before calling
malloc_trim() it is checked whether the function should be called.
Using -dMfail alone does nothing unless tune.fail-alloc is set, which
renders it pretty useless as-is, and is not intuitive. Let's change
this so that the filure rate is preset to 1% when the option is set on
the command line. This allows to inject failures without having to edit
the configuration.
The same change was already performed for the cli. The stats applet and the
prometheus exporter are also concerned. Both use the stats API and rely on
pool functions to get total pool usage in bytes. pool_total_allocated() and
pool_total_used() must return 64 bits unsigned integer to avoid any wrapping
around 4G.
This may be backported to all versions.
We don't need to know very accurately how much RAM is needed in a pool,
however we must not spend time competing with other threads trying to be
the one with the most accurate value. Let's use the "_opportunistic"
variants of swrate_add() which will simply cause some updates to be
dropped in case of thread contention. This should significantly improve
the situation when dealing with many threads and small per-thread caches.
Performance gains of up to 1-2% were observed on 48-thread systems thanks
to this alone.
Till now it was only possible to change the thread local hot cache size
at build time using CONFIG_HAP_POOL_CACHE_SIZE. But along benchmarks it
was sometimes noticed a huge contention in the lower level memory
allocators indicating that larger caches could be beneficial, especially
on machines with large L2 CPUs.
Given that the checks against this value was no longer on a hot path
anymore, there was no reason for continuing to force it to be tuned at
build time. So this patch allows to set it by tune.memory-hot-size.
It's worth noting that during the boot phase the value remains zero so
that it's possible to know if the value was set or not, which opens the
possibility that we try to automatically adjust it based on the per-cpu
L2 cache size or the use of certain protocols (none of this is done yet).
Since the massive pools cleanup that happened in 2.6, the pools
architecture was made quite more hierarchical and many alternate code
blocks could be moved to runtime flags set by -dM. One of them had not
been converted by then, DEBUG_UAF. It's not much more difficult actually,
since it only acts on a pair of functions indirection on the slow path
(OS-level allocator) and a default setting for the cache activation.
This patch adds the "uaf" setting to the options permitted in -dM so
that it now becomes possible to set or unset UAF at boot time without
recompiling. This is particularly convenient, because every 3 months on
average, developers ask a user to recompile haproxy with DEBUG_UAF to
understand a bug. Now it will not be needed anymore, instead the user
will only have to disable pools and enable uaf using -dMuaf. Note that
-dMuaf only disables previously enabled pools, but it remains possible
to re-enable caching by specifying the cache after, like -dMuaf,cache.
A few tests with this mode show that it can be an interesting combination
which catches significantly less UAF but will do so with much less
overhead, so it might be compatible with some high-traffic deployments.
The change is very small and isolated. It could be helpful to backport
this at least to 2.7 once confirmed not to cause build issues on exotic
systems, and even to 2.6 a bit later as this has proven to be useful
over time, and could be even more if it did not require a rebuild. If
a backport is desired, the following patches are needed as well:
CLEANUP: pools: move the write before free to the uaf-only function
CLEANUP: pool: only include pool-os from pool.c not pool.h
REORG: pool: move all the OS specific code to pool-os.h
CLEANUP: pools: get rid of CONFIG_HAP_POOLS
DEBUG: pool: show a few examples in -dMhelp
It's not always easy to remember what certain options do together nor
which ones are only relevant when combined with others, so let's add a
few examples with the "help" command on -dM.
This one was set in defaults.h only when neither DEBUG_NO_POOLS nor
DEBUG_UAF were set. This was not the most convenient location to look
for it, and it was only used in pool.c to decide on the initial value
of POOL_DBG_NO_CACHE.
Let's just use DEBUG_NO_POOLS || DEBUG_UAF directly on this flag and
get rid of the intermediary condition. This also has the benefit of
removing a double inversion, which is always nice for understanding.
Till now pool-os used to contain a mapping from pool_{alloc,free}_area()
to pool_{alloc,free}_area_uaf() in case of DEBUG_UAF, or the regular
malloc-based function. And the *_uaf() functions were in pool.c. But
since 2.4 with the first cleanup of the pools, there has been no more
calls to pool_{alloc,free}_area() from anywhere but pool.c, from exactly
one place each. As such, there's no more need to keep *_uaf() apart in
pool.c, we can inline it into pool-os.h and leave all the OS stuff there,
with pool.c calling either based on DEBUG_UAF. This is cleaner with less
round trips between both files and easier to find.
There's no need for the low-level pool functions to be known from all
callers anymore, they're only used by pool.c. Let's reduce the amount
of header files processed.
In UAF mode, pool_put_to_os() performs a write to the about-to-be-freed
memory area so as to make sure the page is properly mapped and catch a
possible double-free. However there's no point keeping that in an ifdef
in the generic function, because we now have a pool_free_area_uaf()
that is the UAF-specific version of pool_free_area() and the one that
is called immediately after this write. Let's move the code there, it
will be cleaner.