Commit Graph

173 Commits

Author SHA1 Message Date
Willy Tarreau
87d269707b OPTIM: pool: improve needed_avg cache line access pattern
On an AMD EPYC 3rd gen, 20% of the CPU is spent calculating the amount
of pools needed when using QUIC, because pool allocations/releases are
quite frequent and the inter-CCX communication is super slow. Still,
there's a way to save between 0.5 and 1% CPU by using fetch-add and
sub-fetch that are converted to XADD so that the result is directly
fed into the swrate_add argument without having to re-read the memory
area. That's what this patch does.
2024-07-09 16:46:38 +02:00
Willy Tarreau
c0ee2d78d7 DEBUG: pools: report the data around the offending area in case of mismatch
When the integrity check fails, it's useful to get a dump of the area
around the first faulty byte. That's what this patch does. For example
it now shows this before reporting info about the tag itself:

  Contents around first corrupted address relative to pool item:.
  Contents around address 0xe4febc0792c0+40=0xe4febc0792e8:
    0xe4febc0792c8 [80 75 56 d8 fe e4 00 00] [.uV.....]
    0xe4febc0792d0 [a0 f7 23 a4 fe e4 00 00] [..#.....]
    0xe4febc0792d8 [90 75 56 d8 fe e4 00 00] [.uV.....]
    0xe4febc0792e0 [d9 93 fb ff fd ff ff ff] [........]
    0xe4febc0792e8 [d9 93 fb ff ff ff ff ff] [........]
    0xe4febc0792f0 [d9 93 fb ff ff ff ff ff] [........]
    0xe4febc0792f8 [d9 93 fb ff ff ff ff ff] [........]
    0xe4febc079300 [d9 93 fb ff ff ff ff ff] [........]

This may be backported to 2.9 and maybe even 2.8 as it does help spot
the cause of the memory corruption.
2024-04-12 18:01:55 +02:00
Willy Tarreau
16e3655fbd REORG: pool: move the area dump with symbol resolution to tools.c
This function is particularly useful to dump unknown areas watching
for opportunistic symbols, so let's move it to tools.c so that we can
reuse it a little bit more.
2024-04-12 18:01:20 +02:00
Willy Tarreau
b21aaef4e5 DEBUG: pool: improve decoding of corrupted pools
When a corruption was detected in an object, it's often said that the
tag doesn't match the pool, but it should also check if it matches the
location of an earlier pool_free() call, which happens when -dMcaller
is used. That's what we're doing now.
2024-04-12 18:01:05 +02:00
Willy Tarreau
772f9a5874 BUILD: pools: make DEBUG_MEMORY_POOLS=1 the default option
This option has been set by default for a very long time and also
complicates the manipulation of the DEBUG variable. Let's make it
the official default and permit to unset it by setting it to zero.
The other pool-related DEBUG options were adjusted to also explicitly
check for the zero value for consistency.
2024-04-11 17:25:45 +02:00
Willy Tarreau
b746af9990 BUG/MEDIUM: pool: fix rare risk of deadlock in pool_flush()
As reported by github user @JB0925 in issue #2427, there is a possible
crash in pool_flush(). The problem is that if the free_list is not empty
in the first test, and is empty at the moment the xchg() is performed,
for example because another thread called it in parallel, we place a
POOL_BUSY there that is never removed later, causing the next thread to
wait forever.

This was introduced in 2.5 with commit 2a4523f6f ("BUG/MAJOR: pools: fix
possible race with free() in the lockless variant"). It has probably
very rarely been detected, because:
  - pool_flush() is only called when stopping is set
  - the function does nothing if global pools are disabled, which is
    the case on most modern systems with a fast memory allocator.

It's possible to reproduce it by modifying __task_free() to call
pool_flush() on 1% of the calls instead of only when stopping.

The fix is quite simple, it consists in moving the zeroing of the
entry in the break path after verifying that the entry was not already
busy.

This must be backported wherever commit 2a4523f6f is.
2024-02-10 12:38:40 +01:00
Ilya Shipitsin
80813cdd2a CLEANUP: assorted typo fixes in the code and comments
This is 37th iteration of typo fixes
2023-11-23 16:23:14 +01:00
Willy Tarreau
a57f2a5cfe BUG/MEDIUM: pool: try once to allocate from another bucket if empty
In order to limit inter-thread contention on the global pool, in 2.9-dev3
with commit 7bf829ace ("MAJOR: pools: move the shared pool's free_list
over multiple buckets"), it was decided that if the selected bucket had
an empty free list, we would simply give up and fall back to the OS
allocator.

But this causes allocations to be made from the OS for certain threads,
to be released to overloaded pools that are sent back to the OS. One
visible effect is that sending a lot of traffic using h2load with 100
parallel streams over 100 connections causes 5-10k buffers to be
allocated, then reducing the load to only 10 connections doesn't make
these allocations go down, just because some buckets are no longer
visited.

Tests show that giving a second chance to pick another bucket in this
case is sufficient to visit all other buckets and recycle their pending
objects. Now "show pools" that starts at 10k buffers at 100 connections
goes down to about 150 with 1 connection and 100 streams in a fraction
of a second.

No backport is needed, as the issue is only in 2.9.
2023-11-08 17:14:03 +01:00
Willy Tarreau
a9ae094b27 BUG/MINOR: pool: check one other random bucket on alloc conflict
Since 2.9-dev3 with commit 7bf829ace ("MAJOR: pools: move the shared
pool's free_list over multiple buckets"), the global pool supports
multiple heads to reduce inter-thread contention. However, when
grabbing a freelist head fails because another thread is already
picking from it, we just skip to the next one and try again.

Unfortunately, it still maintains a bit of contention between thread
pairs when for some reasons only a few threads are used. This may
happen for example when running on a 4- or 8- thread system and
the two most active ones end up on adjacent buckets.

A better and much simpler solution consists in visiting a random bucket
instead of the current one. Tests show that the CPU usage spent in
pool_refill_local_from_shared() reduces at low number of connections
(hence threads).

No backport is needed, as the issue is only in 2.9.
2023-11-08 17:12:49 +01:00
Willy Tarreau
96bb99a87d DEBUG: pools: detect that malloc_trim() is in progress
Now when calling ha_panic() with a thread still under malloc_trim(),
we'll set a new tainted flag to easily report it, and the output
trace will report that this condition happened and will suggest to
use no-memory-trimming to avoid it in the future.
2023-10-25 15:48:02 +02:00
Willy Tarreau
5714aff4a6 DEBUG: pool: store the memprof bin on alloc() and update it on free()
When looking at "show pools", it's often difficult to know which alloc()
corresponds to which free() since it's not often 1:1. But sometimes we
have all elements available to maintain a link between alloc and free.
Indeed, when the caller is recorded in the allocated area, we can store
the pointer to the just created bin instead of the caller address itself,
since the caller address is already in the memprof bin. By doing so, we
permit the pool_free() call to locate the allocator bin and update its
free count when caller tracing is enabled. This for example allows to
produce outputs like this on "show profiling" and a process started with
-dMcaller:

  1391967  1391968  22805987328  22806003712|  0x59f72f process_stream+0x19f/0x3a7a p_alloc(0) [delta=-16384] [pool=buffer]
  1391936  1391937  22805479424  22805495808|  0x6e1476 task_run_applet+0x426/0xea2 p_alloc(0) [delta=-16384] [pool=buffer]
  1391925  1391925  22805299200  22805299200|  0x58435a main+0xdf07a p_alloc(0) [delta=0] [pool=buffer]
        0  2087930            0  34208645120|  0x59b519 stream_release_buffers+0xf9/0x110 p_free(-16384) [pool=buffer]
   695993   695992  11403149312  11403132928|  0x66018f main+0x1baeaf p_alloc(0) [delta=16384] [pool=buffer]
        0  1391957            0  22805823488|  0x59b47c stream_release_buffers+0x5c/0x110 p_free(-16384) [pool=buffer]
   695968   695970  11402739712  11402772480|  0x587b85 h1_io_cb+0x9a5/0xe7c p_alloc(0) [delta=-32768] [pool=buffer]
        0  1391923            0  22805266432|  0x57f388 main+0xda0a8 p_free(-16384) [pool=buffer]
   695959   695960  11402592256  11402608640|  0x586add main+0xe17fd p_alloc(0) [delta=-16384] [pool=buffer]
          0      695978              0    11402903552|         0x59cc58 stream_free+0x178/0x9ea p_free(-16384) [pool=buffer]
(...)

Here it's quickly visible that all of them got properly released.
2023-10-17 17:13:56 +02:00
Christopher Faulet
b62d5689d2 BUILD: pool: Fix GCC error about potential null pointer dereference
In pool_gc(), GCC 13.2.1 reports an error about a potential null potential
dereference:

src/pool.c: In function ‘pool_gc’:
src/pool.c:807:64: error: potential null pointer dereference [-Werror=null-dereference]
  807 |                         entry->buckets[bucket].free_list = temp->next;
      |                                                            ~~~~^~~~~~

There is no issue here because "bucket" variable cannot be greater than
CONFIG_HAP_POOL_BUCKETS. But to make GCC happy, we now break the loop if it
is greater or equal to CONFIG_HAP_POOL_BUCKETS.
2023-10-04 08:03:02 +02:00
Willy Tarreau
5abbae2d3d CLEANUP: pools: simplify the pool expression when no pool was matched in dump
When dumping pool information, we make a special case of the condition
where the pool couldn't be identified and we consider that it was the
correct one. In the code arrangements brought by commit efc46dede ("DEBUG:
pools: inspect pools on fatal error and dump information found"), a
ternary expression for testing this depends on the "if" block condition
so this can be simplified and will make Coverity happy. This was reported
in GH #2290.
2023-09-13 13:31:41 +02:00
Willy Tarreau
93c2ea0ec3 MEDIUM: pools: refine pool size rounding
The pools sizes were rounded up a little bit too much with commit
30f931ead ("BUG/MEDIUM: pools: fix the minimum allocation size"). The
goal was in fact to make sure they were always at least large enough to
store 2 list heads, and stuffing this into the alignment calculation
resulted in the size being always rounded up to this size. This is
problematic because it means that the appended tag at the end doesn't
always catch potential overflows since more bytes than needed are
allocated. Moreover, this test was later reinforced by commit b5ba09ed5
("BUG/MEDIUM: pools: ensure items are always large enough for the
pool_cache_item"), proving that the first test was not always sufficient.

This needs to be reworked to proceed correctly:
  - the two lists are needed when the object is in the cache, hence
    when we don't care about the tag, which means that the tag's size,
    if any, can easily cover for the missing bytes to reach that size.
    This is actually what was already being checked for.

  - the rounding should not be performed (beyond the size of a word to
    preserve pointer alignment) when pool tagging is enabled, otherwise
    we don't detect small overflows. It means that there will be less
    merging when proceeding like this. Tests show that we merge 93 pools
    into 36 without tags and 43 with tags enabled.

  - the rounding should not consider the extra size, since it's already
    done when calculating the allocated size later (i.e. don't round up
    twice). The difference is subtle but it's what makes sure the tag
    immediately follows the area instead of starting from the end.

Thanks to this, now when writing one byte too many at the end of a struct
stream, the error is instantly caught.
2023-09-12 18:14:05 +02:00
Willy Tarreau
61575769ac DEBUG: pools: print the contents surrounding the expected tag location
When no tag matches a known pool, we can inspect around to help figure
what could have possibly overwritten memory. The contents are printed
one machine word per line in hex, then using printable characters, and
when they can be resolved to a pointer, either the pool's pointer name
or a resolvable symbol with offset. The goal here is to help recognize
what is easily identifiable in memory.

For example applying the following patch to stream_free():

  -	pool_free(pool_head_stream, s);
  +	pool_free(pool_head_stream, (void*)s+1);

Causes the following dump to be emitted:

  FATAL: pool inconsistency detected in thread 1: tag mismatch on free().
    caller: 0x59e968 (stream_free+0x6d8/0xa0a)
    item: 0x13df5c1
    pool: 0x12782c0 ('stream', size 888, real 904, users 1)
  Tag does not match (0x4f00000000012782). Tag does not match any other pool.
  Contents around address 0x13df5c1+888=0x13df939:
         0x13df918 [00 00 00 00 00 00 00 00] [........]
         0x13df920 [00 00 00 00 00 00 00 00] [........]
         0x13df928 [00 00 00 00 00 00 00 00] [........]
         0x13df930 [00 00 00 00 00 00 00 00] [........]
         0x13df938 [c0 82 27 01 00 00 00 00] [..'.....] [pool:stream]
         0x13df940 [4f c0 59 00 00 00 00 00] [O.Y.....] [stream_new+0x4f/0xbec]
         0x13df948 [49 46 49 43 41 54 45 2d] [IFICATE-]
         0x13df950 [81 02 00 00 00 00 00 00] [........]
         0x13df958 [df 13 00 00 00 00 00 00] [........]
  Other possible callers:
  (...)

We notice that the tag references pool_head_stream with the allocation
point in stream_new. Another benefit is that a caller may be figured
from the tag even if the "caller" feature is not enabled, because upon
a free() we always put the caller's location into the tag. This should
be sufficient to debug most cases that normally require gdb.
2023-09-12 18:14:05 +02:00
Willy Tarreau
0f9a10c7f1 DEBUG: pools: also print the value of the tag when it doesn't match
Sometimes the tag's value may reveal a recognizable pattern, so let's
print it when it doesn't match a known pool.
2023-09-12 18:14:05 +02:00
Willy Tarreau
96c1a24224 DEBUG: pools: also print the item's pointer when crashing
It's important to inspect a core or recognize some values to have the
item pointer, it was not provided.
2023-09-12 18:14:05 +02:00
Willy Tarreau
efc46dede9 DEBUG: pools: inspect pools on fatal error and dump information found
It's a bit frustrating sometimes to see pool checks catch a bug but not
provide exploitable information without a core.

Here we're adding a function "pool_inspect_item()" which is called just
before aborting in pool_check_pattern() and POOL_DEBUG_CHECK_MARK() and
which will display the error type, the pool's pointer and name, and will
try to check if the item's tag matches the pool, and if not, will iterate
over all pools to see if one would be a better candidate, then will try
to figure the last known caller and possibly other likely candidates if
the pool's tag is not sufficiently trusted. This typically helps better
diagnose corruption in use-after-free scenarios, or freeing to a pool
that differs from the one the object was allocated from, and will also
indicate calling points that may help figure where an object was last
released or allocated. The info is printed on stderr just before the
backtrace.

For example, the recent off-by-one test in the PPv2 changes would have
produced the following output in vtest logs:

  ***  h1    debug|FATAL: pool inconsistency detected in thread 1: tag mismatch on free().
  ***  h1    debug|  caller: 0x62bb87 (conn_free+0x147/0x3c5)
  ***  h1    debug|  pool: 0x2211ec0 ('pp_tlv_256', size 304, real 320, users 1)
  ***  h1    debug|Tag does not match. Possible origin pool(s):
  ***  h1    debug|  tag: @0x2565530 = 0x2216740 (pp_tlv_128, size 176, real 192, users 1)
  ***  h1    debug|Recorded caller if pool 'pp_tlv_128':
  ***  h1    debug|  @0x2565538 (+0184) = 0x62c76d (conn_recv_proxy+0x4cd/0xa24)

A mismatch in the allocated/released pool is already visible, and the
callers confirm it once resolved, where the allocator indeed allocates
from pp_tlv_128 and conn_free() releases to pp_tlv_256:

  $ addr2line -spafe ./haproxy <<< $'0x62bb87\n0x62c76d'
  0x000000000062bb87: conn_free at connection.c:568
  0x000000000062c76d: conn_recv_proxy at connection.c:1177
2023-09-11 15:46:14 +02:00
Willy Tarreau
f6bee5a50b DEBUG: pools: make pool_check_pattern() take a pointer to the pool
This will be useful to report detailed bug traces.
2023-09-11 15:19:49 +02:00
Willy Tarreau
e92e96b00f DEBUG: pools: pass the caller pointer to the check functions and macros
In preparation for more detailed pool error reports, let's pass the
caller pointers to the check functions. This will be useful to produce
messages indicating where the issue happened.
2023-09-11 15:19:49 +02:00
Willy Tarreau
baf2070421 DEBUG: pools: always record the caller for uncached allocs as well
When recording the caller of a pool_alloc(), we currently store it only
when the object comes from the cache and never when it comes from the
heap. There's no valid reason for this except that the caller's pointer
was not passed to pool_alloc_nocache(), so it used to set NULL there.
Let's just pass it down the chain.
2023-09-11 15:19:49 +02:00
Willy Tarreau
0074c36dd2 BUILD: pools: import plock.h to build even without thread support
In 2.9-dev4, commit 544c2f2d9 ("MINOR: pools: use EBO to wait for
unlock during pool_flush()") broke the thread-less build by calling
pl_wait_new_long() without explicitly including plock.h which is
normally included by thread.h when threads are enabled.
2023-08-26 17:28:08 +02:00
Willy Tarreau
544c2f2d9e MINOR: pools: use EBO to wait for unlock during pool_flush()
pool_flush() could become a source of contention on the pool's free list
if there are many competing thread using that pool. Let's make sure we
use EBO and not just a simple CPU relaxation there, to avoid disturbing
them.
2023-08-17 09:09:20 +02:00
Willy Tarreau
2d18717fb8 BUILD: pools: fix build error on clang with inline vs forceinline
clang is more picky than gcc regarding duplicate "inline". The functions
declared with "forceinline" don't need to have "inline" since it's already
in the macro.
2023-08-12 19:58:17 +02:00
Willy Tarreau
29eed99b50 MINOR: pools: make pool_evict_last_items() use pool_put_to_os_no_dec()
The bucket is already known, no need to calculate it again. Let's just
include the lower level functions.
2023-08-12 19:04:34 +02:00
Willy Tarreau
7bf829ace1 MAJOR: pools: move the shared pool's free_list over multiple buckets
This aims at further reducing the contention on the free_list when using
global pools. The free_list pointer now appears for each bucket, and both
the alloc and the release code skip to a next bucket when ending on a
contended entry. The default entry used for allocations and releases
depend on the thread ID so that locality is preserved as much as possible
under low contention.

It would be nice to improve the situation to make sure that releases to
the shared pools doesn't consider the first entry's pointer but only an
argument that would be passed and that would correspond to the bucket in
the thread's cache. This would reduce computations and make sure that the
shared cache only contains items whose pointers match the same bucket.
This was not yet done. One possibility could be to keep the same splitting
in the local cache.

With this change, an h2load test with 5 * 160 conns & 40 streams on 80
threads that was limited to 368k RPS with the shared cache jumped to
3.5M RPS for 8 buckets, 4M RPS for 16 buckets, 4.7M RPS for 32 buckets
and 5.5M RPS for 64 buckets.
2023-08-12 19:04:34 +02:00
Willy Tarreau
8a0b5f783b MINOR: pools: move the failed allocation counter over a few buckets
The failed allocation counter cannot depend on a pointer, but since it's
a perpetually increasing counter and not a gauge, we don't care where
it's incremented. Thus instead we're hashing on the TID. There's no
contention there anyway, but it's better not to waste the room in
the pool's heads and to move that with the other counters.
2023-08-12 19:04:34 +02:00
Willy Tarreau
da6999f839 MEDIUM: pools: move the needed_avg counter over a few buckets
That's the same principle as for ->allocated and ->used. Here we return
the summ of the raw values, so the result still needs to be fed to
swrate_avg(). It also means that we now use the local ->used instead
of the global one for the calculations and do not need to call pool_used()
anymore on fast paths. The number of samples should likely be divided by
the number of buckets, but that's not done yet (better observe first).

A function pool_needed_avg() was added to report aggregated values for
the "show pools" command.

With this change, an h2load made of 5 * 160 conn * 40 streams on 80
threads raised from 1.5M RPS to 6.7M RPS.
2023-08-12 19:04:34 +02:00
Willy Tarreau
9e5eb586b1 MEDIUM: pools: move the used counter over a few buckets
That's the same principle as for ->allocated. The small difference here
is that it's no longer possible to decrement ->used in batches when
releasing clusters from the cache to the shared cache, so the counter
has to be decremented for each of them. But as it provides less
contention and it's done only during forced eviction, it shouldn't be
a problem.

A function "pool_used()" was added to return the sum of the entries.
It's used by pool_alloc_nocache() and pool_free_nocache() which need
to count the number of used entries. It's not a problem since such
operations are done when picking/releasing objects to/from the OS,
but it is a reminder that the number of buckets should remain small.

With this change, an h2load test made of 5 * 160 conn * 40 streams on
80 threads raised from 812k RPS to 1.5M RPS.
2023-08-12 19:04:34 +02:00
Willy Tarreau
cdb711e42b MEDIUM: pools: spread the allocated counter over a few buckets
The ->used counter is one of the most stressed, and it heavily
depends on the ->allocated one, so let's first move ->allocated
to a few buckets.

A function "pool_allocated()" was added to return the sum of the entries.
It's important not to abuse it as it does iterate, so everywhere it's
possible to avoid it by keeping a local counter, it's better. Currently
it's used for limited pools which need to make sure they do not allocate
too many objects. That's an acceptable tradeoff to save CPU on large
machines at the expense of spending a little bit more on small ones which
normally are not under load.
2023-08-12 19:04:34 +02:00
Willy Tarreau
06885aaea7 MINOR: pools: introduce the use of multiple buckets
On many threads and without the shared cache, there can be extreme
contention on the ->allocated counter, the ->free_list pointer, and
the ->used counter. It's possible to limit this contention by spreading
the counters a little bit over multiple entries, that are summed up when
a consultation is needed. The criterion used to spread the values cannot
be related to the thread ID due to migrations, since we need to keep
consistent stats (allocated vs used).

Instead we'll just hash the pointer, it provides an index that does the
job and that is consistent for the object. When having just a few entries
(16 here as it showed almost identical performance between global and
non-global pools) even iterations should be short enough during
measurements to not be a problem.

A pair of functions designed to ease pointer hash bucket calculation were
added, with one of them doing it for thread IDs because allocation failures
will be associated with a thread and not a pointer.

For now this patch only brings in the relevant parts of the infrastructure,
the CONFIG_HAP_POOL_BUCKETS_BITS macro that defaults to 6 bits when 512
threads or more are supported, 5 bits when 128 or more are supported, 4
bits when 16 or more are supported, otherwise 3 bits for small setups.
The array in the pool_head and the two utility functions are already
added. It should have no measurable impact beyond inflating the pool_head
structure.
2023-08-12 19:04:34 +02:00
Willy Tarreau
29ad61fb00 OPTIM: pools: make pool_get_from_os() / pool_put_to_os() not update ->allocated
The pool's allocation counter doesn't strictly require to be updated
from these functions, it may more efficiently be done in the caller
(even out of a loop for pool_flush() and pool_gc()), and doing so will
also help us spread the counters over an array later. The functions
were renamed _noinc and _nodec to make sure we catch any possible
user in an external patch. If needed, the original functions may easily
be reimplemented in an inline function.
2023-08-12 19:04:34 +02:00
Willy Tarreau
feeda4132b OPTIM: pools: use exponential back-off on shared pool allocation/release
Running a stick-table stress with -dMglobal under 56 threads shows
extreme contention on the pool's free_list because it has to be
processed in two phases and only used to implement a cpu_relax() on
the retry path.

Let's at least implement exponential back-off here to limit the neighbor's
noise and reduce the time needed to successfully acquire the pointer. Just
doing so shows there's still contention but almost doubled the performance,
from 1.1 to 2.1M req/s.
2023-08-12 19:04:34 +02:00
Willy Tarreau
1751db140a MINOR: pools: report a replaced memory allocator instead of just malloc_trim()
Instead of reporting the inaccurate "malloc_trim() support" on -vv, let's
report the case where the memory allocator was actively replaced from the
one used at build time, as this is the corner case we want to be cautious
about. We also put a tainted bit when this happens so that it's possible
to detect it at run time (e.g. the user might have inherited it from an
environment variable during a reload operation).

The now unused is_trim_enabled() function was finally dropped.
2023-03-22 18:05:02 +01:00
Willy Tarreau
0c27ec5df7 BUG/MINOR: pools: restore detection of built-in allocator
The runtime detection of the default memory allocator was broken by
Commit d8a97d8f6 ("BUG/MINOR: illegal use of the malloc_trim() function
if jemalloc is used") due to a misunderstanding of its role. The purpose
is not to detect whether we're on non-jemalloc but whether or not the
allocator was changed from the one we booted with, in which case we must
be extra cautious and absolutely refrain from calling malloc_trim() and
its friends.

This was done only to drop the message saying that malloc_trim() is
supported, which will be totally removed in another commit, and could
possibly be removed even in older versions if this patch would get
backported since in the end it provides limited value.
2023-03-22 17:57:13 +01:00
Willy Tarreau
7aee683541 MINOR: pools: export trim_all_pools()
This way it will be usable from outside instead of malloc_trim().
2023-03-22 17:30:28 +01:00
Willy Tarreau
4138f15182 MEDIUM: pools: move the compat code from trim_all_pools() to malloc_trim()
We already have some generic code in trim_all_pools() to implement the
equivalent of malloc_trim() on jemalloc and macos. Instead of keeping the
logic there, let's just move it to our own malloc_trim() implementation
so that we can unify the mechanism and the logic. Now any low-level code
calling malloc_trim() will either be disabled by haproxy's config if the
user decides to, or will be mapped to the equivalent mechanism if malloc()
was intercepted by a preloaded jemalloc.

Trim_all_pools() preserves the benefit of serializing threads (which we
must not impose to other libs which could come with their own threads).
It means that our own code should mostly use trim_all_pools() instead of
calling malloc_trim() directly.
2023-03-22 17:30:28 +01:00
Willy Tarreau
eaba76b02d MINOR: pools: intercept malloc_trim() instead of trying to plug holes
As reported by Miroslav in commit d8a97d8f6 ("BUG/MINOR: illegal use of
the malloc_trim() function if jemalloc is used") there are still occasional
cases where it's discovered that malloc_trim() is being used without its
suitability being checked first. This is a problem when using another
incompatible allocator. But there's a class of use cases we'll never be
able to cover, it's dynamic libraries loaded from Lua. In order to address
this more reliably, we now define our own malloc_trim() that calls the
previous one after checking that the feature is supported and that the
allocator is the expected one. This way child libraries that would call
it will also be safe.

The function is intentionally left defined all the time so that it will
be possible to clean up some code that uses it by removing ifdefs.
2023-03-22 17:30:28 +01:00
Willy Tarreau
4db0b0430d MINOR: pools: make sure 'no-memory-trimming' is always used
The global option 'no-memory-trimming' was added in 2.6 with commit
c4e56dc58 ("MINOR: pools: add a new global option "no-memory-trimming"")
but there were some cases left where it was not considered. Let's make
is_trim_enabled() also consider it.
2023-03-22 17:29:23 +01:00
Miroslav Zagorac
d8a97d8f60 BUG/MINOR: illegal use of the malloc_trim() function if jemalloc is used
In the event that HAProxy is linked with the jemalloc library, it is still
shown that malloc_trim() is enabled when executing "haproxy -vv":
  ..
  Support for malloc_trim() is enabled.
  ..

It's not so much a problem as it is that malloc_trim() is called in the
pat_ref_purge_range() function without any checking.

This was solved by setting the using_default_allocator variable to the
correct value in the detect_allocator() function and before calling
malloc_trim() it is checked whether the function should be called.
2023-03-22 14:14:50 +01:00
Willy Tarreau
0c4348c982 MINOR: pools: preset the allocation failure rate to 1% with -dMfail
Using -dMfail alone does nothing unless tune.fail-alloc is set, which
renders it pretty useless as-is, and is not intuitive. Let's change
this so that the filure rate is preset to 1% when the option is set on
the command line. This allows to inject failures without having to edit
the configuration.
2023-03-21 09:26:55 +01:00
Christopher Faulet
c960a3b60f BUG/MINOR: pool/stats: Use ullong to report total pool usage in bytes in stats
The same change was already performed for the cli. The stats applet and the
prometheus exporter are also concerned. Both use the stats API and rely on
pool functions to get total pool usage in bytes. pool_total_allocated() and
pool_total_used() must return 64 bits unsigned integer to avoid any wrapping
around 4G.

This may be backported to all versions.
2022-12-22 13:46:21 +01:00
Willy Tarreau
2aa14ce5a1 MINOR: pool: only use opportunistic versions of the swrate_add() functions
We don't need to know very accurately how much RAM is needed in a pool,
however we must not spend time competing with other threads trying to be
the one with the most accurate value. Let's use the "_opportunistic"
variants of swrate_add() which will simply cause some updates to be
dropped in case of thread contention. This should significantly improve
the situation when dealing with many threads and small per-thread caches.

Performance gains of up to 1-2% were observed on 48-thread systems thanks
to this alone.
2022-12-20 14:51:12 +01:00
Willy Tarreau
284cfc67b8 MINOR: pool: make the thread-local hot cache size configurable
Till now it was only possible to change the thread local hot cache size
at build time using CONFIG_HAP_POOL_CACHE_SIZE. But along benchmarks it
was sometimes noticed a huge contention in the lower level memory
allocators indicating that larger caches could be beneficial, especially
on machines with large L2 CPUs.

Given that the checks against this value was no longer on a hot path
anymore, there was no reason for continuing to force it to be tuned at
build time. So this patch allows to set it by tune.memory-hot-size.

It's worth noting that during the boot phase the value remains zero so
that it's possible to know if the value was set or not, which opens the
possibility that we try to automatically adjust it based on the per-cpu
L2 cache size or the use of certain protocols (none of this is done yet).
2022-12-20 14:51:12 +01:00
Willy Tarreau
9192d20f02 MINOR: pools: make DEBUG_UAF a runtime setting
Since the massive pools cleanup that happened in 2.6, the pools
architecture was made quite more hierarchical and many alternate code
blocks could be moved to runtime flags set by -dM. One of them had not
been converted by then, DEBUG_UAF. It's not much more difficult actually,
since it only acts on a pair of functions indirection on the slow path
(OS-level allocator) and a default setting for the cache activation.

This patch adds the "uaf" setting to the options permitted in -dM so
that it now becomes possible to set or unset UAF at boot time without
recompiling. This is particularly convenient, because every 3 months on
average, developers ask a user to recompile haproxy with DEBUG_UAF to
understand a bug. Now it will not be needed anymore, instead the user
will only have to disable pools and enable uaf using -dMuaf. Note that
-dMuaf only disables previously enabled pools, but it remains possible
to re-enable caching by specifying the cache after, like -dMuaf,cache.
A few tests with this mode show that it can be an interesting combination
which catches significantly less UAF but will do so with much less
overhead, so it might be compatible with some high-traffic deployments.

The change is very small and isolated. It could be helpful to backport
this at least to 2.7 once confirmed not to cause build issues on exotic
systems, and even to 2.6 a bit later as this has proven to be useful
over time, and could be even more if it did not require a rebuild. If
a backport is desired, the following patches are needed as well:

  CLEANUP: pools: move the write before free to the uaf-only function
  CLEANUP: pool: only include pool-os from pool.c not pool.h
  REORG: pool: move all the OS specific code to pool-os.h
  CLEANUP: pools: get rid of CONFIG_HAP_POOLS
  DEBUG: pool: show a few examples in -dMhelp
2022-12-08 18:54:59 +01:00
Willy Tarreau
b634987fed DEBUG: pool: show a few examples in -dMhelp
It's not always easy to remember what certain options do together nor
which ones are only relevant when combined with others, so let's add a
few examples with the "help" command on -dM.
2022-12-08 18:45:41 +01:00
Willy Tarreau
4da51bd190 CLEANUP: pools: get rid of CONFIG_HAP_POOLS
This one was set in defaults.h only when neither DEBUG_NO_POOLS nor
DEBUG_UAF were set. This was not the most convenient location to look
for it, and it was only used in pool.c to decide on the initial value
of POOL_DBG_NO_CACHE.

Let's just use DEBUG_NO_POOLS || DEBUG_UAF directly on this flag and
get rid of the intermediary condition. This also has the benefit of
removing a double inversion, which is always nice for understanding.
2022-12-08 17:45:08 +01:00
Willy Tarreau
a95636682d REORG: pool: move all the OS specific code to pool-os.h
Till now pool-os used to contain a mapping from pool_{alloc,free}_area()
to pool_{alloc,free}_area_uaf() in case of DEBUG_UAF, or the regular
malloc-based function. And the *_uaf() functions were in pool.c. But
since 2.4 with the first cleanup of the pools, there has been no more
calls to pool_{alloc,free}_area() from anywhere but pool.c, from exactly
one place each. As such, there's no more need to keep *_uaf() apart in
pool.c, we can inline it into pool-os.h and leave all the OS stuff there,
with pool.c calling either based on DEBUG_UAF. This is cleaner with less
round trips between both files and easier to find.
2022-12-08 17:32:57 +01:00
Willy Tarreau
76a97a98ca CLEANUP: pool: only include pool-os from pool.c not pool.h
There's no need for the low-level pool functions to be known from all
callers anymore, they're only used by pool.c. Let's reduce the amount
of header files processed.
2022-12-08 17:32:40 +01:00
Willy Tarreau
67f89c527f CLEANUP: pools: move the write before free to the uaf-only function
In UAF mode, pool_put_to_os() performs a write to the about-to-be-freed
memory area so as to make sure the page is properly mapped and catch a
possible double-free. However there's no point keeping that in an ifdef
in the generic function, because we now have a pool_free_area_uaf()
that is the UAF-specific version of pool_free_area() and the one that
is called immediately after this write. Let's move the code there, it
will be cleaner.
2022-12-08 16:08:28 +01:00