haproxy

mirror of http://git.haproxy.org/git/haproxy.git synced 2026-02-16 06:02:00 +02:00

Author	SHA1	Message	Date
Willy Tarreau	7bf829ace1	MAJOR: pools: move the shared pool's free_list over multiple buckets This aims at further reducing the contention on the free_list when using global pools. The free_list pointer now appears for each bucket, and both the alloc and the release code skip to a next bucket when ending on a contended entry. The default entry used for allocations and releases depend on the thread ID so that locality is preserved as much as possible under low contention. It would be nice to improve the situation to make sure that releases to the shared pools doesn't consider the first entry's pointer but only an argument that would be passed and that would correspond to the bucket in the thread's cache. This would reduce computations and make sure that the shared cache only contains items whose pointers match the same bucket. This was not yet done. One possibility could be to keep the same splitting in the local cache. With this change, an h2load test with 5 * 160 conns & 40 streams on 80 threads that was limited to 368k RPS with the shared cache jumped to 3.5M RPS for 8 buckets, 4M RPS for 16 buckets, 4.7M RPS for 32 buckets and 5.5M RPS for 64 buckets.	2023-08-12 19:04:34 +02:00
Willy Tarreau	8a0b5f783b	MINOR: pools: move the failed allocation counter over a few buckets The failed allocation counter cannot depend on a pointer, but since it's a perpetually increasing counter and not a gauge, we don't care where it's incremented. Thus instead we're hashing on the TID. There's no contention there anyway, but it's better not to waste the room in the pool's heads and to move that with the other counters.	2023-08-12 19:04:34 +02:00
Willy Tarreau	da6999f839	MEDIUM: pools: move the needed_avg counter over a few buckets That's the same principle as for ->allocated and ->used. Here we return the summ of the raw values, so the result still needs to be fed to swrate_avg(). It also means that we now use the local ->used instead of the global one for the calculations and do not need to call pool_used() anymore on fast paths. The number of samples should likely be divided by the number of buckets, but that's not done yet (better observe first). A function pool_needed_avg() was added to report aggregated values for the "show pools" command. With this change, an h2load made of 5 * 160 conn * 40 streams on 80 threads raised from 1.5M RPS to 6.7M RPS.	2023-08-12 19:04:34 +02:00
Willy Tarreau	9e5eb586b1	MEDIUM: pools: move the used counter over a few buckets That's the same principle as for ->allocated. The small difference here is that it's no longer possible to decrement ->used in batches when releasing clusters from the cache to the shared cache, so the counter has to be decremented for each of them. But as it provides less contention and it's done only during forced eviction, it shouldn't be a problem. A function "pool_used()" was added to return the sum of the entries. It's used by pool_alloc_nocache() and pool_free_nocache() which need to count the number of used entries. It's not a problem since such operations are done when picking/releasing objects to/from the OS, but it is a reminder that the number of buckets should remain small. With this change, an h2load test made of 5 * 160 conn * 40 streams on 80 threads raised from 812k RPS to 1.5M RPS.	2023-08-12 19:04:34 +02:00
Willy Tarreau	cdb711e42b	MEDIUM: pools: spread the allocated counter over a few buckets The ->used counter is one of the most stressed, and it heavily depends on the ->allocated one, so let's first move ->allocated to a few buckets. A function "pool_allocated()" was added to return the sum of the entries. It's important not to abuse it as it does iterate, so everywhere it's possible to avoid it by keeping a local counter, it's better. Currently it's used for limited pools which need to make sure they do not allocate too many objects. That's an acceptable tradeoff to save CPU on large machines at the expense of spending a little bit more on small ones which normally are not under load.	2023-08-12 19:04:34 +02:00
Willy Tarreau	06885aaea7	MINOR: pools: introduce the use of multiple buckets On many threads and without the shared cache, there can be extreme contention on the ->allocated counter, the ->free_list pointer, and the ->used counter. It's possible to limit this contention by spreading the counters a little bit over multiple entries, that are summed up when a consultation is needed. The criterion used to spread the values cannot be related to the thread ID due to migrations, since we need to keep consistent stats (allocated vs used). Instead we'll just hash the pointer, it provides an index that does the job and that is consistent for the object. When having just a few entries (16 here as it showed almost identical performance between global and non-global pools) even iterations should be short enough during measurements to not be a problem. A pair of functions designed to ease pointer hash bucket calculation were added, with one of them doing it for thread IDs because allocation failures will be associated with a thread and not a pointer. For now this patch only brings in the relevant parts of the infrastructure, the CONFIG_HAP_POOL_BUCKETS_BITS macro that defaults to 6 bits when 512 threads or more are supported, 5 bits when 128 or more are supported, 4 bits when 16 or more are supported, otherwise 3 bits for small setups. The array in the pool_head and the two utility functions are already added. It should have no measurable impact beyond inflating the pool_head structure.	2023-08-12 19:04:34 +02:00
Willy Tarreau	4dd33d9c32	OPTIM: pool: split the read_mostly from read_write parts in pool_head Performance profiling on a 48-thread machine showed a lot of time spent in pool_free(), precisely at the point where pool->limit was retrieved. And the reason is simple. Some parts of the pool_head are heavily updated only when facing a cache miss ("allocated", "used", "needed_avg"), while others are always accessed (limit, flags, size). The fact that both entries were stored into the same cache line makes it very difficult for each thread to access these precious info even when working with its own cache. By just splitting the fields apart, a test on QUIC (which stresses pools a lot) more than doubled performance from 42 Gbps to 96 Gbps! Given that the patch only reorders fields and addresses such a significant contention, it should be backported to 2.7 and 2.6.	2022-12-20 14:51:12 +01:00
Willy Tarreau	9192d20f02	MINOR: pools: make DEBUG_UAF a runtime setting Since the massive pools cleanup that happened in 2.6, the pools architecture was made quite more hierarchical and many alternate code blocks could be moved to runtime flags set by -dM. One of them had not been converted by then, DEBUG_UAF. It's not much more difficult actually, since it only acts on a pair of functions indirection on the slow path (OS-level allocator) and a default setting for the cache activation. This patch adds the "uaf" setting to the options permitted in -dM so that it now becomes possible to set or unset UAF at boot time without recompiling. This is particularly convenient, because every 3 months on average, developers ask a user to recompile haproxy with DEBUG_UAF to understand a bug. Now it will not be needed anymore, instead the user will only have to disable pools and enable uaf using -dMuaf. Note that -dMuaf only disables previously enabled pools, but it remains possible to re-enable caching by specifying the cache after, like -dMuaf,cache. A few tests with this mode show that it can be an interesting combination which catches significantly less UAF but will do so with much less overhead, so it might be compatible with some high-traffic deployments. The change is very small and isolated. It could be helpful to backport this at least to 2.7 once confirmed not to cause build issues on exotic systems, and even to 2.6 a bit later as this has proven to be useful over time, and could be even more if it did not require a rebuild. If a backport is desired, the following patches are needed as well: CLEANUP: pools: move the write before free to the uaf-only function CLEANUP: pool: only include pool-os from pool.c not pool.h REORG: pool: move all the OS specific code to pool-os.h CLEANUP: pools: get rid of CONFIG_HAP_POOLS DEBUG: pool: show a few examples in -dMhelp	2022-12-08 18:54:59 +01:00
Willy Tarreau	e81248c0c8	BUG/MINOR: pool: always align pool_heads to 64 bytes This is the pool equivalent of commit `97ea9c49f` ("BUG/MEDIUM: fd: always align fdtab[] to 64 bytes"). After a careful code review, it happens that the pool heads are the other structures allocated with malloc/calloc that claim to be aligned to a size larger than what the allocator can offer. While no issue was reported on them, no memset() is performed and no type is large, this is a problem waiting to happen, so better fix it. In addition, it's relatively easy to do by storing the allocation address inside the pool_head itself and use it at free() time. Finally, threads might benefit from the fact that the caches will really be aligned and that there will be no false sharing. This should be backported to all versions where it applies easily.	2022-03-02 18:22:08 +01:00
Willy Tarreau	ef301b7556	MINOR: pools: add a debugging flag for memory poisonning option Now -dM will set POOL_DBG_POISON for consistency with the rest of the pool debugging options. As such now we only check for the new flag, which allows the default value to be preset.	2022-02-23 17:11:33 +01:00
Willy Tarreau	13d7775b06	MINOR: pools: replace DEBUG_MEMORY_POOLS with runtime POOL_DBG_TAG This option used to allow to store a marker at the end of the area, which was used as a canary and detection against wrong freeing while the object is used, and as a pointer to the last pool_free() caller when back in cache. Now that we can compute the offsets at runtime, let's check it at run time and continue the code simplification.	2022-02-23 17:11:33 +01:00
Willy Tarreau	0271822f17	MINOR: pools: replace DEBUG_POOL_TRACING with runtime POOL_DBG_CALLER This option used to allow to store a pointer to the caller of the last pool_alloc() or pool_free() at the end of the area. Now that we can compute the offsets at runtime, let's check it at run time and continue the code simplification. In __pool_alloc() we now always calculate the return address (which is quite cheap), and the POOL_DEBUG_TRACE_CALLER() calls are conditionned on the status of debugging option.	2022-02-23 17:11:33 +01:00
Willy Tarreau	96d5bc7379	MINOR: pools: store the allocated size for each pool The allocated size is the visible size plus the extra storage. Since for now we can store up to two extra elements (mark and tracer), it's convenient because now we know that the mark is always stored at ->size, and the tracer is always before ->alloc_sz.	2022-02-23 17:11:33 +01:00
Willy Tarreau	e981631d27	MEDIUM: pools: replace CONFIG_HAP_POOLS with a runtime "NO_CACHE" flag. Like previous patches, this replaces the build-time code paths that were conditionned by CONFIG_HAP_POOLS with runtime paths conditionned by !POOL_DBG_NO_CACHE. One trivial test had to be added in the hot path in __pool_alloc() to refrain from calling pool_get_from_cache(), and another one in __pool_free() to avoid calling pool_put_to_cache(). All cache-specific functions were instrumented with a BUG_ON() to make sure we never call them with cache disabled. Additionally the cache[] array was not initialized (remains NULL) so that we can later drop it if not needed. It's particularly huge and should be turned to dynamic with a pointer to a per-thread area where all the objects are located. This will solve the memory usage issue and will improve locality, or even help better deal with NUMA machines once each thread uses its own arena.	2022-02-23 17:11:33 +01:00
Willy Tarreau	dff3b0627d	MINOR: pools: make the global pools a runtime option. There were very few functions left that were specific to global pools, and even the checks they used to participate to are not directly on the most critical path so they can suffer an extra "if". What's done now is that pool_releasable() always returns 0 when global pools are disabled (like the one before) so that pool_evict_last_items() never tries to place evicted objects there. As such there will never be any object in the free list. However pool_refill_local_from_shared() is bypassed when global pools are disabled so that we even avoid the atomic loads from this function. The default global setting is still adjusted based on the original CONFIG_NO_GLOBAL_POOLS that is set depending on threads and the allocator. The global executable only grew by 1.1kB by keeping this code enabled, and the code is simplified and will later support runtime options.	2022-02-23 17:11:33 +01:00
Willy Tarreau	6f3c7f6e6a	MINOR: pools: add a new debugging flag POOL_DBG_INTEGRITY The test to decide whether or not to enforce integrity checks on cached objects is now enabled at runtime and conditionned by this new debugging flag. While previously it was not a concern to inflate the code size by keeping the two functions static, they were moved to pool.c to limit the impact. In pool_get_from_cache(), the fast code path remains fast by having both flags tested at once to open a slower branch when either POOL_DBG_COLD_FIRST or POOL_DBG_INTEGRITY are set.	2022-02-23 17:11:33 +01:00
Willy Tarreau	d3470e1ce8	MINOR: pools: add a new debugging flag POOL_DBG_COLD_FIRST When enabling pools integrity checks, we usually prefer to allocate cold objects first in order to maximize the time the objects spend in the cache. In order to make this configurable at runtime, let's introduce a new debugging flag to control this allocation order. It is currently preset by the DEBUG_POOL_INTEGRITY build-time setting.	2022-02-23 17:11:33 +01:00
Willy Tarreau	fd8b737e2c	MINOR: pools: switch DEBUG_DONT_SHARE_POOLS to runtime This test used to appear at a single location in create_pool() to enable a check on the pool name or unconditionally merge similarly sized pools. This patch introduces POOL_DBG_DONT_MERGE and conditions the test on this new runtime flag, that is preset according to the aforementioned debugging option.	2022-02-23 17:11:33 +01:00
Willy Tarreau	8d0273ed88	MINOR: pools: switch the fail-alloc test to runtime only The fail-alloc test used to be enabled/disabled at build time using the DEBUG_FAIL_ALLOC macro, but it happens that the cost of the test is quite cheap and that it can be enabled as one of the pool_debugging options. This patch thus introduces the first POOL_DBG_FAIL_ALLOC option, whose default value depends on DEBUG_FAIL_ALLOC. The mem_should_fail() function is now always built, but it was made static since it's never used outside.	2022-02-23 17:11:33 +01:00
Willy Tarreau	49bb5d4268	DEBUG: pools: let's add reverse mapping from cache heads to thread and pool During global eviction we're visiting nodes from the LRU tail and we determine their pool cache head and their pool. In order to make sure we never mess up, let's add some backwards pointer to the thread number and pool from the pool_cache_head. It's 64-byte aligned anyway so we're not wasting space and it helps for debugging and will prevent memory corruption the earliest possible.	2022-02-14 20:10:43 +01:00
Willy Tarreau	0575d8fd76	DEBUG: pools: add new build option DEBUG_POOL_INTEGRITY When enabled, objects picked from the cache are checked for corruption by comparing their contents against a pattern that was placed when they were inserted into the cache. Objects are also allocated in the reverse order, from the oldest one to the most recent, so as to maximize the ability to detect such a corruption. The goal is to detect writes after free (or possibly hardware memory corruptions). Contrary to DEBUG_UAF this cannot detect reads after free, but may possibly detect later corruptions and will not consume extra memory. The CPU usage will increase a bit due to the cost of filling/checking the area and for the preference for cold cache instead of hot cache, though not as much as with DEBUG_UAF. This option is meant to be usable in production.	2022-01-21 19:07:48 +01:00
Willy Tarreau	148160b027	MINOR: pools: prepare pool_item to support chained clusters In order to support batched allocations and releases, we'll need to prepare chains of items linked together and that can be atomically attached and detached at once. For this we implement a "down" pointer in each pool_item that points to the other items belonging to the same group. For now it's always NULL though freeing functions already check them when trying to release everything.	2022-01-02 19:35:26 +01:00
Willy Tarreau	c16ed3b090	MINOR: pool: introduce pool_item to represent shared pool items In order to support batch allocation from/to shared pools, we'll have to support a specific representation for pool objects. The new pool_item structure will be used for this. For now it only contains a "next" pointer that matches exactly the current storage model. The few functions that deal with the shared pool entries were adapted to use the new type. There is no functionality difference at this point.	2022-01-02 19:35:26 +01:00
Willy Tarreau	8c4927098e	CLEANUP: pools: get rid of the POOL_LINK macro The POOL_LINK macro is now only used for debugging, and it still requires ifdefs around, which needlessly complicates the code. Let's replace it and the calling code with a new pair of macros: POOL_DEBUG_SET_MARK() and POOL_DEBUG_CHECK_MARK(), that respectively store and check the pool pointer in the extra location at the end of the pool. This removes 4 pairs of ifdefs in the middle of the code.	2022-01-02 12:44:19 +01:00
Willy Tarreau	799f6143ca	CLEANUP: pools: do not use the extra pointer to link shared elements This practice relying on POOL_LINK() dates from the era where there were no pool caches, but given that the structures are a bit more complex now and that pool caches do not make use of this feature, it is totally useless since released elements have already been overwritten, and yet it complicates the architecture and prevents from making simplifications and optimizations. Let's just get rid of this feature. The pointer to the origin pool is preserved though, as it helps detect incorrect frees and serves as a canary for overflows.	2022-01-02 12:44:19 +01:00
Willy Tarreau	4859984a5b	DOC: pool: document the purpose of various structures in the code The pools have become complex with the shared pools and the thread-local caches, and the purpose of certain structures is never easy to grasp. Let's add a bit of documentation there to save some long and painful analysis to those touching that area.	2022-01-02 12:44:19 +01:00
Willy Tarreau	690fa145ef	CLEANUP: pools: pools-t.h doesn't need to include thread-t.h This is probably a leftover from an older version to access MAX_THREADS.	2021-10-07 01:36:51 +02:00
Willy Tarreau	8de6dc9926	REORG: pools: move default settings to defaults.h There's no reason CONFIG_HAP_POOLS and its opposite are located into pools-t.h, it forces those that depend on them to inlcude the file. Other similar options are normally dealt with in defaults.h, which is part of the default API, so let's do that.	2021-09-28 19:31:16 +02:00
Willy Tarreau	8715dec6f9	MEDIUM: pools: remove the locked pools implementation Now that the modified lockless variant does not need a DWCAS anymore, there's no reason to keep the much slower locked version, so let's just get rid of it.	2021-06-10 17:46:50 +02:00
Willy Tarreau	1526ffe815	CLEANUP: pools: remove now unused seq and pool_free_list These ones were only used by the lockless implementation and are not needed anymore.	2021-06-10 17:46:50 +02:00
Willy Tarreau	2a4523f6f4	BUG/MAJOR: pools: fix possible race with free() in the lockless variant In GH issue #1275, Fabiano Nunes Parente provided a nicely detailed report showing reproducible crashes under musl. Musl is one of the libs coming with a simple allocator for which we prefer to keep the shared cache. On x86 we have a DWCAS so the lockless implementation is enabled for such libraries. And this implementation has had a small race since day one: the allocator will need to read the first object's <next> pointer to place it into the free list's head. If another thread picks the same element and immediately releases it, while both the local and the shared pools are too crowded, it will be freed to the OS. If the libc's allocator immediately releases it, the memory area is unmapped and we can have a crash while trying to read that pointer. However there is no problem as long as the item remains mapped in memory because whatever value found there will not be placed into the head since the counter will have changed. The probability for this to happen is extremely low, but as analyzed by Fabiano, it increases with the buffer size. On 16 threads it's relatively easy to reproduce with 2MB buffers above 200k req/s, where it should happen within the first 20 seconds of traffic usually. This is a structural issue for which there are two non-trivial solutions: - place a read lock in the alloc call and a barrier made of lock/unlock in the free() call to force to serialize operations; this will have a big performance impact since free() is already one of the contention points; - change the allocator to use a self-locked head, similar to what is done in the MT_LISTS. This requires two memory writes to the head instead of a single one, thus the overhead is exactly one memory write during alloc and one during free; This patch implements the second option. A new POOL_DUMMY pointer was defined for the locked pointer value, allowing to both read and lock it with a single xchg call. The code was carefully optimized so that the locked period remains the shortest possible and that bus writes are avoided as much as possible whenever the lock is held. Tests show that while a bit slower than the original lockless implementation on large buffers (2MB), it's 2.6 times faster than both the no-cache and the locked implementation on such large buffers, and remains as fast or faster than the all implementations when buffers are 48k or higher. Tests were also run on arm64 with similar results. Note that this code is not used on modern libcs featuring a fast allocator. A nice benefit of this change is that since it removes a dependency on the DWCAS, it will be possible to remove the locked implementation and replace it with this one, that is then usable on all systems, thus significantly increasing their performance with large buffers. Given that lockless pools were introduced in 1.9 (not supported anymore), this patch will have to be backported as far as 2.0. The code changed several times in this area and is subject to many ifdefs which will complicate the backport. What is important is to remove all the DWCAS code from the shared cache alloc/free lockless code and replace it with this one. The pool_flush() code is basically the same code as the allocator, retrieving the whole list at once. If in doubt regarding what barriers to use in older versions, it's safe to use the generic ones. This patch depends on the following previous commits: - MINOR: pools: do not maintain the lock during pool_flush() - MINOR: pools: call malloc_trim() under thread isolation - MEDIUM: pools: use a single pool_gc() function for locked and lockless The last one also removes one occurrence of an unneeded DWCAS in the code that was incompatible with this fix. The removal of the now unused seq field will happen in a future patch. Many thanks to Fabiano for his detailed report, and to Olivier for his help on this issue.	2021-06-10 17:46:50 +02:00
Willy Tarreau	1ab6c0bfd2	MINOR: pools/debug: slightly relax DEBUG_DONT_SHARE_POOLS The purpose of this debugging option was to prevent certain pools from masking other ones when they were shared. For example, task, http_txn, h2s, h1s, h1c, session, fcgi_strm, and connection are all 192 bytes and would normally be mergedi, but not with this option. The problem is that certain pools are declared multiple times with various parameters, which are often very close, and due to the way the option works, they're not shared either. Good examples of this are captures and stick tables. Some configurations have large numbers of stick-tables of pretty similar types and it's very common to end up with the following when the option is enabled: $ socat - /tmp/sock1 <<< "show pools" \| grep stick - Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753800=56 - Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753880=57 - Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753900=58 - Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753980=59 - Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753a00=60 - Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753a80=61 - Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753b00=62 - Pool sticktables (224 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753780=55 In addition to not being convenient, it can have important effects on the memory usage because these pools will not share their entries, so one stick table cannot allocate from another one's pool. This patch solves this by going back to the initial goal which was not to have different pools in the same list. Instead of masking the MAP_F_SHARED flag, it simply adds a test on the pool's name, and disables pool sharing if the names differ. This way pools are not shared unless they're of the same name and size, which doesn't hinder debugging. The same test above now returns this: $ socat - /tmp/sock1 <<< "show pools" \| grep stick - Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 7 users, @0x3fadb30 [SHARED] - Pool sticktables (224 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x3facaa0 [SHARED] This is much better. This should probably be backported, in order to limit the side effects of DEBUG_DONT_SHARE_POOLS being enabled in production.	2021-05-05 07:47:29 +02:00
Willy Tarreau	b8498e961a	MEDIUM: pools: make CONFIG_HAP_POOLS control both local and shared pools Continuing the unification of local and shared pools, now the usage of pools is governed by CONFIG_HAP_POOLS without which allocations and releases are performed directly from the OS using pool_alloc_nocache() and pool_free_nocache().	2021-04-19 15:24:33 +02:00
Willy Tarreau	207c095098	MINOR: pools: move the fault injector to __pool_alloc() Till now it was limited to objects allocated from the OS which means it had little use as soon as pools were enabled. Let's move it upper in the layers so that any code can benefit from fault injection. In addition this allows to pass a new flag POOL_F_NO_FAIL to disable it if some callers prefer a no-failure approach.	2021-04-19 15:24:33 +02:00
Willy Tarreau	53a7fe49aa	MINOR: pools: enable the fault injector in all allocation modes The mem_should_fail() call enabled by DEBUG_FAIL_ALLOC used to be placed only in the no-cache version of the allocator. Now we can generalize it to all modes and remove the exclusive test on CONFIG_HAP_NO_GLOBAL_POOLS.	2021-04-19 15:24:33 +02:00
Willy Tarreau	2d6f628d34	MINOR: pools: rename CONFIG_HAP_LOCAL_POOLS to CONFIG_HAP_POOLS We're going to make the local pool always present unless pools are completely disabled. This means that pools are always enabled by default, regardless of the use of threads. Let's drop this notion of "local" pools and make it just "pool". The equivalent debug option becomes DEBUG_NO_POOLS instead of DEBUG_NO_LOCAL_POOLS. For now this changes nothing except the option and dropping the dependency on USE_THREAD.	2021-04-19 15:24:33 +02:00
Willy Tarreau	d5140e7c6f	MINOR: pool: remove the size field from pool_cache_head Everywhere we have access to the pool so we don't need to cache a copy of the pool's size into the pool_cache_head. Let's remove it.	2021-04-19 15:24:33 +02:00
Willy Tarreau	9f3129e583	MEDIUM: pools: move the cache into the pool header Initially per-thread pool caches were stored into a fixed-size array. But this was a bit ugly because the last allocated pools were not able to benefit from the cache at all. As a work around to preserve performance, a size of 64 cacheable pools was set by default (there are 51 pools at the moment, excluding any addon and debugging code), so all in-tree pools were covered, at the expense of higher memory usage. In addition an index had to be calculated for each pool, and was used to acces the pool cache head into that array. The pool index was not even stored into the pools so it was required to determine it to access the cache when the pool was already known. This patch changes this by moving the pool cache head into the pool head itself. This way it is certain that each pool will have its own cache. This removes the need for index calculation. The pool cache head is 32 bytes long so it was aligned to 64B to avoid false sharing between threads. The extra cost is not huge (~2kB more per pool than before), and we'll make better use of that space soon. The pool cache head contains the size, which should probably be removed since it's already in the pool's head.	2021-04-19 15:24:33 +02:00
Willy Tarreau	de749a9333	MINOR: pools: make the pool allocator support a few flags The pool_alloc_dirty() function was renamed to __pool_alloc() and now takes a set of flags indicating whether poisonning is permitted or not and whether zeroing the area is needed or not. The pool_alloc() function is now just a wrapper calling __pool_alloc(pool, 0).	2021-03-22 20:54:15 +01:00
Willy Tarreau	0bae075928	MEDIUM: pools: add CONFIG_HAP_NO_GLOBAL_POOLS and CONFIG_HAP_GLOBAL_POOLS We've reached a point where the global pools represent a significant bottleneck with threads. On a 64-core machine, the performance was divided by 8 between 32 and 64 H2 connections only because there were not enough entries in the local caches to avoid picking from the global pools, and the contention on the list there was very high. It becomes obvious that we need to have an array of lists, but that will require more changes. In parallel, standard memory allocators have improved, with tcmalloc and jemalloc finding their ways through mainstream systems, and glibc having upgraded to a thread-aware ptmalloc variant, keeping this level of contention here isn't justified anymore when we have both the local per-thread pool caches and a fast process-wide allocator. For these reasons, this patch introduces a new compile time setting CONFIG_HAP_NO_GLOBAL_POOLS which is set by default when threads are enabled with thread local pool caches, and we know we have a fast thread-aware memory allocator (currently set for glibc>=2.26). In this case we entirely bypass the global pool and directly use the standard memory allocator when missing objects from the local pools. It is also possible to force it at compile time when a good allocator is used with another setup. It is still possible to re-enable the global pools using CONFIG_HAP_GLOBAL_POOLS, if a corner case is discovered regarding the operating system's default allocator, or when building with a recent libc but a different allocator which provides other benefits but does not scale well with threads.	2021-03-05 08:30:08 +01:00
Willy Tarreau	daf8aa62a8	MINOR: pools: increase MAX_BASE_POOLS to 64 When not sharing pools (i.e. when building with -DDEBUG_DONT_SHARE_POOLS) we have about 47 pools right now, while MAX_BASE_POOLS is only 32, meaning that only the first 32 ones will benefit from a per-thread cache entry. This totally kills performance when pools are not shared (roughly -20%). Let's double the limit to gain some margin, and make it possible to set it as a build option. It might be useful to backport this to stable versions as they're likely to be affected as well.	2020-06-30 14:29:02 +02:00
Willy Tarreau	c03d7632a5	CLEANUP: pool: only include the type files from types pool-t.h was mistakenly including the full-blown includes for threads, lists and api instead of the types, and as such, CONFIG_HAP_LOCAL_POOLS and CONFIG_HAP_LOCKLESS_POOLS were not visible everywhere.	2020-06-29 10:11:24 +02:00
Willy Tarreau	ed891fda52	MEDIUM: memory: make local pools independent on lockless pools Till now the local pool caches were implemented only when lockless pools were in use. This was mainly due to the difficulties to disentangle the code parts. However the locked pools would further benefit from the local cache, and having this would reduce the variants in the code. This patch does just this. It adds a new debug macro DEBUG_NO_LOCAL_POOLS to forcefully disable local pool caches, and makes sure that the high level functions are now strictly the same between locked and lockless (pool_alloc(), pool_alloc_dirty(), pool_free(), pool_get_first()). The pool index calculation was moved inside the CONFIG_HAP_LOCAL_POOLS guards. This allowed to move them out of the giant #ifdef and to significantly reduce the code duplication. A quick perf test shows that with locked pools the performance increases by roughly 10% on 8 threads and gets closer to the lockless one.	2020-06-11 10:18:57 +02:00
Willy Tarreau	3646777a77	REORG: memory: move the pool type definitions to haproxy/pool-t.h This is the beginning of the move and cleanup of memory.h. This first step only extracts type definitions and basic macros that are needed by the files which reference a pool. They're moved to pool-t.h (since "pool" is more obvious than "memory" when looking for pool-related stuff). 3 files which didn't need to include the whole memory.h were updated.	2020-06-11 10:18:56 +02:00

44 Commits