[libc] [search] improve hsearch robustness (#73896)

Following up the discussion at
https://github.com/llvm/llvm-project/pull/73469#discussion_r1409593911
by @nickdesaulniers.

According to FreeBSD implementation
(https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/upstream-freebsd/lib/libc/stdlib/hcreate.c),
`hsearch` is able to handle the cases where the global table is not
properly initialized. To do this, FreeBSD actually allows hash table to
be dynamically resized. If the global table is uninitialized at the
first call, the table will be initialized with a minimal size; hence
subsequent insertion will be reasonable as the table grows
automatically.

This patch mimic such behaviors. More precisely, this patch introduces:

1. a full table iterator that scans each element in the table,
2. a resize routine that is automatically triggered whenever the load
factor is reached where it iterates the old table and insert the entries
into a new one,
3. more tests that cover the growth of the table.
This commit is contained in:
Schrodinger ZHU Yifan
2023-12-05 12:06:48 -05:00
committed by GitHub
parent 67f9b5ae7d
commit 86e99e11e5
14 changed files with 322 additions and 106 deletions

View File

@@ -62,3 +62,10 @@ Often the standard will imply an intended behavior through what it states is und
Ignoring Bug-For-Bug Compatibility
----------------------------------
Any long running implementations will have bugs and deviations from the standard. Hyrum's Law states that “all observable behaviors of your system will be depended on by somebody” which includes these bugs. An example of a long-standing bug is glibc's scanf float parsing behavior. The behavior is specifically defined in the standard, but it isn't adhered to by all libc implementations. There is a longstanding bug in glibc where it incorrectly parses the string 100er and this caused the C standard to add that specific example to the definition for scanf. The intended behavior is for scanf, when parsing a float, to parse the longest possibly valid prefix and then accept it if and only if that complete parsed value is a float. In the case of 100er the longest possibly valid prefix is 100e but the float parsed from that string is only 100. Since there is no number after the e it shouldn't be included in the float, so scanf should return a parsing error. For LLVM's libc it was decided to follow the standard, even though glibc's version is slightly simpler to implement and this edge case is rare. Following the standard must be the first priority, since that's the goal of the library.
Design Decisions
================
Resizable Tables for hsearch
----------------------------
The POSIX.1 standard does not delineate the behavior consequent to invoking hsearch or hdestroy without prior initialization of the hash table via hcreate. Furthermore, the standard does not specify the outcomes of successive invocations of hsearch absent intervening hdestroy calls. Libraries such as MUSL and Glibc do not apply checks to these scenarios, potentially leading to memory corruption or leakage. Conversely, FreeBSD's libc and Bionic automatically initialize the hash table to a minimal size if it is found uninitialized, and proceeding to destroy the table only if initialization has occurred. This approach also avoids redundant table allocation if an initialized hash table is already present. Given that the hash table starts with a minimal size, resizing becomes necessary to accommodate additional user insertions. LLVM's libc mirrors the approach of FreeBSD's libc and Bionic, owing to its enhanced robustness and user-friendliness. Notably, such resizing behavior itself aligns with POSIX.1 standards, which explicitly permit implementations to modify the capacity of the hash table.