[libc++][format] Improves escaping performance. (#88533)

The previous patch implemented - P2713R1 Escaping improvements in std::format - LWG3965 Incorrect example in [format.string.escaped] p3 for formatting of combining characters These changes were correct, but had a size and performance penalty. This patch improves the size and performance of the previous patch. The performance is still worse than before since the lookups may require two property lookups instead of one before implementing the paper. The changes give a tighter coupling between the Unicode data and the algorithm. Additional tests are added to notify about changes in future Unicode updates. Before ``` ----------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------- BM_ascii_escaped<char> 110704 ns 110696 ns 6206 BM_unicode_escaped<char> 101371 ns 101374 ns 6862 BM_cyrillic_escaped<char> 63329 ns 63327 ns 11013 BM_japanese_escaped<char> 41223 ns 41225 ns 16938 BM_emoji_escaped<char> 111022 ns 111021 ns 6304 BM_ascii_escaped<wchar_t> 112441 ns 112443 ns 6231 BM_unicode_escaped<wchar_t> 102776 ns 102779 ns 6813 BM_cyrillic_escaped<wchar_t> 58977 ns 58975 ns 11868 BM_japanese_escaped<wchar_t> 36885 ns 36886 ns 18975 BM_emoji_escaped<wchar_t> 115885 ns 115881 ns 6051 ``` The first change is to manually encode the entire last area and make a manual exception for the 240 excluded entries. This reduced the table from 1077 to 729 entries and gave the following benchmark results. ``` ----------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------- BM_ascii_escaped<char> 104777 ns 104776 ns 6550 BM_unicode_escaped<char> 96980 ns 96982 ns 7238 BM_cyrillic_escaped<char> 60254 ns 60251 ns 11670 BM_japanese_escaped<char> 44452 ns 44452 ns 15734 BM_emoji_escaped<char> 104557 ns 104551 ns 6685 BM_ascii_escaped<wchar_t> 107456 ns 107454 ns 6505 BM_unicode_escaped<wchar_t> 96219 ns 96216 ns 7301 BM_cyrillic_escaped<wchar_t> 56921 ns 56904 ns 12288 BM_japanese_escaped<wchar_t> 39530 ns 39529 ns 17492 BM_emoji_escaped<wchar_t> 108494 ns 108496 ns 6408 ``` An entry in the table can only contain 2048 code points. For larger ranges there are multiple entries split in chunks with a maximum size of 2048 entries. To encode the entire Unicode code point range 21 bits are required. The manual part starts at 0x323B0 this means all entries in the table fit in 18 bits. This allows to allocate 3 additional bits for the range. This allows entries to have 16384 elements. This range always avoids splitting the range in multiple chunks. This reduces the number of table elements from 729 to 711 and gives the following benchmark results. ``` ----------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------- BM_ascii_escaped<char> 104289 ns 104289 ns 6619 BM_unicode_escaped<char> 96682 ns 96681 ns 7215 BM_cyrillic_escaped<char> 59673 ns 59673 ns 11732 BM_japanese_escaped<char> 41983 ns 41982 ns 16646 BM_emoji_escaped<char> 104119 ns 104120 ns 6683 BM_ascii_escaped<wchar_t> 104503 ns 104505 ns 6693 BM_unicode_escaped<wchar_t> 93426 ns 93423 ns 7489 BM_cyrillic_escaped<wchar_t> 54858 ns 54859 ns 12742 BM_japanese_escaped<wchar_t> 36385 ns 36384 ns 19259 BM_emoji_escaped<wchar_t> 105608 ns 105610 ns 6592 ```
2026-01-17 06:40:01 +08:00 · 2024-04-28 12:15:25 +02:00
parent 0a0f1f9f1d
commit e3dea5e341
4 changed files with 889 additions and 1124 deletions
--- a/libcxx/include/__format/escaped_output_table.h
+++ b/libcxx/include/__format/escaped_output_table.h
--- a/libcxx/include/format
+++ b/libcxx/include/format
@@ -193,6 +193,8 @@ namespace std {
 #include <__format/concepts.h>
 #include <__format/container_adaptor.h>
 #include <__format/enable_insertable.h>
+#include <__format/escaped_output_table.h>
+#include <__format/extended_grapheme_cluster_table.h>
 #include <__format/format_arg.h>
 #include <__format/format_arg_store.h>
 #include <__format/format_args.h>
--- a/libcxx/test/libcxx/utilities/format/format.string/format.string.std/escaped_output.pass.cpp
+++ b/libcxx/test/libcxx/utilities/format/format.string/format.string.std/escaped_output.pass.cpp
@@ -0,0 +1,102 @@
+//===----------------------------------------------------------------------===//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// UNSUPPORTED: c++03, c++11, c++14, c++17, c++20
+// UNSUPPORTED: GCC-ALWAYS_INLINE-FIXME
+
+// <format>
+
+// Tests the properties of the Unicode escaped output table.
+// The libc++ algorithm has size and speed optimizations based on the properties
+// of Unicode. This means updating the Unicode tables has a likilihood of
+// breaking test. This is an assert; it requires validating whether the
+// assumptions of the size and speed optimizations are still valid.
+
+#include <algorithm>
+#include <numeric>
+#include <format>
+#include <cassert>
+
+// Contains the entries for [format.string.escaped]/2.2.1.2.1
+//   CE is a Unicode encoding and C corresponds to a UCS scalar value whose
+//   Unicode property General_Category has a value in the groups Separator (Z)
+//   or Other (C), as described by table 12 of UAX #44
+//
+// Separator (Z) consists of General_Category
+// - Zs Space_Separator,
+// - Zl Line_Separator,
+// - Zp Paragraph_Separator.
+//
+// Other (C) consists of General_Category
+// - Cc Control,
+// - Cf Format,
+// - Cs Surrogate,
+// - Co Private_Use,
+// - Cn Unassigned.
+inline constexpr int Zs = 17;
+inline constexpr int Zl = 1;
+inline constexpr int Zp = 1;
+inline constexpr int Z  = Zs + Zl + Zp;
+
+inline constexpr int Cc = 65;
+inline constexpr int Cf = 170;
+inline constexpr int Cs = 2'048;
+inline constexpr int Co = 137'468;
+inline constexpr int Cn = 824'718;
+inline constexpr int C  = Cc + Cf + Cs + Co + Cn;
+
+// This is the final part of the Unicode properties table:
+//
+// 31350..323AF  ; Lo # [4192] CJK UNIFIED IDEOGRAPH-31350..CJK UNIFIED IDEOGRAPH-323AF
+// 323B0..E0000  ; Cn # [711761] <reserved-323B0>..<reserved-E0000>
+// E0001         ; Cf #       LANGUAGE TAG
+// E0002..E001F  ; Cn #  [30] <reserved-E0002>..<reserved-E001F>
+// E0020..E007F  ; Cf #  [96] TAG SPACE..CANCEL TAG
+// E0080..E00FF  ; Cn # [128] <reserved-E0080>..<reserved-E00FF>
+// E0100..E01EF  ; Mn # [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
+// E01F0..EFFFF  ; Cn # [65040] <reserved-E01F0>..<noncharacter-EFFFF>
+// F0000..FFFFD  ; Co # [65534] <private-use-F0000>..<private-use-FFFFD>
+// FFFFE..FFFFF  ; Cn #   [2] <noncharacter-FFFFE>..<noncharacter-FFFFF>
+// 100000..10FFFD; Co # [65534] <private-use-100000>..<private-use-10FFFD>
+// 10FFFE..10FFFF; Cn #   [2] <noncharacter-10FFFE>..<noncharacter-10FFFF>
+//
+// It can be observed all entries in the range 323B0..10FFFF are in the
+// categories Cf, Co, Cn, except a small range with the property Mn.
+// In order to reduce the size of the table only the entires in the range
+// [0000, 323B0) are stored in the table. The entries in the range
+// [323B0, 10FFFF] use a hand-crafted algorithm.
+//
+// This means a number of entries are omitted
+inline constexpr int excluded = ((0x10FFFF - 0x323B0) + 1) - 240;
+
+inline constexpr int entries = Z + C - excluded;
+
+static constexpr int count_entries() {
+  return std::transform_reduce(
+      std::begin(std::__escaped_output_table::__entries),
+      std::end(std::__escaped_output_table::__entries),
+      0,
+      std::plus{},
+      [](auto entry) { return 1 + static_cast<int>(entry & 0x3fffu); });
+}
+static_assert(count_entries() == entries);
+
+int main(int, char**) {
+  for (char32_t c = 0x31350; c <= 0x323AF; ++c) // 31350..323AF  ; Lo # [4192]
+    assert(std::__escaped_output_table::__needs_escape(c) == false);
+
+  for (char32_t c = 0x323B0; c <= 0xE00FF; ++c) // 323B0..E00FF ; C
+    assert(std::__escaped_output_table::__needs_escape(c) == true);
+
+  for (char32_t c = 0xE0100; c <= 0xE01EF; ++c) // E0100..E01EF  ; Mn # [240]
+    assert(std::__escaped_output_table::__needs_escape(c) == false);
+
+  for (char32_t c = 0xE01F0; c <= 0x10FFFF; ++c) // E01F0..10FFFF; C
+    assert(std::__escaped_output_table::__needs_escape(c) == true);
+
+  return 0;
+}
--- a/libcxx/utils/generate_escaped_output_table.py
+++ b/libcxx/utils/generate_escaped_output_table.py
@@ -113,34 +113,41 @@ DATA_ARRAY_TEMPLATE = """
 /// table lacks a property, thus having more bits available for the size.
 ///
 /// The data has 2 values:
-/// - bits [0, 10] The size of the range, allowing 2048 elements.
-/// - bits [11, 31] The lower bound code point of the range. The upper bound of
-///   the range is lower bound + size.
+/// - bits [0, 13] The size of the range, allowing 16384 elements.
+/// - bits [14, 31] The lower bound code point of the range. The upper bound of
+///   the range is lower bound + size. Note the code expects code units the fit
+///   into 18 bits, instead of the 21 bits needed for the full Unicode range.
 _LIBCPP_HIDE_FROM_ABI inline constexpr uint32_t __entries[{size}] = {{
 {entries}}};

+/// Returns whether the code unit needs to be escaped.
+///
 /// At the end of the valid Unicode code points space a lot of code points are
 /// either reserved or a noncharacter. Adding all these entries to the
-/// lookup table would add 446 entries to the table (in Unicode 14).
-/// Instead the only the start of the region is stored, every code point in
-/// this region needs to be escaped.
-_LIBCPP_HIDE_FROM_ABI inline constexpr uint32_t __unallocated_region_lower_bound = 0x{unallocated:08x};
+/// lookup table would greatly increase the size of the table. Instead these
+/// entries are manually processed. In this large area of reserved code points,
+/// there is a small area of extended graphemes that should not be escaped
+/// unconditionally. This is also manually coded. See the generation script for
+/// more details.

-/// Returns whether the code unit needs to be escaped.
 ///
 /// \pre The code point is a valid Unicode code point.
 [[nodiscard]] _LIBCPP_HIDE_FROM_ABI constexpr bool __needs_escape(const char32_t __code_point) noexcept {{
-  // Since __unallocated_region_lower_bound contains the unshifted range do the
-  // comparison without shifting.
-  if (__code_point >= __unallocated_region_lower_bound)
+
+  // The entries in the gap at the end.
+  if(__code_point >= 0x{gap_lower:08x} && __code_point <= 0x{gap_upper:08x})
+     return false;
+
+  // The entries at the end.
+  if (__code_point >= 0x{unallocated:08x})
    return true;

-  ptrdiff_t __i = std::ranges::upper_bound(__entries, (__code_point << 11) | 0x7ffu) - __entries;
+  ptrdiff_t __i = std::ranges::upper_bound(__entries, (__code_point << 14) | 0x3fffu) - __entries;
  if (__i == 0)
    return false;

  --__i;
-  uint32_t __upper_bound = (__entries[__i] >> 11) + (__entries[__i] & 0x7ffu);
+  uint32_t __upper_bound = (__entries[__i] >> 14) + (__entries[__i] & 0x3fffu);
  return __code_point <= __upper_bound;
 }}
 """
@@ -245,28 +252,33 @@ def property_ranges_to_table(ranges: list[PropertyRange]) -> list[Entry]:

        while True:
            e = Entry(range.lower, range.upper - range.lower)
-            if e.offset <= 2047:
+            if e.offset <= 16383:
                result.append(e)
                break
-            e.offset = 2047
+            e.offset = 16383
            result.append(e)
-            range.lower += 2048
+            range.lower += 16384
    return result


 cpp_entrytemplate = "    0x{:08x} /* {:08x} - {:08x} [{:>5}] */"


-def generate_cpp_data(ranges: list[PropertyRange], unallocated: int) -> str:
+def generate_cpp_data(
+    ranges: list[PropertyRange], unallocated: int, gap_lower: int, gap_upper: int
+) -> str:
    result = StringIO()
    table = property_ranges_to_table(ranges)
+    # Validates all entries fit in 18 bits.
+    for x in table:
+        assert x.lower + x.offset < 0x3FFFF
    result.write(
        DATA_ARRAY_TEMPLATE.format(
            size=len(table),
            entries=",\n".join(
                [
                    cpp_entrytemplate.format(
-                        x.lower << 11 | x.offset,
+                        x.lower << 14 | x.offset,
                        x.lower,
                        x.lower + x.offset,
                        x.offset + 1,
@@ -275,6 +287,8 @@ def generate_cpp_data(ranges: list[PropertyRange], unallocated: int) -> str:
                ]
            ),
            unallocated=unallocated,
+            gap_lower=gap_lower,
+            gap_upper=gap_upper,
        )
    )

@@ -305,22 +319,28 @@ def generate_data_tables() -> str:

    data = compactPropertyRanges(sorted(properties, key=lambda x: x.lower))

-    # The last entry is large. In Unicode 14 it contains the entries
-    # 3134B..0FFFF 912564 elements
-    # This are 446 entries of 1325 entries in the table.
-    # Based on the nature of these entries it is expected they remain for the
-    # forseeable future. Therefore we only store the lower bound of this section.
-    #
-    # When this region becomes substantially smaller we need to investigate
-    # this design.
-    #
-    # Due to P2713R1 Escaping improvements in std::format the range
+    # The output table has two large entries at the end, with a small "gap"
    #   E0100..E01EF  ; Grapheme_Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
-    # is no longer part of these entries. This causes an increase in the size
-    # of the table.
-    assert data[-1].upper == 0x10FFFF
+    # Based on Unicode 15.1.0:
+    # - Encoding all these entries in the table requires 1173 entries.
+    # - Manually handling these last two blocks reduces the size to 729 entries.
+    # This not only reduces the binary size, but also improves the performance
+    # by having fewer elements to search.
+    # The exact entries may differ between Unicode versions. When these numbers
+    # change the test needs to be updated too.
+    #   libcxx/test/libcxx/utilities/format/format.string/format.string.std/escaped_output.pass.cpp
+    assert (data[-2].lower) == 0x323B0
+    assert (data[-2].upper) == 0xE00FF
+    assert (data[-1].lower) == 0xE01F0
+    assert (data[-1].upper) == 0x10FFFF

-    return "\n".join([generate_cpp_data(data[:-1], data[-1].lower)])
+    return "\n".join(
+        [
+            generate_cpp_data(
+                data[:-2], data[-2].lower, data[-2].upper + 1, data[-1].lower - 1
+            )
+        ]
+    )


 if __name__ == "__main__":