(, what's wrong with you?)
In a 1991 retrospective on the history of C++, its creator Bjarne Stroustrup called the lack of a standard string type (and some other standard types) in C++ 1.0 the worst mistake he made in its development: "Their absence led to everyone reinventing the wheel and to an unnecessary diversity in the most fundamental classes"
In the introductory part, I want to briefly describe the current state of strings in C++, how we got to it, and why it is so. I will also describe the shortcomings of current implementations so that the solutions I use in my string library are clear.
Actually, initially there was no standard type for strings in C++. The approach from C was used to work with strings – a string is a pointer to an array of bytes ending in zero. The disadvantages of such strings are that it is impossible to use the byte 0 in the string, i.e. it is not suitable for binary data, the resource management/ownership strategy is unclear, and the main disadvantage is that the length of the string has to be calculated each time, iterating over all its characters.
The origin of this solution is quite clear – from the time of the dinosaurs: just as dinosaurs were large, with small brains and short arms, so computers were large, memory was small, and strings were short. Saving memory on storing the length of a string was more important than losing time on repeatedly calculating the length.
The first attempts to standardize strings as a class began only in C++98 - std::string appeared as part of STL, and like much of STL, it was extremely ambiguously perceived by programmers.
And the first thing that comes to mind when improving C-strings is that you need to store the length of the string:
With such a string, many algorithms are significantly optimized. For example, when comparing two strings for equality, we may not even start comparing their characters if the lengths of the strings are not equal. Moreover, this data is absolutely sufficient for all methods that do not modify the string. Also note that such an object on modern 64-bit architectures is perfectly passed to functions by value – both of its fields fit into registers (well, except for Windows), which makes it easier for the compiler optimizer to work.
Meanwhile, such a solution only made it into the standard in C++17, in the form of std::string_view. Apparently, only then could the committee be convinced that strings are different, and using only one universal object for strings – at the very least, can lead to a decrease in performance, and also violates the principle of "don't pay for what you
don't use". Why are "strings different" and why is one type of string not enough for us, we will consider just below.
And the next question that arises with strings is resource ownership. Almost every major framework solved this problem on its own, inventing its own bicycles. We have std::string, in QT we have QString, in MFC - CString, in ATL - CAtlString, there are own strings in Folly, in general, "thousands of them", any game engine starts with writing its own strings.
Many of these implementations in the aspect of resource management used the approach to improve performance COW – “Copy On Write”. In this case, the string object referred to a buffer shared between several objects with the characters of the string and a reference counter to this buffer, which allowed you to quickly create a copy of the string, and actually copy the characters only when it is modified.
But they all coincided in one thing – the string was always assumed to be mutable, that is, we can modify the characters in the string buffer.
Because of this, the COW approach died by C++11: for each operation that could modify the characters of the string, it was necessary to check whether we are referring to a shared buffer, and if so, copy the characters to another buffer. In a multithreaded environment, you also need to check whether you now need to free the old buffer, and of course, all this is smeared with locks or atomics, which is also not free. Therefore, starting with C++11, std::string does not use COW, and each copying of a string object also leads to copying all the characters of the string to another buffer.
Naturally, each new buffer requires memory allocation, which they are trying to slightly optimize through SSO – “Small String Optimization”, when the string object contains a small buffer inside itself and the characters of short strings are located directly in it. But this already depends on the implementation: in some libraries they place up to 15 bytes in the string object, in some up to 23. However, this optimization is also a double-edged sword, and in various implementations it can complicate the movement of a string - if it stores a pointer to its internal buffer, it will have to be adjusted.
And without COW, the mutability of strings leads to the fact that any initialization of a string object leads to copying bytes. Let's look at this code:
(You can verify the truth of the comments at https://godbolt.org/z/51oKGWT5T )
But if we don’t need to modify the string in any way further in the code, we are wasting money on allocation, copying characters, as well as on the string destructor. That is, I would like to have at least two versions of strings – mutable and immutable, to explicitly make it clear to the compiler that we are not going to modify the string. Or a banal example – we are parsing some incoming data buffer, we need to check whether a certain piece of the buffer is equal to the string "hello" in "pure C++", i.e. without any memcmp and strcmp. Before the advent of string_view, it had to be done something like this:
Here it turns out that first the characters from the data buffer are copied to the buffer of the temporary string, possibly with memory allocation, and only then the temporary string is compared with "hello", and then also the destructor and stack unwinding in case of an exception.
When using std::string_view instead of std::string – the code in C++ almost does not change:
However, the generated machine code is significantly transformed, reaching the level of manual C-code – there it is simply compared that end – start == 5 and then a piece of the initial buffer is compared via memcmp with the string "hello" (with -O2 with constants 1819043176 ('hell') and 111 ('o')). No creation of a temporary object, no copying of bytes, no destructor, no stack unwinding for exceptions. You can verify this at https://godbolt.org/z/9fo188e7c
It would seem, well, string_view appeared in C++17, please, use it in the parameters of your functions instead of const std::string&, and there will be happiness. But there is also a nuance here – everything works fine, as long as we don’t need to pass the string to a third-party C-API: string_view does not give guarantees of null-termination of the string, therefore its data() cannot be passed to a third-party C-API, and therefore you will still have to copy it to std::string first. And since std::string is needed, then it is more optimal to make const std::string& the parameter of the function and further down the chain, all parameters will again become const std::string&.
Next, after initializing a string, the most frequent mutable operation with them is most likely string concatenation, either in the form of simply adding strings, or adding a string to a string. And it is she who can easily cause both suboptimal performance with illiterate use, and memory overhead, even with competent use.
Consider a simple code ( https://godbolt.org/z/odx7W1Pv7 )
As we can see, both in clang and in GCC, several temporary objects are created, into which the characters of the strings are sequentially shifted, and as a result – we get several extra allocations for intermediate buffers, the characters from the strings are copied several extra times from intermediate buffers. Ideally, for better performance, this code needs to be rewritten like this:
Unfortunately, so far no compiler optimizes the first simple code to the level of the second more optimal code, and writing such code every time by hand is quite inconvenient. That is, again you have to pay for what you don’t use. And in this case, memory overhead may well occur – string addition operations usually increase the size of the string buffer in all implementations by at least two times, assuming that something may soon be added to the string again. Therefore, if the string is no longer planned to be modified, but its lifetime has not yet come to an end (for example, this is a field of some class), you should not forget to do shrink_to_fit on it.
Meanwhile, often the main scenario for using strings is just some preparation of the string by several modifications and concatenations, and then it is stored somewhere, no longer changing. In this case, the programmer usually knows approximately what size of strings is expected in this place, and could allocate a buffer for these intermediate modifications directly on the stack, resorting to dynamic allocation only when exceeding the size of this buffer. However, with the current implementation of strings, this is quite problematic, or inconvenient.
Let's summarize what we have at the moment:
std::string.std::string_view, but it does not solve the issues of string ownership, so in fact it is only suitable as a type for passing parameters to functions that do not change strings, with the caveat that it cannot be used in functions calling C-API, since it does not guarantee null-termination.std::string that despite the fact that this is a class for strings, in fact, for working with strings it has an extremely meager functionality compared to what they are used to in other languages – for example, there is no replacement of substrings by a pattern (in other languages this is usually replace, but in C++ this function does something completely different), trim, split, join, upper, lower, etc. These functions have to be written by yourself every time, and it is not a fact that everyone will be able to do this optimally.I hope that after this small introduction you will better understand what problems I solved with my string library and how.
Actually, you can't say that "I reinvented my implementation of the class for strings." As I showed earlier, it is difficult, or even impossible, to write one single string class that is well suited for all scenarios of use. That is why I don’t have a string class, but a string library, which contains several different string types, from simpler to more complex, each of which has its own strengths and weaknesses, and the user needs to competently approach the question of which of these classes should be used in which case.
I started developing the library itself little by little back in 2011-2012, when we already had move semantics, but not yet there was std::string_view. However, now the minimum standard version for the library to work is: C++20 – concepts and <format> are used.
First, I will talk about the library classes for the strings themselves, and then about how the string concatenation problem is optimally solved in it.
Several general points:
std::towupper, std::towlower for the unicode locale, only faster and can work with any type of characters. If you need strict work with unicode, use other tools, such as ICU.simple_str :)The class simply represents a pointer to the beginning of a constant string and its length, in fact the same as std::string_view. It is intended for working with immutable strings, not owning them, that is, you must take care that the real string, represented through simple_str – is alive during its use.
Implements all string methods that do not modify the string.
Aliases:
ssa for simple_str<char>ssu for simple_str<char16_t>ssw for simple_str<wchar_t>ssuu for simple_str<char32_t>It is used mainly for passing strings as a parameter to functions that do not modify the passed string, instead of const std::string&, as well as for local variables when working with parts of strings.
simple_str_ntIn terms of structure and purpose, it coincides with simple_str, but guarantees null-termination of the string. That is, if the function needs to pass the passed parameter further as a C-string to some API without changes, it should use the simple_str_nt type for the parameter. All classes of owning strings (simstr::sstring, simstr::lstring) can be converted to simple_str_nt, since they store strings with a trailing zero. This allows you to write functions with a single parameter type that accepts any type of owning string objects as input.
Aliases:
stra for simple_str_nt<char>stru for simple_str_nt<char16_t>strw for simple_str_nt<wchar_t>struu for simple_str_nt<char32_t>Can be initialized with string literals:
In this case, the length is calculated immediately at compile time. Similarly, simple_str_nt is created using operator""_ss:
A class that can store an immutable string. That is, you can only assign a string to it entirely, you cannot modify the characters of the string.
Owns the string, manages the memory for the characters of the string. Stores a trailing zero with the strings, and can be a source for simple_str_nt, for passing to C-API. Like simple_str, it implements all methods that do not modify the string.
Aliases:
stringa for sstring<char>stringu for sstring<char16_t>stringw for sstring<wchar_t>stringuu for sstring<char32_t>The fact that the stored string is immutable allows you to apply a number of optimizations:
The class also uses SSO – Small String Optimization. Short strings are placed inside the object itself in an internal buffer.
Sizes:
For 64 bits:
stringa – class 24 bytes, SSO up to 23 characters.stringu – class 32 bytes, SSO up to 15 characters.stringuu – class 32 bytes, SSO up to 7 characters.For 32 bits:
stringa – class 16 bytes, SSO up to 15 characters.stringu – class 24 bytes, SSO up to 11 characters.stringuu – class 24 bytes, SSO up to 5 characters.A class that stores a string and allows it to be modified. Owns the string, manages the memory for the characters of the string. Stores a trailing zero with the strings, and can be a source for simple_str_nt, for passing to C-API. Like all other classes, it implements all methods that do not modify the string.
The size of the internal buffer for storing characters is specified as N in the template parameter. Strings up to N characters long are stored inside the object, and when this number is exceeded, a dynamic buffer is allocated, in which the characters are saved. When copying an object, all characters are also always copied.
If forShare == true and the characters do not fit into the local buffer, then a dynamic buffer is created with additional space, so that it matches the structure of the sstring buffer. Then, when moving lstring to sstring – only the pointer will move to the buffer, without unnecessary copying of characters.
This class is convenient for working with strings as a local variable on the stack. Usually we assume the approximate size of the strings we will be working with, and we can create a local string with a buffer on the stack, and work with it. At the same time, without fear of buffer overflow, since in this case the string will switch to a dynamic buffer.
Aliases:
lstringa<N=16> for lsrting<char, N, false>lstringu<N=16> for lsrting<char16_t, N, false>lstringw<N=16> for lsrting<wchar_t, N, false>lstringuu<N=16> for lsrting<char32_t, N, false>lstringsa<N=16> for lsrting<char, N, true>lstringsu<N=16> for lsrting<char16_t, N, true>lstringsw<N=16> for lsrting<wchar_t, N, true>lstringsuu<N=16> for lsrting<char32_t, N, true>A small example of use with explanations:
In this example, you probably noticed how strings are concatenated and wondered – how was the length of the entire result calculated with two additions in order to allocate the necessary space at once, without intermediate buffers?
The answer to this question:
The fact is that there is no addition of string objects as such in the library. Addition is performed for "string expressions".
A string expression is any object of arbitrary type that has length and place functions. The length function returns the length of the string, and the place function places the characters of the string into the buffer passed to it.
Any owning string (simstr::sstring, simstr::lstring) can be initialized with a string expression — it requests its length, allocates space for storing characters, and passes this space to the string expression, calling its place function.
A template addition function is defined for string expressions:
strexprjoin is a template type that is itself a string expression. It stores references to the two string expressions passed to it. When the length is requested, it returns the sum of the lengths of the two string expressions, and when placing characters, it first places the first expression in the passed buffer, then the second.
Thus, the addition operation of string expressions creates an object that is also a string expression, to which the next addition operation can also be applied, and which recursively stores references to the component parts, each of which knows its size and knows how to place itself in the result buffer. And so on, to each resulting string expression, you can reapply operator +, forming a chain of several string expressions, and eventually "materialize" the last resulting object, which first calculates the size of the entire total memory for the final result, and then places the nested subexpressions into one buffer.
All string types in the library are themselves string expressions, that is, they can serve as terms in concatenations of string expressions.
Also, operator+ is defined for string expressions and string literals, string expressions and numbers (numbers are converted to decimal representation), and you can add the desired types yourself.
Example:
There are several types of string expressions "out of the box" for performing various operations on strings:
Returns a string of length NumberOfCharacters, filled with the specified character. The number of characters and the symbol are compile-time constants. For some cases, there is a shorthand notation:
e_spca(NumberOfCharacters) - string of char spaces e_spcw(NumberOfCharacters) - string of w_char spaces
Returns a string of length NumberOfCharacters, filled with the specified character. The number of characters and the symbol can be specified at runtime. Shorthand notation:
e_c(NumberOfCharacters, Symbol)
If Condition == true, the result will be StrExpr1, otherwise StrExpr2.
If Condition == true, the result will be StrExpr1, otherwise an empty string.
Converts a number to decimal representation. Rarely used, since the "+" operator is overloaded for string expressions and numbers, and the number can simply be written as text + number;
converts a number to decimal representation. Rarely used, since the "+" operator is overloaded for string expressions and numbers, and the number can simply be written as text + number;
Concatenates all strings in the container, using a separator. If AfterLast == true, then the separator is added after the last element of the container as well, otherwise only between elements. If OnlyNotEmpty == true, then empty strings are skipped without adding a separator.
Replaces occurrences of "Search" in the original string with "Replace". Search and replace patterns are compile-time string literals.
Replaces occurrences of Search in the original string with Replace. Search and replace patterns can be any string objects at runtime.
Returns an empty string. Abbreviated notation — eea, eeu, eew, eeuu. Used if the string formation starts with a number and a string literal:
since the addition operator is only defined for adding a string expression and a number. I also note that there is operator""_ss, which turns a string literal into a simple_str_nt object, which is already a string expression:
You can create your own string expression types to optimally form strings for your specific purposes and algorithms. To do this, simply create a type with length, place and typename symb_type methods. Examples of creation and use from real projects:
Usage:
Another example
Usage:
And more
Usage
ATTENTION: usually the fields in string expression objects are references to the source data. And these references almost always lead to local or temporary objects. Therefore, it is extremely risky to return string expressions from functions — you need to check a hundred times that they do not contain references to local or temporary variables. Make it a rule — you can easily pass string expressions to functions, and it is dangerous to return them from functions. It is better to materialize a string expression into a string object containing the final string when returning. If desired, the type of the returned string can be specified by a template parameter.
Designed for concatenating multiple strings. When you need to sequentially form a long text from many small pieces (for example, you are forming an html response etc.) - sequentially adding everything to one string object is extremely suboptimal - there will be many reallocations and copying of already accumulated characters. In this case, it is convenient to use chunked_string_builder - all it can do is add a string to the accumulated characters. However, it does this not in a single sequential memory buffer, but in separate buffers, no less than the specified alignment. When filling the next buffer, it simply creates another buffer and continues to add data to it.
That is, suppose you set the alignment to 1024. Added several strings, filled the buffer with 100 characters. And you add a string of 3000 characters long. In this case, 924 characters will be copied to the first buffer, filling it to the end. For the remaining 2076, a buffer of 3072 characters will be created, and they will be copied into it, leaving space for 996 characters in it. Thus, each buffer is sequentially filled to the end and has a size that is a multiple of the specified alignment. This avoids reallocations and copying of processed characters.
After the final filling, you can work with the accumulated data - either merge all the buffers into one sequential string (the size for the buffer of which you now already know), or iterate over them separately, for example, sending these buffers to the network. Or sequentially copying data into a buffer of a given size.