The D language includes the concept of the slice. Matthew shows how the "View" concept can help to model this in C++.
January 01, 2006
URL:http://www.drdobbs.com/a-view-to-a-string-part-i/184402064
Matthew Wilson is a software-development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004) and Extended STL (Addison-Wesley, 2006). He can be contacted at http://imperfectcplusplus.com/.
[This month's column is, in part, an extract from Matthew's forthcoming book on STL extension, called Extended STL, which will be published by Addison-Wesley in 2006.]
This month I'm continuing the theme of how other languages may influence the idioms of C++. Where the last installment examined how Ruby's flexible subscripting might be emulated in C++, in this installment and the next I'm going to be considering how D's [1] notion of a slice can be applied in C++, in particular to strings. I'll be discussing, briefly, the "view" concept and its refined conceptthe slice view. I'll then introduce the STLSoft basic_string_view class template, and discuss its implementation and contracts. Finally, I'll discuss the contexts in which the use of string views is appropriate, along with those where it's not.
The concept of a view is best known from the field of relational databases, where a view is an alternative way of looking at data from one or more tables, with the following characteristics:
The three primary purposes of such views are:
There has been much interest in adapting the view concept to C++, including the work of Jon Seymour [2], the View Template Library (VTL) by Gary Powell and Martin Weiser [3], and the work of Maciej Sobczak [4]. My own work on views inhabits a lower level; its primary purpose is the raison d'etre of C++: efficiency. The additional purposes of views in C++ are legion, including:
Views can have a bewildering variety of forms and functions, including aggregating, filtering, linearizing, polymorphic, sorting, reversing, restructuring, and transforming, to name but a few. For these two installments, I'm going to focus on only the slice view.
In D, a slice provides a view onto a section of an array. Consider the following code:
import std.stdio; void main() { char[] s = "This is a null-terminated string".dup; char[] sl = s [0 .. 4]; writef("[%s]\n", sl); // Prints "[This]" s[2 .. 4] = "at"; writef("[%s]\n", sl); // Prints "[That]" }
A slice is declared as type[], and is internally represented as a length plus a pointer. Hence, the creation of the slice sl does not involve a copy of the source string s, it merely "remembers" the pointer to the start of the literal string (&s[0]) and the length 4. When the contents of s are changed in the assignment of the string "at", the changes are reflected in any subsequent use of the slice sl. (Note that s is a duplicate of the literal string, which, as in C++, is read-only. An attempt to write to a [slice of a] literal will result in an access violation.) The C++ equivalent of a D slice is:
template <typename T> struct slice { size_t len; T *ptr; };
(A slice could also be represented as a pair of pointers. Although I've dallied with such a representationin the recls librariesI consider it to be an inferior representation to a length+pointer. This is because one tends to want to know the size more often than the "end" position. The adopted way also affords direct compatibility with D, and makes writing extensions for Ruby and other languages a little bit easier.)
Obviously, the previous template isn't exactly bristling with functionality, nor does it facilitate nonmutating (that is, const) access; something that the D language avoids by not supporting const as we know it.
To make such a type useful for C++ would require adding methods, such as constructors, assignment operators, and accessors; and providing support for iteration. Such a type exists in the STLSoft libraries, in the form of the array_view class template, which presents a (nonmutating) std::vector-like interface. There is also a more complex class template, basic_string_view, which provides more functionality to support a (nonmutating) std::basic_string-like interface, and which is the subject of this month's and next month's columns.
Slice views can be very simple. They consist of constructors, begin() and end(), rbegin() and rend(), a subscript operator, size(), and, importantly, base(), which returns the slice view's base pointer.
The two main advantages of slice views are logical coherency and physical coherency. Logical coherency means that views always reflect the up-to-date contents of the underlying collection.
The latter advantage, physical coherency, translates into greater efficiency. It has two aspects: First, there is a reduction in memory usage because there are fewer, if any, copies made. Second, because slices don't do deep copies, the cost of copying slice instances is very low. We'll see demonstrations of these efficiency advantages in several different contexts in the second installment.
Further to these main advantages, the view-class templates, by virtue of their emulation of standard interfaces, also provide collection type adaptation. In other words, using basic_string_view allows slices of character arrays to be used in place of std::basic_string.
The slice view (also known as array view) is the only view concept that guarantees (and relies on) contiguous storage. This is analogous to the situation with std::vector, which is the only standard container that guarantees contiguous storage. As a consequence, slice views (and vectors) are compatible with C APIs.
The fact that slice views are always up to date with respect to their underlying storage is, as we will see in the next installment, a great boon, but it also carries risks, and imposes limitations on their use. Simply put, a slice view, unlike the standard library containers (including basic_string), is not a value type. In Imperfect C++ [5] I define a value type as including these aspects:
Slice views fulfill the first of these criteria, but not the second, because if one view is made as a copy of another, and the underlying storage changes, then the logical states (that is, the contents) of both views are affected: They are not independent.
Thus, one must be careful with views to not use them when the link to their underlying storage is, or could be, broken. For example, if one takes a string view onto the contents of a mutable container, such as a std::string, the view can be invalidated by mutating members in just the same way that an iterator can be:
std::string s("Short string"); stlsoft::string_view sv(s.data(), s.size()); std::string::iterator it = s.begin(); s = "Longer string"; // sv and it both invalidated
Obviously, holding a view onto a string after that string is deleted is pretty fatal. Therefore, string views are best used in contexts where their view nature is irrelevant (temporaries/single expressions) or in contexts where their nature is overt (split functions). We'll investigate these contexts in the next installment. Not being a value type also has implications for the string-view semantics, so let's now look at the definition of the class.
The STLSoft basic_string_view class template provides an interface that corresponds to the standard basic_string to a significant degree, as shown in Listing 1. The notable features of basic_string_view are:
As well as the length (m_length) and pointer (m_base) member variables that constitute the slice, there is a third member, m_cstr. This is for the implementation of the c_str() method, which will be discussed shortly. Given the simplicity of the slice view concept, the bulk of the implementation of the class is very straightforward, and I won't belabor you with the details: The constructors, copy assignment operator, swap(), size(), max_size(), length(), compare(), front(), back(), begin(), end(), rbegin(), rend(), and copy() methods all have exactly the same semantics as they do in basic_string. The seven constructors are all eminently straightforward: The default constructor sets the slice members m_length to 0 and m_base to NULL; the other constructors initialize the members as appropriate to their arguments. The m_cstr member is set to NULL in all cases. The destructor need do nothing with the slice members, but releases any memory buffer pointed to by m_cstr.
Because slice views, by definition, present information that they do not themselves manage, they sit somewhat uncomfortably on the border of the value-type domain. Hence, the other methods have interesting, and sometimes surprising, semantics: The implementations of capacity(), clear(), refresh(), equal(), c_str(), base(), data(), is_valid(), and empty_string_() will be examined in the remainder of the article, including an examination of the contract programming aspects of the class.
capacity() returns the same values as size() and length(). This means that code that might test capacity() against length() will (always) receive a meaningful result of 0 of the spare capacity in the string instance.
The clear() and refresh() methods are closely related. Several standard containers provide the clear() method, which means "empty out all contained elements." For the string view, this means setting the pointer to NULL and the length to 0. But it also means deallocating any memory used. The refresh() method leaves the logical state of the string view untouched, and performs only the latter action: resetting the pointer. We'll see why this is so shortly.
equal() is provided as an optimization because it can immediately return true if the internal pointers and lengths are identical, or immediately return false if the lengths differ.
base() is a method from the slice view conceptthat returns the pointer part of the slice. Unless modified by clear(), swap(), or operator =(), whatever was passed to the constructor will be returned by this method.
Because the string view may look onto a slice that does not end in a null terminator, it's obvious that c_str() cannot simply return m_base. Further, even if, at the time of construction, the element of the source string at m_base + m_length is a null terminatorwhich we can't test for anyway, because it might be in an uncommitted page of memory!the underlying storage can change at any time, without notification of the string view(s) mapping to it.
So, if we are to provide c_str(), we must allocate a buffer of length 1 + m_length (into m_cstr), and copy in the contents and add a null terminator. Naturally, once we've done this, it's possible that the underlying storage will change and render the allocated copy out of date. There are six options to handle this:
Unlike std::basic_string, there are no assign() methods. Only two methods exist for modifying a view after its construction: copy assignment via the copy-assignment operator, and swapping via swap(). It would be easy to implement a nearly full-complement of efficient assignment functions in terms of the constructors, but I deliberately did not do this; as it stands, the interface limits the ease with which the user can forget that string views are not value types.
Although it might not seem obvious, data() is one of the most important methods of the string view. The Standard prescribes, for basic_string, that data() return a nonNULL pointer to the first character in an array representing the contents of the string. Note that the pointer returned by data() does not have to be null terminated; this is, unlike c_str(), which must return a nonNULL pointer to a null terminated array.
The obvious implementation for data() would be simply to return m_base, but there are two reasons why we can't do this. First, it's common for functions that take length and pointer to ignore the pointer if the length is 0. Similarly, the behavior of the string-view constructor is well defined if the pointer is NULL. One appealing implementation option would be to initialize m_base to be the empty_string_() in the case where the size specified to a constructor is 0. However, slice views need to be able to provide, via base() and size(), the same description of the slice passed to the constructor, supporting cases (a) and (b) in Figure 1. By assigning m_base to empty_string_() when size is 0case (c) in Figure 1the required slice semantics would not be supported. Thus, the implementation of basic_string_view::data() is as shown in Listing 5.
This represents a slight performance cost for each access, but cannot be avoided, given the semantic constraints already outlined. If you don't care about NULL or are confident you won't receive it for a given string view instance, call base() instead.
Due to their somewhat schizophrenic nature, the specification of contracts for view classes must be subject to very careful consideration, to account for the fact that their logical contents are subject to external change. Interestingly, with a string view, this factor leads to very simple contracts. The invariant for the classrealized, as is my wont [5] in the is_valid() method (Listing 6)stipulates two obvious relationships: that the m_base pointer must be valid if length is non-0; and that the m_cstr member (used to implement c_str()) must be NULL if the length is 0.
What may be surprising, however, is that there is no association between the length and the contents of the underlying array. There are two reasons. First, just as with std::basic_string, we wish to allow strings with embedded NULL ('\0') characters. Second, because it's a view, it's eminently possible (and sometimes expressly desired) that the contents of the underlying array will change. It would be both dissuasively inefficient and logically flawed to police the view contents; hence, the invariant is so simple.
As mentioned earlier, a slice view instance is not valid outside the lifetime of its underlying array. String views are no exception, but because they are small, and look (and for the most part smell) like strings that are value types, one must be careful when using them to avoid the dead reference problem. Despite this, there are several excellent uses for them.
One such use is with the Open-RJ library, which uses stlsoft::basic_string_view<char> as its string type. Because an Open-RJ database is a single, immutable block of memory, within which reside the names and values of all fields in the database records, it would be wasteful indeed to take copies of these strings (for example, in the form of std::string) when slices will safely suffice. All such slices will be valid as long as the database is loaded, irrespective of the lifetime of the record and/or field objects from which they are elicited.
Another scenario in which the dead-reference problem is moot, and in which string views are a flat-out winner, is when manipulating collections of strings with algorithms, as in the definition of has_token_string() (whose performance will be featured in next month's column):
bool has_token_string(char const *str, char delim, char const *token) { typedef stlsoft::string_view string_t; typedef stlsoft::string_tokeniser<string_t, char> tokeniser_t; tokeniser_t tokens(str, delim); return tokens.end() != std::find( tokens.begin(), tokens.end(), token); }
When string_t is std::string (or stlsoft::simple_string), there is one visit to the heap for the copy of str, and one for each tokenized element in str; for example, if str is "cpp;c;java;;pl;;" (and delim is the semicolon) there would be five allocations. However, when string_t is stlsoft::string_view (a typedef for stlsoft::basic_string_view<char>), there are 0 allocations, regardless of the contents of str! Tests show that this function works between 2 and 10 times faster than with either std::string or stlsoft::simple_string.
Another example, which will be performance tested next time, is in combination with memory mapping, to present a text file as a string object:
platformstl::memory_mapped_file mmf("mmf.cpp"); stlsoft::string_view contents (static_cast<char*>(mmf.contents()), mmf.size()); . . . code that uses whole file as string 'contents'
Next time, I'll look at some other safe, but occasionally exotic, uses of string views, including in the implementation of Basic-like slice functions. I'll also discuss a number of performance tests that demonstrate the significant gains to be had by the appropriate use of string views.
Thanks to Bjorn Karlsson, Christopher Diggins, Garth Lancaster, Nevin Liber, Pablo Aguilar, and Walter Bright for their insights into the article.
CUJ
template template< typename C // Character type , typename T = std::char_traits<C> // Traits , typename A = std::allocator<C> // Allocator > class basic_string_view : private A { public: // Types typedef C value_type; typedef basic_string_view<C, T, A> class_type; . . . typedef value_type const &const_reference; public: // Construction basic_string_view(); basic_string_view(class_type const &rhs); basic_string_view(class_type const &s, size_type pos); basic_string_view(class_type const &s, size_type pos, size_type cch); basic_string_view(char_type const *s); basic_string_view(char_type const *s, size_type cch); basic_string_view(char_type const *first, char_type const *last); ~basic_string_view() throw(); class_type &operator =(class_type const &rhs); public: // Operations void swap(class_type &other) throw(); void clear() throw(); void refresh() throw(); public: // Attributes size_type size() const throw(); size_type length() const throw(); static size_type max_size() throw(); allocator_type get_allocator() const; . . . public: // Comparison bool equal(class_type const &rhs) const throw(); bool equal(value_type const *rhs, size_type cchRhs) const throw(); int compare(size_type pos, size_type cch , value_type const *s, size_type cchRhs) const throw(); int compare(size_type pos, size_type cch , value_type const *s) const throw(); int compare(value_type const *s) const throw(); int compare(size_type pos, size_type cch, class_type const &rhs , size_type posRhs, size_type cchRhs) const throw(); int compare(size_type pos, size_type cch , class_type const &rhs) const throw(); int compare(class_type const &rhs) const throw(); public: // Accessors const_reference operator [](size_type index) const; value_type const *c_str() const; value_type const *data() const throw(); value_type const *base() const throw(); const_reference front() const; const_reference back() const; size_type copy( value_type *dest, size_type cch , size_type pos = 0) const throw(); public: // Iteration const_iterator begin() const; const_iterator end() const; const_reverse_iterator rbegin() const; const_reverse_iterator rend() const; private: // Invariant bool is_valid() const; private: // Implementation static char_type const *empty_string_() throw(); static int compare_(char_type const *lhs, size_type lhs_len , char_type const *rhs, size_type rhs_len); private: // Members size_type m_length; char_type const *m_base; mutable char_type *m_cstr; }; // and comparison operators ==, !=, <, <=, >, >= template< typename C , typename T , typename A > bool operator ==(basic_string_view<C, T, A> const &lhs , basic_string_view<C, T, A> const &rhs); // ... and overloads for C const *.
char const *p1 = sv.c_str(); puts(p1); // All ok here char const *p2 = sv.c_str(); // p1 now invalid puts(p1); // Bad things happen here
char const *ptr_to_shared_memory = . . . size_t sh_mem_size = . . . stlsoft::string_view sv(ptr_to_shared_memory, sh_mem_size); assert(strcmp(sv.c_str(), "abc") == strcmp(sv.c_str(), "abc"));
template< typename C , typename T , typename A > C const *basic_string_view<C, T, A>::c_str() const { stlsoft_assert(is_valid()); if(NULL != m_cstr) { return m_cstr; } else if(0 == m_length) { return empty_string_(); } else { // Must allocate the m_cstr member char_type *s = allocate(1 + length(), NULL); traits_type::copy(s, m_base, m_length); s[m_length] = '\0'; m_cstr = s; stlsoft_assert(is_valid()); return m_cstr; } } template<typename C, typename T, typename A> class_type &basic_string_view<C, T, A>::operator =(class_type const &rhs) { stlsoft_assert(is_valid()); refresh(); m_length = rhs.m_length; m_base = rhs.m_base; return *this; } template< typename C , typename T , typename A > void basic_string_view<C, T, A>::refresh() { stlsoft_assert(is_valid()); if(NULL != m_cstr) { deallocate(m_cstr); m_cstr = NULL; } stlsoft_assert(is_valid()); } template< typename C , typename T , typename A > /* static */ C const *basic_string_view<C, T, A>::empty_string_() { // This array initialized to 0, which conveniently happens to // be the empty string, by the module/application load, so is // guaranteed to be valid, and there are no race conditions. static char_type s_empty[1]; stlsoft_assert(s_empty[0] == '\0'); // Paranoid check return s_empty; }
template< typename C , typename T , typename A > C const *basic_string_view<C, T, A>::data() const { stlsoft_assert(is_valid()); return (0 == m_length) ? empty_string_() : m_base; }
bool basic_string_view::is_valid() const { if( 0 != m_length && NULL == m_base) { return false; // If slice is nonempty, m_base should be !NULL } if( 0 == m_length && NULL != m_cstr) { return false; // If slice is empty, should be no m_cstr } return true; };
Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.