Channels ▼
RSS

Generalized String Manipulation: Access Shims and Type Tunneling


Generalized String Manipulation: Access Shims and Type Tunneling

Introduction

Strings are a fundamental concept within software engineering, but strings are not a built-in type in C or C++. They have been conventionally expressed in C as null-terminated contiguous sequences of characters. In C++, this C-style string is still a common form, but there have also been many attempts to objectify strings into string classes. Most C++ developers have written at least one string class during their career, and a number of widely accepted forms exist. The standardization of C++ [1] has seen the advent of the standard template class std::basic_string, and its common char and wchar_t instantiations.

There is a mismatch between objects and the C-library and operating system (and other) C APIs when it comes to string handling. Specifically, most C++ string classes are dealt with as atomic entities (usually passed by value or reference), whereas C APIs deal with pointer to null-terminated sequences. The standard C++ string class allows for the internal representation to not be null-terminated [2], and there are also instances of other string types, eg., the Win32 LSA_UNICODE_STRING type, that do not do so. Therefore there can be quite different ways of accessing the string contents, determining the string length, or even determining whether the string is empty.

Except in rare circumstances -- where there are no literals and no interaction with the C API string functions, or where one uses C-style strings only -- it is virtually impossible to avoid having multiple string types. Usually this is limited to C-style strings and one string class, although one must often deal with multiple string classes within one code base. Indeed, we may develop using a standard string class for expediency, and seek to replace some/all of the strings with a custom class type that may not share a common interface with the standard one.

That there are usually two or more string forms within a single code base can lead to a number of problems:

  • brittleness -- changing the main/only string type can require many changes within the client code;
  • complexity -- code contains different syntactic elements doing the same semantic job;
  • specificity -- generalization of the manipulation of strings in (usually template) algorithms is made difficult or impossible.
In this article I will discuss problems with some strategies for achieving generalization in the manipulation of strings (and, by extension, value types in general), including the use of a common interface, implicit conversions operators, traits, or the pre-processor. Then I will cover the technique of generalization via explicit conversion, which addresses the requirements of flexibility, genericity and simplicity, whilst avoiding the shortfalls of the other approaches.

This technique uses the concept of Access Shims [3] -- lightweight components (including free-functions and conversion classes) that explicitly convert from a specific type to a common type when accessed. The method I will describe combines the concepts of Attribute Shims and Conversion Shims (see the "Shims" sidebar). By use of this technique, compatible client code can be written with minimal or no involvement of the pre-processor, and algorithms can be written and applied to strings of arbitrary type. Furthermore, the use of the shims involves no additional runtime costs, so the generalization is for free. Finally, I will describe how Access Shims facilitate the more fundamental technique of "type tunnelling", illustrated with a practical example.

Generalization via Common Interface

Let's say that we wish to display the contents of the PATH environment variable in a list control. We could implement this as shown in Listing 1.

This works fine. The environment variable definition is tokenized (using the stlsoft::string_tokenizer [4], parameterized to use string_t, which is std::string in this case). The token sequence is then iterated, and each item is passed to the list control via the API call (actually a macro) ListBox_AddString(), using the c_str() method to access the null-terminated string contents.

If we now wish to change the code to use another string class, say stlsoft::basic_frame_string<char, ...> [4], then the code change is limited to one line, as in

 typedef stlsoft::basic_frame_string<char,
  _MAX_PATH>  string_t;
Because stlsoft::basic_frame_string has a method c_str(), the rest of the code will compile without issue. In circumstances where the classes exhibit identical, or sufficiently similar public interfaces, such code changes are trivial. However, this is seldom the case.

Consider that we want to change the string type again, this time to MFC's CString. This class does not have a c_str() method; instead it has an implicit conversion operator (operator TCHAR const *() const). The code changes required are

typedef CString     string_t;

  ...

  ListBox_AddString(hwndList, static_cast<char const *>(s));
//  ListBox_AddString(hwndList, s); // Also works in this case
Now we must change code within the loop, in addition to the typedef. The change is in the call to ListBox_AddString(), which takes the char const* form of the string, via the implicit conversion operator.

If we wish that only a typedef change be required, then we must ensure that CString shares a common interface, in as far as it is exercised in our code, with std::string. We have the option of defining derived types in order to homogenize and unify the functionality. To do this we might use an adaptive veneer [5], as in Listing 2, which would mean the only code change from the original required to support CString would be the following:

typedef c_str_veneer<CString>   string_t;
In general, adaptive veneers must either support a union of all methods provided by all types covered (which is brittle and highly dubious) or must ensure that each adapted type exhibits an agreed set of operations. In this case we are adapting CString's interface to that of std::string, but even this cannot always be achieved. Consider the case of Win32's LSA_UNICODE_STRING, which does not contain null-terminated strings. Though we can borrow from the techniques employed in the Conversion Shims (described below) to synthesize accessible, temporary, null-terminated strings, we are still left with the fact that the conversions we will be employing are implicit. Implicit conversions that carry out significant operations, including resource (de-)allocation, are not desirable [6, 7].

Generalization via Implicit Conversions

Another answer is to provide implicit conversion operators to the degenerate type which, for strings, will be C-strings, i.e., char/wchar_t const*. Listing 3 shows a veneer [5] template that provides such functionality for string types with a c_str() accessor method.

Authors have written much on this subject [6, 7], and I think it's fair to say that use of implicit conversion operators coupled with implicit constructors is particularly pernicious. One can get by with relatively few problems with one or the other, but it is far better to eschew both. It may lead to more typing, but in such cases the extra work is not onerous and often leads to a better understanding of the code.

On these grounds alone, I would rule this technique out. However, providing implicit conversions doesn't even represent a full solution to the problem. It only provides generalization between types that can provide null-terminated contiguous arrays of characters. The use of smart-pointer intermediary temporaries is not allowed since the compiler is already carrying out one implicit conversion (of the string type itself) so cannot legally perform another for the temporary [6].

Furthermore, this technique only provides generalization for access to string contents, not to other attributes, such as length (see the sidebar titled "String Length"). Overall, then, implicit conversion is a thoroughly unsuitable solution (not that I haven't seen it used on more than one occasion!).

Generalization via the Preprocessor

The changes in the loop code with respect to CString don't seem a huge effort, and in application code they aren't, so we could live with the small effort. When dealing with library code, however, it becomes a great deal more important. When expressing the code as templates is not feasible or desirable, the conventional choices amount to prescribing a single string type that must be used, or relying on a common interface, or resorting to ugly pre-processing hacks. Such hacks could be something like the gruesome, and difficult to maintain, variant of the code shown in Listing 4. Lamentably I have seen (and written!) such code.

Clearly, this is no way to write generalized code in C++.

Generalization via Explicit Conversion: Access Shims

Before I discuss the details of the proposed solution, I'll just take a look at its application in our example.

for(; begin != end; ++begin)
{
  string_t const  &s(*begin);

  ListBox_AddString(hwndList, c_str_ptr(s));
}
That's all there is to it! (Naturally, we don't even need the intermediate reference s; it's just for clarity in these examples.) It uses the c_str_ptr access shim. The concept of Access Shims is actually a combination of the concepts of Attribute Shims and Conversion Shims (see the "Shims" sidebar). Access Shims are a collection of components (including free functions and classes) that provide access to instances of a number of types expressed in another common type. For example, Listing 5 shows some of the access shims that are defined within the stlsoft namespace.

As you can see, these shims provide simple conversion to the ANSI/Unicode C-string equivalents of the given types, and no temporaries are created. Hence, they are pure Attribute Shims. Table 1 shows a matrix of convertee and conversion types for some of the shims described here. Access Shim names usually do not follow the naming convention for either Attribute Shims or Conversion Shims, since they need to share token names in order to function. Hence, the shim we will examine in detail in this article is called c_str_ptr (and its character-encoding specific derivatives c_str_ptr_a and c_str_ptr_w).

The intent is that the shim provides a conversion to a C-string pointer, and the name obviously borrows from std::basic_string's c_str() method. The original version of a few years ago was initially named c_str,but I thought I was asking for name-clash trouble so I changed it. Since shims are necessarily "imported" from disparate namespaces rather than explicitly qualified -- and thereby disambiguated -- it is important to have a reasonably good level of name independence.

When the conversion type is not actually directly accessible from the convertee type, it can be necessary to use a conversion shim to access an equivalent C-string form. Listing 6 shows some access shims defined within the winstl [8] namespace. These are significantly different from the shims in the STLSoft namespace. They do not return char const * (or wchar_t const *), but instead return instances of proxy classes. The declaration of one of these, c_str_ptr_HWND_proxy, is also shown in Listing 6.

The class is a template, which is parameterized on character type C (e.g., char, wchar_t). It has an explicit conversion constructor from HWND, and an implicit conversion operator to C const *. This essentially provides the conversion mechanism, and allows the access shims that use it to work. Put together it works simply, as in:

// print the window text to stdout
  puts(c_str_ptr_a(hwnd)); 
The c_str_ptr_a(HWND ) function is called, passing hwnd as the parameter, and returns an instance of c_str_ptr_HWND_proxy<char>. This instance is then passed to puts(). Because puts()requires an argument of type char const *, the compiler looks for a conversion, which is provided by c_str_ptr_HWND_proxy<C>'s operator C const *() const (with C == char) so this method is called, and the result passed to puts().

The second constructor, the move-constructor [9], is provided so that when compilers cannot implement the return-value-optimization [6], the buffer is transferred into the "copied" instance, ensuring that it is present in the instance on which the conversion operator will be invoked. Since the first instance will not, in such a case, have the conversion operator invoked, it is more efficient to move the buffer from one to the other rather than taking a copy.

Where Koenig lookup does not apply (see sidebar titled "Koenig Lookup, Includes and Namespaces"), in order to use the shims from a certain namespace, naturally one must introduce them, preferably via the using declaration (e.g., using stlsoft::c_str_ptr). By this simple mechanism one can have control over which shims are visible and which are not -- albeit that this is granular only at the namespace level -- and it is a trivial matter to introduce newly compatible types, by the introduction of compatible shims, into an extant algorithm or client code function. (One will see more of the implications of this in the Type Tunnelling section.)

There is an important point to be gleaned from the implementation of the winstl::c_str_ptr shims. Since they return temporaries it is not legitimate to write code such as

char const *s = c_str_ptr_a(hwnd);

puts(s);
As soon as s has been initialized to point to the window text buffer in the c_str_ptr_HWND_proxy<char> instance, the instance is destroyed and the buffer can contain who-knows-what?

All conversion shims have this constraint; attribute shims do not. However, since access shims (of the same name) can be composed of either or both of these more basic shim concepts, they also have this constraint.

In common with many user-defined types, one should also be cautious when applying shims directly to printf() and other functions that have variable parameter lists. Although most of the conversion shims currently implemented actually store the pointer at zero offset within them and so function correctly in such conditions, this is still an undefined and inherently dangerous thing to do with any class type. (One popular compiler, Digital Mars C++ [12], addresses this problem and helpfully balks at un-cast class type instances when passed to variable parameter lists.)

Generalization via Templates and Traits

You may be contrasting the shims technique with the widely known technique of templates and traits [10, 11] and wondering why I have not chosen the latter as the solution. Certainly it is possible to define string access traits (as shown in Listing 7). Using these shims could look something like the following:

template <typename S>
inline void dump_string(S const &s)
{
  puts(stlsoft::string_access_traits<S>::c_str(s));

}
However, when one wants to define components that live in another namespace, the picture is much more complicated. Because of the way specialization works, the specialized form must still be defined within the original namespace (as shown in Listing 8).

This requires the involvement of three namespaces: the one containing the component (windows), the one containing the traits (stlsoft), and the one in which we will "use" types from the other two. Furthermore, it requires that the original non-specialized template definition, within namespace stlsoft, is available before the definition of the specialization string_access_traits<windows::Window>, which increases coupling, or that a forward declaration of it is made, which involves coupling as well as being fragile in having a type declared in more than one place.

This extra level of complexity is bought for no gain over the shims approach. But the most compelling reason is that if we use a class (traits) rather than functions (shims) we cannot take advantage of the compiler's built in type-resolution and will have to explicitly specify the type. This is only useful within template functions, and so is only part of the solution domain we are aiming for, or when the programmer is prepared to specify the explicit type, which defeats our whole aim of achieving generalized manipulation in both template and non-template contexts.

The two disadvantages of the shims approach are that a using declaration must often be made (see "Koenig Lookup, Includes and Namespaces" sidebar) to introduce the shims from a particular namespace, and the consequent slight local namespace pollution. Table 2 summarizes the relative merits of the two approaches.

Type Tunneling

Now that we have seen how to increase the flexibility of our types and algorithms, we can ask whether the generalization effects can be taken further. In general, the expanded spectrum of convertible types facilitates the more powerful technique of Type Tunnelling, which allows types to be used with APIs where each is entirely independent of the other. The application of this technique to our code results in the example's hand-written loop being replaced with the single line

std::for_each(begin, end, listbox_add_inserter(hwndList));
The listbox_add_inserter is one of a suite of control inserter function objects in the WinSTL [8] libraries (defined in Listing 9).

The template function call operator method employs a c_str_ptr shim, which is selected by the compiler only at the point when the method is instantiated (inside the instantiation of std::for_each). Thus the definition of string_t can be changed to any type for which a c_str_ptr shim is defined and accessible in the instantiating scope, without any other changes being necessary to the code.

All this seems highly flexible, but it is not clear why this is type tunnelling. A different example may help. Consider a debugging subsystem wherein one can pass descriptive strings to assertions. Using the c_str_ptr access shims, we can layer inline C++ functions over a C-API (GFAssert2A/W()) (Listing 10) such that instances of arbitrary type (e.g. std::string, VARIANT, HWND, etc.) can be passed as the descriptive string. The base API manipulates a degenerate type -- in this case C-style strings -- and does not need to change to accommodate new types, as long as suitable shims are available. Hence new types can be "tunnelled" through the C++ functions into the API via shims.

Summary

In this article I have discussed the desirability for generalized manipulation of strings and illustrated some of the classic problems encountered when trying to write generic code. I have also demonstrated why none of the common techniques cater to the full gamut of generalization requirements and have proposed one -- explicit conversion using access shims -- that does.

By using explicit conversions, we gain the benefits of significant increases in flexibility, simplicity and genericity. Access shims also facilitate the powerful flexible technique of type tunnelling. Furthermore, they can be used as a convenient standalone conversion mechanism (e.g., for converting VARIANT to C-string).

The focus of the article has been on string generalization, which is natural since strings are one of the most common inter-conversion types, and also probably the easiest for which one can see the relevance. But access shims and type tunnelling can be (and are) applied to any set of syntactically disparate but logically related types.

Acknowledgements

I'd like to thank Scott Patterson for providing his usual insight and criticism, once again helping to curb my tendency to wander at great lengths in making small points. Also, many thanks to Joe and Darrah at CUJ for giving a lot of flexibility in the preparation of the draft(s).

Notes and References

[1] ISO/IEC C++ standard, 1998 (ISO/IEC 14882:1998(E))

[2] The C++ Programming Language, Bjarne Stroustrup, Third Edition, Addison-Wesley, 1997

[3] http://synesis.com.au/resources/articles/cpp/shims.pdf

[4] http://stlsoft.org/

[5] http://synesis.com.au/resources/articles/cpp/veneers.pdf

[6] Effective C++, 2nd Edition, Scott Meyers, Addison-Wesley, 1998

[7] More Effective C++, Scott Meyers, Addison-Wesley, 1996

[8] http://winstl.org/

[9] "Move Constructors", Matthew Wilson, Synesis Technical Report, available at http://synesis.com.au/resources/articles/cpp/movectors.pdf. Also "A Proposal to Add Move Semantics Support to the C++ Language," Abrahams, Dimov and Hinnant, at http://anubis.dkuug.dk/jtc1sc22/wg21/docs/papers/2002/n1377.htm.

[10] Nathan Myers, "A New and Useful Template Technique: Traits," C++ Report, June 1995

[11] Generic Programming and the STL: Using and Extending the C++ Standard Template Library, Matthew Austern, Addison-Wesley, 1998

[12] http://digitalmars.com/

[13] The New Shorter English Dictionary, Lesley Brown (ed), Oxford University Press, 1993.

[14] The STLSoft libraries are specifically written to support ANSI and Unicode character-encodings (see sidebar titled "ANSI vs Unicode Encoding"). They are not intended to support other encoding schemes, such as Microsoft's MBCS, since the effort involved in manipulating variable length character-encoding sequences is deemed too great. The purpose of the libraries is that they are very lightweight, prescribe as few constraints on their use as possible, and are exclusively header-only. If you want to write for non-English language encodings and want to use the STLSoft libraries, then you should use Unicode (which is a good option in any case).

About the Author

Matthew Wilson holds a degree in Information Technology and a PhD in Electrical Engineering. He is currently a software development consultant for Synesis Software. Matthew's work interests are in writing bullet-proof real-time, GUI and software-analysis software in C, C++, C# and Java. He has been working with C++ for over 10 years, and is currently bringing STLSoft.org and its offshoots into the public domain. Matthew can be contacted via matthew@synesis.com.au or at http://stlsoft.org.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video