Channels ▼
RSS

The Standard Librarian: Stringstreams and Their Friends


April 2001 C++ Experts Forum/The Standard Librarian


How do you convert a number to a string, or a string to a number? Or, to phrase the question a little more precisely: how do you obtain a number's textual representation, and how do you parse a sequence of characters to get a numeric value?

In C, the usual answer would be to use sprintf and sscanf. These functions have the same syntax as printf and scanf, except that each of them takes an additional argument, a pointer to an array of characters. Just as printf formats its arguments and writes to standard output, so sprintf formats its arguments and writes an array of characters. Just as scanf obtains characters from standard input and parses them according to its format string, so sscanf reads from and parses the array of characters that it is passed.

C++ also has a mechanism that reuses the familiar syntax of I/O. It differs from sprintf/sscanf in two ways. First, where the C mechanism is a variation of printf and scanf, the C++ mechanism is a variation of iostreams. Second, where the C mechanism uses null-terminated arrays of characters, the C++ mechanism uses std::string.

Stringstreams

C++'s mechanism uses classes called std::istringstream and std::ostringstream. They're defined in the standard header <sstream>, and they work just the same way as ordinary istream and ostream. For formatting you create an ostringstream, write to it using operator<< (you have access to the full set of overloaded operators and manipulators, of course), and then retrieve a copy of the string object using the str member function. You don't have to worry about buffer overruns, since ostringstream manages buffers automatically and expands them as needed. For example:

double pi = 3.14159265;
std::ostringstream os;
os << "pi = "
   << std::setprecision(3)
   << pi;
                     
std::string line = os.str();
assert(line == "pi = 3.14");

Similarly, to parse a string that already exists, you create an istringstream object that uses that string, and then you read from it using operator>> the same way that you would read from any ordinary istream:

std::string s("17 42");
std::istringstream in(s);

int i, j;
in >> i >> j;

assert(i == 17);
assert(j == 42);

These classes are useful precisely because they're so simple. You don't have to learn a new syntax; everything you already know about istream and ostream is still true of istringstream and ostringstream. There are, however, a few advanced stringstream features that you may sometimes find useful.

First, like the rest of the C++ I/O library, string streams are templatized: std::ostringstream, for example, is an alias for std::basic_ostringstream<char, char_traits<char>, std::allocator<char>>. If you want to write to a string with a different kind of character type, you can still use this same mechanism. To represent a number as a utf16 string, for example (if you have defined your own utf16 character type, that is; there isn't any class by that name in the Standard C++ library):

std::basic_ostringstream<utf16> os;
os << 42;
std::basic_string<utf16> s = os.str();

Second, just as you can use a constructor to initialize the string that istringstream reads from, you can also use a constructor to initialize the string that ostringstream writes to; you don't have to start with an empty string. For example:

std::basic_ostringstream os("x = ");
assert(os.str() == "x = ");

os.seekp(0, std::ios::end);

os << 42;
assert(os.str() == "x = 42");

Yes, that ugly looking seekp is necessary: you have to tell ostringstream that you want to move the write position so that it points to the end of the string. If you leave it out, ostringstream will start writing at the beginning of the string and will overwrite the characters you've already got; you'll end up with "42= ", which is probably not what you want. Overwriting and appending are both legal; ostringstream will append if you seek to the end, overwrite in the middle if you seek to somewhere in the middle, and overwrite at the beginning if you seek to the beginning. (The C++ Standard says that the write position points to the beginning by default, but, since some implementations get the default write position wrong, it's safer not to rely too heavily on that requirement. If you need the write position to be set to the beginning, you can say so explicitly with os.seekp(0, std::ios::beg).)

Third, in addition to istringstream and ostringstream, there's also a stringstream class that can be used for both reading and writing. Again, if you're using a read/write class like stringstream, you have to be careful to set the read and write positions so that you're referring to the part of the string that you expect to.

Finally, if you've been using C++ for a long time, you may have also heard of something called strstream. Strstream is similar to stringstream, but it works with character arrays instead of with std::string. Don't use it; this isn't an advanced feature, just an obsolete one. It's still in the C++ Standard, but it has been deprecated — and for good reason. Since strstream works with character arrays, it requires complicated memory allocation and deallocation protocols. Strstream is hard to use, and even harder to use correctly.

String Buffers

We've now seen more than enough to answer the initial question of how to represent a number as a string — or, indeed, how to represent almost anything as a string. It's easy to write an as_string function

template <class X>
std::string as_string(const X& x)
{
  std::ostreambuf os;
  os << x;
  return os.str();
}

that operates on int, double, std::complex<>, and any other type for which the output operator is defined. If you can write it to an output stream, you can represent it as a string.

All of this goes back to the central organizing principle of the C++ I/O library. As I have written in earlier columns [1, 2], the fundamental idea is simple: formatting decisions are made by locale facets, individual characters are fetched or stored by stream buffer classes, and the high-level stream classes you interact with directly, like std::ostream and std::ostringstream, are relatively simple wrappers that manage locales and stream buffers.

A class like std::ostringstream, which inherits from std::ostream, is literally just a dozen or two lines of code. It has a member variable of type std::stringbuf, it has a constructor that sets up the stream to use that stringbuf as the stream's buffer, and it has a one-line wrapper function, str, that does nothing but invoke stringbuf::str. Everything that's interesting about string-based streams, everything that makes them different from, say, file-based streams, is contained in stringbuf. If ostringstream didn't exist, but stringbuf did, it wouldn't matter very much. It would just mean that you'd need to do a tiny bit more typing, because creating a string-based stream object would take two lines instead of one:

std::stringbuf buf;
std::ostream os(&buf);

You can do the same thing with any stream buffer class that you write.

The reason I belabor this point is that the Standard C++ library was designed to be user-extensible, and, to use it effectively, you need to know the ways in which it was intended to be extended. For the C++ I/O library, there's often a simple answer: if you want to change the behavior of the system in some way, the first thing you should think about is whether you can do it writing a new kind of stream buffer.

Stream buffers provide two different interfaces. One interface is seen by the outside world, by classes like istream and ostream: it consists of public nonvirtual member functions like sbumpc, for reading a character, and sputc, for writing a character. The other interface is used by classes that inherit from streambuf. This interface exposes an array of characters for reading and one for writing. It consists of protected nonvirtual member functions that manipulate these arrays, and protected virtual member functions that use them. Those protected virtual functions include underflow, which is called when the read buffer is empty and needs to be filled with new characters, and overflow, which is called when the write buffer is full and the characters in it must be written to the output. If you're using an existing stream buffer class, you need only know about the public interface, and if you're writing a new stream buffer class, you need only know about the protected interface; writing a custom stream buffer is just a matter of inheriting from std::streambuf and overriding a handful of virtual member functions.

Stringbuf is a stream buffer where the read and write arrays are connected to an in-memory data structure, a string. That's an important idea: the I/O mechanism doesn't just have to be used for external devices. It's also an idea that can be generalized.

Generalized String Buffers

Strings get stored in all sorts of ways, not just as std::string objects! You may find yourself using a segmented data structure, SGI's nonstandard rope class where strings are represented as trees, arrays of char, std::vector<char>, a string table, or a third-party string class that's a binary interface to some other library. You can use stringbuf (either directly or via istringstream and ostringstream) to connect the C++ I/O mechanism to one kind of in-memory data structure, std::string. If you want to connect C++ I/O to any other kind of in-memory data structure, you need a different kind of stream buffer class.

No single class suffices for all of these different data structures, of course, but, using the STL vocabulary of iterators and containers, we can write a stream buffer class that's general enough for many uses.

Actually, I'm going to show two stream buffer classes instead of one. The stream buffer classes in the C++ Standard (stringbuf and filebuf) use a single class for both reading and writing, but there's no reason for you to write your own stream buffers that way; you can write a stream buffer that is only used for reading, or that is only used for writing. For the most part, in my opinion, you usually should. It'll simplify your code (interleaving reads and writes on the same stream requires a lot of work to keep things consistent), and it'll give you more interface options.

If we're reading characters from a data structure that has anything like an STL interface, the one thing we're guaranteed is that the characters will be provided as a range of iterators; this covers string, rope, vector<char>, C arrays, and probably most other useful data structures. Let's look at a stream buffer class that reads characters from a range of forward iterators. (Even that may be more general than we really need: restricting ourselves to random access iterators would probably be good enough. However, generalizing to forward iterators is so easy that there's no reason not to do it.)

Public member functions like sgetc and sbumpc, defined in the base class std::streambuf, operate on some internal array of characters. This array is characterized by three pointers: eback, gptr, and egptr (the beginning of the array, our current position within it, and the end). The virtual member functions that we override in the derived class only get called when we need special processing. At a minimum, we need to override underflow, which is called when we have reached the end of the array (gptr is equal to egptr) and the user has requested another character.

In principle we have three choices:

  • Keep the internal array empty. This implies that underflow will be called every time the user asks for another character: it can simply be a matter of iterator dereference and increment.
  • Allocate an internal array that's large enough to hold all of the characters in the range. This implies that underflow won't get called until we get to the end of the range; at that point it should always return EOF, because there won't ever be any more characters for it to fetch.
  • Maintain a small fixed-size internal buffer. When underflow gets called, we move to a new position in the sequence and fill the buffer again.

The third choice is the most complicated, but it's also the only one that's fully satisfactory. The first choice would require a virtual function call for every character, with dreadful implications for speed, and the second would often require allocating and copying a large chunk of memory when all we want to do is read a few characters.

Fortunately, even the third choice isn't very complicated. We just have to maintain a single class invariant: the characters in the range [gptr(), egptr()) are the same as the characters in the range [current, current + N), where current is our position within the original range of iterators and where N is the smaller of the buffer size and the number of characters remaining in that original range. When it's time to replenish our buffer, we bump up current by N, copy characters from [current, current + N) into the buffer, and use the protected member function setg to tell the base class the new values of the eback, gptr, and egptr pointers.

That's enough so that we can read characters; we get two more features — repositioning and putback — by overriding three more virtual functions: pbackfail, seekoff, and seekpos. (For our purposes, the difference between seekoff and seekpos is mainly a minor syntactic one. They're separate functions to allow for the possibility of file I/O; on some operating systems, it's impossible to represent a file position as a numeric offset.) Again, the details aren't very complicated: we just shift the frame by manipulating current appropriately and filling up the buffer. The complete definition of rangebuf is shown in Listing 1.

There's one last refinement we can make. I rejected the choice of stuffing the entire input range into the rangebuf's internal buffer, but there's one special case where it makes more sense than any other alternative: when the input range we're dealing with consists of pointers to characters. In that case, we just set eback to the beginning of the range and egptr to the end, and we'll never have to worry about copying characters or shifting frames. The natural way to handle this special case is by template specialization, and that specialization is shown in Listing 2.

(As an aside, you may have noticed an interesting fact about this specialization of rangebuf: you can use it to blur the distinction between file I/O and in-memory I/O. Most operating systems — including Unix and Windows — allow you to "memory map" a file so that you can get pointers to it the same way that you get pointers to ordinary memory. Combine memory mapping and rangebuf, and you've got a buffer for file input. It has fewer features than the standard C++ library filebuf class that came with your compiler, but as a result it's far smaller and simpler and probably faster. If you don't need filebuf's advanced features, and if you care about performance, something like rangebuf may be a good choice.)

Compared to input, output is both simpler and more idiosyncratic. We could write a fairly general input class, rangebuf, because it's almost always possible to provide access to data as a range of read-only iterators. Data modification takes far more forms, and I'm only going to present one of them: appending to a container that conforms to the STL's Sequence interface. This is enough to cover string, vector<char>, and rope, but there are also a great many things it doesn't cover.

There's no such thing as "putback" for output, and, since we're only appending instead of writing to arbitrary places in the middle, we don't have to worry about overriding the virtual functions for seeking. At this point, the class sequencebuf, shown in Listing 3, shouldn't seem very unfamiliar. We maintain a small fixed-size buffer. When it gets full, overflow gets called automatically, and overflow appends the contents of that internal buffer to the sequence we're managing. The user can look at that sequence by calling get_seq.

In sequencebuf, we use pbase, pptr, and epptr instead of eback, gptr, and egptr; we use setp instead of setg, but those differences are fairly unimportant. There's another difference — a fundamental difference between input and output — that's more important. Output buffers can be flushed. In this case, flushing means appending our internal buffer to the sequence we're writing to, whether that internal buffer is full or not. Users can flush buffers explicitly, and sometimes (as, for the sequencebuf class, when a user asks to see the sequence) we have to make sure that the buffers get flushed automatically.

Conclusions

On Usenet, people often ask questions like "How can I write a stream class that will do...?" Usually there can be no answer, because it's the wrong question. The right question is how to write a stream buffer class that does something special. And usually the answer is that, if you're willing to focus on your specific needs instead of trying for complete generality, writing that stream buffer isn't very complicated. For input you need to override underflow, which makes a read position available when there was none, and for output you need to override overflow, which makes a write position available when there was none. For more details it's better to look at code, like Listings 1 and 3 from this column, than to try to read the words in the C++ Standard. What the Standard has to say about streambuf's virtual functions is confusing and makes them seem more complicated than they really are.

The Standard C++ library provides a stream buffer class, std::stringbuf, that uses an std::string for input and output. That's an important idea: it's a convenient interface for changing data representation, and it allows you to format data items for display in some way that doesn't fit into conventional file I/O. By using your own in-memory stream buffer classes, you can generalize this important idea to other kinds of data structures.

Notes

[1] Matt Austern. "The Standard Librarian: IOStreams and stdio," C/C++ Users Journal, November 2000, http://www.cuj.com/experts/1811/austern.html.

[2] Matt Austern. "The Standard Librarian: Streambufs and Streambuf Iterators," C/C++ Users Journal, March 2001, http://www.cuj.com/experts/1903/austern.html.

Matt Austern is the author of Generic Programming and the STL and the chair of the C++ standardization committee’s library working group. He works at AT&T Labs — Research and can be contacted at austern@research.att.com.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video