brooksmoses | A basic conceptual C++ question.

(This in some ways ties into

allbery's recently-posted essay on online communities. Most of my dearest friends, and much of my social life, link back in some way to the online communities that I found shortly after I started graduate school, which were places that I could ask this sort of question, and full of people who'd be interested in talking about the answer. In the last few years, they've nearly all drifted apart -- so I'm trying the experiment of asking here, to see what comes of it.)

So, I'm learning C++, from what I suspect is somewhat of an interesting perspective these days; I've been programming for two decades, been doing object-oriented programming as part of my paying work for two or three years, and maintain a Windows port of a large open-source C++ program. This means that I've got sort of weird gaps in what I know -- such as the fact that I wrote my first abstract class definition ... yesterday.

In any case, the problem is this: I'm writing a parser. It breaks a stream of characters up into tokens, so I've got an abstract class for "token" and a bunch of derived classes for the various kinds of tokens (names, parentheses, whitespace, etc.). And so I went off to write a function that takes in a reference to the character stream, bites off a bit of it, and returns a token object, figuring that would be pretty trivial.

Except that, as best I can tell, I can't do that in C++.

Because, you see, I want to return an arbitrary token object, which might be an instance of any of the various derived token classes, and so the return type from the function ought to be declared as token, and my C++ manual tells me that "an abstract class may not be used as ... a function return type." I've got to use a pointer to the abstract class, or a reference to it, instead. But that's no good, because then I have to worry about where the object that's pointed to comes from, and in particular I have to worry about who frees the memory when it goes away. Which is an almighty pain, because this is a function result, and might very easily just be passed as an argument to another function, and then who knows who's supposed to free the memory?

I'd also like to be able to do a nice, simple, "A = B" assignment, where B is a token object, and get a copy of the object in A. Except that the same problem arises; I can't declare A as an abstract token; I have to declare it as a pointer or reference to one, and I get the same sort of problems; I want a copy of the token, not a reference to the same piece of memory.

As best I can figure from what I know, the only way to make this work is to write a wrapper class that contains a pointer to the abstract class (and the derived classes), and has a destructor that clears the pointed-at memory. And then, ugh, it also has to explicitly handle all the assignment-is-copying stuff to construct a new token-object for the result, and I'm not sure what happens if one of these gets passed by value, but I know if I'm not careful I'll get a copy of the pointer and a destructor that clears it when the function ends, leaving the original pointer pointing to garbage.

There's got to be an easier way, right? Or have I just been spoiled by using a dynamically-typed language to learn class inheritance?

Current Mood: perplexed

Threaded | Top-Level Comments Only

From:

elisteran.livejournal.com

First, this is not purely a dynamically-typed vs statically-typed language distinction; it's got at least as much to do with managed-memory vs unmanaged-memory.

The wrapper class approach you're considering is a C++ pattern called "Pimpl" (pointer to Implementation). If you use something like boost's shared_ptr, you can use the default copy-constructor/assignment operators. See this article (http://www.boost.org/libs/smart_ptr/sp_techniques.html) for a quick link.

Perhaps more apropos is the bit where it talks about the abstract class/factory pattern, which is almost what you're doing.

I haven't programmed in C++ in a long time. But as far as I can tell, the answer to half of C++'s wrinkles nowadays seem to be "boost". And a good chunk of the boost libraries will be in the next C++ standard, so it's a fairly safe thing to grab some of its source.

From:

elisteran.livejournal.com

To expand on my thing above, Java makes this simple, despite being mostly statically typed. And, if you really wanted to avoid the shared_ptr, auto_ptr, which is standard, may be what you're looking for (it depends on your ownership semantics; auto_ptr is notorious for being misused because people lose track of who's supposed to be owning the pointer. See http://www.gotw.ca/publications/using_auto_ptr_effectively.htm ).

From:

surelars.livejournal.com

You're not spoiled by dynamic typing; you're spoiled by garbage collection. You can do GC for a statically typed language just fine.

Not having GC makes life more complicated, requiring the use of references, pointers to implementations, and other use of indirection. It grants you more control (i.e., when is a copy a copy), more predictability in terms of memory use and run-time behavior, but the cost of that control is that you're required to think about these details even when you don't want to.

With the speed of modern-day CPUs and the price of RAM, GC will do the job just fine for most software. That, however, was not the case back in the 80's. Also, GC is not at all what it was 20 year ago. Oh, and with the sloppy programming we see in most places, GC would often be better than all the darn memory leaks we have now.

Then there's the software that will not tolerate unpredictable behaviour. You don't want the control software of your passenger airliner, nuclear power plant, etc. to pause for GC. Again, even for 90% of the power plant or airlines sw it probably wouldn't matter, but there's that small crucial part where it mustn't happen. Despite the research that's gone into real-time GC.

C and C++ comes from the branch of programming that favour "the programmer must be in control" and the "the programmer (wants to) know what he's doing" type of thinking. It makes really high quality code with superior run-time behaviour possible, but with a price for the programmer, and often with sub-par results for average programmers.

Given recent (10+ years) advances in both hw and GC tech, I favor the "let the programmer worry 'bout the algorithm" type of thinking. In an ideal world I'd like to be able to combine GC and non-GC in the same language, but that's not easy at all.

From:

elisteran.livejournal.com

Of course, you can get 90% of the benefits of true garbage collection (figure pulled out of the air, of course) by simply using reference counting, which is nice and deterministic (and thus suitable for real-time constraints). Without a compacting garbage collector, you may run of usable memory due to heap fragmentation -- but that's a worry with explicit memory allocation anyway.

From:

dragon3.livejournal.com

Maybe I just don't get the need for abstraction in this case. Why can't this all be done with strings? Are you planning to parse something other than a character stream?

From:

elisteran.livejournal.com

At a guess, Brooks is using tokens to simplify the input being read. Suppose, for example, that he's lexing a C expression:
int i = 2;
This is valid C code, as it matches the grammar clauses

Type ::= Int 
Expression ::= Type Identifier '=' Expression ';'
Expression ::= Integer

(the grammar is fictitious; I'm not sure that's exactly how the standard grammar is broken down).

Using simple strings, you'd get the resulting sequence: "int","i","=","2",";" or so. That didn't gain you very much; you still will have to do a bunch of work in the parser. If, instead, you got the tokens (with field data in parentheses)
Int,Identifier("i"), Assign,Integer(2), Semicolon
it'd be easier to parse, since you can simply switch on each token. Rather than having to continually see whether this string is a valid identifier, or could be interpreted as a keyword, you just have to do it once.

In some sense, you could lex it into just strings and do the same amount of work in the parser; but separating tokenizing from parsing makes for easier-to-understand code.

(Incidentally, this sort of tokenizing, by the way, is why most modern computer languages have reserved keywords. Whereas some older languages let you do things like IF IF = 2 THEN THEN := 3, a language like C just says you can't use if as an identifier).

(and sorry, I don't mean to swamp the thread, I'm just taking a break from grading assignments, and this is a handy distraction :).

From:

brooksmoses

Yeah -- I realized, after I went home, that I've actually been spoiled by (modern) Fortran. Fortran doesn't do garbage collection either, but what it does (for dynamic arrays and strings, and the concept could be nearly-trivially extended to this case) has the same result that there's no need to worry about memory management. Specifically, it defines an "allocatable" type, which is rather like a pointer to dynamically-allocated memory, but the pointed-to memory is uniquely tied to the declared variable -- so that, when the declared variable goes away, the memory is freed. Allocatables can also be function results or expressions (or parts of user-defined types in whatever way one of those is used), and are appropriately handled in those cases as well.

It occurs to me that for single objects, I can do most of that with a wrapper class with a destructor, though it's annoying to write. But it certainly gives an entertaining new meaning to the old saying, "An [adjective] programmer can write Fortran in any language"!

From:

brooksmoses

It's actually a luddite sort of reason, in a way -- it's a TeX-like macro expansion language, ignoring all of the past 25 years of developments in how to write language syntax. This means that (among other things) it's parsing lists of tokens out of the input character stream, sticking them into storage, and then interpreting them some time later when all of the words and character types in the langauge might have been redefined. So, if the token is a single character, what's actually stored is the character and a code indicating its type.

In addition, since it's a macro expansion language, there are often times where a token comes from an operation on other tokens, rather than directly from the parsing. I'm adding numeric tokens to the language, and I'd rather store those as numbers rather than as strings, so as not to lose precision.

This is also why I'm writing my own parser rather than using a canned one. From the standpoint of most parsers, this language is a vile evil thing that does not obey any of the usual rules.

From:

brooksmoses

Yup, pretty much. Also, this one does end up allowing evil things conceptually somewhat like your "IF IF =2 THEN THEN := 3", and what needs to get stored is something equivalent to (name "IF", type If-keyword) (name "IF", type Identifier) ... and so on. It's not quite at the same level, though -- as I mentioned to Ryk, it's a TeX-like language, where the concept of "start of group" and "the character '{'" are not inherently linked, and one could -- in the middle of a program -- define "[" to be a grouping character instead and use "{" for something else.

(Don't worry about swamping the thread. More conversation is a good thing!)

From:

brooksmoses

A good point on the dynamic- vs. static-typing distinction; on reflection, the real distinction is that I'm used to using languages where it's trivial to put things with compile-time-unknown data representations (essentially anything, in Matlab; dynamic arrays in Fortran) into managed memory.

Thanks for the reference to the shared_ptr article; that looks rather useful (as does the auto_ptr one) in figuring out how to do the wrapper function and what exactly I want it to do.

Also useful to know that there's a name for this pattern (which implies that people who know more than I do also use it!).

From:

suzanne.livejournal.com

*shakes head* Love you. You're incomprehensible, but I love you. *snuggles*

From:

pure-agnostic.livejournal.com

Hey Brooks,

We just met while painting Jen's new room today. I liked our conversation enough to want to read your LJ. I read part of your LJ and came across this post.

As a longtime C++ programmer, and having written parsers, I can offer a few suggestions. Of course, you may have figured out a solution already, and in that case, feel free to ignore my suggestions.

As you know, a function can't return an abstract type, but can return a pointer to one. So you know out you want to return a copy of the token, but then the question is: "How do I copy it so that it is of the most derived type, and not just the abstract type?"

Consider adding a Clone function to your abstract class:
class Token
{
public:
virtual Token * Clone( void ) const = 0;
// ... other functions in Token class.
protected:
/// Inline protected copy constructor.
Token( const Token & that ) {}
};

You can then define a similar Clone function in every derived class, and that allows you to copy anything derived from Token and get a copy in the most derived type, but usable as Token. I'd suggest reading Item 54 in the book "C++ Coding Standards" by Herb Sutter and Andrei Alexandrescu for how to implement a virtual Clone function. It's a great book, and I'm a project administrator for an open source project made by one of the authors of that book.

The open source project: http://sourceforge.net/projects/loki-lib/
The author's home page: http://www.erdani.org/
Website for C++ Coding Standards book: http://safari.awprofessional.com/0321113586

If you want to build your own parser, I strongly recommend using Boost.Spirit. I wrote several language parsers in Spirit, and recommend it over any other parsing engine. I also wrote several parsing utilities which work on top of Boost.Spirit. These utilities do a lot of the low-level semantic actions for a Spirit-style parser, and I intend to make an open source project out of them. Once the parser is set up, all a host program has to do is provide interface classes which the parser can call (something akin to the Visitor design pattern).

Boost's documentation for Spirit: http://www.boost.org/libs/spirit/index.html

Hope that helps,

Cheers,

Rich

From:

brooksmoses

Hello, Rich! Apologies for taking so long to reply to this. I enjoyed meeting you as well. And even though I do have a bit of a solution for this particular problem, the real point of this project isn't to get a parser to but to learn some more programming skills, so the advice is definitely helpful nonetheless!

Anyhow, yes, having a Clone function does look like a useful approach to things. (Actually, now that I go back and look at what I did after writing this, it looks like it's pretty nearly the solution that I ended up using -- which makes it definitely useful information, since I can now compare the details and see what subtleties I missed. Thanks!)

And also thank you for the advice with regard to Boost.Spirit. I'll keep that in mind and look into it when I get around to wanting to write something that uses a real parser rather than a toy learning-exercise one. :) (And do let me know if you do end up creating an open-source wrapper utility for it.)

Threaded | Top-Level Comments Only

Profile

brooksmoses

My Website

January 2026

S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Page Summary

Active Entries

Style Credit

Base style: Crossroads by branchandroot
Theme: Cloud Sounds by krja

Expand Cut Tags

No cut tags

Page generated Mar. 23rd, 2026 04:59 am

Brooks Moses

A basic conceptual C++ question.

A basic conceptual C++ question.

no subject

no subject

no subject

no subject

Luddite?

Re: Luddite?

no subject

Re: Luddite?

Re: Luddite?

no subject

sighs

no subject

no subject

Profile

January 2026

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags