A basic conceptual C++ question.
Oct. 4th, 2006 10:31 pm(This in some ways ties into
allbery's recently-posted essay on online communities. Most of my dearest friends, and much of my social life, link back in some way to the online communities that I found shortly after I started graduate school, which were places that I could ask this sort of question, and full of people who'd be interested in talking about the answer. In the last few years, they've nearly all drifted apart -- so I'm trying the experiment of asking here, to see what comes of it.)
So, I'm learning C++, from what I suspect is somewhat of an interesting perspective these days; I've been programming for two decades, been doing object-oriented programming as part of my paying work for two or three years, and maintain a Windows port of a large open-source C++ program. This means that I've got sort of weird gaps in what I know -- such as the fact that I wrote my first abstract class definition ... yesterday.
In any case, the problem is this: I'm writing a parser. It breaks a stream of characters up into tokens, so I've got an abstract class for "token" and a bunch of derived classes for the various kinds of tokens (names, parentheses, whitespace, etc.). And so I went off to write a function that takes in a reference to the character stream, bites off a bit of it, and returns a token object, figuring that would be pretty trivial.
Except that, as best I can tell, I can't do that in C++.
Because, you see, I want to return an arbitrary token object, which might be an instance of any of the various derived token classes, and so the return type from the function ought to be declared as token, and my C++ manual tells me that "an abstract class may not be used as ... a function return type." I've got to use a pointer to the abstract class, or a reference to it, instead. But that's no good, because then I have to worry about where the object that's pointed to comes from, and in particular I have to worry about who frees the memory when it goes away. Which is an almighty pain, because this is a function result, and might very easily just be passed as an argument to another function, and then who knows who's supposed to free the memory?
I'd also like to be able to do a nice, simple, "A = B" assignment, where B is a token object, and get a copy of the object in A. Except that the same problem arises; I can't declare A as an abstract token; I have to declare it as a pointer or reference to one, and I get the same sort of problems; I want a copy of the token, not a reference to the same piece of memory.
As best I can figure from what I know, the only way to make this work is to write a wrapper class that contains a pointer to the abstract class (and the derived classes), and has a destructor that clears the pointed-at memory. And then, ugh, it also has to explicitly handle all the assignment-is-copying stuff to construct a new token-object for the result, and I'm not sure what happens if one of these gets passed by value, but I know if I'm not careful I'll get a copy of the pointer and a destructor that clears it when the function ends, leaving the original pointer pointing to garbage.
There's got to be an easier way, right? Or have I just been spoiled by using a dynamically-typed language to learn class inheritance?
So, I'm learning C++, from what I suspect is somewhat of an interesting perspective these days; I've been programming for two decades, been doing object-oriented programming as part of my paying work for two or three years, and maintain a Windows port of a large open-source C++ program. This means that I've got sort of weird gaps in what I know -- such as the fact that I wrote my first abstract class definition ... yesterday.
In any case, the problem is this: I'm writing a parser. It breaks a stream of characters up into tokens, so I've got an abstract class for "token" and a bunch of derived classes for the various kinds of tokens (names, parentheses, whitespace, etc.). And so I went off to write a function that takes in a reference to the character stream, bites off a bit of it, and returns a token object, figuring that would be pretty trivial.
Except that, as best I can tell, I can't do that in C++.
Because, you see, I want to return an arbitrary token object, which might be an instance of any of the various derived token classes, and so the return type from the function ought to be declared as token, and my C++ manual tells me that "an abstract class may not be used as ... a function return type." I've got to use a pointer to the abstract class, or a reference to it, instead. But that's no good, because then I have to worry about where the object that's pointed to comes from, and in particular I have to worry about who frees the memory when it goes away. Which is an almighty pain, because this is a function result, and might very easily just be passed as an argument to another function, and then who knows who's supposed to free the memory?
I'd also like to be able to do a nice, simple, "A = B" assignment, where B is a token object, and get a copy of the object in A. Except that the same problem arises; I can't declare A as an abstract token; I have to declare it as a pointer or reference to one, and I get the same sort of problems; I want a copy of the token, not a reference to the same piece of memory.
As best I can figure from what I know, the only way to make this work is to write a wrapper class that contains a pointer to the abstract class (and the derived classes), and has a destructor that clears the pointed-at memory. And then, ugh, it also has to explicitly handle all the assignment-is-copying stuff to construct a new token-object for the result, and I'm not sure what happens if one of these gets passed by value, but I know if I'm not careful I'll get a copy of the pointer and a destructor that clears it when the function ends, leaving the original pointer pointing to garbage.
There's got to be an easier way, right? Or have I just been spoiled by using a dynamically-typed language to learn class inheritance?
no subject
Date: 2006-10-05 12:24 pm (UTC)The wrapper class approach you're considering is a C++ pattern called "Pimpl" (pointer to Implementation). If you use something like boost's shared_ptr, you can use the default copy-constructor/assignment operators. See this article (http://www.boost.org/libs/smart_ptr/sp_techniques.html) for a quick link.
Perhaps more apropos is the bit where it talks about the abstract class/factory pattern, which is almost what you're doing.
I haven't programmed in C++ in a long time. But as far as I can tell, the answer to half of C++'s wrinkles nowadays seem to be "boost". And a good chunk of the boost libraries will be in the next C++ standard, so it's a fairly safe thing to grab some of its source.
no subject
Date: 2006-10-05 12:29 pm (UTC)no subject
Date: 2006-10-05 01:19 pm (UTC)Not having GC makes life more complicated, requiring the use of references, pointers to implementations, and other use of indirection. It grants you more control (i.e., when is a copy a copy), more predictability in terms of memory use and run-time behavior, but the cost of that control is that you're required to think about these details even when you don't want to.
With the speed of modern-day CPUs and the price of RAM, GC will do the job just fine for most software. That, however, was not the case back in the 80's. Also, GC is not at all what it was 20 year ago. Oh, and with the sloppy programming we see in most places, GC would often be better than all the darn memory leaks we have now.
Then there's the software that will not tolerate unpredictable behaviour. You don't want the control software of your passenger airliner, nuclear power plant, etc. to pause for GC. Again, even for 90% of the power plant or airlines sw it probably wouldn't matter, but there's that small crucial part where it mustn't happen. Despite the research that's gone into real-time GC.
C and C++ comes from the branch of programming that favour "the programmer must be in control" and the "the programmer (wants to) know what he's doing" type of thinking. It makes really high quality code with superior run-time behaviour possible, but with a price for the programmer, and often with sub-par results for average programmers.
Given recent (10+ years) advances in both hw and GC tech, I favor the "let the programmer worry 'bout the algorithm" type of thinking. In an ideal world I'd like to be able to combine GC and non-GC in the same language, but that's not easy at all.
no subject
Date: 2006-10-05 02:49 pm (UTC)Luddite?
Date: 2006-10-05 02:59 pm (UTC)Re: Luddite?
Date: 2006-10-05 05:29 pm (UTC)At a guess, Brooks is using tokens to simplify the input being read. Suppose, for example, that he's lexing a C expression:
int i = 2;
This is valid C code, as it matches the grammar clauses
(the grammar is fictitious; I'm not sure that's exactly how the standard grammar is broken down).
Using simple strings, you'd get the resulting sequence: "int","i","=","2",";" or so. That didn't gain you very much; you still will have to do a bunch of work in the parser. If, instead, you got the tokens (with field data in parentheses)
Int,Identifier("i"), Assign,Integer(2), Semicolon
it'd be easier to parse, since you can simply switch on each token. Rather than having to continually see whether this string is a valid identifier, or could be interpreted as a keyword, you just have to do it once.
In some sense, you could lex it into just strings and do the same amount of work in the parser; but separating tokenizing from parsing makes for easier-to-understand code.
(Incidentally, this sort of tokenizing, by the way, is why most modern computer languages have reserved keywords. Whereas some older languages let you do things like
IF IF = 2 THEN THEN := 3, a language like C just says you can't useifas an identifier).(and sorry, I don't mean to swamp the thread, I'm just taking a break from grading assignments, and this is a handy distraction :).
no subject
Date: 2006-10-05 06:35 pm (UTC)It occurs to me that for single objects, I can do most of that with a wrapper class with a destructor, though it's annoying to write. But it certainly gives an entertaining new meaning to the old saying, "An [adjective] programmer can write Fortran in any language"!
Re: Luddite?
Date: 2006-10-05 06:55 pm (UTC)In addition, since it's a macro expansion language, there are often times where a token comes from an operation on other tokens, rather than directly from the parsing. I'm adding numeric tokens to the language, and I'd rather store those as numbers rather than as strings, so as not to lose precision.
This is also why I'm writing my own parser rather than using a canned one. From the standpoint of most parsers, this language is a vile evil thing that does not obey any of the usual rules.
Re: Luddite?
Date: 2006-10-05 07:06 pm (UTC)(Don't worry about swamping the thread. More conversation is a good thing!)
no subject
Date: 2006-10-05 07:26 pm (UTC)Thanks for the reference to the shared_ptr article; that looks rather useful (as does the auto_ptr one) in figuring out how to do the wrapper function and what exactly I want it to do.
Also useful to know that there's a name for this pattern (which implies that people who know more than I do also use it!).
*sighs*
Date: 2006-10-06 03:04 am (UTC)no subject
Date: 2006-12-17 01:43 am (UTC)We just met while painting Jen's new room today. I liked our conversation enough to want to read your LJ. I read part of your LJ and came across this post.
As a longtime C++ programmer, and having written parsers, I can offer a few suggestions. Of course, you may have figured out a solution already, and in that case, feel free to ignore my suggestions.
As you know, a function can't return an abstract type, but can return a pointer to one. So you know out you want to return a copy of the token, but then the question is: "How do I copy it so that it is of the most derived type, and not just the abstract type?"
Consider adding a Clone function to your abstract class:
class Token
{
public:
virtual Token * Clone( void ) const = 0;
// ... other functions in Token class.
protected:
/// Inline protected copy constructor.
Token( const Token & that ) {}
};
You can then define a similar Clone function in every derived class, and that allows you to copy anything derived from Token and get a copy in the most derived type, but usable as Token. I'd suggest reading Item 54 in the book "C++ Coding Standards" by Herb Sutter and Andrei Alexandrescu for how to implement a virtual Clone function. It's a great book, and I'm a project administrator for an open source project made by one of the authors of that book.
The open source project: http://sourceforge.net/projects/loki-lib/
The author's home page: http://www.erdani.org/
Website for C++ Coding Standards book: http://safari.awprofessional.com/0321113586
If you want to build your own parser, I strongly recommend using Boost.Spirit. I wrote several language parsers in Spirit, and recommend it over any other parsing engine. I also wrote several parsing utilities which work on top of Boost.Spirit. These utilities do a lot of the low-level semantic actions for a Spirit-style parser, and I intend to make an open source project out of them. Once the parser is set up, all a host program has to do is provide interface classes which the parser can call (something akin to the Visitor design pattern).
Boost's documentation for Spirit: http://www.boost.org/libs/spirit/index.html
Hope that helps,
Cheers,
Rich
no subject
Date: 2007-01-16 06:50 am (UTC)Anyhow, yes, having a Clone function does look like a useful approach to things. (Actually, now that I go back and look at what I did after writing this, it looks like it's pretty nearly the solution that I ended up using -- which makes it definitely useful information, since I can now compare the details and see what subtleties I missed. Thanks!)
And also thank you for the advice with regard to Boost.Spirit. I'll keep that in mind and look into it when I get around to wanting to write something that uses a real parser rather than a toy learning-exercise one. :) (And do let me know if you do end up creating an open-source wrapper utility for it.)