brooksmoses: (anime-me)
[personal profile] brooksmoses
(This in some ways ties into [livejournal.com profile] allbery's recently-posted essay on online communities. Most of my dearest friends, and much of my social life, link back in some way to the online communities that I found shortly after I started graduate school, which were places that I could ask this sort of question, and full of people who'd be interested in talking about the answer. In the last few years, they've nearly all drifted apart -- so I'm trying the experiment of asking here, to see what comes of it.)

So, I'm learning C++, from what I suspect is somewhat of an interesting perspective these days; I've been programming for two decades, been doing object-oriented programming as part of my paying work for two or three years, and maintain a Windows port of a large open-source C++ program. This means that I've got sort of weird gaps in what I know -- such as the fact that I wrote my first abstract class definition ... yesterday.

In any case, the problem is this: I'm writing a parser. It breaks a stream of characters up into tokens, so I've got an abstract class for "token" and a bunch of derived classes for the various kinds of tokens (names, parentheses, whitespace, etc.). And so I went off to write a function that takes in a reference to the character stream, bites off a bit of it, and returns a token object, figuring that would be pretty trivial.

Except that, as best I can tell, I can't do that in C++.

Because, you see, I want to return an arbitrary token object, which might be an instance of any of the various derived token classes, and so the return type from the function ought to be declared as token, and my C++ manual tells me that "an abstract class may not be used as ... a function return type." I've got to use a pointer to the abstract class, or a reference to it, instead. But that's no good, because then I have to worry about where the object that's pointed to comes from, and in particular I have to worry about who frees the memory when it goes away. Which is an almighty pain, because this is a function result, and might very easily just be passed as an argument to another function, and then who knows who's supposed to free the memory?

I'd also like to be able to do a nice, simple, "A = B" assignment, where B is a token object, and get a copy of the object in A. Except that the same problem arises; I can't declare A as an abstract token; I have to declare it as a pointer or reference to one, and I get the same sort of problems; I want a copy of the token, not a reference to the same piece of memory.

As best I can figure from what I know, the only way to make this work is to write a wrapper class that contains a pointer to the abstract class (and the derived classes), and has a destructor that clears the pointed-at memory. And then, ugh, it also has to explicitly handle all the assignment-is-copying stuff to construct a new token-object for the result, and I'm not sure what happens if one of these gets passed by value, but I know if I'm not careful I'll get a copy of the pointer and a destructor that clears it when the function ends, leaving the original pointer pointing to garbage.

There's got to be an easier way, right? Or have I just been spoiled by using a dynamically-typed language to learn class inheritance?

Date: 2006-10-05 12:24 pm (UTC)
From: [identity profile] elisteran.livejournal.com
First, this is not purely a dynamically-typed vs statically-typed language distinction; it's got at least as much to do with managed-memory vs unmanaged-memory.


The wrapper class approach you're considering is a C++ pattern called "Pimpl" (pointer to Implementation). If you use something like boost's shared_ptr, you can use the default copy-constructor/assignment operators. See this article (http://www.boost.org/libs/smart_ptr/sp_techniques.html) for a quick link.

Perhaps more apropos is the bit where it talks about the abstract class/factory pattern, which is almost what you're doing.

I haven't programmed in C++ in a long time. But as far as I can tell, the answer to half of C++'s wrinkles nowadays seem to be "boost". And a good chunk of the boost libraries will be in the next C++ standard, so it's a fairly safe thing to grab some of its source.

Date: 2006-10-05 12:29 pm (UTC)
From: [identity profile] elisteran.livejournal.com
To expand on my thing above, Java makes this simple, despite being mostly statically typed. And, if you really wanted to avoid the shared_ptr, auto_ptr, which is standard, may be what you're looking for (it depends on your ownership semantics; auto_ptr is notorious for being misused because people lose track of who's supposed to be owning the pointer. See http://www.gotw.ca/publications/using_auto_ptr_effectively.htm ).

Date: 2006-10-05 01:19 pm (UTC)
From: [identity profile] surelars.livejournal.com
You're not spoiled by dynamic typing; you're spoiled by garbage collection. You can do GC for a statically typed language just fine.

Not having GC makes life more complicated, requiring the use of references, pointers to implementations, and other use of indirection. It grants you more control (i.e., when is a copy a copy), more predictability in terms of memory use and run-time behavior, but the cost of that control is that you're required to think about these details even when you don't want to.

With the speed of modern-day CPUs and the price of RAM, GC will do the job just fine for most software. That, however, was not the case back in the 80's. Also, GC is not at all what it was 20 year ago. Oh, and with the sloppy programming we see in most places, GC would often be better than all the darn memory leaks we have now.

Then there's the software that will not tolerate unpredictable behaviour. You don't want the control software of your passenger airliner, nuclear power plant, etc. to pause for GC. Again, even for 90% of the power plant or airlines sw it probably wouldn't matter, but there's that small crucial part where it mustn't happen. Despite the research that's gone into real-time GC.

C and C++ comes from the branch of programming that favour "the programmer must be in control" and the "the programmer (wants to) know what he's doing" type of thinking. It makes really high quality code with superior run-time behaviour possible, but with a price for the programmer, and often with sub-par results for average programmers.

Given recent (10+ years) advances in both hw and GC tech, I favor the "let the programmer worry 'bout the algorithm" type of thinking. In an ideal world I'd like to be able to combine GC and non-GC in the same language, but that's not easy at all.

Date: 2006-10-05 02:49 pm (UTC)
From: [identity profile] elisteran.livejournal.com
Of course, you can get 90% of the benefits of true garbage collection (figure pulled out of the air, of course) by simply using reference counting, which is nice and deterministic (and thus suitable for real-time constraints). Without a compacting garbage collector, you may run of usable memory due to heap fragmentation -- but that's a worry with explicit memory allocation anyway.

Luddite?

Date: 2006-10-05 02:59 pm (UTC)
From: [identity profile] dragon3.livejournal.com
Maybe I just don't get the need for abstraction in this case. Why can't this all be done with strings? Are you planning to parse something other than a character stream?

Re: Luddite?

Date: 2006-10-05 05:29 pm (UTC)
From: [identity profile] elisteran.livejournal.com

At a guess, Brooks is using tokens to simplify the input being read. Suppose, for example, that he's lexing a C expression:

int i = 2;

This is valid C code, as it matches the grammar clauses

Type ::= Int 
Expression ::= Type Identifier '=' Expression ';'
Expression ::= Integer


(the grammar is fictitious; I'm not sure that's exactly how the standard grammar is broken down).




Using simple strings, you'd get the resulting sequence: "int","i","=","2",";" or so. That didn't gain you very much; you still will have to do a bunch of work in the parser. If, instead, you got the tokens (with field data in parentheses)
Int,Identifier("i"), Assign,Integer(2), Semicolon
it'd be easier to parse, since you can simply switch on each token. Rather than having to continually see whether this string is a valid identifier, or could be interpreted as a keyword, you just have to do it once.



In some sense, you could lex it into just strings and do the same amount of work in the parser; but separating tokenizing from parsing makes for easier-to-understand code.



(Incidentally, this sort of tokenizing, by the way, is why most modern computer languages have reserved keywords. Whereas some older languages let you do things like IF IF = 2 THEN THEN := 3, a language like C just says you can't use if as an identifier).



(and sorry, I don't mean to swamp the thread, I'm just taking a break from grading assignments, and this is a handy distraction :).

*sighs*

Date: 2006-10-06 03:04 am (UTC)
From: [identity profile] suzanne.livejournal.com
*shakes head* Love you. You're incomprehensible, but I love you. *snuggles*

Date: 2006-12-17 01:43 am (UTC)
From: [identity profile] pure-agnostic.livejournal.com
Hey Brooks,

We just met while painting Jen's new room today. I liked our conversation enough to want to read your LJ. I read part of your LJ and came across this post.

As a longtime C++ programmer, and having written parsers, I can offer a few suggestions. Of course, you may have figured out a solution already, and in that case, feel free to ignore my suggestions.

As you know, a function can't return an abstract type, but can return a pointer to one. So you know out you want to return a copy of the token, but then the question is: "How do I copy it so that it is of the most derived type, and not just the abstract type?"

Consider adding a Clone function to your abstract class:
class Token
{
public:
virtual Token * Clone( void ) const = 0;
// ... other functions in Token class.
protected:
/// Inline protected copy constructor.
Token( const Token & that ) {}
};

You can then define a similar Clone function in every derived class, and that allows you to copy anything derived from Token and get a copy in the most derived type, but usable as Token. I'd suggest reading Item 54 in the book "C++ Coding Standards" by Herb Sutter and Andrei Alexandrescu for how to implement a virtual Clone function. It's a great book, and I'm a project administrator for an open source project made by one of the authors of that book.

The open source project: http://sourceforge.net/projects/loki-lib/
The author's home page: http://www.erdani.org/
Website for C++ Coding Standards book: http://safari.awprofessional.com/0321113586

If you want to build your own parser, I strongly recommend using Boost.Spirit. I wrote several language parsers in Spirit, and recommend it over any other parsing engine. I also wrote several parsing utilities which work on top of Boost.Spirit. These utilities do a lot of the low-level semantic actions for a Spirit-style parser, and I intend to make an open source project out of them. Once the parser is set up, all a host program has to do is provide interface classes which the parser can call (something akin to the Visitor design pattern).

Boost's documentation for Spirit: http://www.boost.org/libs/spirit/index.html


Hope that helps,

Cheers,

Rich
Page generated Mar. 23rd, 2026 04:59 am
Powered by Dreamwidth Studios