A program that writes code: should it use 'string'?

Ramon F Herrera

I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?

TIA,

-RFH
-------------

void SynthesizeTextField(CompleteField fullTextField)
{
string code;
string baseFieldname = "text";
stringstream ss;
static int subindex = 1;

code = "Field ";
code += baseFieldname;
ss << subindex;
code += ss.str();
code += " ";
code += "doc.FieldCreate(\"";
code += baseFieldname;
code += ss.str();
code += "\", Field::e_text, \"\", \"\");";

subindex++;
}

Jun 27 '08 #1

Subscribe Post Reply

1608

Kai-Uwe Bux

Ramon F Herrera wrote:

>
I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

Recommended is to measure before you optimize. Write the program so that it
is easy to understand. When (and only when) you have a performance problem,
don't guess what the cause might be; instead, use a profiler to identify
the bottleneck and then do something about it.

The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?

[snip]

If profiling shows that appending to the string is too costly, reserving a
certain capacity would be the first thing to try. It's the least intrusive
measure.
Best

Kai-Uwe Bux

Jun 27 '08 #2

Daniel T.

Ramon F Herrera <ra***@conexus.netwrote:

I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?

A std::string is just a vector<charwith some extra functions that you
probably don't need for this particular variable.

I think your best bet would be to make your own class to represent
"code", implement that class with a string if that makes sense to you.
The nice thing is you can always change the implementation of the class
later as profiling requires, without affecting any other code.

Jun 27 '08 #3

James Kanze

On Jun 1, 11:58 pm, Ramon F Herrera <ra...@conexus.netwrote:

I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?

-------------

void SynthesizeTextField(CompleteField fullTextField)
{
string code;
string baseFieldname = "text";
stringstream ss;
static int subindex = 1;

code = "Field ";
code += baseFieldname;
ss << subindex;
code += ss.str();
code += " ";
code += "doc.FieldCreate(\"";
code += baseFieldname;
code += ss.str();
code += "\", Field::e_text, \"\", \"\");";

subindex++;
}

For starters, I'd generate (or support generation) directly into
the output stream. Something like:

std::ostream&
SynthesizeTextField(
std::ostream& dest,
... )
{
// ...
return dest ;
}

You're formatting here (some of the data is numeric,
apparently), so you might as well treat the entire thing as a
stream. And you'll certainly be outputting it in the end;
there's not much you can do with C++ source code within the
program, so you might as well generate directly into the output
stream, and never build the string at all.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #4

Juha Nieminen

Ramon F Herrera wrote:

I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

I think it's ok. You could also say something like
"code.reserve(100*1024);" which allocates 100kB (or any other
amount you feel is about correct) of memory for it so that it
never has to resize (unless you exceed that limit, of course),
which might make it slightly more efficient.

Jun 27 '08 #5

Pascal J. Bourguignon

Ramon F Herrera <ra***@conexus.netwrites:

I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?
[...]
code += "doc.FieldCreate(\"";
[...]

No, you should not use strings to generate code. Code is a syntac
tree. You should have a tree of objects:

Lhs* lhs=new Variable("pi_squared");
Rhs* rhs=new Variable("pi");
Statement* code=new Assignment(lhs,new Multiply(rhs,rhs));
cout<<code->generate();

would produce:

pi_squared=pi*pi;
--
__Pascal Bourguignon__

Jun 27 '08 #6

Yannick Tremblay

In article <7c************@pbourguignon.anevia.com>,
Pascal J. Bourguignon <pj*@informatimago.comwrote:

>
No, you should not use strings to generate code. Code is a syntac
tree. You should have a tree of objects:

Lhs* lhs=new Variable("pi_squared");
Rhs* rhs=new Variable("pi");
Statement* code=new Assignment(lhs,new Multiply(rhs,rhs));
cout<<code->generate();

This is C++, not Java, loose the "new" abuse:
// class Variable;
// class Statement;
// class Assignment: public Statement;

Variable lhs("pi_squared");
Variable rhs("pi");
Assignment code(lhs, Multiply(rhs, rhs);

cout << code.generate();

// or
cout << Assignement(Variable("pi_squared"),
Multiply(Variable("pi"),Variable("pi")).generate() ;

>would produce:

pi_squared=pi*pi;

Your tree of object approach is probably superior as complexity
increases. For simple problems direct construction in a
string/ostream is likely to be sufficient but if you have a lot of
complex code generation to do, the cost of creating the code object
hierarchy is likely to be worthwhile.

Yannick

Jun 27 '08 #7

James Kanze

On Jun 2, 2:38 pm, p...@informatimago.com (Pascal J. Bourguignon)
wrote:

Ramon F Herrera <ra...@conexus.netwrites:

I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?
[...]
code += "doc.FieldCreate(\"";
[...]

No, you should not use strings to generate code. Code is a
syntac tree.

That depends a lot on the code. The compiler may treat it as a
syntax tree, but most of the time I'm generating code, it's
fairly flat (tables and that sort of stuff). And of course, in
the end, you need text, to feed to the compiler.

You should have a tree of objects:

Lhs* lhs=new Variable("pi_squared");
Rhs* rhs=new Variable("pi");
Statement* code=new Assignment(lhs,new Multiply(rhs,rhs));
cout<<code->generate();

would produce:

pi_squared=pi*pi;

I think you've missed the question. The original poster may
actually be already doing that, for all we know. The question
concerned the generation of the code, not the source from which
it was generated. And the code itself must be text (at least as
the question was posed).

Of course, I agree that you don't have to generate that text
entirely in one std::string object. Regardless of the source,
you should (usually) output it directly to an ostream (which
could be an ostringstream *if* you need the text in the process,
but usually, it will be an ofstream, I think).

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #8

James Kanze

On Jun 2, 3:41 pm, ytrem...@nyx.nyx.net (Yannick Tremblay) wrote:

In article <7clk1oc8r6....@pbourguignon.anevia.com>,
Pascal J. Bourguignon <p...@informatimago.comwrote:
No, you should not use strings to generate code. Code is a syntac
tree. You should have a tree of objects:

Lhs* lhs=new Variable("pi_squared");
Rhs* rhs=new Variable("pi");
Statement* code=new Assignment(lhs,new Multiply(rhs,rhs));
cout<<code->generate();

This is C++, not Java, loose the "new" abuse:

He's building a tree. That pretty much required dynamic
allocation.

// class Variable;
// class Statement;
// class Assignment: public Statement;

Variable lhs("pi_squared");
Variable rhs("pi");
Assignment code(lhs, Multiply(rhs, rhs);

Unless you've got dynamic allocation of the nodes somewhere
hidden in the constructors, this is not going to work. And of
course, it doesn't work if the expression is the result of
parsing some external data either.

cout << code.generate();

// or
cout << Assignement(Variable("pi_squared"),
Multiply(Variable("pi"),Variable("pi")).generate() ;

would produce:

pi_squared=pi*pi;

Your tree of object approach is probably superior as
complexity increases. For simple problems direct construction
in a string/ostream is likely to be sufficient but if you have
a lot of complex code generation to do, the cost of creating
the code object hierarchy is likely to be worthwhile.

Tree or not, you'll have to either build a string or generate
text directly into an ostream sooner or later. If I understand
the original poster correctly, his question concerned the
efficiency of using a string when the size of the code became
large; he's already solved his problem the source of the code
(tree or otherwise).

I'll admit that I generate a lot of code automatically, and I've
never used a syntax tree to do so. But most of the code is just
tables, or a function with a single switch statement (which is
also a table of sorts). Or the code is generated from a
template (general sense of the word, not a C++ template).

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #9

Juha Nieminen

Pascal J. Bourguignon wrote:

No, you should not use strings to generate code. Code is a syntac
tree. You should have a tree of objects:

Why make things more complicated than necessary? You converted his
easy-to-read code into a mess of pointers and dynamically allocated
objects. What for?

Jun 27 '08 #10

Yannick Tremblay

In article <d4**********************************@m73g2000hsh. googlegroups.com>,
James Kanze <ja*********@gmail.comwrote:

>On Jun 2, 3:41 pm, ytrem...@nyx.nyx.net (Yannick Tremblay) wrote:
>In article <7clk1oc8r6....@pbourguignon.anevia.com>,
Pascal J. Bourguignon <p...@informatimago.comwrote:
>No, you should not use strings to generate code. Code is a syntac
tree. You should have a tree of objects:

>Lhs* lhs=new Variable("pi_squared");
Rhs* rhs=new Variable("pi");
Statement* code=new Assignment(lhs,new Multiply(rhs,rhs));
cout<<code->generate();

>This is C++, not Java, loose the "new" abuse:

He's building a tree. That pretty much required dynamic
allocation.

Looking at the proposed syntax above, I don't think that was the
reason for the "new" overflow syntax so I maintain my opinion.

This could be true for:

Lhs* lhs=new Variable("pi_squared");
Rhs* rhs=new Variable("pi");
Rhs* rhs2=new Variable("pi");
Assignemnt code(lhs, new Multiply(rhs,rhs2))

But in the code as presented:

1- Multiply can't get double ownership of rhs unless it's constructor
is convoluted. If it gets basic ownership of the dynamically
allocated object it is given, Multiply(rhs, rhs) is probably a bug.

2-
Statement* code=new Assignment(/*...*/);
std::cout << code->generate();

is very hard to justify. To me that's clear dynamic allocation
abuse. Of course, "code" could later be added to a statement
collection but that was not in the presented code so dynamic
allocation there was unjustified.

3- The code as presented will leak if either of the 2nd, 3rd or 4th
"new" throws.

So maybe the following would be acceptable:

shared_ptr<Lhslhs(new Variable("pi_squared"));
shared_ptr<Rhsrhs(new Variable("pi"));
Assignemnt code(lhs, new Multiply(rhs,rhs))

>// class Variable;
// class Statement;
// class Assignment: public Statement;

>Variable lhs("pi_squared");
Variable rhs("pi");
Assignment code(lhs, Multiply(rhs, rhs);

Unless you've got dynamic allocation of the nodes somewhere
hidden in the constructors, this is not going to work. And of

Copy constructors would do the job fine. It seems to works for
the STL. The Assignement implementation would also not be forced to
have a particular internal structure but could be implemented in
whatever way is best.

>course, it doesn't work if the expression is the result of
parsing some external data either.

Not sure I get your point here.
That said, I would certainly agree that the dynamic allocation in the
client code interface is a serious candidate for consideration but
there's nothing wrong with:

Assignement code( Variable("pi_squared"),
Multiply( Variable("pi"), Variable("pi"));

The explicit dynamic allocation by the client code might be more
efficient but IMO it is also more error prone. So a judgement call is
needed on performance vs safety. If I am in control of both side of
the interface and need performance, I'll probably go for the dynamic
allocation in client code solution. However, if I am writing a
library for general use and have no idea who will be using it, I'll
write copy constructors for Variable and use them internally rather
than expose my internals to client code.
Yannick

Jun 27 '08 #11

Puppet_Sock

On Jun 1, 5:58*pm, Ramon F Herrera <ra...@conexus.netwrote:

I am writing a program that generates source code. See a snippet
below. My question is about the use of that growing 'code' variable.
Is it efficient? Is is recommended for this case?

The code generated can grow a lot. Perhaps I should allocate a large
max size in advance?

Here;s your code snippet.

void SynthesizeTextField(CompleteField fullTextField)
{
string code;
string baseFieldname = "text";
stringstream ss;
static int subindex = 1;
code = "Field ";
code += baseFieldname;
ss << subindex;
code += ss.str();
code += " ";
code += "doc.FieldCreate(\"";
code += baseFieldname;
code += ss.str();
code += "\", Field::e_text, \"\", \"\");";
subindex++;

}

I don't see that fullTextField is used.

I don't see that code is used after it is filled.
Seems to be no way for it to get out of the function.

Not really possible to answer your question without a
lot of detailed consideration of your problem specs.

For example: The snippet shows a lot of appending,
and not much else. Not much help there deciding on
what to do about growing data set size.

You need to think about things like:
- Will the growing be only at the end or the middle or front?
- Will you need to stick data into the middle of the
target data? For example, will you need to insert
words into the middle of the data your are building?
- Will you want to be doing edit-in-place type actions?
For example, sorting on keywords, user defined edits, etc.
- Will you need to do searching in the data? Sorting on
keywords, analysis on treds, or anythign like that.
- Will you want to do any syntax analysis? Things like
search for well formed lines of code, and so on.
- Any other complications of increased scope you can
pry out of the folks setting the project.

If you can figure out which, if any, of these is likely,
then you can pic a data structure that will accomodate
them easier. That way you can get ahead of your client
asking for new features.

On the other hand, if you are confident that none of that
sort of thing is ever going to happen, then pick the most
simple way of doing things that you can. That will be
the easiest to update if it does start to degrade.
Socks

Jun 27 '08 #12

Pascal J. Bourguignon

Juha Nieminen <no****@thanks.invalidwrites:

Pascal J. Bourguignon wrote:
>No, you should not use strings to generate code. Code is a syntac
tree. You should have a tree of objects:

Why make things more complicated than necessary? You converted his
easy-to-read code into a mess of pointers and dynamically allocated
objects. What for?

As a first step toward implement Greenspun's Tenth Law, of course...

--
__Pascal Bourguignon__

Jun 27 '08 #13

James Kanze

On Jun 3, 5:16 pm, ytrem...@nyx.nyx.net (Yannick Tremblay) wrote:

In article
<d442bccc-43ec-4060-a323-f5943c4f3...@m73g2000hsh.googlegroups.com>,
James Kanze <james.ka...@gmail.comwrote:

On Jun 2, 3:41 pm, ytrem...@nyx.nyx.net (Yannick Tremblay) wrote:
In article <7clk1oc8r6....@pbourguignon.anevia.com>,
Pascal J. Bourguignon <p...@informatimago.comwrote:
No, you should not use strings to generate code. Code is a syntac
tree. You should have a tree of objects:

Lhs* lhs=new Variable("pi_squared");
Rhs* rhs=new Variable("pi");
Statement* code=new Assignment(lhs,new Multiply(rhs,rhs));
cout<<code->generate();

This is C++, not Java, loose the "new" abuse:

He's building a tree. That pretty much required dynamic
allocation.

Looking at the proposed syntax above, I don't think that was
the reason for the "new" overflow syntax so I maintain my
opinion.

I'm not sure what you mean by "overflow" syntax, but Pascal
explicitly said that you should have a tree, so I think we have
to assume that he was building a tree.

This could be true for:

Lhs* lhs=new Variable("pi_squared");
Rhs* rhs=new Variable("pi");
Rhs* rhs2=new Variable("pi");
Assignemnt code(lhs, new Multiply(rhs,rhs2))

OK, so his code builds a directed acyclic graph, instead of a
tree. What does that change?

But in the code as presented:

1- Multiply can't get double ownership of rhs unless it's
constructor is convoluted. If it gets basic ownership of the
dynamically allocated object it is given, Multiply(rhs, rhs)
is probably a bug.

First, I suspect that the posted code was just a hint, and not
meant to be polished, finished, fully working code. Second, I
don't quite follow your points about "ownership". If you're
building a directed acyclic graph, then ownership is not really
a relevant issue; if there is ownership, it is shared by all
parents, but typically, you'll implement some sort of garbage
collection, and not worry about it. If you're not using the
Boehm collector, you'll allocate all of the nodes from a pool,
with a pool for each expression, and you'll drop the entire pool
when you're done with the expression. Or, since the graph is
acyclic, you can even use boost::shared_ptr if performance isn't
an issue (and the amount boost::shared_ptr will impact is
probably small enough to make it not an issue).

2-
Statement* code=new Assignment(/*...*/);
std::cout << code->generate();

is very hard to justify. To me that's clear dynamic
allocation abuse. Of course, "code" could later be added to a
statement collection but that was not in the presented code so
dynamic allocation there was unjustified.

Except that in a larger context, it's likely that you can't
allocate Statement (or any syntax element) on the stack.
(Unless you have full garbage collection, of course.)

3- The code as presented will leak if either of the 2nd, 3rd
or 4th "new" throws.

Without seeing the actual classes involved, I can't say that.
Probably, he's using the Boehm collector; this is typically the
sort of thing where garbage collection shines. Or he's defined
an operator new/operator delete in the base class constructor
which allocates from a pool, and he just tells the pool to drop
everything when he's through with the expression, at a higher
level. (That's the way I usually handle syntax trees when I
can't use the Boehm collector.) Or maybe he's made the
constructors nothrow, and replaced the new_handler to abort, so
that the entire code is guaranteed no throw.

So maybe the following would be acceptable:

shared_ptr<Lhslhs(new Variable("pi_squared"));
shared_ptr<Rhsrhs(new Variable("pi"));
Assignemnt code(lhs, new Multiply(rhs,rhs))

Maybe, but there are better solutions.

[...]

Copy constructors would do the job fine. It seems to works
for the STL.

In case you hadn't notice, the STL does dynamic allocation in
its containers. Here, he's building a tree outside of any
container, so that doesn't work; he'd have to hide it in the
individual elements.

The Assignement implementation would also not be forced to
have a particular internal structure but could be implemented
in whatever way is best.

course, it doesn't work if the expression is the result of
parsing some external data either.

Not sure I get your point here.

If you don't know what variables you're going to need up front,
the only way to get the objects you need is by dynamic
allocation.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #14

A program that writes code: should it use 'string'?

Similar topics