"Streams"

bonj

hello
I hope somebody can help me get my head around this area of 'stream'
programming... I know that streams are very fashionable nowadays so
hopefully there'll be lots of replies. ;-)

Basically I have an operation which the input and output for are streams -
a function which receives a certain 'chunk' of data each time it runs, and
it runs many times until it has received all the data, in a similar way to
a socket.
What the processing involves is treating the data like a string, and
picking out certain 'words' and putting 'tags' round them. For instance,
say if the word 'dog' came along, I want to replace it with
<animal>dog</animal> or something like that.
So that if the input stream was "table kennel dog coffee beer" the output
stream would be "table kennel <animal>dog</animal> coffee beer" or
something like that.
So say, if in a stream-like fashion, the stream function was called 5
times, each receiving 100 bytes of a transmission which is 500 bytes
in total.

What if the word 'dog' occurred from characters 99 to 101?
The first invocation would *end* in the characters "...do" and the second
would start with the characters "g ..."
I can't think how I can solve this, it is eluding me unfortunately. It
sounds like it should be simple, but isn't. I need to do it in a way that
doesn't involve storing the whole string in memory at once (for
performance reasons - the whole data might be as large as 64K), and hope
somebody can shed any further light on it!

Please don't say "You need to examine the bytes at the end of one
invocation and compare them with the ones at the start of the next"...
because I know this is *what* I've got to do, the question is *how*?

Jul 23 '05 #1

Subscribe Post Reply

2699

Torsten Mueller

bonj <a@b.com> schrieb:

So that if the input stream was "table kennel dog coffee beer" the
output stream would be "table kennel <animal>dog</animal> coffee
beer" or something like that.
You have text data.
What if the word 'dog' occurred from characters 99 to 101?

Why do you want to handle text data like binary data in a fixed size
buffer. You have lines - process lines, line by line.

If you have structures on multiple lines you will get a more complex
problem. Then you should use a parser.

T.M.

Jul 23 '05 #2

Walter Roberson

In article <pa****************************@b.com>, bonj <a@b.com> wrote:
:What the processing involves is treating the data like a string, and
:picking out certain 'words' and putting 'tags' round them. For instance,
:say if the word 'dog' came along, I want to replace it with
:<animal>dog</animal> or something like that.

:What if the word 'dog' occurred from characters 99 to 101?
:The first invocation would *end* in the characters "...do" and the second
:would start with the characters "g ..."
:I can't think how I can solve this, it is eluding me unfortunately. It
:sounds like it should be simple, but isn't. I need to do it in a way that
:doesn't involve storing the whole string in memory at once (for
:performance reasons - the whole data might be as large as 64K), and hope
:somebody can shed any further light on it!

You don't need to store the whole string: you only need to store
as much of the trailing context as might be a match for one of the
special strings.

Furthermore, as there are a limited number of strings
that you wish to put the tag around, then unless you need the match to
be case insensitive, you do not need to store the string itself,
just the string's index number and length of the pending match.
Initialize: pending = -1;

End of Buffer:
If this buffer ends in whitespace (including newline) then
process it fully and set pending = -1.

Otherwise, match it against the special strings, looking for a
match as long as the string in the buffer; if you do not find such
a match, then emit that string unchanged and set pending = -2.
If you find a match, set pending to the index of the first special
string partly matched against and record the length of the match in
pending_matchlen .

Beginning of Buffer:
If pending == -2 then read characters from the beginning of the
buffer and emit them without checking for matches until you
find whitespace; then set pending = -1 and continue processing the buffer.

If pending >= 0 then pull strings[pending] out of the match table,
take the first pending_matchlen characters of it, and push that string
at the front of the buffer. Leave pending the way it is and continue
onward.

If pending == -1 then continue onward

Continuing Onward:
if pending == -1 then consume and emit all leading whitespace in
the buffer and then loop back around to 'Continuing Onward'.
When you are finally positioned to a non-whitespace, set pending = 0
and keep going as per below.

Find the end of the current word in the buffer. If there is
whitespace afterwards, compare the current word to the match table
starting from pending. If you find an exact match, emit the
appropriate tag-surrounded string, set pending = -1 and loop around
to "continuing onward'. If there was no exact match, emit the
word itself, set pending = -1, and loop around to "Continuing Onward".
If the buffer did not end in whitespace, you are in the End of Buffer
state -- I showed that first so you would be able to see what the
pending flag was about before beging hit with the beginning of buffer
logic where the value of pending is important.
This algorithm needs only two words of state, one signed value large
enough to hold the maximum number of special strings, and the other
signed or unsigned long enough to hold the maximum length of one of
the special string.
--
Live it up, rip it up, why so lazy?
Give it out, dish it out, let's go crazy, yeah!
-- Supertramp (The USENET Song)

Jul 23 '05 #3

CBFalconer

bonj wrote:

I hope somebody can help me get my head around this area of 'stream'
programming... I know that streams are very fashionable nowadays so
hopefully there'll be lots of replies. ;-)

You should never cross-post without setting follow-ups to the group
you inhabit. Also cross-posting between c.l.c and c.l.c++ is very
likely to be damaging. Newsgroups are separated by subject to
avoid extraneous clutter, so keep them clean. At any rate, I am
setting follow-ups to c.l.c.

Streams are nothing new.

In C all files are streams of chars. That means they are
essentially serially read and written, with some provisions for
altering the sequence (which are not guaranteed to work). That
also means that all you can do is read or write one char (byte) at
a time. Of course you can save the result, or advance to another
for write.

So the fundamental functions you have available are putc, getc,
(and ungetc, a kluge that allows one char. lookahead) or their
non-macro counterparts fputc and fgetc. Everything else (that is
guaranteed) can be built from these, such as fgets, fscanf,
fprintf, fread, fwrite.

The run time systems is usually built to make those fundamental
accesses efficient, by providing hidden buffers. To take maximal
advantage of those buffers getc and putc are allowed to be macros
that do internal things peculiar to the implementation.

Notice that it is perfectly possible to write i/o code that never
needs user buffering, yet it adequately converts between text and
such things as ints, floats, chars, strings, etc.

Notice also that a magnetic or paper tape, that can only advance,
is a perfectly usable medium on which to implement a stream. So is
a keyboard, for input, or a printer, for output. You can also pipe
the output of one program (a stream) to the input of another
program (also a stream) operating simulataneously with as little as
a single char. of buffering.

Pascal does a very good job of formally defining abstract file
streams, and how they can be used and/or manipulated. You might
want to look up ISO1785 or ISO10206 to read the appropriate
sections. Unlike C, Pascal does not limit files to streams of
bytes.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Jul 23 '05 #4

Walter Roberson

In article <uy***********@fastmail.fm>,
Torsten Mueller <Ur*****@gmx.net> wrote:
:Why do you want to handle text data like binary data in a fixed size
:buffer. You have lines - process lines, line by line.

The OP wrote,

:>Basically I have an operation which the input and output for are streams -
:>a function which receives a certain 'chunk' of data each time it runs, and
:>it runs many times until it has received all the data, in a similar way to
:>a socket.

I read that as meaning he was using STREAMS, the fundamentally
packetized I/O facility found mostly in some versions of Unix;
related to TLI and the newer OpenStreams; e.g., recvmsg().

On the other hand, STREAMS aren't exactly the latest hot thing like
the OP was describing, so perhaps the OP was misposting a C++ question
into the comp.lang.c newsgroup...
--
"No one has the right to destroy another person's belief by
demanding empirical evidence." -- Ann Landers

Jul 23 '05 #5

Dylan Nicholson

bonj <a@b.com> wrote in message news:<pa****************************@b.com>...

hello
I hope somebody can help me get my head around this area of 'stream'
programming... I know that streams are very fashionable nowadays so
hopefully there'll be lots of replies. ;-)

Basically I have an operation which the input and output for are streams -
a function which receives a certain 'chunk' of data each time it runs, and
it runs many times until it has received all the data, in a similar way to
a socket.
While that might be the underlying mechanism for retrieving data from
the source, C++ streams always allow character-by-character
processing. It is up the underlying streambuf to grab more data as
required (usually in underflow() or uflow()), in whatever size chunks
make the most sense.
So say, if in a stream-like fashion, the stream function was called 5
times, each receiving 100 bytes of a transmission which is 500 bytes
in total.

What if the word 'dog' occurred from characters 99 to 101?
The first invocation would *end* in the characters "...do" and the second
would start with the characters "g ..." I can't think how I can solve this, it is eluding me unfortunately. It
sounds like it should be simple, but isn't. I need to do it in a way that
doesn't involve storing the whole string in memory at once (for
performance reasons - the whole data might be as large as 64K), and hope
somebody can shed any further light on it!

If your underlying streambuf class has a fixed buffer that is the size
of the longest possible word you need to handle (say 1024 bytes), then
items of data straddling data packets is never an issue.

If words are separated by spaces, then the code to add tags around
words is as simple as:

void add_tags(istream& in_stream, ostream& out_stream)
{
while (in_stream >> word)
out_stream << "<tag>" << word << "</tag>"
}

or if you prefer more STL-ish approach:

string add_tag(const string& word)
{
return "<tag>" + word + "</tag>"
}

void add_tags(istream& in_stream, ostream& out_stream)
{
transform(istream_iterator<string>(in_stream),
istream_iterator<string>(), ostream_iterator<string>(out_stream),
add_tag);
}

The underlying streambuf can still grab data packets of a fixed size
(100 bytes in your example), but it only does it once the previous
packet has been fully exhausted, and the characters in it have already
been extracted into your application's string object. This usually
occurs when the underflow() or uflow() virtual functions are called.

Typically though you only need to implement your own streambuf when
using a very specific form of input data not supported by the standard
library.
Sockets might be one example, although you may find you can use
filebuf for this, depending on your OS and particular filebuf
implementation (you may need one that has an extension to supply your
own FILE* pointer, for example).

Hope this helps.

Jul 23 '05 #6

bonj

On Sun, 27 Feb 2005 17:27:42 +0100, Torsten Mueller spooled the following
warez:

bonj <a@b.com> schrieb:
So that if the input stream was "table kennel dog coffee beer" the
output stream would be "table kennel <animal>dog</animal> coffee
beer" or something like that.

You have text data.
What if the word 'dog' occurred from characters 99 to 101?

Why do you want to handle text data like binary data in a fixed size
buffer. You have lines - process lines, line by line.

If you have structures on multiple lines you will get a more complex
problem. Then you should use a parser.

T.M.

erm.... yes, you're right I do have 'lines'.
So I could in effect have a 'line' buffers, and then the callback function
that receives data feeds it into the line buffer, and then another
function processes the line buffer but only once it's got a complete line.

There's only one problem with that. There's no actual upper limit on the
length of a line. Even if the data comes from a text control, it could
wrap - meaning that there could be zillions of characters
coming through, before an actual linefeed is encountered.

The data *is* text, you're right - but I want to prepare for the
situation where it is 'prose-like'.

Jul 23 '05 #7

bonj

On Sun, 27 Feb 2005 17:10:52 +0000, Walter Roberson spooled the following
warez:

In article <uy***********@fastmail.fm>,
Torsten Mueller <Ur*****@gmx.net> wrote:
:Why do you want to handle text data like binary data in a fixed size
:buffer. You have lines - process lines, line by line.

The OP wrote,

:>Basically I have an operation which the input and output for are streams -
:>a function which receives a certain 'chunk' of data each time it runs, and
:>it runs many times until it has received all the data, in a similar way to
:>a socket.

I read that as meaning he was using STREAMS, the fundamentally
packetized I/O facility found mostly in some versions of Unix;
related to TLI and the newer OpenStreams; e.g., recvmsg().

On the other hand, STREAMS aren't exactly the latest hot thing like
the OP was describing, so perhaps the OP was misposting a C++ question
into the comp.lang.c newsgroup...

No, it's not an *actual* unix stream, but simply a library that a friend
has written that reads the (textual) data out of a user-input control. It
needs to do this fast, and it does do its job correctly and it does
operate fast. The only drawback with the way it operates is that to do its
job fast, the only way it can get the data out of the control so fast is
to call a given 'callback' function a certain number of times. The
parameters it passes to this callback function are a pointer to the
buffer, and the number of bytes it has retrieved into the buffer - *this
number could vary*. I might get 100 bytes one time, and 30 the next, if it
suddenly went slow, say.

I like the idea of having a line buffer, but I have no idea how many
characters could be in a line. What I might do is have buffer that
operates on any whitespace break, as while I can't put an upper limit on
the length of a line, I can do so on a word. I might look into doing it
like this.

Jul 23 '05 #8

jjf

Walter Roberson wrote:

In article <uy***********@fastmail.fm>,
Torsten Mueller <Ur*****@gmx.net> wrote:
:Why do you want to handle text data like binary data in a fixed size
:buffer. You have lines - process lines, line by line.

The OP wrote,

:>Basically I have an operation which the input and output for are
:>streams - a function which receives a certain 'chunk' of data each
:>time it runs, and it runs many times until it has received all the
:>data, in a similar way to a socket.

I read that as meaning he was using STREAMS, the fundamentally
packetized I/O facility found mostly in some versions of Unix;
related to TLI and the newer OpenStreams; e.g., recvmsg().

On the other hand, STREAMS aren't exactly the latest hot thing like
the OP was describing, so perhaps the OP was misposting a C++ question into the comp.lang.c newsgroup...

Since the messsage didn't go to any group where STREAMS would be
on-topic, it seems far more likely that he was talking about C
streams (unless he's actually asking a C++ question). Why he should
think they're particularly fashionable at the moment, I've no idea
- they've been heavily used for decades.

Jul 23 '05 #9

Similar topics