473,398 Members | 2,188 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

"Streams"

hello
I hope somebody can help me get my head around this area of 'stream'
programming... I know that streams are very fashionable nowadays so
hopefully there'll be lots of replies. ;-)

Basically I have an operation which the input and output for are streams -
a function which receives a certain 'chunk' of data each time it runs, and
it runs many times until it has received all the data, in a similar way to
a socket.
What the processing involves is treating the data like a string, and
picking out certain 'words' and putting 'tags' round them. For instance,
say if the word 'dog' came along, I want to replace it with
<animal>dog</animal> or something like that.
So that if the input stream was "table kennel dog coffee beer" the output
stream would be "table kennel <animal>dog</animal> coffee beer" or
something like that.
So say, if in a stream-like fashion, the stream function was called 5
times, each receiving 100 bytes of a transmission which is 500 bytes
in total.

What if the word 'dog' occurred from characters 99 to 101?
The first invocation would *end* in the characters "...do" and the second
would start with the characters "g ..."
I can't think how I can solve this, it is eluding me unfortunately. It
sounds like it should be simple, but isn't. I need to do it in a way that
doesn't involve storing the whole string in memory at once (for
performance reasons - the whole data might be as large as 64K), and hope
somebody can shed any further light on it!

Please don't say "You need to examine the bytes at the end of one
invocation and compare them with the ones at the start of the next"...
because I know this is *what* I've got to do, the question is *how*?


Jul 23 '05 #1
8 2700
bonj <a@b.com> schrieb:
So that if the input stream was "table kennel dog coffee beer" the
output stream would be "table kennel <animal>dog</animal> coffee
beer" or something like that.
You have text data.
What if the word 'dog' occurred from characters 99 to 101?


Why do you want to handle text data like binary data in a fixed size
buffer. You have lines - process lines, line by line.

If you have structures on multiple lines you will get a more complex
problem. Then you should use a parser.

T.M.
Jul 23 '05 #2
In article <pa****************************@b.com>, bonj <a@b.com> wrote:
:What the processing involves is treating the data like a string, and
:picking out certain 'words' and putting 'tags' round them. For instance,
:say if the word 'dog' came along, I want to replace it with
:<animal>dog</animal> or something like that.

:What if the word 'dog' occurred from characters 99 to 101?
:The first invocation would *end* in the characters "...do" and the second
:would start with the characters "g ..."
:I can't think how I can solve this, it is eluding me unfortunately. It
:sounds like it should be simple, but isn't. I need to do it in a way that
:doesn't involve storing the whole string in memory at once (for
:performance reasons - the whole data might be as large as 64K), and hope
:somebody can shed any further light on it!

You don't need to store the whole string: you only need to store
as much of the trailing context as might be a match for one of the
special strings.

Furthermore, as there are a limited number of strings
that you wish to put the tag around, then unless you need the match to
be case insensitive, you do not need to store the string itself,
just the string's index number and length of the pending match.
Initialize: pending = -1;

End of Buffer:
If this buffer ends in whitespace (including newline) then
process it fully and set pending = -1.

Otherwise, match it against the special strings, looking for a
match as long as the string in the buffer; if you do not find such
a match, then emit that string unchanged and set pending = -2.
If you find a match, set pending to the index of the first special
string partly matched against and record the length of the match in
pending_matchlen .

Beginning of Buffer:
If pending == -2 then read characters from the beginning of the
buffer and emit them without checking for matches until you
find whitespace; then set pending = -1 and continue processing the buffer.

If pending >= 0 then pull strings[pending] out of the match table,
take the first pending_matchlen characters of it, and push that string
at the front of the buffer. Leave pending the way it is and continue
onward.

If pending == -1 then continue onward

Continuing Onward:
if pending == -1 then consume and emit all leading whitespace in
the buffer and then loop back around to 'Continuing Onward'.
When you are finally positioned to a non-whitespace, set pending = 0
and keep going as per below.

Find the end of the current word in the buffer. If there is
whitespace afterwards, compare the current word to the match table
starting from pending. If you find an exact match, emit the
appropriate tag-surrounded string, set pending = -1 and loop around
to "continuing onward'. If there was no exact match, emit the
word itself, set pending = -1, and loop around to "Continuing Onward".
If the buffer did not end in whitespace, you are in the End of Buffer
state -- I showed that first so you would be able to see what the
pending flag was about before beging hit with the beginning of buffer
logic where the value of pending is important.
This algorithm needs only two words of state, one signed value large
enough to hold the maximum number of special strings, and the other
signed or unsigned long enough to hold the maximum length of one of
the special string.
--
Live it up, rip it up, why so lazy?
Give it out, dish it out, let's go crazy, yeah!
-- Supertramp (The USENET Song)
Jul 23 '05 #3
bonj wrote:

I hope somebody can help me get my head around this area of 'stream'
programming... I know that streams are very fashionable nowadays so
hopefully there'll be lots of replies. ;-)


You should never cross-post without setting follow-ups to the group
you inhabit. Also cross-posting between c.l.c and c.l.c++ is very
likely to be damaging. Newsgroups are separated by subject to
avoid extraneous clutter, so keep them clean. At any rate, I am
setting follow-ups to c.l.c.

Streams are nothing new.

In C all files are streams of chars. That means they are
essentially serially read and written, with some provisions for
altering the sequence (which are not guaranteed to work). That
also means that all you can do is read or write one char (byte) at
a time. Of course you can save the result, or advance to another
for write.

So the fundamental functions you have available are putc, getc,
(and ungetc, a kluge that allows one char. lookahead) or their
non-macro counterparts fputc and fgetc. Everything else (that is
guaranteed) can be built from these, such as fgets, fscanf,
fprintf, fread, fwrite.

The run time systems is usually built to make those fundamental
accesses efficient, by providing hidden buffers. To take maximal
advantage of those buffers getc and putc are allowed to be macros
that do internal things peculiar to the implementation.

Notice that it is perfectly possible to write i/o code that never
needs user buffering, yet it adequately converts between text and
such things as ints, floats, chars, strings, etc.

Notice also that a magnetic or paper tape, that can only advance,
is a perfectly usable medium on which to implement a stream. So is
a keyboard, for input, or a printer, for output. You can also pipe
the output of one program (a stream) to the input of another
program (also a stream) operating simulataneously with as little as
a single char. of buffering.

Pascal does a very good job of formally defining abstract file
streams, and how they can be used and/or manipulated. You might
want to look up ISO1785 or ISO10206 to read the appropriate
sections. Unlike C, Pascal does not limit files to streams of
bytes.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Jul 23 '05 #4
In article <uy***********@fastmail.fm>,
Torsten Mueller <Ur*****@gmx.net> wrote:
:Why do you want to handle text data like binary data in a fixed size
:buffer. You have lines - process lines, line by line.

The OP wrote,

:>Basically I have an operation which the input and output for are streams -
:>a function which receives a certain 'chunk' of data each time it runs, and
:>it runs many times until it has received all the data, in a similar way to
:>a socket.

I read that as meaning he was using STREAMS, the fundamentally
packetized I/O facility found mostly in some versions of Unix;
related to TLI and the newer OpenStreams; e.g., recvmsg().

On the other hand, STREAMS aren't exactly the latest hot thing like
the OP was describing, so perhaps the OP was misposting a C++ question
into the comp.lang.c newsgroup...
--
"No one has the right to destroy another person's belief by
demanding empirical evidence." -- Ann Landers
Jul 23 '05 #5
bonj <a@b.com> wrote in message news:<pa****************************@b.com>...
hello
I hope somebody can help me get my head around this area of 'stream'
programming... I know that streams are very fashionable nowadays so
hopefully there'll be lots of replies. ;-)

Basically I have an operation which the input and output for are streams -
a function which receives a certain 'chunk' of data each time it runs, and
it runs many times until it has received all the data, in a similar way to
a socket.
While that might be the underlying mechanism for retrieving data from
the source, C++ streams always allow character-by-character
processing. It is up the underlying streambuf to grab more data as
required (usually in underflow() or uflow()), in whatever size chunks
make the most sense.
So say, if in a stream-like fashion, the stream function was called 5
times, each receiving 100 bytes of a transmission which is 500 bytes
in total.

What if the word 'dog' occurred from characters 99 to 101?
The first invocation would *end* in the characters "...do" and the second
would start with the characters "g ..." I can't think how I can solve this, it is eluding me unfortunately. It
sounds like it should be simple, but isn't. I need to do it in a way that
doesn't involve storing the whole string in memory at once (for
performance reasons - the whole data might be as large as 64K), and hope
somebody can shed any further light on it!


If your underlying streambuf class has a fixed buffer that is the size
of the longest possible word you need to handle (say 1024 bytes), then
items of data straddling data packets is never an issue.

If words are separated by spaces, then the code to add tags around
words is as simple as:

void add_tags(istream& in_stream, ostream& out_stream)
{
while (in_stream >> word)
out_stream << "<tag>" << word << "</tag>"
}

or if you prefer more STL-ish approach:

string add_tag(const string& word)
{
return "<tag>" + word + "</tag>"
}

void add_tags(istream& in_stream, ostream& out_stream)
{
transform(istream_iterator<string>(in_stream),
istream_iterator<string>(), ostream_iterator<string>(out_stream),
add_tag);
}

The underlying streambuf can still grab data packets of a fixed size
(100 bytes in your example), but it only does it once the previous
packet has been fully exhausted, and the characters in it have already
been extracted into your application's string object. This usually
occurs when the underflow() or uflow() virtual functions are called.

Typically though you only need to implement your own streambuf when
using a very specific form of input data not supported by the standard
library.
Sockets might be one example, although you may find you can use
filebuf for this, depending on your OS and particular filebuf
implementation (you may need one that has an extension to supply your
own FILE* pointer, for example).

Hope this helps.
Jul 23 '05 #6
On Sun, 27 Feb 2005 17:27:42 +0100, Torsten Mueller spooled the following
warez:
bonj <a@b.com> schrieb:
So that if the input stream was "table kennel dog coffee beer" the
output stream would be "table kennel <animal>dog</animal> coffee
beer" or something like that.


You have text data.
What if the word 'dog' occurred from characters 99 to 101?


Why do you want to handle text data like binary data in a fixed size
buffer. You have lines - process lines, line by line.

If you have structures on multiple lines you will get a more complex
problem. Then you should use a parser.

T.M.

erm.... yes, you're right I do have 'lines'.
So I could in effect have a 'line' buffers, and then the callback function
that receives data feeds it into the line buffer, and then another
function processes the line buffer but only once it's got a complete line.

There's only one problem with that. There's no actual upper limit on the
length of a line. Even if the data comes from a text control, it could
wrap - meaning that there could be zillions of characters
coming through, before an actual linefeed is encountered.

The data *is* text, you're right - but I want to prepare for the
situation where it is 'prose-like'.
Jul 23 '05 #7
On Sun, 27 Feb 2005 17:10:52 +0000, Walter Roberson spooled the following
warez:
In article <uy***********@fastmail.fm>,
Torsten Mueller <Ur*****@gmx.net> wrote:
:Why do you want to handle text data like binary data in a fixed size
:buffer. You have lines - process lines, line by line.

The OP wrote,

:>Basically I have an operation which the input and output for are streams -
:>a function which receives a certain 'chunk' of data each time it runs, and
:>it runs many times until it has received all the data, in a similar way to
:>a socket.

I read that as meaning he was using STREAMS, the fundamentally
packetized I/O facility found mostly in some versions of Unix;
related to TLI and the newer OpenStreams; e.g., recvmsg().

On the other hand, STREAMS aren't exactly the latest hot thing like
the OP was describing, so perhaps the OP was misposting a C++ question
into the comp.lang.c newsgroup...

No, it's not an *actual* unix stream, but simply a library that a friend
has written that reads the (textual) data out of a user-input control. It
needs to do this fast, and it does do its job correctly and it does
operate fast. The only drawback with the way it operates is that to do its
job fast, the only way it can get the data out of the control so fast is
to call a given 'callback' function a certain number of times. The
parameters it passes to this callback function are a pointer to the
buffer, and the number of bytes it has retrieved into the buffer - *this
number could vary*. I might get 100 bytes one time, and 30 the next, if it
suddenly went slow, say.

I like the idea of having a line buffer, but I have no idea how many
characters could be in a line. What I might do is have buffer that
operates on any whitespace break, as while I can't put an upper limit on
the length of a line, I can do so on a word. I might look into doing it
like this.

Jul 23 '05 #8
jjf

Walter Roberson wrote:
In article <uy***********@fastmail.fm>,
Torsten Mueller <Ur*****@gmx.net> wrote:
:Why do you want to handle text data like binary data in a fixed size
:buffer. You have lines - process lines, line by line.

The OP wrote,

:>Basically I have an operation which the input and output for are
:>streams - a function which receives a certain 'chunk' of data each
:>time it runs, and it runs many times until it has received all the
:>data, in a similar way to a socket.

I read that as meaning he was using STREAMS, the fundamentally
packetized I/O facility found mostly in some versions of Unix;
related to TLI and the newer OpenStreams; e.g., recvmsg().

On the other hand, STREAMS aren't exactly the latest hot thing like
the OP was describing, so perhaps the OP was misposting a C++ question into the comp.lang.c newsgroup...


Since the messsage didn't go to any group where STREAMS would be
on-topic, it seems far more likely that he was talking about C
streams (unless he's actually asking a C++ question). Why he should
think they're particularly fashionable at the moment, I've no idea
- they've been heavily used for decades.

Jul 23 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: dover | last post by:
For the code, outputfile << *p << endl; Someone suggests: Don't use endl here; it is flushing output stream every time. Use plain '\n'. What's the difference of using "endl" and "\n" above?...
145
by: Sidney Cadot | last post by:
Hi all, In a discussion with Tak-Shing Chan the question came up whether the as-if rule can cover I/O functions. Basically, he maintains it can, and I think it doesn't. Consider two...
11
by: Kobu | last post by:
I have a question about C's abstract "streams" (that I can't seem to FULLY understand from reading several tutorials). Streams seems to suggest that input can be treated continously if needed....
9
by: bonj | last post by:
hello I hope somebody can help me get my head around this area of 'stream' programming... I know that streams are very fashionable nowadays so hopefully there'll be lots of replies. ;-) ...
4
by: floppyzedolfin | last post by:
Hello! I'm actually encoding an encryption / decryption program. The encryption programes takes a file path in parameter, and encrypts the contents of the file and stores that into another file. ...
9
by: andrew.smith.cpp | last post by:
hi, whts the difference between the std::endl or "\n" ? because both do the same work Thanks
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.