473,406 Members | 2,345 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

mmap parsing...

hi,

I have a file stored in memory using mmap() and I'd like to parse to
read line by line.
Also, there are several threads that read this buffer so I think
strtok(p, "\n") wouldnt be a good choice. I'd like to hear from you
guys what would be a good implementation in this case.

thanks in advance,

lz.

--
Lucas Zimmerman
ne******@gmail.com

Nov 15 '05 #1
8 2794
On 7 Jul 2005 10:11:48 -0700, ne******@gmail.com wrote in comp.lang.c:
hi,

I have a file stored in memory using mmap() and I'd like to parse to
read line by line.
Also, there are several threads that read this buffer so I think
strtok(p, "\n") wouldnt be a good choice. I'd like to hear from you
guys what would be a good implementation in this case.

thanks in advance,


There is no mmap() and there are no threads in C, so this is off-topic
here. I suggest you take this to a group that supports your
compiler/OS combination.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html
Nov 15 '05 #2
Please read my post again. My question is regarding strtok(), not
mmap() or threads. I want an ANSI solution to this problem, thats why I
came to CLC.

lz.

--
Lucas Zimmerman
ne******@gmail.com

Nov 15 '05 #3
ne******@gmail.com wrote:
Please read my post again. My question is regarding strtok(), not
mmap() or threads. I want an ANSI solution to this problem, thats why I
came to CLC.

lz.

1) Please quote an appropriate amount of context when replying. It's the
Usenet way.

2) Since your (well placed) concern about using strtok() has to do with
its relationship to threads, there *is* no ANSI solution (as ANSI C
knows not of threads). Your platform probably supplies a solution.
Posting to an appropriate platform specific forum will likely help you
find the information you need.

HTH,
--ag

--
Artie Gold -- Austin, Texas
http://it-matters.blogspot.com (new post 12/5)
http://www.cafepress.com/goldsays
Nov 15 '05 #4
ne******@gmail.com wrote:
I have a file stored in memory using mmap() and I'd like to parse to
read line by line.
Also, there are several threads that read this buffer so I think
strtok(p, "\n") wouldnt be a good choice. I'd like to hear from you
guys what would be a good implementation in this case.


Indeed, strtok() is utter crap for situations like this. If you are
using gcc, you can use strtok_r() which is reetrnant and thread safe.

You said you wanted portability in another post, though I don't know
how that fits with your mmap() usage. I'll assume you mean "mmap() or
equivalent" or that you intend to make it more general in the future.
Anyhow, for portable string manipulations, you can use "The Better
String Library": http://bstring.sf.net/ . It has string parsing
facilities that are equal or better than any of C's built-in library
functions, it is a totally thread safe and reentrant library, and its
portable.

If, by portable, you mean portable to any system with mmap(), there is
another possibility of using James Antil's Vstr:
http://www.and.org/vstr/ which claims higher I/O performance via using
mmap, however it is not thread safe (it claims to be fork()-safe, which
is not the same thing, but may be sufficient for you.)

--
Paul Hsieh
http://bstring.sf.net/
http://www.pobox.com/~qed/

Nov 15 '05 #5
ne******@gmail.com writes:
I have a file stored in memory using mmap() and I'd like to parse to
read line by line.
Also, there are several threads that read this buffer so I think
strtok(p, "\n") wouldnt be a good choice. I'd like to hear from you
guys what would be a good implementation in this case.


strtok() is rarely a good choice for anything.
strtok() has at least these problems:

* It merges adjacent delimiters. If you use a comma as
your delimiter, then "a,,b,c" is three tokens, not
four. This is often the wrong thing to do. In fact,
it is only the right thing to do, in my experience,
when the delimiter set is limited to white space.

* The identity of the delimiter is lost, because it is
changed to a null terminator.

* It modifies the string that it tokenizes. This is bad
because it forces you to make a copy of the string if
you want to use it later. It also means that you can't
tokenize a string literal with it; this is not
necessarily something you'd want to do all the time but
it is surprising.

* It can only be used once at a time. If a sequence of
strtok() calls is ongoing and another one is started,
the state of the first one is lost. This isn't a
problem for small programs but it is easy to lose track
of such things in hierarchies of nested functions in
large programs. In other words, strtok() breaks
encapsulation.

Instead, use some substitute, e.g. strtok_r(). Here is an
implementation of strtok_r(). It may be SUSv3 compliant, but I
do not know for sure. If you use it, you should probably rename
it, because (most) names beginning with `str' are reserved:

/* Breaks a string into tokens separated by DELIMITERS. The
first time this function is called, S should be the string to
tokenize, and in subsequent calls it must be a null pointer.
SAVE_PTR is the address of a `char *' variable used to keep
track of the tokenizer's position. The return value each time
is the next token in the string, or a null pointer if no
tokens remain.

This function treats multiple adjacent delimiters as a single
delimiter. The returned tokens will never be length 0.
DELIMITERS may change from one call to the next within a
single string.

strtok_r() modifies the string S, changing delimiters to null
bytes. Thus, S must be a modifiable string. String literals,
in particular, are *not* modifiable in C, even though for
backward compatibility they are not `const'.

Example usage:

char s[] = " String to tokenize. ";
char *token, *save_ptr;

for (token = strtok_r (s, " ", &save_ptr); token != NULL;
token = strtok_r (NULL, " ", &save_ptr))
printf ("'%s'\n", token);

outputs:

'String'
'to'
'tokenize.'
*/
char *
strtok_r (char *s, const char *delimiters, char **save_ptr)
{
char *token;

ASSERT (delimiters != NULL);
ASSERT (save_ptr != NULL);

/* If S is nonnull, start from it.
If S is null, start from saved position. */
if (s == NULL)
s = *save_ptr;
ASSERT (s != NULL);

/* Skip any DELIMITERS at our current position. */
while (strchr (delimiters, *s) != NULL)
{
/* strchr() will always return nonnull if we're searching
for a null byte, because every string contains a null
byte (at the end). */
if (*s == '\0')
{
*save_ptr = s;
return NULL;
}

s++;
}

/* Skip any non-DELIMITERS up to the end of the string. */
token = s;
while (strchr (delimiters, *s) == NULL)
s++;
if (*s != '\0')
{
*s = '\0';
*save_ptr = s + 1;
}
else
*save_ptr = s;
return token;
}

--
int main(void){char p[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuv wxyz.\
\n",*q="kl BIcNBFr.NKEzjwCIxNJC";int i=sizeof p/2;char *strchr();int putchar(\
);while(*q){i+=strchr(p,*q++)-p;if(i>=(int)sizeof p)i-=sizeof p-1;putchar(p[i]\
);}return 0;}
Nov 15 '05 #6
Ben Pfaff wrote:
ne******@gmail.com writes:
I have a file stored in memory using mmap() and I'd like to parse to
read line by line.
Also, there are several threads that read this buffer so I think
strtok(p, "\n") wouldnt be a good choice. I'd like to hear from you
guys what would be a good implementation in this case.


strtok() is rarely a good choice for anything. ...
Instead, use some substitute, e.g. strtok_r(). Here is an
implementation of strtok_r().

<snip>

Example usage:

char s[] = " String to tokenize. ";
char *token, *save_ptr;

for (token = strtok_r (s, " ", &save_ptr); token != NULL;
token = strtok_r (NULL, " ", &save_ptr))
printf ("'%s'\n", token);

<snip>


I prefer...

char *alt_strtok(char **s, const char *del)
{
char *t;
if (!*s) return 0;
*s += strspn(*s, del);
if (!**s) return *s = 0;
*s += strcspn(t = *s, del);
if (**s) *(*s)++ = 0; else *s = 0;
return t;
}

Usage:

char s[] = " String to tokenize. ";
char *tok, *sp;

for (sp = s; tok = alt_strtok(&sp, " "); )
printf("'%s'\n", tok);

--
Peter

Nov 15 '05 #7
ne******@gmail.com wrote:

Please read my post again. My question is regarding strtok(), not
mmap() or threads. I want an ANSI solution to this problem, thats
why I came to CLC.


Well, you didn't bother to quote things properly, so it is
impossible to write a sane reply (see my sig below). In general,
don't mess with things and use the fundamental file system to
access your data. Whether the underlying system uses mmap or
threads is its business. There certainly is no reason to mention
them in this group.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
Nov 15 '05 #8
On 7 Jul 2005 19:55:27 -0700, we******@gmail.com wrote:
ne******@gmail.com wrote:
I have a file stored in memory using mmap() and I'd like to parse to
read line by line.
Also, there are several threads that read this buffer so I think
strtok(p, "\n") wouldnt be a good choice. I'd like to hear from you
guys what would be a good implementation in this case.


Indeed, strtok() is utter crap for situations like this. If you are
using gcc, you can use strtok_r() which is reetrnant and thread safe.

gcc as such is not relevant, unless this is one of the functions
chosen for inlining which I haven't seen on any platform I've used.
You have strtok_r if you use _glibc_, which is sometimes but not
always used in conjunction with gcc; or some other sytems. Or of
course if you provide it in usercode as Ben's nextthread.

And (the usual though not officially standard) strtok_r() is safe for
multiple threads concurrently or different parts (e.g. loop levels)
interleavedly parsing _different_ strings; it is of no help for
multiple threads accessing the same string, which appears to me to be
what the OP is asking. And as noted by Ben like strtok() it collapses
adjacent delimiters so here skips empty lines, which may or may not be
a problem for the OP.

Given the lines in the file are delimited by a single known character
like '\n', which is a good bet on most if not all systems that support
mmap _under that name_, and is also needed for strtok() or _r(), then
strchr() does much of the job -- or memchr() if the file contents
aren't (necessarily) terminated or followed by a null character, which
they might not be depending on file size and page size.

- David.Thompson1 at worldnet.att.net
Nov 15 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Hao Xu | last post by:
Hi everyone! I found that if you want to write to the memory got by mmap(), you have to get the file descriptor for mmap() in O_RDWR mode. If you got the file descriptor in O_WRONLY mode, then...
7
by: Michael | last post by:
I'm writing an application that decodes a file containing binary records. Each record is a particular event type. Each record is translated into ASCII and then written to a file. Each file contains...
4
by: Fabiano Sidler | last post by:
Hi folks! I created an mmap object like so: --- snip --- from mmap import mmap,MAP_ANONYMOUS,MAP_PRIVATE fl = file('/dev/zero','rw') mm = mmap(fl.fileno(), 1, MAP_PRIVATE|MAP_ANONYMOUS) ---...
2
by: beejisbrigit | last post by:
Hi there, I was wondering if anyone had experience with File I/O in Java vs. C++ using mmap(), and knew if the performance was better in one that the other, or more or less negligible. My...
1
by: James T. Dennis | last post by:
I've been thinking about the Python mmap module quite a bit during the last couple of days. Sadly most of it has just been thinking ... and reading pages from Google searches ... and very little...
1
by: koara | last post by:
Hello all, i am using the mmap module (python2.4) to access contents of a file. My question regards the relative performance of mmap.seek() vs mmap.tell(). I have a generator that returns...
2
by: Neal Becker | last post by:
On linux, I don't understand why: f = open ('/dev/eos', 'rw') m = mmap.mmap(f.fileno(), 1000000, prot=mmap.PROT_READ|mmap.PROT_WRITE, flags=mmap.MAP_SHARED) gives 'permission denied', but...
0
by: Kris Kennaway | last post by:
If I do the following: def mmap_search(f, string): fh = file(f) mm = mmap.mmap(fh.fileno(), 0, mmap.MAP_SHARED, mmap.PROT_READ) return mm.find(string) def mmap_is_in(f, string): fh =...
0
by: Gabriel Genellina | last post by:
En Thu, 29 May 2008 19:17:05 -0300, Kris Kennaway <kris@FreeBSD.org> escribió: Looks like you should define the sq_contains member in mmap_as_sequence, and the type should have the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.