473,326 Members | 2,337 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

tokenising a string using another string

I've got a really messy text file that I need to work on and the only
things separating each record is either "\r\n\r\n" or "Total:".

I figure I won't be able to use strtok because it will split the string
when it matches any rather than all of the chars in the delimiter.

Is there an easy way to split it based on a string (read char*) rather
than a char?

TIA
Mark
Nov 15 '05 #1
6 1441

Mark wrote:
I've got a really messy text file that I need to work on and the only
things separating each record is either "\r\n\r\n" or "Total:".
Can we have some more information, here? It sure is messy,
but my premise is it contains some information, otherwise you
wouldn't be splitting your hairs on this. And if it contains
some specific information, then there will be some structure
to it. Maybe then you can read a char at a time, build some
tokens out of them, take the ones you need and do whatever
that needs to be done.

Or, am I mistaken, and you have tried all of this out and failed?
I figure I won't be able to use strtok because it will split the string
when it matches any rather than all of the chars in the delimiter.
This can probably wait, till we have identified what all tokens
we have to find, and then proceed accordingly.
Is there an easy way to split it based on a string (read char*) rather
than a char?
Read them via fgets() and use sscanf() or your own hand spun lexer().
TIA
Mark


Nov 15 '05 #2
Suman wrote:
Can we have some more information, here?

[snip]

It's supposed to be a CSV export from MYOB but there are a few memo
field that have carriage returns etc so I can't easily read until \r\n
and assume that that is one record.

It might go something like this...

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun

Total: <-- this is always at the end

I can't change the data coming out and I can't really change the data
going in because it's coming out of an accounting system.

What I thought I could probably do was either read up until the first
\r\n\r\n and completely ignore Total: (it's never used) or read up until
Total: and discard it later.

What I was hoping is that someone has already done a generic split
string on string kinda thing so that when someone eventually takes a
look at my spaghetti code they won't decide to fire me on the spot ;-)

Mark
Nov 15 '05 #3

Mark wrote:
Suman wrote:
Can we have some more information, here? [snip]

It's supposed to be a CSV export from MYOB but there are a few memo


CSV = Comma separated values? What is MYOB?
field that have carriage returns etc so I can't easily read until \r\n
and assume that that is one record.

It might go something like this...

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun
Total: <-- this is always at the end
This is what I was talking about :)
So maybe you can actually write your own crude grammar: viz.
Record_Set -> Record Record_set
|'Total:'

Record -> Cust_name ',' Date ',' Memo_fields

Memo_fields -> Memo_field ',' Memo_fields
| Memo_field

Memo_field -> ...
Cust_name -> ...

... and then find what the *tokens* are. And then write your own
lexer -- that will scan the input for The Chosen Ones! I can't change the data coming out and I can't really change the data
going in because it's coming out of an accounting system.

What I thought I could probably do was either read up until the first
\r\n\r\n and completely ignore Total: (it's never used) or read up until
Total: and discard it later.
Are you sure you are not missing the forest for the trees?
I mean I do not understand your preoccupation with `\r\n'.
Not to demean you or something, just that I can't fathom why it
is so important.
What I was hoping is that someone has already done a generic split
string on string kinda thing so that when someone eventually takes a
look at my spaghetti code they won't decide to fire me on the spot ;-)
I don't have any :/ Mark


Nov 15 '05 #4
Mark <us**@site.com> wrote:
I've got a really messy text file that I need to work on and the only
things separating each record is either "\r\n\r\n" or "Total:".

I figure I won't be able to use strtok because it will split the string
when it matches any rather than all of the chars in the delimiter.

Is there an easy way to split it based on a string (read char*) rather
than a char?


Not pre-made. You'll have to search for the strings yourself, using
strstr().

Richard
Nov 15 '05 #5
Mark wrote:
Suman wrote:
Can we have some more information, here?

It's supposed to be a CSV export from MYOB but there are a few memo
field that have carriage returns etc so I can't easily read until \r\n
and assume that that is one record.

It might go something like this...
"might" is not a word I like to see in interface specifications...

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun

Total: <-- this is always at the end
so how do you know when one "memo field" ends and the next one begins?

I can't change the data coming out and I can't really change the data
going in because it's coming out of an accounting system.

What I thought I could probably do was either read up until the first
\r\n\r\n and completely ignore Total: (it's never used) or read up until
Total: and discard it later.

What I was hoping is that someone has already done a generic split
string on string kinda thing so that when someone eventually takes a
look at my spaghetti code they won't decide to fire me on the spot ;-)


stop writing code (of whatever pasta variety). You have *got* to work
out the format of the data. The reason it has turned to spagetti is you

don't know what it's supposed to do. How can you write a program to do
something you can't do yourself?
--
Nick Keighley

Nov 15 '05 #6
> Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun

Total: <-- this is always at the end

The plan goes like this:

1. Use a state variable to keep track of what you're reading now.
2. Use a switch to handle similar states.
3. Inside the switch, read on until you reach the terminating condition
for this state.

Ok, I'd write some rough code based on this as :

enum LEXERSTATES = { CNAME, LDATE, MEMO1, MEMO2, MEMO3, LDONE } cstate
= CNAME;
while(!feof(infile)) {
switch(cstate) {
case CNAME:
case LDATE:
/* Read on until a ',' is reached and increment your state. */
break;
case MEMO1:
/* Code to read memo 1 */
break;

/* Write the rest of the code yourself :-) */
}
}

Nov 15 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Stu Cazzo | last post by:
I have the following: String myStringArray; String myString = "98 99 100"; I want to split up myString and put it into myStringArray. If I use this: myStringArray = myString.split(" "); it...
10
by: Anand Pillai | last post by:
To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1:...
9
by: John F Dutcher | last post by:
I use code like the following to retrieve fields from a form: recd = recd.append(string.ljust(form.getfirst("lname",' '),15)) recd.append(string.ljust(form.getfirst("fname",' '),15)) etc.,...
16
by: PK9 | last post by:
I have a string variable that holds the equivalent of a DateTime value. I pulled this datetime from the database and I want to strip off the time portion before displaying to the user. I am...
8
by: Eric Lilja | last post by:
Hello, I had what I thought was normal text-file and I needed to locate a string matching a certain pattern in that file and, if found, replace that string. I thought this would be simple but I had...
19
by: Erik Wikström | last post by:
First of all, forgive me if this is the wrong place to ask this question, if it's a stupid question (it's my second week with C++), or if this is answered some place else (I've searched but not...
16
by: Khuong Dinh Pham | last post by:
I have the contents of an image of type std::string. How can I make a CxImage object with this type. The parameters to CxImage is: CxImage(byte* data, DWORD size) Thx in advance
32
by: tshad | last post by:
Can you do a search for more that one string in another string? Something like: someString.IndexOf("something1","something2","something3",0) or would you have to do something like: if...
9
by: Fei Liu | last post by:
In Accellerated C++, the author recommends that in a header file one should not declare using std::string, using std::vector etc instead one should directly specify the namespace specifier in...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.