473,407 Members | 2,629 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,407 software developers and data experts.

string extraction

I have a file containing some commands in free format. Each
command is terminated with ";". The ";" can also be found within the
command but, only enclosed within delimiters (' or ""). Example:

INSERT INTO nation (code, name) VALUES(700448768,
"za; sdfhsd''"sdfa");

INSERT INTO nation (code, name)

VALUES(701464576, 'msd; vasdvas ""hjh"" u');
My question is: what is the best code to extract, one at a time, these
commands.
The result should be (2 commands):
INSERT INTO nation (code, name) VALUES(700448768, "za; sdfhsd'"sdfa");
INSERT INTO nation (code, name) VALUES(701464576, 'msd; vasdvas ""hjh""
u');
I was thinking about regex, but it may be tricky to find the right one.
Any ideas?

-tom

May 19 '06 #1
7 1832
You should take a look at parsing algorithms related to theory
surrounding compilers.

There's a common command found in many programming languages called
Split or Tokenize that allows you to specify a delimting character, and
returns some sort of collection of objects. Something like:
Array arrayCommands =Split(stringCommands, ";")

And then you would do something like:

foreach(String command in arrayCommands)
{
if(command ends with a ", then there was a quoted ';')
{
//so we add the quoted ; back in and combine again with the next
command which is really part of this command and shouldn't have been
split up
command = command + ';' + (next command in array)
delete next command in array
}
}

Thi is just psuedo code of course. (next command in array) could be
found by getting index of current command, adding one, and indexing
into the array. You'll need to find out what VB.NET's tokenize or
split function is and how it works. I'm sure there is something like
that.

May 20 '06 #2
It might be worth trying using Regex, but your delimiters don't seem to
have any symmetry.

In this line, for instance :
INSERT INTO nation (code, name) VALUES(700448768, "za; sdfhsd''"sdfa");

there are 3 double quotes, not 4 as one would expect. You seem to be
opening with a double quote and closing with a single quote. So, I
couldn't get far with constructing a Regex.

May 20 '06 #3
Hi snozz,

thank you very much for your advice: I will look for these
functions.

About the logic you kindly suggest I am not clear
and I have a question.

When I wrote:

<< The ";" can also be found within the
command but, only enclosed within delimiters (' or "") >>

I meant something like for instance

1. " ;; some string containing; semicolon; within "

not necessarily something like:

2. " ";"';' some string "";"" containing semicolon ... "

I have the impression that you are assuming the situation 2
and not 1. Is that so or I am missing something?

Another point is that the file can be several Gigs and I need a
kind of "buffered" logic. But I guess I could read a bounce of lines
at a time.

-tom

Snozz ha scritto:
You should take a look at parsing algorithms related to theory
surrounding compilers.

There's a common command found in many programming languages called
Split or Tokenize that allows you to specify a delimting character, and
returns some sort of collection of objects. Something like:
Array arrayCommands =Split(stringCommands, ";")

And then you would do something like:

foreach(String command in arrayCommands)
{
if(command ends with a ", then there was a quoted ';')
{
//so we add the quoted ; back in and combine again with the next
command which is really part of this command and shouldn't have been
split up
command = command + ';' + (next command in array)
delete next command in array
}
}

Thi is just psuedo code of course. (next command in array) could be
found by getting index of current command, adding one, and indexing
into the array. You'll need to find out what VB.NET's tokenize or
split function is and how it works. I'm sure there is something like
that.


May 20 '06 #4
Hi Cerebrus,

what I mean is that string follow exactly the same rules as in VB.NET
or SQL
the string

"za; sdfhsd''"sdfa"

in the command you refer to is ok because the string content:
<za; sdfhsd''"sdfa>

is meant to be rendered as: <za; sdfhsd''sdfa>
that is the double quotes "" that are within the string are rendered
as single quotes. Just the same as in VB.NET.

You are however right about example 2
it should have been:

2. " "";""';' some string "";"" containing semicolon ... "

Yes I have tried often to use regex, but it's complicate to
deal even with sImple cases of quotes enclosed within quotes.

------------------------

Put it simply, my question is: how do I extract commands of the type

myCommand ;

each command ends where a ; (not enclosed in a string) is found.

The commands are freely put within a very large file. myCommand can
contain internally
strings which contain the semicolon char. String can be delimited by
either " or '
and can contain internally the delimiter char. In such a case the
delimiter is doubled
(as in VB.NET, SQL, ...) and will be rendered as a single char.

-tom

May 20 '06 #5
Your best option is to probably use a .indexof methods on the total char
string looking for " and ;. Flags can tell you when to skip the ; inclosed
in ""'s
--
Dennis in Houston
"to**************@uniroma1.it" wrote:
I have a file containing some commands in free format. Each
command is terminated with ";". The ";" can also be found within the
command but, only enclosed within delimiters (' or ""). Example:

INSERT INTO nation (code, name) VALUES(700448768,
"za; sdfhsd''"sdfa");

INSERT INTO nation (code, name)

VALUES(701464576, 'msd; vasdvas ""hjh"" u');
My question is: what is the best code to extract, one at a time, these
commands.
The result should be (2 commands):
INSERT INTO nation (code, name) VALUES(700448768, "za; sdfhsd'"sdfa");
INSERT INTO nation (code, name) VALUES(701464576, 'msd; vasdvas ""hjh""
u');
I was thinking about regex, but it may be tricky to find the right one.
Any ideas?

-tom

May 20 '06 #6
Thanks Dennis,

Actually I am not completely persuaded it can be done that way in
general as you could
have something like :

.... (" my preferred keywords ""work; work ; work"" ") ;

mmm I am afraid that all chars must be parsed so that one could put
flags to distinguish when an ; occurs within string delimiters and
when, instead is a command separator ....

-tom

Dennis ha scritto:
Your best option is to probably use a .indexof methods on the total char
string looking for " and ;. Flags can tell you when to skip the ; inclosed
in ""'s
--
Dennis in Houston
"to**************@uniroma1.it" wrote:
I have a file containing some commands in free format. Each
command is terminated with ";". The ";" can also be found within the
command but, only enclosed within delimiters (' or ""). Example:

INSERT INTO nation (code, name) VALUES(700448768,
"za; sdfhsd''"sdfa");

INSERT INTO nation (code, name)

VALUES(701464576, 'msd; vasdvas ""hjh"" u');
My question is: what is the best code to extract, one at a time, these
commands.
The result should be (2 commands):
INSERT INTO nation (code, name) VALUES(700448768, "za; sdfhsd'"sdfa");
INSERT INTO nation (code, name) VALUES(701464576, 'msd; vasdvas ""hjh""
u');
I was thinking about regex, but it may be tricky to find the right one.
Any ideas?

-tom


May 21 '06 #7
Since compilers already deal with this quite efficiently, then you
really will found solid practical algorithms if you look at some of the
theory that addresses programming languages, syntax, and parsing.

Might want to try a search for "recursive decent parser"

I think your type of parsing fallss under "lexical analysis" although
it might be "syntax analysis"

May 26 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Xah Lee | last post by:
# strings can be joined by +. print "this" + " that" # string can be multiplied print "this" *5 # substring extraction is done by appending a bracket # with begin and ending index a="this...
7
by: ma740988 | last post by:
The string object value_f doesn't produce the right output. At issue, - I suspect - is the conversion from string to int with istringstream. An alternate approach? Thanks in advance #include...
1
by: Adam Parkin | last post by:
Hello all, I'm trying to write a function which given a std::string parses the string by breaking the sentance up by whitespace (\t, ' ', \n) and returns the result as a vector of strings. Here's...
2
by: Jason Huang | last post by:
Hi, Would someone show me how to do the data extraction to Excel in ASP.Net using C# web form? I am not familiar with VB, so I am asking someone to help me out! Any help will be appreciated. ...
1
by: James Lehman | last post by:
Hello. I want to write a program that reads AutoCAD shape (font) files. They are written with the convention that hexadecimal values have a leading zero and decimal values do not. All numbers...
5
by: TheSteph | last post by:
Hi, I'm new to Regex.. Could someone show me how I can extract substring enclosed in ? Example :
3
by: dec01louis | last post by:
Hi all, actually i'm now doing something on license plate recognition system for my project. The first step would be the license plate extraction algorithm which means it is needed to extract a...
1
by: kellysgirl | last post by:
Now what you are going to see posted here is both the set of instructions I was given..and the code I have written. The instructions I was given are as follows In this case, you will create...
11
by: Jacek Dziedzic | last post by:
Hi! I need a routine like: std::string nth_word(const std::string &s, unsigned int n) { // return n-th word from the string, n is 0-based // if 's' contains too few words, return "" //...
5
by: Taras_96 | last post by:
Hi all, Jesse Liberty writes: "cin.get() >>myVarOne >myVarTwo; // illegal The return value of (cin.get() >myVarOne) is an integer, not an iostream object." ...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.