By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,030 Members | 1,997 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,030 IT Pros & Developers. It's quick & easy.

how to split a string using ,fixed character length, variable text delimmiter

P: n/a
I'm working on a data file and can't find any common delimmiters in the
file to indicate the end of one row of data and the start of the next.
Rows are not on individual lines but run accross multiple lines.

It would appear though that every distinct set of data starts with a
'code' that is always the 25 characters long. The text is variable
however.

Assuming i've read the contents of the file into the string myfile, how
do i split my file into an array, using this variable text, fixed 25
character long, delimiter?

Thankyou!

Gary-

Dec 6 '06 #1
Share this Question
Share on Google+
24 Replies


P: n/a
Hello,
>Assuming i've read the contents of the file into the string myfile, how
do i split my file into an array, using this variable text, fixed 25
character long, delimiter?

You should probably be able to use Regex.Split(...), with a good regular
expression of course. I can give you help on writing that regular
expression, but I'll have to know a lot more about the delimiter string.
Oliver Sturm
--
http://www.sturmnet.org/blog
Dec 6 '06 #2

P: n/a
How do *you* know it's a delimiter and not data?

In other words, if *I* were to look at the file, knowing nothing about it,
how could I tell what was a delimiter and what was data? How would you
explain to me what to look for?

When you can answer that, you can start thinking about how to pass that
information to a machine.

HTH
Peter

<ga********@myway.comwrote in message
news:11**********************@16g2000cwy.googlegro ups.com...
I'm working on a data file and can't find any common delimmiters in the
file to indicate the end of one row of data and the start of the next.
Rows are not on individual lines but run accross multiple lines.

It would appear though that every distinct set of data starts with a
'code' that is always the 25 characters long. The text is variable
however.

Assuming i've read the contents of the file into the string myfile, how
do i split my file into an array, using this variable text, fixed 25
character long, delimiter?

Thankyou!

Gary-

Dec 6 '06 #3

P: n/a
Thankyou for your replies. OK I have had another look at i think the
task has just got harder. The length isn't always 25 characters. But I
have found a pattern, hopefully this will help.
I am using this 'code' as a delimmiter because it always proceeds the
name of an item, and this file is essentially a database of items.
Following the name of an item, a number of item characteristcs specific
to that item are listed. Eventually the items characteristics are
completely listed and the next 'code' is encountered which proceeds the
next item in the database.

There does seem to be some identifiable traits of this code.
It appears to be always at least 20 characters long.

- The code is continuous there are no spaces present.
- It is always composed of letters ranging from A-Z, or numbers 0-9.
- The first two characters of this code are always letters raning from
A-Z.
- These two letters are repeated at least two other times during the
code.

e.g.

DODE86DODE86SZDO010144

So I guess what I am trying to do now is split the string, every time a
a string in encountered that is at least 20 characters long, is alpha
numeric, and has the first two letters repeated initself at least two
other times.

I think this is going to be tough?

Any ideas?

Thankyou-
Peter Bradley wrote:
How do *you* know it's a delimiter and not data?

In other words, if *I* were to look at the file, knowing nothing about it,
how could I tell what was a delimiter and what was data? How would you
explain to me what to look for?

When you can answer that, you can start thinking about how to pass that
information to a machine.

HTH
Peter

<ga********@myway.comwrote in message
news:11**********************@16g2000cwy.googlegro ups.com...
I'm working on a data file and can't find any common delimmiters in the
file to indicate the end of one row of data and the start of the next.
Rows are not on individual lines but run accross multiple lines.

It would appear though that every distinct set of data starts with a
'code' that is always the 25 characters long. The text is variable
however.

Assuming i've read the contents of the file into the string myfile, how
do i split my file into an array, using this variable text, fixed 25
character long, delimiter?

Thankyou!

Gary-
Dec 6 '06 #4

P: n/a
In case it wasn't obvious I would also like to add that the code has at
least one space at the start of it and the end of it.
ga********@myway.com wrote:
Thankyou for your replies. OK I have had another look at i think the
task has just got harder. The length isn't always 25 characters. But I
have found a pattern, hopefully this will help.
I am using this 'code' as a delimmiter because it always proceeds the
name of an item, and this file is essentially a database of items.
Following the name of an item, a number of item characteristcs specific
to that item are listed. Eventually the items characteristics are
completely listed and the next 'code' is encountered which proceeds the
next item in the database.

There does seem to be some identifiable traits of this code.
It appears to be always at least 20 characters long.

- The code is continuous there are no spaces present.
- It is always composed of letters ranging from A-Z, or numbers 0-9.
- The first two characters of this code are always letters raning from
A-Z.
- These two letters are repeated at least two other times during the
code.

e.g.

DODE86DODE86SZDO010144

So I guess what I am trying to do now is split the string, every time a
a string in encountered that is at least 20 characters long, is alpha
numeric, and has the first two letters repeated initself at least two
other times.

I think this is going to be tough?

Any ideas?

Thankyou-
Peter Bradley wrote:
How do *you* know it's a delimiter and not data?

In other words, if *I* were to look at the file, knowing nothing about it,
how could I tell what was a delimiter and what was data? How would you
explain to me what to look for?

When you can answer that, you can start thinking about how to pass that
information to a machine.

HTH
Peter

<ga********@myway.comwrote in message
news:11**********************@16g2000cwy.googlegro ups.com...
I'm working on a data file and can't find any common delimmiters in the
file to indicate the end of one row of data and the start of the next.
Rows are not on individual lines but run accross multiple lines.
>
It would appear though that every distinct set of data starts with a
'code' that is always the 25 characters long. The text is variable
however.
>
Assuming i've read the contents of the file into the string myfile, how
do i split my file into an array, using this variable text, fixed 25
character long, delimiter?
>
Thankyou!
>
Gary-
>
Dec 6 '06 #5

P: n/a
There does seem to be some identifiable traits of this code.
It appears to be always at least 20 characters long.

- The code is continuous there are no spaces present.
- It is always composed of letters ranging from A-Z, or numbers 0-9.
- The first two characters of this code are always letters raning from
A-Z.
- These two letters are repeated at least two other times during the
code.

e.g.

DODE86DODE86SZDO010144

So I guess what I am trying to do now is split the string, every time a
a string in encountered that is at least 20 characters long, is alpha
numeric, and has the first two letters repeated initself at least two
other times.
Who created this file? Are there no documentation which describes its
format? Can you post a sample of the data that shows at least 2
complete "records" or items? Is there anything in the file, perhaps a
header of some sort, that can shed any light on the format?

Chris

Dec 6 '06 #6

P: n/a
Hello,
>DODE86DODE86SZDO010144

So I guess what I am trying to do now is split the string, every time a
a string in encountered that is at least 20 characters long, is alpha
numeric, and has the first two letters repeated initself at least two
other times.
Yes, well... you could try using a regular expression such as this:

[ ]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]

(This could also be simplified a bit.)
This does evaluate the double repetition of the initial two characters,
but it can't check the maximum length of the string at the same time. If
you'd just be searching the text in question for occurrences of the
expression, you could easily write an additional check in code, to find
out whether any given string has the correct maximum length. But if you're
using this expression in a Split() call, you couldn't do that...

Personally I would probably still use an expression such as this, to
search for that is, and do the splitting myself. If you can't do the
splitting fully automatically, you'll have to do it yourself in any case -
and using a regular expression to do the delimiter searching seems a
better option to me than coding up the search in C#.

Oliver Sturm
--
http://www.sturmnet.org/blog
Dec 6 '06 #7

P: n/a
I agree, that's' wonky sounding data :) How about something like this?
It's not quite what you want, but might give you a start. currently it
just finds the codes as you've described them and returns them, but
I've got to do some real work so...
I'm not great with Regular expressions so I've only used one to check
the first two characters occur three times. Oh and it's very scrappy.

private void fooTest2()
{
foreach(string s in
foo2(",,,12tt12ttt12ttttttttt,,ab111ab11111111111a b,"))
{
Console.WriteLine(s);
}
}
private System.Collections.ArrayList foo2(string pFoo)
{
int i;
int j;
int o=0;
int p=0;
System.Text.RegularExpressions.Regex r;
bool running=true;
char[] c;
String s;
System.Collections.ArrayList a = new ArrayList();
c=pFoo.ToCharArray();

for(i=0, j=0; i<c.Length ; i=j)
{
for(;j<c.Length;j++)
{
if(IsAN(c[j]))
{
if(running)
{
p++;
}
else
{
running = true;
p=1;
o=j;
}
}
else
{
running = false;
}
if(20 == p)
{
r = new
System.Text.RegularExpressions.Regex(pFoo.Substrin g(o,2));
s=pFoo.Substring(o,j-o+1);
if(3 == r.Matches(s).Count)
{
p=0;
running = false;
a.Add(s);
}
else
{
running=false;
}
}
}
}
return a;
}
private bool IsAN(char pC)
{
char c = pC.ToString().ToUpper().ToCharArray()[0];
if('A' <= c && 'Z' >= c)
{
return true;
}
if('0' <= pC && '9' >= pC)
{
return true;
}
return false;
}

Dec 6 '06 #8

P: n/a
Hi it's used in a custom written programme where I work which is dos
based.
The developers have long since dissapeared.

I'd really like to know how to achieve this in code if possible,

Thanks,

Gary-
There does seem to be some identifiable traits of this code.
It appears to be always at least 20 characters long.
>
- The code is continuous there are no spaces present.
- It is always composed of letters ranging from A-Z, or numbers 0-9.
- The first two characters of this code are always letters raning from
A-Z.
- These two letters are repeated at least two other times during the
code.
>
e.g.
>
DODE86DODE86SZDO010144
>
So I guess what I am trying to do now is split the string, every time a
a string in encountered that is at least 20 characters long, is alpha
numeric, and has the first two letters repeated initself at least two
other times.

Who created this file? Are there no documentation which describes its
format? Can you post a sample of the data that shows at least 2
complete "records" or items? Is there anything in the file, perhaps a
header of some sort, that can shed any light on the format?

Chris
Dec 6 '06 #9

P: n/a
This group is amazing. Thankyou both very much, i'm going to explore
them both now.

Oliver Sturm wrote:
Hello,
DODE86DODE86SZDO010144

So I guess what I am trying to do now is split the string, every time a
a string in encountered that is at least 20 characters long, is alpha
numeric, and has the first two letters repeated initself at least two
other times.

Yes, well... you could try using a regular expression such as this:

[ ]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]

(This could also be simplified a bit.)
This does evaluate the double repetition of the initial two characters,
but it can't check the maximum length of the string at the same time. If
you'd just be searching the text in question for occurrences of the
expression, you could easily write an additional check in code, to find
out whether any given string has the correct maximum length. But if you're
using this expression in a Split() call, you couldn't do that...

Personally I would probably still use an expression such as this, to
search for that is, and do the splitting myself. If you can't do the
splitting fully automatically, you'll have to do it yourself in any case -
and using a regular expression to do the delimiter searching seems a
better option to me than coding up the search in C#.

Oliver Sturm
--
http://www.sturmnet.org/blog
Dec 6 '06 #10

P: n/a
I am trying to use the regular expression that Oliver kindly provided
as a starting point.
filecontents is a string that contains my file contents. But i cant get
this to work. I added the @ in as i was getting an error that it didn't
recognise the escape sequence, but it still isn't working. How can i
fix this? Thankyou.

Im getting an error at Regex.Split(...)

Regex r = new Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Regex.Split(filecontents, r);

MessageBox.Show(filecontents.Length.ToString());

Thankyou

Dec 6 '06 #11

P: n/a
try
string[] matches = r.Split(filecontents); Assuming filecontents is the
text we're searching.

ga********@myway.com wrote:
I am trying to use the regular expression that Oliver kindly provided
as a starting point.
filecontents is a string that contains my file contents. But i cant get
this to work. I added the @ in as i was getting an error that it didn't
recognise the escape sequence, but it still isn't working. How can i
fix this? Thankyou.

Im getting an error at Regex.Split(...)

Regex r = new Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Regex.Split(filecontents, r);

MessageBox.Show(filecontents.Length.ToString());

Thankyou
Dec 6 '06 #12

P: n/a
Thankyou developer x, i'm not getting the desired result. Then i
realised i shouldn't be using @ as that will just negate the escape
characters.

The regex doesn't like \1 any suggestions what this should be changed
to?

Thanks,

Gary-

DeveloperX wrote:
try
string[] matches = r.Split(filecontents); Assuming filecontents is the
text we're searching.

ga********@myway.com wrote:
I am trying to use the regular expression that Oliver kindly provided
as a starting point.
filecontents is a string that contains my file contents. But i cant get
this to work. I added the @ in as i was getting an error that it didn't
recognise the escape sequence, but it still isn't working. How can i
fix this? Thankyou.

Im getting an error at Regex.Split(...)

Regex r = new Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Regex.Split(filecontents, r);

MessageBox.Show(filecontents.Length.ToString());

Thankyou
Dec 6 '06 #13

P: n/a
Bizarre, I pasted the regex into my little test app with the @ of
course and it fires

foo3(",,,12tt12ttt12ttttttttt,, ABCCCABCCCCCCCCCABCC ,");

private void foo3(string pFoo)
{
System.Text.RegularExpressions.Regex r = new
System.Text.RegularExpressions.Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Console.WriteLine(r.Matches(pFoo).Count.ToString() );
string[] s = r.Split(pFoo);
}

the \1 refers to the first group ]([A-Z][A-Z]) so what this regex is
saying is match a space then XX then any combination of X or N then the
XX found earlier, then more XX or N then our original XX again followed
by more X or N then a space iirc. X - A-Z, N = 0-9.

You might also wish to look at
System.Text.RegularExpressions.RegexOptions enum which sets things like
case sensitivity, multi line support and so forth. As you can see above
I didn't set anything and just took the defaults.

What is the actual error you got?
garyuse...@myway.com wrote:
Thankyou developer x, i'm not getting the desired result. Then i
realised i shouldn't be using @ as that will just negate the escape
characters.

The regex doesn't like \1 any suggestions what this should be changed
to?

Thanks,

Gary-

DeveloperX wrote:
try
string[] matches = r.Split(filecontents); Assuming filecontents is the
text we're searching.

ga********@myway.com wrote:
I am trying to use the regular expression that Oliver kindly provided
as a starting point.
filecontents is a string that contains my file contents. But i cant get
this to work. I added the @ in as i was getting an error that it didn't
recognise the escape sequence, but it still isn't working. How can i
fix this? Thankyou.
>
Im getting an error at Regex.Split(...)
>
Regex r = new Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Regex.Split(filecontents, r);
>
MessageBox.Show(filecontents.Length.ToString());
>
Thankyou
Dec 6 '06 #14

P: n/a
Hello,
>Thankyou developer x, i'm not getting the desired result. Then i
realised i shouldn't be using @ as that will just negate the escape
characters.
No, using the @ should be just fine, I usually do that myself.
>The regex doesn't like \1 any suggestions what this should be changed
to?
That's if you don't use the @, right?

I'm not really sure what the problem might be - of course my expression is
working with a lot of assumptions that you and I have been making in this
discussion, so accordingly there may be a lot of reasons why you're not
"getting the desired results" :-)

I checked that my expression worked with the delimiter string you
previously posted, but nothing else of course. If you can post further
examples of the delimiter string, maybe that would help... otherwise, feel
free to send me a sample program or a sample data file by email (I think
attachments can't be posted to this group?) or something and I'll have a
look.
Oliver Sturm
--
http://www.sturmnet.org/blog
Dec 6 '06 #15

P: n/a
Thankyou DeveloperX. I don't get an error with the @ only when i remove
the @.

I removed the @ because when i include it the result isn't what i
expected.

If i run this with the @ and then check the length of arraylist its
3055.

now the sample file im running it on has three of these 'codes' and
three rows of data.

e.g.

HUa82ab8HU272ajHUeje <lots of other text here running over multiple
linesUNa8723oansjaUNasUNa <more text here running over many lines>
IN8aatjresINiys9aINsa <more text here>

Now i thought i would get all the text between <...into individual
arraylist elements by running this but i'm not... what am i doing
wrong?

Thankyou
Gary-

I was expecting each part of my arraylist
e.g. [0], [1], ...
to contain everything between a set of codes.

DeveloperX wrote:
Bizarre, I pasted the regex into my little test app with the @ of
course and it fires

foo3(",,,12tt12ttt12ttttttttt,, ABCCCABCCCCCCCCCABCC ,");

private void foo3(string pFoo)
{
System.Text.RegularExpressions.Regex r = new
System.Text.RegularExpressions.Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Console.WriteLine(r.Matches(pFoo).Count.ToString() );
string[] s = r.Split(pFoo);
}

the \1 refers to the first group ]([A-Z][A-Z]) so what this regex is
saying is match a space then XX then any combination of X or N then the
XX found earlier, then more XX or N then our original XX again followed
by more X or N then a space iirc. X - A-Z, N = 0-9.

You might also wish to look at
System.Text.RegularExpressions.RegexOptions enum which sets things like
case sensitivity, multi line support and so forth. As you can see above
I didn't set anything and just took the defaults.

What is the actual error you got?
garyuse...@myway.com wrote:
Thankyou developer x, i'm not getting the desired result. Then i
realised i shouldn't be using @ as that will just negate the escape
characters.

The regex doesn't like \1 any suggestions what this should be changed
to?

Thanks,

Gary-

DeveloperX wrote:
try
string[] matches = r.Split(filecontents); Assuming filecontents is the
text we're searching.
>
ga********@myway.com wrote:
I am trying to use the regular expression that Oliver kindly provided
as a starting point.
filecontents is a string that contains my file contents. But i cant get
this to work. I added the @ in as i was getting an error that it didn't
recognise the escape sequence, but it still isn't working. How can i
fix this? Thankyou.

Im getting an error at Regex.Split(...)

Regex r = new Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Regex.Split(filecontents, r);

MessageBox.Show(filecontents.Length.ToString());

Thankyou
Dec 6 '06 #16

P: n/a
Here's foo3 again with some extra code. The interesting bit is the for
loop at the end. If you print out what it's matching and the position
in the source data we can see what's going on. Is it feasable to post
the test data?
On the @ thing, c# uses escape characters in strings so \ followed by a
character has different meanings. \t is tab (iirc) What the @ does
before the string is tell the compiler that everything in the quotes is
now a literal string and it shouldn't get fancy and try and replace \1
with what it things \1 should mean (or crash when it doesn't know what
it is :)).

private void foo3(string pFoo)
{
System.Text.RegularExpressions.Regex r = new
System.Text.RegularExpressions.Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");

Console.WriteLine(r.Matches(pFoo).Count.ToString() );

string[] s = r.Split(pFoo);
//Console.WriteLine(s[0]);
//Console.WriteLine(s[1]);

System.Text.RegularExpressions.MatchCollection c = r.Matches(pFoo);
foreach(System.Text.RegularExpressions.Match m in c)
{
//Console.WriteLine(m.Index.ToString());
}
System.Text.RegularExpressions.Match a;

for(a=r.Match(pFoo);a.Success; a=a.NextMatch())
{
Console.WriteLine(a.Index.ToString() + " " + a.Value);
}

}

garyuse...@myway.com wrote:
Thankyou DeveloperX. I don't get an error with the @ only when i remove
the @.

I removed the @ because when i include it the result isn't what i
expected.

If i run this with the @ and then check the length of arraylist its
3055.

now the sample file im running it on has three of these 'codes' and
three rows of data.

e.g.

HUa82ab8HU272ajHUeje <lots of other text here running over multiple
linesUNa8723oansjaUNasUNa <more text here running over many lines>
IN8aatjresINiys9aINsa <more text here>

Now i thought i would get all the text between <...into individual
arraylist elements by running this but i'm not... what am i doing
wrong?

Thankyou
Gary-

I was expecting each part of my arraylist
e.g. [0], [1], ...
to contain everything between a set of codes.

DeveloperX wrote:
Bizarre, I pasted the regex into my little test app with the @ of
course and it fires

foo3(",,,12tt12ttt12ttttttttt,, ABCCCABCCCCCCCCCABCC ,");

private void foo3(string pFoo)
{
System.Text.RegularExpressions.Regex r = new
System.Text.RegularExpressions.Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Console.WriteLine(r.Matches(pFoo).Count.ToString() );
string[] s = r.Split(pFoo);
}

the \1 refers to the first group ]([A-Z][A-Z]) so what this regex is
saying is match a space then XX then any combination of X or N then the
XX found earlier, then more XX or N then our original XX again followed
by more X or N then a space iirc. X - A-Z, N = 0-9.

You might also wish to look at
System.Text.RegularExpressions.RegexOptions enum which sets things like
case sensitivity, multi line support and so forth. As you can see above
I didn't set anything and just took the defaults.

What is the actual error you got?
garyuse...@myway.com wrote:
Thankyou developer x, i'm not getting the desired result. Then i
realised i shouldn't be using @ as that will just negate the escape
characters.
>
The regex doesn't like \1 any suggestions what this should be changed
to?
>
Thanks,
>
Gary-
>
DeveloperX wrote:
>
try
string[] matches = r.Split(filecontents); Assuming filecontents is the
text we're searching.

ga********@myway.com wrote:
I am trying to use the regular expression that Oliver kindly provided
as a starting point.
filecontents is a string that contains my file contents. But i cant get
this to work. I added the @ in as i was getting an error that it didn't
recognise the escape sequence, but it still isn't working. How can i
fix this? Thankyou.
>
Im getting an error at Regex.Split(...)
>
Regex r = new Regex(@"[
]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]");
Regex.Split(filecontents, r);
>
MessageBox.Show(filecontents.Length.ToString());
>
Thankyou
Dec 6 '06 #17

P: n/a
Hello,
>HUa82ab8HU272ajHUeje <lots of other text here running over multiple
linesUNa8723oansjaUNasUNa <more text here running over many lines>
IN8aatjresINiys9aINsa <more text here>

Now i thought i would get all the text between <...into individual
arraylist elements by running this but i'm not... what am i doing
wrong?
An obvious thing could be to use RegexOptions.IgnoreCase in your call to
Split() - your original delimiter didn't have any lower case characters,
but those you're posting now do.

Apart from that - either describe in much more detail how your code works
now and what result you're actually getting, or post or mail something
that lets us reproduce the problem ourselves.
Oliver Sturm
--
http://www.sturmnet.org/blog
Dec 6 '06 #18

P: n/a
Emailed a sample, thanks very much.

Oliver Sturm wrote:
Hello,
Thankyou developer x, i'm not getting the desired result. Then i
realised i shouldn't be using @ as that will just negate the escape
characters.

No, using the @ should be just fine, I usually do that myself.
The regex doesn't like \1 any suggestions what this should be changed
to?

That's if you don't use the @, right?

I'm not really sure what the problem might be - of course my expression is
working with a lot of assumptions that you and I have been making in this
discussion, so accordingly there may be a lot of reasons why you're not
"getting the desired results" :-)

I checked that my expression worked with the delimiter string you
previously posted, but nothing else of course. If you can post further
examples of the delimiter string, maybe that would help... otherwise, feel
free to send me a sample program or a sample data file by email (I think
attachments can't be posted to this group?) or something and I'll have a
look.
Oliver Sturm
--
http://www.sturmnet.org/blog
Dec 6 '06 #19

P: n/a
ga********@myway.com wrote:
Hi it's used in a custom written programme where I work which is dos
based.
The developers have long since dissapeared.

I'd really like to know how to achieve this in code if possible,
That's why I suggested posting a few "records" from this file so we
could see it and maybe help determine its format. Does the file start
with this data immediately or is there any header data in the beginning
of the file?

There does seem to be some identifiable traits of this code.
It appears to be always at least 20 characters long.

- The code is continuous there are no spaces present.
- It is always composed of letters ranging from A-Z, or numbers 0-9.
- The first two characters of this code are always letters raning from
A-Z.
- These two letters are repeated at least two other times during the
code.

e.g.

DODE86DODE86SZDO010144

So I guess what I am trying to do now is split the string, every time a
a string in encountered that is at least 20 characters long, is alpha
numeric, and has the first two letters repeated initself at least two
other times.
Who created this file? Are there no documentation which describes its
format? Can you post a sample of the data that shows at least 2
complete "records" or items? Is there anything in the file, perhaps a
header of some sort, that can shed any light on the format?

Chris
Dec 6 '06 #20

P: n/a
Hello,
>Emailed a sample, thanks very much.
Replied by email. Just a quick summary of what was wrong with your
previous code.

Looking at the regex I had posted previously:

[ ]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]

This regex has a number of groups that I had added for test purposes. It
can be stripped down to this without any changes:

[ ]([A-Z][A-Z])[A-Z0-9]+\1[A-Z0-9]+\1[A-Z0-9]+[ ]

It still has one capture group that is absolutely necessary to make the
back reference work. The Regex.Split method has the peculiar behaviour of
adding the result of the capture group itself to the string array it
returns, and there doesn't seem to be a way around that. So in the sample
program I sent you, I used the matching functionality of the Regex class
instead and picked out the pieces from the string "manually".

All this is probably not the most efficient algorithm in the world -
including the idea of reading the whole 14MB file into a string - but I
wouldn't expect any big performance problems on a modern system... if
performance is important, there are certainly lots of optimizations that
can be done.
Oliver Sturm
--
http://www.sturmnet.org/blog
Dec 6 '06 #21

P: n/a
You are a gentleman and a scholar sir, I'm going to spend a good couple
of days reading over the code in your email when it arrives - until I
become confident with these techniques.

Regex is very new to me, I would have been completely lost without your
help.

Many, many, thanks again,

Gary-

Oliver Sturm wrote:
Hello,
Emailed a sample, thanks very much.

Replied by email. Just a quick summary of what was wrong with your
previous code.

Looking at the regex I had posted previously:

[ ]([A-Z][A-Z])([A-Z0-9]+)(\1)([A-Z0-9]+)(\1)([A-Z0-9]+)[ ]

This regex has a number of groups that I had added for test purposes. It
can be stripped down to this without any changes:

[ ]([A-Z][A-Z])[A-Z0-9]+\1[A-Z0-9]+\1[A-Z0-9]+[ ]

It still has one capture group that is absolutely necessary to make the
back reference work. The Regex.Split method has the peculiar behaviour of
adding the result of the capture group itself to the string array it
returns, and there doesn't seem to be a way around that. So in the sample
program I sent you, I used the matching functionality of the Regex class
instead and picked out the pieces from the string "manually".

All this is probably not the most efficient algorithm in the world -
including the idea of reading the whole 14MB file into a string - but I
wouldn't expect any big performance problems on a modern system... if
performance is important, there are certainly lots of optimizations that
can be done.
Oliver Sturm
--
http://www.sturmnet.org/blog
Dec 6 '06 #22

P: n/a
Well thank you - mail should be there, it was sent even before my previous
post.
Oliver Sturm
--
http://www.sturmnet.org/blog
Dec 6 '06 #23

P: n/a
Thanks again Oliver, i'm just working through that code today. I
understand (at least at a very basic level) what most of the code is
doing. With the exception of the following line: -

string content = i == matches.Count - 1 ?

could you explain that line for me please,

Thank you,

Gary-

Dec 7 '06 #24

P: n/a
Hello,
>Thanks again Oliver, i'm just working through that code today. I
understand (at least at a very basic level) what most of the code is
doing. With the exception of the following line: -

string content = i == matches.Count - 1 ?

could you explain that line for me please,
It actually continues to say

string content = i == matches.Count - 1 ?
text.Substring(match.Index + match.Length) :
text.Substring(match.Index + match.Length, matches[i + 1].Index - match.Index - match.Length);
Sorry I used this - it's not the most widely understood or liked
construct. The whole thing is called a ternary expression and it's a
slightly shorter way of saying

if (i == matches.Count - 1)
content = text.Substring(match.Index + match.Length);
else
content = text.Substring(match.Index + match.Length, matches[i + 1].Index - match.Index - match.Length);
Oliver Sturm
--
http://www.sturmnet.org/blog
Dec 7 '06 #25

This discussion thread is closed

Replies have been disabled for this discussion.