By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,534 Members | 1,807 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,534 IT Pros & Developers. It's quick & easy.

Regex help with large strings

P: n/a
Hi,

I've seen some postings on this but not exactly relating to this
posting. I'm reading in a large mail message as a string. In the
string is an xml attachment that I need to parse out and remove from
the message once processed. I have to do this as a string and not
using any CDO libraries. My problem is that there's normally a large
pdf in the file so when I read the file in it's massive and I don't
knwo if the XML is at the start/middle or end of the string. My regex
is as follows:

Regex rXMLPart = new Regex(
@"(?<Start>.*)(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)(?<End>.*)",
RegexOptions.IgnoreCase |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);

and a sample of the string is:
-----------------------
Message-ID: <00****************************@csfb.csgroup.com >
From: "Test" <te**@test.com>
To: <>
Subject: This is a test subject
Date: Thu, 2 Sep 2004 16:58:12 +0100
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_NextPart_000_0005_01C4910E.083D9600"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2800.1409
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1409

This is a multi-part message in MIME format.

------=_NextPart_000_0005_01C4910E.083D9600
Content-Type: multipart/alternative;
boundary="----=_NextPart_001_0006_01C4910E.083D9600"
------=_NextPart_001_0006_01C4910E.083D9600
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

This is some body text.
-mark.
------=_NextPart_001_0006_01C4910E.083D9600
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.2800.1458" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2>This is some body text.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>***</FONT></DIV></BODY></HTML>

------=_NextPart_000_0005_01C4910E.083D9600
Content-Type: text/xml;
name="DO_NOT_DELETE_EMAIL_ATTACHMENT.XML"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="DO_NOT_DELETE_EMAIL_ATTACHMENT.XML"

<?xml version="1.0" encoding="UTF-8"?>
<distributionList>
</distributionList>
------=_NextPart_000_0005_01C4910E.083D9600
Content-Type: application/pdf;
name="Reader.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="Reader.pdf"

JVBERi0xLjUNJeLjz9MNCjkxOTUgMCBvYmo8PC9IWzQzMzk2ID M5MzJdL0xpbmVhcml6ZWQgMS9F
IDEyMjMzNi9MIDE1NTUzMDcvTiAxNzkvTyA5MTk5L1QgMTM3MT M2Mz4+DWVuZG9iag0gICAgICAg
IA14cmVmDTkxOTUgMzYNMDAwMDAwMDAxNiAwMDAwMCBuDQowMD AwMDQ3Njc5IDAwMDAwIG4NCjAw
MDAwNDMzOTYgMDAwMDAgbg0KMDAwMDA0NzkzNSAwMDAwMCBuDQ owMDAwMDQ3OTk5IDAwMDAwIG4N
CjAwMDAwNDgyNzYgMDAwMDAgbg0KMDAwMDA0ODMyNyAwMDAwMC BuDQowMDAwMDQ4NjMwIDAwMDAw
IG4NCjAwMDAwNTM0ODAgMDAwMDAgbg0KMDAwMDA1MzUxNiAwMD AwMCBuDQowMDAwMDUzOTUyIDAw
.......
------------------------

I've cut the string short but that is the jist of it. If I were to run
against this attached string it all works fine but when really large
(with the rest of the pdf in) the match hangs:

Match mXMLPersonalisation = rXMLPart.Match(data);

Could anyone suggest a better way that I should do this. I need to get
the first part and the last part and join thus removing the XML part.
I also need to work on the XML to creat the new messages.

i.e.

string sStartPartOfEmailMessage =
mXMLPersonalisation.Groups["Start"].ToString();
string sXMLPartOfMessage =
mXMLPersonalisation.Groups["Middle"].ToString();;
string sEndPartOfEmailMessage =
mXMLPersonalisation.Groups["End"].ToString();;

SendXMLEmail(sStartPartOfEmailMessage, sXMLPartOfMessage,
sEndPartOfEmailMessage);

Any help would be much appreciated.

-mark.
Nov 16 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
"Mark" <ma*********@csfb.com> wrote in
news:9a**************************@posting.google.c om...
Hi,

I've seen some postings on this but not exactly relating to this
posting. I'm reading in a large mail message as a string. In the
string is an xml attachment that I need to parse out and remove from
the message once processed. I have to do this as a string and not
using any CDO libraries. My problem is that there's normally a large
pdf in the file so when I read the file in it's massive and I don't
knwo if the XML is at the start/middle or end of the string. My regex
is as follows:

Regex rXMLPart = new Regex(
@"(?<Start>.*)(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)(?<End>.*)",
RegexOptions.IgnoreCase |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);


I haven't done any performance tests with that regex, but I'm quite sure it
will take years if it can *not* find a match on a long string: Here are a
few suggestions:

- Add start/end anchors like these:
@"^(?<Start>.*)(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)(?<End>.*)$"
So the .* expression in the beginning doesn't have to try every starting
point in the string.
- Couldn't you use Regex.Replace on a pattern like this:
@"(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)"
The way regex's work, this should be a lot faster. If you need complex
processing on the string that can't be done with capturing paranthesis, you
could use a MatchEvealuator.
- Finally:
@"Content-Type:[^.*?]text\/xml"
Are you sure about this character class? I'd have expected something like
"\s*" instead of "[^.*?]".

Niki
Nov 16 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.