Regex help with large strings

Mark

Hi,

I've seen some postings on this but not exactly relating to this
posting. I'm reading in a large mail message as a string. In the
string is an xml attachment that I need to parse out and remove from
the message once processed. I have to do this as a string and not
using any CDO libraries. My problem is that there's normally a large
pdf in the file so when I read the file in it's massive and I don't
knwo if the XML is at the start/middle or end of the string. My regex
is as follows:

Regex rXMLPart = new Regex(
@"(?<Start>.*)(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)(?<End>.*)",
RegexOptions.IgnoreCase |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);

and a sample of the string is:
-----------------------
Message-ID: <00****************************@csfb.csgroup.com >
From: "Test" <te**@test.com>
To: <>
Subject: This is a test subject
Date: Thu, 2 Sep 2004 16:58:12 +0100
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_NextPart_000_0005_01C4910E.083D9600"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2800.1409
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1409

This is a multi-part message in MIME format.

------=_NextPart_000_0005_01C4910E.083D9600
Content-Type: multipart/alternative;
boundary="----=_NextPart_001_0006_01C4910E.083D9600"
------=_NextPart_001_0006_01C4910E.083D9600
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

This is some body text.
-mark.
------=_NextPart_001_0006_01C4910E.083D9600
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.2800.1458" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2>This is some body text.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2>***</FONT></DIV></BODY></HTML>

------=_NextPart_000_0005_01C4910E.083D9600
Content-Type: text/xml;
name="DO_NOT_DELETE_EMAIL_ATTACHMENT.XML"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="DO_NOT_DELETE_EMAIL_ATTACHMENT.XML"

<?xml version="1.0" encoding="UTF-8"?>
<distributionList>
</distributionList>
------=_NextPart_000_0005_01C4910E.083D9600
Content-Type: application/pdf;
name="Reader.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="Reader.pdf"

JVBERi0xLjUNJeLjz9MNCjkxOTUgMCBvYmo8PC9IWzQzMzk2ID M5MzJdL0xpbmVhcml6ZWQgMS9F
IDEyMjMzNi9MIDE1NTUzMDcvTiAxNzkvTyA5MTk5L1QgMTM3MT M2Mz4+DWVuZG9iag0gICAgICAg
IA14cmVmDTkxOTUgMzYNMDAwMDAwMDAxNiAwMDAwMCBuDQowMD AwMDQ3Njc5IDAwMDAwIG4NCjAw
MDAwNDMzOTYgMDAwMDAgbg0KMDAwMDA0NzkzNSAwMDAwMCBuDQ owMDAwMDQ3OTk5IDAwMDAwIG4N
CjAwMDAwNDgyNzYgMDAwMDAgbg0KMDAwMDA0ODMyNyAwMDAwMC BuDQowMDAwMDQ4NjMwIDAwMDAw
IG4NCjAwMDAwNTM0ODAgMDAwMDAgbg0KMDAwMDA1MzUxNiAwMD AwMCBuDQowMDAwMDUzOTUyIDAw
.......
------------------------

I've cut the string short but that is the jist of it. If I were to run
against this attached string it all works fine but when really large
(with the rest of the pdf in) the match hangs:

Match mXMLPersonalisation = rXMLPart.Match(data);

Could anyone suggest a better way that I should do this. I need to get
the first part and the last part and join thus removing the XML part.
I also need to work on the XML to creat the new messages.

i.e.

string sStartPartOfEmailMessage =
mXMLPersonalisation.Groups["Start"].ToString();
string sXMLPartOfMessage =
mXMLPersonalisation.Groups["Middle"].ToString();;
string sEndPartOfEmailMessage =
mXMLPersonalisation.Groups["End"].ToString();;

SendXMLEmail(sStartPartOfEmailMessage, sXMLPartOfMessage,
sEndPartOfEmailMessage);

Any help would be much appreciated.

-mark.

Nov 16 '05 #1

Subscribe Post Reply

3366

Niki Estner

"Mark" <ma*********@csfb.com> wrote in
news:9a**************************@posting.google.c om...

Hi,

I've seen some postings on this but not exactly relating to this
posting. I'm reading in a large mail message as a string. In the
string is an xml attachment that I need to parse out and remove from
the message once processed. I have to do this as a string and not
using any CDO libraries. My problem is that there's normally a large
pdf in the file so when I read the file in it's massive and I don't
knwo if the XML is at the start/middle or end of the string. My regex
is as follows:

Regex rXMLPart = new Regex(
@"(?<Start>.*)(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)(?<End>.*)",
RegexOptions.IgnoreCase |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);

I haven't done any performance tests with that regex, but I'm quite sure it
will take years if it can *not* find a match on a long string: Here are a
few suggestions:

- Add start/end anchors like these:
@"^(?<Start>.*)(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)(?<End>.*)$"
So the .* expression in the beginning doesn't have to try every starting
point in the string.
- Couldn't you use Regex.Replace on a pattern like this:
@"(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)"
The way regex's work, this should be a lot faster. If you need complex
processing on the string that can't be done with capturing paranthesis, you
could use a MatchEvealuator.
- Finally:
@"Content-Type:[^.*?]text\/xml"
Are you sure about this character class? I'd have expected something like
"\s*" instead of "[^.*?]".

Niki

Nov 16 '05 #2

by: alphatan | last post by:

Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning...

C / C++

Regex performance?

by: Beeeeeves | last post by:

Anyone have any ideas on how fast .NET regexes are when operating on large amounts of text? (the input could be, say, 10KB of text, the regex's pattern would be fairly big ( getting on for 1KB...

C# / C Sharp

Which RegEx Testing Tool Do You Prefer?

by: clintonG | last post by:

I'm using an .aspx tool I found at but as nice as the interface is I think I need to consider using others. Some can generate C# I understand. Your preferences please... <%= Clinton Gallagher ...

ASP.NET

Regex for legal file name chars

by: WLT | last post by:

I need to filter out non-legal characters from potential (user-entered) file names. Certainly this must exist already. Also...I remember there was a large cache of regex strings online...

C# / C Sharp

parsing VB code with a regex

by: Mark | last post by:

I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...

.NET Framework

Seeking regex optimizer

by: Kay Schluehr | last post by:

I have a list of strings ls = and want to create a regular expression sx from it, such that sx.match(s) yields a SRE_Match object when s starts with an s_i for one i in . There might be...

Python

Are Regex slower than methods from classes like String & Char?

by: ommail | last post by:

Hi I wonder if regular expressions are in general sower than using classes like String and Char when used for validating/parsing text data? I've done some simple test (using IsMatch()) method...

C# / C Sharp

string/regex: extracting the context of a string match around the found search term?

by: | last post by:

I'm analyzing large strings and finding matches using the Regex class. I want to find the context those matches are found in and to display excerpts of that context, just as a search engine might....

C# / C Sharp

Regex with quotes

by: Flomo Togba Kwele | last post by:

I am having difficulty writing a Regex constructor. A line has a quote(") at its beginning and its end. I need to strip both characters off. If the line looks like "1", I need the result to be 1....

.NET Framework

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Regex help with large strings

Similar topics