473,549 Members | 2,455 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Advanced RegEx (pattern clustering)

Hi,

I'm just wrapping my head around regex and am pretty sure it can do the
task at hand - but it's too complex for my brain to process -- so am
throwing it out there for you experts to comment on. I am posing two
questions. In the interests of space and focus, I'll post a separate
thread for the other use case (clustering).

Use Case 1:
Filenames contain a TrackNumber (or not).

Examples:
01 - Calexico - Sonic Wind (instrumental mix).mp3
Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3
Carl Orff - Carmina Burana - 08 - Uf dem anger- Chramer, gip .mp3
01-linkin_park_-_foreword-mp3.mp3
[03] (Wish I Could Fly Like) Superman.mp3

Other examples might be: (XX), XX-, -XX-, - XX - , - XX,-XX
Where XX is a one or two digit number.

Specific examples of things that should not be captured:
Jethro Tull - 1999 - Live At House Of Blues - 13 - Hunting Girl.mp3

The 1999 i snot a track number, but the 13 is. A rule that the number
should be 2 digits should catch one.

Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3
The date should not be captured, but the 33 should.

UB40 - 08 - Sing Our Own Song.mp3
The 40 shouldn't be captured, but the 08 should.

Blink 182 - Take Off Your Pants And Jacket - 06 - The Rock Show.mp3
The 182 should not be captured, but the 06 should.

One more case:
08_Smokie_Livin g Near The Edge.mp3

Phew...sorry for the length of the post --- can one put together a
regex to tackle this problem?

If so --- I will be both amazed and grateful for your suggestions.

Thanks.

P.S. Part 2 of this will deal with clustering...

May 10 '06 #1
5 1712
Well, you're starting out by making the most common mistake that people make
who use regular expressions. Instead of giving us a set of rules, you give
us a bunch of examples. The problem with this is that the examples only
*hint* at the underlying rules, and do not spell them out. One could derive
several different sets of rules from the examples you've given.

In case you don't understand, I'll give you an example. See if you can tweak
the example into the exact rules for your regular expression:

1. The string or set of strings will (will not?) consist entirely of file
names.
2. A "Track Number" (not?) always consists of exactly 2 digits.
3. These 2 digits may appear anywhere in the file name, except for the
extension.
4. These 2 digits will (not?) always be delimited by punctuation marks.
5. If at the beginning or end of the file name, only 1 (possibly more than
1?) mark is used.
6. The set of possible punctuation marks consists of: [], -, _, ()
7. The punctuation marks will always immediately (no spaces)
precede and/or follow the "Track Number" with one exception.
8. Hyphens will always have a single (or more?) space between the hyphen and
the
"Tracking Number" and between the hyphen and the rest of the file name.
9. There will never be any other substrings in the strings that follows
these rules.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Hard work is a medication for which
there is no placebo.

"skavan" <su**********@t fn.com> wrote in message
news:11******** **************@ j33g2000cwa.goo glegroups.com.. .
Hi,

I'm just wrapping my head around regex and am pretty sure it can do the
task at hand - but it's too complex for my brain to process -- so am
throwing it out there for you experts to comment on. I am posing two
questions. In the interests of space and focus, I'll post a separate
thread for the other use case (clustering).

Use Case 1:
Filenames contain a TrackNumber (or not).

Examples:
01 - Calexico - Sonic Wind (instrumental mix).mp3
Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3
Carl Orff - Carmina Burana - 08 - Uf dem anger- Chramer, gip .mp3
01-linkin_park_-_foreword-mp3.mp3
[03] (Wish I Could Fly Like) Superman.mp3

Other examples might be: (XX), XX-, -XX-, - XX - , - XX,-XX
Where XX is a one or two digit number.

Specific examples of things that should not be captured:
Jethro Tull - 1999 - Live At House Of Blues - 13 - Hunting Girl.mp3

The 1999 i snot a track number, but the 13 is. A rule that the number
should be 2 digits should catch one.

Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3
The date should not be captured, but the 33 should.

UB40 - 08 - Sing Our Own Song.mp3
The 40 shouldn't be captured, but the 08 should.

Blink 182 - Take Off Your Pants And Jacket - 06 - The Rock Show.mp3
The 182 should not be captured, but the 06 should.

One more case:
08_Smokie_Livin g Near The Edge.mp3

Phew...sorry for the length of the post --- can one put together a
regex to tackle this problem?

If so --- I will be both amazed and grateful for your suggestions.

Thanks.

P.S. Part 2 of this will deal with clustering...

May 10 '06 #2
Good point. In fact, writing the rules helps really clarify the
problem. Here goes:

1. The set of strings will consist entirely of filenames.
2. A <Track Number> will consist of 1 *OR* 2 digits in the range of 1
to 36.
3. A <Track Number> witha value of less than 10, may be preceded by a
zero.
4. <Track Number> cannot be guaranteed to be the only digits in the
string.
5. <Track Number> will be preceded by one of the following: <SPACE>, {,
[, <, (, _, -
6. The exception to #5 is if <Track Number> is at the start of the
string.
7. If <Track Number> is preceded by an opening punctation character: (
< { [, then <Track Number> will be followed by the corresponding
closing punctuation character.
8. If <Track Number> is not preceded by a opening punctation character,
it will be followed by either: <SPACE>, _,- or an opening punctation
character (for the next field in the string).
9. There may be additional spaces before and after the delimiters
specified in 8 and before but not after the Open Punctation delimiters
and after but not before the closing punctuation characters.
10. There will never be any other substrings in the strings that
follows these rules.

Wow - that seems to really specify the problem. I'm feeling terrific
about it. Except for one tiny, teeny, thing.
I'm still STUCK!!!! h-e-l-p. Eternal thanks to someone who can
translate 1-10 into regex or otherwise.

Thanks.

s.

May 10 '06 #3
I'm glad I was able to help youwith your analysis. Problem-solving is a
really important skill to have as a programmer, and the ability to spell out
business rules is the most important key to writing good code. As you can
see, this involves a process of breaking down the requirements into smaller
and smaller bites, until you have atomic business rules.

I was able to construct a Regular Expression based upon your business rules.
However, there is a problem, and I'm not sure it can be solved. First,
here's the regular expression:

(?m)(?<=[\{\(\[\<_]|\-\s|^)\d{1,2}(?=[\}\)\_\>\]]|\s\-|$)

In English, this means:

1. Caret and dollar match new lines.
2. A match is 1 or 2 digits.
3. The digits must be preceded by one of the following:
a. One of the following characters: { [ ( _ <
b. A hyphen followed by a space.
c. Be at the beginning of the line.
4. The digits must be followed by one of the following:
a. One of the following characters: } ] ) _ >
b. A space followed by a hyphen
c. Be at the end of the line.

Here's the problem with it. Consider these 2 examples you included:

Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3
01-linkin_park_-_foreword-mp3.mp3
Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3

The problem is, in case you can't see it, what do do about digits that are
preceded or followed by a hyphen *without* a space? If you allow it, you
pick up "-13-" in the date. If you disallow it, you don't pick up the "01-"
in the second example, or the "-1-" in the third example.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Hard work is a medication for which
there is no placebo.

"skavan" <su**********@t fn.com> wrote in message
news:11******** *************@i 40g2000cwc.goog legroups.com...
Good point. In fact, writing the rules helps really clarify the
problem. Here goes:

1. The set of strings will consist entirely of filenames.
2. A <Track Number> will consist of 1 *OR* 2 digits in the range of 1
to 36.
3. A <Track Number> witha value of less than 10, may be preceded by a
zero.
4. <Track Number> cannot be guaranteed to be the only digits in the
string.
5. <Track Number> will be preceded by one of the following: <SPACE>, {,
[, <, (, _, -
6. The exception to #5 is if <Track Number> is at the start of the
string.
7. If <Track Number> is preceded by an opening punctation character: (
< { [, then <Track Number> will be followed by the corresponding
closing punctuation character.
8. If <Track Number> is not preceded by a opening punctation character,
it will be followed by either: <SPACE>, _,- or an opening punctation
character (for the next field in the string).
9. There may be additional spaces before and after the delimiters
specified in 8 and before but not after the Open Punctation delimiters
and after but not before the closing punctuation characters.
10. There will never be any other substrings in the strings that
follows these rules.

Wow - that seems to really specify the problem. I'm feeling terrific
about it. Except for one tiny, teeny, thing.
I'm still STUCK!!!! h-e-l-p. Eternal thanks to someone who can
translate 1-10 into regex or otherwise.

Thanks.

s.

May 10 '06 #4
Since I can't think of any common strings that would have this effect
OTHER than date formats, the
simple approach would be to eliminate date formats before a final scan
using the regex above.

In the simple case, a date format in this context will, I think, always
have the following rule:

1. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits
AND/OR
2. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits

This should capture the middle digits and then expand to ignore the
string of digits appropriately.

So:
a) Do you think this would work?
b) Is it a preliminary regex or can it be pre-pended to the string
above?
c) What does it look like?

BTW - Your thought process is pretty good for a "Profession al
Numbskull" :).

s.

May 10 '06 #5
Here's a mod that may work for you:

(?m)(?!<\d+\s?\-\s?)(?<=[\{\(\[\<_]|\-\s?|^)\d{1,2}(? =[\}\)\_\>\]]|\s?\-|$)(?!\s?\-\s?\d+)

This is identical to the first, with a couple of changes and additions.
First, the spaces with the hyphens are now optional (\s? means 0 or 1
space). Second, I added a negative look-behind to the beginning, and a
negative look-ahead at the end. The negative look-behind states that the
match cannot be preceded by 1 or more digits followed by 0 or 1 space and a
hyphen followed by 0 or 1 space. The negative look-behind states that the
match cannot be followed by 0 or 1 space followed by a hyphen followed by 0
or 1 space followed by 1 or more numbers.

Of course, you realize that there are not hard and fast rules for this sort
of thing. Anyone can give any name to an mp3 file. But it works for all the
examples you gave.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Hard work is a medication for which
there is no placebo.

"skavan" <su**********@t fn.com> wrote in message
news:11******** *************@i 39g2000cwa.goog legroups.com...
Since I can't think of any common strings that would have this effect
OTHER than date formats, the
simple approach would be to eliminate date formats before a final scan
using the regex above.

In the simple case, a date format in this context will, I think, always
have the following rule:

1. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits
AND/OR
2. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits

This should capture the middle digits and then expand to ignore the
string of digits appropriately.

So:
a) Do you think this would work?
b) Is it a preliminary regex or can it be pre-pended to the string
above?
c) What does it look like?

BTW - Your thought process is pretty good for a "Profession al
Numbskull" :).

s.

May 10 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
9715
by: aevans1108 | last post by:
expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a numeric value according to an arbitrary regular expression.
1
19798
by: Craig Kenisston | last post by:
Hi, I need to write a function that should behave like the SQL's "like" operator on a list of words. I was wondering if I can use Regex directly to do this job. But I've been reading about regex and it supports very different characters and behaves different. So, I'm looking for an advise. I'd like to know if it is feasable or not, I...
9
4239
by: Whitless | last post by:
Okay I am ready to pull what little hair I have left out. I pass the function below my String to search, my find string (a regular expression) and my replace string (another regular expression). Why does this function replace the found reg ex. with the actual string "\t" and not a tab? (in the example below out of frustration I actually...
8
1814
by: vbmark | last post by:
I'm new to RegEx in vb.net so I'm not sure how to do this. I want to know if a string contains two minus signs "-". If there are two then I want it to return TRUE. I also need to know if the string contains two plus signs "+". Should this be a seperate RegEx or can one RegEx check for both signs? Thanks!
4
3594
by: shonend | last post by:
I am trying to extract the pattern like this : "SUB: some text LOT: one-word" Described, "SUB" and "LOT" are key words; I want those words, everything in between and one word following the "LOT:". Source text may contain multiple "SUB: ... LOT:" blocks. For example this is my source text:
4
2088
by: skavan | last post by:
Use Case: We have music files that describe, in their filename, attributes of the music. We do not know a general pattern that applies to all filenames -- but we do know that filenames that are clustered together (by for example directory) will, most likely, have the same filename pattern. Here is an example: 10,000 Maniacs - MTV...
8
10269
by: sherifffruitfly | last post by:
Hi, I've been searching as best I can for this - coming up with little. I have a file that is full of lines fitting this pattern: (?<year>\d{4}),(?<amount>\d{6,7}) I'm likely to get a bunch of hits with this - I'm only interested in the *last* one. Is there a way to build the concept "last" into the
0
1644
by: YellowFin Announcements | last post by:
Security solutions provider EXTOL MSC Berhad has developed a neural network predictive analysis engine and is now working with Australian business intelligence company Yellowfin to develop a front end that is integrated with Yellowfin's reporting, that can be delivered to clients. Under the tie up, Yellowfin will make the new engine called...
0
1532
by: vmysore | last post by:
I am trying to get all the columns selected within a SQL query (including the sub selects). When the code hits matcher.find(). i get the following exception: Exception in thread "main" java.lang.StackOverflowError at java.util.regex.Pattern$Branch.match(Pattern.java:4530) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570) I am...
0
7542
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7736
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
6066
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5385
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5110
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3514
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
1961
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1079
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
783
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.