473,322 Members | 1,409 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Advanced RegEx (pattern clustering)

Hi,

I'm just wrapping my head around regex and am pretty sure it can do the
task at hand - but it's too complex for my brain to process -- so am
throwing it out there for you experts to comment on. I am posing two
questions. In the interests of space and focus, I'll post a separate
thread for the other use case (clustering).

Use Case 1:
Filenames contain a TrackNumber (or not).

Examples:
01 - Calexico - Sonic Wind (instrumental mix).mp3
Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3
Carl Orff - Carmina Burana - 08 - Uf dem anger- Chramer, gip .mp3
01-linkin_park_-_foreword-mp3.mp3
[03] (Wish I Could Fly Like) Superman.mp3

Other examples might be: (XX), XX-, -XX-, - XX - , - XX,-XX
Where XX is a one or two digit number.

Specific examples of things that should not be captured:
Jethro Tull - 1999 - Live At House Of Blues - 13 - Hunting Girl.mp3

The 1999 i snot a track number, but the 13 is. A rule that the number
should be 2 digits should catch one.

Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3
The date should not be captured, but the 33 should.

UB40 - 08 - Sing Our Own Song.mp3
The 40 shouldn't be captured, but the 08 should.

Blink 182 - Take Off Your Pants And Jacket - 06 - The Rock Show.mp3
The 182 should not be captured, but the 06 should.

One more case:
08_Smokie_Living Near The Edge.mp3

Phew...sorry for the length of the post --- can one put together a
regex to tackle this problem?

If so --- I will be both amazed and grateful for your suggestions.

Thanks.

P.S. Part 2 of this will deal with clustering...

May 10 '06 #1
5 1694
Well, you're starting out by making the most common mistake that people make
who use regular expressions. Instead of giving us a set of rules, you give
us a bunch of examples. The problem with this is that the examples only
*hint* at the underlying rules, and do not spell them out. One could derive
several different sets of rules from the examples you've given.

In case you don't understand, I'll give you an example. See if you can tweak
the example into the exact rules for your regular expression:

1. The string or set of strings will (will not?) consist entirely of file
names.
2. A "Track Number" (not?) always consists of exactly 2 digits.
3. These 2 digits may appear anywhere in the file name, except for the
extension.
4. These 2 digits will (not?) always be delimited by punctuation marks.
5. If at the beginning or end of the file name, only 1 (possibly more than
1?) mark is used.
6. The set of possible punctuation marks consists of: [], -, _, ()
7. The punctuation marks will always immediately (no spaces)
precede and/or follow the "Track Number" with one exception.
8. Hyphens will always have a single (or more?) space between the hyphen and
the
"Tracking Number" and between the hyphen and the rest of the file name.
9. There will never be any other substrings in the strings that follows
these rules.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Hard work is a medication for which
there is no placebo.

"skavan" <su**********@tfn.com> wrote in message
news:11**********************@j33g2000cwa.googlegr oups.com...
Hi,

I'm just wrapping my head around regex and am pretty sure it can do the
task at hand - but it's too complex for my brain to process -- so am
throwing it out there for you experts to comment on. I am posing two
questions. In the interests of space and focus, I'll post a separate
thread for the other use case (clustering).

Use Case 1:
Filenames contain a TrackNumber (or not).

Examples:
01 - Calexico - Sonic Wind (instrumental mix).mp3
Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3
Carl Orff - Carmina Burana - 08 - Uf dem anger- Chramer, gip .mp3
01-linkin_park_-_foreword-mp3.mp3
[03] (Wish I Could Fly Like) Superman.mp3

Other examples might be: (XX), XX-, -XX-, - XX - , - XX,-XX
Where XX is a one or two digit number.

Specific examples of things that should not be captured:
Jethro Tull - 1999 - Live At House Of Blues - 13 - Hunting Girl.mp3

The 1999 i snot a track number, but the 13 is. A rule that the number
should be 2 digits should catch one.

Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3
The date should not be captured, but the 33 should.

UB40 - 08 - Sing Our Own Song.mp3
The 40 shouldn't be captured, but the 08 should.

Blink 182 - Take Off Your Pants And Jacket - 06 - The Rock Show.mp3
The 182 should not be captured, but the 06 should.

One more case:
08_Smokie_Living Near The Edge.mp3

Phew...sorry for the length of the post --- can one put together a
regex to tackle this problem?

If so --- I will be both amazed and grateful for your suggestions.

Thanks.

P.S. Part 2 of this will deal with clustering...

May 10 '06 #2
Good point. In fact, writing the rules helps really clarify the
problem. Here goes:

1. The set of strings will consist entirely of filenames.
2. A <Track Number> will consist of 1 *OR* 2 digits in the range of 1
to 36.
3. A <Track Number> witha value of less than 10, may be preceded by a
zero.
4. <Track Number> cannot be guaranteed to be the only digits in the
string.
5. <Track Number> will be preceded by one of the following: <SPACE>, {,
[, <, (, _, -
6. The exception to #5 is if <Track Number> is at the start of the
string.
7. If <Track Number> is preceded by an opening punctation character: (
< { [, then <Track Number> will be followed by the corresponding
closing punctuation character.
8. If <Track Number> is not preceded by a opening punctation character,
it will be followed by either: <SPACE>, _,- or an opening punctation
character (for the next field in the string).
9. There may be additional spaces before and after the delimiters
specified in 8 and before but not after the Open Punctation delimiters
and after but not before the closing punctuation characters.
10. There will never be any other substrings in the strings that
follows these rules.

Wow - that seems to really specify the problem. I'm feeling terrific
about it. Except for one tiny, teeny, thing.
I'm still STUCK!!!! h-e-l-p. Eternal thanks to someone who can
translate 1-10 into regex or otherwise.

Thanks.

s.

May 10 '06 #3
I'm glad I was able to help youwith your analysis. Problem-solving is a
really important skill to have as a programmer, and the ability to spell out
business rules is the most important key to writing good code. As you can
see, this involves a process of breaking down the requirements into smaller
and smaller bites, until you have atomic business rules.

I was able to construct a Regular Expression based upon your business rules.
However, there is a problem, and I'm not sure it can be solved. First,
here's the regular expression:

(?m)(?<=[\{\(\[\<_]|\-\s|^)\d{1,2}(?=[\}\)\_\>\]]|\s\-|$)

In English, this means:

1. Caret and dollar match new lines.
2. A match is 1 or 2 digits.
3. The digits must be preceded by one of the following:
a. One of the following characters: { [ ( _ <
b. A hyphen followed by a space.
c. Be at the beginning of the line.
4. The digits must be followed by one of the following:
a. One of the following characters: } ] ) _ >
b. A space followed by a hyphen
c. Be at the end of the line.

Here's the problem with it. Consider these 2 examples you included:

Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3
01-linkin_park_-_foreword-mp3.mp3
Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3

The problem is, in case you can't see it, what do do about digits that are
preceded or followed by a hyphen *without* a space? If you allow it, you
pick up "-13-" in the date. If you disallow it, you don't pick up the "01-"
in the second example, or the "-1-" in the third example.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Hard work is a medication for which
there is no placebo.

"skavan" <su**********@tfn.com> wrote in message
news:11*********************@i40g2000cwc.googlegro ups.com...
Good point. In fact, writing the rules helps really clarify the
problem. Here goes:

1. The set of strings will consist entirely of filenames.
2. A <Track Number> will consist of 1 *OR* 2 digits in the range of 1
to 36.
3. A <Track Number> witha value of less than 10, may be preceded by a
zero.
4. <Track Number> cannot be guaranteed to be the only digits in the
string.
5. <Track Number> will be preceded by one of the following: <SPACE>, {,
[, <, (, _, -
6. The exception to #5 is if <Track Number> is at the start of the
string.
7. If <Track Number> is preceded by an opening punctation character: (
< { [, then <Track Number> will be followed by the corresponding
closing punctuation character.
8. If <Track Number> is not preceded by a opening punctation character,
it will be followed by either: <SPACE>, _,- or an opening punctation
character (for the next field in the string).
9. There may be additional spaces before and after the delimiters
specified in 8 and before but not after the Open Punctation delimiters
and after but not before the closing punctuation characters.
10. There will never be any other substrings in the strings that
follows these rules.

Wow - that seems to really specify the problem. I'm feeling terrific
about it. Except for one tiny, teeny, thing.
I'm still STUCK!!!! h-e-l-p. Eternal thanks to someone who can
translate 1-10 into regex or otherwise.

Thanks.

s.

May 10 '06 #4
Since I can't think of any common strings that would have this effect
OTHER than date formats, the
simple approach would be to eliminate date formats before a final scan
using the regex above.

In the simple case, a date format in this context will, I think, always
have the following rule:

1. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits
AND/OR
2. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits

This should capture the middle digits and then expand to ignore the
string of digits appropriately.

So:
a) Do you think this would work?
b) Is it a preliminary regex or can it be pre-pended to the string
above?
c) What does it look like?

BTW - Your thought process is pretty good for a "Professional
Numbskull" :).

s.

May 10 '06 #5
Here's a mod that may work for you:

(?m)(?!<\d+\s?\-\s?)(?<=[\{\(\[\<_]|\-\s?|^)\d{1,2}(?=[\}\)\_\>\]]|\s?\-|$)(?!\s?\-\s?\d+)

This is identical to the first, with a couple of changes and additions.
First, the spaces with the hyphens are now optional (\s? means 0 or 1
space). Second, I added a negative look-behind to the beginning, and a
negative look-ahead at the end. The negative look-behind states that the
match cannot be preceded by 1 or more digits followed by 0 or 1 space and a
hyphen followed by 0 or 1 space. The negative look-behind states that the
match cannot be followed by 0 or 1 space followed by a hyphen followed by 0
or 1 space followed by 1 or more numbers.

Of course, you realize that there are not hard and fast rules for this sort
of thing. Anyone can give any name to an mp3 file. But it works for all the
examples you gave.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Hard work is a medication for which
there is no placebo.

"skavan" <su**********@tfn.com> wrote in message
news:11*********************@i39g2000cwa.googlegro ups.com...
Since I can't think of any common strings that would have this effect
OTHER than date formats, the
simple approach would be to eliminate date formats before a final scan
using the regex above.

In the simple case, a date format in this context will, I think, always
have the following rule:

1. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits
AND/OR
2. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits

This should capture the middle digits and then expand to ignore the
string of digits appropriately.

So:
a) Do you think this would work?
b) Is it a preliminary regex or can it be pre-pended to the string
above?
c) What does it look like?

BTW - Your thought process is pretty good for a "Professional
Numbskull" :).

s.

May 10 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: aevans1108 | last post by:
expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a...
1
by: Craig Kenisston | last post by:
Hi, I need to write a function that should behave like the SQL's "like" operator on a list of words. I was wondering if I can use Regex directly to do this job. But I've been reading about...
9
by: Whitless | last post by:
Okay I am ready to pull what little hair I have left out. I pass the function below my String to search, my find string (a regular expression) and my replace string (another regular expression)....
8
by: vbmark | last post by:
I'm new to RegEx in vb.net so I'm not sure how to do this. I want to know if a string contains two minus signs "-". If there are two then I want it to return TRUE. I also need to know if the...
4
by: shonend | last post by:
I am trying to extract the pattern like this : "SUB: some text LOT: one-word" Described, "SUB" and "LOT" are key words; I want those words, everything in between and one word following the...
4
by: skavan | last post by:
Use Case: We have music files that describe, in their filename, attributes of the music. We do not know a general pattern that applies to all filenames -- but we do know that filenames that are...
8
by: sherifffruitfly | last post by:
Hi, I've been searching as best I can for this - coming up with little. I have a file that is full of lines fitting this pattern: (?<year>\d{4}),(?<amount>\d{6,7}) I'm likely to get a...
0
by: YellowFin Announcements | last post by:
Security solutions provider EXTOL MSC Berhad has developed a neural network predictive analysis engine and is now working with Australian business intelligence company Yellowfin to develop a front...
0
by: vmysore | last post by:
I am trying to get all the columns selected within a SQL query (including the sub selects). When the code hits matcher.find(). i get the following exception: Exception in thread "main"...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.