By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,918 Members | 1,923 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,918 IT Pros & Developers. It's quick & easy.

C# and regex issue

P: n/a
Hi all,

I am trying to use regular expressions to parse out mp3 titles into
three different groups (artist, title and remix). I currently have
three ways to name a mp3 file:

Artist - Title [Remix]
Artist - Title (Remix)
Artist - Title

I have approached the problem the following way.

First I start by looking to see if the following regex matches (?
<artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?)\) matches. If
not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
however I run into two problems.

1. The last regex does not work.
2. I have to execute these regular expressions in the above order for
it to be correct. If I would execute a working version of the last
regex it would match every time.

So my two questions are:

1. Is there a better way to do this? Do I have to execute the regular
expressions in order for this to work? It could be problematic if I
introduce more naming conventions.
2. How do I get the last regular expression to work.

Any help is appreciated.

Thanks

May 21 '07 #1
Share this Question
Share on Google+
7 Replies


P: n/a
* Nightcrawler wrote, On 21-5-2007 5:56:
Hi all,

I am trying to use regular expressions to parse out mp3 titles into
three different groups (artist, title and remix). I currently have
three ways to name a mp3 file:

Artist - Title [Remix]
Artist - Title (Remix)
Artist - Title

I have approached the problem the following way.

First I start by looking to see if the following regex matches (?
<artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?)\) matches. If
not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
however I run into two problems.

1. The last regex does not work.
The last regex has nothing to force it to go beyond the first captured
letter of the title, and because of the *, it won't even have to match
that. Your reluctant modifier to the * tells the engine to stop ias soon
as possible. Either modifying it to read

(?<artist>.*?) - (?<title>.*)
or better yet, force it to capture the whole name:
^(?<artist>.*?) - (?<title>.*?)$

fixes your problem.
2. I have to execute these regular expressions in the above order for
it to be correct. If I would execute a working version of the last
regex it would match every time.

So my two questions are:

1. Is there a better way to do this? Do I have to execute the regular
expressions in order for this to work? It could be problematic if I
introduce more naming conventions.
My guess is that you'll have to use a predetermined order in which to
execute your search. Otherwise there is no way for any engine to know
which of the matching variants to use. Alternatively, you could be more
precise as to which characters each captured group can contain. So
instead of .*? you could write [a-z0-9'.,]*? which would make it easier
to write patterns that don't actually overlap.

Jesse
May 21 '07 #2

P: n/a
(?<artist>\w+)\s+-\s+(?<title>\w+)(?:\s+[\(\[](?<remix>\w+)[)\]])?

Explanation:
There are 4 distinct parts to this:

(?<artist>\w+) Find a string of word characters. Captures to group "artist"

\s+-\s+ Followed by 1 or more spaces, followed by a hyphen, followed by 1
or more spaces

(?<title>\w+) Find a string of word characters. Captures to group "title"

(?:\s+[\(\[](?<remix>\w+)[)\]])?

Non-capturing group, of which there may be 0 or 1. Begins with 1 or more
spaces, followed by 1 of the characters '(' or '['. This is followed by a
named capturing group called "remix" which is defined as 1 or more word
characters. This is followed by 1 of the characters ')' or ']'.

This assumes that there will always be an artist and a title, but that remix
may be omitted.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Nightcrawler" <th************@gmail.comwrote in message
news:11**********************@z28g2000prd.googlegr oups.com...
Hi all,

I am trying to use regular expressions to parse out mp3 titles into
three different groups (artist, title and remix). I currently have
three ways to name a mp3 file:

Artist - Title [Remix]
Artist - Title (Remix)
Artist - Title

I have approached the problem the following way.

First I start by looking to see if the following regex matches (?
<artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?)\) matches. If
not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
however I run into two problems.

1. The last regex does not work.
2. I have to execute these regular expressions in the above order for
it to be correct. If I would execute a working version of the last
regex it would match every time.

So my two questions are:

1. Is there a better way to do this? Do I have to execute the regular
expressions in order for this to work? It could be problematic if I
introduce more naming conventions.
2. How do I get the last regular expression to work.

Any help is appreciated.

Thanks

May 21 '07 #3

P: n/a
On May 21, 7:31 am, "Kevin Spencer" <unclechut...@nothinks.comwrote:
(?<artist>\w+)\s+-\s+(?<title>\w+)(?:\s+[\(\[](?<remix>\w+)[)\]])?

Explanation:
There are 4 distinct parts to this:

(?<artist>\w+) Find a string of word characters. Captures to group "artist"

\s+-\s+ Followed by 1 or more spaces, followed by a hyphen, followed by 1
or more spaces

(?<title>\w+) Find a string of word characters. Captures to group "title"

(?:\s+[\(\[](?<remix>\w+)[)\]])?

Non-capturing group, of which there may be 0 or 1. Begins with 1 or more
spaces, followed by 1 of the characters '(' or '['. This is followed by a
named capturing group called "remix" which is defined as 1 or more word
characters. This is followed by 1 of the characters ')' or ']'.

This assumes that there will always be an artist and a title, but that remix
may be omitted.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:http://www.miradyne.net

"Nightcrawler" <thomas.zale...@gmail.comwrote in message

news:11**********************@z28g2000prd.googlegr oups.com...
Hi all,
I am trying to use regular expressions to parse out mp3 titles into
three different groups (artist, title and remix). I currently have
three ways to name a mp3 file:
Artist - Title [Remix]
Artist - Title (Remix)
Artist - Title
I have approached the problem the following way.
First I start by looking to see if the following regex matches (?
<artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?)\) matches. If
not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
however I run into two problems.
1. The last regex does not work.
2. I have to execute these regular expressions in the above order for
it to be correct. If I would execute a working version of the last
regex it would match every time.
So my two questions are:
1. Is there a better way to do this? Do I have to execute the regular
expressions in order for this to work? It could be problematic if I
introduce more naming conventions.
2. How do I get the last regular expression to work.
Any help is appreciated.
Thanks- Hide quoted text -

- Show quoted text -
Thank you. I tried your regex on a sample of 10 titles and it didn't
really work. Here are my ten samples that I used:
>From P-60 - Sinking With The Fall
JP Conley - Karma Moods [Soul Mix]
Soul Beats - Wherever You Go... [Love Mix]
Thievery Corporation - Doors Of Perception
Thievery Corporation - Holographic Universe
Ananda Project - Universal Love [Jay-J's Shifted Up Mix]
Collective Sound Members - Switch
Cool Touch - Gravity
Dennis Ferrer - Church Lady Part 2 [Bryan Cox Remix]
Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]

After I ran the regular expression on the titles above. Here is what
the groups caught:

Artist
-----------------------------
60
Conley
Beats
Corporation
Corporation
Project
Members
Touch
Ferrer
Air

Title
-----------------------------
Sinking
Karma
Wherever
Doors
Holographic
Universal
Switch
Gravity
Church
Cherry

Remix
-----------------------------
Nothing was captured here

Please let me know what is wrong.

Thanks

May 21 '07 #4

P: n/a
* Nightcrawler wrote, On 21-5-2007 18:07:
On May 21, 7:31 am, "Kevin Spencer" <unclechut...@nothinks.comwrote:
>(?<artist>\w+)\s+-\s+(?<title>\w+)(?:\s+[\(\[](?<remix>\w+)[)\]])?

Explanation:
There are 4 distinct parts to this:

(?<artist>\w+) Find a string of word characters. Captures to group "artist"

\s+-\s+ Followed by 1 or more spaces, followed by a hyphen, followed by 1
or more spaces

(?<title>\w+) Find a string of word characters. Captures to group "title"

(?:\s+[\(\[](?<remix>\w+)[)\]])?

Non-capturing group, of which there may be 0 or 1. Begins with 1 or more
spaces, followed by 1 of the characters '(' or '['. This is followed by a
named capturing group called "remix" which is defined as 1 or more word
characters. This is followed by 1 of the characters ')' or ']'.

This assumes that there will always be an artist and a title, but that remix
may be omitted.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:http://www.miradyne.net

"Nightcrawler" <thomas.zale...@gmail.comwrote in message

news:11**********************@z28g2000prd.googleg roups.com...
>>Hi all,
I am trying to use regular expressions to parse out mp3 titles into
three different groups (artist, title and remix). I currently have
three ways to name a mp3 file:
Artist - Title [Remix]
Artist - Title (Remix)
Artist - Title
I have approached the problem the following way.
First I start by looking to see if the following regex matches (?
<artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?)\) matches. If
not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
however I run into two problems.
1. The last regex does not work.
2. I have to execute these regular expressions in the above order for
it to be correct. If I would execute a working version of the last
regex it would match every time.
So my two questions are:
1. Is there a better way to do this? Do I have to execute the regular
expressions in order for this to work? It could be problematic if I
introduce more naming conventions.
2. How do I get the last regular expression to work.
Any help is appreciated.
Thanks- Hide quoted text -
- Show quoted text -

Thank you. I tried your regex on a sample of 10 titles and it didn't
really work. Here are my ten samples that I used:
>>From P-60 - Sinking With The Fall
JP Conley - Karma Moods [Soul Mix]
Soul Beats - Wherever You Go... [Love Mix]
Thievery Corporation - Doors Of Perception
Thievery Corporation - Holographic Universe
Ananda Project - Universal Love [Jay-J's Shifted Up Mix]
Collective Sound Members - Switch
Cool Touch - Gravity
Dennis Ferrer - Church Lady Part 2 [Bryan Cox Remix]
Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]
The regex only looks for titles and authors that are made up of \w+
whick means one word. As you can see there are multiple words here. A
better solution could be to substiture '\w+' with '\w+( \w+)*' which
catches one word followed by any number of other words.

Same goes for the title. There are also some characters in your titles,
like the ' and a . which aren't in the \w shortcut. A better solution
might be \S here, which means no whitespace.

(?<artist>\S+(\s+\S+)*)\s+-\s+(?<title>\w+)(?:\s+[\(\[](?<remix>\w+)[)\]])?

This does not actually do the trick yet, as it captures too much in the
artist field. After playing around with reluctant modifiers and a few
other small modifications I came up with this:

^(?<artist>\S+(\s+\S+)*?)\s+?-\s+?(?<title>\S+(\s+\S+)*?)(?:\s+[\(\[](?<remix>\S+(\s+[^)\]]+)*?)[)\]])?\r?$
MultiLine ON
ExplicitCapture ON

which works on all the examples you've provided which didn't work
before. But after expanding on the testcases I found a few other things
that didn't work.

This is what I finally came up with:
^(?<artist>\S+?([ \t]+\S+)*?)[ \t]+-[ \t]+(?<title>\S+([ \t]+\S+)*?)([
\t]+?(\((?<remix>[^\)]+(\s+[^\)]+)*?)\)|\[(?<remix>[^\]]+(\s+[^\]]+)*?)\]
))?\r?$
MultiLine ON
ExplicitCapture ON

Even though it works, I would recommend not to use it as such. Please
try to come up with a better way, unless this is a one time thing. The
regex above is hardly readable and almost unmaintainable.

Jesse
>
After I ran the regular expression on the titles above. Here is what
the groups caught:

Artist
-----------------------------
60
Conley
Beats
Corporation
Corporation
Project
Members
Touch
Ferrer
Air

Title
-----------------------------
Sinking
Karma
Wherever
Doors
Holographic
Universal
Switch
Gravity
Church
Cherry

Remix
-----------------------------
Nothing was captured here

Please let me know what is wrong.

Thanks
May 21 '07 #5

P: n/a
My apologies, Nightcrawler.

Revised Standard Version:

(?<artist>.+)(?=(\s+-\s+))\1(?:(?<title>.+)??(?<remix>(?:\([^\)]+\)|\[[^\]]+\]))|(?<title>.+))

Part of the problem with my first was that it didn't account for spaces in
the Artist or Title. Another was that I was not aware of the rules, which
include the possibility that there might be hyphens (or other characters) in
the Artist, Title, or Remix, and finally, that Title might contain
parenthetized groups of characters, just like Remix. Your examples were very
helpful!

A short explanation of the above:

(?<artist>.+)(?=(\s+-\s+))\1

This indicates that "artist" should be any characters that MUST be followed
by 1 or more spaces, a hyphen, and 1 or more spaces. This means that the
test will fail if the Artist contains a hyphen which has 1 or more spaces on
both sides, but that a hyphen which does NOT have a space on either the left
or right side is okay. The assertion is that the hyphen between "artist" and
"title" will have spaces on BOTH sides.

I put the "space-space" sequence into an unnamed capturing group, because it
has to be captured after the assertion, which does NOT capture it, in order
to match the rest of the line. Thus, the first part ends with "\1" which
captures the "space-space" sequence.

(?:(?<title>.+)??(?<remix>(?:\([^\)]+\)|\[[^\]]+\]))|(?<title>.+))

This was the tricky part, since the "title" may have parenthetized character
groups in it, which look just like the "remix," further complicated by the
fact that "remix" may be absent. Note that this is not perfect, and I will
explain why in a bit.

It puts 2 possible combinations into an OR-ing non-capturing group. The
first possible combination is:

(?<title>.+)??(?<remix>(?:\([^\)]+\)|\[[^\]]+\]))

This uses a double-question-mark quantifier, which makes the first ("title")
part optional, and matches it lazily, a rare construct, but necessary in
this case, as we assume that the title WILL be there, but the lazy part
leaves room for the last part if there are any parenthetized groups of
characters in the "title." This is followed by the "remix" group, which is
defined as either a '(' followed by 1 or more non-')' characters, followed
by a ')', or a '[' followed by 1 or more non-']' characters, followed by a
']'. This ensures that if the remix is present, it will be captured.
However, if the remix is NOT present, we need an alternative:

(?<title>.+)

Captures the rest of the string, if the first alternative fails.

Now, as to why these rules are not perfect, let's have a look at one of the
items in your list:

Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]

Obviously, the [DJ AM Mix] is the Remix. Why obviously? Well, it is the last
parenthetized expression in the string. But what if you left the Remix off?

Air - Cherry Blossom Girl (Because You Blossom)

NOW, "Cherry Blossom Girl" becomes the title, and "(Because You Blossom)"
becomes the Remix. Why? Because it is the last parenthetized expression in
the string. Now, even a human being could not tell the difference, because
you are using a rule that states that the last the parenthetized expression
in the string is the Remix. In other words, your rules for "remix" overlap
your rules for "title." The only solution to this would be to further
qualify the rules. That is, you would have to either restrict the rules for
"title" to a certain pair of brackets, or restrict the rules for "remix" to
a certain pair of brackets.

Thanks for the challenge!

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Nightcrawler" <th************@gmail.comwrote in message
news:11**********************@y2g2000prf.googlegro ups.com...
On May 21, 7:31 am, "Kevin Spencer" <unclechut...@nothinks.comwrote:
>(?<artist>\w+)\s+-\s+(?<title>\w+)(?:\s+[\(\[](?<remix>\w+)[)\]])?

Explanation:
There are 4 distinct parts to this:

(?<artist>\w+) Find a string of word characters. Captures to group
"artist"

\s+-\s+ Followed by 1 or more spaces, followed by a hyphen, followed by
1
or more spaces

(?<title>\w+) Find a string of word characters. Captures to group "title"

(?:\s+[\(\[](?<remix>\w+)[)\]])?

Non-capturing group, of which there may be 0 or 1. Begins with 1 or more
spaces, followed by 1 of the characters '(' or '['. This is followed by a
named capturing group called "remix" which is defined as 1 or more word
characters. This is followed by 1 of the characters ')' or ']'.

This assumes that there will always be an artist and a title, but that
remix
may be omitted.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:http://www.miradyne.net

"Nightcrawler" <thomas.zale...@gmail.comwrote in message

news:11**********************@z28g2000prd.googleg roups.com...
Hi all,
I am trying to use regular expressions to parse out mp3 titles into
three different groups (artist, title and remix). I currently have
three ways to name a mp3 file:
Artist - Title [Remix]
Artist - Title (Remix)
Artist - Title
I have approached the problem the following way.
First I start by looking to see if the following regex matches (?
<artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?)\) matches. If
not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
however I run into two problems.
1. The last regex does not work.
2. I have to execute these regular expressions in the above order for
it to be correct. If I would execute a working version of the last
regex it would match every time.
So my two questions are:
1. Is there a better way to do this? Do I have to execute the regular
expressions in order for this to work? It could be problematic if I
introduce more naming conventions.
2. How do I get the last regular expression to work.
Any help is appreciated.
Thanks- Hide quoted text -

- Show quoted text -

Thank you. I tried your regex on a sample of 10 titles and it didn't
really work. Here are my ten samples that I used:
>>From P-60 - Sinking With The Fall
JP Conley - Karma Moods [Soul Mix]
Soul Beats - Wherever You Go... [Love Mix]
Thievery Corporation - Doors Of Perception
Thievery Corporation - Holographic Universe
Ananda Project - Universal Love [Jay-J's Shifted Up Mix]
Collective Sound Members - Switch
Cool Touch - Gravity
Dennis Ferrer - Church Lady Part 2 [Bryan Cox Remix]
Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]

After I ran the regular expression on the titles above. Here is what
the groups caught:

Artist
-----------------------------
60
Conley
Beats
Corporation
Corporation
Project
Members
Touch
Ferrer
Air

Title
-----------------------------
Sinking
Karma
Wherever
Doors
Holographic
Universal
Switch
Gravity
Church
Cherry

Remix
-----------------------------
Nothing was captured here

Please let me know what is wrong.

Thanks

May 22 '07 #6

P: n/a
On May 22, 8:40 am, "Kevin Spencer" <unclechut...@nothinks.comwrote:
My apologies, Nightcrawler.

Revised Standard Version:

(?<artist>.+)(?=(\s+-\s+))\1(?:(?<title>.+)??(?<remix>(?:\([^\)]+\)|\[[^\]]*+\]))|(?<title>.+))

Part of the problem with my first was that it didn't account for spaces in
the Artist or Title. Another was that I was not aware of the rules, which
include the possibility that there might be hyphens (or other characters)in
the Artist, Title, or Remix, and finally, that Title might contain
parenthetized groups of characters, just like Remix. Your examples were very
helpful!

A short explanation of the above:

(?<artist>.+)(?=(\s+-\s+))\1

This indicates that "artist" should be any characters that MUST be followed
by 1 or more spaces, a hyphen, and 1 or more spaces. This means that the
test will fail if the Artist contains a hyphen which has 1 or more spaceson
both sides, but that a hyphen which does NOT have a space on either the left
or right side is okay. The assertion is that the hyphen between "artist" and
"title" will have spaces on BOTH sides.

I put the "space-space" sequence into an unnamed capturing group, becauseit
has to be captured after the assertion, which does NOT capture it, in order
to match the rest of the line. Thus, the first part ends with "\1" which
captures the "space-space" sequence.

(?:(?<title>.+)??(?<remix>(?:\([^\)]+\)|\[[^\]]+\]))|(?<title>.+))

This was the tricky part, since the "title" may have parenthetized character
groups in it, which look just like the "remix," further complicated by the
fact that "remix" may be absent. Note that this is not perfect, and I will
explain why in a bit.

It puts 2 possible combinations into an OR-ing non-capturing group. The
first possible combination is:

(?<title>.+)??(?<remix>(?:\([^\)]+\)|\[[^\]]+\]))

This uses a double-question-mark quantifier, which makes the first ("title")
part optional, and matches it lazily, a rare construct, but necessary in
this case, as we assume that the title WILL be there, but the lazy part
leaves room for the last part if there are any parenthetized groups of
characters in the "title." This is followed by the "remix" group, which is
defined as either a '(' followed by 1 or more non-')' characters, followed
by a ')', or a '[' followed by 1 or more non-']' characters, followed by a
']'. This ensures that if the remix is present, it will be captured.
However, if the remix is NOT present, we need an alternative:

(?<title>.+)

Captures the rest of the string, if the first alternative fails.

Now, as to why these rules are not perfect, let's have a look at one of the
items in your list:

Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]

Obviously, the [DJ AM Mix] is the Remix. Why obviously? Well, it is the last
parenthetized expression in the string. But what if you left the Remix off?

Air - Cherry Blossom Girl (Because You Blossom)

NOW, "Cherry Blossom Girl" becomes the title, and "(Because You Blossom)"
becomes the Remix. Why? Because it is the last parenthetized expression in
the string. Now, even a human being could not tell the difference, because
you are using a rule that states that the last the parenthetized expression
in the string is the Remix. In other words, your rules for "remix" overlap
your rules for "title." The only solution to this would be to further
qualify the rules. That is, you would have to either restrict the rules for
"title" to a certain pair of brackets, or restrict the rules for "remix" to
a certain pair of brackets.

Thanks for the challenge!

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:http://www.miradyne.net

"Nightcrawler" <thomas.zale...@gmail.comwrote in message

news:11**********************@y2g2000prf.googlegro ups.com...
On May 21, 7:31 am, "Kevin Spencer" <unclechut...@nothinks.comwrote:
(?<artist>\w+)\s+-\s+(?<title>\w+)(?:\s+[\(\[](?<remix>\w+)[)\]])?
Explanation:
There are 4 distinct parts to this:
(?<artist>\w+) Find a string of word characters. Captures to group
"artist"
\s+-\s+ Followed by 1 or more spaces, followed by a hyphen, followedby
1
or more spaces
(?<title>\w+) Find a string of word characters. Captures to group "title"
(?:\s+[\(\[](?<remix>\w+)[)\]])?
Non-capturing group, of which there may be 0 or 1. Begins with 1 or more
spaces, followed by 1 of the characters '(' or '['. This is followed by a
named capturing group called "remix" which is defined as 1 or more word
characters. This is followed by 1 of the characters ')' or ']'.
This assumes that there will always be an artist and a title, but that
remix
may be omitted.
--
HTH,
Kevin Spencer
Microsoft MVP
Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:http://www.miradyne.net
"Nightcrawler" <thomas.zale...@gmail.comwrote in message
>news:11**********************@z28g2000prd.googleg roups.com...
Hi all,
I am trying to use regular expressions to parse out mp3 titles into
three different groups (artist, title and remix). I currently have
three ways to name a mp3 file:
Artist - Title [Remix]
Artist - Title (Remix)
Artist - Title
I have approached the problem the following way.
First I start by looking to see if the following regex matches (?
<artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?)\) matches. If
not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
however I run into two problems.
1. The last regex does not work.
2. I have to execute these regular expressions in the above order for
it to be correct. If I would execute a working version of the last
regex it would match every time.
So my two questions are:
1. Is there a better way to do this? Do I have to execute the regular
expressions in order for this to work? It could be problematic if I
introduce more naming conventions.
2. How do I get the last regular expression to work.
Any help is appreciated.
Thanks- Hide quoted text -
- Show quoted text -
Thank you. I tried your regex on a sample of 10 titles and it didn't
really work. Here are my ten samples that I used:
>From P-60 - Sinking With The Fall
JP Conley - Karma Moods [Soul Mix]
Soul Beats - Wherever You Go... [Love Mix]
Thievery Corporation - Doors Of Perception
Thievery Corporation - Holographic Universe
Ananda Project - Universal Love [Jay-J's Shifted Up Mix]
Collective Sound Members - Switch
Cool Touch - Gravity
Dennis Ferrer - Church Lady Part 2 [Bryan Cox Remix]
Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]
After I ran the regular expression on the titles above. Here is what
the groups caught:
Artist
-----------------------------
60
Conley
Beats
Corporation
Corporation
Project
Members
Touch
Ferrer
Air
Title
-----------------------------
Sinking
Karma
Wherever
Doors
Holographic
Universal
Switch
Gravity
Church
Cherry
Remix
-----------------------------
Nothing was captured here
Please let me know what is wrong.
Thanks- Hide quoted text -

- Show quoted text -
Kevin,

Wow! Thanks!

Do you think it would make more sense to ask the user to define how
they name their files is simple terms and then build the regex in code
instead? I can see how it can get compllicated trying to guess.

Say someone told me they use %artist% - %title% [%remix%], how would
you build a regex for that case for instance?

On another note, do you accept freelance work?

Please let me know.

Thanks

May 23 '07 #7

P: n/a
Do you think it would make more sense to ask the user to define how
they name their files is simple terms and then build the regex in code
instead? I can see how it can get compllicated trying to guess.
Well, regular expressions are "simply" reflections of rules that define
patterns. In order to create an effective regular expression, you need to
define the rules that identify the patterns. In your case, the rules were
fairly simple, but a little too loose:

Artist pattern:
Any string that ends with a " - " character sequence, followed by a Ttle
pattern, and optionally followed by a Remix pattern.

Title pattern:
Any string that is preceded by an Artist pattern, optionally followed by a
Remix pattern.

Remix pattern:
Any string preceded by an Artist and a Title pattern, enclosed in round or
square brackets.

You'll note that ALL of the pattern rules have to be met for a match. This
is because each of these patterns is a *part* of a match. A match is not a
match unless all parts match. That is why the Artist pattern includes the
assertion that it is followed by a Title pattern and an optional Remix
pattern, and so on.

The "looseness" problem occurs because of the number of "Any string" rules
in the rules. This allows, for example, a Title to end with a string
enclosed in round or square brackets, which, combined with an absense of a
Remix pattern (allowed), makes the parenthetized end of the Title to be
identified as the absent Remix.

Because you're working with media lists, supplied by end users, you don't
want the rules to be so complex that the users have trouble following them.
This invites error by the users, and you'll probably have plenty of that
anyway! So, what you need is to make the rules as loose as possible, while
still enabling them to be parsed by a regular expression.

As I said, one idea would be to require a difference in the brackets used in
the Title and the Remix. Another would be to require a separator sequence,
such as the " - " character sequence, between the Title and Remix. This
would allow the user to continue to use any character sequence in all of
them.

There are other possibilities as well. If you keep the goal in mind (rules
as loose as possible while retaining clarity for parsing), you can come up
with your own if you like. For example, thinking about it a bit more, the
rule that the Remix may be omitted, combined with the "Any string" rule for
the Title could also be overcome by a rule that the Remix is required, and
if the user doesn't have Remix information, he/she could simply add a pair
of empty brackets to the end:

Air - Cherry Blossom Girl (Because You Blossom) [ ]
On another note, do you accept freelance work?
I don't, but my company might hire me out for a contract job. Let me know,
and I can send you some contact information.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Nightcrawler" <th************@gmail.comwrote in message
news:11*********************@q75g2000hsh.googlegro ups.com...
On May 22, 8:40 am, "Kevin Spencer" <unclechut...@nothinks.comwrote:
My apologies, Nightcrawler.

Revised Standard Version:

(?<artist>.+)(?=(\s+-\s+))\1(?:(?<title>.+)??(?<remix>(?:\([^\)]+\)|\[[^\]]*+\]))|(?<title>.+))
<snip>
Kevin,

Wow! Thanks!

Do you think it would make more sense to ask the user to define how
they name their files is simple terms and then build the regex in code
instead? I can see how it can get compllicated trying to guess.

Say someone told me they use %artist% - %title% [%remix%], how would
you build a regex for that case for instance?

On another note, do you accept freelance work?

Please let me know.

Thanks

May 23 '07 #8

This discussion thread is closed

Replies have been disabled for this discussion.