Hi,
I'm just wrapping my head around regex and am pretty sure it can do the
task at hand - but it's too complex for my brain to process -- so am
throwing it out there for you experts to comment on. I am posing two
questions. In the interests of space and focus, I'll post a separate
thread for the other use case (clustering).
Use Case 1:
Filenames contain a TrackNumber (or not).
Examples:
01 - Calexico - Sonic Wind (instrumental mix).mp3
Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3
Carl Orff - Carmina Burana - 08 - Uf dem anger- Chramer, gip .mp3
01-linkin_park_-_foreword-mp3.mp3
[03] (Wish I Could Fly Like) Superman.mp3
Other examples might be: (XX), XX-, -XX-, - XX - , - XX,-XX
Where XX is a one or two digit number.
Specific examples of things that should not be captured:
Jethro Tull - 1999 - Live At House Of Blues - 13 - Hunting Girl.mp3
The 1999 i snot a track number, but the 13 is. A rule that the number
should be 2 digits should catch one.
Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3
The date should not be captured, but the 33 should.
UB40 - 08 - Sing Our Own Song.mp3
The 40 shouldn't be captured, but the 08 should.
Blink 182 - Take Off Your Pants And Jacket - 06 - The Rock Show.mp3
The 182 should not be captured, but the 06 should.
One more case:
08_Smokie_Livin g Near The Edge.mp3
Phew...sorry for the length of the post --- can one put together a
regex to tackle this problem?
If so --- I will be both amazed and grateful for your suggestions.
Thanks.
P.S. Part 2 of this will deal with clustering... 5 1712
Well, you're starting out by making the most common mistake that people make
who use regular expressions. Instead of giving us a set of rules, you give
us a bunch of examples. The problem with this is that the examples only
*hint* at the underlying rules, and do not spell them out. One could derive
several different sets of rules from the examples you've given.
In case you don't understand, I'll give you an example. See if you can tweak
the example into the exact rules for your regular expression:
1. The string or set of strings will (will not?) consist entirely of file
names.
2. A "Track Number" (not?) always consists of exactly 2 digits.
3. These 2 digits may appear anywhere in the file name, except for the
extension.
4. These 2 digits will (not?) always be delimited by punctuation marks.
5. If at the beginning or end of the file name, only 1 (possibly more than
1?) mark is used.
6. The set of possible punctuation marks consists of: [], -, _, ()
7. The punctuation marks will always immediately (no spaces)
precede and/or follow the "Track Number" with one exception.
8. Hyphens will always have a single (or more?) space between the hyphen and
the
"Tracking Number" and between the hyphen and the rest of the file name.
9. There will never be any other substrings in the strings that follows
these rules.
--
HTH,
Kevin Spencer
Microsoft MVP
Professional Numbskull
Hard work is a medication for which
there is no placebo.
"skavan" <su**********@t fn.com> wrote in message
news:11******** **************@ j33g2000cwa.goo glegroups.com.. . Hi,
I'm just wrapping my head around regex and am pretty sure it can do the task at hand - but it's too complex for my brain to process -- so am throwing it out there for you experts to comment on. I am posing two questions. In the interests of space and focus, I'll post a separate thread for the other use case (clustering).
Use Case 1: Filenames contain a TrackNumber (or not).
Examples: 01 - Calexico - Sonic Wind (instrumental mix).mp3 Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3 Carl Orff - Carmina Burana - 08 - Uf dem anger- Chramer, gip .mp3 01-linkin_park_-_foreword-mp3.mp3 [03] (Wish I Could Fly Like) Superman.mp3
Other examples might be: (XX), XX-, -XX-, - XX - , - XX,-XX Where XX is a one or two digit number.
Specific examples of things that should not be captured: Jethro Tull - 1999 - Live At House Of Blues - 13 - Hunting Girl.mp3
The 1999 i snot a track number, but the 13 is. A rule that the number should be 2 digits should catch one.
Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3 The date should not be captured, but the 33 should.
UB40 - 08 - Sing Our Own Song.mp3 The 40 shouldn't be captured, but the 08 should.
Blink 182 - Take Off Your Pants And Jacket - 06 - The Rock Show.mp3 The 182 should not be captured, but the 06 should.
One more case: 08_Smokie_Livin g Near The Edge.mp3
Phew...sorry for the length of the post --- can one put together a regex to tackle this problem?
If so --- I will be both amazed and grateful for your suggestions.
Thanks.
P.S. Part 2 of this will deal with clustering...
Good point. In fact, writing the rules helps really clarify the
problem. Here goes:
1. The set of strings will consist entirely of filenames.
2. A <Track Number> will consist of 1 *OR* 2 digits in the range of 1
to 36.
3. A <Track Number> witha value of less than 10, may be preceded by a
zero.
4. <Track Number> cannot be guaranteed to be the only digits in the
string.
5. <Track Number> will be preceded by one of the following: <SPACE>, {,
[, <, (, _, -
6. The exception to #5 is if <Track Number> is at the start of the
string.
7. If <Track Number> is preceded by an opening punctation character: (
< { [, then <Track Number> will be followed by the corresponding
closing punctuation character.
8. If <Track Number> is not preceded by a opening punctation character,
it will be followed by either: <SPACE>, _,- or an opening punctation
character (for the next field in the string).
9. There may be additional spaces before and after the delimiters
specified in 8 and before but not after the Open Punctation delimiters
and after but not before the closing punctuation characters.
10. There will never be any other substrings in the strings that
follows these rules.
Wow - that seems to really specify the problem. I'm feeling terrific
about it. Except for one tiny, teeny, thing.
I'm still STUCK!!!! h-e-l-p. Eternal thanks to someone who can
translate 1-10 into regex or otherwise.
Thanks.
s.
I'm glad I was able to help youwith your analysis. Problem-solving is a
really important skill to have as a programmer, and the ability to spell out
business rules is the most important key to writing good code. As you can
see, this involves a process of breaking down the requirements into smaller
and smaller bites, until you have atomic business rules.
I was able to construct a Regular Expression based upon your business rules.
However, there is a problem, and I'm not sure it can be solved. First,
here's the regular expression:
(?m)(?<=[\{\(\[\<_]|\-\s|^)\d{1,2}(?=[\}\)\_\>\]]|\s\-|$)
In English, this means:
1. Caret and dollar match new lines.
2. A match is 1 or 2 digits.
3. The digits must be preceded by one of the following:
a. One of the following characters: { [ ( _ <
b. A hyphen followed by a space.
c. Be at the beginning of the line.
4. The digits must be followed by one of the following:
a. One of the following characters: } ] ) _ >
b. A space followed by a hyphen
c. Be at the end of the line.
Here's the problem with it. Consider these 2 examples you included:
Prince - Northrop - 06-13-2000 - 33 - Kiss.mp3
01-linkin_park_-_foreword-mp3.mp3
Gustav_Mahler-Symphony#10-Slatkin-St_Louis-1-Adagio.mp3
The problem is, in case you can't see it, what do do about digits that are
preceded or followed by a hyphen *without* a space? If you allow it, you
pick up "-13-" in the date. If you disallow it, you don't pick up the "01-"
in the second example, or the "-1-" in the third example.
--
HTH,
Kevin Spencer
Microsoft MVP
Professional Numbskull
Hard work is a medication for which
there is no placebo.
"skavan" <su**********@t fn.com> wrote in message
news:11******** *************@i 40g2000cwc.goog legroups.com... Good point. In fact, writing the rules helps really clarify the problem. Here goes:
1. The set of strings will consist entirely of filenames. 2. A <Track Number> will consist of 1 *OR* 2 digits in the range of 1 to 36. 3. A <Track Number> witha value of less than 10, may be preceded by a zero. 4. <Track Number> cannot be guaranteed to be the only digits in the string. 5. <Track Number> will be preceded by one of the following: <SPACE>, {, [, <, (, _, - 6. The exception to #5 is if <Track Number> is at the start of the string. 7. If <Track Number> is preceded by an opening punctation character: ( < { [, then <Track Number> will be followed by the corresponding closing punctuation character. 8. If <Track Number> is not preceded by a opening punctation character, it will be followed by either: <SPACE>, _,- or an opening punctation character (for the next field in the string). 9. There may be additional spaces before and after the delimiters specified in 8 and before but not after the Open Punctation delimiters and after but not before the closing punctuation characters. 10. There will never be any other substrings in the strings that follows these rules.
Wow - that seems to really specify the problem. I'm feeling terrific about it. Except for one tiny, teeny, thing. I'm still STUCK!!!! h-e-l-p. Eternal thanks to someone who can translate 1-10 into regex or otherwise.
Thanks.
s.
Since I can't think of any common strings that would have this effect
OTHER than date formats, the
simple approach would be to eliminate date formats before a final scan
using the regex above.
In the simple case, a date format in this context will, I think, always
have the following rule:
1. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits
AND/OR
2. one or two digits (where the first digit may be 0) preceded by a '-'
and then at least 1 or more digits
This should capture the middle digits and then expand to ignore the
string of digits appropriately.
So:
a) Do you think this would work?
b) Is it a preliminary regex or can it be pre-pended to the string
above?
c) What does it look like?
BTW - Your thought process is pretty good for a "Profession al
Numbskull" :).
s.
Here's a mod that may work for you:
(?m)(?!<\d+\s?\-\s?)(?<=[\{\(\[\<_]|\-\s?|^)\d{1,2}(? =[\}\)\_\>\]]|\s?\-|$)(?!\s?\-\s?\d+)
This is identical to the first, with a couple of changes and additions.
First, the spaces with the hyphens are now optional (\s? means 0 or 1
space). Second, I added a negative look-behind to the beginning, and a
negative look-ahead at the end. The negative look-behind states that the
match cannot be preceded by 1 or more digits followed by 0 or 1 space and a
hyphen followed by 0 or 1 space. The negative look-behind states that the
match cannot be followed by 0 or 1 space followed by a hyphen followed by 0
or 1 space followed by 1 or more numbers.
Of course, you realize that there are not hard and fast rules for this sort
of thing. Anyone can give any name to an mp3 file. But it works for all the
examples you gave.
--
HTH,
Kevin Spencer
Microsoft MVP
Professional Numbskull
Hard work is a medication for which
there is no placebo.
"skavan" <su**********@t fn.com> wrote in message
news:11******** *************@i 39g2000cwa.goog legroups.com... Since I can't think of any common strings that would have this effect OTHER than date formats, the simple approach would be to eliminate date formats before a final scan using the regex above.
In the simple case, a date format in this context will, I think, always have the following rule:
1. one or two digits (where the first digit may be 0) preceded by a '-' and then at least 1 or more digits AND/OR 2. one or two digits (where the first digit may be 0) preceded by a '-' and then at least 1 or more digits
This should capture the middle digits and then expand to ignore the string of digits appropriately.
So: a) Do you think this would work? b) Is it a preliminary regex or can it be pre-pended to the string above? c) What does it look like?
BTW - Your thought process is pretty good for a "Profession al Numbskull" :).
s. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: aevans1108 |
last post by:
expanding this message to microsoft.public.dotnet.xml
Greetings
Please direct me to the right group if this is an inappropriate place
to post this question. Thanks.
I want to format a numeric value according to an arbitrary regular
expression.
|
by: Craig Kenisston |
last post by:
Hi,
I need to write a function that should behave like the SQL's "like" operator
on a list of words.
I was wondering if I can use Regex directly to do this job. But I've been
reading about regex and it supports very different characters and behaves
different.
So, I'm looking for an advise. I'd like to know if it is feasable or not, I...
|
by: Whitless |
last post by:
Okay I am ready to pull what little hair I have left out.
I pass the function below my String to search, my find string (a regular
expression) and my replace string (another regular expression). Why does this
function replace the found reg ex. with the actual string "\t" and not a tab?
(in the example below out of frustration I actually...
|
by: vbmark |
last post by:
I'm new to RegEx in vb.net so I'm not sure how to do this.
I want to know if a string contains two minus signs "-". If there are two
then I want it to return TRUE.
I also need to know if the string contains two plus signs "+". Should this
be a seperate RegEx or can one RegEx check for both signs?
Thanks!
|
by: shonend |
last post by:
I am trying to extract the pattern like this :
"SUB: some text LOT: one-word"
Described, "SUB" and "LOT" are key words; I want those words,
everything in between and one word following the "LOT:". Source text
may contain multiple "SUB: ... LOT:" blocks.
For example this is my source text:
| |
by: skavan |
last post by:
Use Case:
We have music files that describe, in their filename, attributes of the
music.
We do not know a general pattern that applies to all filenames -- but
we do know that filenames that are clustered together (by for example
directory) will, most likely, have the same filename pattern.
Here is an example:
10,000 Maniacs - MTV...
|
by: sherifffruitfly |
last post by:
Hi,
I've been searching as best I can for this - coming up with little.
I have a file that is full of lines fitting this pattern:
(?<year>\d{4}),(?<amount>\d{6,7})
I'm likely to get a bunch of hits with this - I'm only interested in
the *last* one. Is there a way to build the concept "last" into the
|
by: YellowFin Announcements |
last post by:
Security solutions provider EXTOL MSC Berhad has developed a neural
network predictive analysis engine and is now working with Australian
business intelligence company Yellowfin to develop a front end that is
integrated with Yellowfin's reporting, that can be delivered to
clients.
Under the tie up, Yellowfin will make the new engine called...
|
by: vmysore |
last post by:
I am trying to get all the columns selected within a SQL query (including the sub selects). When the code hits matcher.find(). i get the following exception:
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Branch.match(Pattern.java:4530)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
I am...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it. ...
| |
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in...
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...
| |