473,395 Members | 1,885 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Advanced RegEx (the cluster problem).

Use Case:
We have music files that describe, in their filename, attributes of the
music.
We do not know a general pattern that applies to all filenames -- but
we do know that filenames that are clustered together (by for example
directory) will, most likely, have the same filename pattern.

Here is an example:
10,000 Maniacs - MTV Unplugged - 01 - These are Days.mp3

on its own, we can deduce little, except perhaps that the 01 is a
tracknumber (see my previous post). But if we know that all filenames
in the same directory, belong to the same album, but will have
different titles, then maybe, just maybe we can xform this filename
into its component parts.

So, if we have in our possession say, 2 more filenames:
10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3

Then we should be able to parse all the filenames into:
<var1>,<var2>,<tracknum>,<title>

Even if we don't know whether 10,000 Maniacs or MTV Unplugged is the
name of the Album or the Artists (this bit can be deduced by other
heuristics).

Now, it may be that a cluster of filenames has fewer properties:
4 Non Blondes - Train.mp3
4 Non Blondes - No Place Like Home.mp3

In this case the 4 is part of the Artist name which can be deduced,
since two tracks wouldn't have the same tracknumber.

As more example clusters:
01 What's the Matter Here.mp3
02 Hey Jack Kerouac.mp3

50 Cent - Back Down.mp3
50 Cent - Gotta Get.mp3

10 Just Like Anyone Aimee Mann Bachelor No. 2 Rock.mp3
09 Driving Sideways Aimee Mann Bachelor No. 2 Rock.mp3

This is doing my head in. I just can't figure out how to attack the
problem - even though my instincts say Regex should be the
tool....H-E-L-P!! Any ideas anyone?

May 10 '06 #1
4 2076
"skavan" <su**********@tfn.com> wrote:
This is doing my head in. I just can't figure out how to attack the
problem - even though my instincts say Regex should be the
tool....H-E-L-P!! Any ideas anyone?


My instinct is strongly not to use regex. Instead use more generic
clustering techniques. That way you'll also be able to add extra
data-sources (e.g. MP3 tags like genre, volume level, statistical
analysis of music) to get even more useful clustering. Also you'll be
more robust against other naming schemes the users have come up with
but you haven't thought of in advance.

(I'm guessing that you're doing this in order to present a better UI
to users for them to navigate their music collection).

PS. I've never coded any kind of clustering analysis. All I know about
it is seeing talks given by other researchers. So take what I say with
a pinch of salt.

--
Lucian
May 10 '06 #2
What a great problem!

I'm going to work from the assumption that we're working only within a directory - and if this doesn't naturally scale up to include directory names you'll say so and we can take another look...

Another assumption I'll use is that you can represent your strings in 8 bit characters. Unicode and MBCS present problems about the # of bytes needed to represent any character that, in some cases, cause characters to be variable length. That makes things far more difficult...

Here's what I see is the problem:
(Within a directory...) your examples demonstrate that there are one or more substrings within each filename that match (for any given cluster) - and your need is to be able to extract these substrings.

Here is a collection of techniques to detect similarities / differences between file names:

1. If you represent the strings as arrays of bytes you can then "subtract" one string from another:

String A: 10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
String B: 10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3
Diff : 0000000000000000000000000000000000.000.0.......... .............

where a dot represents the numerical value of the difference of the ASCII characters

From this you can easily detect that this pair start with a long string of characters that match - up to a common position; after that there's a short match string and then it's all unmatched.
Note that since track numbers may not have leading zeros they can be variable length. To handle this you'll have to do this comparison three times (for every pair of file names compared): aligned, left shift filename 2, right shift filename 2 (each shift is by one character position). Of the three comparisons - one will show up with a long batch of zeros and the other two will not.

Your last example presents a further shift problem - where the variable stuff (which is likely to be variable length) is at the beginning. For these, you'll need to align the strings on the end, not the beginning; you'll then detect a bunch of consecutive zeros at the end, not the beginning.

2. So far, so good - but if you've 500 files in one directory, you could be doing 3*500*500 = 750,000 comparisons. Not good. So now you need to group similar files. And you need to do this before you've found the common parts of file names. For this, I'd create four hash tables - designed so that you are likely to get multiple hits on a given hash value (so you'll store a list of the hits under any given hash). The three tables will hash distinct parts of each file name:
A. the first four alpha characters of the file name (i.e. skip leading digits, punctuation and spaces).
B. the last four alpha characters of the file name
C. the string of alpha characters between the first pair of "-" characters, if such a string exists (don't bother adding names to the hash table if they don't hit).
D. like C, but between the last pair of "-" characters.

If there are other commonly used characters that punctuate parts of the file names, allow them as valid alternatives to "-" in C & D; you might create distinct hash tables for each distinct punctuation character.

By eliminating digits we eliminate track numbers from the hash tables - which are variable (and variable length) and can sometimes mimic band or album names - and so eliminate the need to shift before generating the hash.
By hashing on first & last & between delimiters we're generating indexes that are likely to hit on some recurring substring: all file names that share the same substring will land in the same hash bucket. Now your string differencing only needs to happen on the groups of file names within any given bucket - and you can safely skip making comparisons between files in different buckets from each other.

Note that some groups of file names may show up in more than one hash table - where a file name is like your 10,000 Maniacs example, both the band name and the album name will form a cluster in a hash table, but the band name will cluster in hash table A and the album name will cluster in table C.

Also note that you may need to look at limit cases - where a band name, or album name, contains fewer than four characters. In this case you would likely find that the four selected characters land in different hash buckets because of a differing first or last character. If you know you have cases where this might happen, you could consider modifying the 4 character rule to be '4 characters, unless there are fewer before I hit a digit or punctuation mark'.

Hope some or all of this helps. I'll be looking to see whether it works for you...

Griffin


"skavan" <su**********@tfn.com> wrote in message news:11**********************@y43g2000cwc.googlegr oups.com...
Use Case:
We have music files that describe, in their filename, attributes of the
music.
We do not know a general pattern that applies to all filenames -- but
we do know that filenames that are clustered together (by for example
directory) will, most likely, have the same filename pattern.

Here is an example:
10,000 Maniacs - MTV Unplugged - 01 - These are Days.mp3

on its own, we can deduce little, except perhaps that the 01 is a
tracknumber (see my previous post). But if we know that all filenames
in the same directory, belong to the same album, but will have
different titles, then maybe, just maybe we can xform this filename
into its component parts.

So, if we have in our possession say, 2 more filenames:
10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3

Then we should be able to parse all the filenames into:
<var1>,<var2>,<tracknum>,<title>

Even if we don't know whether 10,000 Maniacs or MTV Unplugged is the
name of the Album or the Artists (this bit can be deduced by other
heuristics).

Now, it may be that a cluster of filenames has fewer properties:
4 Non Blondes - Train.mp3
4 Non Blondes - No Place Like Home.mp3

In this case the 4 is part of the Artist name which can be deduced,
since two tracks wouldn't have the same tracknumber.

As more example clusters:
01 What's the Matter Here.mp3
02 Hey Jack Kerouac.mp3

50 Cent - Back Down.mp3
50 Cent - Gotta Get.mp3

10 Just Like Anyone Aimee Mann Bachelor No. 2 Rock.mp3
09 Driving Sideways Aimee Mann Bachelor No. 2 Rock.mp3

This is doing my head in. I just can't figure out how to attack the
problem - even though my instincts say Regex should be the
tool....H-E-L-P!! Any ideas anyone?

May 10 '06 #3
Very intresting approach.
You say "If you represent the strings as arrays of bytes you can then
"subtract" one string from another"...
Forgive the dumb question - but how do you subtract one byte array from
another in vb.net (or C#)?
Standard operators don't work!

Seems to me, I should take your byte array concept and do the
following:
1. Assign each filename to a string: s1, s2
2. Determine which string is longer and by how many chars: N (assume s1
is longer in this example).
3. Add N chars to end of s2 and place in s3
4. Add N chars to start of s2 and place in s4
5. Convert s1,s3 and s4 to bytearrays: b1, b3, b4
6. Subtract b3 from b1. If it starts with 0's it's CASE A), if not it's
(CASE B).
7. Subtract b4 from b1. If it ends with 0's (CASE C), if not it's (CASE
D).
8. CASE A + CASE C = fixed fields at eather end, variable in the middle
9. CASE A + CASE D = fixed fields at the beginning, variable at the
end.
10. CASE B + CASE C = variable field at the beginning, thereafter
fixed.
11. CASE B + CASE D = ERROR!!! no fixed matches...or variable at both
ends!

12. at this point, one can begin pulling the fixed fields and parsing
them.

Make sense?

Back to the question...short of iterating through the byte array, one
item at a time, how does one subtract two byte arrays? If I have to
iterate, why bother converting from string to byte array?

May 10 '06 #4
....oh and to handle the 11 vs 1 issue (2 chars vs 1), maybe I convert
all numbers to XX (regex) before I start the sequence above?

May 10 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

198
by: Sridhar R | last post by:
>From technical point of view, I could not understand the the reasoning behind using Java in major companies. Sure that Python, is used in some, but still Java is considered as a sure-job...
2
by: uli | last post by:
Is there any open source Python software (preferably biopython) written which runs on a cluster. Alternatively are there interfaces written in Python to existing cluster software.
0
by: rbs | last post by:
Is it worth looking into putting Oracle 8.1.7 on a Win2k3 cluster. Only Oracle 9.2.0 is certified on a W2k3 cluster, but applications (support contracts) prevent us from upgrading to Oracle 9. ...
0
by: Leonid | last post by:
I have Win 2003 cluster environment + SQL 2000 SP3 with active-passive infrastructure. Node1 active Node2 passive When I stop one node and When I am stopping a Node2 - everything is Ok - Node...
0
by: Leonid | last post by:
From: leonid_n@infogateonline.com (Leonid) Newsgroups: comp.databases.ms-sqlserver Subject: SQL Agent + Cluster NNTP-Posting-Host: 80.74.102.10 Message-ID:...
9
by: Christ | last post by:
Hi there, i'm trying to make a regex, but it ain't working. In just one regex expression I want to check a password that must meet following requirements: - at least 6 characters long - at...
7
by: Dennis Gearon | last post by:
If I've read everything right, in order to get: multiple languages on a site with the functionality of ALL of: REGEX LIKE Correctly sorted text
1
by: abhinav | last post by:
Hi guys.I have read that one cannot perform true multithreading in python due to global interpreter lock mechanism.Suppose i have to implement a crawler on a say cluster system like clusterknoppix...
5
by: skavan | last post by:
Hi, I'm just wrapping my head around regex and am pretty sure it can do the task at hand - but it's too complex for my brain to process -- so am throwing it out there for you experts to comment...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.