473,785 Members | 2,434 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Advanced RegEx (the cluster problem).

Use Case:
We have music files that describe, in their filename, attributes of the
music.
We do not know a general pattern that applies to all filenames -- but
we do know that filenames that are clustered together (by for example
directory) will, most likely, have the same filename pattern.

Here is an example:
10,000 Maniacs - MTV Unplugged - 01 - These are Days.mp3

on its own, we can deduce little, except perhaps that the 01 is a
tracknumber (see my previous post). But if we know that all filenames
in the same directory, belong to the same album, but will have
different titles, then maybe, just maybe we can xform this filename
into its component parts.

So, if we have in our possession say, 2 more filenames:
10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3

Then we should be able to parse all the filenames into:
<var1>,<var2>,< tracknum>,<titl e>

Even if we don't know whether 10,000 Maniacs or MTV Unplugged is the
name of the Album or the Artists (this bit can be deduced by other
heuristics).

Now, it may be that a cluster of filenames has fewer properties:
4 Non Blondes - Train.mp3
4 Non Blondes - No Place Like Home.mp3

In this case the 4 is part of the Artist name which can be deduced,
since two tracks wouldn't have the same tracknumber.

As more example clusters:
01 What's the Matter Here.mp3
02 Hey Jack Kerouac.mp3

50 Cent - Back Down.mp3
50 Cent - Gotta Get.mp3

10 Just Like Anyone Aimee Mann Bachelor No. 2 Rock.mp3
09 Driving Sideways Aimee Mann Bachelor No. 2 Rock.mp3

This is doing my head in. I just can't figure out how to attack the
problem - even though my instincts say Regex should be the
tool....H-E-L-P!! Any ideas anyone?

May 10 '06 #1
4 2094
"skavan" <su**********@t fn.com> wrote:
This is doing my head in. I just can't figure out how to attack the
problem - even though my instincts say Regex should be the
tool....H-E-L-P!! Any ideas anyone?


My instinct is strongly not to use regex. Instead use more generic
clustering techniques. That way you'll also be able to add extra
data-sources (e.g. MP3 tags like genre, volume level, statistical
analysis of music) to get even more useful clustering. Also you'll be
more robust against other naming schemes the users have come up with
but you haven't thought of in advance.

(I'm guessing that you're doing this in order to present a better UI
to users for them to navigate their music collection).

PS. I've never coded any kind of clustering analysis. All I know about
it is seeing talks given by other researchers. So take what I say with
a pinch of salt.

--
Lucian
May 10 '06 #2
What a great problem!

I'm going to work from the assumption that we're working only within a directory - and if this doesn't naturally scale up to include directory names you'll say so and we can take another look...

Another assumption I'll use is that you can represent your strings in 8 bit characters. Unicode and MBCS present problems about the # of bytes needed to represent any character that, in some cases, cause characters to be variable length. That makes things far more difficult...

Here's what I see is the problem:
(Within a directory...) your examples demonstrate that there are one or more substrings within each filename that match (for any given cluster) - and your need is to be able to extract these substrings.

Here is a collection of techniques to detect similarities / differences between file names:

1. If you represent the strings as arrays of bytes you can then "subtract" one string from another:

String A: 10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
String B: 10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3
Diff : 000000000000000 000000000000000 0000.000.0..... ............... ...

where a dot represents the numerical value of the difference of the ASCII characters

From this you can easily detect that this pair start with a long string of characters that match - up to a common position; after that there's a short match string and then it's all unmatched.
Note that since track numbers may not have leading zeros they can be variable length. To handle this you'll have to do this comparison three times (for every pair of file names compared): aligned, left shift filename 2, right shift filename 2 (each shift is by one character position). Of the three comparisons - one will show up with a long batch of zeros and the other two will not.

Your last example presents a further shift problem - where the variable stuff (which is likely to be variable length) is at the beginning. For these, you'll need to align the strings on the end, not the beginning; you'll then detect a bunch of consecutive zeros at the end, not the beginning.

2. So far, so good - but if you've 500 files in one directory, you could be doing 3*500*500 = 750,000 comparisons. Not good. So now you need to group similar files. And you need to do this before you've found the common parts of file names. For this, I'd create four hash tables - designed so that you are likely to get multiple hits on a given hash value (so you'll store a list of the hits under any given hash). The three tables will hash distinct parts of each file name:
A. the first four alpha characters of the file name (i.e. skip leading digits, punctuation and spaces).
B. the last four alpha characters of the file name
C. the string of alpha characters between the first pair of "-" characters, if such a string exists (don't bother adding names to the hash table if they don't hit).
D. like C, but between the last pair of "-" characters.

If there are other commonly used characters that punctuate parts of the file names, allow them as valid alternatives to "-" in C & D; you might create distinct hash tables for each distinct punctuation character.

By eliminating digits we eliminate track numbers from the hash tables - which are variable (and variable length) and can sometimes mimic band or album names - and so eliminate the need to shift before generating the hash.
By hashing on first & last & between delimiters we're generating indexes that are likely to hit on some recurring substring: all file names that share the same substring will land in the same hash bucket. Now your string differencing only needs to happen on the groups of file names within any given bucket - and you can safely skip making comparisons between files in different buckets from each other.

Note that some groups of file names may show up in more than one hash table - where a file name is like your 10,000 Maniacs example, both the band name and the album name will form a cluster in a hash table, but the band name will cluster in hash table A and the album name will cluster in table C.

Also note that you may need to look at limit cases - where a band name, or album name, contains fewer than four characters. In this case you would likely find that the four selected characters land in different hash buckets because of a differing first or last character. If you know you have cases where this might happen, you could consider modifying the 4 character rule to be '4 characters, unless there are fewer before I hit a digit or punctuation mark'.

Hope some or all of this helps. I'll be looking to see whether it works for you...

Griffin


"skavan" <su**********@t fn.com> wrote in message news:11******** **************@ y43g2000cwc.goo glegroups.com.. .
Use Case:
We have music files that describe, in their filename, attributes of the
music.
We do not know a general pattern that applies to all filenames -- but
we do know that filenames that are clustered together (by for example
directory) will, most likely, have the same filename pattern.

Here is an example:
10,000 Maniacs - MTV Unplugged - 01 - These are Days.mp3

on its own, we can deduce little, except perhaps that the 01 is a
tracknumber (see my previous post). But if we know that all filenames
in the same directory, belong to the same album, but will have
different titles, then maybe, just maybe we can xform this filename
into its component parts.

So, if we have in our possession say, 2 more filenames:
10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3

Then we should be able to parse all the filenames into:
<var1>,<var2>,< tracknum>,<titl e>

Even if we don't know whether 10,000 Maniacs or MTV Unplugged is the
name of the Album or the Artists (this bit can be deduced by other
heuristics).

Now, it may be that a cluster of filenames has fewer properties:
4 Non Blondes - Train.mp3
4 Non Blondes - No Place Like Home.mp3

In this case the 4 is part of the Artist name which can be deduced,
since two tracks wouldn't have the same tracknumber.

As more example clusters:
01 What's the Matter Here.mp3
02 Hey Jack Kerouac.mp3

50 Cent - Back Down.mp3
50 Cent - Gotta Get.mp3

10 Just Like Anyone Aimee Mann Bachelor No. 2 Rock.mp3
09 Driving Sideways Aimee Mann Bachelor No. 2 Rock.mp3

This is doing my head in. I just can't figure out how to attack the
problem - even though my instincts say Regex should be the
tool....H-E-L-P!! Any ideas anyone?

May 10 '06 #3
Very intresting approach.
You say "If you represent the strings as arrays of bytes you can then
"subtract" one string from another"...
Forgive the dumb question - but how do you subtract one byte array from
another in vb.net (or C#)?
Standard operators don't work!

Seems to me, I should take your byte array concept and do the
following:
1. Assign each filename to a string: s1, s2
2. Determine which string is longer and by how many chars: N (assume s1
is longer in this example).
3. Add N chars to end of s2 and place in s3
4. Add N chars to start of s2 and place in s4
5. Convert s1,s3 and s4 to bytearrays: b1, b3, b4
6. Subtract b3 from b1. If it starts with 0's it's CASE A), if not it's
(CASE B).
7. Subtract b4 from b1. If it ends with 0's (CASE C), if not it's (CASE
D).
8. CASE A + CASE C = fixed fields at eather end, variable in the middle
9. CASE A + CASE D = fixed fields at the beginning, variable at the
end.
10. CASE B + CASE C = variable field at the beginning, thereafter
fixed.
11. CASE B + CASE D = ERROR!!! no fixed matches...or variable at both
ends!

12. at this point, one can begin pulling the fixed fields and parsing
them.

Make sense?

Back to the question...shor t of iterating through the byte array, one
item at a time, how does one subtract two byte arrays? If I have to
iterate, why bother converting from string to byte array?

May 10 '06 #4
....oh and to handle the 11 vs 1 issue (2 chars vs 1), maybe I convert
all numbers to XX (regex) before I start the sequence above?

May 10 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

198
7726
by: Sridhar R | last post by:
>From technical point of view, I could not understand the the reasoning behind using Java in major companies. Sure that Python, is used in some, but still Java is considered as a sure-job language. After being a python programmer for long time, I consider it painful to learn/use Java now (well, like many I will be forced to do that in my job). What makes such companies to choose Java over dynamic, productive languages like Python? ...
2
1852
by: uli | last post by:
Is there any open source Python software (preferably biopython) written which runs on a cluster. Alternatively are there interfaces written in Python to existing cluster software.
0
2430
by: rbs | last post by:
Is it worth looking into putting Oracle 8.1.7 on a Win2k3 cluster. Only Oracle 9.2.0 is certified on a W2k3 cluster, but applications (support contracts) prevent us from upgrading to Oracle 9. Experience anyone?
0
1167
by: Leonid | last post by:
I have Win 2003 cluster environment + SQL 2000 SP3 with active-passive infrastructure. Node1 active Node2 passive When I stop one node and When I am stopping a Node2 - everything is Ok - Node 1 can work in autonomic mode and I can restart a SQL Agent without any problems When I am stopping a Node1 - everything is Ok - Node 1 can work in
0
1210
by: Leonid | last post by:
From: leonid_n@infogateonline.com (Leonid) Newsgroups: comp.databases.ms-sqlserver Subject: SQL Agent + Cluster NNTP-Posting-Host: 80.74.102.10 Message-ID: <9dbf743d.0410240200.691d5172@posting.google.com> I have Win 2003 cluster environment + SQL 2000 SP3 with active-passive infrastructure. Node1 active Node2 passive
9
3072
by: Christ | last post by:
Hi there, i'm trying to make a regex, but it ain't working. In just one regex expression I want to check a password that must meet following requirements: - at least 6 characters long - at least one letter (doesn't matter where in te string) - at least one number (doesn't matter where in te string) (those are no prob) - excluding 0, o, O, i, I, l, L, 1 (to prevent confusion about
7
2052
by: Dennis Gearon | last post by:
If I've read everything right, in order to get: multiple languages on a site with the functionality of ALL of: REGEX LIKE Correctly sorted text
1
2347
by: abhinav | last post by:
Hi guys.I have read that one cannot perform true multithreading in python due to global interpreter lock mechanism.Suppose i have to implement a crawler on a say cluster system like clusterknoppix so that i can use parallel virtual machine (PVM)for programming in multiprocessor environment or say open MPI.Can i integrate python with PVM or MPI.Can i embed python into C for programming in multiprocessor environment.Is there any way of...
5
1722
by: skavan | last post by:
Hi, I'm just wrapping my head around regex and am pretty sure it can do the task at hand - but it's too complex for my brain to process -- so am throwing it out there for you experts to comment on. I am posing two questions. In the interests of space and focus, I'll post a separate thread for the other use case (clustering). Use Case 1: Filenames contain a TrackNumber (or not).
0
9645
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10325
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10091
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9950
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8972
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6739
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5381
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
3645
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2879
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.