Use Case:
We have music files that describe, in their filename, attributes of the
music.
We do not know a general pattern that applies to all filenames -- but
we do know that filenames that are clustered together (by for example
directory) will, most likely, have the same filename pattern.
Here is an example:
10,000 Maniacs - MTV Unplugged - 01 - These are Days.mp3
on its own, we can deduce little, except perhaps that the 01 is a
tracknumber (see my previous post). But if we know that all filenames
in the same directory, belong to the same album, but will have
different titles, then maybe, just maybe we can xform this filename
into its component parts.
So, if we have in our possession say, 2 more filenames:
10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3
Then we should be able to parse all the filenames into:
<var1>,<var2>,< tracknum>,<titl e>
Even if we don't know whether 10,000 Maniacs or MTV Unplugged is the
name of the Album or the Artists (this bit can be deduced by other
heuristics).
Now, it may be that a cluster of filenames has fewer properties:
4 Non Blondes - Train.mp3
4 Non Blondes - No Place Like Home.mp3
In this case the 4 is part of the Artist name which can be deduced,
since two tracks wouldn't have the same tracknumber.
As more example clusters:
01 What's the Matter Here.mp3
02 Hey Jack Kerouac.mp3
50 Cent - Back Down.mp3
50 Cent - Gotta Get.mp3
10 Just Like Anyone Aimee Mann Bachelor No. 2 Rock.mp3
09 Driving Sideways Aimee Mann Bachelor No. 2 Rock.mp3
This is doing my head in. I just can't figure out how to attack the
problem - even though my instincts say Regex should be the
tool....H-E-L-P!! Any ideas anyone? 4 2094
"skavan" <su**********@t fn.com> wrote: This is doing my head in. I just can't figure out how to attack the problem - even though my instincts say Regex should be the tool....H-E-L-P!! Any ideas anyone?
My instinct is strongly not to use regex. Instead use more generic
clustering techniques. That way you'll also be able to add extra
data-sources (e.g. MP3 tags like genre, volume level, statistical
analysis of music) to get even more useful clustering. Also you'll be
more robust against other naming schemes the users have come up with
but you haven't thought of in advance.
(I'm guessing that you're doing this in order to present a better UI
to users for them to navigate their music collection).
PS. I've never coded any kind of clustering analysis. All I know about
it is seeing talks given by other researchers. So take what I say with
a pinch of salt.
--
Lucian
What a great problem!
I'm going to work from the assumption that we're working only within a directory - and if this doesn't naturally scale up to include directory names you'll say so and we can take another look...
Another assumption I'll use is that you can represent your strings in 8 bit characters. Unicode and MBCS present problems about the # of bytes needed to represent any character that, in some cases, cause characters to be variable length. That makes things far more difficult...
Here's what I see is the problem:
(Within a directory...) your examples demonstrate that there are one or more substrings within each filename that match (for any given cluster) - and your need is to be able to extract these substrings.
Here is a collection of techniques to detect similarities / differences between file names:
1. If you represent the strings as arrays of bytes you can then "subtract" one string from another:
String A: 10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
String B: 10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3
Diff : 000000000000000 000000000000000 0000.000.0..... ............... ...
where a dot represents the numerical value of the difference of the ASCII characters
From this you can easily detect that this pair start with a long string of characters that match - up to a common position; after that there's a short match string and then it's all unmatched.
Note that since track numbers may not have leading zeros they can be variable length. To handle this you'll have to do this comparison three times (for every pair of file names compared): aligned, left shift filename 2, right shift filename 2 (each shift is by one character position). Of the three comparisons - one will show up with a long batch of zeros and the other two will not.
Your last example presents a further shift problem - where the variable stuff (which is likely to be variable length) is at the beginning. For these, you'll need to align the strings on the end, not the beginning; you'll then detect a bunch of consecutive zeros at the end, not the beginning.
2. So far, so good - but if you've 500 files in one directory, you could be doing 3*500*500 = 750,000 comparisons. Not good. So now you need to group similar files. And you need to do this before you've found the common parts of file names. For this, I'd create four hash tables - designed so that you are likely to get multiple hits on a given hash value (so you'll store a list of the hits under any given hash). The three tables will hash distinct parts of each file name:
A. the first four alpha characters of the file name (i.e. skip leading digits, punctuation and spaces).
B. the last four alpha characters of the file name
C. the string of alpha characters between the first pair of "-" characters, if such a string exists (don't bother adding names to the hash table if they don't hit).
D. like C, but between the last pair of "-" characters.
If there are other commonly used characters that punctuate parts of the file names, allow them as valid alternatives to "-" in C & D; you might create distinct hash tables for each distinct punctuation character.
By eliminating digits we eliminate track numbers from the hash tables - which are variable (and variable length) and can sometimes mimic band or album names - and so eliminate the need to shift before generating the hash.
By hashing on first & last & between delimiters we're generating indexes that are likely to hit on some recurring substring: all file names that share the same substring will land in the same hash bucket. Now your string differencing only needs to happen on the groups of file names within any given bucket - and you can safely skip making comparisons between files in different buckets from each other.
Note that some groups of file names may show up in more than one hash table - where a file name is like your 10,000 Maniacs example, both the band name and the album name will form a cluster in a hash table, but the band name will cluster in hash table A and the album name will cluster in table C.
Also note that you may need to look at limit cases - where a band name, or album name, contains fewer than four characters. In this case you would likely find that the four selected characters land in different hash buckets because of a differing first or last character. If you know you have cases where this might happen, you could consider modifying the 4 character rule to be '4 characters, unless there are fewer before I hit a digit or punctuation mark'.
Hope some or all of this helps. I'll be looking to see whether it works for you...
Griffin
"skavan" <su**********@t fn.com> wrote in message news:11******** **************@ y43g2000cwc.goo glegroups.com.. .
Use Case:
We have music files that describe, in their filename, attributes of the
music.
We do not know a general pattern that applies to all filenames -- but
we do know that filenames that are clustered together (by for example
directory) will, most likely, have the same filename pattern.
Here is an example:
10,000 Maniacs - MTV Unplugged - 01 - These are Days.mp3
on its own, we can deduce little, except perhaps that the 01 is a
tracknumber (see my previous post). But if we know that all filenames
in the same directory, belong to the same album, but will have
different titles, then maybe, just maybe we can xform this filename
into its component parts.
So, if we have in our possession say, 2 more filenames:
10,000 Maniacs - MTV Unplugged - 02 - Eat for Two.mp3
10,000 Maniacs - MTV Unplugged - 03 - Candy Everybody Wants.mp3
Then we should be able to parse all the filenames into:
<var1>,<var2>,< tracknum>,<titl e>
Even if we don't know whether 10,000 Maniacs or MTV Unplugged is the
name of the Album or the Artists (this bit can be deduced by other
heuristics).
Now, it may be that a cluster of filenames has fewer properties:
4 Non Blondes - Train.mp3
4 Non Blondes - No Place Like Home.mp3
In this case the 4 is part of the Artist name which can be deduced,
since two tracks wouldn't have the same tracknumber.
As more example clusters:
01 What's the Matter Here.mp3
02 Hey Jack Kerouac.mp3
50 Cent - Back Down.mp3
50 Cent - Gotta Get.mp3
10 Just Like Anyone Aimee Mann Bachelor No. 2 Rock.mp3
09 Driving Sideways Aimee Mann Bachelor No. 2 Rock.mp3
This is doing my head in. I just can't figure out how to attack the
problem - even though my instincts say Regex should be the
tool....H-E-L-P!! Any ideas anyone?
Very intresting approach.
You say "If you represent the strings as arrays of bytes you can then
"subtract" one string from another"...
Forgive the dumb question - but how do you subtract one byte array from
another in vb.net (or C#)?
Standard operators don't work!
Seems to me, I should take your byte array concept and do the
following:
1. Assign each filename to a string: s1, s2
2. Determine which string is longer and by how many chars: N (assume s1
is longer in this example).
3. Add N chars to end of s2 and place in s3
4. Add N chars to start of s2 and place in s4
5. Convert s1,s3 and s4 to bytearrays: b1, b3, b4
6. Subtract b3 from b1. If it starts with 0's it's CASE A), if not it's
(CASE B).
7. Subtract b4 from b1. If it ends with 0's (CASE C), if not it's (CASE
D).
8. CASE A + CASE C = fixed fields at eather end, variable in the middle
9. CASE A + CASE D = fixed fields at the beginning, variable at the
end.
10. CASE B + CASE C = variable field at the beginning, thereafter
fixed.
11. CASE B + CASE D = ERROR!!! no fixed matches...or variable at both
ends!
12. at this point, one can begin pulling the fixed fields and parsing
them.
Make sense?
Back to the question...shor t of iterating through the byte array, one
item at a time, how does one subtract two byte arrays? If I have to
iterate, why bother converting from string to byte array?
....oh and to handle the 11 vs 1 issue (2 chars vs 1), maybe I convert
all numbers to XX (regex) before I start the sequence above? This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Sridhar R |
last post by:
>From technical point of view, I could not understand the the reasoning
behind using Java in major companies. Sure that Python, is used in
some, but still Java is considered as a sure-job language.
After being a python programmer for long time, I consider it painful to
learn/use Java now (well, like many I will be forced to do that in my
job).
What makes such companies to choose Java over dynamic, productive
languages like Python? ...
|
by: uli |
last post by:
Is there any open source Python software (preferably biopython) written
which runs on a cluster. Alternatively are there interfaces written in
Python to existing cluster software.
|
by: rbs |
last post by:
Is it worth looking into putting Oracle 8.1.7 on a Win2k3
cluster.
Only Oracle 9.2.0 is certified on a W2k3 cluster, but applications
(support contracts) prevent us from upgrading to Oracle 9.
Experience anyone?
|
by: Leonid |
last post by:
I have Win 2003 cluster environment + SQL 2000 SP3 with active-passive
infrastructure.
Node1 active
Node2 passive
When I stop one node and
When I am stopping a Node2 - everything is Ok - Node 1 can work in
autonomic mode and I can restart a SQL Agent without any problems
When I am stopping a Node1 - everything is Ok - Node 1 can work in
|
by: Leonid |
last post by:
From: leonid_n@infogateonline.com (Leonid)
Newsgroups: comp.databases.ms-sqlserver
Subject: SQL Agent + Cluster
NNTP-Posting-Host: 80.74.102.10
Message-ID: <9dbf743d.0410240200.691d5172@posting.google.com>
I have Win 2003 cluster environment + SQL 2000 SP3 with active-passive
infrastructure.
Node1 active
Node2 passive
| |
by: Christ |
last post by:
Hi there,
i'm trying to make a regex, but it ain't working.
In just one regex expression I want to check a password that must meet
following requirements:
- at least 6 characters long
- at least one letter (doesn't matter where in te string)
- at least one number (doesn't matter where in te string)
(those are no prob)
- excluding 0, o, O, i, I, l, L, 1 (to prevent confusion about
|
by: Dennis Gearon |
last post by:
If I've read everything right, in order to get:
multiple languages on a site
with the functionality of ALL of:
REGEX
LIKE
Correctly sorted text
|
by: abhinav |
last post by:
Hi guys.I have read that one cannot perform true multithreading in
python due to global interpreter lock mechanism.Suppose i have to
implement a crawler on a say cluster system like clusterknoppix so that
i can use parallel virtual machine (PVM)for programming in
multiprocessor environment or say open MPI.Can i integrate python with
PVM or MPI.Can i embed python into C for programming in multiprocessor
environment.Is there any way of...
|
by: skavan |
last post by:
Hi,
I'm just wrapping my head around regex and am pretty sure it can do the
task at hand - but it's too complex for my brain to process -- so am
throwing it out there for you experts to comment on. I am posing two
questions. In the interests of space and focus, I'll post a separate
thread for the other use case (clustering).
Use Case 1:
Filenames contain a TrackNumber (or not).
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |