Pattern Matching Given # of Characters and no String Input; use RegularExpressions?

Synonymous

Hello,

Can regular expressions compare file names to one another. It seems RE
can only compare with input i give it, while I want it to compare
amongst itself and give me matches if the first x characters are
similiar.

For example:

cccat
cccap
cccan
dddfa
dddfg
dddfz

Would result in the 'ddd' and the 'ccc' being grouped together if I
specified it to look for a match of the first 3 characters.

What I am trying to do is build a script that will automatically
create directories based on duplicates like this starting with say 10
characters, and going down to 1. This way "Vacation1.jpg,
Vacation2.jpg" would be sent to its own directory (if i specifiy the
first 8 characters being similiar) and "Cat1.jpg, Cat2.jpg" would
(with 3) as well.

Thanks for your help and interest!

S M

Jul 19 '05 #1

Subscribe Post Reply

1901

tiissa

Synonymous wrote:

Can regular expressions compare file names to one another. It seems RE
can only compare with input i give it, while I want it to compare
amongst itself and give me matches if the first x characters are
similiar.

Do you have to use regular expressions?

If you know the number of characters to match can't you just compare slices?

In [1]: f1,f2='cccat','cccap'

In [2]: f1[:3]
Out[2]: 'ccc'

In [3]: f1[:3]==f2[:3]
Out[3]: True

It seems to me you just have to compare each file to the next one (after
having sorted your list).

Jul 19 '05 #2

tiissa

tiissa wrote:

If you know the number of characters to match can't you just compare
slices?

If you don't, you can still do it by hand:

In [7]: def cmp(s1,s2):
....: diff_map=[chr(s1[i]!=s2[i]) for i in range(min(len(s1),
len(s2)))]
....: diff_index=''.join(diff_map).find(chr(True))
....: if -1==diff_index:
....: return min(len(s1), len(s2))
....: else:
....: return diff_index
....:

In [8]: cmp('cccat','cccap')
Out[8]: 4

In [9]: cmp('ccc','cccap')
Out[9]: 3

In [10]: cmp('cccat','dddfa')
Out[10]: 0

Jul 19 '05 #3

Kent Johnson

tiissa wrote:

Synonymous wrote:
Can regular expressions compare file names to one another. It seems RE
can only compare with input i give it, while I want it to compare
amongst itself and give me matches if the first x characters are
similiar.

Do you have to use regular expressions?

If you know the number of characters to match can't you just compare
slices?

It seems to me you just have to compare each file to the next one (after
having sorted your list).

itertools.groupby() can do the comparing and grouping:

import itertools
def groupbyPrefix(lst, n): ... lst.sort()
... def key(item):
... return item[:n]
... return [ list(items) for k, items in itertools.groupby(lst, key=key) ]
... names = ['cccat', 'cccap', 'cccan', 'cccbt', 'ccddd', 'dddfa', 'dddfg', 'dddfz']
groupbyPrefix(names, 3) [['cccat', 'cccap', 'cccan', 'cccbt'], ['ccddd'], ['dddfa', 'dddfg', 'dddfz']] groupbyPrefix(names, 2)

[['cccat', 'cccap', 'cccan', 'cccbt', 'ccddd'], ['dddfa', 'dddfg', 'dddfz']]

Kent

Jul 19 '05 #4

Synonymous

tiissa <ti****@nonfree.fr> wrote in message news:<42***********************@news.free.fr>...

tiissa wrote:
If you know the number of characters to match can't you just compare
slices?

If you don't, you can still do it by hand:

In [7]: def cmp(s1,s2):
....: diff_map=[chr(s1[i]!=s2[i]) for i in range(min(len(s1),
len(s2)))]
....: diff_index=''.join(diff_map).find(chr(True))
....: if -1==diff_index:
....: return min(len(s1), len(s2))
....: else:
....: return diff_index
....:

In [8]: cmp('cccat','cccap')
Out[8]: 4

In [9]: cmp('ccc','cccap')
Out[9]: 3

In [10]: cmp('cccat','dddfa')
Out[10]: 0

I will look at that, although if i have 300 images i dont want to type
all the comparisons (In [9]: cmp('ccc','cccap')) by hand, it would
just be easier to sort them then :).

I got it somewhat close to working in visual basic:

If Left$(Cells(iRow, 1).Value, Count) = Left$(Cells(iRow - 1,
1).Value, Count) Then

What it says is when comparing a list, it looks at the 'Count' left
number of characters in the cell and compares it to the row cell
above's 'Count' left number of characters and then does the task (i.e.
makes a directory, moves the files) if they are equal.

I will look for a Left$(str) function that looks at the first X
characters for python :)).

Thank you for your help!

Synonymous

Jul 19 '05 #5

John Machin

On 17 Apr 2005 18:12:19 -0700, sm***********@gmail.com (Synonymous)
wrote:

I will look for a Left$(str) function that looks at the first X
characters for python :)).

Wild goose chase alert! AFAIK there isn't one. Python uses slice
notation instead of left/mid/right/substr/whatever functions. I do
suggest that instead of looking for such a beastie, you read this
section of the Python Tutorial: 3.1.2 Strings.

Then, if you think that that was a good use of your time, you might
like to read the *whole* tutorial :))

HTH,

John

Jul 19 '05 #6

Dennis Lee Bieber

On 17 Apr 2005 18:12:19 -0700, sm***********@gmail.com (Synonymous)
declaimed the following in comp.lang.python:

I will look for a Left$(str) function that looks at the first X
characters for python :)).

BASIC's
Left$(str, x)

is essentially Python's
str[:x]

and a comparison of two would be
somestring[:X] == anotherstring[:X]
-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Jul 19 '05 #7

tiissa

Synonymous wrote:

tiissa <ti****@nonfree.fr> wrote in message news:<42***********************@news.free.fr>...
tiissa wrote:
If you know the number of characters to match can't you just compare
slices?

If you don't, you can still do it by hand:

In [7]: def cmp(s1,s2):
....: diff_map=[chr(s1[i]!=s2[i]) for i in range(min(len(s1),
len(s2)))]
....: diff_index=''.join(diff_map).find(chr(True))
....: if -1==diff_index:
....: return min(len(s1), len(s2))
....: else:
....: return diff_index
....:

I will look at that, although if i have 300 images i dont want to type
all the comparisons (In [9]: cmp('ccc','cccap')) by hand, it would
just be easier to sort them then :).

I didn't meant you had to type it by hand. I thought about writing a
small script (as opposed to using some in the standard tools). It might
look like:

In [22]: def make_group(L):
....: root,res='',[]
....: for i in range(1,len(L)):
....: if ''==root:
....: root=L[i][:cmp(L[i-1],L[i])]
....: if ''==root:
....: res.append((L[i-1],[L[i-1]]))
....: else:
....: res.append((root,[L[i-1],L[i]]))
....: elif len(root)==cmp(root,L[i]):
....: res[-1][1].append(L[i])
....: else:
....: root=''
....: if ''==root:
....: res.append((L[-1],[L[-1]]))
....: return res
....:

In [23]: L=['cccat','cccap','cccan','dddfa','dddfg','dddfz']

In [24]: L.sort()

In [25]: make_group(L)
Out[25]: [('ccca', ['cccan', 'cccap', 'cccat']), ('dddf', ['dddfa',
'dddfg', 'dddfz'])]
However I guarantee no optimality in the number of classes (but, hey,
that's when you don't specify the size of the prefix).
(Actually, I guarantee nothing at all ;p)
But in particular, you can have some file singled out:

In [26]: make_group(['cccan','cccap','cccat','cccb'])
Out[26]: [('ccca', ['cccan', 'cccap', 'cccat']), ('cccb', ['cccb'])]
It is a matter of choice: either you want to specify by hand the size of
the prefix and you'd rather look at itertools as pointed out by Kent, or
you don't and a variation with the above code might do the job.

Jul 19 '05 #8

Synonymous

Hello!

I was trying to create a program to search for the largest common
subsetstring among filenames in a directory, them move the filenames
to the substring's name. I have succeeded, with help, in doing so and
here is the code.

Thanks for your help!

--- Code ---

#This program was created with feed back from: smeghead and sirup plus
aum of I2P; and also tiissa and John Machin of comp.lang.python
#Thank you very much.
#I still get the odd error in this, but it was 1 out of 2500 files
successfully sorted. Make sure you have a directory under c:/test/
called 'aa' and have your
#I release this code into the public domain :o), send feed back to
sm***********@gmail.com
files in c:/test/
import pickle
import os
import shutil
os.chdir ( '/test')
aaaa=2
aa='aa'
x=0
y=20
while y <> 2:
print y
List = []
for fileName in os.listdir ( '/test/' ):
Directory = fileName
List.append(Directory)
List.append("A111111111111")
List.sort()
List.append("Z111111111111")
ListLength = len(List) - 1
x = 0
while x < ListLength:
ListLength = len(List) - 1
b = List[x]
c = List[x + 1]
backward1 = List[x - 1]
d = b[:y]
e = c[:y]
backward2 = backward1[:y]
f = str(d)
g = str(e)
backward3 = str(backward2)
if f==g:
if os.path.isdir (aa+"/"+f) == True:
shutil.move(b,aa+"/"+f)
else:
os.mkdir(aa+"/"+f)
#os.mkdir(f)
shutil.move(b,aa+"/"+f)
else:
if f==backward3:
if os.path.isdir (aa+"/"+f) == True:
shutil.move(b,aa+"/"+f)
else:
os.mkdir(aa+"/"+f)
#os.mkdir(f)
shutil.move(b,aa+"/"+f)
else:
aaaa=3
x = x + 1
y = y - 1

--- End Code ---

sm***********@gmail.com (Synonymous) wrote in message news:<ae**************************@posting.google. com>...

Hello,

Can regular expressions compare file names to one another. It seems RE
can only compare with input i give it, while I want it to compare
amongst itself and give me matches if the first x characters are
similiar.

For example:

cccat
cccap
cccan
dddfa
dddfg
dddfz

Would result in the 'ddd' and the 'ccc' being grouped together if I
specified it to look for a match of the first 3 characters.

What I am trying to do is build a script that will automatically
create directories based on duplicates like this starting with say 10
characters, and going down to 1. This way "Vacation1.jpg,
Vacation2.jpg" would be sent to its own directory (if i specifiy the
first 8 characters being similiar) and "Cat1.jpg, Cat2.jpg" would
(with 3) as well.

Thanks for your help and interest!

S M

Jul 19 '05 #9

by: gsv2com | last post by:

One of my weaknesses has always been pattern matching. Something I definitely need to study up on and maybe you guys can give me a pointer here. I'm looking to remove all of this code and just...

PHP

Pattern Matching

by: Greg Lindstrom | last post by:

Hello- I'm running Python 2.2.3 on Windows XP "Professional" and am reading a file wit 1 very long line of text (the line consists of multiple records with no cr/lf). What I would like to do is...

Python

[perl-python] string pattern matching

by: Xah Lee | last post by:

# -*- coding: utf-8 -*- # Python # Matching string patterns # # Sometimes you want to know if a string is of # particular pattern. Let's say in your website # you have converted all images...

Python

Pattern matching when Copying

by: Joecx | last post by:

Hi If I want to copy files using a pattern like: I want all files on a directory that start with 20050822 to be copied to a different directory. I can't get file.copy or copyfile to accept *.*...

.NET Framework

Wildcard pattern matching

by: RobC | last post by:

I have noticed that in the directoryInfo class a method named GetFiles(String) accepts a wildcard pattern as a parameter and uses this to match file names that exist in a directory. I want to use...

C# / C Sharp

Characters used for pattern matching with like

by: Antony Paul | last post by:

Hi all, What are the characters used for pattern matching with PostgreSQL 7.3. I know it is using % and _ . Any other characters ?. rgds Atnony Paul ---------------------------(end of...

PostgreSQL Database

Pattern matching with string and list

by: olaufr | last post by:

Hi, I'd need to perform simple pattern matching within a string using a list of possible patterns. For example, I want to know if the substring starting at position n matches any of the string I...

Python

Implementing fp pattern matching, using C++

by: Ole Nielsby | last post by:

First, bear with my xpost. This goes to comp.lang.c++ comp.lang.functional with follow-up to comp.lang.c++ - I want to discuss an aspect of using C++ to implement a functional language, and...

C / C++

Problem with pattern matching in .sh script

by: forumsaregreat | last post by:

Hello all, I am doing a pattern matching for a string in a shell script. Here an apostrophe ( ' ) is also a part of a string hence it should also be part of the pattern. But somehow i can't seem...

Linux

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Pattern Matching Given # of Characters and no String Input; use RegularExpressions?

Similar topics