Regular expression help needed

Karin Jensen

Hi

I am writing in PHP and trying to work with regular expressions
on records in a multilanguage database. I understand regexp basics,
but have bitten off more than I can chew here and really need help.

The problem is to do with generating all strings that match a
pattern defined in terms of brackets and slashes. Here, brackets
mean that something is optional and slashes are used to show
alternatives.

In the examples below, all the matches I need to generate are
shown below the original string.

Here are some simple brackets:
aardappel(en)
aardappel
aardappelen
homard ( la) bordelaise
homard bordelaise
homard la bordelaise

And brackets used in combination with slashes:
croquant(e)/(s)
croquant
croquante
croquants
croquantes
agrio/a/(s)
agrio
agria
agrios
agrias

Words can also be bracketed:
crale(s) (froide(s))
crale
crales
crale froide
crales froides
crale froides (even though these last
crales froide two aren't proper French)

And slashes used for whole word alternatives:
jablecn zvin/trdl
jablecn zvin
jablecn trdl
cuerpo, con/de
cuerpo, con
cuerpo, de
douillon(s)/douillen(s)
douillon
douillons
douillen
douillens

Maybe the expression would have to decide on a whole word
substitution rather than a single letter - e.g "cuerpo, con/de"
isn't "cuerpo, cde" - if the "slashed" term is more than a single
letter?

I think I can see how to generate a set of rules based on the
above, but I have no idea how to implement the logic of them using
preg functions.

Help!
And thanks in advance...

Karin

Aug 18 '05 #1

Subscribe Post Reply

1979

Will Woodhull

I'd like more context about how you will be using these grammars. Can't
say a whole lot without knowing more.

If the human-readable form is not carved in stone, I can offer these
suggestions:

Using square brackets rather than parens for optional content would be
consistent with what people see in spreadsheet textbooks and other
computer training aids

aardappel(en) becomes aardappel[en]

Using the vertical pipe character for alternatives would also be more
consistent

agrio/a becomes agrio|a

Since the parens is no longer used for optional content, it can be used
for grouping, which improves readability

agrio|a can also be written agri(o|a)

and it also makes it easy to disambiguate things

cuerpo, con/de becomes cuerpo, (con)|(de)

If it is possible to do without a human-readable form, then all these
rules could be expressed directly in regex strings, suitable for
plugging into preg_match() and preg_replace:

aardappel(en) becomes '/aardappel(en)?/'
cuerpo, con/de becomes '/cuerpo, (con|de)/'
douillon(s)/douillen(s) becomes '/douill[eo]ns?/'

PS: I think 'agrio/a/(s)' would also match 'agri' and 'agris' according
to the rules? I think what was wanted was ' agrio/a(s)'.

Aug 18 '05 #2

Karin Jensen

Will Woodhull wrote:

I'd like more context about how you will be using these grammars.
Can't say a whole lot without knowing more.

If the human-readable form is not carved in stone, I can offer
these suggestions:

Using square brackets rather than parens for optional content
would be consistent with what people see in spreadsheet textbooks
and other computer training aids

aardappel(en) becomes aardappel[en]

Using the vertical pipe character for alternatives would also be
more consistent

agrio/a becomes agrio|a

Since the parens is no longer used for optional content, it can
be used for grouping, which improves readability

agrio|a can also be written agri(o|a)

and it also makes it easy to disambiguate things

cuerpo, con/de becomes cuerpo, (con)|(de)

If it is possible to do without a human-readable form, then all
these rules could be expressed directly in regex strings,
suitable for plugging into preg_match() and preg_replace:

aardappel(en) becomes '/aardappel(en)?/'
cuerpo, con/de becomes '/cuerpo, (con|de)/'
douillon(s)/douillen(s) becomes '/douill[eo]ns?/'

PS: I think 'agrio/a/(s)' would also match 'agri' and 'agris'
according to the rules? I think what was wanted was '
agrio/a(s)'.

Hi Will

Very many thanks for taking the time to reply. I should have
explained more about the context of things, sorry. A friend has
created the database and I am implementing its searchable, web-based
version.

The search patterns wouldn't be seen by the users. I could try to
talk my friend into changing them to regular-expression-friendly
versions, but I am not sure how readable she would find them for her
own use.

Thanks for your help - and good point about agrio/a(s)!

Best wishes and thanks again,

Karin

Aug 18 '05 #3

Will Woodhull

Karin Jensen wrote:

The search patterns wouldn't be seen by the users. I could try to
talk my friend into changing them to regular-expression-friendly
versions, but I am not sure how readable she would find them for her
own use.

She wouldn't find regexes very useful in desk work.

The end result has to be a function that for each of your friend's
'search patterns' would convert the many variants to a single token.
Both "aardappel" and "aardappelen" are converted to some arbitrary and
unique value like "{AARDAPPEL}". This tokenizer function is applied to
both the search string given by the visitor and to each target as it is
pulled from the database, then the tokenized versions of these are
compared.

The tokenizer has to work with preg_replace() regex, but the database
provides the search patterns in a different syntax-- and sometimes
there are ambiguities in that syntax. For this and for a number of
other reasons it makes sense to build the logic into a look-up table,
such as a predefined array

# look-up from database search string to regex
$rule['aardappel(en)'] = '/aardappel(en)?/';
$rule['agrio/a(s)'] = '/agri[oa]s?/';
$rule['cuerpo, con/de'] = '/cuerpo,\w(con|de)/';

A second predefined look-up table using the same keys can hold the
tokens

$token['aardappel(en)'] = '{AARDAPPEL}';
$token['agrio/a(s)'] = '{AGRIO}';
$token['cuerpo, con/de'] ='{CUERPO}';

Then the PHP logic would be something like

function tokenize($givenstring) {
$tokenized = $givenstring;
foreach ($rule as $key => $regex) {
$tokenized = preg_replace($regex, $token[$key], $tokenized);
}
return $tokenized;
}

This approach would be easy to debug, maintain, and extend. It would
also be possible to move the look-up tables out of PHP and into the
database-- which might be a good or bad thing to do.

HTH

Aug 19 '05 #4

Joachim Weiß

Karin Jensen schrieb:

Will Woodhull wrote:

Hi Karin,
Hi Will

I thing the problem can't be really solved with regular expressions,
perhaps because the creators of human languagage didn't know php ;-)

Just to put in my 2cents:

1.generate a index with the soundex() of the field you want to have
searched.
(because the most differences in the words are at the end I suggest to
use only the first few cars of the word.)

soundex has been rewritten to several languages.

2. you will find many matches in your Database.
now use levenshtein to sort the results.

Surely it won't handle all queries. But for a solution to find different
declinations of words it might be sufficient enough and it ist much
easier than implementing a rule set for a natural language in php.
HIH

Jo

Aug 19 '05 #5

Will Woodhull

Hi Jo,
Joachim Weiß wrote:

I thing the problem can't be really solved with regular expressions, <snip> Just to put in my 2cents:

1.generate a index with the soundex() of the field you want to have

I haven't worked with soundex but I think it is not the solution here.
Soundex would return very different values for some of the expressions
that Karin will be working with. And at other times a soundex approach
will run into a new set of difficulties with homologues-- words that
have the same sound but very different meanings-- that I think would
raise severe difficulties.

Historically the approach to this kind of variant recognition problem
is to reduce all the variants in the target and in each possible match
to the same tokens and perform the test on theses tokenized
representatives. In this particular case, the problem is compounded
because the variants are expressed in a type of human-recognizable
grammar that will need to be translated, somehow, into expressions the
computer understands. These are all pattern recognition problems in
text strings-- which is exactly the kind of problem that regular
expressions were designed to handle.

Aug 22 '05 #6

Similar topics

Regular Languages. Help Needed.

by: Jack Smith | last post by:

I posted this question earlier, but I got no responses. Can anyone help me out here...any hints or even how to start? Thanks in advance. Let doubleswap(x) be the string formed by replacing each...

Java

Bottleneck? More efficient regular expression?

by: Tina Li | last post by:

Hello, I've been struggling with a regular expression for parsing XML files, which keeps giving the run time error "maximum recursion limit exceeded". Here is the pattern string: ...

Python

Help needed: cryptic perl regular expression in python syntax

by: pekka niiranen | last post by:

Hi there, I have perl script that uses dynamically constructed regular in this way: ------perl code starts ---- $result ""; $key = AAA\?01; $key = quotemeta $key; $line = " ...

Python

Possible to insert variables into regular expressions?

by: Chris Lasher | last post by:

Hello, I would like to create a set of very similar regular expression. In my initial thought, I'd hoped to create a regular expression with a variable inside of it that I could simply pass a...

Python

Help needed with a regular expression

by: Neri | last post by:

Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...

C# / C Sharp

Combine regular expression validator with javascript function?

by: Dot net work | last post by:

Hello. Say I have a .net textbox that uses a .net regularexpressionvalidator. If the regular expression fails, is it possible to launch a small client side javascript function to do something,...

ASP.NET

Regular Expression Help

by: tmeister | last post by:

I am in need of a regular expression that tests and fails if there are 14 or more of a character in the test string. There can be up to 13 of these characters in the string and any other...

ASP.NET

Regular expression

by: PawelR | last post by:

Hello everybody, I have problem with regular expression. I want "code" telephon number and I have two types number: 1) or 2) ( or ) where x - is digital Maybe someone know where is simple...

ASP.NET

Regular Expression

by: Øyvind Isaksen | last post by:

Can anyone please help me make i regular expression (asp.net / vb.net) that check if a password is more than 5 digit/char? Example passwords: whK5v = valid password hd3 = invalid password...

ASP.NET

Regular expression optimization

by: Billa | last post by:

Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...

.NET Framework

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware