473,399 Members | 3,038 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

question about nasty regex

I'm wondering if someone can tell me whether the following set of
regex substitutions is possible. I want to convert parallel legal
citations into single citations. What I mean is, I want to change, e.g.:

"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434, 72
S. Ct. 394, 397, 96 L.Ed. 475 (1952)."

into:

"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434 (1952)."

Generally, the beginning pattern would consist of:

1. Two names, consisting of one or more words, always separated by a
"v."

2. One, two, or three citations, each of which always has a volume
number ("342") followed by a name, consisting of one or two word
units always ending with "." ("U.S."), followed by a page number ("429")

3. Each citation may contain a comma and a second page number (", 434")

4. Optionally, a parenthesized year ("(1952)")

5. A final "."

I am thinking this is impossible, but I thought that if it were
possible to translate this into Python code, someone here could put
me on the right track.

Thanks.
Apr 3 '06 #1
5 1439
> What I mean is, I want to change, e.g.:

"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434, 72
S. Ct. 394, 397, 96 L.Ed. 475 (1952)."

into:

"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434 (1952)."

Generally, the beginning pattern would consist of:

1. Two names, consisting of one or more words, always separated by a
"v."

2. One, two, or three citations, each of which always has a volume
number ("342") followed by a name, consisting of one or two word
units always ending with "." ("U.S."), followed by a page number ("429")

3. Each citation may contain a comma and a second page number (", 434")

4. Optionally, a parenthesized year ("(1952)")

5. A final "."

import re
tests = ['Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434, 72 S. Ct. 394, 397, 96 L.Ed. 475
(1952).', 'Joe v. Volcano, Fork, 123 Internet, et. al, 314
U.S. 123, 43, 88 S. Ct. 394, 397, 97 L.Ed. 459 (2005).',
'Grandma v. RIAA, 314 U.S. 123, 43, 88 S. Ct. 394, 397, 97
L.Ed. 459.'] r= re.compile(r'(.*?)\s+v\.\s+(.*?)\s+(\d+)\s+U\.S\.\ s+((?:\d+,\s*)+)\s*(.*?)(\(\d{4}\))?\.$') results = [r.match(x) for x in tests]
for x in range(0,3):

.... print "Test %i" % x
.... print "="*20
.... print "\n".join(["%s: %s" % (a,results[x].group(b))
for a,b in zip(["Party1", "Party2", "Court", "Pages",
"Extra", "Year"], range(1,7))])
....
Test 0
====================
Party1: Doremus
Party2: Board of Education of Hawthorne,
Court: 342
Pages: 429, 434,
Extra: 72 S. Ct. 394, 397, 96 L.Ed. 475
Year: (1952)
Test 1
====================
Party1: Joe
Party2: Volcano, Fork, 123 Internet, et. al,
Court: 314
Pages: 123, 43,
Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459
Year: (2005)
Test 2
====================
Party1: Grandma
Party2: RIAA,
Court: 314
Pages: 123, 43,
Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459
Year: None
Things get a little messy if one of the parties has digits
followed by whitespace, followed by "U.S" in their name,
such as a ficticious "99 U.S. Luftballoons". Caveat
regextor. There are also some places where trailing commas
end up in items if there are multiple parties. You may want
to strip them off too before reassembling them.

Reassemble the pieces as needed. Season to taste. Bake at
350 for 20-25 minutes until golden brown.

HTH, or at least gets you on the path to regexp mangling.

-tkc


Apr 3 '06 #2
Tim Chase wrote:
What I mean is, I want to change, e.g.:
[snip regular expressions lesson]


Whoa. That is super-duper extra cool. Thank you *very* much.
Apr 3 '06 #3
Peter <fa****************@hotmail.com> writes:
What I mean is, I want to change, e.g.:

[snip regular expressions lesson]


Whoa. That is super-duper extra cool. Thank you *very* much.


"Some people, when confronted with a problem, think ``I know, I'll use
regular expressions.'' Now they have two problems." --JWZ

Apr 3 '06 #4
In article <7x************@ruckus.brouhaha.com>,
Paul Rubin <http://ph****@NOSPAM.invalid> wrote:
"Some people, when confronted with a problem, think ``I know, I'll use
regular expressions.'' Now they have two problems." --JWZ


Regexes are good if you need a solution quickly, and you're not
processing large amounts of data on a regular basis. (How large is
large? When you're chewing through appreciable amounts of CPU time doing
it.)

Once you get to that point, it would be more efficient to hand-code your
own state machine to do the parsing. Of course, doing it in an (even
partially) interpreted language like Python or Perl would defeat the
point...
Apr 4 '06 #5
Lawrence D'Oliveiro wrote:
In article <7x************@ruckus.brouhaha.com>,
Paul Rubin <http://ph****@NOSPAM.invalid> wrote:
"Some people, when confronted with a problem, think ``I know, I'll use
regular expressions.'' Now they have two problems." --JWZ
Regexes are good if you need a solution quickly, and you're not
processing large amounts of data on a regular basis. (How large is
large? When you're chewing through appreciable amounts of CPU time doing
it.)


But "need a solution quickly" in this group is usually interpreted as
saving programmer time, not CPU time. I wouldn't have been able to come
up with that monstrosity nearly as quickly as Tim did, and I wouldn't
even be able to understand it without significant study, and I
definitely would have trouble maintaining it a few months later when I
found a test case which it didn't handle properly. I also wouldn't even
have confidence that it worked perfectly without throwing a dozen test
cases at it...

On the other hand, I could code a hybrid or entirely non-regex solution
in five or ten minutes (with tests!), and it would be quite readable.
Once you get to that point, it would be more efficient to hand-code your
own state machine to do the parsing. Of course, doing it in an (even
partially) interpreted language like Python or Perl would defeat the
point...


The number of problems for which Python and Perl aren't fast enough is
far smaller than most people think, as is the number of problems for
which regular expressions are really a suitable solution. :-)

-Peter

Apr 4 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: engwar1 | last post by:
Not sure where to ask this. Please suggest another newsgroup if this isn't the best place for this question. I'm new to both vb.net and regex. I need a regular expression that will validate what...
2
by: Jose | last post by:
There's something for me to learn with this example, i'm sure :) Given this text: "....." and my first attempt at capture the groups: "(?:\)" RegExTest gives me what i expect: 6 captured...
18
by: Kamen Yotov | last post by:
hi all, i first posted this on http://msdn.microsoft.com/vcsharp/team/language/ask/default.aspx (ask a c# language designer) a couple of days ago, but no response so far... therefore, i am...
6
by: Du Dang | last post by:
Text: ===================== <script1> ***stuff A </script1> ***more stuff <script2> ***stuff B
4
by: Leon | last post by:
why is the following not correct in asp.net? I'm trying to match all subdomain names 'leon.domain.com', but not 'www.domain.com'? Dim sdm As Regex sdm = New Regex (?!www\.)(.*)\.domain\.com
8
by: Leon | last post by:
My web application allow the user to enter the site by typing in a subdomain such as 'name.domain.com'. However, I want to retrieve just the 'name' part of the subdomain. see code below (the equal...
5
by: Chris | last post by:
How Do I use the following auto-generated code from The Regulator? '------------------------------------------------------------------------------ ' <autogenerated> ' This code was generated...
7
by: Extremest | last post by:
I am using this regex. static Regex paranthesis = new Regex("(\\d*/\\d*)", RegexOptions.IgnoreCase); it should find everything between parenthesis that have some numbers onyl then a forward...
27
by: Murray R. Van Luyn | last post by:
Hi, Is using frames in a website as big an issue nowadays as it was maybe 5 or so years ago? I can remember that you used to have to have a frame site, as well as a non frame site for browsers...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.