473,396 Members | 1,895 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Help sought Perl with a bit of REGEX

I am working on a script to process a large number of old electoral records.
There are about 100,000 records in all but here is a representative sample

BTW hd =household duties
ALLISON, Winifred hd
BRACKENREG, Helen & James hd & lands officer
MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver

Note that the first names are in the same sequence as the occupations. An
occupation may consist of one or two words eg 'hd' or 'tractor driver'. The
last of these sample records has 3 'Marshalls' Margaret, Charles and Herbert

though other records include up to six family members. In all cases there is

a pattern:

1 person . . . occupation is immediately followed by a line return
(naturally)
2 people . . . first occupation is followed by an '&', last occupation by
line return
3 or more people . . . the first and up to the second last occupation are
followed by commas and the remainder of the line follows the aforementioned
patterns
My initial thoughts
Use a global REGEX that would step though and match the next occupation but
it has not proved that easy. Need a way to move the 'matching point forward
to a ampersand, comma or line return depending on context. If anyone could
provide some insights into whether RE can provide this level of control or
point me to a more appropriate solution.
Here the relevant code snippet:
#preceding code to do with last name, addresses etc This part works well

@matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record

foreach $FirstName (@matches ) {

(m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation

$Occupation =$1; # stores the next matching occupation with each successive
loop

print ("\"$FirstName\",\"$Occupation\");

}

Jul 22 '06 #1
1 2662
On 07/22/2006 02:56 AM, Chris Newman wrote:
I am working on a script to process a large number of old electoral records.
There are about 100,000 records in all but here is a representative sample

BTW hd =household duties
ALLISON, Winifred hd
BRACKENREG, Helen & James hd & lands officer
MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver

Note that the first names are in the same sequence as the occupations. An
occupation may consist of one or two words eg 'hd' or 'tractor driver'. The
last of these sample records has 3 'Marshalls' Margaret, Charles and Herbert

though other records include up to six family members. In all cases there is

a pattern:

1 person . . . occupation is immediately followed by a line return
(naturally)
2 people . . . first occupation is followed by an '&', last occupation by
line return
3 or more people . . . the first and up to the second last occupation are
followed by commas and the remainder of the line follows the aforementioned
patterns
My initial thoughts
Use a global REGEX that would step though and match the next occupation but
it has not proved that easy. Need a way to move the 'matching point forward
to a ampersand, comma or line return depending on context. If anyone could
provide some insights into whether RE can provide this level of control or
point me to a more appropriate solution.
Here the relevant code snippet:
#preceding code to do with last name, addresses etc This part works well

@matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record

foreach $FirstName (@matches ) {

(m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation

$Occupation =$1; # stores the next matching occupation with each successive
loop

print ("\"$FirstName\",\"$Occupation\");

}
The newsgroup comp.lang.perl is defunct. Comp.lang.perl.misc
is where the action is.

I like to break problems into pieces and eat away at them
piece-by-piece. For this problem, I'd use the s/// operator to
match and remove parts of the string that I'm looking for.

Your strings are organized like so: <family-name>
<first-names<occupations>. So I'd suggest stripping off
(while matching) the family-names first, followed by the
first-names, followed by the occupations. And since '&' seems
to have a function that's the same as the comma, I'd convert
all &'s to commas before doing the real work, e.g.

use Data::Dumper;

my $data = q{
ALLISON, Winifred hd
BRACKENREG, Helen & James hd & lands officer
MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
};

open (FH, "<", \$data) or die("Couldn't open in-memory file.\n");

while (my $line = <FH>) {
$_ = $line;
s/^\s+//;
s/\s+$//;
next if m/^$/;

my ($fam,@names,@occup);
s/\&/,/g;
if (s/^([A-Z]+),\s*//) { $fam = $1 }
while (s/^([A-Z][a-z]+)(\s*,\s*)?//) { push @names, $1 }
while (s/^([a-z ]+)(\s*,\s*)?//) { push @occup, $1 }

print Data::Dumper->Dump([$fam,\@names,\@occup],
[qw(family names occupations)]);
}

close FH;
Jul 22 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Tony C | last post by:
I'm writing a python program which uses regular expressions, but I'm totally new to regexps. I've got Kuchling's "Regexp HOWTO", "Mastering Regular Expresions" by Oreilly, and have access to...
77
by: Hunn E. Balsiche | last post by:
in term of its OO features, syntax consistencies, ease of use, and their development progress. I have not use python but heard about it quite often; and ruby, is it mature enough to be use for...
17
by: Michael McGarry | last post by:
Hi, I am just starting to use Python. Does Python have all the regular expression features of Perl? Is Python missing any features available in Perl? Thanks, Michael
75
by: Xah Lee | last post by:
http://python.org/doc/2.4.1/lib/module-re.html http://python.org/doc/2.4.1/lib/node114.html --------- QUOTE The module defines several functions, constants, and an exception. Some of the...
1
by: rdimayuga | last post by:
I need a regex pattern that will match a string starting with zero or one dot's. For example, ".string" and "string" should both match, but something like "estring" should not match. So far, I've...
7
by: Sam Lowry | last post by:
Greetings. I am trying to do something which should elementary for Perl, but I have only been able to find bits and pieces on it. When I put the bits together they do not work. Maybe I am going...
13
by: The Cleaning Wonder Boy | last post by:
Could someone please explain to me what the (?<Key> and (?<Value> are in the following Regex expression? This gets relative links in an HTML string (file). ...
6
by: deepak_kamath_n | last post by:
Hello, I am relatively new to the world of regex and require some help in forming a regular expression to achieve the following: I have an input stream similar to: Slot: slot1 Description:...
2
by: sangith | last post by:
Hi, I am trying to understand a concept in Regex in Perl. How to write regex in Perl such that metacharacter * is not greedy. Here is my code:- #!usr/bin/perl use strict; my $sentence =...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.