Regexp issue . . .

MichaelC

Hi all. I am having a particularly difficult time with a perl script that I
am writing. The problem area is a place where I need to strip some newlines
out of a file.

My source data is text which is in paragraph form, but has line breaks
within the paragraphs. I need to do as much processing as possible in order
to minimise the amount of manual changes that I have to make.

Sample text is as follows:

"This document is intended to give you an
overview of DG as well as highlight some of
the features. This is a brought to your handheld using DG."
With DG you can view and edit word processing and spreadsheet files on
your handheld. Simple push-button synchronization of
the handheld with the desktop will maintain the most up-to-date
version of a file on both the desktop and handheld.

I want these to be parsed as follows:

"This document is intended to give you an overview of DG as well as
highlight some of the features. This is a brought to your handheld using
DG." With DG you can view and edit word processing and spreadsheet files on
your handheld. Simple push-button synchronization of the handheld with the
desktop will maintain the most up-to-date version of a file on both the
desktop and handheld.

--

One way that I thought might work is to catch all lines that begin upper
case, prepend them with a line break, strip the trailing break, then trap
all lines that start lower case and dump them as-is. Repeat this until no
matches are made on the lower case test, then clean up all those extra line
breaks.

I came up with this . . . but all it seems to do is strip all newlines out.

while( <infl> ) {

my $x = $_;
if ( $x =~ ?^[^a-z]? ) { $x =~ s!(.*)\n!\n\1 ! }
else { $x =~ s!(.*)\n!\1 ! }
print outfl $x;
}

Any help would be greately appreciated.

Michael

Jul 19 '05 #1

Subscribe Post Reply

3290

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"MichaelC" <mi****@NOshaSPAMw.ca> wrote in
news:d9Dwb.492453$9l5.241927@pd7tw2no:

Hi all. I am having a particularly difficult time with a perl script
that I am writing. The problem area is a place where I need to strip
some newlines out of a file.

My source data is text which is in paragraph form, but has line breaks
within the paragraphs. I need to do as much processing as possible in
order to minimise the amount of manual changes that I have to make.

You don't say what you mean by "paragraph form". If you're using that
term in the usual sense, then you mean that the paragraphs have double
newlines between them. Is that so? If so, Perl can read paragraph-at-a-
time for you:

$/ = '';
$paragraph = <>;

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP8NO2mPeouIeTNHoEQKl7wCgwhaYGGLKl2VuQu4P7cXtQv 9C8ZQAn0K0
9YlaoVGjDaBonogRTFfOnn5h
=h9Av
-----END PGP SIGNATURE-----

Jul 19 '05 #2

MichaelC

"Eric J. Roode" <RE***********@comcast.net> wrote in message
news:Xn*************************@216.196.97.136...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"MichaelC" <mi****@NOshaSPAMw.ca> wrote in
news:d9Dwb.492453$9l5.241927@pd7tw2no:
Hi all. I am having a particularly difficult time with a perl script
that I am writing. The problem area is a place where I need to strip
some newlines out of a file.

My source data is text which is in paragraph form, but has line breaks
within the paragraphs. I need to do as much processing as possible in
order to minimise the amount of manual changes that I have to make.

You don't say what you mean by "paragraph form". If you're using that
term in the usual sense, then you mean that the paragraphs have double
newlines between them. Is that so? If so, Perl can read paragraph-at-a-
time for you:

$/ = '';
$paragraph = <>;

Sorry, I thought that I had defined my problem in
enough detail. My problem is that the text that I am
processing does NOT have double line breaks
between paragraphs, and the text has been presented
wrapped to 72 character width. I do not have access
to the original, as it was lost. That is the reason for
my current problem.
That said, statistically, in the text that I am processing,
the vast majority of lines that start with the set [A-Z"]
will start a new paragraph. The converse is als true,
in that lines that start [a-z,.!?] are definitely part of a
logical paragraph. In that sense, I am not using the
term "paragraph" in the way that you normally assume.

As an object example, the explanation above is a reasonable simulation of
the problem that I am facing. Logistically, the manually broken text is two
paragraphs with no extra line breaks between them. I neither require nor do
I desire double line breaks between paragraphs, what I ro need, though, is
each paragraph on a single line with a single line break at the end, and
ONLY there.

For example, I need to strip all but two line breaks out of the example that
I have provided, so that the text is contiguous from "Sorry, I" to "current
problem." and from "That said, " to "normally assume." After some thought,
I found a solution:

#!/usr/bin/perl

open(infl, "<in.txt" );
open(outfl, ">out.txt");

while( <infl> ) {

my $x = $_;
if ( $x =~ m!^[A-Z"]! ) { print outfl "\n"; }
$x =~ s!(^.+)\n!\1 !m;

print outfl $x;
}

close(infl);
close(outfl);

Thanks,

Michael

Jul 19 '05 #3

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"MichaelC" <mi****@NOshaSPAMw.ca> wrote in
news:s5Vwb.496786$pl3.155625@pd7tw3no:

Sorry, I thought that I had defined my problem in
enough detail.
I would say not. :-)
My problem is that the text that I am
processing does NOT have double line breaks
between paragraphs, and the text has been presented
wrapped to 72 character width. I do not have access
to the original, as it was lost. That is the reason for
my current problem.
That said, statistically, in the text that I am processing,
the vast majority of lines that start with the set [A-Z"]
will start a new paragraph. The converse is als true,
in that lines that start [a-z,.!?] are definitely part of a
logical paragraph. In that sense, I am not using the
term "paragraph" in the way that you normally assume.

It sounds like you want to remove all newlines, except where the newline
is followed by an uppercase character. Is that correct?

If so, I'd suggest reading the entire file into memory, and doing a
simple substitution on it:

$/ = undef;
$content = <FILE>;
$content =~ s/\n(?![[:upper:]])//g;

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP8SeSmPeouIeTNHoEQKoVQCfdSokT7bnrjmUOkqt4NVFOn p9A48An3t1
xj9Z1HMNOPOnq8PJ6NJF1KvR
=1T1p
-----END PGP SIGNATURE-----

Jul 19 '05 #4

Similar topics

String search vs regexp search

by: Anand Pillai | last post by:

To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1:...

Python

Saving search results in a dictionary

by: Lukas Holcik | last post by:

Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could....

Python

I need BETWEEN on speed

by: Rizyak | last post by:

This is x-posted in: alt.php.sql comp.databases.ms-sqlserver microsoft.public.sqlserver.programming I have events that occur during the day. I want to be able to search those by a form with...

Microsoft SQL Server

EXSLT and regexp

by: Chris Croughton | last post by:

I'm trying to use the EXSLT regexp package from http://www.exslt.org/regexp/functions/match/index.html (specifically the match function) with the libxml xltproc (which supports EXSLT), but...

.NET Framework

RegExp to strip accents while ignoring case

by: Jon Maz | last post by:

Hi All, I want to strip the accents off characters in a string so that, for example, the (Spanish) word "práctico" comes out as "practico" - but ignoring case, so that "PRÁCTICO" comes out as...

C# / C Sharp

Regexp and &

by: Jimmy | last post by:

Hello, If I want to check in C# code that if there is a & in my string variable using RegExp, how should I inform RegExp about &-char? Is just &, \&, & or something else? I have strucled with...

C# / C Sharp

regexp test function behavior

by: HopfZ | last post by:

I coudn't understand some behavior of RegExp.test function. Example html code: ---------------- <html><head></head><body><script type="text/javascript"> var r = /^https?:\/\//g;...

Javascript

Speed issue in Access

by: SQL Learner | last post by:

Hi all, I have an Access db with two large tables - 3,100,000 (tblA) and 7,000 (tblB) records. I created a select query using Inner Join by partial matching two fields (X from tblA and Y from...

Microsoft Access / VBA

RegExp split for Spell Check

by: SmokeWilliams | last post by:

Hi, I am working on a Spell checker for my richtext editor. I cannot use any open source, and must develop everything myself. I need a RegExp pattern to split text into a word array. I have...

Javascript

RegExp.test() with global flag set

by: Matt | last post by:

Hello all, I have just discovered (the long way) that using a RegExp object with the 'global' flag set produces inconsistent results when its test() method is executed. I realize that 'global'...

Javascript

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware