Regexp issue . . .

MichaelC

Hi all. I am having a particularly difficult time with a perl script that I
am writing. The problem area is a place where I need to strip some newlines
out of a file.

My source data is text which is in paragraph form, but has line breaks
within the paragraphs. I need to do as much processing as possible in order
to minimise the amount of manual changes that I have to make.

Sample text is as follows:

"This document is intended to give you an
overview of DG as well as highlight some of
the features. This is a brought to your handheld using DG."
With DG you can view and edit word processing and spreadsheet files on
your handheld. Simple push-button synchronization of
the handheld with the desktop will maintain the most up-to-date
version of a file on both the desktop and handheld.

I want these to be parsed as follows:

"This document is intended to give you an overview of DG as well as
highlight some of the features. This is a brought to your handheld using
DG." With DG you can view and edit word processing and spreadsheet files on
your handheld. Simple push-button synchronization of the handheld with the
desktop will maintain the most up-to-date version of a file on both the
desktop and handheld.

--

One way that I thought might work is to catch all lines that begin upper
case, prepend them with a line break, strip the trailing break, then trap
all lines that start lower case and dump them as-is. Repeat this until no
matches are made on the lower case test, then clean up all those extra line
breaks.

I came up with this . . . but all it seems to do is strip all newlines out.

while( <infl> ) {

my $x = $_;
if ( $x =~ ?^[^a-z]? ) { $x =~ s!(.*)\n!\n\1 ! }
else { $x =~ s!(.*)\n!\1 ! }
print outfl $x;
}

Any help would be greately appreciated.

Michael

Jul 19 '05 #1

Subscribe Reply

3331

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"MichaelC" <mi****@NOshaSP AMw.ca> wrote in
news:d9Dwb.4924 53$9l5.241927@p d7tw2no:

Hi all. I am having a particularly difficult time with a perl script
that I am writing. The problem area is a place where I need to strip
some newlines out of a file.

My source data is text which is in paragraph form, but has line breaks
within the paragraphs. I need to do as much processing as possible in
order to minimise the amount of manual changes that I have to make.

You don't say what you mean by "paragraph form". If you're using that
term in the usual sense, then you mean that the paragraphs have double
newlines between them. Is that so? If so, Perl can read paragraph-at-a-
time for you:

$/ = '';
$paragraph = <>;

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP8NO2mPeouI eTNHoEQKl7wCgwh aYGGLKl2VuQu4P7 cXtQv9C8ZQAn0K0
9YlaoVGjDaBonog RTFfOnn5h
=h9Av
-----END PGP SIGNATURE-----

Jul 19 '05 #2

MichaelC

"Eric J. Roode" <RE***********@ comcast.net> wrote in message
news:Xn******** *************** **@216.196.97.1 36...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"MichaelC" <mi****@NOshaSP AMw.ca> wrote in
news:d9Dwb.4924 53$9l5.241927@p d7tw2no:
Hi all. I am having a particularly difficult time with a perl script
that I am writing. The problem area is a place where I need to strip
some newlines out of a file.

My source data is text which is in paragraph form, but has line breaks
within the paragraphs. I need to do as much processing as possible in
order to minimise the amount of manual changes that I have to make.

You don't say what you mean by "paragraph form". If you're using that
term in the usual sense, then you mean that the paragraphs have double
newlines between them. Is that so? If so, Perl can read paragraph-at-a-
time for you:

$/ = '';
$paragraph = <>;

Sorry, I thought that I had defined my problem in
enough detail. My problem is that the text that I am
processing does NOT have double line breaks
between paragraphs, and the text has been presented
wrapped to 72 character width. I do not have access
to the original, as it was lost. That is the reason for
my current problem.
That said, statistically, in the text that I am processing,
the vast majority of lines that start with the set [A-Z"]
will start a new paragraph. The converse is als true,
in that lines that start [a-z,.!?] are definitely part of a
logical paragraph. In that sense, I am not using the
term "paragraph" in the way that you normally assume.

As an object example, the explanation above is a reasonable simulation of
the problem that I am facing. Logistically, the manually broken text is two
paragraphs with no extra line breaks between them. I neither require nor do
I desire double line breaks between paragraphs, what I ro need, though, is
each paragraph on a single line with a single line break at the end, and
ONLY there.

For example, I need to strip all but two line breaks out of the example that
I have provided, so that the text is contiguous from "Sorry, I" to "current
problem." and from "That said, " to "normally assume." After some thought,
I found a solution:

#!/usr/bin/perl

open(infl, "<in.txt" );
open(outfl, ">out.txt") ;

while( <infl> ) {

my $x = $_;
if ( $x =~ m!^[A-Z"]! ) { print outfl "\n"; }
$x =~ s!(^.+)\n!\1 !m;

print outfl $x;
}

close(infl);
close(outfl);

Thanks,

Michael

Jul 19 '05 #3

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"MichaelC" <mi****@NOshaSP AMw.ca> wrote in
news:s5Vwb.4967 86$pl3.155625@p d7tw3no:

Sorry, I thought that I had defined my problem in
enough detail.
I would say not. :-)
My problem is that the text that I am
processing does NOT have double line breaks
between paragraphs, and the text has been presented
wrapped to 72 character width. I do not have access
to the original, as it was lost. That is the reason for
my current problem.
That said, statistically, in the text that I am processing,
the vast majority of lines that start with the set [A-Z"]
will start a new paragraph. The converse is als true,
in that lines that start [a-z,.!?] are definitely part of a
logical paragraph. In that sense, I am not using the
term "paragraph" in the way that you normally assume.

It sounds like you want to remove all newlines, except where the newline
is followed by an uppercase character. Is that correct?

If so, I'd suggest reading the entire file into memory, and doing a
simple substitution on it:

$/ = undef;
$content = <FILE>;
$content =~ s/\n(?![[:upper:]])//g;

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP8SeSmPeouI eTNHoEQKoVQCfdS okT7bnrjmUOkqt4 NVFOnp9A48An3t1
xj9Z1HMNOPOnq8P J6NJF1KvR
=1T1p
-----END PGP SIGNATURE-----

Jul 19 '05 #4

Similar topics

39356

String search vs regexp search

by: Anand Pillai | last post by:

To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1: domything() And the regexp search assuming no case restriction would be,

Python

2355

Saving search results in a dictionary

by: Lukas Holcik | last post by:

Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could. Or how can I replace the html &entities; in a string "blablabla&blablabal&balbalbal" with the chars they mean using re.sub? I found out they are stored in an dict . I though about this functionality:

Python

1490

I need BETWEEN on speed

by: Rizyak | last post by:

This is x-posted in: alt.php.sql comp.databases.ms-sqlserver microsoft.public.sqlserver.programming I have events that occur during the day. I want to be able to search those by a form with checkboxes (multiple select). Let's say for instance an event is happening from 3-10pm. When someone searches for 4-6 (checkbox option) it needs to show up.

Microsoft SQL Server

1817

EXSLT and regexp

by: Chris Croughton | last post by:

I'm trying to use the EXSLT regexp package from http://www.exslt.org/regexp/functions/match/index.html (specifically the match function) with the libxml xltproc (which supports EXSLT), but whatever I do gets errors. The examples use namespace regExp, but the supplied files use regexp, I've got it so that it at least doesn't complain about namespaces but it then complains that it can't find the match function. My stylesheet is:

.NET Framework

7478

RegExp to strip accents while ignoring case

by: Jon Maz | last post by:

Hi All, I want to strip the accents off characters in a string so that, for example, the (Spanish) word "práctico" comes out as "practico" - but ignoring case, so that "PRÁCTICO" comes out as "PRACTICO". What's the best way to do this? TIA,

C# / C Sharp

3792

Regexp and &

by: Jimmy | last post by:

Hello, If I want to check in C# code that if there is a & in my string variable using RegExp, how should I inform RegExp about &-char? Is just &, \&, & or something else? I have strucled with this small issue two days and I still not sure how C# accpets &-char in Regular Expresions. Can anyone help?

C# / C Sharp

2933

regexp test function behavior

by: HopfZ | last post by:

I coudn't understand some behavior of RegExp.test function. Example html code: ---------------- <html><head></head><body><script type="text/javascript"> var r = /^https?:\/\//g; document.write( ); </script></body></html> ---------------------

Javascript

3249

Speed issue in Access

by: SQL Learner | last post by:

Hi all, I have an Access db with two large tables - 3,100,000 (tblA) and 7,000 (tblB) records. I created a select query using Inner Join by partial matching two fields (X from tblA and Y from tblB). The size of the db is about 200MBs. Now my issue is, the query has been running for over 3 hours already - I have no idea when it will end. I am using Access 2003. Are there ways to improve the speed performance? (Also, would the...

Microsoft Access / VBA

2909

RegExp split for Spell Check

by: SmokeWilliams | last post by:

Hi, I am working on a Spell checker for my richtext editor. I cannot use any open source, and must develop everything myself. I need a RegExp pattern to split text into a word array. I have been doing it by splitting by spaces or <ptags. I run into a probelm with the richtext part of my editor. When I change the font, it wraps the text in a tag. the tag has something like <font face="arial>some words</ font This splits the text at...

Javascript

3908

RegExp.test() with global flag set

by: Matt | last post by:

Hello all, I have just discovered (the long way) that using a RegExp object with the 'global' flag set produces inconsistent results when its test() method is executed. I realize that 'global' is not an appropriate modifier for the test() function - test() searches the entire string by default. However, I would expect it to degrade gracefully. Instead, I seem to be getting something as follows - using W3Schools handy page at :

Javascript

9586

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9423

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10210

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10043

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8869

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5298

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

3956

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3561

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2814

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General