Suggestions for how to approach this problem?

John Salerno

I figured I might give myself a little project to make my life at work
easier, so here's what I want to do:

I have a large list of publication citations that are numbered. The
numbers are simply typed in with the rest of the text. What I want to do
is remove the numbers and then put bullets instead. Now, this alone
would be easy enough, with a little Python and a little work by hand,
but the real issue is that because of the way these citations were
typed, there are often line breaks at the end of each line -- in other
words, the person didn't just let the line flow to the next line, they
manually pressed Enter. So inserting bullets at this point would put a
bullet at each line break.

So I need to remove the line breaks too, but of course not *all* of them
because each reference still needs a line break between it. So I'm
hoping I could get an idea or two for approaching this. I figure regular
expressions will be needed, and maybe it would be good to remove the
line breaks first and *not* remove a line break that comes before the
numbers (because that would be the proper place for one), and then
finally remove the numbers.

Thanks.

May 8 '07 #1

Subscribe Post Reply

1071

John Salerno

John Salerno wrote:

typed, there are often line breaks at the end of each line

Also, there are sometimes tabs used to indent the subsequent lines of
citation, but I assume with that I can just replace the tab with a space.

May 8 '07 #2

Marc 'BlackJack' Rintsch

In <46**********************@news.astraweb.com>, John Salerno wrote:

I have a large list of publication citations that are numbered. The
numbers are simply typed in with the rest of the text. What I want to do
is remove the numbers and then put bullets instead. Now, this alone
would be easy enough, with a little Python and a little work by hand,
but the real issue is that because of the way these citations were
typed, there are often line breaks at the end of each line -- in other
words, the person didn't just let the line flow to the next line, they
manually pressed Enter. So inserting bullets at this point would put a
bullet at each line break.

So I need to remove the line breaks too, but of course not *all* of them
because each reference still needs a line break between it. So I'm
hoping I could get an idea or two for approaching this. I figure regular
expressions will be needed, and maybe it would be good to remove the
line breaks first and *not* remove a line break that comes before the
numbers (because that would be the proper place for one), and then
finally remove the numbers.

I think I have vague idea how the input looks like, but it would be
helpful if you show some example input and wanted output.

Ciao,
Marc 'BlackJack' Rintsch

May 8 '07 #3

John Salerno

Marc 'BlackJack' Rintsch wrote:

I think I have vague idea how the input looks like, but it would be
helpful if you show some example input and wanted output.

Good idea. Here's what it looks like now:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated
bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in
Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician 16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the
colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955.

As you can see, any single citation is broken over several lines as a
result of a line break. I want it to look like this:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray

irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in Enterobacteriaceae. 34. The specific effects
of the inhibitors of DNA synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the colon: Evaluation of an eleven-year
period. Annals Surg. 166:947-955.

Now, since this is pasted, it might not even look good to you. But in
the second example, the numbers are meant to be bullets and so the
indentation would happen automatically (in Word). But for now they are
just typed.

May 8 '07 #4

Necmettin Begiter

On Tuesday 08 May 2007 22:23:31 John Salerno wrote:

John Salerno wrote:
typed, there are often line breaks at the end of each line

Also, there are sometimes tabs used to indent the subsequent lines of
citation, but I assume with that I can just replace the tab with a space.

Is this how the text looks like:

123
some information

124 some other information

126(tab here)something else

If this is the case (the numbers are at the beginning, and after the numbers
there is either a newline or a tab, the logic might be this simple:

get the numbers at the beginning of the line. Check for \n and \t after the
number, if either exists, remove them or replace them with a space or
whatever you prefer, and there you have it. Also, how are the records
seperated? By empty lines? If so, \n\n is an empty line in a string, like
this:
"""
some text here\n
\n
some other text here\n
"""

May 8 '07 #5

Dave Hansen

On May 8, 3:00 pm, John Salerno <johnj...@NOSPAMgmail.comwrote:

Marc 'BlackJack' Rintsch wrote:
I think I have vague idea how the input looks like, but it would be
helpful if you show some example input and wanted output.

Good idea. Here's what it looks like now:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated
bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in
Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.

Questions:

1) Do the citation numbers always begin in column 1?

2) Are the citation numbers always followed by a period and then at
least one whitespace character?

If so, I'd probably use a regular expression like ^[0-9]+\.[ \t] to
find the beginning of each cite. then I would output each cite
through a state machine that would reduce consecutive whitespace
characters (space, tab, newline) into a single character, separating
each cite with a newline.

Final formatting can be done with paragraph styles in Word.

HTH,
-=Dave

May 8 '07 #6

James Stroud

John Salerno wrote:

Marc 'BlackJack' Rintsch wrote:
Here's what it looks like now:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated
bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in
Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the
colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955.

As you can see, any single citation is broken over several lines as a
result of a line break. I want it to look like this:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in Enterobacteriaceae. 34. The specific effects
of the inhibitors of DNA synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the colon: Evaluation of an eleven-year
period. Annals Surg. 166:947-955.

Now, since this is pasted, it might not even look good to you. But in
the second example, the numbers are meant to be bullets and so the
indentation would happen automatically (in Word). But for now they are
just typed.

If you can count on the person not skipping any numbers in the
citations, you can take an "AI" approach to hopefully weed out the rare
circumstance that a number followed by a period starts a line in the
middle of the citation. This is not failsafe, say if you were on
citation 33 and it was in chapter 34 and that 34 happend to start a new
line. But, then again, even a human would take a little time to figure
that one out--and probably wouldn't be 100% accurate either. I'm sure
there is an AI word for the type of parser that could parse something
like this unambiguously and I'm sure that it has been proven to be
impossible to create:

import re
records = []
record = None
counter = 1
regex = re.compile(r'^(\d+)\. (.*)')
for aline in lines:
m = regex.search(aline)
if m is not None:
recnum, aline = m.groups()
if int(recnum) == counter:
if record is not None:
records.append(record)
record = [aline.strip()]
counter += 1
continue
record.append(aline.strip())

if record is not None:
records.append(record)

records = [" ".join(r) for r in records]
pyimport re
pyrecords = []
pyrecord = None
pycounter = 1
pyregex = re.compile(r'^(\d+)\. (.*)')
pyfor aline in lines:
.... m = regex.search(aline)
.... if m is not None:
.... recnum, aline = m.groups()
.... if int(recnum) == counter:
.... if record is not None:
.... records.append(record)
.... record = [aline.strip()]
.... counter += 1
.... continue
.... record.append(aline.strip())
....
pyif record is not None:
.... records.append(record)
....
pyrecords = [" ".join(r) for r in records]
pyrecords

['Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.',
'Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.',
'Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in Enterobacteriaceae. 34. The specific effects of
the inhibitors of DNA synthesis on the transfer of R factor and F
factor. Med. Biol. (Tokyo) 73:79-83.',
'Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.',
'Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the colon: Evaluation of an eleven-year period.
Annals Surg. 166:947-955.']
James

May 8 '07 #7

John Salerno

Necmettin Begiter wrote:

Is this how the text looks like:

123
some information

124 some other information

126(tab here)something else

If this is the case (the numbers are at the beginning, and after the numbers
there is either a newline or a tab, the logic might be this simple:

They all seem to be a little different. One consistency is that each
number is followed by two spaces. There is nothing separating each
reference except a single newline, which I want to preserve. But within
each reference there might be a combination of spaces, tabs, or newlines.

May 9 '07 #8

John Salerno

Dave Hansen wrote:

Questions:

1) Do the citation numbers always begin in column 1?

Yes, that's one consistency at least. :)

2) Are the citation numbers always followed by a period and then at
least one whitespace character?

Yes, it seems to be either one or two whitespaces.

find the beginning of each cite. then I would output each cite
through a state machine that would reduce consecutive whitespace
characters (space, tab, newline) into a single character, separating
each cite with a newline.

Interesting idea! I'm not sure what "state machine" is, but it sounds
like you are suggesting that I more or less separate each reference,
process it, and then rewrite it to a new file in the cleaner format?
That might work pretty well.

May 9 '07 #9

John Salerno

James Stroud wrote:

If you can count on the person not skipping any numbers in the
citations, you can take an "AI" approach to hopefully weed out the rare
circumstance that a number followed by a period starts a line in the
middle of the citation.

I don't think any numbers are skipped, but there are some cases where a
number is followed by a period within a citation. But this might not
matter since each reference number begins at the start of the line, so I
could use the RE to start at the beginning.

May 9 '07 #10

John Salerno

John Salerno wrote:

So I need to remove the line breaks too, but of course not *all* of them
because each reference still needs a line break between it.

After doing a bit of search and replace for tabs with my text editor, I
think I've narrowed down the problem to just this:

I need to remove all newline characters that are not at the end of a
citation (and replace them with a single space). That is, those that are
not followed by the start of a new numbered citation. This seems to
involve a look-ahead RE, but I'm not sure how to write those. This is
what I came up with:
\n(?=(\d)+)

(I can never remember if I need parentheses around '\d' or if the +
should be inside it or not!

May 9 '07 #11

James Stroud

John Salerno wrote:

John Salerno wrote:

>So I need to remove the line breaks too, but of course not *all* of
them because each reference still needs a line break between it.

After doing a bit of search and replace for tabs with my text editor, I
think I've narrowed down the problem to just this:

I need to remove all newline characters that are not at the end of a
citation (and replace them with a single space). That is, those that are
not followed by the start of a new numbered citation. This seems to
involve a look-ahead RE, but I'm not sure how to write those. This is
what I came up with:
\n(?=(\d)+)

(I can never remember if I need parentheses around '\d' or if the +
should be inside it or not!

I included code in my previous post that will parse the entire bib,
making use of the numbering and eliminating the most probable, but still
fairly rare, potential ambiguity. You might want to check out that code,
as my testing it showed that it worked with your example.

James

May 9 '07 #12

John Salerno

James Stroud wrote:

I included code in my previous post that will parse the entire bib,
making use of the numbering and eliminating the most probable, but still
fairly rare, potential ambiguity. You might want to check out that code,
as my testing it showed that it worked with your example.

Thanks. It looked a little involved so I hadn't started to work through
it yet, but I'll do that now before I actually try to write something
from scratch. :)

May 10 '07 #13

John Salerno

James Stroud wrote:

import re
records = []
record = None
counter = 1
regex = re.compile(r'^(\d+)\. (.*)')
for aline in lines:
m = regex.search(aline)
if m is not None:
recnum, aline = m.groups()
if int(recnum) == counter:
if record is not None:
records.append(record)
record = [aline.strip()]
counter += 1
continue
record.append(aline.strip())

if record is not None:
records.append(record)

records = [" ".join(r) for r in records]

What do I need to do to get this to run against the text that I have? Is
'lines' meant to be a list of the lines from the original citation file?

May 10 '07 #14

by: Carlos Ribeiro | last post by:

Hello all, I'm posting this to the list with the intention to form a group of people interested in this type of solution. I'm not going to spam the list with it, unless for occasional and...

Python

Problem with sizeof, when using it with Base class pointer

by: Gopal-M | last post by:

I have the problem with sizeof operator I also want to implement a function that can return size of an object. My problem is as follows.. I have a Base class, say Base and there are many class...

C / C++

Encrypt String or different approach

by: Gary Townsend (Spatial Mapping Ltd.) | last post by:

Good afternoon, I am building an application that uses ASP .NET, and Blackmoon FTP Server, My plan currently is to automate some user processes one of those processes is to allow them to...

ASP.NET

Problem - Direction Required

by: Paul Say | last post by:

Problem: I have a custom class called Job and a custom collection called JobsCollection. I currently display the collection on a web page via a datagrid, and I have a dropdown list that...

ASP.NET

Database Connection Problem. Please Help

by: Sam | last post by:

Hi all, I have a process which first pulls one time all application IDs from a database and stores them in a table(this process works fine everytime). I then loop through the table, one at a...

Visual Basic .NET

Referencing problem

by: cmd | last post by:

I have code in the OnExit event of a control on a subform. The code works properly in this instance. If, however, I put the same code in the OnExit event of a control on a Tab Control of a main...

Microsoft Access / VBA

Suggestions for best practice: mysql vs array sorting

by: bill | last post by:

I am about to start on a module that will accept a location from a user, use Google geolocation services to get the lat/lon and then compute the distance from the site visitor to about 100 kennels...

PHP

a simple problem about using stl op text files

by: Chelong | last post by:

hey,the follow is the text file content ========================================apple====pear== one Lily 7 0 0 7 7 two Lily 20 20 6.6666 20 8 one Lily 0 10 2.85 4 0 two Lily 22 22 7.33326 2 5 ...

C / C++

List problem

by: shapper | last post by:

Hello, I have the following code: <div id="outer" class="outer"> <ol> <li>Item outer 01</li> <li>Item outer 02</li> </ol> <div id="inner" class="inner">

HTML / CSS

problem with getElementsByName

by: lukaszmn | last post by:

Hi, I cannot figure out why everything from line "var elem = doc.getElementsByName('keywords')" in x.js is not executed. I want to set value of <input name=keywords ...which is in a.html after page...

Javascript

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Suggestions for how to approach this problem?

Similar topics