By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,772 Members | 906 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,772 IT Pros & Developers. It's quick & easy.

Suggestions for how to approach this problem?

P: n/a
I figured I might give myself a little project to make my life at work
easier, so here's what I want to do:

I have a large list of publication citations that are numbered. The
numbers are simply typed in with the rest of the text. What I want to do
is remove the numbers and then put bullets instead. Now, this alone
would be easy enough, with a little Python and a little work by hand,
but the real issue is that because of the way these citations were
typed, there are often line breaks at the end of each line -- in other
words, the person didn't just let the line flow to the next line, they
manually pressed Enter. So inserting bullets at this point would put a
bullet at each line break.

So I need to remove the line breaks too, but of course not *all* of them
because each reference still needs a line break between it. So I'm
hoping I could get an idea or two for approaching this. I figure regular
expressions will be needed, and maybe it would be good to remove the
line breaks first and *not* remove a line break that comes before the
numbers (because that would be the proper place for one), and then
finally remove the numbers.

Thanks.
May 8 '07 #1
Share this Question
Share on Google+
13 Replies


P: n/a
John Salerno wrote:

typed, there are often line breaks at the end of each line
Also, there are sometimes tabs used to indent the subsequent lines of
citation, but I assume with that I can just replace the tab with a space.
May 8 '07 #2

P: n/a
In <46**********************@news.astraweb.com>, John Salerno wrote:
I have a large list of publication citations that are numbered. The
numbers are simply typed in with the rest of the text. What I want to do
is remove the numbers and then put bullets instead. Now, this alone
would be easy enough, with a little Python and a little work by hand,
but the real issue is that because of the way these citations were
typed, there are often line breaks at the end of each line -- in other
words, the person didn't just let the line flow to the next line, they
manually pressed Enter. So inserting bullets at this point would put a
bullet at each line break.

So I need to remove the line breaks too, but of course not *all* of them
because each reference still needs a line break between it. So I'm
hoping I could get an idea or two for approaching this. I figure regular
expressions will be needed, and maybe it would be good to remove the
line breaks first and *not* remove a line break that comes before the
numbers (because that would be the proper place for one), and then
finally remove the numbers.
I think I have vague idea how the input looks like, but it would be
helpful if you show some example input and wanted output.

Ciao,
Marc 'BlackJack' Rintsch
May 8 '07 #3

P: n/a
Marc 'BlackJack' Rintsch wrote:
I think I have vague idea how the input looks like, but it would be
helpful if you show some example input and wanted output.
Good idea. Here's what it looks like now:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated
bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in
Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician 16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the
colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955.

As you can see, any single citation is broken over several lines as a
result of a line break. I want it to look like this:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray

irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in Enterobacteriaceae. 34. The specific effects
of the inhibitors of DNA synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the colon: Evaluation of an eleven-year
period. Annals Surg. 166:947-955.

Now, since this is pasted, it might not even look good to you. But in
the second example, the numbers are meant to be bullets and so the
indentation would happen automatically (in Word). But for now they are
just typed.
May 8 '07 #4

P: n/a
On Tuesday 08 May 2007 22:23:31 John Salerno wrote:
John Salerno wrote:
typed, there are often line breaks at the end of each line

Also, there are sometimes tabs used to indent the subsequent lines of
citation, but I assume with that I can just replace the tab with a space.
Is this how the text looks like:

123
some information

124 some other information

126(tab here)something else

If this is the case (the numbers are at the beginning, and after the numbers
there is either a newline or a tab, the logic might be this simple:

get the numbers at the beginning of the line. Check for \n and \t after the
number, if either exists, remove them or replace them with a space or
whatever you prefer, and there you have it. Also, how are the records
seperated? By empty lines? If so, \n\n is an empty line in a string, like
this:
"""
some text here\n
\n
some other text here\n
"""
May 8 '07 #5

P: n/a
On May 8, 3:00 pm, John Salerno <johnj...@NOSPAMgmail.comwrote:
Marc 'BlackJack' Rintsch wrote:
I think I have vague idea how the input looks like, but it would be
helpful if you show some example input and wanted output.

Good idea. Here's what it looks like now:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated
bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in
Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
Questions:

1) Do the citation numbers always begin in column 1?

2) Are the citation numbers always followed by a period and then at
least one whitespace character?

If so, I'd probably use a regular expression like ^[0-9]+\.[ \t] to
find the beginning of each cite. then I would output each cite
through a state machine that would reduce consecutive whitespace
characters (space, tab, newline) into a single character, separating
each cite with a newline.

Final formatting can be done with paragraph styles in Word.

HTH,
-=Dave
May 8 '07 #6

P: n/a
John Salerno wrote:
Marc 'BlackJack' Rintsch wrote:
Here's what it looks like now:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated
bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in
Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the
colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955.

As you can see, any single citation is broken over several lines as a
result of a line break. I want it to look like this:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in Enterobacteriaceae. 34. The specific effects
of the inhibitors of DNA synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the colon: Evaluation of an eleven-year
period. Annals Surg. 166:947-955.

Now, since this is pasted, it might not even look good to you. But in
the second example, the numbers are meant to be bullets and so the
indentation would happen automatically (in Word). But for now they are
just typed.
If you can count on the person not skipping any numbers in the
citations, you can take an "AI" approach to hopefully weed out the rare
circumstance that a number followed by a period starts a line in the
middle of the citation. This is not failsafe, say if you were on
citation 33 and it was in chapter 34 and that 34 happend to start a new
line. But, then again, even a human would take a little time to figure
that one out--and probably wouldn't be 100% accurate either. I'm sure
there is an AI word for the type of parser that could parse something
like this unambiguously and I'm sure that it has been proven to be
impossible to create:

import re
records = []
record = None
counter = 1
regex = re.compile(r'^(\d+)\. (.*)')
for aline in lines:
m = regex.search(aline)
if m is not None:
recnum, aline = m.groups()
if int(recnum) == counter:
if record is not None:
records.append(record)
record = [aline.strip()]
counter += 1
continue
record.append(aline.strip())

if record is not None:
records.append(record)

records = [" ".join(r) for r in records]
pyimport re
pyrecords = []
pyrecord = None
pycounter = 1
pyregex = re.compile(r'^(\d+)\. (.*)')
pyfor aline in lines:
.... m = regex.search(aline)
.... if m is not None:
.... recnum, aline = m.groups()
.... if int(recnum) == counter:
.... if record is not None:
.... records.append(record)
.... record = [aline.strip()]
.... counter += 1
.... continue
.... record.append(aline.strip())
....
pyif record is not None:
.... records.append(record)
....
pyrecords = [" ".join(r) for r in records]
pyrecords

['Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.',
'Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.',
'Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in Enterobacteriaceae. 34. The specific effects of
the inhibitors of DNA synthesis on the transfer of R factor and F
factor. Med. Biol. (Tokyo) 73:79-83.',
'Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.',
'Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the colon: Evaluation of an eleven-year period.
Annals Surg. 166:947-955.']
James
May 8 '07 #7

P: n/a
Necmettin Begiter wrote:
Is this how the text looks like:

123
some information

124 some other information

126(tab here)something else

If this is the case (the numbers are at the beginning, and after the numbers
there is either a newline or a tab, the logic might be this simple:
They all seem to be a little different. One consistency is that each
number is followed by two spaces. There is nothing separating each
reference except a single newline, which I want to preserve. But within
each reference there might be a combination of spaces, tabs, or newlines.
May 9 '07 #8

P: n/a
Dave Hansen wrote:
Questions:

1) Do the citation numbers always begin in column 1?
Yes, that's one consistency at least. :)
2) Are the citation numbers always followed by a period and then at
least one whitespace character?
Yes, it seems to be either one or two whitespaces.
find the beginning of each cite. then I would output each cite
through a state machine that would reduce consecutive whitespace
characters (space, tab, newline) into a single character, separating
each cite with a newline.
Interesting idea! I'm not sure what "state machine" is, but it sounds
like you are suggesting that I more or less separate each reference,
process it, and then rewrite it to a new file in the cleaner format?
That might work pretty well.
May 9 '07 #9

P: n/a
James Stroud wrote:
If you can count on the person not skipping any numbers in the
citations, you can take an "AI" approach to hopefully weed out the rare
circumstance that a number followed by a period starts a line in the
middle of the citation.
I don't think any numbers are skipped, but there are some cases where a
number is followed by a period within a citation. But this might not
matter since each reference number begins at the start of the line, so I
could use the RE to start at the beginning.
May 9 '07 #10

P: n/a
John Salerno wrote:
So I need to remove the line breaks too, but of course not *all* of them
because each reference still needs a line break between it.
After doing a bit of search and replace for tabs with my text editor, I
think I've narrowed down the problem to just this:

I need to remove all newline characters that are not at the end of a
citation (and replace them with a single space). That is, those that are
not followed by the start of a new numbered citation. This seems to
involve a look-ahead RE, but I'm not sure how to write those. This is
what I came up with:
\n(?=(\d)+)

(I can never remember if I need parentheses around '\d' or if the +
should be inside it or not!
May 9 '07 #11

P: n/a
John Salerno wrote:
John Salerno wrote:
>So I need to remove the line breaks too, but of course not *all* of
them because each reference still needs a line break between it.


After doing a bit of search and replace for tabs with my text editor, I
think I've narrowed down the problem to just this:

I need to remove all newline characters that are not at the end of a
citation (and replace them with a single space). That is, those that are
not followed by the start of a new numbered citation. This seems to
involve a look-ahead RE, but I'm not sure how to write those. This is
what I came up with:
\n(?=(\d)+)

(I can never remember if I need parentheses around '\d' or if the +
should be inside it or not!
I included code in my previous post that will parse the entire bib,
making use of the numbering and eliminating the most probable, but still
fairly rare, potential ambiguity. You might want to check out that code,
as my testing it showed that it worked with your example.

James
May 9 '07 #12

P: n/a
James Stroud wrote:
I included code in my previous post that will parse the entire bib,
making use of the numbering and eliminating the most probable, but still
fairly rare, potential ambiguity. You might want to check out that code,
as my testing it showed that it worked with your example.
Thanks. It looked a little involved so I hadn't started to work through
it yet, but I'll do that now before I actually try to write something
from scratch. :)
May 10 '07 #13

P: n/a
James Stroud wrote:
import re
records = []
record = None
counter = 1
regex = re.compile(r'^(\d+)\. (.*)')
for aline in lines:
m = regex.search(aline)
if m is not None:
recnum, aline = m.groups()
if int(recnum) == counter:
if record is not None:
records.append(record)
record = [aline.strip()]
counter += 1
continue
record.append(aline.strip())

if record is not None:
records.append(record)

records = [" ".join(r) for r in records]
What do I need to do to get this to run against the text that I have? Is
'lines' meant to be a list of the lines from the original citation file?
May 10 '07 #14

This discussion thread is closed

Replies have been disabled for this discussion.