473,395 Members | 1,756 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

table (ascii text) lin ayout recognition

Hello,

I am looking for python code useful to process
tables that are in ASCII text. The code must
determine where are the columns (fields).
Concerned tables for my application are various,
but their columns are not very complicated
to locate for a human, because even
when ignoring the semantic of words,
our eyes see vertical alignments

Here is a sample table (must be viewed
with fixed-width font to see alignments):
=================================

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

=================================

I want the python code that builds a representation
of this table (for exemple a list of lists, where each list
represents a table line, each element of the list
being a field value).

Any hints?
thanks

Sep 13 '06 #1
6 1665
vb******@gmail.com wrote:
Hello,

I am looking for python code useful to process
tables that are in ASCII text. The code must
determine where are the columns (fields).
Concerned tables for my application are various,
but their columns are not very complicated
to locate for a human, because even
when ignoring the semantic of words,
our eyes see vertical alignments

Here is a sample table (must be viewed
with fixed-width font to see alignments):
=================================

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

=================================

I want the python code that builds a representation
of this table (for exemple a list of lists, where each list
represents a table line, each element of the list
being a field value).

Any hints?
thanks
I have to catch a bus, but, quickly the algorithm is to code non-space
as one and space as zero, then 'or' operate down the columns. Zeros will
indicate high probability of between-column. Code tomorrow if no one
else posts.

Must run...
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Sep 13 '06 #2
vb******@gmail.com wrote:
Hello,

I am looking for python code useful to process
tables that are in ASCII text. The code must
determine where are the columns (fields).
Concerned tables for my application are various,
but their columns are not very complicated
to locate for a human, because even
when ignoring the semantic of words,
our eyes see vertical alignments

Here is a sample table (must be viewed
with fixed-width font to see alignments):
=================================

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

=================================

I want the python code that builds a representation
of this table (for exemple a list of lists, where each list
represents a table line, each element of the list
being a field value).

Any hints?
thanks
As promised. I call this the "cast a shadow" algorithm for table
discovery. This is about as obfuscated as I could make it. It will be up
to you to explain it to your teacher ;-)

Assuming the lines are all equal width (padded right with space) e.g.:

def rpadd(lines):
"""
Pass in the lines as a list of lines.
"""
lines = [line.rstrip() for line in lines]
maxlen = max([len(line) for line in lines])
return [line + ' ' * (maxlen - len(line)) for line in lines]
In which case, you can:
binary = [[((s==' ' and 2) or 1) for s in line] for line in lines]
shadow = [1 in c for c in zip(*binary)]

isit = False
indices = []
for i,v in enumerate(shadow):
if v is not isit:
indices.append(i)
isit = not isit

indices.append(i+1)

indices = [t for t in zip(indices[::2],indices[1::2])]

columns = [[line[t[0]:t[1]].strip() for line in lines] for t in indices]
In case you want rows:

rows = zip(*columns)
James
Sep 13 '06 #3
James Stroud wrote:
indices = [t for t in zip(indices[::2],indices[1::2])]
(Artefact of cut-and-paste.)

Make that:

indices = zip(indices[::2],indices[1::2])

James
Sep 13 '06 #4
My version, not much tested. It probably doesn't work well for tables
with few rows. It finds the most frequent word beginnings, and then
splits the data according to them.

data = """\
44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope
"""

import re, pprint
# import collections # For Python 2.5

# RE to find the beginning of words
tpatt = re.compile(r"\b[^ ]")

# Remove empty lines
lines = filter(None, data.splitlines())

# Find the positions of all word beginnings
# This finds: treshs = [0, 11, 25, 35, 49, ...
# 44544 ipod apple black 102
# ^ ^ ^ ^ ^
treshs = [ob.start() for li in lines for ob in tpatt.finditer(li)]

# Find treshs frequences
freqs = {}
for el in treshs:
freqs[el] = freqs.get(el, 0) + 1

# Find treshs frequences, alternative for Python V.2.5
# freqs = collections.defaultdict(int)
# for el in treshs:
# freqs[el] += 1

# Find a big enough frequence
bigf = max(freqs.itervalues()) * 0.6

# Find the most common column beginnings
cols = sorted(k for k,v in freqs.iteritems() if v>bigf)

def xpairs(alist):
"xpairs(xrange(n)) ==(0,1), (1,2), (2,3), ..., (n-2, n-1)"
for i in xrange(len(alist)-1):
yield alist[i:i+2]

result = [[li[x:y].strip() for x,y in xpairs(cols+[None])] for li in
lines]

print data
pprint.pprint(result)
"""
Output:

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

[['44544', 'ipod', 'apple', 'black', '102'],
['GFGFHHF-12', 'unknown thing', 'bizar', 'brick mortar', 'tbc'],
['45fjk', 'do not know', '+ is less', '', 'biac'],
['', 'disk', 'seagate', '250GB', '130'],
['5G_gff', '', 'tbd', 'tbd', ''],
['gjgh88hgg', 'media record', 'a and b', '', '12'],
['hjj', 'foo', 'bar', 'hop', 'zip'],
['hg uy oi', 'hj uuu ii a', 'qqq ccc v', 'ZZZ Ughj', ''],
['qdsd', 'zert', '', 'nope', 'nope']]
"""

Bye,
bearophile

Sep 13 '06 #5
"James Stroud" <js*****@mbi.ucla.eduwrote in message
news:8R***************@newssvr21.news.prodigy.com. ..
vb******@gmail.com wrote:
>Hello,

I am looking for python code useful to process
tables that are in ASCII text. The code must
determine where are the columns (fields).
Concerned tables for my application are various,
but their columns are not very complicated
to locate for a human, because even
when ignoring the semantic of words,
our eyes see vertical alignments

Here is a sample table (must be viewed
with fixed-width font to see alignments):
=================================

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

=================================

I want the python code that builds a representation
of this table (for exemple a list of lists, where each list
represents a table line, each element of the list
being a field value).

Any hints?
thanks

As promised. I call this the "cast a shadow" algorithm for table
discovery. This is about as obfuscated as I could make it. It will be up
to you to explain it to your teacher ;-)
James -

I used your same algorithm, but I guess I used more brute force (and didn't
use pyparsing, either!).

-- Paul
data = """\
44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope""".split('\n')

# find rightmost space characters delimiting text columns
spaceCols = set(range(max(map(len, data)))) - \
set( [col for line in data
for col,c in enumerate(line.expandtabs())
if not c.isspace() ] )
spaceCols -= set( [c for c in spaceCols if c+1 in spaceCols ] )

# convert to sorted list of leading col characters
spaceCols = map(lambda x:x+1, sorted(list(spaceCols)))

# get and pretty-print data fields
dataFields = \
[ [line.expandtabs()[start:stop] for (start,stop) in
zip([0]+spaceCols,spaceCols+[None])] for line in data ]
import pprint
pprint.pprint( dataFields )

Gives:

[['44544 ', 'ipod ', 'apple ', 'black ', '102'],
['GFGFHHF-12 ', 'unknown thing ', 'bizar ', 'brick mortar ', 'tbc'],
['45fjk ', 'do not know ', '+ is less ', ' ', 'biac'],
[' ', 'disk ', 'seagate ', '250GB ', '130'],
['5G_gff ', ' ', 'tbd ', 'tbd', ''],
['gjgh88hgg ', 'media record ', 'a and b ', ' ', '12'],
['hjj ', 'foo ', 'bar ', 'hop ', 'zip'],
['hg uy oi ', 'hj uuu ii a ', 'qqq ccc v ', 'ZZZ Ughj', ''],
['qdsd ', 'zert ', ' ', 'nope ', 'nope']]
Sep 13 '06 #6
Here you can find an improved version:

http://aspn.activestate.com/ASPN/Coo.../Recipe/498093

Sep 14 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

25
by: kie | last post by:
hello, i have a table that creates and deletes rows dynamically using createElement, appendChild, removeChild. when i have added the required amount of rows and input my data, i would like to...
157
by: Dennis | last post by:
Is there some way --using, say, DOM or javascript-- to detect the current pixel width and/or height of a relatively sized table or of one of its columns or rows. I'm going to be writing javascript...
17
by: DraguVaso | last post by:
Hi, For my SMS-application I need to be able to send characters with accents (like é and à). But this doesn't seem to work in Text Mode, so i will need to do it in PDU Mode. Does anybody has...
10
by: John Smith | last post by:
I know that uploading an image to a database has been covered, oh, about 3 trillion times. However, I haven't found anything covering uploading to a MySQL database with .net. Please don't...
18
by: Ger | last post by:
I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found...
61
by: Christoph Zwerschke | last post by:
On the page http://wiki.python.org/moin/Python3%2e0Suggestions I noticed an interesting suggestion: "These operators ≤ ≥ ≠ should be added to the language having the following meaning: ...
1
by: Meena | last post by:
In our company we are trying to add speech recognition to our products. I downloaded the Speech Recognition engine. Now there is a component called Microsoft Direct Speech Recognition in VB.Net...
5
by: Troels Arvin | last post by:
Hello, Every so often, I'm asked to help people recover data from tables that were either dropped or where to much data was DELETEed. The complications related to restoring data are a problem....
7
by: SV | last post by:
I am using ASP.Net / VB.Net v 2005. I want to add text recognition to one of the text box of suburb in my form. What I want to do is when user type any character in that text box, one dynamic list...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.