table (ascii text) lin ayout recognition

vbfoobar

Hello,

I am looking for python code useful to process
tables that are in ASCII text. The code must
determine where are the columns (fields).
Concerned tables for my application are various,
but their columns are not very complicated
to locate for a human, because even
when ignoring the semantic of words,
our eyes see vertical alignments

Here is a sample table (must be viewed
with fixed-width font to see alignments):
=================================

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

=================================

I want the python code that builds a representation
of this table (for exemple a list of lists, where each list
represents a table line, each element of the list
being a field value).

Any hints?
thanks

Sep 13 '06 #1

Subscribe Post Reply

1665

James Stroud

vb******@gmail.com wrote:

Hello,

I am looking for python code useful to process
tables that are in ASCII text. The code must
determine where are the columns (fields).
Concerned tables for my application are various,
but their columns are not very complicated
to locate for a human, because even
when ignoring the semantic of words,
our eyes see vertical alignments

Here is a sample table (must be viewed
with fixed-width font to see alignments):
=================================

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

=================================

I want the python code that builds a representation
of this table (for exemple a list of lists, where each list
represents a table line, each element of the list
being a field value).

Any hints?
thanks

I have to catch a bus, but, quickly the algorithm is to code non-space
as one and space as zero, then 'or' operate down the columns. Zeros will
indicate high probability of between-column. Code tomorrow if no one
else posts.

Must run...
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Sep 13 '06 #2

James Stroud

vb******@gmail.com wrote:

Hello,

I am looking for python code useful to process
tables that are in ASCII text. The code must
determine where are the columns (fields).
Concerned tables for my application are various,
but their columns are not very complicated
to locate for a human, because even
when ignoring the semantic of words,
our eyes see vertical alignments

Here is a sample table (must be viewed
with fixed-width font to see alignments):
=================================

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

=================================

I want the python code that builds a representation
of this table (for exemple a list of lists, where each list
represents a table line, each element of the list
being a field value).

Any hints?
thanks

As promised. I call this the "cast a shadow" algorithm for table
discovery. This is about as obfuscated as I could make it. It will be up
to you to explain it to your teacher ;-)

Assuming the lines are all equal width (padded right with space) e.g.:

def rpadd(lines):
"""
Pass in the lines as a list of lines.
"""
lines = [line.rstrip() for line in lines]
maxlen = max([len(line) for line in lines])
return [line + ' ' * (maxlen - len(line)) for line in lines]
In which case, you can:
binary = [[((s==' ' and 2) or 1) for s in line] for line in lines]
shadow = [1 in c for c in zip(*binary)]

isit = False
indices = []
for i,v in enumerate(shadow):
if v is not isit:
indices.append(i)
isit = not isit

indices.append(i+1)

indices = [t for t in zip(indices[::2],indices[1::2])]

columns = [[line[t[0]:t[1]].strip() for line in lines] for t in indices]
In case you want rows:

rows = zip(*columns)
James

Sep 13 '06 #3

James Stroud

James Stroud wrote:

indices = [t for t in zip(indices[::2],indices[1::2])]

(Artefact of cut-and-paste.)

Make that:

indices = zip(indices[::2],indices[1::2])

James

Sep 13 '06 #4

bearophileHUGS

My version, not much tested. It probably doesn't work well for tables
with few rows. It finds the most frequent word beginnings, and then
splits the data according to them.

data = """\
44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope
"""

import re, pprint
# import collections # For Python 2.5

# RE to find the beginning of words
tpatt = re.compile(r"\b[^ ]")

# Remove empty lines
lines = filter(None, data.splitlines())

# Find the positions of all word beginnings
# This finds: treshs = [0, 11, 25, 35, 49, ...
# 44544 ipod apple black 102
# ^ ^ ^ ^ ^
treshs = [ob.start() for li in lines for ob in tpatt.finditer(li)]

# Find treshs frequences
freqs = {}
for el in treshs:
freqs[el] = freqs.get(el, 0) + 1

# Find treshs frequences, alternative for Python V.2.5
# freqs = collections.defaultdict(int)
# for el in treshs:
# freqs[el] += 1

# Find a big enough frequence
bigf = max(freqs.itervalues()) * 0.6

# Find the most common column beginnings
cols = sorted(k for k,v in freqs.iteritems() if v>bigf)

def xpairs(alist):
"xpairs(xrange(n)) ==(0,1), (1,2), (2,3), ..., (n-2, n-1)"
for i in xrange(len(alist)-1):
yield alist[i:i+2]

result = [[li[x:y].strip() for x,y in xpairs(cols+[None])] for li in
lines]

print data
pprint.pprint(result)
"""
Output:

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

[['44544', 'ipod', 'apple', 'black', '102'],
['GFGFHHF-12', 'unknown thing', 'bizar', 'brick mortar', 'tbc'],
['45fjk', 'do not know', '+ is less', '', 'biac'],
['', 'disk', 'seagate', '250GB', '130'],
['5G_gff', '', 'tbd', 'tbd', ''],
['gjgh88hgg', 'media record', 'a and b', '', '12'],
['hjj', 'foo', 'bar', 'hop', 'zip'],
['hg uy oi', 'hj uuu ii a', 'qqq ccc v', 'ZZZ Ughj', ''],
['qdsd', 'zert', '', 'nope', 'nope']]
"""

Bye,
bearophile

Sep 13 '06 #5

Paul McGuire

"James Stroud" <js*****@mbi.ucla.eduwrote in message
news:8R***************@newssvr21.news.prodigy.com. ..

vb******@gmail.com wrote:
>Hello,

I am looking for python code useful to process
tables that are in ASCII text. The code must
determine where are the columns (fields).
Concerned tables for my application are various,
but their columns are not very complicated
to locate for a human, because even
when ignoring the semantic of words,
our eyes see vertical alignments

Here is a sample table (must be viewed
with fixed-width font to see alignments):
=================================

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

=================================

I want the python code that builds a representation
of this table (for exemple a list of lists, where each list
represents a table line, each element of the list
being a field value).

Any hints?
thanks

As promised. I call this the "cast a shadow" algorithm for table
discovery. This is about as obfuscated as I could make it. It will be up
to you to explain it to your teacher ;-)

James -

I used your same algorithm, but I guess I used more brute force (and didn't
use pyparsing, either!).

-- Paul
data = """\
44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope""".split('\n')

# find rightmost space characters delimiting text columns
spaceCols = set(range(max(map(len, data)))) - \
set( [col for line in data
for col,c in enumerate(line.expandtabs())
if not c.isspace() ] )
spaceCols -= set( [c for c in spaceCols if c+1 in spaceCols ] )

# convert to sorted list of leading col characters
spaceCols = map(lambda x:x+1, sorted(list(spaceCols)))

# get and pretty-print data fields
dataFields = \
[ [line.expandtabs()[start:stop] for (start,stop) in
zip([0]+spaceCols,spaceCols+[None])] for line in data ]
import pprint
pprint.pprint( dataFields )

Gives:

[['44544 ', 'ipod ', 'apple ', 'black ', '102'],
['GFGFHHF-12 ', 'unknown thing ', 'bizar ', 'brick mortar ', 'tbc'],
['45fjk ', 'do not know ', '+ is less ', ' ', 'biac'],
[' ', 'disk ', 'seagate ', '250GB ', '130'],
['5G_gff ', ' ', 'tbd ', 'tbd', ''],
['gjgh88hgg ', 'media record ', 'a and b ', ' ', '12'],
['hjj ', 'foo ', 'bar ', 'hop ', 'zip'],
['hg uy oi ', 'hj uuu ii a ', 'qqq ccc v ', 'ZZZ Ughj', ''],
['qdsd ', 'zert ', ' ', 'nope ', 'nope']]

Sep 13 '06 #6

bearophileHUGS

Here you can find an improved version:

http://aspn.activestate.com/ASPN/Coo.../Recipe/498093

Sep 14 '06 #7

by: kie | last post by:

hello, i have a table that creates and deletes rows dynamically using createElement, appendChild, removeChild. when i have added the required amount of rows and input my data, i would like to...

Javascript

157

How to detect table width or height?

by: Dennis | last post by:

Is there some way --using, say, DOM or javascript-- to detect the current pixel width and/or height of a relatively sized table or of one of its columns or rows. I'm going to be writing javascript...

HTML / CSS

ASCII to PDU convertion and back

by: DraguVaso | last post by:

Hi, For my SMS-application I need to be able to send characters with accents (like é and à). But this doesn't seem to work in Text Mode, so i will need to do it in PDU Mode. Does anybody has...

.NET Framework

uploading image to mysql table

by: John Smith | last post by:

I know that uploading an image to a database has been covered, oh, about 3 trillion times. However, I haven't found anything covering uploading to a MySQL database with .net. Please don't...

C# / C Sharp

Unicode to ASCII string conversion

by: Ger | last post by:

I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found...

Visual Basic .NET

Using non-ascii symbols

by: Christoph Zwerschke | last post by:

On the page http://wiki.python.org/moin/Python3%2e0Suggestions I noticed an interesting suggestion: "These operators â‰¤ â‰¥ â‰ should be added to the language having the following meaning: ...

Python

Adding Speech Recognition in VB.Net

by: Meena | last post by:

In our company we are trying to add speech recognition to our products. I downloaded the Speech Recognition engine. Now there is a component called Microsoft Direct Speech Recognition in VB.Net...

Visual Basic .NET

Restoring selected tables / table data

by: Troels Arvin | last post by:

Hello, Every so often, I'm asked to help people recover data from tables that were either dropped or where to much data was DELETEed. The complications related to restoring data are a problem....

DB2 Database

Text recognition in ASP.Net

by: SV | last post by:

I am using ASP.Net / VB.Net v 2005. I want to add text recognition to one of the text box of suburb in my form. What I want to do is when user type any character in that text box, one dynamic list...

ASP.NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

table (ascii text) lin ayout recognition

Similar topics