473,385 Members | 2,044 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Sniffing Text Files

Hi. I have files that I will be importing in at least four different
plain text formats, one of them being tab delimited format, a couple
being token based uses pipes (but not delimited with pipes), another
being xml. There will likely be others as well but the data needs to be
extracted and rewritten to a single format. The files can be fairly
large (several MB) so I do not want to read the whole file into memory.
What approach would be recommended for sniffing the files for the
different text formats. I realize CSV module has a sniffer but it is
something that is limited more or less to delimited files. I have a
couple of ideas on what I could do but I am interested in hearing from
others on how they might handle something like this so I can determine
the best approach to take. Many thanks.

Regards,
David
Sep 23 '05 #1
2 1660
David Pratt <fa*******@eastlink.ca> writes:
Hi. I have files that I will be importing in at least four different
plain text formats, one of them being tab delimited format, a couple
being token based uses pipes (but not delimited with pipes), another
being xml. There will likely be others as well but the data needs to
be extracted and rewritten to a single format. The files can be fairly
large (several MB) so I do not want to read the whole file into
memory. What approach would be recommended for sniffing the files for
the different text formats. I realize CSV module has a sniffer but it
is something that is limited more or less to delimited files. I have
a couple of ideas on what I could do but I am interested in hearing
from others on how they might handle something like this so I can
determine the best approach to take. Many thanks.


With GB memory machines being common, I wouldn't think twice about
slurping a couple of meg into RAM to examine. But if that's to much,
how about simply reading in the first <chunk> bytes, and checking that
for the characters you want? <chunk> should be large enough to reveal
what you need, but small enogh that your'e comfortable reading it
in. I'm not sure that there aren't funny interactions between read and
readline, so do be careful with that.

Another approach to consider is libmagic. Google turns up a number of
links to Python wrappers for it.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Sep 23 '05 #2
On Fri, 23 Sep 2005 01:20:49 -0300, David Pratt wrote:
Hi. I have files that I will be importing in at least four different
plain text formats, one of them being tab delimited format, a couple
being token based uses pipes (but not delimited with pipes), another
being xml. There will likely be others as well but the data needs to be
extracted and rewritten to a single format. The files can be fairly
large (several MB) so I do not want to read the whole file into memory.
Why ever not? On modern machines, "several MB" counts as small files. Let
your operating system worry about memory, at least until you get to really
big (several hundred megabytes) files.
What approach would be recommended for sniffing the files for the
different text formats.


In no particular order:

(1) Push the problem onto the user: they specify what sort of file they
think it is. If they tell your program the file is XML when it is in fact
a CSV file, your XML importer will report back that that the input file is
a broken XML file.

(2) Look at the file extension (.xml, .csv, .txt, etc) and assume that it
is correct. If the user gives you an XML file called "data.csv", you can
hardly be blamed for treating it wrong. This behaviour is more accepted
under Windows than Linux or Macintosh.

(3) Use the Linux command "file" to determine the contents of the file.
There may be equivalents on other OSes.

(4) Write your own simple scanner that tries to determine if the file is
xml, csv, tab-delimited text, etc. A basic example:

(Will need error checking and hardening)

def sniff(filename):
"""Return one of "xml", "csv", "txt" or "tkn", or "???"
if it can't decide the file type.
"""
fp = open(filename, "r")
scores = {"xml": 0, "csv": 0, "txt": 0, "tkn": 0}
for line in fp.readlines():
if not line:
continue
if line[0] == "<":
scores["xml"] += 1
if '\t' in line:
scores["txt"] += 1
if ',' in line:
scores["csv"] += 1
if SOMETOKEN in line:
scores["csv"] += 1
# Pick the best guess:
L = [(score, name) for (name, score) in scores.items()]
L.sort()
L.reverse()
# L is now sorted from highest down to lowest by score.
best_guess = L[0]
second_best_guess = L[0]
if best_guess[0] > 10*second_best_guess[0]:
fp.close()
return best_guess[1]
fp.close()
return "???"
Note that the above code really isn't good enough for production work, but
it should give you an idea how to proceed.
Hope that helps.

--
Steven.

Sep 23 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Fabiano Sidler | last post by:
Hello Friends! How would you implement as parallel as possible sniffing with multiple pacpObjects? On my machine, I must create pcapObjects for two interfaces, but I don't want the others, for...
11
by: James | last post by:
I wasn't sure where to post this so I'm sorry if this is the wrong place. But I would like to know what some of the differences are between client-side and server-side browser sniffing. I'm aware...
7
by: Jeff Schwab | last post by:
What's the best way to generate a sequence of characters in Python? I'm looking for something like this Perl code: 'a' .. 'z' .
0
by: Nuno Magalhaes | last post by:
Why does C# only supports LAN packet sniffing? Should I have to use WinPCap if I want to capture the outgoing packets on xp pro also? Why this limitation? Here's the source for capturing the...
1
by: CW | last post by:
It's recommended that when signing on using FormsAuthentication, one should do so over a secure (SSL) channel. If I understand FormsAuthentication mechanism correctly, the Authentication ticket...
2
by: Tom Rahav | last post by:
Hi, Does someone know how can I develop an application that tracks all files and directories activities in my computer? What I mean is that I want to develop a tool that runs in the background and...
16
by: petermichaux | last post by:
Hi, Does anyone have a a cross-browser setOpacity function that does not use browser sniffing? I looked at the Yahoo! UI function and it detects IE by looking for window.ActiveXObject. I also...
1
by: RobG | last post by:
Browser sniffing is generally considered a very bad idea, developers are usually told that feature detection is the way to go. It seems a real pitty then that the developers of new platforms are...
1
by: Ken Fine | last post by:
I have been investigating programmatically downloading FLV content from various sites ("video scraping"??) Many interactive GUI tools do this, such as the Orbit downloader. At the heart of them...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.