By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,536 Members | 2,165 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,536 IT Pros & Developers. It's quick & easy.

clear the files using python

P: n/a
Sez
Hi,

I'm not a programmer. I start working as text miner and as a first task
I have given 1000 dirty files that needs to be cleaned before
classification tasks. I have been told python is the best tool for this
job.

Each file's structure as below:

Comments: This is article 1965 obtained from the website
Title: Banana Report #65, September 2003
Author: dylab
Date: 1st September 2003
Section: pulse

In the past month:
A mass hit North America, cutting electricity to 50 million people
across the North east
I'm expected execute the python script so the file suppose to look like
this:

pulse, In, the, past, month, A, mass, hit, North, America, cutting,
electricity, to, 50, million, people, across, the, North east, dylab

Could you please point me to right direction here. Or provide some
example code. In the mean time I'll be searching myself. I know you
guys hate novice people like me but I would appreciated if you could
provide little help here.

Thanks & regards,
Sez

Jul 19 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Sez sez:
Each file's structure as below:
Comments: This is article 1965 obtained from the website
Title: Banana Report #65, September 2003
Author: dylab
Date: 1st September 2003
Section: pulse

In the past month:
A mass hit North America, cutting electricity to 50 million people
across the North east
I'm expected execute the python script so the file suppose to look like
this:

pulse, In, the, past, month, A, mass, hit, North, America, cutting,
electricity, to, 50, million, people, across, the, North east, dylab
You'll need either more examples or a more detailed description. The
above could be interpreted as something like "put the pulse section
first, then exactly 19 words from the following text, removing
punctuation and line breaks, and taking the last two words together as
one, then add the 'author' field, and write them all out together with a
field separator of ', ' (comma plus space)".

On the other hand, it could be interpreted a large number of other ways,
and since none of us have any idea what you are trying to do with the
results, we can't use our own intuition or experience to help.

I also personally find it hard to respond to questions like this with
real code when there are things about the task which I find very
surprising. For example, you're throwing away the date information
entirely, along with the comments and title. Is that really intended?

And are the author and section fields always exactly one word, with no
punctuation? (What would happen if an author's name was "Hansen,
Peter"? How would you format that in the output without getting the
first name confused with the next field?)
Could you please point me to right direction here. Or provide some
example code. In the mean time I'll be searching myself. I know you
guys hate novice people like me but I would appreciated if you could
provide little help here.


We don't "hate" novice people by any means... I suspect you are either
trying to be self-deprecating or maybe you just haven't read this
newsgroup for long. c.l.p actually *loves* novices; it just doesn't
prefer questions that aren't very clear. Keep trying (and improving!)
and you'll definitely get the help you need.

And your comment about Python being the best language for this is pretty
close to the mark... but there are certainly a variety of ways to go
about the task and the best might depend on a lot of unanswered questions.

-Peter
Jul 19 '05 #2

P: n/a
On 8 May 2005 21:55:04 -0700, Sez <se*******@yahoo.com.au> wrote:
Could you please point me to right direction here. Or provide some
example code. In the mean time I'll be searching myself. I know you
guys hate novice people like me but I would appreciated if you could
provide little help here.


Oh, we don't hate novices here, not at all. On the other hand, we
aren't going to write your script for you. ;-) Why not take a look at
the Python beginners guide (at
<http://www.python.org/moin/BeginnersGuide>), and come back to us when
you have a specific problem.

--
Cheers,
Simon B,
si***@brunningonline.net,
http://www.brunningonline.net/simon/blog/
Jul 19 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.