Case tagging and python

Fred Mangusta

Hi,

I'm relatively new to programming in general, and totally new to python,
and I've been told that this language is particularly good for what I
need to do. Let me explain.
I have a large corpus of English text, in the form of several files.

First of all I would like to scan each file. Then, for each word I find,
I'd like to examine its case status, and write the (lower case) word back
to another text file - with, appended, a tag stating the case it had in
the original file.

An example. Suppose we have three possible "case conditions"
-all lowercase
-all uppercase
-initial uppercase only

Three corresponding tags for each of these might be, respectively:
-nocap
-allcaps
-cap

Therefore, given the string

"The Chairman of BP was asleep"

I would like to produce

"the/cap chairman/cap of/nocap /bp/allcaps was/nocap /asleep/nocap"

and writing this into a file.
I have the following algorithm in mind:

-open input file
-open output file
-get line of text
-split line into words
-for each word
-tag = checkCase(word)
-newword = lowercase(word) + append(tag)
rejoin words into line
write line into output file

Now, I managed to write the following initial code

for s in file:
lines += 1
if lines % 1000 == 0:
print '%d lines' % We print the total lines
sent = s.split() #split string by spaces
#...
But then I don't quite know what would be the fastest/best way to do
this. Could I use the join function to reform the string? And, regarding
the casetest() function, what do you suggest to do? Should I test each
character of each word or there are faster methods?

Thanks very much,

F.

Jul 31 '08 #1

Subscribe Reply

1252

bearophileHUGS

Fred Mangusta:

Could I use the join function to reform the string?

You can write a function to split the words, for example taking in
account the points too, etc.

And, regarding the casetest() function, what do you suggest to do?

Python strings have isupper, islower, istitle methods, they may be
enough for your purposes.

-open input file
-open output file
-get line of text
-split line into words
-for each word
-tag = checkCase(word)
-newword = lowercase(word) + append(tag)
rejoin words into line
write line into output file

It seems good. To join the words of a line there's str.join. Now you
can write a function that splits lines, and another to check the case,
then you can show them to us.

Yet, I don't see how much use can have your output file :-)

Bye,
bearophile

Jul 31 '08 #2

Fred Mangusta

Hi, I came up with the following procedure

ALLCAPS = "|ALLCAPS"
NOCAPS = "|NOCAPS"
MIDCAPS = "|MIDCAPS"
CAPS = "|CAPS"
DIGIT = "|DIGIT"

def test_case(w):

w_out = ''

if w.isalpha(): #se la virgola non ci entra
if w.isupper():
w_out = w.lower() + ALLCAPS
return w_out
elif w.islower():
w_out = w + NOCAPS
return w_out
else:
m = re.match("^[A-Z]",w)
if m:
w_out = w.lower() + CAPS #notsure about this..
return w_out
else:
w_out = w.lower() + MIDCAPS
return w_out
elif w.isdigit():
w_out = w + DIGIT
return w_out

Called in here:
#=========================
lines = 0
for s in file:
lines += 1
if lines % 1000 == 0:
print '%d lines' % lines
#sent = sent.replace(",","")
sent = s.split() #split string by spaces
for w in sent:
wout= test_case(w)
#==========================

But I don't know if I'm doing something sensible? Moreover:

- test_case has problems, cause whenever It finds some punctuation
character attached to some word, doesn't tag it. I was thinking of
cleaning the line of the punctuation before using split on it (see
commented row) but I don't know if I have to call that replace() once
for every punctuation char?
-Is there a way to reprint the tagged text in a file including punctuation?
-Is my test_case a good start? Would you use regular expressions?

Thanks very much!
F.

Jul 31 '08 #3

chrispoliquin

I second the idea of just using the islower(), isupper(), and
istitle() methods.
So, you could have a function - let's call it checkCase() - that
returns a string with the tag you want...

def checkCase(word):

if word.islower():
tag = 'nocap'
elif word.isupper():
tag = 'allcaps'
elif word.istitle():
tag = 'cap'

return tag

Then let's take an input file and pass every word through the
function...

f = open(path:to:file, 'r')
corpus_text = f.read()
f.close()

tagged_corpus = ''
all_words = corpus_text.split()

for w in all_words:
tagtext = checkCase(w)
tagged_corpus = tagged_corpus + ' ' + w + '/' + tagtext

output_file = open(path:to:file, 'w')
output_file.write(tagged_corpus)
print 'All Done!'

Also, if you're doing natural language processing in Python, you
should get NLTK.

Jul 31 '08 #4

Similar topics

Doxygen tagging

by: David Dvali | last post by:

Hello. What is Doxygen tagging and how it can be used for source code documenting?

C# / C Sharp

tagging code like del.icio.us, flickr

by: pizarropablo | last post by:

hi, i'm looking for code of tagging system like del.icio.us or flickr. excuse my english.

ASP.NET

Implementing a tagging system in .net

by: darrel | last post by:

Has anyone built a tagging system for their .net application akin to way flickr or del.icio.us implement it? We're building an internal CMS and all of the structured content will be organized...

ASP.NET

RSS is a failure when used for social search and tagging - is there another XML scheme that developers might use?

by: Jake Barnes | last post by:

Reading over this debate I came to realize that RSS is too limited to mediate against the weaknesses and failures of social search: ...

.NET Framework

XML-tagging text document from W3C-schema

by: Jana | last post by:

Is there an XML-editor that can assist you in marking up the data in an urformatted text document, such that you mark the data and then right-click to access the Schema you have assigned, and...

.NET Framework

cheese shop: tagging and dating

by: metaperl.etc | last post by:

The first thing I look at when examining a module is how often it is updated. Unfortunately, the entries there dont show this. Eg: http://www.python.org/pypi/PySimpleXML/1.0 Second, it seems...

Python

Tagging Database Query

by: gojoe101 | last post by:

Hello, I have a tagging database, that allows me to apply tags to transaction records. The transaction records can have none or several different tags applied to them. I would like to be able...

Microsoft Access / VBA

How will i use tagging?...

by: bluethunder | last post by:

I have a problem regarding in printing the stored information in the database. I must use a tagging code to print the information that i needed to print. The information in database are came from...

Visual Basic 4 / 5 / 6

Photo tagging HELP!!!

by: sidhubhatia23 | last post by:

i Need help with photo tagging. but i am nto too sure how to do that. i am trying to get the tagging work like it works in face book. can some one help?

Web Applications

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp