473,486 Members | 1,984 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

splitting a long string into a list

I have a single long string - I'd like to split it into a list of
unique keywords. Sadly, the database wasn't designed to do this, so I
must do this in Python - I'm having some trouble using the .split()
function, it doesn't seem to do what I want it to - any ideas?

thanks very much for your help.

r-sr-
longstring = 'Agricultural subsidies; Foreign aidAgriculture;
Sustainable Agriculture - Support; Organic Agriculture; Pesticides, US,
Childhood Development, Birth Defects; Toxic ChemicalsAntibiotics,
AnimalsAgricultural Subsidies, Global TradeAgricultural
SubsidiesBiodiversityCitizen ActivismCommunity
GardensCooperativesDietingAgriculture, CottonAgriculture, Global
TradePesticides, MonsantoAgriculture, SeedCoffee, HungerPollution,
Water, FeedlotsFood PricesAgriculture, WorkersAnimal Feed, Corn,
PesticidesAquacultureChemical
WarfareCompostDebtConsumerismFearPesticides, US, Childhood Development,
Birth DefectsCorporate Reform, Personhood (Dem. Book)Corporate Reform,
Personhood, Farming (Dem. Book)Crime Rates, Legislation,
EducationDebt, Credit CardsDemocracyPopulation, WorldIncomeDemocracy,
Corporate Personhood, Porter Township (Dem. Book)Disaster
ReliefDwellings, SlumsEconomics, MexicoEconomy, LocalEducation,
ProtestsEndangered Habitat, RainforestEndangered SpeciesEndangered
Species, Extinctionantibiotics, livestockAgricultural subsidies;
Foreign aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
Toxic Chemicals;Antibiotics, Animals;Agricultural Subsidies, Global
Trade;Agricultural Subsidies;Biodiversity;Citizen Activism;Community
Gardens;Cooperatives;Dieting;Agriculture, Cotton;Agriculture, Global
Trade;Pesticides, Monsanto;Agriculture, Seed;Coffee, Hunger;Pollution,
Water, Feedlots;Food Prices;Agriculture, Workers;Animal Feed, Corn,
Pesticides;Aquaculture;Chemical
Warfare;Compost;Debt;Consumerism;Fear;Pesticides, US, Childhood
Development, Birth Defects;Corporate Reform, Personhood (Dem.
Book);Corporate Reform, Personhood, Farming (Dem. Book);Crime Rates,
Legislation, Education;Debt, Credit Cards;Democracy;Population,
World;Income;Democracy, Corporate Personhood, Porter Township (Dem.
Book);Disaster Relief;Dwellings, Slums;Economics, Mexico;Economy,
Local;Education, Protests;Endangered Habitat, Rainforest;Endangered
Species;Endangered Species, Extinction;antibiotics,
livestock;Pesticides, Water;Environment, Environmentalist;Food, Hunger,
Agriculture, Aid, World, Development;Agriculture, Cotton
Trade;Agriculture, Cotton, Africa;Environment, Energy;Fair Trade (Dem.
Book);Farmland, Sprawl;Fast Food, Globalization, Mapping;depression,
mental illness, mood disorders;Economic Democracy, Corporate
Personhood;Brazil, citizen activism, hope, inspiration, labor
issues;citizen activism, advice, hope;Pharmaceuticals, Medicine,
Drugs;Community Investing;Environment, Consumer Waste Reduction,
Consumer Behavior and Taxes;Hunger, US, Poverty;FERTILITY,
Women;Agricultural subsidies; Foreign aid;Agriculture; Sustainable
Agriculture - Support; Organic Agriculture; Pesticides, US, Childhood
Development, Birth Defects; Toxic Chemicals;Antibiotics,
Animals;Agricultural Subsidies, Global Trade;Agricultural
Subsidies;Biodiversity;Citizen Activism;Community
Gardens;Cooperatives;Dieting;Agricultural subsidies; Foreign
aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
Toxic Chemicals;Antibiotics, Animals;Agricultural Subsidies, Global
Trade;Agricultural Subsidies;Biodiversity;Citizen Activism;Community
Gardens;Cooperatives;Dieting;Agriculture, Cotton;Agriculture, Global
Trade;Pesticides, Monsanto;Agriculture, Seed;Coffee, Hunger;Pollution,
Water, Feedlots;Food Prices;Agriculture, Workers;Animal Feed, Corn,
Pesticides;Aquaculture;Chemical
Warfare;Compost;Debt;Consumerism;Fear;Pesticides, US, Childhood
Development, Birth Defects;Corporate Reform, Personhood (Dem.
Book);Corporate Reform, Personhood, Farming (Dem. Book);Crime Rates,
Legislation, Education;Debt, Credit Cards;'

Nov 28 '06 #1
8 2558
What exactly seems to be the problem?
"ronrsr" <ro****@gmail.comwrote in message
news:11*********************@14g2000cws.googlegrou ps.com...
>I have a single long string - I'd like to split it into a list of
unique keywords. Sadly, the database wasn't designed to do this, so I
must do this in Python - I'm having some trouble using the .split()
function, it doesn't seem to do what I want it to - any ideas?

thanks very much for your help.

r-sr-

Nov 28 '06 #2
ronrsr wrote:
I have a single long string - I'd like to split it into a list of
unique keywords. Sadly, the database wasn't designed to do this, so I
must do this in Python - I'm having some trouble using the .split()
function, it doesn't seem to do what I want it to - any ideas?
Did you follow the recommendations given to you the last time you asked this
question? What did you try? What results do you want to get?

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Nov 28 '06 #3
still having a heckuva time with this.

here's where it stand - the split function doesn't seem to work the way
i expect it to.
longkw1,type(longkw): Agricultural subsidies; Foreign
aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
<type 'list'1

longkw.replace(',',';')

Agricultural subsidies; Foreign aid;Agriculture; Sustainable
Agriculture - Support; Organic Agriculture; Pesticides, US, Childhood
Development
kw = longkw.split("; ,") #kw is now a list of len 1

kw,typekw= ['Agricultural subsidies; Foreign aid;Agriculture;
Sustainable Agriculture - Support; Organic Agriculture; Pesticides, US,
Childhood Development, Birth Defects; Toxic Chemicals;Antibiotics,
Animals;Agricultural Subsidies
what I would like is to break the string into a list of the delimited
words, but have had no luck doing that - I thought split wuld do that,
but it doens't.

bests,

-rsr-
Robert Kern wrote:
ronrsr wrote:
I have a single long string - I'd like to split it into a list of
unique keywords. Sadly, the database wasn't designed to do this, so I
must do this in Python - I'm having some trouble using the .split()
function, it doesn't seem to do what I want it to - any ideas?

Did you follow the recommendations given to you the last time you asked this
question? What did you try? What results do you want to get?

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
Nov 28 '06 #4
"ronrsr" <ro****@gmail.comwrote:
>I have a single long string - I'd like to split it into a list of
unique keywords. Sadly, the database wasn't designed to do this, so I
must do this in Python - I'm having some trouble using the .split()
function, it doesn't seem to do what I want it to - any ideas?

thanks very much for your help.

r-sr-
longstring = 'Agricultural subsidies; Foreign aidAgriculture;
Sustainable Agriculture - Support; Organic Agriculture; Pesticides, US,
Childhood Development, Birth Defects; Toxic ChemicalsAntibiotics,
AnimalsAgricultural Subsidies, Global TradeAgricultural
SubsidiesBiodiversityCitizen ActivismCommunity...
What do you want out of this? It looks like there are several levels
crammed together here. At first blush, it looks like topics separated by
"; ", so this should get you started:

topics = longstring.split("; ")
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Nov 28 '06 #5

ronrsr wrote:
I have a single long string - I'd like to split it into a list of
unique keywords. Sadly, the database wasn't designed to do this, so I
must do this in Python - I'm having some trouble using the .split()
function, it doesn't seem to do what I want it to - any ideas?

thanks very much for your help.

r-sr-
longstring = 'Agricultural subsidies; Foreign aidAgriculture;
Sustainable Agriculture - Support; Organic Agriculture; Pesticides, US,
[snip most of VERY long string]
Book);Corporate Reform, Personhood, Farming (Dem. Book);Crime Rates,
Legislation, Education;Debt, Credit Cards;'

Hi ronster,

As far as I recall, without digging in the archives:

We would probably agree (if shown the schema) that the database wasn't
designed. However it seems to have changed. Last time you asked, it
was at least queryable and producing rows, each containing one column
(a string of structure unknown to us and not divulged by you). You were
given extensive advice: how to use split(), plus some questions to
answer about the data e.g. the significance (if any) of semicolon
versus comma. You were also asked about the SQL that was used. You were
asked to explain what you meant by "keywords". All of those questions
were asked so that we could understand your problem, and help you.
Since then, nothing.

Now you have what appears to be something like your previous results
stripped of newlines and smashed together (are the newlines of no
significance at all?), and you appear to be presenting it as a new
problem.

What's going on?

Regards,
John

Nov 28 '06 #6
ronrsr wrote:
still having a heckuva time with this.
You don't seem to get it.
here's where it stand - the split function doesn't seem to work the way
i expect it to.
longkw1,type(longkw): Agricultural subsidies; Foreign
aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
<type 'list'1

longkw.replace(',',';')
>>sample = "eat, drink; man, woman"
sample.replace(";", ",")
'eat, drink, man, woman'
>>sample
'eat, drink; man, woman'

Aha, Python doesn't replace in place, it creates a new string instead.
Agricultural subsidies; Foreign aid;Agriculture; Sustainable
Agriculture - Support; Organic Agriculture; Pesticides, US, Childhood
Development
kw = longkw.split("; ,") #kw is now a list of len 1
>>sample = "eat+-drink+man-woman"
sample.split("+-")
['eat', 'drink+man-woman']
>>sample.split("+")
['eat', '-drink', 'man-woman']

Aha, Python interprets the complete split() argument as the delimiter, not
each of its characters.

Do you think you can combine these two findings to make your code work? You
will have to replace() first and then split().

Peter
Nov 28 '06 #7
ronrsr wrote:
still having a heckuva time with this.

here's where it stand - the split function doesn't seem to work the way
i expect it to.
longkw1,type(longkw): Agricultural subsidies; Foreign
aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
<type 'list'1

longkw.replace(',',';')

Agricultural subsidies; Foreign aid;Agriculture; Sustainable
Agriculture - Support; Organic Agriculture; Pesticides, US, Childhood
Development
Here you have discovered that string.replace() returns a string and does
NOT modify the original string. Try this for clarification:
>>a="DAWWIJFWA,dwadw;djwkajdw"
a
'DAWWIJFWA,,,,,,dwadw;djwkajdw'
>>a.replace(",",";")
'DAWWIJFWA;;;;;;dwadw;djwkajdw'
>>a
'DAWWIJFWA,,,,,,dwadw;djwkajdw'
>>b = a.replace(',',';')
b
'DAWWIJFWA;;;;;;dwadw;djwkajdw'
>

kw = longkw.split("; ,") #kw is now a list of len 1
Yes, because it is trying to split longkw wherever it finds the whole
string "; '" and NOT wherever it finds ";" or " " or ",". This has been
stated before by NickV, Duncan Booth, Fredrik Lundh and Paul McGuire
amongst others. You will need to do either:

a.)

# First split on every semicolon
a = longkw.split(";")
b = []
# Then split those results on whitespace
#(the default action for string.split())
for item in a:
b.append(item.split())
# Then split on commas
kw = []
for item in b:
kw.append(item.split(","))

or b.)

# First replace commas with spaces
longkw = longkw.replace(",", " ")
# Then replace semicolons with spaces
longkw = longkw.replace(";", " ")
# Then split on white space, (default args)
kw = longkw.split()
Note that we did:
longkw = longkw.replace(",", " ")
and not just:
longkw.replace(",", " ")
You will find that method A may give empty strings as some elements of
kw. If so, use method b.
Finally, if you have further problems, please please do the following:

1.) Provide your input data clearly, exactly as you have it.
2.) Show exactly what you want the output to be, including any special
cases.
3.) If something doesn't work the way you expect it to, tell us how you
expect it to work so we know what you mean by "doesn't work how I expect
it to"
4.) Read all the replies carefully and if you don't understand the
reply, ask for clarification.
5.) Read the help functions carefully - what the input parameters have
to be and what the return value will be, and whether or not it changes
the parameters or original object. Strings are usually NOT mutable so
any functions that operate on strings tend to return the result as a new
string and leave the original string intact.

I really hope this helps,

Cameron.
Nov 28 '06 #8
ronrsr wrote:
still having a heckuva time with this.

here's where it stand - the split function doesn't seem to work the way
i expect it to.
longkw1,type(longkw): Agricultural subsidies; Foreign
aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
<type 'list'1

longkw.replace(',',';')

Agricultural subsidies; Foreign aid;Agriculture; Sustainable
Agriculture - Support; Organic Agriculture; Pesticides, US, Childhood
Development
kw = longkw.split("; ,") #kw is now a list of len 1

kw,typekw= ['Agricultural subsidies; Foreign aid;Agriculture;
Sustainable Agriculture - Support; Organic Agriculture; Pesticides, US,
Childhood Development, Birth Defects; Toxic Chemicals;Antibiotics,
Animals;Agricultural Subsidies
what I would like is to break the string into a list of the delimited
words, but have had no luck doing that - I thought split wuld do that,
but it doens't.

bests,

-rsr-
>>import SE # http://cheeseshop.python.org/pypi/SE/2.3
>>Split_Marker = SE.SE (' ,=| ;=| ') # Translates both ',' and
';' into an arbitrary split mark ('|')
>>for item in Split_Marker (longstring).split ('|'): print item
Agricultural subsidies
Foreign aidAgriculture
Sustainable Agriculture - Support
Organic Agriculture

.... etc.

To get rid of the the leading space on some lines simply add
corresponding replacements. SE does any number of substitutions in one
pass. Defining them is a simple matter of writing them up in one single
string from which the translator object is made:
>>Split_Marker = SE.SE (' ,=| ;=| ", =|" "; =|" ')
for item in Split_Marker (longstring).split ('|'): print item
Agricultural subsidies
Foreign aidAgriculture
Sustainable Agriculture - Support
Organic Agriculture
Regards

Frederic
Nov 29 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

18
2046
by: robsom | last post by:
Hi, I have a problem with a small python program I'm trying to write and I hope somebody may help me. I'm working on tables of this kind: CGA 1988 06 21 13 48 G500-050 D 509.62 J.. R1 1993 01...
3
2179
by: Piet | last post by:
Hello, I have a very strange problem with regular expressions. The problem consists of analyzing the properties of columns of a MySQL database. When I request the column type, I get back a string...
6
1908
by: qwweeeit | last post by:
Splitting with RE has (for me!) misterious behaviour! I want to get the words from this string: s= 'This+(that)= a.string!!!' in a list like that considering "a.string" as a word. Python...
3
5537
by: Patrick Coleman | last post by:
Hi, I'm looking for a function to split urls into their component parts, ie protocol, host, path, filename, extension. I'm really only looking for path and hostname (so I can download a webpage...
7
2216
by: qwweeeit | last post by:
Hi all, I am writing a script to visualize (and print) the web references hidden in the html files as: '<a href="web reference"> underlined reference</a>' Optimizing my code, I found that an...
20
2983
by: Ed | last post by:
I am running Access 2002 and just ran the built in Access wizard for splitting a database into a back end (with tables) and front end (with queries, forms, modules, etc.). After running the...
2
1757
by: Jenny | last post by:
Hello All! I have a long XML file that I should transmit to other computer using http. Problem is that the whole XML Document is too large for one transmitting. What is the nicest way to...
1
328
by: ronrsr | last post by:
I have a single long string - I'd like to split it into a list of unique keywords. Sadly, the database wasn't designed to do this, so I must do this in Python - I'm having some trouble using the...
2
3245
by: shadow_ | last post by:
Hi i m new at C and trying to write a parser and a string class. Basicly program will read data from file and splits it into lines then lines to words. i used strtok function for splitting data to...
37
1798
by: xyz | last post by:
I have a string 16:23:18.659343 131.188.37.230.22 131.188.37.59.1398 tcp 168 for example lets say for the above string 16:23:18.659343 -- time 131.188.37.230 -- srcaddress 22 ...
0
7099
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
6964
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7123
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7175
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
5430
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
4864
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
3069
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
1378
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
262
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.