473,804 Members | 3,686 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regexp problem with `('

I have the following text

<title>Goods Item 146 (174459989) - OurWebSite</title>

from which I need to extract
`Goods Item 146 '

Can anyone help with regexp?
Thank you for help
L.

Mar 22 '07 #1
5 1075
Johny a écrit :
I have the following text

<title>Goods Item 146 (174459989) - OurWebSite</title>

from which I need to extract
`Goods Item 146 '

Can anyone help with regexp?
Sure : the documentation is here:
http://docs.python.org/lib/module-re.html

And there's a nice tutorial here:
http://www.amk.ca/python/howto/regex/

Read all this, try to solve your problem, and come back with what you've
done so far if you need more help.
Thank you for help
You're welcome.
Mar 22 '07 #2
On Thu, Mar 22, 2007 at 01:26:22AM -0700, Johny wrote:
I have the following text

<title>Goods Item 146 (174459989) - OurWebSite</title>

from which I need to extract
`Goods Item 146 '

Can anyone help with regexp?
Thank you for help
L.
(Goods\s+Item\s +146\s+)

--
Zeng Nan

MY BLOG: http://zengnan.blogspot.com
Public Key: http://pgp.mit.edu/ | www.keyserver.net

~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~
In Lexington, Kentucky, it's illegal to carry an ice cream cone in your
pocket.

~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (FreeBSD)

iD8DBQFGAkBkxFS vMHT0z4kRAlUsAK Cq4wRgyQvrWWj/QMxG3jNq/sD8ywCdEp9v
gBHj/zW4yyPUmoN9tSlk 2oo=
=71Vb
-----END PGP SIGNATURE-----

Mar 22 '07 #3
Zeng Nan wrote:
On Thu, Mar 22, 2007 at 01:26:22AM -0700, Johny wrote:
>I have the following text

<title>Goods Item 146 (174459989) - OurWebSite</title>

from which I need to extract
`Goods Item 146 '

Can anyone help with regexp?
Thank you for help
L.

(Goods\s+Item\s +146\s+)

[snigger]

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Mar 22 '07 #4
On Mar 22, 3:26 am, "Johny" <pyt...@hope.cz wrote:
I have the following text

<title>Goods Item 146 (174459989) - OurWebSite</title>

from which I need to extract
`Goods Item 146 '

Can anyone help with regexp?
Thank you for help
L.
Here's the immediate answer to your question.
import re
src = "<title>Goo ds Item 146 (174459989) - OurWebSite</title>"
pattern = r"<title>(.*)\( "
re.search(patte rn,src).groups( )[0]
I post it this way so that you can relate the re to your specific
question, and then maybe apply this to whatever else you are scraping
from this web page.

Please don't follow up with a post asking how to extract "45","Rubbe r
chicken" from "<tr><td>45 </td><td>Rubber chicken</td></tr>". At this
point, you should try a little experimentation on your own.

-- Paul

Mar 22 '07 #5
Johny wrote:
I have the following text

<title>Goods Item 146 (174459989) - OurWebSite</title>

from which I need to extract
`Goods Item 146 '

Can anyone help with regexp?
Thank you for help
L.
In general, parsing HTML with regular expressions is a bad idea.
Usually, you use something like BeautifulSoup to parse the HTML,
extract the desired field, like the contents of "<title>", then
work on that.

If you try to do this line by line with regular expressions,
it will fail when the line breaks aren't where you expect. If
you try to do a whole document with regular expressions, other
material such as content in comments can be misrecognized.

Try something like this:

# Regular expression to extract group before "(NNNNN)"
kreextractitem = re.compile(r'^( .*)\(\d+\))
pagetree = BeautifulSoup.B eautifulSoup(st ringcontainingh tml)
titleitem = pagetree.find({ 'title':True, 'TITLE':True})
if titleitem :
titletext = " ".join(atag.fin dAll(text=True, recursive=True) )
# Text of TITLE item is now in "titletext" as a string.
groups = kreextractitem. search(titletex t)
if groups :
goodsitem = groups.group(1) .strip()
# "goodsitem" now contains everything before "(NNNN)"
This approach will work no matter where the line breaks are in the original
HTML.

John Nagle
Mar 22 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
39359
by: Anand Pillai | last post by:
To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1: domything() And the regexp search assuming no case restriction would be,
5
2358
by: Lukas Holcik | last post by:
Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could. Or how can I replace the html &entities; in a string "blablabla&amp;blablabal&amp;balbalbal" with the chars they mean using re.sub? I found out they are stored in an dict . I though about this functionality:
0
1819
by: Chris Croughton | last post by:
I'm trying to use the EXSLT regexp package from http://www.exslt.org/regexp/functions/match/index.html (specifically the match function) with the libxml xltproc (which supports EXSLT), but whatever I do gets errors. The examples use namespace regExp, but the supplied files use regexp, I've got it so that it at least doesn't complain about namespaces but it then complains that it can't find the match function. My stylesheet is:
4
7479
by: Jon Maz | last post by:
Hi All, I want to strip the accents off characters in a string so that, for example, the (Spanish) word "práctico" comes out as "practico" - but ignoring case, so that "PRÁCTICO" comes out as "PRACTICO". What's the best way to do this? TIA,
8
2035
by: Dmitry Korolyov | last post by:
ASP.NET app using c# and framework version 1.1.4322.573 on a IIS 6.0 web server. A single-line asp:textbox control and regexp validator attached to it. ^\d+$ expression does match an empty string (when you don't enter any values) - this is wrong d+ expression does not match, for example "g24" string - this is also wrong www.regexplib.com test validator works fine for both cases, i.e. it is reporting "not match" for the...
26
2140
by: Matt Kruse | last post by:
Are there any current browsers that have Javascript support, but not RegExp support? For example, cell phone browsers, blackberrys, or other "minimal" browsers? I know that someone using Netscape 3 would fall into this category, for example, but that's not a realistic situation anymore. And if such a condition exists, then how do you guys handle validation using regular expressions, if the browser lacks them? For example:
7
3453
by: Csaba Gabor | last post by:
I need to come up with a function function regExpPos (text, re, parenNum) { ... } that will return the position within text of RegExp.$parenNum if there is a match, and -1 otherwise. For example: var re = /some(thing|or other)?.*(n(est)(?:ed)?.*(parens) )/ var text = "There were some nesting parens in the test"; alert (regExpPos (text, re, 3));
4
2753
by: conan | last post by:
This regexp '<widget class=".*" id=".*">' works well with 'grep' for matching lines of the kind <widget class="GtkWindow" id="window1"> on a XML .glade file However that's not true for the re module in python, since this one takes the regexp as if were specified this way: '^<widget class=".*"
6
2277
by: runsun pan | last post by:
Hi I am wondering why I couldn't get what I want in the following 3 cases of re: (A) var p=/(+-?+):(+)/g p.exec("style='font-size:12'") -- // expected
4
3910
by: Matt | last post by:
Hello all, I have just discovered (the long way) that using a RegExp object with the 'global' flag set produces inconsistent results when its test() method is executed. I realize that 'global' is not an appropriate modifier for the test() function - test() searches the entire string by default. However, I would expect it to degrade gracefully. Instead, I seem to be getting something as follows - using W3Schools handy page at :
0
9706
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10335
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10323
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10082
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7621
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6854
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5652
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4301
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2993
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.