By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,814 Members | 1,050 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,814 IT Pros & Developers. It's quick & easy.

scrape info from web:the .text problem

P: 18
hi ,everyone,i want to scrape something from
http://search.dangdang.com/search_pub.php?key=python
my code is :
Expand|Select|Wrap|Line Numbers
  1. import urllib
  2. import lxml.html
  3. down='http://search.dangdang.com/search_pub.php?key=python'
  4. file=urllib.urlopen(down).read()
  5. root=lxml.html.fromstring(file)
  6. tnodes = root.xpath("//div[@class='listitem detail']//li[@class='maintitle']//a")
  7. for i,x in  enumerate(tnodes):
  8.    print i,"  ",x.get('name'),x.get('href'),x.get('onclick'),x.text,"\n"
  9.  
the output is :
0 p_name http://product.dangdang.com/product.aspx?product_id=20872365&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','20872365_1_22591_p','','',''); None

1 p_name http://product.dangdang.com/product.aspx?product_id=20255354&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','20255354_2_12605_p','','',''); None

2 p_name http://product.dangdang.com/product.aspx?product_id=20836565&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','20836565_3_2361_p','','',''); None

3 p_name http://product.dangdang.com/product.aspx?product_id=21004615&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','21004615_4_3387_p','','',''); None

4 p_name http://product.dangdang.com/product.aspx?product_id=21063086&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','21063086_5_18815_p','','',''); None

5 pr_name http://product.dangdang.com/product.aspx?product_id=20678461&ref=search-1-pub s('click','python','01.54.04.03,01.54.06.18','','8 6_1_25','','','20678461_6_3967_p','','','RECO'); None

6 pr_name http://product.dangdang.com/product.aspx?product_id=20650363&ref=search-1-pub s('click','python','01.54.19.00','','86_1_25','',' ','20650363_7_62_p','','','RECO'); 黑客之道:漏洞发掘的艺术(原书第二版)(赠1CD)(电子制品CD-ROM)(

7 pr_name http://product.dangdang.com/product.aspx?product_id=20767932&ref=search-1-pub s('click','python','01.54.19.00','','86_1_25','',' ','20767932_8_4475_p','','','RECO'); Binary Hacks――黑客秘笈100选

8 p_name http://product.dangdang.com/product.aspx?product_id=20596189&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','20596189_9_639_p','','',''); None

9 p_name http://product.dangdang.com/product.aspx?product_id=20947680&ref=search-1-pub s('click','python','01.54.24.00,01.54.06.18','','8 6_1_25','','','20947680_10_7295_p','','',''); None

10 p_name http://product.dangdang.com/product.aspx?product_id=21050368&ref=search-1-pub s('click','python','01.54.19.00','','86_1_25','',' ','21050368_11_7039_p','','',''); None

11 p_name http://product.dangdang.com/product.aspx?product_id=20667966&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','20667966_12_383_p','','',''); None

12 p_name http://product.dangdang.com/product.aspx?product_id=21022493&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','21022493_13_5183_p','','',''); None

13 pr_name http://product.dangdang.com/product.aspx?product_id=479654&ref=search-1-pub s('click','python','01.54.06.08,01.54.06.18','','8 6_1_25','','','479654_14_2095_p','','','RECO'); Perl语言编程(第三版)

14 pr_name http://product.dangdang.com/product.aspx?product_id=20999855&ref=search-1-pub s('click','python','01.54.10.00','','86_1_25','',' ','20999855_15_6715_p','','','RECO'); 程序员的思维修炼:开发认知潜能的九堂课

15 pr_name http://product.dangdang.com/product.aspx?product_id=20696203&ref=search-1-pub s('click','python','01.54.06.08','','86_1_25','',' ','20696203_16_31615_p','','','RECO'); Perl语言入门(第五版)(原书名:Learning Perl,5/e)

16 p_name http://product.dangdang.com/product.aspx?product_id=20670643&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','20670643_17_24_p','','',''); 可爱的

17 p_name http://product.dangdang.com/product.aspx?product_id=20362210&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','20362210_18_32_p','','',''); 学习

18 p_name http://product.dangdang.com/product.aspx?product_id=9053236&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','9053236_19_4_p','','',''); 学习

19 p_name http://product.dangdang.com/product.aspx?product_id=20850780&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','20850780_20_1055_p','','',''); None

20 pr_name http://product.dangdang.com/product.aspx?product_id=20449068&ref=search-1-pub s('click','python','01.54.06.08','','86_1_25','',' ','20449068_21_38_p','','','RECO'); 精通Perl

21 p_name http://product.dangdang.com/product.aspx?product_id=21127816&ref=search-1-pub s('click','python','01.54.24.00,01.54.06.18','','8 6_1_25','','','21127816_22_12545_p','','',''); None

22 p_name http://product.dangdang.com/product.aspx?product_id=21107633&ref=search-1-pub s('click','python','01.54.06.18','','86_1_25','',' ','21107633_23_19245_p','','',''); Hadoop权威指南(第2版)修订升级版

23 None http://bang.dangdang.com/product_redirect.php?product_id=9317290 None None

24 p_name http://product.dangdang.com/product.aspx?product_id=9317290&ref=search-1-pub s('click','python','01.54.06.06,01.49.01.11,01.54. 26.00','','86_1_25','','','9317290_24_81727_p','', '',''); Java编程思想(第4版)

25 p_name http://product.dangdang.com/product.aspx?product_id=20773186&ref=search-1-pub s('click','python','01.54.06.17','','86_1_25','',' ','20773186_25_80479_p','','',''); Android应用开发揭秘

the problem is x.text ,for example:

1.
<a name="p_name" target="_blank" href="http://product.dangdang.com/product.aspx?product_id=20872365&ref=search-1-pub" onclick="s('click','python','01.54.06.18','','86_1 _25','','','20872365_1_22591_p','','','');">
<font class="skcolor_ljg">Python</font>
基础教程(第2版)
</a>
what i want to get is "Python 基础教程(第2版)",the output is None

2:
<a name="p_name" target="_blank" href="http://product.dangdang.com/product.aspx?product_id=20670643&ref=search-1-pub" onclick="s('click','python','01.54.06.18','','86_1 _25','','','20670643_17_24_p','','','');">
可爱的
<font class="skcolor_ljg">Python</font>
</a>
what i want to get is "可爱的python",the output is 可爱的

would you mind to tell me how to revise my code?
Aug 6 '11 #1

✓ answered by dwblas

It appears the you could split on the ")" and parse the next to last element if you want to do it by hand. Otherwise, check BeautifulSoup.
Expand|Select|Wrap|Line Numbers
  1. test_it = """<a name="p_name" target="_blank" href="http://product.dangdang.com/product.aspx?product_id=20872365&ref=search-1-pub" onclick="s('click','python','01.54.06.18','','86_1 _25','','','20872365_1_22591_p','','','');">
  2. <font class="skcolor_ljg">Python</font>
  3. replaced for us latin-1 users)
  4. </a>"""
  5.  
  6. ## convert to a single string and split on the ")"
  7. test_2 = [x for x in test_it.split("\n")]
  8. test_list = "".join(test_2).split(")")
  9. print test_list[-2]
  10.  
  11. ## select everything between ">" and "<" or the end of the string

Share this Question
Share on Google+
2 Replies


P: 10
I know nothing about the libraries or the techniques involved in doing this. I will suggest that maybe those nodes don't have any text to show. I looked at the source on that site in your code, and the html looks a little strange to me. The anchor tag <a> that has class="maintitle" doesn't have a closing tag. Instead, this appears: <div class="clear"/> and there was no div around that I could match it to. I think your code is good, but the website is using anchor tags in a very strange (probably bad) way.
Aug 18 '11 #2

Expert 100+
P: 624
It appears the you could split on the ")" and parse the next to last element if you want to do it by hand. Otherwise, check BeautifulSoup.
Expand|Select|Wrap|Line Numbers
  1. test_it = """<a name="p_name" target="_blank" href="http://product.dangdang.com/product.aspx?product_id=20872365&ref=search-1-pub" onclick="s('click','python','01.54.06.18','','86_1 _25','','','20872365_1_22591_p','','','');">
  2. <font class="skcolor_ljg">Python</font>
  3. replaced for us latin-1 users)
  4. </a>"""
  5.  
  6. ## convert to a single string and split on the ")"
  7. test_2 = [x for x in test_it.split("\n")]
  8. test_list = "".join(test_2).split(")")
  9. print test_list[-2]
  10.  
  11. ## select everything between ">" and "<" or the end of the string
Aug 18 '11 #3

Post your reply

Sign in to post your reply or Sign up for a free account.