parse URL (href) from xhtml, xhtml -> text, for data

hawat.thufir

Given an xhtml file, how can I "export" the data to plain-text? That is,
I want:

google www.google.com
Whereas, if I copy and paste what the browser shows, I lose the URL and
end up with:

google
The idea is that I want to import the data to MySQL using the mysqlimport
command, but mysqlimport requires plain-text. The xhtml file in question:

[thufir@localhos t Desktop]$ cat raw.xhtml -n
1 <?xml version="1.0" encoding="UTF-8"?><!DOCTYP E html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
2 <html xmlns="http://www.w3.org/1999/xhtml"><head><m eta
http-equiv="content-type" content="text/html; charset=utf-8" /><title
/><meta name="generator " content="StarOf fice/OpenOffice.org XSLT
(http://xml.openoffice. org/sx2ml)" /><meta name="created"
content="2006-02-07T15:19:17" /><meta name="changed"
content="2006-02-07T15:36:55" /><base href="." /><style type="text/css">
3 @page { }
4 table { border-collapse:collap se; border-spacing:0;
empty-cells:show }
5 td, th { vertical-align:top; }
6 h1, h2, h3, h4, h5, h6 { clear:both }
7 ol, ul { padding:0; }
8 * { margin:0; }
9 *.ta1 { }
10 *.ce1 { font-family:Courier; color:#000000;
font-size:10pt; font-style:normal; text-shadow:none; font-weight:normal; }
11 *.ce2 { font-family:Courier; color:#000000; }
12 *.Default { font-family:'Bitstre am Vera Sans'; }
13 *.Heading { font-family:'Bitstre am Vera Sans';
text-align:center ! important; font-size:16pt; font-style:italic;
font-weight:bold; }
14 *.Heading1 { font-family:'Bitstre am Vera Sans';
text-align:center ! important; font-size:16pt; font-style:italic;
font-weight:bold; }
15 *.Result { font-family:'Bitstre am Vera Sans';
font-style:italic; font-weight:bold; text-decoration:unde rline; }
16 *.Result2 { font-family:'Bitstre am Vera Sans';
font-style:italic; font-weight:bold; text-decoration:unde rline; }
17 *.co1 { width:0.8925in; }
18 *.ro1 { height:0.1756in ; }
19 *.ro2 { height:0.1681in ; }
20 </style></head><body dir="ltr"><tabl e border="0"
cellspacing="0" cellpadding="0" class="ta1"><co lgroup><col width="99"
/></colgroup><tr class="ro1"><td style="text-align:left;widt h:0.8925in; "
class="ce1"><p> <a href="http://www.google.com/">google
</a>Â*Â*</p></td></tr><tr class="ro2"><td
style="text-align:left;widt h:0.8925in; " class="ce2" /></tr><tr
class="ro2"><td style="text-align:left;widt h:0.8925in; " class="ce2"
/></tr></table></body></html>[thufir@localhos t Desktop]$ date
Tue Feb 7 15:52:34 EST 2006
[thufir@localhos t Desktop]$

thanks,
Thufir

Feb 7 '06 #1

Subscribe Reply

2315

Joe Kesselman

First, you need to define what portions of the document are "data". It
sounds like what you want is just the links; is that correct?

If so, you need to search for <a> elements that have an href attribute,
pull out their content (which may be arbitrarily complex markup, please
remember -- rich text, images, etc. -- you need to define how much of
that you want to return and how you want it presented!), pull the value
of the href attribute, and report that pair of values.

Assuming that description of your problem is correct, you can do this by
writing a program that uses an XML parser and the SAX or DOM APIs, or
you can write an XSLT stylesheet such as the following. (WARNING: UNTESTED.)

<xsl:styleshe et version="1.0"
xmlns:xsl="http ://www.w3.org/1999/XSL/Transform">

<xsl:output method="text" version="1.0" encoding="UTF-8" />

<xsl:template match="/">
<xsl:apply-templates select="//a[@href]"/>
</xsl:template>

<xsl:template match="a[@href]">
<xsl:value-of select="."/>
<xsl:text> </xsl:text>
<xsl:value-of select="@href"/>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>

Feb 8 '06 #2

hawat.thufir

Joe Kesselman wrote:

First, you need to define what portions of the document are "data". It
sounds like what you want is just the links; is that correct?
I like your approach: asking what is "data" in this case. Yes, I'm
after "just" the links for the example given. However, if I could get
the pair of values, "google" and "http://www.google.com" that'd be
stage two. For now, yes, I'd be happy with just the links.
If so, you need to search for <a> elements that have an href attribute,
pull out their content (which may be arbitrarily complex markup, please
remember -- rich text, images, etc. -- you need to define how much of
that you want to return and how you want it presented!), pull the value
of the href attribute, and report that pair of values.
Not sure I follow you there, I'm not after the actual google page,
simply the URL.
Assuming that description of your problem is correct, you can do this by
writing a program that uses an XML parser and the SAX or DOM APIs, or
you can write an XSLT stylesheet such as the following. (WARNING: UNTESTED.)

...

Right, thanks for writing a transform, I get the gist. That's actually
a big deal, I've read a tad about XSLT but it seemed arcane until just
now.

How do I get plain-text from the result, though? The result will be
XML, but not XHTML, which is a step in the right direction. PHP or
similar would be required to parse the XML resultant to get a
plain-text file with the link?
-Thufir

Feb 8 '06 #3

Martin Honnen

ha**********@gm ail.com wrote:

Right, thanks for writing a transform, I get the gist. That's actually
a big deal, I've read a tad about XSLT but it seemed arcane until just
now.

How do I get plain-text from the result, though?

Joe's XSLT stylesheet has
<xsl:output method="text" version="1.0" encoding="UTF-8" />
so the output method is text and not XML.
XSLT can produce XML or HTML or text depending on the output method.
--

Martin Honnen
http://JavaScript.FAQTs.com/

Feb 8 '06 #4

Joe Kesselman

> How do I get plain-text from the result, though?

Note that the <xsl:output> statement in my example says to produce text
output. That says the output should be a free-form text stream rather
than XML. (Exactly how that differs from XML or HTML output modes is
described in the XSLT spec, if you want the details.)

I've said the text should be encoded as UTF-8; if you want the output in
a different encoding, that too can be specified via xsl:output. (Not all
processors support all encodings, admittedly.)
As I said, this is just one possible approach. You could hand-code a
solution almost as trivially, but I think I'm going to leave that as a
homework assignment for now. <smile/>

Feb 8 '06 #5

hawat.thufir

Joe Kesselman wrote:

How do I get plain-text from the result, though?
Note that the <xsl:output> statement in my example says to produce text
output.

Pardon, I didn't notice that until you mentioned it--thanks!

... I've said the text should be encoded as UTF-8; if you want the output in
a different encoding, that too can be specified via xsl:output. (Not all
processors support all encodings, admittedly.)
As I said, this is just one possible approach. You could hand-code a
solution almost as trivially, but I think I'm going to leave that as a
homework assignment for now. <smile/>

Doh!
Thanks for the help :)
I have to install a JRE and Saxon (?), then I'll give it a go.
-Thufir

Feb 8 '06 #6

hawat.thufir

Joe Kesselman wrote:

How do I get plain-text from the result, though?
Note that the <xsl:output> statement in my example says to produce text
output.

Pardon, I didn't notice that until you mentioned it--thanks!

... I've said the text should be encoded as UTF-8; if you want the output in
a different encoding, that too can be specified via xsl:output. (Not all
processors support all encodings, admittedly.)
As I said, this is just one possible approach. You could hand-code a
solution almost as trivially, but I think I'm going to leave that as a
homework assignment for now. <smile/>

Doh!
Thanks for the help :)
I have to install a JRE and Saxon (?), then I'll give it a go.
-Thufir

Feb 8 '06 #7

hawat.thufir

Joe Kesselman wrote:

How do I get plain-text from the result, though?
Note that the <xsl:output> statement in my example says to produce text
output.

Pardon, I didn't notice that until you mentioned it--thanks!

... I've said the text should be encoded as UTF-8; if you want the output in
a different encoding, that too can be specified via xsl:output. (Not all
processors support all encodings, admittedly.)
As I said, this is just one possible approach. You could hand-code a
solution almost as trivially, but I think I'm going to leave that as a
homework assignment for now. <smile/>

Doh!
Thanks for the help :)
I have to install a JRE and Saxon (?), then I'll give it a go.
-Thufir

Feb 8 '06 #8

Similar topics

2104

cut strings and parse for images

by: Andreas Volz | last post by:

Hi, I used SGMLParser to parse all href's in a html file. Now I need to cut some strings. For example: http://www.example.com/dir/example.html Now I like to cut the string, so that only domain and directory is left over. Expected result:

Python

1625

MS XML DOM component can't parse reference

by: laughlin | last post by:

I'm using MS's XML DOM component to parse what I thought was XHTML. For some reason, the parser doesn't like the following reference that includes a couple of parameters following the "?" and seperated by the "&". Specifically, the parser is expecting a ";" following "stills" in the snippet below. <a href="../../../../Media/Drills/feint1.avi?selection=8.65,10.33&stills=10.07"> Any ideas?

.NET Framework

3775

How to parse a XML doc with HTML tags within the texts

by: Francesco Moi | last post by:

Hi. I must parse this XML document: -------------- <doc> <item> <name>Jerry</name> <message>Hi<br>My name is Jerry</message> </item> </doc>

.NET Framework

4393

XHTML: browser rendering parse tree instead. Why?

by: Neil Zanella | last post by:

Hello, When I save the following file with the .xhtml or .xml extension I get the XML parse tree and the following message instead of the actual document. This XML file does not appear to have any style information associated with it. The document tree is shown below. I wonder if anyone could explain the above message. Even when I place

HTML / CSS

4023

How to parse xmlDocument with XPath

by: John Barring | last post by:

Hi All, I am new to XPath stuff. I want parse XMLDocument with XPath and find out subset of information. If you look at following xml, for i.e how can i retrieve subset information such as 'EnrollmentState','SCGProductCollection' or 'EnrollmentEntities'. I have no idea how to parse following XML <SOAP-ENV:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"

.NET Framework

2494

Regular Expression to Parse HTML

by: Charles Law | last post by:

Does anyone have a regex pattern to parse HTML from a stream? I have a well structured file, where each line is of the form <sometag someattribute='attr'>text</sometag> for example <SPAN CLASS='myclass'>A bit of text</SPAN>, or Just some text, without tags

Visual Basic .NET

489

Parse Error

by: Ben Allen | last post by:

Hi, Im currently getting the error: Parse error: parse error, unexpected T_ENCAPSED_AND_WHITESPACE, expecting T_STRING or T_VARIABLE or T_NUM_STRING in /home/midwestm/public_html/support/tutorials.php on line 15. When running the script below, I believe the error has only started to occur since we made the move to a server with global variables switched off. <?php ob_start(); ?>

PHP

4380

parse xhtml with xpath???

by: Lore Leunoeg | last post by:

Hello Can I parse an XHTML Document with XPath? I tried the following expressions with the following XHTML Document with n o result: Removing the doctype and the namespace (xmlns) statemens I get the expected result . Problem: I need the doctype and the namespace tags in my document. Thank you

.NET Framework

15699

Use C++ to parse HTML

by: Bo Yang | last post by:

Hi, guys. I am now developing an application in which I need to fetch some html page, and then parsing it to get some intended content in it. Because HTML is not a standard XML format, so I am curious about how should it be parsed? Any help and suggestion will be appreciated very much, thanks in advance!

C / C++

13874

Problem with location.href and frames

by: August Karlstrom | last post by:

Hi everyone, I have some problems loading a page into a frame from a different frame. In Firefox and Explorer the lower frame displays "Test..." but in some older version of Safari it is left blank. Can anyone spot undefined behavior in the documents below or should we blame Safari? Regards,

Javascript

9690

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9551

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10504

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10251

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

6811

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5606

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4149

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3764

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2945

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General