473,591 Members | 2,902 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Extract HTML + Reg Ex

Ori
Hi,

I have a HTML text which I need to parse in order to extract data from
it.

My html contain a table contains few rows and two columns. I want to
extract the data from the 2nd column in the most efficient way (using
Reg Ex.) either than using the "indexOf" function of String.

Thanks,

Ori.

Here is the HTML table:

<table BORDER="1" CELLSPACING="0" CELLPADDING="1" >
<tr>
<td>Licensee Name</td>
<td BGCOLOR="#ffffc c">JOHN Doo</td>
</tr>
<tr>
<td><a HREF=>Primary Status</a></td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
<tr>
<td>License Number</td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
<tr>
<td><a >License Type</a></td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
<tr>
<td>Header</td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
<tr>
<td>Address</td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
<tr>
<td>City State State Zip </td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
</table>
Nov 15 '05 #1
1 3445
Hi!

Try this:

// First split the HTML into Table Lines
string[] arrLines = Regex.Split(str Content, @"<tr.*?>",
RegexOptions.Ig noreCase);

// Go through each line
forearch (string strLine in arrLines)
{
// Split into Rows Array
string[] strCol = Regex.Split(str Line, @"<td.*?>",
RegexOptions.Ig noreCase);
// Remove HTML Tags?
strCol[1] = Regex.Replace(s trCol[1], @"<[^>]*>", "");
// second Column
MessageBox.Show (strCol[1]);
}
Hope thats what you want!

Greetings

Matthias

or*******@hotma il.com (Ori) wrote in news:b431a203.0 402111057.442f4 545
@posting.google .com:
Hi,

I have a HTML text which I need to parse in order to extract data from
it.

My html contain a table contains few rows and two columns. I want to
extract the data from the 2nd column in the most efficient way (using
Reg Ex.) either than using the "indexOf" function of String.

Thanks,

Ori.

Here is the HTML table:

<table BORDER="1" CELLSPACING="0" CELLPADDING="1" >
<tr>
<td>Licensee Name</td>
<td BGCOLOR="#ffffc c">JOHN Doo</td>
</tr>
<tr>
<td><a HREF=>Primary Status</a></td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
<tr>
<td>License Number</td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
<tr>
<td><a >License Type</a></td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
<tr>
<td>Header</td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
<tr>
<td>Address</td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
<tr>
<td>City State State Zip </td>
<td BGCOLOR="#ffffc c">Data_To_Be_E xtracted</td>
</tr>
</table>


Nov 15 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
3903
by: Phong Ho | last post by:
Hi everyone, I try to write a simple web crawler. It has to do the following: 1) Open an URL and retrieve a HTML file. 2) Extract news headlines from the HTML file 3) Put the headlines into a RSS file. For example, I want to go to this site and extract the headlines: www.unstrung.com/section.asp?section_id=86
1
8620
by: Tim Smith | last post by:
I am looking to extract form element values from html, more generally I have a substring that identifies the beginning of a value and a string that identifies the end of value and I need to extract the substring. My ugly code looks like this: public static String getValue(String data, String begin, String end) { int delimPos = data.indexOf(delim, data.indexOf(begin) +
10
6867
by: mark4 | last post by:
Hello, Are there any utilities to help me extract Content from HTML ? I'd like to store this data in a database. The HTML consists of about 10,000 files with a total size of about 160 Mb. Each file is a thread from a message forum. Each thread has several contributions. The threads are in linear order of date posted with filenames such as 000125633.html. The
0
1609
by: Vjay77 | last post by:
I posted this question, but I pressed 'post' and it disappeared. So once again: Problem: I need to go to lets say www.site.com/page.html Imagine that this html code is 6 mb long. I need to extract information between bytes 5000 and 5020.
3
2448
by: rahman | last post by:
I have few hundred HTML pages. I need to extract portion of each HTML page into a text/database/HTML files format. You can imagine it is very tedious to do one by one. Is there any automatic process/software/tool available that could help me extract information form mass HTML files? I can specify what portion of file to take or leave. I have some tag like: <!--topic start-->
0
1189
by: manuel.reil | last post by:
Hello, currently i am developing a very small cms using python and cheetah. very early i have noticed that i was lacking the method to extract/recover the contents (html,text) from the html that is generated by cheetah and delivered to the site viewer. to explain it further: during the output processing by cheetah placeholders are replaced with my text/html input. to edit/alter the page i have to extract my personal input out of the...
1
3661
by: steveyjg | last post by:
I want to extract the following data from a retrieved html file and store the information as strings. 'get the text of "title" <h1 id="test_title">title</h1> 'get the contents of the value attribute <input name="test_code" type="text" value='<object </object>' > 'get the text of "category" or value of c <div class="smallText">
9
9267
by: flit | last post by:
Hello All, Using poplib in python I can extract only the headers using the .top, there is a way to extract only the message text without the headers? like remove the fields below: " Return-Path: X-Original-To: Received: from
1
4566
by: rcamarda | last post by:
I'd need to have a function that allows me to extract 'fields' from within the string I.E. (kinda pseudo code) declare @foo as varchar(100) set @foo = "Robert*Camarda*123 Main Street" select EXTRACT(@foo, '*', 2) ; -- would return 'Camarda' select EXTRACT(@foo, '*', 3) ;-- returns '123 Main Street' select EXTRACT(@foo, '*', 0) ;-- would return entire string select EXTRACT(@foo,'*' , 9) ;-- would return null
0
7935
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
1
7995
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8227
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
5735
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
3851
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
3893
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2379
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1467
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1202
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.