473,398 Members | 2,088 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

regular expression to extract text

Hi

I have an html file with headings followed by one or more paragraphs
like this

<h2>blah blah 1</h2>
<p>more blah blah blah</p>

<h2>blah blah 2</h2>
<p>more blah blah blah</p>
<p>even more blah blah blah</p>

I'd like to extract the text of the headings and the related
paragraphs and insert them into a database. So far I've managed to
get the heading text but cant figure out how to get the associated
paragraphs. I've been using regular expressions, here is the
expression I have so far <h2[.]*>(.+?)</h2>(.+?). This gets the text
of the headings but not the paragraphs and now I'm basically stumped.

Any help would be appreciated.
Nov 25 '07 #1
3 2453
On Nov 25, 9:48 pm, suzanne.bo...@gmail.com wrote:
Hi

I have an html file with headings followed by one or more paragraphs
like this

<h2>blah blah 1</h2>
<p>more blah blah blah</p>

<h2>blah blah 2</h2>
<p>more blah blah blah</p>
<p>even more blah blah blah</p>

I'd like to extract the text of the headings and the related
paragraphs and insert them into a database. So far I've managed to
get the heading text but cant figure out how to get the associated
paragraphs. I've been using regular expressions, here is the
expression I have so far <h2[.]*>(.+?)</h2>(.+?). This gets the text
of the headings but not the paragraphs and now I'm basically stumped.

Any help would be appreciated.
you could do this another way, although reg exp is a great way.
have you thought that you could use xml to so this.
since you are obviosuly starting with something which is basically
xml, why not just load the string as xml (topping and tailing it if
needed) and then extract using xpath.
Nov 25 '07 #2
Slightly unorthodox, but this works.

<?php

preg_match_all("/((<h2>(.+?)<\/h2>(.+?)<p>(.+?)<\/p>))/is", $html,
$matches);
print_r($matches);

// array[3] would be headings and array[5] would be the related
paragraph text
?>
Nov 26 '07 #3
The problem with using xml is that the html is coming from Word so it
contains a lot of unnecessary crap and isn't valid xml. And since I
don't have much experience parsing xml in php I thought it would be
easier to use regular expressions to extract the sections I want.

And I'm almost there now, the expression Kailash wrote almost works
but it only gives the first paragraph after the heading. I just need
to work out how to extract the rest of the paragraphs.
Nov 26 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Michael McGarry | last post by:
Hi, I am horrible with Regular Expressions, can anyone recommend a book on it? Also I am trying to parse the following string to extract the number after load average. ".... load average:...
1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
3
by: Tom | last post by:
I have struggled with the issue of whether or not to use Regular Expressions for a long time now, and after implementing many text manipulating solutions both ways, I've found that writing...
1
by: prithvis.mohanty | last post by:
I need to extract all the href urls and the anchor text with regular expression match from a html page. I have this href\s*=\s*(?:""(?<1>*)""|(?<1>\S+)) regex with me. This only extracts the href...
4
by: Neri | last post by:
Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...
3
by: ksr | last post by:
Hi, I am looking for a regular expression that would extract UNC paths from a given string and place that inside a href. Currently the expression fails if there is a space in the path.. eg....
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
4
by: duikboot | last post by:
Hello, I am trying to extract a list of strings from a text. I am looking it for hours now, googling didn't help either. Could you please help me? I expected:
0
by: altavim | last post by:
Usually when you make regular expression to extract text you are starting from simple expression. When you got to know target text, you are extending your expression. Subsequently very hard to ready...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.