473,395 Members | 1,541 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Parse text from HTML website, dump into DB

I am working on a script to extract statistics (which is updated daily) from
a website, and insert them into a MySQL database. I want to take this
website:
http://www.usatoday.com/sports/baske...layers0304.htm
and strip off all the HTML tags and etc, make it look like
http://www.enlhoops.com/ratings/parsed.txt
and then insert each players stat line into the database.

I have begun writing the script, getting the file, striping html tags off,
but that doesn't seem to work too well. If anyone can help me get started,
suggest a function or anything else, that would be helpful. Thanks.

IceOnFire
Jul 17 '05 #1
2 12960
In article <10*************@corp.supernews.com>,
"IceOnFire" <af@iceonfire.net> wrote:
I am working on a script to extract statistics (which is updated daily) from
a website, and insert them into a MySQL database. I want to take this
website:
http://www.usatoday.com/sports/baske...layers0304.htm
and strip off all the HTML tags and etc, make it look like
http://www.enlhoops.com/ratings/parsed.txt
and then insert each players stat line into the database.

I have begun writing the script, getting the file, striping html tags off,
but that doesn't seem to work too well. If anyone can help me get started,
suggest a function or anything else, that would be helpful. Thanks.

IceOnFire


Use perl. It's more suited to this sort of thing and can run
independently from the command line.

CPAN modules allow you to extend perl to access sites as if you were
browser, including accepting cookies.

--
DeeDee, don't press that button! DeeDee! NO! Dee...

Jul 17 '05 #2
IceOnFire wrote:
I am working on a script to extract statistics (which is updated daily) from
a website, and insert them into a MySQL database. I want to take this
website:
http://www.usatoday.com/sports/baske...layers0304.htm
and strip off all the HTML tags and etc, make it look like
http://www.enlhoops.com/ratings/parsed.txt
and then insert each players stat line into the database.

I have begun writing the script, getting the file, striping html tags off,
but that doesn't seem to work too well. If anyone can help me get started,
suggest a function or anything else, that would be helpful. Thanks.


Here is some example code I wrote to do a very similar thing for the
BBC's Fantasy Football system (so I can view them on my Nokia 3650
phone). It's not perfect (in fact it's quite dirty) but it does the
trick and it may help get you started:


<?php
print '<?xml version="1.0"?>';
?>
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN"
"http://www.wapforum.org/DTD/xhtml-mobile10.dtd" >

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>BBC Fantasy Football</title>
<style type="text/css">
p, body, td, th { font-family: Arial, Helvetica,
Sans-Serif; font-size: medium; }
th { background-color: #efefce; }
td { background-color: #ffffde; }
h1 { font-size: large;?>
</style>
</head>

<body>
<p align='center'><img src="bbcsport_logo.gif" alt="BBC Sport"
/></p>
<h1>Team for <?=$name?></h1>
<div align='center'>

<?php

$page =
file_get_contents("http://bbcfootball.fantasyleague.co.uk/team/teamscreen.asp?pin=$id");

$page = str_replace("\n", "", $page);
if (preg_match("/CURRENT FIRST 11(.*?)<\/table>/m", $page, $matches)) {
print "<table><tr><th>Player</th><th width='20'>P</th><th
width='30'>C</th><th width='20'>W</th><th width='20'>M</th></tr>";
$table = $matches[1];

preg_match_all("/(<tr>.*?<\/tr>)/", $table, $matches);
for ($n=0; $n<count($matches[1]); $n++) {
if (preg_match("/^.*?<td
..*?\/td><td.*?>(\d+).*<\/td><td.*>(\S+)<\/td><td.*>(\S+)<\/td>.*?squad_(\S).gif.*?<td.*>(\S
+)<\/td><td.*>(\S+)<\/td><td.*>(\S+)<\/td><td.*>(\S+)<\/td>/",
$matches[1][$n], $player)) {

switch ($player[4]) {
case "g": $pos='GK'; break;
case "f": $pos='FB'; break;
case "c": $pos='CB'; break;
case "m": $pos='MF'; break;
case "s": $pos='SK'; break;
}

$club = str_replace("&nbsp;", "", $player[5]);

print "<tr><td
align='left'>$player[2]$player[3]</td><td align='center'>$pos</td><td
align='center'>$club</t
d><td align='center'>$player[7]</td><td
align='center'>$player[8]</td></tr>";
}
}
print "</table>";
}
else {
print "<p><b>Currently updating...</b></p>";
}

?>

</div>

</body>
</html>


Best of luck,
Andy
Jul 17 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: John A. Irwin | last post by:
I'm very new to PHP and am trying to figure out how to parse out a variable "HTTP_REFERER". My reason for this is my site was recently "FEATURED" (sic) on a website called FARK.COM. Because of...
2
by: Tyaan | last post by:
Hi.. I'm a perl noob need to know how to write a script to parse a file containing one to four of the following blocks of text? I then want to print the results in a format showing the memory size...
10
by: Viken Karaguesian | last post by:
Hello everyone, Me again. Trying to learn some more :>) I hope I got the terminology right. How does a browser parse (correct term?) an HTML document. I'm sure that every browser does it a...
4
by: TwinT | last post by:
Hi. I've got a small problem with parse trees... first off, I don't exactly know what it is. I'm a newbie, in a way, but I've been bravely trying to learn my way through all the confusion out in...
5
by: js | last post by:
I have a textbox contains text in the format of "yyyy/MM/dd hh:mm:ss". I need to parse the text using System.DateTime.Parse() function with custom format. I got an error using the following code. ...
29
by: gs | last post by:
let say I have to deal with various date format and I am give format string from one of the following dd/mm/yyyy mm/dd/yyyy dd/mmm/yyyy mmm/dd/yyyy dd/mm/yy mm/dd/yy dd/mmm/yy mmm/dd/yy
1
AdrianH
by: AdrianH | last post by:
Assumptions I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C programming. FYI Although I have called this...
9
by: sebzzz | last post by:
Hi, I work at this company and we are re-building our website: http://caslt.org/. The new website will be built by an external firm (I could do it myself, but since I'm just the summer student...
6
by: blitzztriger | last post by:
Hello all! This might be simple,but im having some problems in parsing the php code, can someone help? i have a text file with this (for example): ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.