473,396 Members | 2,093 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

HTML scraping

111 100+
Im using VBA (in Access) to scrape an internal webpage at work to gather email addresses. The emails are encoded and i need to use this page to decode them. I can get it to return the entire HTML of a page and within this i need to narrow down and extract the email address.

right now i have this as my code:

Expand|Select|Wrap|Line Numbers
  1. Private Sub Command0_Click()
  2. Dim webBrowser As webBrowser
  3. Dim recordset As ADODB.recordset
  4. Dim ScrapeHTML As String
  5. Dim test As String
  6.  
  7. Set webBrowser = CreateObject("InternetExplorer.Application")
  8.  
  9. Set recordset = New ADODB.recordset
  10.     recordset.Open "scrambled", CurrentProject.Connection
  11.  
  12. With webBrowser
  13.     .Navigate "URL removed for security purposes"
  14.     .Visible = True
  15.         Do While webBrowser.Busy
  16.            DoEvents
  17.         Loop
  18.      End With
  19. Do While webBrowser.Busy
  20.     DoEvents
  21. Loop
  22.  
  23. Do While Not recordset.EOF
  24. Me.[pnr] = recordset![pnr]
  25. Do While webBrowser.Busy
  26.     DoEvents
  27. Loop
  28. Debug.Print test
  29. With webBrowser
  30.     .Document.Forms("EncodeDecodeEmailForm").all("decode").focus
  31.     .Document.Forms("EncodeDecodeEmailForm").all("decode").Value = recordset![Scrambled email]
  32.     SendKeys "{TAB}"
  33.     Pause 1
  34.     SendKeys "{ENTER}"
  35. Pause 2
  36.  
  37. Do While webBrowser.Busy
  38.     DoEvents
  39. Loop
  40.  
  41. ScrapeHTML = webBrowser.Document.documentElement.innerHTML
  42. ScrapeHTML = Mid(ScrapeHTML, InStr(1, ScrapeHTML, "this.form.encode") + 112, 60)
  43. Debug.Print ScrapeHTML
  44. ScrapeHTML = Left(ScrapeHTML, InStr(1, ScrapeHTML, "<") - 1)
  45. Debug.Print InStr(1, ScrapeHTML, ">")
  46. Me.[decoded email] = Right(ScrapeHTML, InStr(1, ScrapeHTML, ">") + 30)
  47. Debug.Print Me.[decoded email]
  48.  
  49. recordset.MoveNext
  50. DoCmd.RunCommand acCmdRecordsGoToNew
  51.  
  52. End With
  53. Do While webBrowser.Busy
  54.     DoEvents
  55. Loop
  56. With webBrowser
  57.     .Document.Forms("EncodeDecodeEmailForm").all("decode").focus
  58.     .Document.Forms("EncodeDecodeEmailForm").all("decode").Value = ""
  59.  
  60. End With
  61. Do While webBrowser.Busy
  62.     DoEvents
  63. Loop
  64. Pause 1
  65.  
  66. Do While webBrowser.Busy
  67.     DoEvents
  68. Loop
  69. Loop
  70. End Sub
and my output looks like this:

Expand|Select|Wrap|Line Numbers
  1. iddle>BOB@COMCAST.NET
obviously im not doing something right with my Left(), Right() functions, but i cant seem to figure it out. In fact i think that location of the email address is moving around in the HTML by a few characters every time.

Does anyone have any ideas on how to accomplish this?
Apr 4 '08 #1
6 8970
Dököll
2,364 Expert 2GB
Im using VBA (in Access) to scrape an internal webpage at work to gather email addresses...
Sorry for your troubles, Neekos!

I am sending over to the Access forum for further support, surely could be handled here but our Access forum is extremely busy, just in case someone sees yours.

Not sure how to handle this either, Neekos, hope you find what you're looking for:-)
Apr 5 '08 #2
ADezii
8,834 Expert 8TB
The following code will work for the constant pattern of text>E-Mail Address. Should this pattern vary, I need to know some of the parameters, namely:
  1. Does the E-Mail Address always follow the '>'?
  2. Can the E-Mail Address appear anywhere within the string?
  3. Is the '.' delimiter the only period that will appear in the String?
  4. Anything else you can think of...
Expand|Select|Wrap|Line Numbers
  1. Dim strCompleteString As String
  2. Dim strEMailAddr As String
  3.  
  4. strCompleteString = "iddle>BOB@COMCAST.NET"
  5. strEMailAddr = Right$(strCompleteString, Len(strCompleteString) - InStr(strCompleteString, ">"))
  6.  
  7. Debug.Print strEMailAddr
OUTPUT:
Expand|Select|Wrap|Line Numbers
  1. BOB@COMCAST.NET
Apr 5 '08 #3
Neekos
111 100+
The email address always follows the '>' (its part of an html tag)

Here is the actual HTML that its scraping:

Expand|Select|Wrap|Line Numbers
  1. <tr><td align=center>Enter Encoded E-Mail Address to DECODE</td></tr>
  2. <tr><td align=center><input type="text" name="decode" maxlength="60" value="$16$79/IRSWBERQB.DUJCYSXQUB01" onclick="clearField(this.form.encode)"></td></tr>
  3. <tr><td align=center id=hilite>BOB@COMCAST.NET</td></tr>
  4. </table>
  5.  
  6. <table border="0" width="400">
  7. <tr><td align=center><input type="button" name="send" value="Submit" onclick="javascript:return validateForm();">
  8.     &nbsp;&nbsp;<input type="reset" value="Reset"></td>
for the most part, the email address is in the same character position for every one, but occassionally ill get some that are off by about 4-6 characters (so the output is getting trimmed)


The following code will work for the constant pattern of text>E-Mail Address. Should this pattern vary, I need to know some of the parameters, namely:
  1. Does the E-Mail Address always follow the '>'?
  2. Can the E-Mail Address appear anywhere within the string?
  3. Is the '.' delimiter the only period that will appear in the String?
  4. Anything else you can think of...
Expand|Select|Wrap|Line Numbers
  1. Dim strCompleteString As String
  2. Dim strEMailAddr As String
  3.  
  4. strCompleteString = "iddle>BOB@COMCAST.NET"
  5. strEMailAddr = Right$(strCompleteString, Len(strCompleteString) - InStr(strCompleteString, ">"))
  6.  
  7. Debug.Print strEMailAddr
OUTPUT:
Expand|Select|Wrap|Line Numbers
  1. BOB@COMCAST.NET
Apr 7 '08 #4
mshmyob
904 Expert 512MB
May I jump in here? I have noticed in your HTML scrape that the unique identifier is the '@' symbol. I would therefore use that to determine the email address.

Logic would be like so:

1. Find the @ symbol
2. Look backwards to find the '>' symbol
3. Look forward to find the '<' symbol
4. What is in between is the email address (this eliminates variable lengths)

Here is the code

Expand|Select|Wrap|Line Numbers
  1. Dim vSearchStr As String
  2.  Dim vAmpPos As Long
  3.     Dim vStartPos As Long
  4.     Dim vEndPos As Long
  5.     Dim vEmail As String
  6.  
  7.  
  8.     vSearchStr = [Your scraped HTML]
  9.     ' get the position of the @ symbol
  10.     vAmpPos = InStr(1, vSearchStr, "@", 1)
  11.     ' get the end position
  12.     vEndPos = InStr(vAmpPos, vSearchStr, "<", 1)
  13.     ' get the start position
  14.     vStartPos = InStrRev(vSearchStr, ">", vAmpPos, 1)
  15.     ' this is your email address
  16.     vEmail = Mid(vSearchStr, vStartPos + 1, (vEndPos - vStartPos) - 1)
  17.  
Hope this is of some help

cheers,

The email address always follows the '>' (its part of an html tag)

Here is the actual HTML that its scraping:

Expand|Select|Wrap|Line Numbers
  1. <tr><td align=center>Enter Encoded E-Mail Address to DECODE</td></tr>
  2. <tr><td align=center><input type="text" name="decode" maxlength="60" value="$16$79/IRSWBERQB.DUJCYSXQUB01" onclick="clearField(this.form.encode)"></td></tr>
  3. <tr><td align=center id=hilite>BOB@COMCAST.NET</td></tr>
  4. </table>
  5.  
  6. <table border="0" width="400">
  7. <tr><td align=center><input type="button" name="send" value="Submit" onclick="javascript:return validateForm();">
  8.     &nbsp;&nbsp;<input type="reset" value="Reset"></td>
for the most part, the email address is in the same character position for every one, but occassionally ill get some that are off by about 4-6 characters (so the output is getting trimmed)
Apr 9 '08 #5
Neekos
111 100+
HA! how the heck did i not think of using that in the first place?? Here i was counting hundreds of characters away to find something unique in the HTML. Thank you!

May I jump in here? I have noticed in your HTML scrape that the unique identifier is the '@' symbol. I would therefore use that to determine the email address.

Logic would be like so:

1. Find the @ symbol
2. Look backwards to find the '>' symbol
3. Look forward to find the '<' symbol
4. What is in between is the email address (this eliminates variable lengths)

Here is the code

Expand|Select|Wrap|Line Numbers
  1. Dim vSearchStr As String
  2.  Dim vAmpPos As Long
  3.     Dim vStartPos As Long
  4.     Dim vEndPos As Long
  5.     Dim vEmail As String
  6.  
  7.  
  8.     vSearchStr = [Your scraped HTML]
  9.     ' get the position of the @ symbol
  10.     vAmpPos = InStr(1, vSearchStr, "@", 1)
  11.     ' get the end position
  12.     vEndPos = InStr(vAmpPos, vSearchStr, "<", 1)
  13.     ' get the start position
  14.     vStartPos = InStrRev(vSearchStr, ">", vAmpPos, 1)
  15.     ' this is your email address
  16.     vEmail = Mid(vSearchStr, vStartPos + 1, (vEndPos - vStartPos) - 1)
  17.  
Hope this is of some help

cheers,
Apr 9 '08 #6
mshmyob
904 Expert 512MB
You're welcome. We all have been in the same boat. The more eyes the better.

cheers,

HA! how the heck did i not think of using that in the first place?? Here i was counting hundreds of characters away to find something unique in the HTML. Thank you!
Apr 9 '08 #7

Sign in to post your reply or Sign up for a free account.

Similar topics

4
by: David Jones | last post by:
Hi, I'm interested in learning about web scraping/site scraping using Python. Does anybody know of some online resources or have any modules that are available to help out. O'Reilly published an...
17
by: DesignGuy | last post by:
I would like to download the RDF dump and generate static HTML pages (with customizable headers and footers). I have only found one program called iHierarchy that claims to do this (...
1
by: mustafa | last post by:
anyone know some good reliable html scraping (with python) tutorials. i have looked around and found a few. one uses urllib2 and beautifull soap modules for scraping and parsing...
5
by: Lorenzo | last post by:
I've a web site with a classic asp login page (https), another where in a textbox i write a sql query and a third that shows the resulset of the query.... Now i want to create an asp.net...
2
by: Paul W | last post by:
Hi - I want to be able to capture the html generated by one of my pages. Is there any way to do this from within the application, or must I use some form of 'screen-scraping'. If screen-scraping,...
3
by: Sanjay Arora | last post by:
We are looking to select the language & toolset more suitable for a project that requires getting data from several web-sites in real- time....html parsing/scraping. It would require full emulation...
3
by: Jim S | last post by:
I have a need to read the contents of an html table on a remote web page into a variable. I guess this is called screen scraping but not sure. I'm not sure where to start or what the best...
5
by: mailtogops | last post by:
Hi All, I am involved in one project which tends to collect news information published on selected, known web sites inthe format of HTML, RSS, etc and sortlist them and create a bookmark on our...
2
by: s. d. rose | last post by:
Hello All. I am learning Python, and have never worked with HTML. However, I would like to write a simple script to audit my 100+ Netware servers via their web portal. I was reading Chapter 8...
18
by: Ecka | last post by:
Hi everyone, I'm trying to write a PHP script that connects to a bank's currency convertor page using cURL and that part works fine. The issue is that I end up with a page that includes a lot...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.