424,279 Members | 1,883 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,279 IT Pros & Developers. It's quick & easy.

HTML scraping

100+
P: 111
Im using VBA (in Access) to scrape an internal webpage at work to gather email addresses. The emails are encoded and i need to use this page to decode them. I can get it to return the entire HTML of a page and within this i need to narrow down and extract the email address.

right now i have this as my code:

Expand|Select|Wrap|Line Numbers
  1. Private Sub Command0_Click()
  2. Dim webBrowser As webBrowser
  3. Dim recordset As ADODB.recordset
  4. Dim ScrapeHTML As String
  5. Dim test As String
  6.  
  7. Set webBrowser = CreateObject("InternetExplorer.Application")
  8.  
  9. Set recordset = New ADODB.recordset
  10.     recordset.Open "scrambled", CurrentProject.Connection
  11.  
  12. With webBrowser
  13.     .Navigate "URL removed for security purposes"
  14.     .Visible = True
  15.         Do While webBrowser.Busy
  16.            DoEvents
  17.         Loop
  18.      End With
  19. Do While webBrowser.Busy
  20.     DoEvents
  21. Loop
  22.  
  23. Do While Not recordset.EOF
  24. Me.[pnr] = recordset![pnr]
  25. Do While webBrowser.Busy
  26.     DoEvents
  27. Loop
  28. Debug.Print test
  29. With webBrowser
  30.     .Document.Forms("EncodeDecodeEmailForm").all("decode").focus
  31.     .Document.Forms("EncodeDecodeEmailForm").all("decode").Value = recordset![Scrambled email]
  32.     SendKeys "{TAB}"
  33.     Pause 1
  34.     SendKeys "{ENTER}"
  35. Pause 2
  36.  
  37. Do While webBrowser.Busy
  38.     DoEvents
  39. Loop
  40.  
  41. ScrapeHTML = webBrowser.Document.documentElement.innerHTML
  42. ScrapeHTML = Mid(ScrapeHTML, InStr(1, ScrapeHTML, "this.form.encode") + 112, 60)
  43. Debug.Print ScrapeHTML
  44. ScrapeHTML = Left(ScrapeHTML, InStr(1, ScrapeHTML, "<") - 1)
  45. Debug.Print InStr(1, ScrapeHTML, ">")
  46. Me.[decoded email] = Right(ScrapeHTML, InStr(1, ScrapeHTML, ">") + 30)
  47. Debug.Print Me.[decoded email]
  48.  
  49. recordset.MoveNext
  50. DoCmd.RunCommand acCmdRecordsGoToNew
  51.  
  52. End With
  53. Do While webBrowser.Busy
  54.     DoEvents
  55. Loop
  56. With webBrowser
  57.     .Document.Forms("EncodeDecodeEmailForm").all("decode").focus
  58.     .Document.Forms("EncodeDecodeEmailForm").all("decode").Value = ""
  59.  
  60. End With
  61. Do While webBrowser.Busy
  62.     DoEvents
  63. Loop
  64. Pause 1
  65.  
  66. Do While webBrowser.Busy
  67.     DoEvents
  68. Loop
  69. Loop
  70. End Sub
and my output looks like this:

Expand|Select|Wrap|Line Numbers
  1. iddle>BOB@COMCAST.NET
obviously im not doing something right with my Left(), Right() functions, but i cant seem to figure it out. In fact i think that location of the email address is moving around in the HTML by a few characters every time.

Does anyone have any ideas on how to accomplish this?
Apr 4 '08 #1
Share this Question
Share on Google+
6 Replies


Dököll
Expert 100+
P: 2,364
Im using VBA (in Access) to scrape an internal webpage at work to gather email addresses...
Sorry for your troubles, Neekos!

I am sending over to the Access forum for further support, surely could be handled here but our Access forum is extremely busy, just in case someone sees yours.

Not sure how to handle this either, Neekos, hope you find what you're looking for:-)
Apr 5 '08 #2

ADezii
Expert 5K+
P: 8,595
The following code will work for the constant pattern of text>E-Mail Address. Should this pattern vary, I need to know some of the parameters, namely:
  1. Does the E-Mail Address always follow the '>'?
  2. Can the E-Mail Address appear anywhere within the string?
  3. Is the '.' delimiter the only period that will appear in the String?
  4. Anything else you can think of...
Expand|Select|Wrap|Line Numbers
  1. Dim strCompleteString As String
  2. Dim strEMailAddr As String
  3.  
  4. strCompleteString = "iddle>BOB@COMCAST.NET"
  5. strEMailAddr = Right$(strCompleteString, Len(strCompleteString) - InStr(strCompleteString, ">"))
  6.  
  7. Debug.Print strEMailAddr
OUTPUT:
Expand|Select|Wrap|Line Numbers
  1. BOB@COMCAST.NET
Apr 5 '08 #3

100+
P: 111
The email address always follows the '>' (its part of an html tag)

Here is the actual HTML that its scraping:

Expand|Select|Wrap|Line Numbers
  1. <tr><td align=center>Enter Encoded E-Mail Address to DECODE</td></tr>
  2. <tr><td align=center><input type="text" name="decode" maxlength="60" value="$16$79/IRSWBERQB.DUJCYSXQUB01" onclick="clearField(this.form.encode)"></td></tr>
  3. <tr><td align=center id=hilite>BOB@COMCAST.NET</td></tr>
  4. </table>
  5.  
  6. <table border="0" width="400">
  7. <tr><td align=center><input type="button" name="send" value="Submit" onclick="javascript:return validateForm();">
  8.     &nbsp;&nbsp;<input type="reset" value="Reset"></td>
for the most part, the email address is in the same character position for every one, but occassionally ill get some that are off by about 4-6 characters (so the output is getting trimmed)


The following code will work for the constant pattern of text>E-Mail Address. Should this pattern vary, I need to know some of the parameters, namely:
  1. Does the E-Mail Address always follow the '>'?
  2. Can the E-Mail Address appear anywhere within the string?
  3. Is the '.' delimiter the only period that will appear in the String?
  4. Anything else you can think of...
Expand|Select|Wrap|Line Numbers
  1. Dim strCompleteString As String
  2. Dim strEMailAddr As String
  3.  
  4. strCompleteString = "iddle>BOB@COMCAST.NET"
  5. strEMailAddr = Right$(strCompleteString, Len(strCompleteString) - InStr(strCompleteString, ">"))
  6.  
  7. Debug.Print strEMailAddr
OUTPUT:
Expand|Select|Wrap|Line Numbers
  1. BOB@COMCAST.NET
Apr 7 '08 #4

mshmyob
Expert 100+
P: 903
May I jump in here? I have noticed in your HTML scrape that the unique identifier is the '@' symbol. I would therefore use that to determine the email address.

Logic would be like so:

1. Find the @ symbol
2. Look backwards to find the '>' symbol
3. Look forward to find the '<' symbol
4. What is in between is the email address (this eliminates variable lengths)

Here is the code

Expand|Select|Wrap|Line Numbers
  1. Dim vSearchStr As String
  2.  Dim vAmpPos As Long
  3.     Dim vStartPos As Long
  4.     Dim vEndPos As Long
  5.     Dim vEmail As String
  6.  
  7.  
  8.     vSearchStr = [Your scraped HTML]
  9.     ' get the position of the @ symbol
  10.     vAmpPos = InStr(1, vSearchStr, "@", 1)
  11.     ' get the end position
  12.     vEndPos = InStr(vAmpPos, vSearchStr, "<", 1)
  13.     ' get the start position
  14.     vStartPos = InStrRev(vSearchStr, ">", vAmpPos, 1)
  15.     ' this is your email address
  16.     vEmail = Mid(vSearchStr, vStartPos + 1, (vEndPos - vStartPos) - 1)
  17.  
Hope this is of some help

cheers,

The email address always follows the '>' (its part of an html tag)

Here is the actual HTML that its scraping:

Expand|Select|Wrap|Line Numbers
  1. <tr><td align=center>Enter Encoded E-Mail Address to DECODE</td></tr>
  2. <tr><td align=center><input type="text" name="decode" maxlength="60" value="$16$79/IRSWBERQB.DUJCYSXQUB01" onclick="clearField(this.form.encode)"></td></tr>
  3. <tr><td align=center id=hilite>BOB@COMCAST.NET</td></tr>
  4. </table>
  5.  
  6. <table border="0" width="400">
  7. <tr><td align=center><input type="button" name="send" value="Submit" onclick="javascript:return validateForm();">
  8.     &nbsp;&nbsp;<input type="reset" value="Reset"></td>
for the most part, the email address is in the same character position for every one, but occassionally ill get some that are off by about 4-6 characters (so the output is getting trimmed)
Apr 9 '08 #5

100+
P: 111
HA! how the heck did i not think of using that in the first place?? Here i was counting hundreds of characters away to find something unique in the HTML. Thank you!

May I jump in here? I have noticed in your HTML scrape that the unique identifier is the '@' symbol. I would therefore use that to determine the email address.

Logic would be like so:

1. Find the @ symbol
2. Look backwards to find the '>' symbol
3. Look forward to find the '<' symbol
4. What is in between is the email address (this eliminates variable lengths)

Here is the code

Expand|Select|Wrap|Line Numbers
  1. Dim vSearchStr As String
  2.  Dim vAmpPos As Long
  3.     Dim vStartPos As Long
  4.     Dim vEndPos As Long
  5.     Dim vEmail As String
  6.  
  7.  
  8.     vSearchStr = [Your scraped HTML]
  9.     ' get the position of the @ symbol
  10.     vAmpPos = InStr(1, vSearchStr, "@", 1)
  11.     ' get the end position
  12.     vEndPos = InStr(vAmpPos, vSearchStr, "<", 1)
  13.     ' get the start position
  14.     vStartPos = InStrRev(vSearchStr, ">", vAmpPos, 1)
  15.     ' this is your email address
  16.     vEmail = Mid(vSearchStr, vStartPos + 1, (vEndPos - vStartPos) - 1)
  17.  
Hope this is of some help

cheers,
Apr 9 '08 #6

mshmyob
Expert 100+
P: 903
You're welcome. We all have been in the same boat. The more eyes the better.

cheers,

HA! how the heck did i not think of using that in the first place?? Here i was counting hundreds of characters away to find something unique in the HTML. Thank you!
Apr 9 '08 #7

Post your reply

Sign in to post your reply or Sign up for a free account.