473,320 Members | 1,859 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Extract URL from String

markmcgookin
648 Expert 512MB
Hi Folks,

I am writing a program to analyse an html page in java, I am connecting to a website, then going to extract ALL the links from it. I think the best way to do this is using the <a href... /a> tags as a guideline.

I have the code....

Expand|Select|Wrap|Line Numbers
  1.  
  2. String    data1;
  3. DataInputStream  webadd = null;
  4.  
  5. webadd = new DataInputStream( 
  6.         (new URL("http://www.anyrandomurl.com/")).openStream() );
  7.  
  8. data1 = webadd.readLine();
  9.  
  10. while ( data != null ) 
  11.   {
  12.      data = webadd.readLine();
  13.      *** HELP NEEDED HERE ***
  14.   }
  15.  
  16.  
This obviously reads through every line of code in an html doc at the URL and puts it into data. I was thinking of storing all the URLs from the site in an array later on, but it is the way of extracting the links I was unsure of... possibly somekind of sting tokenizer? I really need something that will scroll through a string, char by char until it hits <a href=" and will then record the data until it hits /a> giving me the URL.

Which I can then just add to something like URLs [] and loop through that later.

Think it will only be one line of code or so, any ideas?

Cheers!
Mar 11 '07 #1
5 17920
Ganon11
3,652 Expert 2GB
The String class has a .charAt() function that will return the character at the position specified. You can use this to search through the String char by char.

Alternatively, there's a .find() function in the String class that you can use to search for "<a href=", and extract the substring (URL) by using the .substr() function in String. Check out the official documentation here.
Mar 11 '07 #2
markmcgookin
648 Expert 512MB
The String class has a .charAt() function that will return the character at the position specified. You can use this to search through the String char by char.

Alternatively, there's a .find() function in the String class that you can use to search for "<a href=", and extract the substring (URL) by using the .substr() function in String. Check out the official documentation here.
Ah excellent, I've used that before, and I knew something like it existed, I totally forgot the syntax! cheers!

Just reading through the java.net 1.5.0 stuff here now too see if there is any useful methods in those classes.

Cheers pal!
Mar 11 '07 #3
markmcgookin
648 Expert 512MB
Would an idea be:

Read line of html as String ( strLine )

posStart = strLine.Find("<a href =")
posFinish = strLine.Find("/a>")

linkURL = strLine.subSequence( posStart, posFinish )

That's obv mostly pseudo code, but would you think that would return a link (I can't test it until tomorrow) ? also, what do you think I should do to deal with lines that have more than one link? as that will obviously return only the 1st link off the line.

Now obviously a loop if i was using the ChatAt() method for going through every char in the line would be

i = 0
For i = 0 ; i=strLine.Length

But not too sure how I could get that to work with me for running through.

Maby

IF posFinish != strLine.Length

... continue

or something? lol
Mar 11 '07 #4
Ganon11
3,652 Expert 2GB
Well, I know HTML links are stored in this way:

"<a href="http://www.mysitehere.com/thisiscool/index.html >MY TEST HERE </a >"

without the spaces before the >'s.

So you'd have to search for "<a href" and ">" and take that substring.

As for finding multiple links in one line...once you find a link, you can get rid of the first x characters to the end of the "<" and begin the process again until you can't find the "<a href" in the String anymore.
Mar 11 '07 #5
markmcgookin
648 Expert 512MB
Well, I know HTML links are stored in this way:

"<a href="http://www.mysitehere.com/thisiscool/index.html >MY TEST HERE </a >"

without the spaces before the >'s.

So you'd have to search for "<a href" and ">" and take that substring.

As for finding multiple links in one line...once you find a link, you can get rid of the first x characters to the end of the "<" and begin the process again until you can't find the "<a href" in the String anymore.
Cool man, cheers! I'll try something out tomorrow and maby post back here during the week! (Been VB.Net programming all day... must switch brain to java overnight!)

Thanks very much for taking the time to reply!
Mar 11 '07 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

9
by: Sharon | last post by:
hi, I want to extract a string from a file, if the file is like this: 1 This is the string 2 3 4 how could I extract the string, starting from the 10th position (i.e. "T") and...
6
by: Mohammad-Reza | last post by:
Hi I want to extract icon of an exe file and want to know how. I look at the MSDN and find out that I can use ExtractIconEx() Windows API but in there are some changes to that api in c# I made...
7
by: teo | last post by:
hallo, I need to extract a word and few text that precedes and follows it (about 30 + 30 chars) from a long textual document. Like the description that Google returns when it has found a...
5
by: deko | last post by:
If I have random and unpredictable user agent strings containing URLs, what is the best way to extract the URL? For example, let's say the string looks like this: registered NYSE 943 <a...
1
by: caine | last post by:
I want to extract web data from a news feed page http://everling.nierchi.net/mmubulletins.php. Just want to extract necessary info between open n closing tags of <title>, <categoryand <link>....
1
by: nkg1234567 | last post by:
I'm trying to extract HTML from a website in the form of a string, and then I want to extract particular elements from the string using the substr function: here is some sample code that I have thus...
7
by: erikcw | last post by:
Hi all, I'm trying to extract zip file (containing an xml file) from an email so I can process it. But I'm running up against some brick walls. I've been googling and reading all afternoon, and...
1
by: rcamarda | last post by:
I'd need to have a function that allows me to extract 'fields' from within the string I.E. (kinda pseudo code) declare @foo as varchar(100) set @foo = "Robert*Camarda*123 Main Street" select...
1
by: GS | last post by:
I need to extract sections out of a long string of about 5 to 10 KB, change any date format of dd Mmm yyyy to yyyy-mm-dd, then further from each section extract columns of tables. what is the...
5
by: Steve | last post by:
Hi all Does anybody please know a way to extract an Image from a pdf file and save it as a TIFF? I have used a scanner to scan documents which are then placed on a server, but I need to...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.