473,395 Members | 1,377 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

parsing internet page using C

175 100+
I've been trying to parse data on a web page in C, but after a several hours of searching the internet for help with writing this code I'm still lost.

Can anyone give me any tips, code, or any other sort of help with this? Nothing I try seems to work and I'm just extremely frustrated and would appreciate any help.

What I'm trying to do is go to a web site, look for various keywords, and based on what it finds, pull that data out and either put it in variables in the code or put it in a text document.
Mar 28 '07 #1
8 6357
RedSon
5,000 Expert 4TB
Try searching for web robots or web spiders.
Mar 28 '07 #2
I've been trying to parse data on a web page in C, but after a several hours of searching the internet for help with writing this code I'm still lost.

Can anyone give me any tips, code, or any other sort of help with this? Nothing I try seems to work and I'm just extremely frustrated and would appreciate any help.

What I'm trying to do is go to a web site, look for various keywords, and based on what it finds, pull that data out and either put it in variables in the code or put it in a text document.
Hi,
you need to establish http connection using sockets, then send something like "GET /web/index.html HTTP/1.0\nhost: www.sapik.cz\n\n" and get the html code in reply and parse. This won't work for php sites. Maybe I can write some example if this is what you need.

-jan-
Mar 28 '07 #3
manontheedge
175 100+
that is what I need...and they are html pages. If you can post some code that would be a great help...I've gotten pretty much nowhere with this. I was about to settle for opening the page, copying it to excel and sorting through it there due to my frustration, so yes I would appreciate the help very much.
Mar 28 '07 #4
that is what I need...and they are html pages. If you can post some code that would be a great help...I've gotten pretty much nowhere with this. I was about to settle for opening the page, copying it to excel and sorting through it there due to my frustration, so yes I would appreciate the help very much.
Hi, this is the code:
Expand|Select|Wrap|Line Numbers
  1.  #include <iostream>
  2. #include <string>
  3. #include <algorithm>
  4. #include <fstream>
  5. #include <windows.h>
  6.  
  7. using namespace std;
  8.  
  9. int t_sockets::GetWeb()
  10. {
  11.     #define BUFSIZE 1000000
  12.     WORD wVersionRequested = MAKEWORD(1,1); 
  13.     WSADATA data;                           // library
  14.     string text("GET /forum/thread624279.html HTTP/1.0\nhost: www.thescripts.com\n\n");
  15.     hostent *host;                          // remote machine
  16.     sockaddr_in serverSock;                 // remote socket
  17.     int mySocket;                           // Socket    
  18.     char buf[BUFSIZE];                      // input buffer
  19.     int size, totalSize = 0;                // number of recieved and sent bytes
  20.     ofstream output(".\\download.html");    //write html data 
  21.     // get sockets ready
  22.     if (WSAStartup(wVersionRequested, &data) != 0)
  23.     {
  24.         cerr << "error in inicialization of sockets" << endl;
  25.  
  26.         return -1;
  27.     }    
  28.     // get info about remote machine
  29.     if ((host = gethostbyname("www.builder.cz")) == NULL)
  30.     {
  31.         cerr << "Wrong address" << endl;
  32.         WSACleanup();
  33.         return -1;
  34.     }
  35.     // Creation of a socket
  36.     if ((mySocket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) == -1)
  37.     {
  38.         cerr << "Can't create socket" << endl;
  39.         WSACleanup();
  40.         return -1;
  41.     }
  42.     // fill in sockaddr_in
  43.     // 1) protocol family
  44.     serverSock.sin_family = AF_INET;
  45.     // 2) port to connect (http)
  46.     serverSock.sin_port = htons(80);
  47.     // 3) IP address to connect to
  48.     memcpy(&(serverSock.sin_addr), host->h_addr, host->h_length);
  49.     // connect the socket
  50.     if (connect(mySocket, (sockaddr *)&serverSock, sizeof(serverSock)) == -1)
  51.     {
  52.         cerr << "can't connect" << endl;
  53.         WSACleanup();
  54.         return -1;
  55.     }
  56.     // send data
  57.     if ((size = send(mySocket, text.c_str(), text.size() + 1, 0)) == -1)
  58.     {
  59.         cerr << "Can't send data" << endl;
  60.         WSACleanup();
  61.         return -1;
  62.     }
  63.     cout << "sent " << size << endl;
  64.     // recieve data
  65.     text = "";
  66.     while (((size = recv(mySocket, buf, BUFSIZE - 1, 0)) != 0) && (size != -1))
  67.     {
  68.         buf[size] = '\0';
  69.         totalSize += size;
  70.         text += buf;
  71.     }
  72.     if (size == -1)
  73.     {
  74.         cout << "Can't recieve data" << endl;
  75.     }    
  76.     // close connection
  77.     closesocket(mySocket);
  78.     WSACleanup();
  79.     cout << "Accepted: " << totalSize << " bytes" << endl << "HTTP Header:" << endl << endl;
  80.     // get http reply...
  81.     int offset = text.find("\r\n\r\n");
  82.     copy(text.begin(), text.begin() + offset, ostream_iterator<char>(cout,""));
  83.     copy(text.begin() + offset, text.end(), ostream_iterator<char>(output,""));
  84.     return 0;
  85. }
  86.  
This is done for VC++ v.6
Mar 29 '07 #5
nmadct
83 Expert
First, you're way better off doing this in Perl than in C, as there's a huge amount of ready-made code that will do most of the work for you. I think Python also has good libraries for this.

If you want a robust way to access web pages, you might try libwww, although I've found it's not the easiest thing to use.

The easiest way to grab web pages from a C program is probably to invoke the wget or curl program to fetch the page as a file, then open that file.

As for parsing the file, that's an entirely different question. There is plenty of literature out there on writing parsers, you can search Google for it. If you're not interested in fully parsing HTML, but rather just getting some info out of the page, your job might be considerably easier.
Mar 29 '07 #6
manontheedge
175 100+
Expand|Select|Wrap|Line Numbers
  1.  
  2. #include<stdio.h>
  3. #include<winsock2.h>
  4.  
  5. #pragma comment(lib, "ws2_32.lib")
  6. #define STRING_MAX 65536
  7. #define MAX 8388608
  8.  
  9. char *get_http(char *targetip, int port, char *file)
  10.  {
  11.       WSADATA wsaData;
  12.      WORD wVersionRequested;
  13.      struct hostent*          target_ptr;
  14.      struct sockaddr_in      sock;
  15.      SOCKET MySock;
  16.  
  17.  
  18.      wVersionRequested = MAKEWORD(2, 2);
  19.      if (WSAStartup(wVersionRequested, &wsaData) < 0)
  20.      {
  21.              printf("################# ERROR! ###################\n");
  22.              printf("Your ws2_32.dll is too old to use this application.    \n");
  23.              printf("Go to microsofts web site to download the most recent \n");
  24.              printf("version of ws2_32.dll.\n");
  25.  
  26.  
  27.              WSACleanup();
  28.              exit(1);
  29.      }
  30.      MySock = socket(AF_INET, SOCK_STREAM, 0);
  31.      if(MySock==INVALID_SOCKET)
  32.      {
  33.              printf("Socket error!\r\n");
  34.  
  35.              closesocket(MySock);
  36.              WSACleanup();
  37.              exit(1);
  38.      }
  39.      if ((target_ptr = gethostbyname(targetip)) == NULL)
  40.      {
  41.              printf("Resolve of %s failed, please try again.\n", targetip);
  42.  
  43.              closesocket(MySock);
  44.              WSACleanup();
  45.              exit(1);
  46.      }
  47.      memcpy(&sock.sin_addr.s_addr, target_ptr->h_addr, target_ptr->h_length);
  48.      sock.sin_family = AF_INET;
  49.      sock.sin_port = htons((USHORT)port);
  50.  
  51.      if ( (connect(MySock, (struct sockaddr *)&sock, sizeof (sock) )))
  52.      {
  53.              printf("Couldn't connect to host.\n");
  54.  
  55.              closesocket(MySock);
  56.              WSACleanup();
  57.              exit(1);
  58.      }
  59.      char sendfile[STRING_MAX];
  60.      strcpy(sendfile, "GET ");
  61.      strcat(sendfile, file);
  62.      strcat(sendfile, " HTTP/1.1 \r\n" );
  63.      strcat(sendfile, "Host: localhost\r\n\r\n");
  64.      if (send(MySock, sendfile, sizeof(sendfile)-1, 0) == -1)
  65.      {
  66.              printf("Error sending Packet\r\n");
  67.              closesocket(MySock);
  68.              WSACleanup();
  69.              exit(1);
  70.      }
  71.      send(MySock, sendfile, sizeof(sendfile)-1, 0);
  72.  
  73.  
  74.          char *recvString = new char[MAX];
  75.      int nret;
  76.      nret = recv(MySock, recvString, MAX + 1, 0);
  77.  
  78.  
  79.      char *output= new char[nret];
  80.      strcpy(output, "");
  81.      if (nret == SOCKET_ERROR)
  82.      {
  83.              printf("Attempt to receive data FAILED. \n");
  84.      }
  85.      else
  86.      {
  87.              strncat(output, recvString, nret);
  88.              delete [ ] recvString;
  89.      }
  90.      closesocket(MySock);
  91.      WSACleanup();
  92.      return (output);
  93.      delete [ ] output;
  94.  }
  95.  
  96. int main(int argc, char *argv[])
  97. {
  98.     int port = 80;
  99.     char* targetip;
  100.  
  101.     if (argc < 2)
  102.     {
  103.        printf("WebGrab usage:\r\n");
  104.        printf("%s <TargetIP> [port]\r\n", argv[0]);
  105.        return(0);
  106.     }
  107.  
  108.     targetip = argv[1];
  109.     char* output;
  110.  
  111.     if(argc >= 3)
  112.     {
  113.        port = atoi(argv[2]);
  114.     }
  115.  
  116.     if(argc >= 4)
  117.     {
  118.        output = get_http(targetip, port, argv[3]);
  119.     }
  120.  
  121.     else
  122.     {
  123.        output = get_http(targetip, port, "/");
  124.     }
  125.  
  126.     printf("%s", output);
  127.  
  128.     return(0);
  129. }
  130.  
  131.  
I looked in to everything you guys recommended, and I started researching sockets. In theory it makes sense. I have some code here that is suppose to get a web page and print the data to the console. But it keeps stopping at the "Couldn't connect to host" part, and it's got me confused.

I'm loading the program in the command prompt, and there are 3 arguments, one madatory...the IP address or a fully qualified domain name. Each time I run the program, about 5-10 seconds later I get that error. I'd like to know why this is happening.

I'm actually trying to learn this stuff and how it works so I can use it more in the future, so any help or guidance with this would help. Thanks for the help so far as well.
Mar 30 '07 #7
nmadct
83 Expert
I noticed that your "memcpy" line doesn't match the one in Jan's code. I wonder if that's causing a problem. After you've tried to connect and encountered an error, you can call WSAGetLastError() to get an error message, which might be helpful. Also, it might be helpful if you could print out the values of the arguments to connect() just before making the call, to verify what it's trying to do.

I don't know what kind of reference you're using, but this page is helpful: http://www.sockets.com/winsock.htm
(It's old, but I don't think the API has changed much since then. Things have been added but the original functions should work the same.)
Apr 2 '07 #8
Another option if a C++/MFC app works is good for you...

// Just returning the whoooole webpage as a string for parsing.
CString GetSourceHtml(CString theUrl)
{
// this first block does the actual work
CString szOutput;
szOutput = "";
CInternetSession session;
CInternetFile* file = NULL;
try
{
// try to connect to the URL
file = (CInternetFile*) session.OpenURL(theUrl);
}
catch (CInternetException* m_pException)
{
// set file to NULL if there's an error
file = NULL;
m_pException->Delete();
}

// most of the following deals with storing the html to a file
CStdioFile dataStore;

if (file)
{

CString somecode;

// continue fetching code until there is no more
while (file->ReadString(somecode) != NULL)
{
szOutput = szOutput + somecode;
}

file->Close();
delete file;
}

return szOutput.Trim();

}

http://www.edevmachine.com
Apr 25 '07 #9

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: Glenn | last post by:
Is there any way to partially refresh a web page on each request? I have a search page that needs to refresh the number of results with each change in the form elements. Since there are so many...
0
by: Antwerp | last post by:
Hi, I'm trying to create a perl script that will log into a website (the login form uses POST), navigate to several pages, and append the (html) content parsed from those pages to a seperate log...
1
by: Kueishiong Tu | last post by:
How do I send and receive internet messages using socket in the .net VC++ environment? Coding sample will be helpful.
1
by: Eric Caron | last post by:
Hi, In a typical "show-list-click-one-and-edit" scenario, I use a DataGrid on the "list" page using the pager option. Say I'm on page 2, click edit on an item and click Cancel on the edit page...
3
by: ABC | last post by:
How to create a web page class for inhert web page using ASP.NET 1.1 and 2.0?
0
by: bropolwig | last post by:
Hi, I'm fairly new at programming, so bear with me ;-) I use borland c++ builder to do some programming. I sometimes have a problem reading an internet page using the NMHTTP component. I use the...
1
by: hvivekw | last post by:
Hi, I would like to open a web page using Perl. I have a device on the internet whose web user interface I would like to open and subsequently automate some of the tabs on that web page. Is...
4
by: alag20 | last post by:
Hi Guys, Sorry for duplicate posting as this Question refers to both c# and perl cgi script on the net, so please help. Here is the original post...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.