Home | Programming | C++ | CURL | 01
RSS Githublogo Youtubelogo higgs.at kovacs.cf

(lib)CURL is used to get/send information on the internet. For example if we want to crawl a website or send information to forms...

We will use it to get the contant of a single page. The code is shown below.

More information about CURL and its options/settings are listed at http://curl.haxx.se/libcurl/
#include <string>
#include <iostream>
#include <curl/curl.h>

size_t write_r_data( void* ptr, size_t size, size_t nmemb, void * data )
{
   std::string * result = static_cast<std::string*>(data);
   *result += std::string( (char*)ptr, size*nmemb );
   return size*nmemb;
}

int main()
{
   std::string url_full="www.phys.ik.cx/index.php";
   std::string useragent = "www.phys.ik.cx/robot/faq.php";   // user agent string

   CURL * ch_ = curl_easy_init();   // create a CURL handle
   char error_buffer[CURL_ERROR_SIZE];
   std::cout << error_buffer << std::endl;   // display the error log

   // SET OPTIONS
   curl_easy_setopt( ch_, CURLOPT_ERRORBUFFER, error_buffer );   // option set for error_buffer
   curl_easy_setopt( ch_, CURLOPT_WRITEFUNCTION, &write_r_data);   // pointer to the recieved data

   std::string result;
   curl_easy_setopt( ch_, CURLOPT_WRITEDATA, &result );   // write the data into this variable

   int id = 1;
   curl_easy_setopt( ch_, CURLOPT_VERBOSE, id );   // 1 ... a lot of verbose informations
   curl_easy_setopt( ch_, CURLOPT_URL, url_full.c_str() );
   curl_easy_setopt( ch_, CURLOPT_USERAGENT, useragent.c_str() );   // set user agent string
   curl_easy_setopt( ch_, CURLOPT_CONNECTTIMEOUT, 10);   // time(seconds) we want to be connected to the server
   curl_easy_setopt( ch_, CURLOPT_TIMEOUT, 30);   // maximum time(seconds) the transfer of the files may need
   // SET OPTIONS


   curl_easy_perform(ch_);   // start transfer with the options set above (multiple calls of this for the same handle is possible)
   curl_easy_cleanup(ch_);   // purges the handle (when crawling is done)

   std::cout << result << std::endl;
}

to compile the code (unix) just use g++ crawl.cpp -lcurl with the option -lcurl. The installation of libcurl on the system is assumed ofc.

We will use a function write_r_data which will help us with the processing of the recieved data.

std::string url_full will contain the url we want to obtain.

std::string useragent is the user agent identifier. This string will be sent to the server and this will be written in the server logs. We give here a link to a site, where admins can obtain more information about the crawler and its owner.

Then a CURL handle is created with CURL * ch_ = curl_easy_init(); aswell as the error management.

After that the options are set for the created handle.

curl_easy_perform(ch_) starts the transfer. There can be made several calls now(the options will stay set)

curl_easy_cleanup(ch_) cleans(purges) the handle - don't forget this :)

And we get something like this:

[claus@mau SPIDER]$ g++ crawl.cpp -lcurl
[claus@mau SPIDER]$ ./a.out
* About to connect() to www.phys.ik.cx port 80 (#0)
* Trying 217.70.142.105...
* connected
* Connected to www.phys.ik.cx (217.70.142.105) port 80 (#0)
> GET /index.php HTTP/1.1
User-Agent: www.phys.ik.cx/robot/faq.php
Host: www.phys.ik.cx
Accept: */*
< HTTP/1.1 200 OK
< Date: Sun, 08 Dec 2013 00:06:04 GMT
< Server: Apache/2.2.16 (Debian)
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Set-Cookie: PHPSESSID=8bd8ac219abbea49b9a38003577d0ed8; path=/
< Vary: Accept-Encoding
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=iso-8859-1
<
* Connection #0 to host www.phys.ik.cx left intact
* Closing connection #0
.
.
.


then the source code of the site follows(didn't add this).

This are the basics to obtain the source code of a single site from the net.

the source code extra: crawl.cpp