(lib)CURL is used to get/send information on the internet. For example if we want to crawl a website or send information to forms...
We will use it to get the contant of a single page. The code is shown below.
More information about CURL and its options/settings are listed at
http://curl.haxx.se/libcurl/
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
#include <string>
#include <iostream>
#include <curl/curl.h>
size_t write_r_data( void* ptr, size_t size, size_t nmemb, void * data )
{
std::string * result = static_cast<std::string*>(data);
*result += std::string( (char*)ptr, size*nmemb );
return size*nmemb;
}
int main()
{
std::string url_full="www.phys.ik.cx/index.php";
std::string useragent = "www.phys.ik.cx/robot/faq.php"; // user agent string
CURL * ch_ = curl_easy_init(); // create a CURL handle
char error_buffer[CURL_ERROR_SIZE];
std::cout << error_buffer << std::endl; // display the error log
// SET OPTIONS
curl_easy_setopt( ch_, CURLOPT_ERRORBUFFER, error_buffer ); // option set for error_buffer
curl_easy_setopt( ch_, CURLOPT_WRITEFUNCTION, &write_r_data); // pointer to the recieved data
std::string result;
curl_easy_setopt( ch_, CURLOPT_WRITEDATA, &result ); // write the data into this variable
int id = 1;
curl_easy_setopt( ch_, CURLOPT_VERBOSE, id ); // 1 ... a lot of verbose informations
curl_easy_setopt( ch_, CURLOPT_URL, url_full.c_str() );
curl_easy_setopt( ch_, CURLOPT_USERAGENT, useragent.c_str() ); // set user agent string
curl_easy_setopt( ch_, CURLOPT_CONNECTTIMEOUT, 10); // time(seconds) we want to be connected to the server
curl_easy_setopt( ch_, CURLOPT_TIMEOUT, 30); // maximum time(seconds) the transfer of the files may need
// SET OPTIONS
curl_easy_perform(ch_); // start transfer with the options set above (multiple calls of this for the same handle is possible)
curl_easy_cleanup(ch_); // purges the handle (when crawling is done)
std::cout << result << std::endl;
}
to compile the code (unix) just use
g++ crawl.cpp -lcurl with the option
-lcurl. The installation of libcurl on the system is assumed ofc.
We will use a function
write_r_data which will help us with the processing of the recieved data.
std::string url_full will contain the url we want to obtain.
std::string useragent is the user agent identifier. This string will be sent to the server and this will be written in the server logs. We give here a link to a site, where admins can obtain more information about the crawler and its owner.
Then a CURL handle is created with
CURL * ch_ = curl_easy_init(); aswell as the error management.
After that the options are set for the created handle.
curl_easy_perform(ch_) starts the transfer. There can be made several calls now(the options will stay set)
curl_easy_cleanup(ch_) cleans(purges) the handle - don't forget this :)
And we get something like this:
[claus@mau SPIDER]$ g++ crawl.cpp -lcurl
[claus@mau SPIDER]$ ./a.out
* About to connect() to www.phys.ik.cx port 80 (#0)
* Trying 217.70.142.105...
* connected
* Connected to www.phys.ik.cx (217.70.142.105) port 80 (#0)
> GET /index.php HTTP/1.1
User-Agent: www.phys.ik.cx/robot/faq.php
Host: www.phys.ik.cx
Accept: */*
< HTTP/1.1 200 OK
< Date: Sun, 08 Dec 2013 00:06:04 GMT
< Server: Apache/2.2.16 (Debian)
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Set-Cookie: PHPSESSID=8bd8ac219abbea49b9a38003577d0ed8; path=/
< Vary: Accept-Encoding
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=iso-8859-1
<
* Connection #0 to host www.phys.ik.cx left intact
* Closing connection #0
.
.
.
then the source code of the site follows(didn't add this).
This are the basics to obtain the source code of a single site from the net.
the source code extra:
crawl.cpp