C++

(lib)CURL is used to get/send information on the internet. For example if we want to crawl a website or send information to forms...

We will use it to get the contant of a single page. The code is shown below.

More information about CURL and its options/settings are listed at http://curl.haxx.se/libcurl/

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041

#include <string>
#include <iostream>
#include <curl/curl.h>

size_t write_r_data( void* ptr, size_t size, size_t nmemb, void * data )
{
 std::string * result = static_cast<std::string*>(data);
 *result += std::string( (char*)ptr, size*nmemb );
 return size*nmemb;
}

int main()
{
 std::string url_full="www.phys.ik.cx/index.php";
 std::string useragent = "www.phys.ik.cx/robot/faq.php"; // user agent string

 CURL * ch_ = curl_easy_init(); // create a CURL handle
 char error_buffer[CURL_ERROR_SIZE];
 std::cout << error_buffer << std::endl; // display the error log

 // SET OPTIONS
 curl_easy_setopt( ch_, CURLOPT_ERRORBUFFER, error_buffer ); // option set for error_buffer
 curl_easy_setopt( ch_, CURLOPT_WRITEFUNCTION, &write_r_data); // pointer to the recieved data

 std::string result;
 curl_easy_setopt( ch_, CURLOPT_WRITEDATA, &result ); // write the data into this variable

 int id = 1;
 curl_easy_setopt( ch_, CURLOPT_VERBOSE, id ); // 1 ... a lot of verbose informations
 curl_easy_setopt( ch_, CURLOPT_URL, url_full.c_str() );
 curl_easy_setopt( ch_, CURLOPT_USERAGENT, useragent.c_str() ); // set user agent string
 curl_easy_setopt( ch_, CURLOPT_CONNECTTIMEOUT, 10); // time(seconds) we want to be connected to the server
 curl_easy_setopt( ch_, CURLOPT_TIMEOUT, 30); // maximum time(seconds) the transfer of the files may need
 // SET OPTIONS

 curl_easy_perform(ch_); // start transfer with the options set above (multiple calls of this for the same handle is possible)
 curl_easy_cleanup(ch_); // purges the handle (when crawling is done)

 std::cout << result << std::endl;
}

#include &amp;lt;string> 
#include &amp;lt;iostream> 
#include &amp;lt;curl/curl.h> 
 
size_t write_r_data( void* ptr, size_t size, size_t nmemb, void * data ) 
{ 
	std::string * result = static_cast&amp;lt;std::string*>(data); 
	*result += std::string( (char*)ptr, size*nmemb ); 
	return size*nmemb; 
} 
 
int main() 
{ 
	std::string url_full="www.phys.ik.cx/index.php"; 
	std::string useragent = "www.phys.ik.cx/robot/faq.php";	// user agent string 
 
	CURL * ch_ = curl_easy_init();	// create a CURL handle 
	char error_buffer[CURL_ERROR_SIZE]; 
	std::cout &amp;lt;&amp;lt; error_buffer &amp;lt;&amp;lt; std::endl;	// display the error log 
 
	// SET OPTIONS 
	curl_easy_setopt( ch_, CURLOPT_ERRORBUFFER, error_buffer );	// option set for error_buffer 
	curl_easy_setopt( ch_, CURLOPT_WRITEFUNCTION, &write_r_data);	// pointer to the recieved data 
 
	std::string result; 
	curl_easy_setopt( ch_, CURLOPT_WRITEDATA, &result );	// write the data into this variable 
 
	int id = 1; 
	curl_easy_setopt( ch_, CURLOPT_VERBOSE, id );	// 1 ... a lot of verbose informations 
	curl_easy_setopt( ch_, CURLOPT_URL, url_full.c_str() ); 
	curl_easy_setopt( ch_, CURLOPT_USERAGENT, useragent.c_str() );	// set user agent string 
	curl_easy_setopt( ch_, CURLOPT_CONNECTTIMEOUT, 10);	// time(seconds) we want to be connected to the server 
	curl_easy_setopt( ch_, CURLOPT_TIMEOUT, 30);	// maximum time(seconds) the transfer of the files may need 
	// SET OPTIONS 
 
 
	curl_easy_perform(ch_);	// start transfer with the options set above (multiple calls of this for the same handle is possible) 
	curl_easy_cleanup(ch_);	// purges the handle (when crawling is done) 
 
	std::cout &amp;lt;&amp;lt; result &amp;lt;&amp;lt; std::endl; 
}

Mark Code

to compile the code (unix) just use g++ crawl.cpp -lcurl with the option -lcurl. The installation of libcurl on the system is assumed ofc.

We will use a function write_r_data which will help us with the processing of the recieved data.

std::string url_full will contain the url we want to obtain.

std::string useragent is the user agent identifier. This string will be sent to the server and this will be written in the server logs. We give here a link to a site, where admins can obtain more information about the crawler and its owner.

Then a CURL handle is created with CURL * ch_ = curl_easy_init(); aswell as the error management.

After that the options are set for the created handle.

curl_easy_perform(ch_) starts the transfer. There can be made several calls now(the options will stay set)

curl_easy_cleanup(ch_) cleans(purges) the handle - don't forget this :)

And we get something like this:

[claus@mau SPIDER]$ g++ crawl.cpp -lcurl
[claus@mau SPIDER]$ ./a.out
* About to connect() to www.phys.ik.cx port 80 (#0)
* Trying 217.70.142.105...
* connected
* Connected to www.phys.ik.cx (217.70.142.105) port 80 (#0)
> GET /index.php HTTP/1.1
User-Agent: www.phys.ik.cx/robot/faq.php
Host: www.phys.ik.cx
Accept: */*
< HTTP/1.1 200 OK
< Date: Sun, 08 Dec 2013 00:06:04 GMT
< Server: Apache/2.2.16 (Debian)
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Set-Cookie: PHPSESSID=8bd8ac219abbea49b9a38003577d0ed8; path=/
< Vary: Accept-Encoding
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=iso-8859-1
<
* Connection #0 to host www.phys.ik.cx left intact
* Closing connection #0
.
.
.

then the source code of the site follows(didn't add this).

This are the basics to obtain the source code of a single site from the net.

the source code extra: crawl.cpp