Pavuk

Last Update: Oktober 24 2008

Table of Contents

Name

pavuk - HTTP, HTTP over SSL, FTP, FTP over SSL and Gopher recursive document retrieval program

Synopsis

pavuk [-X] [-x] [-with_gui] [-runX] [-[no]bg] [-[no]prefs] [-h] [-help] [-v] [-version]

pavuk [ -mode {normal | resumeregets | singlepage | singlereget | sync | dontstore | ftpdir | mirror} ] [-X] [-x] [-with_gui] [-runX] [-[no]bg] [-[no]prefs] [-[no]progress] [-[no]stime] [ -xmaxlog $nr ] [ -logfile $file ] [ -slogfile $file ] [ -auth_file $file ] [ -msgcat $dir ] [ -language $str ] [ -gui_font $font ] [-quiet/-verbose] [-[no]read_css] [-[no]read_msie_cc] [-[no]read_cdata] [-[no]read_comments] [ -cdir $dir ] [ -scndir $dir ] [ -scenario $str ] [ -dumpscn $filename ] [ -dumpdir $dir ] [ -dumpcmd $filename ] [ -l $nr ] [ -lmax $nr ] [ -dmax $nr ] [ -leave_level $nr ] [ -maxsize $nr ] [ -minsize $nr ] [ -asite $list ] [ -dsite $list ] [ -adomain $list ] [ -ddomain $list ] [ -asfx $list ] [ -dsfx $list ] [ -aprefix $list ] [ -dprefix $list ] [ -amimet $list ] [ -dmimet $list ] [ -pattern $pattern ] [ -url_pattern $pattern ] [ -rpattern $regexp ] [ -url_rpattern $regexp ] [ -skip_pattern $pattern ] [ -skip_url_pattern $pattern ] [ -skip_rpattern $regexp ] [ -skip_url_rpattern $regexp ] [ -newer_than $time ] [ -older_than $time ] [ -schedule $time ] [ -reschedule $nr ] [-[dont_]leave_site] [-[dont_]leave_dir] [ -http_proxy $site[:$port] ] [ -ftp_proxy $site[:$port] ] [ -ssl_proxy $site[:$port] ] [ -gopher_proxy $site[:$port] ] [-[no]ftp_httpgw] [-[no]ftp_dirtyproxy] [-[no]gopher_httpgw] [-[no]FTP] [-[no]HTTP] [-[no]SSL] [-[no]Gopher] [-[no]FTPdir] [-[no]CGI] [-[no]FTPlist] [-[no]FTPhtml] [-[no]Relocate] [-[no]force_reget] [-[no]cache] [-[no]check_size] [-[no]Robots] [-[no]Enc] [ -auth_name $user ] [ -auth_passwd $pass ] [ -auth_scheme {1/2/3/4/user/Basic/Digest/NTLM} ] [-[no_]auth_reuse_nonce] [ -http_proxy_user $user ] [ -http_proxy_pass $pass ] [ -http_proxy_auth {1/2/3/4/user/Basic/Digest/NTLM} ] [-[no_]auth_reuse_proxy_nonce] [ -ssl_key_file $file ] [ -ssl_cert_file $file ] [ -ssl_cert_passwd $pass ] [ -from $email ] [-[no]send_from] [ -identity $str ] [-[no]auto_referer] [-[no]referer] [-[no]persistent] [ -alang $list ] [ -acharset $list ] [ -retry $nr ] [ -nregets $nr ] [ -nredirs $nr ] [ -rollback $nr ] [ -sleep $nr ] [ -[no]rsleep ] [ -timeout $nr ] [ -rtimeout $nr ] [ -wtimeout $nr ] [-[no]preserve_time] [-[no]preserve_perm] [-[no]preserve_slinks] [ -bufsize $nr ] [ -maxrate $nr ] [ -minrate $nr ] [ -user_condition $str ] [ -cookie_file $file ] [-[no]cookie_send] [-[no]cookie_recv] [-[no]cookie_update] [ -cookies_max $nr ] [ -disabled_cookie_domains $list ] [ -disable_html_tag $TAG,[$ATTRIB][;...] ] [ -enable_html_tag $TAG,[$ATTRIB][;...] ] [ -tr_del_chr $str ] [ -tr_str_str $str1 $str2 ] [ -tr_chr_chr $chrset1 $chrset2 ] [ -index_name $str ] [-[no]store_index] [ -store_name $str ] [-[no]debug] [ -debug_level $level ] [ -browser $str ] [ -urls_file $file ] [ -file_quota $nr ] [ -trans_quota $nr ] [ -fs_quota $nr ] [-enable_js/-disable_js] [ -fnrules $t $m $r ] [ -mime_type_file $file ] [-[no]store_info] [-[no]all_to_local] [-[no]sel_to_local] [-[no]all_to_remote] [ -url_strategy $strategy ] [-[no]remove_adv] [ -adv_re $RE ] [-[no]check_bg] [-[no]send_if_range] [ -sched_cmd $str ] [-[no]unique_log] [ -post_cmd $str ] [ -ssl_version $v ] [-[no]unique_sslid] [ -aip_pattern $re ] [ -dip_pattern $re ] [-[no]use_http11] [ -local_ip $addr ] [ -request $req ] [ -formdata $req ] [ -httpad $str ] [ -nthreads $nr ] [-[no]immesg] [ -dumpfd {$nr | @[@]$filepath } ] [ -dump_urlfd {$nr | @[@]$filepath } ] [-[no]unique_name] [-[dont_]leave_site_enter_dir] [ -max_time $nr ] [-[no]del_after] [-[no]singlepage] [-[no]dump_after] [-[no]dump_response] [-[no]dump_request] [ -auth_ntlm_domain $str ] [ -auth_proxy_ntlm_domain $str ] [ -js_pattern $re ] [ -follow_cmd $str ] [-[no]retrieve_symlink] [ -js_transform $p $t $h $a ] [ -js_transform2 $p $t $h $a ] [ -ftp_proxy_user $str ] [ -ftp_proxy_pass $str ] [-[dont_]limit_inlines] [ -ftp_list_options $str ] [-[no]fix_wuftpd_list] [-[no]post_update] [ -info_dir $dir ] [ -mozcache_dir $dir ] [ -aport $list ] [ -dport $list ] [-[no]hack_add_index] [ -default_prefix $str ] [ -ftp_login_handshake $host $handshake ] [ -js_script_file $file ] [ -dont_touch_url_pattern $pat ] [ -dont_touch_url_rpattern $pat ] [ -dont_touch_tag_rpattern $pat ] [ -tag_pattern $tag $attrib $url ] [ -tag_rpattern $tag $attrib $url ] [ -nss_cert_dir $dir ] [-[no]nss_accept_unknown_cert] [-nss_domestic_policy/-nss_export_policy] [-[no]verify] [ -tlogfile $file ] [ -trelative {object | program} ] [ -tp FQDN[:port] ] [ -transparent_proxy FQDN[:port] ] [ -tsp FQDN[:port] ] [ -transparent_ssl_proxy FQDN[:port] ] [-[not]sdemo] [-noencode] [ -[no]ignore_chunk_bug ] [ -hammer_mode $nr ] [ -hammer_threads $nr ] [ -hammer_flags $nr ] [ -hammer_ease $nr ] [ -hammer_rtimeout $nr ] [ -hammer_repeat $nr ] [ -[no]log_hammering ] [ -hammer_recdump {$nr | @[@]$filepath } ] [ URLs ]

pavuk [-mode {normal | singlepage | singlereget}] [ -base_level $nr ]

pavuk [-mode sync] [ -ddays $nr ] [ -subdir $dir ] [-[no]remove_old]

pavuk [-mode resumeregets] [ -subdir $dir ]

pavuk [-mode linkupdate] [ -cdir $dir ] [ -subdir $dir ] [ -scndir $dir ] [ -scenario $str ]

pavuk [-mode reminder] [ -remind_cmd $str ]

pavuk [-mode mirror] [ -subdir $dir ] [-[no]remove_old] [-[no]remove_before_store] [-[no]always_mdtm]

Description

This manual page describes how to use pavuk.

Pavuk can be used to mirror contents of Internet/intranet servers and to maintain copies in a local tree of documents. Pavuk stores retrieved documents in locally mapped disk space. The structure of the local tree is the same as the one on the remote server. Each supported service (protocol) has its own sub-directory in the local tree. Each referenced server has its own sub-directory in these protocols sub-directories; followed by the port number on which the service resides, delimited by character can be be changed. With the option -fnrules you can change the default layout of the local document tree, without losing link consistency.

With pavuk it is possible to have up-to-date copies of remote documents in the local disk space.

As of version 0.3pl2, pavuk can automatically restart broken connections, and reget partial content from an FTP server (which must support the REST command), from a properly configured HTTP/1.1 server, or from a HTTP/1.0 server which supports Ranges.

As of version 0.6 it is possible to handle configurations via so called scenarios. The best way to create such a configuration file is to use the X Window interface and simply save the created configuration. The other way is to use the -dumpscn switch.

As of version 0.7pl1 it is possible to store authentication information into an authinfo file, which pavuk can then parse and use.

As of version 0.8pl4 pavuk can fetch documents for use in a local proxy/cache server without storing them to local documents tree.

As of version 0.9pl4 pavuk supports SOCKS (4/5) proxies if you have the required libraries.

As of version 0.9pl12 pavuk can preserve permissions of remote files and symbolic links, so it can be used for powerful FTP mirroring.

The pavuk releases starting at 0.9.36 support dumping commands to a specific file (see the -dumpdir and -dumpcmd arguments).

Pavuk supports SSL connections to FTP servers, if you specify ftps:// URL instead of ftp://.

Pavuk can automatically handle file names with unsafe characters for file-system. This is only implemented yet for Win32 platform and it is hard coded.

Pavuk can now use HTTP/1.1 protocol for communication with HTTP servers. It can use persistent connections, so one TCP connection should be used to transfer several documents without closing it. This feature saves network bandwidth and also speeds up network communication.

Pavuk can do configurable POST requests to HTTP servers and support also file uploading via HTTP POST request.

Pavuk can automatically fill found HTML forms, if user will supply data for its fields before with option -formdata .

Pavuk can run configurable number of concurrently running downloading threads when compiled with multithreading support.

Pavuk 0.9pl128 introduced the use of JavaScript bindings for doing some complicated tasks (e.g. decision making, filename transformation) which need some more computing complexity than may be achieved with a regular, non-scriptable program.

pavuk 0.9.36 introduced the optional multiplier suffixes K, M or G for numeric parameter values of command line options. These multipliers represent the ISO multipliers Kilo(1000), Mega(1000000) and Giga(1.0E9), unless otherwise specified (some command line options relate to memory or disc sizes in either bytes of kBytes, where these multipliers will then be processed as the nearest power-of-2: K(1024), M(1048567) or G(1073741824).

Format of Supported URLs

HTTP

http://[[user][:password]@]host[:port][/document]

[[user][:password]@]host[:port][/document]

HTTPS

https://[[user][:password]@]host[:port][/document]

ssl[.domain][:port][/document]

FTP

ftp://[[user][:password]@]host[:port][/relative_path][;type=x]

ftp://[[user][:password]@]host[:port][//absolute_path][;type=x]

ftp[.domain][:port][/document][;type=x]

FTPS

ftps://[[user][:password]@]host[:port][/relative_path][;type=x]

ftps://[[user][:password]@]host[:port][//absolute_path][;type=x]

ftps[.domain][:port][/document][;type=x]

Gopher

gopher://host[:port][/type[document]]

gopher[.domain][:port][/type[document]]

Default Mapping of URLs to Local File Names

HTTP

http://[[user][:password]@]host[:port][/document][?query]

http/host_port/[document][?query]

HTTPS

https://[[user][:password]@]host[:port][/document][?query]

https/host_port/[document][?query]

FTP

ftp://[[user][:password]@]host[:port][/path]

ftp/host_port/[path]

FTPS

ftps://[[user][:password]@]host[:port][/path]

ftps/host_port/[path]

Gopher

gopher://host[:port][/type[document]]

gopher/host_port/[type[document]]

Note

Pavuk will use the string with which it queries the target server as the name of the results file. This file name may, in some cases, contain punctuations such as $,?,=,& etc. Such punctuation can cause problems when you are trying to browse downloaded files with your browser or you are trying to process downloaded files with shell scripts or view files with file management utilities which reference the name of the results file. If you believe that this may be causing problems for you, then you can remove all punctuation from the result file name with the option: -tr_del_chr [:punct:] or with other options for adjusting file names (-tr_str_str and -tr_chr_chr ).

The order in which these URL to file name conversions are applied is as follows: -tr_str_str is applied first, followed by -tr_del_chr , while -tr_chr_chr comes last.

Options

All options are case insensitive.

List of Options Chapters

Mode
Help
Indicate/Logging/Interface options
Netli options
Special start
Scenario/Task options
Directory options
Preserve options
Proxy options
Proxy authentication
Protocol/Download Option
Authentication
Site/Domain/Port Limitation Options
Limitation Document properties
Limitation Document name
Limitation Protocol Option
Other Limitation Options
JavaScript support
Cookie
HTML rewriting engine tuning options
File name / URL Conversion Option
Hammer mode options: load testing web sites
Other Options

Mode

-mode {normal, linkupdate, sync, singlepage, singlereget, resumeregets, dontstore, ftpdir, mirror, reminder}

Set operation mode.

normal: retrieves recursive documents
linkupdate: update remote URLs in local HTML documents to local URLs if these URLs exist in the local tree
sync: synchronize remote documents with local tree (if a local copy of a document is older than remote, the document is retrieved again, otherwise nothing happens)
singlepage: URL is retrieved as one page with all inline objects (picture, sound ...) this mode is now obsoleted by -singlepage option.
resumeregets: pavuk scans the local tree for files that were not retrieved fully and retrieves them again (uses partial get if possible)
singlereget: get URL until it is retrieved in full
dontstore: transfer page from server, but don’t store it to the local tree. This mode is suitable for fetching pages that are held in a local proxy/cache server.
reminder: used to inform the user about changed documents
mirror: similar to the ’sync’ mode, but will automatically remove local documents which do not exist anymore on the remote site. This mode will make an exact copy of the remote site, including keeping the file names intact as much as possible.
ftpdir: used to list of contents of FTP directories

default operation mode is normal mode.

Help

-h, -help

Print long verbose help message

-v, -version

Show version information and feature set configuration at compilation time.

Feature : Developer Debug Build
Short description : Identifies this pavuk binary as compiled with debug features enabled (-DDEBUG), such as extra run-time checks.
Affects : all

Feature : Debug features
Short description : This pavuk binary can show very detailed debug / diagnostic information about the grabbing process, including message dumps, etc.
Affects : -debug/-nodebug , -debug_level $level

Feature : GNU gettext internationalization of messages
Short description : Important messages can be shown in the local language.
Affects : -language , -msgcat

Feature : flock() / fcntl() document locking
Short description : When you do not have this built in, you should refrain from running multiple pavuk binaries and/or multithreaded sessions. Depending on the built-in locking type (’flock()’, ’Win32 flock()’ or ’fcntl()’) you can or should not use network shared storage to store the results of your session: fcntl() locking is assumed to be capable of locking files on NFS shares, while flock() very probably won’t be able to do that.
Affects : file I/O

Feature : Gtk GUI interface
Short description : You can use the built-in GUI.
Affects : -X , -with_gui , -runX , -prefs , -noprefs , -xmaxlog , -gui_font

Feature : GUI with URL tree preview
Short description : You can use the built-in GUI URL tree views.
Affects : -browser

Feature : HTTP and FTP over SSL; SSL layer implemented with OpenSSL / SSLeay / NSS library
Short description : You can access SSL secured URLs / sites and proxies. pavuk may have been built with either OpenSSL, SSLeay or Netscape SSL support. Some features are only available with the one, some only with another implementation.
Affects : -noSSL , -SSL , -verify , -noverify , -noFTPS , -FTPS , -ssl_cert_passwd , -ssl_cert_file , -ssl_key_file , -ssl_cipher_list , -ssl_proxy , -ssl_version , -unique_sslid , -nounique_sslid , -nss_cert_dir , -nss_accept_unknown_cert , -nonss_accept_unknown_cert , -nss_domestic_policy , -nss_export_policy

Feature : Socks proxy support
Short description : You can SOCKS4 and/or SOCKS5 proxies.
Affects :

Feature : file-system free space checking
Short description : You can use quotas to prevent your local storage from filling up / overflowing.
Affects : -file_quota

Feature : optional regex patterns in -fnrules and -*rpattern options
Short description : You can use regular expressions to help pavuk select and filter content. pavuk also mentions which regex engine has been built in: POSIX, Bell V8, BSD, GNU, PCRE or TRE
Affects : -rpattern , -skip_rpattern , -url_rpattern , -skip_url_rpattern , -remove_adv , -noremove_adv , -adv_re , -aip_pattern , -dip_pattern , -js_pattern , -js_transform , -js_transform2 , -dont_touch_url_rpattern , -dont_touch_tag_rpattern , -tag_rpattern

Feature : support for loading files from Netscape browser cache
Short description : You can access the private browser cache of Netscape browsers.
Affects : -nscache_dir

Feature : support for loading files from Microsoft Internet Explorer browser cache
Short description : You can access the private browser cache of Microsoft Internet Explorer browsers.
Affects : -ie_cache

Feature : support for detecting whether pavuk is running as background job
Short description : Progress reports, etc. will be disabled when pavuk is running as a background task
Affects : -check_bg , -nocheck_bg , -progress_mode , -verbose , -noverbose , -noquiet , -debug_level , -nodebug , -debug , ...

Feature : multithreading support
Short description : Allows pavuk to perform multiple tasks simultaneously.
Affects : -hammer_threads , -nthreads , -immesg , -noimmesg

Feature : NTLM authorization support
Short description : You can access web servers which use NTLM-base access security.
Affects : -auth_ntlm_domain , -auth_proxy_ntlm_domain

Feature : JavaScript bindings
Short description : You can use JavaScript-based filters and patterns.
Affects : -js_script_file

Feature : IPv6 support
Short description : Pavuk incorporates basic IPv6 support.
Affects :

Feature : HTTP compressed data transfer (gzip/compress/deflate Content-Encoding)
Short description : pavuk supports compressed transmission formats (HTTP Accept-Encoding) to reduce network traffic load.
Affects : -noEnc , -Enc

Feature : DoS support (a.k.a. ’chunky’ a.k.a. ’hammer modes’)
Short description : this pavuk binary can be used to test (’hammer’) your sites
Affects : -hammer_recdump , -log_hammering , -nolog_hammering , -hammer_threads , -hammer_mode , -hammer_flags , -hammer_ease , -hammer_rtimeout , -hammer_repeat

Indicate/Logging/Interface Options

-quiet

Don’t show any messages on the screen.

-verbose

Force to show output messages on the screen (default)

-progress/-noprogress

Show retrieving progress while running in the terminal (default is progress off). When turned on, progress will be shown in the format specified by the -progress_mode setting.

Note

This option only has effect when pavuk is run in a console window.

-progress_mode $nr

Specify how progress (see -progress will be shown to the user. Several modes $nr are supported:

Report every run (-hammer_mode ) and URL fetched on a separate line. Also show the download progress (bytes and percentage downloaded) while fetching a document from the remote site. This is the most verbose progress display. (default)

Example output:

 
URL[ 1]:    35(0) of    56  http://hobbelt.com/CAT-tuts/panther-l2-50pct.jpg
S:  10138 / 10138 B [100.0%] [R: 187.8 kB/s] [ET: 0:00:00] [RT: 0:00:00]
URL[ 1]:    38(0) of    56  http://hobbelt.com/CAT-tuts/get-started-cat-50pct.jpg
S:   5868 / 5868 B [100.0%] [R: 114.8 kB/s] [ET: 0:00:00] [RT: 0:00:00]
URL[ 2]:    34(0) of    56  http://hobbelt.com/CAT-tuts/CAT_Panther_CM2.avi
S:    8311 / 8311 kB [100.0%] [R:   4.7 MB/s] [ET: 0:00:01] [RT: 0:00:00]
URL[ 2]:    40(0) of    56  http://hobbelt.com/icons/knowspam-teeny-9.gif
S:    817 / 817 B [100.0%] [R:  20.3 kB/s] [ET: 0:00:00] [RT: 0:00:00]

Report every run (-hammer_mode ) in a concise format (’=RUN=’) and display each URL fetched as a separate dot ’.’.

Example output:
```
 
  ............................................[URL] download: ERROR: HTTP document not found
                   
```

2, 3, 4, 5, 6

These are identical to mode 1 , except in hammer mode while hammering a site. Increase the number to see less progress info during a hammer operation.

-stime/-nostime

Show start and end time of transfer. (default isn’t this information shown)

-xmaxlog $nr

Maximum number of log lines in the Log widget. 0 means unlimited. This option is available only when compiled with the GTK+ GUI. (default value is 0)

$nr specifies the size in bytes, unless postfixed with one of the characters K, M or G, which imply the multipliers K(1024), M(1048567) or G(1073741824).

-logfile $file

File where all produced messages are stored.

-unique_log/-nounique_log

When logfile as specified with the option -logfile is already used by another process, try to generate new unique name for the log file. (default is this option turned off)

-slogfile $file

File to store short logs in. This file contains one line of information per processed document. This is meant to be used in connection with any sort of script to produce some statistics, for validating links on your website, or for generating simple site maps. Multiple pavuk processes can use this file concurrently, without overwriting each others entries. Record structure:

PID: process id of pavuk process
TIME: current time
COUNTER: in the format current/total number of URLs
STATUS: contains the type of the error: FATAL, ERR, WARN or OK
ERRCODE: is the number code of the error (see errcode.h in pavuk sources)
URL: of the document
PARENTURL: first parent document of this URL (when it doesn’t have parent - [none])
FILENAME: is the name of the local file the document is saved under
SIZE: size of requested document if known
DOWNLOAD_TIME: time which takes downloading of this document in format seconds.milli_seconds
HTTPRESP: contains the first line of the HTTP server response

-language $str

Native language that pavuk should use for communication with its user (works only when there is a message catalog for that language) GNU gettext support (for message internationalization) must also be compiled in. Default language is taken from your NLS environment variables.

-gui_font $font

Font used in the GUI interface. To list available X fonts use the xlsfonts command. This option is available only when compiled with GTK+ GUI support.

Netli Options

-read_css/-noread_css

Enable or disable fetching objects mentioned in inline and external CSS style sheets.

-read_msie_cc/-noread_msie_cc

Enable or disable fetching objects mentioned in Microsoft Internet Explorer Conditional Comments (a.k.a. MSIE CC’s).

-read_cdata/-noread_cdata

Enable or disable fetching objects mentioned in <![CDATA[...]]> sections.

-read_comments/-noread_comments

Enable or disable fetching objects mentioned in HTML  Comment sections.

-verify/-noverify

Enable or disable verifying server CERTS in SSL mode.

-tlogfile $file

Turn on Netli logging with output to specified file.

-trelative {object | program}

Make Netli timings relative to the start of the first object or the program.

-tp FQDN[:port] , -transparent_proxy FQDN[:port]

When processing URL, send the original, but send it to the IP address at

FQDN

-tsp FQDN[:port] , -transparent_ssl_proxy FQDN[:port]

When processing HTTPS URL, send the original, but send it to the IP address at FQDN

-sdemo/-notsdemo

Output in sdemo compatible format. This is only used by sdemo . (For now it simply means output ’-1’ rather than ’*’ when measurements are invalid.)

-encode/-noencode

Do / do not escape characters that are "unsafe" in URLs. Default behavior is to escape unsafe characters.

Special Start

-X, -x, -with_gui

Start program with X Window interface (if compiled with support for GTK+). By default pavuk starts without GUI and behaves like a regular command-line tool.

-runX

When used together with the -X option, pavuk starts processing of URLs immediately after the GUI window is launched. Without the -X given, this option doesn’t have any effect. Only available when compiled with GTK+ support.

-bg/-nobg

This option allows pavuk to detach from its terminal and run in background mode. Pavuk will not output any messages to the terminal than. If you want to see messages, you have to use the -log_file option to specify a file where messages will be written. Default pavuk executes at foreground.

-check_bg/-nocheck_bg

Normally, programs sent into the background after being run in foreground continue to output messages to the terminal. If this option is activated, pavuk checks if it is running as background job and will not write any messages to the terminal in this case. After it becomes a foreground job again, it will start writing messages to terminal in the normal way. This option is available only when your system supports retrieving of terminal info via tc*() functions.

-prefs/-noprefs

When you turn this option on, pavuk will preserve all settings when exiting, and when you run pavuk with GUI interface again, all settings will be restored. The settings will be stored in the ~./pavuk_prefs file. Default pavuk want restore its option when started. This option is available only when compiled with GTK+.

-schedule $time

Execute pavuk at the time specified as parameter. The Format of the $time parameter is YYYY.MM.DD.hh.mm . You need a properly configured scheduling with the at command on your system for using this option. If default configuration (at -f %f %t %d.%m.%Y ) of scheduling command won’t work on your system, try to adjust it with -sched_cmd option.

$time must be specified as local (a.k.a. ’wall clock’) time.

-reschedule $nr

Execute pavuk periodically with $nr hours period. You need properly configured scheduling with the at command on your system for using this option.

-sched_cmd $str

Command to use for scheduling. Pavuk explicitly supports scheduling with at $str should contain regular characters and macros, escaped by % character. Supported macros are:

%f: for script filename
%t: for time (in format HH:MM)
...: all macros as supported by the strftime (3) function

-urls_file $file

If you use this option, pavuk will read URLs from $file before it starts processing. In this file, each URL needs to be on a separate line. After the last URL, a single dot . followed by a LF (line-feed) character denotes the end. Pavuk will start processing right after all URLs have been read. If $file is given as the - character, standard input will be read.

-store_info/-nostore_info

This option causes pavuk to store information about each document into a separate file in the .pavuk_info directory. This file is used to store the original URL from which the document was downloaded. For files that are downloaded via HTTP or HTTPS protocols, the whole HTTP response header is stored there. I recommend to use this option when you are using options that change the default layout of the local document tree, because this info file helps pavuk to map the local filename to the URL. This option is also very useful when different URLs have the same filename in the local tree. When this occurs, pavuk detects this using info files, and it will prefix the local name with numbers. At default is disabled storing of this extra information.

-info_dir $dir

You can set with this option location of separate directory for storing info files created when -store_info option is used. This is useful when you don’t want to mix in destination directory the info files with regular document files. The structure of the info files is preserved, just are stored in different directory.

-request $req

With this option you can specify extended information for starting URLs. With this option you can specify query data for POST or GET . Current syntax of this option is:

URL:["]$url["] [METHOD:["]{GET|POST}["]] [ENCODING:["]{u|m}["]] [FIELD:["]variable=value["]] [COOKIE:["][variable=value;[...]]variable=value[;]["]] [FILE:["]variable=filename["]] [LNAME:["]local_filename["]]

URL

specifies request URL

METHOD

specifies request method for URL and is one of GET or POST.

ENCODING

specifies encoding for request body data.

m: is for multipart/form-data encoding
u: is for application/x-www-form-urlencoded encoding

FIELD

specifies field of request data in format variable=value. For encoding of special characters in variable and value you can use same encoding as is used in application/x-www-form-urlencoded encoding.

COOKIE

specifies one or more cookies that are related to the specified URL. These cookies will be used/transmitted by pavuk when this URL is accessed, thus enabling pavuk to access URLs which require the use of specific cookies for a proper response.

Note

The settings of command-line option -disabled_cookie_domains does apply.

See the Cookie chapter for more info.

FILE

specifies special field of query, which is used to specify file for POST based file upload.

LNAME

specifies localname for this request

When you need to use inside the FIELD: and FILE: fields of request specification special characters, you should use the application/x-www-form-urlencoded encoding of characters. It means all nonASCII characters, quote character ("), space character ( ), ampersand character (&), percent character (%) and equal character (=) should be encoded in form %xx where xx is hexadecimal representation of ASCII value of character. So for example % character should be encoded like %25 .

-formdata $req

This option gives you chance to specify contents for HTML forms found during traversing document tree. Current syntax of this option is same as for -request option, but ENCODING: and METHOD: are meaningless in this option semantics. In URL: you have to specify HTML form action URL, which will be matched against action URLs found in processed HTML documents. If pavuk finds action URL which matches that supplied in -formdata option, pavuk will construct GET or POST request from data supplied in this option and from default form field values supplied in HTML document. Values supplied on command-line have precedence before that supplied in HTML file.

-nthreads $nr

By means of this option you can specify how many concurrent threads will download documents. Default pavuk executes 3 concurrent downloading threads.

This option is available only when pavuk is compiled to support multithreading.

-immesg/-noimmesg

Default pavuks behavior when running multiple downloading threads is to buffer all output messages in memory buffer and flush that buffered data just when thread finishes processing of one document. With this option you can change this behavior to see the messages immediately when it is produced. It is only usable when you want to debug some specials in multithreading environment.

This option is available only when pavuk is compiled to support multithreading.

-dumpfd $nr / -dumpfd @[@]$file

For scripting is sometimes usable to be able to download document directly to pipe or variable instead of storing it to regular file. In such case you can use this option to dump data for example to stdout ( $nr = 1 ).

Note

pavuk 0.9.36 and later releases also support the @$file argument, where you can specify a file to dump the data to. The file path must be prefixed by an ’@’ character. If you prefix the file path with a second ’@’, pavuk will assume you wish to append to an already existing file. Otherwise the file will be created/erased when pavuk starts.

-dump_after/-nodump_after

While using -dumpfd option in multithreaded pavuk, it is required to dump document in one moment because documents downloaded in multiple threads can overlap. This option is also useful when you want to dump document after pavuk adjusts links inside HTML documents.

-dump_request/-nodump_request

This option has effect only when used with the -dumpfd option. It is used to dump HTTP requests.

-dump_response/-nodump_response

This option has effect only when used with the -dumpfd option. It is used to dump HTTP response headers.

-dump_urlfd $nr / -dump_urlfd @[@]$file

When you will use this option, pavuk will output all URLs found in HTML documents to file descriptor $nr . You can use this option to extract and convert all URLs to absolute URLs and write those to stdout, for example.

Note

pavuk 0.9.36 and later releases also support the @$file argument, where you can specify a file to dump the data to. The file path must be prefixed by an ’@’ character. If you prefix the file path with a second ’@’, pavuk will assume you wish to append to an already existing file. Otherwise the file will be created/erased when pavuk starts.

Scenario/Task Options

-scenario $str: Name of scenario to load and/or run. Scenarios are files with a structure similar to the .pavukrc file. Scenarios contain saved configurations. You can use it for periodical mirroring. Parameters from scenarios specified at the command line can be overwritten by command line parameters. To be able to use this option, you need to specify scenario base directory with option -scndir .
-dumpscn $filename: Store actual configuration into scenario file with name $filename . This is useful to quickly create pre-configured scenarios for manual editing.
-dumpcmd $str: File name where the command will be ’dumped’. To be able to use this option, you need to specify the dump base directory with option -dumpdir .

Directory Options

-msgcat $dir

Directory which contains the message catalog for pavuk. If you do not have permission to store a pavuk message catalog in the system directory, you should simply create similar structure of directories in your home directory as it is on your system.

For example:

Your native language is German, and your home directory is /home/jano .

You should at first create the directory /home/jano/locales/de/LC_MESSAGES/ , then put the German pavuk.mo there and set -msgcat to /home/jano/locales/ . If you have properly set locale environment values, you will see pavuk speaking German. This option is available only when you compiled in support for GNU gettext messages internationalization.

-cdir $dir

Directory where are all retrieved documents are stored. If not specified, the current directory is used. If the specified directory doesn’t exist, it will be created.

-scndir $dir

Directory in which your scenarios are stored. You must use this option when you are loading or storing scenario files.

-dumpdir $dir

Directory in which your command dumps are stored. You must use this option when you are storing command dump files using the -dumpcmd command.

Preserve Options

-preserve_time/-nopreserve_time

Store downloaded document with same modification time as on the remote site. Modification time will be set only when such information is available (some FTP servers do not support the MDTM command, and some documents on HTTP servers are created online so pavuk can’t retrieve the modification time of this document). At default modification time of documents isn’t preserved.

-preserve_perm/-nopreserve_perm

Store downloaded document with the same permissions as on the remote site. This option has effect only when downloading a file through FTP protocol and assumes that the -ftplist option is used. At default permissions are not preserved.

-preserve_slinks/-nopreserve_slinks

Set symbolic links to point exactly to same location as on the remote server; don’t do any relocations. This option has effect only when downloading file through FTP protocol and assumes that the -ftplist option is used. Default symbolic links are not preserved, and are retrieved as regular documents with full contents of linked file.

For example, assume that on the FTP server ftp.xx.org there is a symbolic link /pub/pavuk/pavuk-current.tgz , which points to /tmp/pub/pavuk-0.9pl11.tgz . Pavuk will create symbolic link ftp/ftp.xx.org_21/pub/pavuk/pavuk-current.tgz

if option -preserve_slinks will be used this symbolic link will point to /tmp/pub/pavuk-0.9pl11.tgz

if option -nopreserve_slinks will be used, this symbolic link will point to ../../tmp/pub/pavuk-0.9pl11.tgz

-retrieve_symlink/-noretrieve_symlink

Retrieve files behind symbolic links instead of replicating symlinks in local tree.

Proxy Options

-http_proxy $site[:$port]: If this parameter is used, then all HTTP requests are going through this proxy server. This is useful if your site resides behind a firewall, or if you want to use a HTTP proxy cache server. The default port number is 8080. Pavuk allows you to specify multiple HTTP proxies (using multiple -http_proxy options) and it will rotate proxies with round robin priority disabling proxies with errors.
-nocache/-cache: Use this option whenever you want to get the document directly from the site and not from your HTTP proxy cache server. Default pavuk allows transfer of document copies from cache.
-ftp_proxy $site[:$port]: If this parameter is used, then all FTP requests are going through this proxy server. This is useful when your site resides behind a firewall, or if you want to use FTP proxy cache server. The default port number is 22. Pavuk supports three different types of proxies for FTP, see the options -ftp_httpgw and -ftp_dirtyproxy . If none of the mentioned options is used, then pavuk assumes a regular FTP proxy with USER user@host connecting to remote FTP server.
-ftp_httpgw/-noftp_httpgw: The specified FTP proxy is a HTTP gateway for the FTP protocol. Default FTP proxy is regular FTP proxy.
-ftp_dirtyproxy/-noftp_dirtyproxy: The specified FTP proxy is a HTTP proxy which supports a CONNECT request (pavuk should use full FTP protocol, except of active data connections). Default FTP proxy is regular FTP proxy. If both -ftp_dirtyproxy and -ftp_httpgw are specified, -ftp_dirtyproxy is preferred.
-gopher_proxy $site[:$port]: Gopher gateway or proxy/cache server.
-gopher_httpgw/-nogopher_httpgw: The specified Gopher proxy server is a HTTP gateway for Gopher protocol. When -gopher_proxy is set and this -gopher_httpgw option isn’t used, pavuk is using proxy as HTTP tunnel with CONNECT request to open connections to Gopher servers.
-ssl_proxy $site[:$port]: SSL proxy (tunneling) server [as that in CERNhttpd + patch or in Squid] with enabled CONNECT request (at least on port 443). This option is available only when compiled with SSL support (you need the SSleay or OpenSSL libraries with development headers)

Proxy Authentication

-http_proxy_user $user: User name for HTTP proxy authentication.
-http_proxy_pass $pass: Password for HTTP proxy authentication.
-http_proxy_auth {1/2/3/4/user/Basic/Digest/NTLM}: Authentication scheme for proxy access. Similar meaning as the -auth_scheme option (see help for this option for more details). Default is 2 (Basic scheme).
-auth_proxy_ntlm_domain $str: NT or LM domain used for authorization again HTTP proxy server when NTLM authentication scheme is required. This option is available only when compiled with OpenSSL or libdes libraries.
-auth_reuse_proxy_nonce/-noauth_reuse_proxy_nonce: When using HTTP Proxy Digest access authentication scheme use first received nonce value in multiple following requests.
-ftp_proxy_user $user: User name for FTP proxy authentication.
-ftp_proxy_pass $pass: Password for FTP proxy authentication.

Protocol/Download Options

-ftp_passive

Uses passive ftp when downloading via ftp.

-ftp_active

Uses active ftp when downloading via ftp.

-active_ftp_port_range $min :$max

This option permits to specify the ports used for active ftp. This permits easier firewall configuration since the range of ports can be restricted.

Pavuk will randomly choose a number from within the specified range until an open port is found. Should no open ports be found within the given range, pavuk will default to a normal kernel-assigned port, and a message (debug level net ) is output.

The port range selected must be in the non-privileged range (e.g. greater than or equal to 1024); it is STRONGLY RECOMMENDED that the chosen range be large enough to handle many simultaneous active connections (for example, 49152-65534, the IANA-registered ephemeral port range).

-always_mdtm/-noalways_mdtm

Force pavuk to always use "MDTM" to determine the file modification time and never uses cached times determined when listing the remote files.

-remove_before_store/-noremove_before_store

Force unlink’ing of files before new content is stored to a file. This is helpful if the local files are hardlinked to some other directory and after mirroring the hardlinks are checked. All "broken" hardlinks indicate a file update.

-retry $nr

Set the number of attempts to transfer processed document. Default set to 1, this mean pavuk will retry once to get documents which failed on first attempt.

-nregets $nr

Set the number of allowed regets on a single document, after a broken transfer. Default value for this option is 2.

This option is discarded when running pavuk in singlereget mode as pavuk will then keep on trying to reget the URL until successful or a fatal error occurs. If the server is found to not support reget’ing content and -force_reget has not been specified, this will be regarded as a fatal error.

-nredirs $nr

Set number of allowed HTTP redirects. (use this for prevention of loops) Default value for this option is 5, and conform to HTTP specification.

-rollback $nr

Set the number of bytes to discard from the already locally available content (counted from the end of the file) if regetting. Default value for this option is 0.

-force_reget/-noforce_reget

Force reget’ing of the whole document after a broken transfer when the server doesn’t support retrieving of partial content. Pavuk default behavior is to stop getting documents which don’t allow restarting of transfer from specified position.

When forced reget’ing is turned on, pavuk will still start fetching each URL by requesting a partial content download when (part of) the URL content is already available locally. However, when such an attempt fails, pavuk will discard the notion of requesting a partial content download (i.e. HTTP Range specification) entirely for this URL only and attempt to download the content as a whole instead.

Hence, in order for ’-force_reget’ to work as expected, you should realize each URL should be at least spidered twice, i.e. the -nregets command-line option should have a value of 1 at least (2 by default if this option is not specified explicitly).

-timeout $nr

Timeout for stalled connection attempts in milliseconds. Default timeout is 0, and that means timeout checking is disabled.

$nr specifies the timeout in millieseconds, unless postfixed with one of the characters S, M, H or D (either in upper or lower case), which imply the alternative time units S = seconds, M = minutes, H = hours or D = days.

-rtimeout $nr

Timeout for data read operations in milliseconds: the connection is closed with an error when no further data is received within this time limit. Default timeout is 0, an that means timeout checking is disabled.

-wtimeout $nr

Timeout for data write operations in milliseconds: the connection is closed with an error when no further data could be transmitted within this time limit. Default timeout is 0, an that means timeout checking is disabled.

-noRobots/-Robots

This switch suppresses the use of the robots.txt standard, which is used to restrict access of Web robots to some locations on the web server. Default is allowed checking of robots.txt files on HTTP servers. Enable this option always when you are downloading huge sets of pages with unpredictable layout. This prevents you from upsetting server administrators :-).

-noEnc/-Enc

This switch suppresses / enables using the gzip , compress or deflate encoding in transfer.

Some servers are broken as they are reporting files with the MIME type application/gzip or application/compress as gzip or compress encoded, when it should have been reported as ’untouched’, which is defined by the keyword ’identity’ according to the HTTP standards. See for example HTTP/1.1 standard RFC2616, section 14.3, Accept-Encoding and the counterpart: section 14.11, Content-Encoding.

Turn this option off (-noEnc) when you don’t want to allow the server to compress content for transmission: in that case, the server will transmit all content as is, which, in the case of faulty servers mentioned above, means you will receive the compressed file types exactly as they are stored on the server and no undesirable decompression attempts will be made by pavuk.

By default, the option ’-Enc’ is enabled, as this allows for often significant data transfer savings, resulting in fewer transmission costs and faster web responses. Note: when you have a pavuk binary without libz support compiled in, pavuk will never request content compression, as it won’t be able to decompress those results. In that case, ’-Enc’ is identical to ’-noEnc’.

For improved functionality, make sure your pavuk binary comes with libz support. Check your pavuk --version output for a mention of this feature (’Content-Encoding’).

-check_size/-nocheck_size

The option -nocheck_size should be used if you are trying to download pages from a HTTP server which sends a wrong Content-Length: field in the MIME header of response. Default pavuk behavior is to check this field and complain when something is wrong.

-maxrate $nr

If you don’t want to give all your transfer bandwidth to pavuk, use this option to set pavuk’s maximum transfer rate. This option accepts a floating point number to specify the transfer rate in kB/s. If you want get optimal settings, you also have to play with the size of the read buffer (option -bufsize ) because pavuk is doing flow control only at application level. At default pavuk uses full bandwidth.

-minrate $nr

If you hate slow transfer rates, this option allows you to break transfers with slow speed. You can set the minimum transfer rate, and if the connection gets slower than the given rate, the transfer will be stopped. The minimum transfer rate is given in kB/s. At default pavuk doesn’t check this limit.

-bufsize $nr

This option is used to specify the size of the read buffer (default size: 32kB). If you have a very fast connection, you may increase the size of the buffer to get a better read performance. If you need to decrease the transfer rate, you may need to decrease the size of the buffer and set the maximum transfer rate with the -maxrate option. This option accepts the size of the buffer in kB.

$nr specifies the size in kiloBytes, unless postfixed with one of the characters K or M, which imply the corresponding (power-of-2) multipliers. That means that the $nr value ’1K’ is 1 MegaByte, ’1M’ is a whopping 1 GigaByte.

-fs_quota $nr

If you are running pavuk on a multiuser system, you may need to avoid filling up your file system. This option lets you specify how many space must remain free. If pavuk detects an underrun of the free space, it will stop downloading files. Specify this quota in kB. Default value is 0, and that mean no checking of this quota.

-file_quota $nr

This option is useful when you want to limit downloading of big files, but want to download at least $nr kilobytes from big files. A big file will be transferred, and when it reaches the specified size, transfer will break. Such document will be processed as properly downloaded, so be careful when using this option. At default pavuk is transferring full size of documents.

-trans_quota $nr

If you are aware that your selection should address a big amount of data, you can use this option to limit the amount of transferred data. Default is by size unlimited transfer.

-max_time $nr

Set maximum amount of time for program run. After time is exceeded, pavuk will stop downloading. Time is specified in minutes. Default value is 0, and it means downloading time is not limited.

-url_strategy $strategy

This option allows you to specify a downloading order for URLs in document tree. This option accepts the following strings as parameters:

level: will order URLs as it loads it from HTML files (default)
leveli: as previous, but inline objects URLs come first
pre: will insert URLs from actual HTML document at start, before other
prei: as previous, but inline objects URLs come first

-send_if_range/-nosend_if_range

Send If-Range: header in HTTP request. I found out that some HTTP servers (greetings, MS :-)) are sending different ETag: fields in different responses for the same, unchanged document. This causes problems when pavuk attempts to reget a document from such a server: pavuk will remember the old ETag value and uses it in following requests for this document. If the server checks it with the new ETag value and it differs, it will refuse to send only part of the document, and start the download from scratch.

-ssl_version $v

Set required SSL protocol version for SSL communication. $v is one of:

ssl2
ssl23
ssl3
tls1

This option is available only when compiled with SSL support. Default is ssl23.

-unique_sslid/-nounique_sslid

This option can be used if you want to use a unique SSL ID for all SSL sessions. Default pavuk behavior is to negotiate each time new session ID for each connection. This option is available only when compiled with SSL support.

-use_http11/-nouse_http11

This option is used to switch between HTTP/1.0 and HTTP/1.1 protocol used with HTTP servers. Using of HTTP/1.1 is recommended, because it is faster than HTTP/1.0 and uses less network bandwidth for initiating connections. pavuk uses HTTP/1.1 by default.

-local_ip $addr

You can use this option when you want to use specified network interface for communication with other hosts. This option is suitable for multihomed hosts with several network interfaces. Address should be entered as regular IP address or as host name.

-identity $str

This option allows you to specify content of User-Agent: field of HTTP request. This is usable, when scripts on remote server returns different document on same URL for different browsers, or if some HTTP server refuse to serve document for Web robots like pavuk. Default pavuk sends in User-Agent: field pavuk/$VERSION string.

-auto_referer/-noauto_referer

This option forces pavuk to send HTTP Referer: header field with starting URLs. Content of this field will be self URL. Using this option is required, when remote server checks the Referer: field. At default pavuk wont send Referer: field with starting URLs.

-referer/-noreferer

This option allows to enable and disable the transmission of HTTP Referer: header field. At default pavuk sends Referer: field.

-persistent/-nopersistent

This option allows to enable and disable the use of persistent HTTP connections. The default is to use persistent HTTP connections. Some servers have problems with that type of connection and this options allows to get data from these type of servers also.

-httpad $str

In some cases you may want to add user defined fields to HTTP/HTTPS requests. This option is exactly for this purpose. In $str you can directly specify content of additional header. If you specify only raw header, it will be used only for starting requests. When you want to use this header with each request while crawling, prefix the header with + character.

To add multiple additional headers, you can repeatedly specify this command-line option, once for each additional header.

-page_sfx $list

Specify a collection of filename / web page extensions which are to be treated as HTML pages, which is useful when scanning / hammering web sites which present unusual mime types with their pages (see also: -hammer_mode ). $list must contain a comma separated list of web page endings. The default set is .html, .htm, .asp, .aspx, .php, .php3, .php4, .pl, .shtml

Note

When pavuk includes the chunky/hammer feature (see -hammer_mode ), any web page which matches the endings specified in $list will be registered in the hammering recording buffer and marked as a page starter (’[STARTER]’): hammer time measurements are collected and reported on a ’total page’ base (see -tlogfile ). This means that pavuk assumes a user or web browser, which loads a page, will also load any style sheets, scripts and images to properly display that page. All those items are part of a ’total page’, but each page has only a single ’starting point’: the page itself.

To approximate ’total page’ timings instead of ’per item’ timings, pavuk will mark the URLs which act as web page ’starting points’ as [STARTER]. Here pavuk assumes that each web page is simple (i.e. does not use iframes, etc.), hence it is assumed that recognizing the web page URL ending is sufficient.

Please note also that the ’endings’ in $list do not have to be ’filename extensions’ per se: the ’endings’ are simply matched against the URL (with any ’?xxx=yyy’ query elements removed) using a simple, case-insensitive comparison. Hence you may also specify:

-page_sfx "index.html,index.htm"

when you only want any URLs which end with ’index.html’ or ’index.htm’ to be treated as ’page starters’ for timing purposes.

-del_after/-nodel_after

This option allows you to delete FILES from REMOTE server, when download is properly finished. At default is this option off.

-FTPlist/-noFTPlist

When the option -FTPlist will be used, pavuk will retrieve content of FTP directories with the FTP command LIST instead of NLST . So the same listing will be retrieved as with the "ls -l " UNIX command.

This option is required if you need to preserve permissions of remote files or you need to preserve symbolic links. Pavuk supports wide listing on FTP servers with regular BSD or SYSV style "ls -l" directory listing, on FTP servers with EPFL listing format, VMS style listing, DOS/Windows style listing and Novell listing format. Default pavuk behavior is to use NLST for FTP directory listings.

-ftp_list_options $str

Some FTP servers require to supply extra options to LIST or NLST FTP commands to show all files and directories properly. But be sure not to use any extra options which can reformat output of the listing. Useful is especially -a option which force FTP server to show also dot files and directories and with broken WuFTP servers it also helps to produce full directory listings not just files.

-fix_wuftpd / -nofix_wuftpd

This option is result of several attempts to to get working properly the -remove_old option with WuFTPd server when -ftplist option is used. The problem is that FTP command LIST on WuFTPd don’t mind when trying to list non-existing directory, and indicates success in FTP response code. When you activate this option, pavuk uses extra FTP command ( STAT -d dir ) to check whether the directory really exists. Don’t use this option until you are sure that you really need it!

-ignore_chunk_bug / -noignore_chunk_bug

Ignore IIS 5/6 RFC2616 chunked transfer mode server bug, which would otherwise have pavuk fail and report downloads as ’possibly truncated’. When this is reported by pavuk you should specify this option and retry the operation.

Authentication

-auth_file $file

File where you have stored authentication information for access to some service. For file structure see below in FILES section.

-auth_name $user

If you are using this parameter, pavuk will transmit your authentication details with each HTTP access for grabbing a document. For security reasons, use this option only if you know that only one HTTP server could be accessed or use the -asite option to specify the sites for which you want to use authentication. Otherwise your auth parameters will be sent to each accessed HTTP server.

-auth_passwd $passwd

Value of this parameter is used as password for authentication

-auth_scheme {1/2/3/4/user/Basic/Digest/NTLM}

This parameter specifies used authentication scheme.

1, user

means user authentication scheme is used as defined in HTTP/1.0 or HTTP/1.1. Password and user name are sent in plaintext format (unencrypted).

2, Basic

means Basic authentication scheme is used as defined in HTTP/1.0. Password and user name are sent BASE64 encoded.

This is the default setting.

3, Digest

means Digest access authentication scheme based on MD5 checksums as defined in RFC2069.

4, NTLM

means NTLM proprietary access authentication scheme used by Microsoft IIS or Proxy servers. When you use this scheme, you must also specify NT or LM domain with option -auth_ntlm_domain .

This scheme is supported only when compiled with OpenSSL or libdes libraries.

-auth_ntlm_domain $str

NT or LM domain used for authorization again HTTP server when NTLM authentication scheme is required.

This option is available only when compiled with OpenSSL or libdes libraries.

-auth_reuse_nonce/-noauth_reuse_nonce

While using HTTP Digest access authentication scheme use first received nonce value in more following requests. Default pavuk negotiates nonce for each request.

-ssl_key_file $file

File with public key for SSL certificate (learn more from SSLeay or OpenSSL documentation).

This option is available only when compiled with SSL support (you need SSleay or OpenSSL libraries and development headers).

-ssl_cert_file $file

Certificate file in PEM format (learn more from SSLeay or OpenSSL documentation).

This option is available only when compiled with SSL support (you need SSleay or OpenSSL libraries and development headers).

-ssl_cer_passwd $str

Password used to generate certificate (learn more from SSLeay or OpenSSL documentation).

This option is available only when compiled with SSL support (you need SSLeay or OpenSSL libraries and development headers).

-nss_cert_dir $dir

Config directory for NSS (Netscape SSL implementation) certificates. Usually ~/.netscape (created by Netscape communicator/navigator) or profile directory below ~/.mozilla (created by Mozilla browser). The directory should contain cert7.db and key3.db files.

If you don’t use Mozilla nor Netscape, you must create this files by utilities distributed with NSS libraries. Pavuk opens certificate database only read-only.

This option is available only when pavuk is compiled with SSL support provided by Netscape NSS SSL implementation.

-nss_accept_unknown_cert/-nonss_accept_unknown_cert

By default will pavuk reject connection to SSL server which certificate is not stored in local certificate database (set by -nss_cert_dir option). You must explicitly force pavuk to allow connection to servers with unknown certificates.

This option is available only when pavuk is compiled with SSL support provided by Netscape NSS SSL implementation.

-nss_domestic_policy/-nss_export_policy

Selects sets of ciphers allowed/disabled by USA export rules.

This option is available only when pavuk is compiled with SSL support provided by Netscape NSS SSL implementation.

-from $email

This parameter is used when accessing anonymous FTP server as password or is optionally inserted into From field in HTTP request. If not specified pavuk discovers this from USER environment variable and from site hostname.

-send_from/-nosend_from

This option is used for enabling or disabling sending of user identification, entered in -from option, as FTP anonymous user password and From: field of HTTP request. By default is this option off.

-ftp_login_handshake $host $handshake

When you need to use nonstandard login procedure for some of FTP servers, you can use this option to change default pavuk login procedure. To allow more flexibility, you can assign the login procedure to some server or to all. When $host is specified as empty string ("" ), than attached login procedure is assigned to all FTP servers besides those having assigned own login procedures. In the $handshake parameter you can specify exact login procedure specified by FTP commands followed by expected FTP response codes delimited with backslash (\ ) characters.

For example this is default login procedure when logging in regular ftp server without going through proxy server:

USER %u\331\PASS %p\230

There are two commands followed by two response codes. After USER command pavuk expects FTP response code 331 and after PASS command pavuk expects from server FTP response code 230. In ftp commands you can use following macros which will be replaced by respective values:

%u: user name used to access FTP server
%p: password used to access FTP server
%U: user name used to access FTP proxy server
%P: password used to access FTP proxy server
%h: hostname of FTP server
%s: port number on which FTP server listens

Site/Domain/Port Limitation Options

-asite $list

Specify comma separated list of allowed sites on which referenced documents are stored. When this option is specified, pavuk will only follow links which point to servers in this list.

The -dsite parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.

-dsite $list

Specify a comma separated list of disallowed sites.

The -asite parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.

-adomain $list

Specify a comma separated list of allowed domains on which referenced documents are stored. When this option is specified, pavuk will only follow links which point to domains in this list.

The -ddomain parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.

-ddomain $list

Specify a comma separated list of disallowed domains.

The -adomain parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.

-aport $list

In $list , you can write comma separated list of ports from which you allow to download documents.

The -dport parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.

-dport $list

This option is used to specify denied ports. When this option is specified, pavuk will only follow links which point to servers in this list.

The -aport parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.

Limitation Document Properties

-amimet $list

List of comma separated allowed MIME types. You can also use wildcard patterns with this option.

The -dmimet parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.

-dmimet $list

List of comma separated disallowed MIME types. You can also use wildcard patterns with this option.

The -amimet parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.

-maxsize $nr

Maximum allowed size of document. This option is applied only when pavuk is able to detect the document before starting the transfer. Default value is 0, and it means this limit isn’t applied.

$nr specifies the size in bytes, unless postfixed with one of the characters K, M or G, which imply the multipliers K(1024), M(1048567) or G(1073741824).

-minsize $nr

minimal allowed size of document. This option is applied only when pavuk is able to detect the document before starting the transfer. Default value is 0, and it means this limit isn’t applied.

$nr specifies the size in bytes, unless postfixed with one of the characters K, M or G, which imply the multipliers K(1024), M(1048567) or G(1073741824).

-newer_than $time

Allow only transfer of documents with modification time newer than specified in parameter $time . Format of $time is: YYYY.MM.DD.hh:mm . To apply this option pavuk must be able to detect modification time of document.

$time must be specified as local (a.k.a. ’wall clock’) time.

-older_than $time

Allow only transfer of documents with modification time older than specified in parameter $time. Format of $time is: YYYY.MM.DD.hh:mm . To apply this option pavuk must be able to detect modification time of document.

$time must be specified as local (a.k.a. ’wall clock’) time.

-noCGI/-CGI

this switch prevents to transfer dynamically generated parametric documents through CGI interface. This is detected with occurrence of ? character inside URL. Default pavuk behavior is to allow transfer of URLs with query strings.

-alang $list

this allows you to specify an ordered comma separated list of preferred natural languages. This option works only with HTTP and HTTPS protocol using Accept-Language: MIME field.

-acharset $list

This options allows you to specify a comma separated list of preferred encoding standards for transferred documents. This works only with HTTP and HTTPS urls and only if such document encodings are available on the destination server.

An example:

-acharset iso-8859-2,windows-1250,utf8

Limitation Document Name

-asfx $list

This parameter allows you to specify a set of comma separated suffixes used to restrict the selection of documents which will be processed.

The -dsfx parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.

-dsfx $list

A set of comma separated suffixes that are used to specify which documents will not be processed.

The -asfx parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.

-aprefix $list / -dprefix $list

These two options allow you to specify set of allowed or disallowed prefixes of documents. They are mutually exclusive: when these options occur multiple times in your configuration file and/or command line, the last occurrence will be used and all previous ones discarded.

-pattern $pattern

This option allows you to specify wildcard pattern for documents. All documents are tested if they match this pattern.

-rpattern $reg_exp

This is equal option as previous, but this uses regular expressions. Available only on platforms which have any supported RE implementation.

-skip_pattern $pattern

This option allows you to specify wildcard pattern for documents that should be skipped. All documents are tested if they match this pattern.

-skip_rpattern $reg_exp

This is equal option as previous, but this uses regular expressions. Available only on platforms which have any supported RE implementation.

-url_pattern $pattern

This option allows you to specify wildcard pattern for URLs. All URLs are tested if they match this pattern.

Example:

-url_pattern http://\*.idata.sk:\*/~ondrej/\*

this option enables all HTTP URLs from domain .idata.sk on all ports which are located under /~ondrej/ .

-url_rpattern $reg_exp

This is equal option as previous, but this uses regular expressions. Available only on platforms which have any supported RE implementation.

-skip_url_pattern $pattern

This option allows you to specify wildcard pattern for URLs that should be skipped. All URLs are tested if they match this pattern.

Example:

-skip_url_pattern ’*home*’

this option will force pavuk to skip all HTTP URLs which have ’home’ anywhere in their URL. This of course includes the query string part of the URL,

hence

-skip_url_pattern ’*&action=edit*’ will direct pavuk to skip any HTTP URLs which have a URL query section which has ’action=edit ’ as any but the first query element (as it would then match ’*?action=edit* ’ instead).

-skip_url_rpattern $reg_exp

This is equal option as previous, but this uses regular expressions. Available only on platforms which have any supported RE implementation.

-aip_pattern $re

This option allows you to limit set of transferred documents by server IP address. IP address can be specified as regular expressions, so it is possible to specify set of IP addresses by one expression. Available only on platforms which have any supported RE implementation.

-dip_pattern $re

This option similar to previous option, but is used to specify set of disallowed IP addresses. Available only on platforms which have any supported RE implementation.

-tag_pattern $tag $attrib $url

More powerful version of -url_pattern option for more precise matching of allowed URLs based on HTML tag name pattern, HTML tag attribute name pattern and on URL pattern. You can use in all three parameters of this option wildcard patterns, thus something like -tag_pattern ’*’ ’*’ url_pattern is equal to -url_pattern url_pattern . The $tag and $attrib parameters are always matched against uppercase strings. For example if you want pavuk to follow only regular links ignoring any style sheets, images, etc., use option -tag_pattern A HREF ’*’ .

-tag_rpattern $tag $attrib $url

This is variation on the -tag_pattern . It uses regular expression patterns in parameters instead of wildcard patterns used in the -tag_pattern option.

Limitation Protocol Option

-noHTTP/-HTTP

This switch suppresses all transfers through HTTP protocol. Default is transfer trough HTTP enabled.

-noSSL/-SSL

This switch suppresses all transfers through HTTPS protocol (HTTP protocol over SSL) . Default is transfer trough HTTPS enabled.

This option is available only when compiled with SSL support (you need SSleay or OpenSSL libraries and development headers).

-noGopher/-Gopher

Suppress all transfers through Gopher Internet protocol. Default is transfer trough Gopher enabled.

-noFTP/-FTP

This switch prevents processing documents allocated on all FTP servers. Default is transfer trough FTP enabled.

-noFTPS/-FTPS

This switch prevents processing documents allocated on all FTP servers accessed through SSL. Default is transfer trough FTPS enabled.

This option is available only when compiled with SSL support (you need SSleay or OpenSSL libraries and development headers).

-FTPhtml/-noFTPhtml

By using of option -FTPhtml you can force pavuk to process HTML files downloaded with FTP protocol. At default pavuk won’t parse HTML files from FTP servers.

-FTPdir/-noFTPdir

Force recursive processing of FTP directories too. The default setting is to deny recursive downloading from FTP servers, i.e. FTP directory trees will not be traversed.

-disable_html_tag $TAG,[$ATTRIB][;...] / -enable_html_tag $TAG,[$ATTRIB][;...]

Enable or disable processing of particular HTML tags or attributes. At default all supported HTML tags are enabled.

For example if you don’t want to process all images you should use option -disable_html_tag ’IMG,SRC;INPUT,SRC;BODY,BACKGROUND’ .SS "OTHER LIMITATION OPTIONS"

-subdir $dir

Sub-directory of local tree directory, to limit some of the modes {sync, resumeregets, linkupdate} in its tree scan.

-dont_leave_site/-leave_site

(Don’t) leave starting site. At default pavuk can span host when recursing through WWW tree.

-dont_leave_dir/-leave_dir

(Don’t) leave starting directory. If -dont_leave_dir option is used pavuk will stay only in starting directory (including its own sub-directories). At default pavuk can leave starting directories.

-leave_site_enter_dir/-dont_leave_site_enter_dir

If you are downloading WWW tree which spans multiple hosts with huge trees, you may want to allow downloading of document which are in directory hierarchy below directory which we visited as first on each site. To obtain this, use option -dont_leave_site_enter_dir . By default pavuk will go also to higher directory levels on that site.

-l $nr , -lmax $nr

Set maximum allowed level of tree traverse. Default is set to 0, what means that pavuk can traverse at infinitum. As of version 0.8pl1 inline objects of HTML pages are placed at same level as parent HTML page.

-leave_level $nr

Maximum level of documents outside from site of starting URL. Default is set to 0, and 0 means that checking is not applied.

-site_level $nr

Maximum level of sites outside from site of starting URL. Default is set to 0, and 0 means that checking is not applied.

-dmax $nr

Set maximum allowed number of documents that are processed. Default value is 0. That means no restrictions are used in number of processed documents.

-singlepage/-nosinglepage

Using option -singlepage allows you to transfer just HTML pages with all its inlined objects (pictures, sounds, frame documents, ...). By default single page transfer is disabled.

Note

This option renders the -mode singlepage option obsolete.

-limit_inlines/-dont_limit_inlines

With this option you can control whether limiting options apply also to inline objects (pictures, sounds, ...). This is useful when you want to download specified set of HTML pages with all inline options without any restrictions.

-user_condition $str

Script or program name for users own conditions. You can write any script which should with exit value decide if download URL or not. Script gets from pavuk any number of options, with this meaning :

-url $url: processed URL
-parent $url: any number of parent URLs
-level $nr: level of this URL from starting URL
-size $nr: size of requested URL
-date $datenr: modification time of requested URL in format YYYYMMDDhhmmss

The exit status 0 of script or program means that current URL should be rejected and nonzero exit status means that URL should be accepted.

Warning

use user conditions only if required because of big slowdowns caused by forking scripts for each checked URL.

-follow_cmd $str

This option allows you to specify script or program which can by its exit status decide whether to follow URLs from current HTML document. This script will be called after download of each HTML document. The script will get following options as it’s parameters:

-url $url: URL of current HTML document
-infile $file: local file where is stored HTML document

The exit status 0 of script or program means that URLs from current document will be disallowed, other exit status means, that pavuk can follow links from current HTML document.

Javascript Support

Support for scripting languages like JavaScript or VBScript in pavuk is done bit hacky way. There is no interpreter for these languages, so not all things will work. Whole support which pavuk have for these scripting languages is based on regular expression patterns specified by user. Pavuk searches for these patterns in DOM event attributes of HTML tags, in javascript:... URLs, in inline scripts in HTML documents enclosed between <script></script> tags and in separate javascript files. Support for scripting languages is only available when pavuk is compiled with proper regular expression library (POSIX/GNU/PCRE/TRE).

-enable_js/-disable_js

This options are used to enable or disable processing of JavaScript parts of HTML documents. You must enable this option to be able to use processing of javascript patterns.

-js_pattern $re

With this option you are specifying what patterns match interesting parts of JavaScript for extracting URLs. The parameter must be RE pattern with exactly one subpattern which matches the URL part precisely. For example to match the URL in the following type of javascript expressions:

document.b1.src=’pics/button1_pre.jpg’

you can use this pattern

^document.[a-zA-Z0-9_]*.src[ \t]*=[ \t]*’(.*)’$

-js_transform $p $t $h $a

This option is similar to the previous one, but you can use custom transform rules for the URL parts of patterns and also specify the exact HTML tag and attribute where to look for this pattern. The $p is the pattern to match the relevant part of script. The $t is a transform rule for the URL. In this parameter the $x parts will be replaced by x -th subpattern of the $p pattern. The $h parameter is either the exact HTML tag or "*" when this applies to javascript body of HTML document or separate JavaScript file: URLs or DOM event attribs or "" (empty string) when this apply to javascript body of HTML document or separate JavaScript file. The $a parameter is either the exact HTML attrib of tag or "" (empty string) when this rule applies to the javascript body.

-js_transform2 $p $t $h $a

This option is very similar to previous. The meaning of all parameters is same, just the pattern $p can have only one substring which will be used in the transform rule $t . This is required to allow rewriting of URL parts of the tags and scripts. This option can also be used to force pavuk to recognize HTML arg/attribute pairs which pavuk does not support.

Use this option instead of -js_transform when you want to make sure pavuk ’rewrites’ the transformed URL in the content grabbed from a site and stored on your local disc.

In other words: -js_transform is good enough when you only want to direct pavuk to grab a specific URL which is not literally available in the content already downloaded, while -js_transform2 does just that little bit more: it also makes sure this newly created URL ends up in the content saved to disc, by replacing the text matched by the first sub-expression.

Note

Make sure that the first sub-expression always matches some content, because otherwise pavuk will display a warning and not rewrite the content, as it could not detect where you wanted the replacement URL to go.

Note

Additional caveat: when your pavuk binary was built using a RE library which does not support sub-expressions, pavuk will report an error and abort when any of the -js_pattern , -js_transform or -js_transform2 command-line options were specified.

Cookie

-cookie_file $file: File where are stored cookie info. This file must be in Netscape cookie file format (generated with Netscape Navigator or Communicator ...).
-cookie_send/-nocookie_send: Use collected cookies in HTTP/HTTPS requests. Pavuk will not send at default cookies.
-cookie_recv/-nocookie_recv: Store received cookies from HTTP/HTTPS responses into memory cookie cache. At default pavuk will not remember received cookies.
-cookie_update/-nocookie_update: Update cookie file on disk and synchronize it with changes made by any concurrent processes. At default pavuk will not update cookie file on disk.
-cookies_max $nr: Maximum number of cookies in memory cookie cache. Default value is 0, and that means no restrictions for cookies number.
-disabled_cookie_domains $list: Comma-separated list of cookie domains which are permitted to send cookies stored into cookie cache
-cookie_check/-nocookie_check: Check when receiving cookie, if cookie domain is equal to domain of server which sends this cookie. At default pavuk check is server is setting cookies for its domain, and if it tries to set cookie for foreign domain pavuk will complain about that and will reject such cookie.

HTML Rewriting Engine Tuning Options

-noRelocate/-Relocate: This switch prevents the program from rewriting relative URLs to absolute URLs after the HTML document has been transferred. Default pavuk behavior is to maintain link consistency of HTML documents. So always when a HTML document is downloaded pavuk will rewrite all URLs to point to the local document if it is available and if it is not available it will point the link to the remote document instead. After the document has been properly downloaded pavuk will update all the links in any HTML documents which point at this one.
-all_to_local/-noall_to_local: This option forces pavuk to change all URLs inside HTML document to local URLs immediately after download of document. Default is this option disabled.
-sel_to_local/-nosel_to_local: This option forces pavuk to change all URLs, which accomplish conditions for download, to local inside HTML document immediately after download of document. I recommend to use this option, when you are sure, that transfer will be without any problems. This option can save a lot of processor time. Default is this option disabled.
-all_to_remote/-noall_to_remote: This option forces pavuk to change all URLs inside HTML document to remote URLs immediately after download of document. Default is this option disabled.
-post_update/-nopost_update: This option is especially designed to allow in -fnrules option doing rules based on MIME type of document. This option forces pavuk to generate local names for documents just after pavuk knows what is the MIME type of document. This have big impact on the rewriting engine of links inside HTML documents. This option causes dysfunction of other options for controlling the link rewriting engine. Use this option only when you know what you are doing :-)
-dont_touch_url_pattern $pat: This options serves to deny rewriting and processing of particular URLs in HTML documents by pavuk HTML rewriting engine. This option accepts wildcard patterns to specify such URLs. Matching is done against untouched URLs so when he URL is relative, you must use pattern which matches the relative URL, when it is absolute, you must use absolute URL.
-dont_touch_url_rpattern $pat: This option is variation on previous option. This one uses regular patterns for matching of URLs instead of wildcard patterns used by -dont_touch_url_pattern option. This option is available only when pavuk is compiled with support for regular expression patterns.
-dont_touch_tag_rpattern $pat: This option is variation on previous option, just matching is made on full HTML tag with included <>. This option accepts regular expression patterns. It is available only when pavuk is compiled with support for regular expression patterns.

File Name/URL Conversion Option

-tr_del_chr $str

All characters found in $str will be deleted from local name of document. $str should contain escape sequences similar like in the UNIX tr (1) command:

\n

newline (ASCII LF: 10(dec))

\r

carriage return (ASCII CR: 13(dec))

\t

horizontal tab space (ASCII TAB: 9(dec))

\0xXX

hexadecimal ASCII value (1-byte range, but you can never specify ASCII NUL (0(dec)), i.e. XX can be in the range ’01’ to ’FF’)

[:upper:]

all uppercase letters (ASCII ’A’..’Z’)

[:lower:]

all lowercase letters (ASCII ’a’..’z’)

[:alpha:]

all letters (ASCII ’A’..’Z’ + ’a’..’z’)

[:alnum:]

all letters and digits (ASCII ’A’..’Z’ + ’a’..’z’ + ’0’..’9’)

[:digit:]

all digits (ASCII ’0’..’9’)

[:xdigit:]

all hexadecimal digits (ASCII ’0’..’9’ + ’A’..’F’ + ’a’..’f’)

[:space:]

all horizontal and vertical white-space (ASCII SPACE(’ ’, 32(dec)), TAB(9(dec)), CR(10(dec)), LF(13(dec)), FF(12(dec)), VT(11(dec)))

[:blank:]

all horizontal white-space (ASCII SPACE(’ ’, 32(dec)), TAB(9(dec)))

[:cntrl:]

all control characters (ASCII 1(dec)..31(dec) + 127(dec))

[:print:]

all printable characters including space (ASCII 32(dec)..126(dec))

[:nprint:]

all non printable characters (ASCII 1(dec)..31(dec) + 127(dec)..255(dec))

[:punct:]

all punctuation characters (ASCII 33(dec)..47(dec) + 58(dec)..64(dec) + 91(dec)..96(dec) + 123(dec)..126(dec)), in other words these characters:

! (Exclamation mark),
" (Quotation mark; " in HTML),
$ (Dollar sign),
% (Percent sign),
& (Ampersand),
’ (Closing single quote a.k.a. apostrophe),
( (Opening parentheses),
) (Closing parentheses),
* (Asterisk a.k.a. star, multiply),
+ (Plus),
, (Comma),
- (Hyphen, dash, minus),
. (Period),
/ (Slant a.k.a. forward slash, divide),
: (Colon),
; (Semicolon),
< (Less than sign; &lt in HTML),
= (Equals sign),
> (Greater than sign; > in HTML),
? (Question mark),
@ (At-sign),
[ (Opening square bracket),
\ (Reverse slant a.k.a. Backslash),
] (Closing square bracket),
^ (Caret a.k.a. Circumflex),
_ (Underscore),
‘ (Opening single quote),
{ (Opening curly brace),
| (Vertical line),
} (Cloing curly brace),
~ (Tilde a.k.a. approximate))

[:graph:]

all printable characters excluding space (ASCII 33(dec)..126(dec))

-X

a range: expands to a character series starting with the last expanded character (or ASCII(1(dec)) when the ’-’ minus character is positioned at the start of this string/specification) and ending with the character specified by X , where X may also be a ’\’-escaped character, e.g. ’\n’ or ’\x7E’. Hence you can specify ranges like ’\x20-\x39’ and get what you’d expect.

-tr_str_str $str1 $str2

String $str1 from local name of document will be replaced with $str2 .

-tr_chr_chr $chrset1 $chrset2

Characters from $chrset1 from local name of document will be replaced with corresponding character from $chrset2 . $charset1 and $charset2 should have same syntax as $str in -tr_del_chr option: both $charset1 and $charset2 will be expanded to a character set using the rules described above. The characters in the expanded sets $charset1 and $charset2 have a 1:1 relationship, e.g. the second character in $charset1 will be replaced by the second character in $charset2 .

Caution

If the set $charset2 is smaller than the set $charset1 , any characters in the set $charset1 at positions at or beyond the size of the set $charset2 will be replaced by the last character in the set $cha2rset2 . For example, tr_chr_chr(’abcd’, ’AB’, ’abcde’) will produce

the result
 
              ’ABBBe’
             
as ’c’ and ’d’ in $charset1 are beyond the range of $charset2 , hence these are replaced by the last character in $charset2 : ’B’. With the above example this may seem rather obvious, but be reminded that elements like ’[:punct:]’ are deterministic (as they do not depend on your ’locale’, but they can still be hard to use as you must determine which and how many characters they will produce upon expansion. See the description for -tr_del_chr above for additional info to help you with this.

$charset1 or $charset2 as locale dependent and can

thus

-store_name $str

Define the local filename to use for the very first file downloaded. This option is most useful when running pavuk in ’singlepage’ mode, but it works for any mode.

-index_name $str

With this option you can change directory index name. By default the filename _._.html is used, which is assumed to be a filename not usually occuring on web/ftp/... sites.

-store_index/-nostore_index

With option -nostore_index you deny storing of directory indexes into HTML files (which are named according to the -index_name settings). The default is to store all directory URLs as HTML index files (i.e. -store_index ).

-fnrules $t $m $r

This is a very powerful option! This option is used to flexibly change the layout of the local document tree. It accepts three parameters.

The first parameter $t is used to say what type the following pattern is:

F

is used for a wildcard pattern (uses fnmatch (3) ), while

R

is used for a regular expression pattern (using any supported RE implementation).
The second parameter is the matching pattern used to select URLs for this rule. If a URL matches this pattern, then the local name for this URL is computed using the rule specified in the third parameter.
And the third parameter is the local name building rule. Pavuk now supports two kinds of local name building rules. One is based only on simple rule macros and the other is a more complicated, extended rule rule, which also enables you to perform several functions in a LISP-like micro language.

Pavuk differentiates between these two kinds of rules by looking at the first character of the rule. When the first character is a ’(’ open bracket character, the rule is assumed to be of the extended sort, while in all other cases it is assumed to be a simple rule.

A Simple rule should contain a mix of literals and escaped macros. Macros are escaped by the % character or the $ character.

Note

if you want to place a literal % or $ character in the generated string, you can escape that character with a \ backslash character prefix, so pavuk will not recognize it as a macro escape character here.

Note

-fnrules always performs additional cleanup for file paths produced by both matching simple and extended rules: multiple consecutive occurrences of / slashes in the path are replaced by a single / slash, while any directory and/or file names which end with a . dot have that dot removed.

Note

-fnrules are processed in the order they occurred on the command line. If a rule matches the current URL, this rule will be applied. Any subsequent rules will be skipped. This allows you to specify multiple -fnrules on the command line. By ordering them from specific to generic, you can apply different rules to subsets of the URL collection (e.g. you’re putting the -fnrules F ’*’ ’%some%macros%’ statement last).

Note

When an -fnrules statement matches the current URL, any specified -base_level path processing will not be applied to the -fnrules generated path.

Here is list of recognized macros:

$x

where x is any positive number. This macro is replaced with x -th substring matched by the RE pattern, which was specified in the second -fnrules argument $m . (If you use this you need to understand RE sub-matches!)

%i

is replaced with protocol id string:

(http,https,ftp,ftps,file,gopher)

%p

is replaced with password. (use this only where applicable)

%u

is replaced with user name. (use this only where applicable)

%h

is replaced with the fully qualified host name.

%m

is replaced with the fully qualified domain name.

%r

is replaced with port number.

%d

is replaced with path to document.

%n

is replaced with document name (including the extension).

%b

is replaced with base name of document (without the extension).

%e

is replaced with the URL filename extension.

%s

is replaced with the URL searchstring.

%M

is replaced with the full MIME type of document as transmitted in the MIME header. For example:

text/html; charset=utf-8

As of v0.9.36, you do not need to specify the -post_update option to make this option work.

%B

is replaced with basic MIME type of the document, i.e. the MIME type without any attributes. For example:

text/html

%A

is replaced with MIME type attributes of the document, i.e. all the stuff following the initial ’;’ semicolon as specified in the MIME type header which was sent to us by the server. For example:

charset=utf-8

%E

is replaced with default extension assigned to the MIME type of the document.

As of v0.9.36, you do not need to specify the -post_update option to make this option work.

You may want to specify the additional command line option -mime_type_file $file to override the rather limited set of built-in MIME types and default file extensions.

%X

is replaced with the default extension assigned to the MIME type of the document, if one exists. Otherwise, the existing file extension is used instead.

You may want to specify the additional command line option -mime_type_file $file to override the rather limited set of built-in MIME types and default file extensions.

%Y

is replaced with file extension if one is available. Otherwise, the default extension assigned to the MIME type of the document is used instead.

You may want to specify the additional command line option -mime_type_file $file to override the rather limited set of built-in MIME types and default file extensions.

%x

where x is a positive decimal number. This macro is replaced with the x -th directory from the path of the document, starting with 1 for the initial sub-directory.

%-x

where x is a positive decimal number. This macro is replaced with the x -th directory from the path of the document, counting down from end. The value 1 indicates the last sub-directory in the path.

%o

default localname for URL

Here is an example. If you want to place the document into a single directory, one for each extension, you should use the following -fnrules option:

-fnrules F ’*’ ’/%e/%n’

Extended rules always begin with a ’(’ character. These rules use a syntax much alike the LISP syntax.

Here are the basic rules for writing extended rules:

the complete rule statement must return the local filename as a string return value
each function/operation is enclosed inside round braces ()
the first token right after the opening brace is the function name/operator
each function has a nonzero fixed number of parameters
each function returns a numeric or string value
function parameters are separated by one or more space characters
any parameter of a function should be a string, number, macro or another function
a literal string parameter must always be quoted using " double quotes. When you need to include a " double quote as part of the literal string itself, escape it by prefixing it with a \ backslash character.
a literal numeric parameter can be presented in any encoding supported by the strtol (3) function (octal, decimal, hexadecimal, ...)
there is no implicit conversion from number to string
each macro is prefixed by % character and is one character long
each macro is replaced by its string representation from current URL
function parameters are typed strictly
top level function must return string value

Extended rules supports the full set of % escaped macros supported by simple rules, plus one additional macro:

%U: URL string

Here is a description of all supported functions/operators:

sc

concatenate two string parameters
accepts two string parameters
returns string value

ss

substring from string
accepts three parameters.
first is string from which we want to cut a sub-part
second is number which represents starting position in string
third is number which represents ending position in string
returns string value

hsh

compute modulo hash value from string with specified base
accepts two parameters
first is string for which we are computing the hash value
second is numeric value for base of modulo hash
returns numeric value

md5

compute MD5 checksum for string
accepts one string value
returns string which represents MD5 checksum

lo

convert all characters inside string to lower case
accepts one string value
returns string value

up

convert all characters inside string to upper case
accepts one string value
returns string value

ue

encode unsafe characters in string with same encoding which is used for encoding unsafe characters inside URL ( %xx ). By default all non-ASCII values are encoded when this function is used.
accepts two string values
first is string which we want to encode
second is string which contains unsafe characters
return string value

ud

decode any URL entities in the string and replace those by the actual characters.
accepts one string value
return string value

dc

delete unwanted characters from string (has similar functionality as -tr_del_chr option)
accepts two string values
first is string from which we want delete
second is string which contains characters we want to delete.
returns string value

tc

replace character with other character in string (has similar functionality as -tr_chr_chr option)
accepts three string values
first is string inside which we want to replace characters
second is set of characters which we want to replace
third is set of characters with which we want to replace those with
returns string value

ts

replace some string inside string with any other string (has similar functionality as -tr_str_str option)
accepts three string values
first is string inside which we want to replace string
second is the from string
third is to string
returns string value

spn

calculate initial length of string which contains only specified set of characters. (has same functionality as strspn (3) libc function)
accepts two string values
first is input string
second is set of acceptable characters
returns numeric value

cspn

calculate initial length of string which doesn’t contain specified set of characters. (has same functionality as strcspn (3) libc function)
accepts two string values
first is input string
second is set of unacceptable characters
returns numeric value

sl

calculate length of string
accepts one string value
returns numeric value

ns

convert number to string by format
accepts two parameters
first parameter is format string same as for printf (3) function
second is number which we want to convert
returns string value

sn

convert string to number by radix
accepts two parameters
first parameter is string which we want to convert using the strtol (3) function
second is radix number to use for conversion; specify radix ’0’ zero if the strtol (3) function should auto-discover the radix used
returns numeric value

lc

return position of last occurrence of specified character inside string
accepts two string parameters
first string which we are searching in
second string contains character for which we are looking (only the first character of the string is used)
returns numeric value; 0 if character could not be found

+

add two numeric values
accepts two numeric values
returns numeric value

-

subtract two numeric values
accepts two numeric values
returns numeric value

%

calculate modulo remainder
accepts two numeric values
returns numeric value; return 0 if divisor is 0

*

multiply two numeric values
accepts two numeric values
returns numeric value

/

divide two numeric values
accepts two numeric values
returns numeric value; 0 if division by zero

rmpar

remove parameter from query string
accepts two strings
first parameter is the string which we are adjusting
second parameter is the name of parameter which should be removed
returns adjusted string

getval

get query string parameter value
accepts two strings
first parameter is query string from which to get the parameter value (usually %s )
second string is name of parameter for which we want to get the value
returns value of the parameter or empty string when the parameter doesn’t exists

sif

logical decision
accepts three parameters
first is numeric and when its value is nonzero, the result of this decision is the result of the second parameter, otherwise it is the result of the third parameter
second parameter is string (returned when condition is nonzero/true)
third parameter is string (returned when condition is zero/false)
returns string result of decision

!

logical not
accepts one numeric parameter
returns negation of parameter

&

logical and
accept two numeric parameters
returns logical and of parameters

|

logical or
accept two numeric parameters
returns logical or of parameters

getext

get file extension
accept one sting (filename or path)
return string containing extension of parameter

seq

compare two strings
accepts two strings for comparison
returns

numeric value 0

if different

numeric value 1

if equal

fnseq

compare a wildcard pattern and a string (has the same functionality as the fnmatch (3) libc function)
accepts two strings for comparison
first string is a wildcard pattern
second string is the data which should match the pattern
returns

numeric value 0

if different

numeric value 1

if equal

sp

return URL sub-part from the matching -fnrules ’R’ regex
accepts one number, which references the corresponding sub-expression in the -fnrules ’R’ regex
returns the URL substring which matched the specified sub-expression
This function is available only when pavuk is compiled with regex support, including sub-expressions (POSIX/PCRE/TRE/...).

jsf

Execute JavaScript function
Accepts one string parameter which holds name of JavaScript function specified in script loaded with -js_script_file option.
Returns string value equal to return value. of JavaScript function. See the -js_script_file command line option for further details.
This function is available only when pavuk is compiled with support for JavaScript bindings.

For example, if you are mirroring a very large number of Internet sites into the same local directory, too much entries in one directory will cause performance problems. You may use for example hsh or md5 functions to generate one additional level of hash directories based on hostname with one of the following options:

-fnrules F ’*’ ’(sc (nc "%02d/" (hsh %h 100)) %o)’
-fnrules F ’*’ ’(sc (ss (md5 %h) 0 2) %o)’

-base_level $nr

Number of directory levels to omit in local tree.

For example when downloading URL ftp://ftp.idata.sk/pub/unix/www/pavuk-0.7pl1.tgz you enter at command line -base_level 4 in local tree will be created www/pavuk-0.7pl1.tgz not ftp/ftp.idata.sk_21/pub/unix/www/pavuk-0.7pl1.tgz as normally.

-default_prefix $str

Default prefix of mirrored directory. This option is used only when you are trying to synchronize content of remote directory which was downloaded using -base_level option. Also you must use directory based synchronization method, not URL based synchronization method. This is especially useful, when used in conjunction with -remove_old option.

-remove_adv/-noremove_adv

This option is used for turn on/off of removing HTML tags which contains advertisement banners. The banners are not removed from HTML file, but are commented out. Such URLs also will not be downloaded. This option have effect only when used with option -adv_re . Default is turned off. This option is available only when your system have support for one of the supported regular expressions implementations.

-adv_re $RE

This option is used to specify regular expressions for matching URLs of advertisement banners. For example:

-adv_re http://ad.doubleclick.net/.*

is used to match all files from server ad.doubleclick.net. This option is available only when your system has any supported regular expressions implementation.

-unique_name/-nounique_name

Pavuk by default always attempts to assign a unique local filename to each unique URL. If this behavior is not wanted, you can use option -nounique_name to disable this.

Hammer Mode Options: Load Testing Web Sites

-hammer_mode $nr

define the hammer mode:

0 = old fashioned: keep on running until all URLs have been accessed -hammer_repeat times.
1 = record activity on first run; burst transmit recorded activity -hammer_repeat times. This is an extremely fast mode suitable for loadtesting medium and large servers (assuming you are running pavuk on similar hardware).

-hammer_threads $nr

define the number of threads to use for the replay hammer attack (hammer mode 1)

-hammer_flags $nr

define hammer mode flags: see the man page for more info

-hammer_ease $nr

delay for network communications (msec). 0 == no delay, default = 0.

-hammer_rtimeout $nr

timeout for network communications (msec). 0 == no timeout, default = 0.

-hammer_repeat $nr

number of times the requests should be executed again (load test by hammering the same stuff over and over).

-log_hammering / -nolog_hammering

log all activity during a ’hammer’ run.

Note

Note: only applies to hammer_modes >= 1, as hammer_mode == 0 is simply a re-execution of all the requests, using the regular code and processing by pavuk and as such the regular pavuk logging is used for that mode.

-hammer_recdump {$nr | @[@]$filepath }

number of filedescriptor where to output recorded activity.

Note

pavuk 0.9.36 and later releases also support the @$file argument, where you can specify a file to dump the data to. The file path must be prefixed by an ’@’ character. If you prefix the file path with a second ’@’, pavuk will assume you wish to append to an already existing file. Otherwise the file will be created/erased when pavuk starts.

Other Options

-sleep $nr

This option allows you to specify number of seconds during that the program will be suspended between two transfers. Useful to deny server overload. Default value for this option is 0.

-rsleep/-norsleep

When this option is active, pavuk randomizes the the sleep time between transfers in interval between zero and value specified with -sleep option. Default is this option inactive.

-ddays $nr

If document has a modification time later then $nr days before today, then in sync mode pavuk attempts to retrieve a newer copy of the document from the remote server. Default value is 0.

-remove_old/-noremove_old

Remove improper documents (those which don’t exist on remote site). This option have effect only when used in directory based sync mode. When used with URL based sync mode, pavuk will not remove any old files which were excluded from document tree and are not referenced in any HTML document. You must also use option -subdir , to let pavuk find files which belongs to current mirror. By default pavuk won’t remove any old files.

-browser $str

is used to set your browser command (in URL tree dialog you can use right click to raise menu, from which you can start browser on actually selected URL). This option is available only when compiled with GTK GUI and with support for URL tree preview.

-debug/-nodebug

turns on displaying of debug messages. This option is available only when compiled with -DDEBUG, i.e. when having executed ./configure --enable-debug to set up the pavuk source code. If the -debug option is used, pavuk will output verbose information about documents, whole protocol level information, file locking information and much more (the amount and types of information depends on the -debug_level command-line arguments). This option is used as a trigger to enable output of debug messages selected by the -debug_level option. Default is debug mode turned off. To check if your pavuk binary supports -debug , you can run pavuk with the -version option.

-debug_level $level

Set level of required debug information. $level can be numeric value which represent binary mask for requested debug levels, or comma separated list of supported debug level indentifiers.

The debug level identifiers (as listed below) can be prefixed with a ! exclamation mark to turn them off . For example, this $level specification:

all,!html,!limits

will turn ’all’ debug levels ON, except ’html’ and ’limits’.

Currently pavuk supports following debug level identifiers:

all: request all currently supported debug levels
bufio: for watching the pavuk I/O buffering layer at work - this layer is positioned on top of all file I/O and network traffic for improved performance.
cookie: for monitoring HTTP ’cookies’ processing.
dev: for additional ’developer’ debug info. This generally produces more debug info across the board.
hammer: for watching events while running in -hammer_mode >= 1 replay mode
html: for HTML parser debugging
htmlform: for monitoring HTML web form processing, such as recognizing and (automatically) filling in web form fields.
protos: to see server side protocol messages
protoc: to see client side protocol messages
procs: to see some special procedure calls
locks: for debugging of documents locking
net: for debugging some low level network stuff
misc: for miscellaneous unsorted debug messages
user: for verbose user level messages
mtlock: locking of resources in multithreading environment
mtthr: launching/waking/sleeping/stopping of threads in multithreaded environment
protod: for DEBUGGING of POST requests
limits: for debugging limiting options, you will see the reason why particular URLs are rejected by pavuk and which option caused this.
rules: for debugging -fnrules and JavaScript-based filters.
ssl: to enable verbose reporting about SSL related things.
trace: to enable verbose reporting of development related things.
js: for debugging the -js_pattern , -js_transform and -js_transform2 filter processing.

-remind_cmd $str

This option has effect only when running pavuk in reminder mode. To command specified with this option pavuk sends result of running reminder mode. There are listed URLs which are changed and URLs which have any errors. Default remind command is "mailx user@server -s \"pavuk reminder result\"" .

-nscache_dir $dir

Path to Netscape browser cache directory. If you specify this path, pavuk attempts to find out if you have URL in this cache. If URL is there it will be fetched else pavuk will download it from net. The cache directory index file must be named index.db and must be located in the cache directory. To support this feature, pavuk have to be linked with BerkeleyDB 1.8x .

-mozcache_dir $dir

Path to Mozilla browser cache directory. Same functionality as with previous option, just for different browser with different cache formats. Pavuk supports both formats of Mozilla browser disk cache (old for versions <0.9 and new used in 0.9=<). The old format cache directory must contain cache directory index database with name cache.db . Then new format cache directory must contain map file _CACHE_MAP_ , and three block files _CACHE_001_ , _CACHE_002_ , _CACHE_003_ . To support old Mozilla cache format, pavuk have to be linked with BerkeleyDB 1.8x. New Mozilla cache format doesn’t require any external library.

-post_cmd $str

Post-processing command, which will be executed after successful download of document. This command may somehow handle with document. During time of running this command, pavuk leaves actual document locked, so there isn’t chance that some other pavuk process will modify document. This post-processing command will get three additional parameters from pavuk.

name

local name of document

1 / 0

1 -- if document is HTML document,
0 -- if not

URL

original URL of this document

-hack_add_index/-nohack_add_index

This is bit hacky option. It forces pavuk to add to URL queue also directory indexes of all queued documents. This allow pavuk to download more documents from site, than it is able achieve in normal traversing of HTML documents. Bit dirty but useful in some cases.

-js_script_file $file

Pavuk have optionally built-in JavaScript interpreter to allow high level customization of some internal procedures. Currently you are allowed to customize with your own JavaScript functions two things. You can use it to set precise limiting options, or you can write own functions which can be used inside rules of -fnrules option. With this option you can load JavaScript script with functions into pavuks internal JavaScript interpreter. This option is available only when you have compiled pavuk with support for JavaScript bindings.

-mime_type_file $file

Specify an alternative MIME type and file extensions definition file $file to override the rather limited set of built-in MIME types and default file extensions. The file must be of a UNIX mime.types(5) compatible format.

If you do not specify this command line option, these MIME types and extensions are known to pavuk by default:

MIME types and default file extensions

MIME type	Default File Extension

text/html*	html
text/js	js
text/plain	txt
image/jpeg	jpg
image/pjpeg	jpg
image/gif	gif
image/png	png
image/tiff	tiff
application/pdf	pdf
application/msword	doc
application/postscript	ps
application/rtf	rtf
application/wordperfect5.1	wps
application/zip	zip
video/mpeg	mpg

Note that the source distribution of pavuk already includes a full fledged mime.types file for your convenience. You may point -mime_type_file at this file to make pavuk aware of (almost) all MIME types available out there!

Javascript Bindings

You may want to use the JavaScript bindings built into pavuk for performing tasks which need some more complexity than can achieved with a regular, non-scriptable program.

You can load one JavaScript file into pavuk using command line option -js_script_file . Currently there are in pavuk two exits where user can insert own JavaScript functions.

One is inside routine which is doing decision whether particular URL should be downloaded or not. If you want insert own JavaScript decision function you must name it pavuk_url_cond_check . The prototype of this function looks following:

 
        function pavuk_url_cond_check (url, level)
        {
        ...
        }

where the function return value is used by pavuk. Any return value which evaluates to a boolean ’false’ or integer ’0’ (zero) will be considered a ’NO’ answer, i.e. skip the given URL. Any other boolean or integer return value constitutes a ’YES’ answer. (Note that return values are cast to an integer value before evaluation.)

level

is an integer number and indicates from which of five different places in pavuk code is currently pavuk_url_cond_check function called:

level == 0: condition checking is called from HTML parsing routine. At this point you can use all conditions besides -dmax , -newer_than , -older_than , -max_size , -min_size , -amimet , -dmimet and -user_condition when calling the pavuk url.check_cond(name, ....) URL class method from this JavaScript function script code. Calling url.check_cond(name, ....) with any of the conditions listed above will be processed as a no-op, i.e. it will return the boolean value ’TRUE’.
level == 1: condition checking is called from routine which is performing queueing of URLs into download queue. These URLs have been collected from another HTML page before. At this point you can only use the conditions -dmax and -user_condition .
level == 2: condition checking is called when URL is taken from download queue and will be transferred after this check will be successful. At this point you can use same set of conditions like in level == 0 except -tag_pattern and -tag_rpattern . Meanwhile you can use the condition -dmax here.
level == 3: condition checking is called after pavuk sent request for download and detected document size, modification time and mime type. In this level you can only use the conditions -newer_than , -older_than , -max_size , -min_size , -amimet , -dmimet and -user_condition . As with the other levels, using any other conditions is identical to a no-op check.

url

is object instance of PavukUrl class. It contains all information about particular URL and is wrapper for parsed URLs defined in pavuk like structure of url type.

It have following attributes:

read-write attributes

status

(int32, defined always) holds bitfields with different info (look in url.h to see more)

read-only attributes defined always

protocol

one of "http" "https" "ftp" "ftps" "file" "gopher" "unknown" means kind of URL

level

level in document tree at which this URL lies

ref_cnt

number of parent documents which reference this URL

urlstr

full URL string

read-only attributes defined when protocol == "http" or "https"

http_host

host name or IP address

http_port

port number

http_document

HTTP document

http_searchstr

query string when available (the part of URL after ?)

http_anchor_name

anchor name when available (the part of URL after #)

http_user

user name for authorization when available

http_password

password for authorization when available

read-only attributes defined when protocol == "ftp" or "ftps"

ftp_host

host name or IP address

ftp_port

port number

ftp_user

user name for authorization when available

ftp_password

password for authorization when available

ftp_path

path to file or directory

ftp_anchor_name

anchor name when available (the part of URL after #)

ftp_dir

flag whether this URL points to directory

read-only attributes defined when protocol == "file"

file_name

path to file or directory

file_searchstr

query string when available (the part of URL after ?)

file_anchor_name

anchor name when available (the part of URL after #)

read-only attributes defined when protocol == "gopher"

gopher_host

host name or IP address

gopher_port

port number

gopher_selector

selector string

read-only attributes defined when protocol is unidentified

unsupported_urlstr

full URL string

read-only attributes available when performing checking of conditions

check_level

equivalent to level parameter of pavuk_url_cond_check function

mime_type

MIME type of this URL (defined when available)

doc_size

size of document (defined when available)

modification_time

modification time of document (defined when available)

doc_number

number of document in download queue (defined when available)

html_doc

full content of parent document of current URL (defined when level == 0)

html_doc_offset

offset of current HTML tag in parent document of URL (defined when level == 0)

moved_to

get URL to which was this URL moved (define when available)

html_tag

full HTML tag (including the <> delimiter characters) from which is taken current URL (defined when level == 0)

tag

name of HTML tag from which is current URL taken (defined when level == 0)

attrib

name of HTML tag attribute from which is current URL taken (defined when level == 0)

And following methods:
get_parent(n)

get URL of n-th parent document

check_cond(name, ...)
check condition which option name is "name". when you will not provide additional parameters pavuk will use parameters from command-line or scenario file for condition checking. Else it will use listed parameters.

The following condition names are recognized (note that the use of other names is considered an error here):
- -noFTP
- -noHTTP
- -noSSL
- -noGopher
- -noFTPS
- -noCGI
- -lmax
- -asite
- -dsite
- -adomain
- -ddomain
- -aprefix
- -dprefix
- -asfx
- -dsfx
- -dont_leave_site
- -dont_leave_dir
- -site_level
- -leave_level
- -dont_leave_site_enter_dir
- -aport
- -dport
- -aip_pattern
- -dip_pattern
- -pattern
- -rpattern
- -skip_pattern
- -skip_rpattern
- -url_pattern
- -url_rpattern
- -skip_url_pattern
- -skip_url_rpattern
- -tag_pattern
- -tag_rpattern
- -dmax
- -user_condition
- -max_size
- -min_size
- -amimet
- -dmimet
- -newer_than
- -older_than

Next to that, pavuk also offers a global print(...) function which will print each of the parameters passed to it, separating them by a single space. The text is terminated by a newline. Note that each of the print(...) parameters is cast to a string before being printed.

Here is some example like pavuk_url_cond_check function can look:

 
function pavuk_url_cond_check (url, level)
{
  if (level == 0)
  {
    if (url.level > 3 && url.check_cond("-asite", "www.host.com"))
      return false;
    if (url.check_cond("-url_rpattern"
                      , "http://www.idata.sk/~ondrej/"
                      , "http://www.idata.sk/~robo/")
        && url.check_cond("-dsfx", ".jar", ".tgz", ".png"))
      return false;
  }
  if (level == 2)
  {
    par = url.get_parent();
    if (par && par.get_moved())
      return false;
  }
  return true;
}

This example is rather useless, but shows you how to use this feature.

Second possible use of JavaScript with pavuk is in -fnrules option for generating local names. In this case it is done by special function of extended -fnrules option syntax called "jsf " which has one parameter: the name of javascript function which will be called. The function must return a string and its prototype is something like the following:

 
        function some_jsf_func(fnrule)
        {
        ...
        }

The -fnrule parameter is an object instance of PavukFnrules class.

it have three read-only attributes:

url - which is of PavukUrl type described above
pattern - which is the -fnrules provided pattern string
pattern_type - which is the -fnrules provided pattern type ID (an integer number): when called by a -fnrules ... ’F’ option, pattern_type == 2, when called by a -fnrules ... ’R’ (regex) option, pattern_type == 1, otherwise pattern_type == 0 (unknown).

and also has two methods

get_macro( (macro ) ) - it returns value of the ’%’ macros used in -fnrules option, where the (string type) (macro ) argument may be any of ’%i’, ’%p’, ’%u’, ’%h’, ’%m’, ’%r’, ’%d’, ’%n’, ’%b’, ’%e’, ’%s’, ’%q’, ’%U’, ’%o’, ’%M’, ’%B’, ’%A’, ’%E’, ’%Y’ or ’%X’. Any other (macro ) argument value will not be processed and is passed as is, i.e. will be returned by get_macro( (macro ) ) untouched.
get_sub( (nr ) ) - which returns the substring of ’urlstr’ as matched by the Regex sub-expression ’nr’ when the -fnrules R statement was matched.

You can do something like:

-fnrules F "*" ’(jsf "some_fnrules_func")’

Exit Status

As of version 0.9pl29 pavuk have changed indication of status by exit codes. In earlier versions exit status 0 was for no error and nonzero exit status was something like count of failed documents. In all version after 0.0pl29 there are defined following exit codes:

no error, everything is OK
error in configuration of pavuk options or error in config files
some error occurred while downloading documents
a signal was caught while downloading documents; transfer was aborted
an internal check failed while downloading documents; transfer was aborted

Environmental Variables

USER: variable is used to construct email address from user and hostname
LC_*, LANG: used to set internationalized environment
PAVUKRC_FILE: with this variable you can specify alternative location for your .pavukrc configuration file.

Required External Programs

at: is used for scheduling.
gunzip: is used to decode gzip or compress encoded documents. Note that since pavuk release 0.9.36 gunzip is only used when pavuk has been built without built-in zlib support. You can check if your pavuk binary comes with built-in zlib support by running pavuk -v which should report ’gzip/compress/deflate Content-Encoding’ as one of the optional features available.

Bugs

If you find any, please let me know.

Files

/usr/local/etc/pavukrc

---

~/.pavukrc

---

~/.pavuk_prefs

These files are used as default configuration files. You may specify there some constant values like your proxy server or your preferred WWW browser. Configuration options reflect command line options. Not all parameters are suitable for use in default configuration file. You should select only some of them, which you really need.

File ~/.pavuk_prefs is special file which contains automatically stored configuration. This file is used only when running GUI interface of pavuk and option -prefs is active.

-auth_file $file

File $file should contain as many authentication records as you need. Records are separated by any number of empty lines. Parameter name is case insensitive.

Structure of record:

Field : Proto: <proto ID>
Description : identification of protocol (ftp/http/https/..)
Reqd: : required

Field : Host: <host:[port]>
Description : host name
Reqd: : required

Field : User: <user>
Description : name of user
Reqd: : optional

Field : Pass: <password>
Description : password for user
Reqd: : optional

Field : Base: <path>
Description : base prefix of document path
Reqd: : optional

Field : Realm: <name>
Description : realm for HTTP authorization
Reqd: : optional

Field : NTLMDomain: <domain>
Description : NTLM domain for NTLM authorization
Reqd: : optional

Field : Type: <type>
Description : HTTP authentication scheme. Accepted values: {1/2/3/4/user/Basic/Digest/NTLM} Similar meaning as the -auth_scheme option (see help for this option for more details). Default is 2 (Basic scheme).
Reqd: : optional

See pavuk_authinfo.sample file for an example.

~/.pavuk_keys

this is file where are stored information about configurable menu option shortcuts. This is available only when compiled with GTK+ 1.2 and higher.

~/.pavuk_remind_db

this file contains information about URLs for running in reminder mode. Structure of this file is very easy. Each line contains information about one URL. First entry in line is last known modification time of URL (stored in time_t format - number of seconds since 1.1.1970 GMT), and second entry is the URL itself.

Configuration Order of Application

First (if present) parsed file is /usr/local/etc/pavukrc then ~/.pavukrc (if present), then ~/.pavuk_prefs (if present). Last the command line is parsed.

The precedence of configuration settings is as follows (ordered from highest to lowest precedence):

Entered in user interface
Entered in command line
~/.pavuk_prefs
~/.pavukrc
/usr/local/etc/pavukrc

Config File Parameters Versus Command Line Options

Here is table of config file - command line options pairs:

Config file options vs. command line option equivalents

Config file option	command line option

ActiveFTPData:	-ftp_active / -ftp_passive
ActiveFTPPortRange:	-active_ftp_port_range
AddHTTPHeader:	-httpad
AdvBannerRE:	-adv_re
AllLinksToLocal:	-all_to_local / -noall_to_local
AllLinksToRemote:	-all_to_remote / -noall_to_remote
AllowCGI:	-CGI / -noCGI
AllowedDomains:	-adomain
AllowedIPAdrressPattern:	-aip_pattern
AllowedMIMETypes:	-amimet
AllowedPorts:	-aport
AllowedPrefixes:	-aprefix
AllowedSites:	-asite
AllowedSuffixes:	-asfx
AllowFTP:	-FTP / -noFTP
AllowFTPRecursion:	-FTPdir
AllowFTPS:	-FTPS / -noFTPS
AllowGopher:	-Gopher / -noGopher
AllowGZEncoding:	-Enc / -noEnc
AllowHTTP:	-HTTP / -noHTTP
AllowRelocation:	-Relocate / -noRelocate
AllowSSL:	-SSL / -noSSL
AlwaysMDTM:	-always_mdtm / -noalways_mdtm
AuthFile:	-auth_file
AuthReuseDigestNonce:	-auth_reuse_nonce
AuthReuseProxyDigestNonce:	-auth_reuse_proxy_nonce
AutoReferer:	-auto_referer / -noauto_referer
BaseLevel:	-base_level
BgMode:	-bg / -nobg
Browser:	-browser
CheckIfRunnigAtBackground:	-check_bg / -nocheck_bg
CheckSize:	-check_size / -nocheck_size
CommTimeout:	-timeout
CookieCheckDomain:	-cookie_check / -nocookie_check
CookieFile:	-cookie_file
CookieRecv:	-cookie_recv / -nocookie_recv
CookieSend:	-cookie_send / -nocookie_send
CookiesMax:	-cookies_max
CookieUpdate:	-cookie_update / -nocookie_update
Debug:	-debug / -nodebug
DebugLevel:	-debug_level
DefaultMode:	-mode
DeleteAfterTransfer:	-del_after / -nodel_after
DisabledCookieDomains:	-disabled_cookie_domains
DisableHTMLTag:	-disable_html_tag
DisallowedDomains:	-ddomain
DisallowedIPAdrressPattern:	-dip_pattern
DisallowedMIMETypes:	-dmimet
DisallowedPorts:	-dport
DisallowedPrefixes:	-dprefix
DisallowedSites:	-dsite
DisallowedSuffixes:	-dsfx
DocExpiration:	-ddays
DontLeaveDir:	-leave_dir / -dont_leave_dir
DontLeaveSite:	-leave_site / -dont_leave_site
DontTouchTagREPattern:	-dont_touch_tag_rpattern
DontTouchUrlPattern:	-dont_touch_url_pattern
DontTouchUrlREPattern:	-dont_touch_url_rpattern
DumpFD:	-dumpfd
DumpUrlFD:	-dump_urlfd
EmailAddress:	-from
EnableHTMLTag:	-enable_html_tag
EnableJS:	-enable_js / -disable_js
FileSizeQuota:	-file_quota
FixWuFTPDBrokenLISTcmd:	-fix_wuftpd_list / -nofix_wuftpd_list
FnameRules:	-fnrules
FollowCommand:	-follow_cmd
ForceReget:	-force_reget
FSQuota:	-fs_quota
FTPDirtyProxy:	-ftp_dirtyproxy
FTPhtml:	-FTPhtml / -noFTPhtml
FTPListCMD:	-FTPlist / -noFTPlist
FTPListOptions:	-ftp_list_options
FtpLoginHandshake:	-ftp_login_handshake
FTPProxy:	-ftp_proxy
FTPProxyPassword:	-ftp_proxy_pass
FTPProxyUser:	-ftp_proxy_user
FTPViaHTTPProxy:	-ftp_httpgw
GopherProxy:	-gopher_proxy
GopherViaHTTPProxy:	-gopher_httpgw
GUIFont:	-gui_font
HackAddIndex:	-hack_add_index / -nohack_add_index
HammerEaseOffDelay:	-hammer_ease
HammerFlags:	-hammer_flags
HammerMode:	-hammer_mode
HammerReadTimeout:	-hammer_rtimeout
HammerRecorderDumpFD:	-hammer_recdump
HammerRepeatCount:	-hammer_repeat
HammerThreadCount:	-hammer_threads
HashSize:	-hash_size
HTMLFormData:	-formdata
HTMLTagPattern:	-tag_pattern
HTMLTagREPattern:	-tag_rpattern
HTTPAuthorizationName:	-auth_name
HTTPAuthorizationPassword:	-auth_passwd
HTTPAuthorizationScheme:	-auth_scheme
HTTPProxy:	-http_proxy
HTTPProxyAuth:	-http_proxy_auth
HTTPProxyPass:	-http_proxy_pass
HTTPProxyUser:	-http_proxy_user
Identity:	-identity
IgnoreChunkServerBug	-ignore_chunk_bug / -noignore_chunk_bug
ImmediateMessages:	-immesg / -noimmsg
IndexName:	-index_name
JavaScriptFile:	-js_script_file
JavascriptPattern:	-js_pattern
JSTransform2:	-js_transform2
JSTransform:	-js_transform
Language:	-language
LeaveLevel:	-leave_level
LeaveSiteEnterDirectory:	-leave_site_enter_dir / -dont_leave_site_enter_dir
LimitInlineObjects:	-limit_inlines / -dont_limit_inlines
LocalIP:	-local_ip
LogFile:	-logfile
LogHammerAction:	-log_hammering / -nolog_hammering
MatchPattern:	-pattern
MaxDocs:	-dmax
MaxLevel:	-lmax / -l
MaxRate:	-maxrate
MaxRedirections:	-nredirs
MaxRegets:	-nregets
MaxRetry:	-retry
MaxRunTime:	-max_time
MaxSize:	-maxsize
MinRate:	-minrate
MinSize:	-minsize
MozillaCacheDir:	-mozcache_dir
NetscapeCacheDir:	-nscache_dir
NewerThan:	-newer_than
NLSMessageCatalogDir:	-msgcat
NSSAcceptUnknownCert:	-nss_accept_unknown_cert / -nonss_accept_unknown_cert
NSSCertDir:	-nss_cert_dir
NSSDomesticPolicy:	-nss_domestic_policy / -nss_export_policy
NTLMAuthorizationDomain:	-auth_ntlm_domain
NTLMProxyAuthorizationDomain:	-auth_proxy_ntlm_domain
NumberOfThreads:	-nthreads
OlderThan:	-older_than
PageSuffixes:	-page_sfx
PostCommand:	-post_cmd
PostUpdate:	-post_update / -nopost_update
PreferredCharset:	-acharset
PreferredLanguages:	-alang
PreserveAbsoluteSymlinks:	-preserve_slinks / -nopreserve_slinks
PreservePermisions:	-preserve_perm / -nopreserve_perm
PreserveTime:	-preserve_time / -nopreserve_time
Quiet:	-quiet / -verbose
RandomizeSleepPeriod:	-rsleep / -norsleep
ReadBufferSize:	-bufsize
ReadCSS:	-read_css / -noread_css
ReadHtmlComment:	-noread_comments / -read_comments
Read_MSIE_ConditionalComments:	-noread_msie_cc / -read_msie_cc
Read_XML_CDATA_Content:	-noread_cdata / -read_cdata
RegetRollbackAmount:	-rollback
REMatchPattern:	-rpattern
ReminderCMD:	-remind_cmd
RemoveAdvertisement:	-remove_adv / -noremove_adv
RemoveBeforeStore:	-remove_before_store / -noremove_before_store
RemoveOldDocuments:	-remove_old
RequestInfo:	-request
Reschedule:	-reschedule
RetrieveSymlinks:	-retrieve_symlink / -noretrieve_symlink
RunX:	-runX
ScenarioDir:	-scndir
SchedulingCommand:	-sched_cmd
SelectedLinksToLocal:	-sel_to_local / -nosel_to_local
SendFromHeader:	-send_from / -nosend_from
SendIfRange:	-send_if_range / -nosend_if_range
SeparateInfoDir:	-info_dir
ShowDownloadTime:	-stime
ShowProgress:	-progress
SinglePage:	-singlepage / -nosinglepage
SiteLevel:	-site_level
SkipMatchPattern:	-skip_pattern
SkipREMatchPattern:	-skip_rpattern
SkipURLMatchPattern:	-skip_url_pattern
SkipURLREMatchPattern:	-skip_url_rpattern
SleepBetween:	-sleep
SLogFile:	-slogfile
SSLCertFile:	-ssl_cert_file
SSLCertPassword:	-ssl_cert_passwd
SSLKeyFile:	-ssl_key_file
SSLProxy:	-ssl_proxy
SSLVersion:	-ssl_version
StatisticsFile:	-statfile
StoreDirIndexFile:	-store_index / -nostore_index
StoreDocInfoFiles:	-store_info / -nostore_info
StoreName:	-store_name
TransferQuota:	-trans_quota
TrChrToChr:	-tr_chr_chr
TrDeleteChar:	-tr_del_chr
TrStrToStr:	-tr_str_str
UniqueDocName:	-unique_name / -nounique_name
UniqueLogName:	-unique_log / -nounique_log
UniqueSSLID:	-unique_sslid / -nounique_sslid
URLMatchPattern:	-url_pattern
URLREMatchPattern:	-url_rpattern
UrlSchedulingStrategy:	-url_strategy
URLsFile:	-urls_file
UseCache:	-cache / -nocache
UseHTTP11:	-use_http11
UsePreferences:	-prefs / -noprefs
UserCondition:	-user_condition
UseRobots:	-Robots / -noRobots
Verify CERT:	-verify / -noverify
WaitOnExit:	-ewait
WorkingDir:	-cdir
WorkingSubDir:	-subdir
XMaxLogSize:	-xmaxlog
URL:	one URL (more lines with URL: ... means more URLs)

Some config file entries are not available as command-line options:

Extra config file options for the GTK GUI

Config file option	Description

BtnConfigureIcon:	accepts a path argument
BtnConfigureIcon_s:	accepts a path argument
BtnLimitsIcon:	accepts a path argument
BtnLimitsIcon_s:	accepts a path argument
BtnGoBgIcon:	accepts a path argument
BtnGoBgIcon_s:	accepts a path argument
BtnRestartIcon:	accepts a path argument
BtnRestartIcon_s:	accepts a path argument
BtnContinueIcon:	accepts a path argument
BtnContinueIcon_s:	accepts a path argument
BtnStopIcon:	accepts a path argument
BtnStopIcon_s:	accepts a path argument
BtnBreakIcon:	accepts a path argument
BtnBreakIcon_s:	accepts a path argument
BtnExitIcon:	accepts a path argument
BtnExitIcon_s:	accepts a path argument
BtnMinimizeIcon:	accepts a path argument
BtnMaximizeIcon:	accepts a path argument

A line which begins with ’#’ means comment.

TrStrToStr: and TrChrToChr: must contain two quoted strings. All parameter names are case insensitive. If here is missing any option, try to look inside config.c source file.

See pavukrc.sample file for example.

Examples

Example Command Lines

The most simple incantation:

      pavuk http://<my_host>/doc/

Mirroring a site to a specific local directory tree, rejecting big files (> 16MB), plus lots of extra options for, among others: active FTP sessions, passive FTP (for when you’re behind a firewall), etc. As such, this is a rather mix & mash example:

pavuk -mode mirror -nobg -store_info -info_dir /mirror/info
  -nthreads 1 -cdir /mirror/incoming -subdir /mirror/incoming
  -preserve_time -nopreserve_perm -nopreserve_slinks -noretrieve_symlink
  -force_reget -noRobots -trans_quota 16384 -maxsize 16777216
  -max_time 28 -nodel_after -remove_before_store -ftpdir -ftplist
  -ftp_list_options -a -dont_leave_site -dont_leave_dir -all_to_local
  -remove_old -nostore_index -active_ftp_port_range 57344:65535
  -always_mdtm -ftp_passive -base_level 2
  http://<my_host>/doc/

Taking Mediawiki Dynamic Content Off Line Using Pavuk - an Advanced Example

Note

This is a writeup for a bit of extra pavuk documentation. Comments are welcomed; I hope this is useful for those who are looking for some prime examples of pavuk use (intermediate complexity).

Author: Ger Hobbelt
< ger@hobbelt.com >

Target Audience

Anyone who doesn’t find

  ’pavuk http://www.da-url-to-spider.com/’

suits their need entirely.

Anyone who feels an itch coming up when their current spider software croaks again, merely because you were only interested in spidering a part of the pages.

Assumed Knowledge and Experience

This example text assumes you’ve had your first few trial runs using pavuk already. We take off at the point where you knew you should really read the manual but didn’t dare do so. Yet. ... Or you did and got that look upon your face, where your relatives start to laugh and your kids yell: “Mom! Dad is doing that look again!”

the Goal

We’re going to cover a hardcase example of use for any spider: a Mediawiki-driven documentation website.

The goal: Get some easily readable pages in your local (off-line) storage.

Grabbing a Wiki Where it Hurts

I wished to have the documentation for a tool, which I have purchased before, available off net, since I’m not always connected when I’m somewhere where I find time to work with that particular tool. And the company that sells the product doesn’t include a paper manual.

Their documentation is stored in a Mediawiki web site, i.e. a website driven by the same software which was written for the well known Wikipedia.

There are several issues with such sites, at least from a ’off net copy’ and ’spider’ perspective:

The web pages don’t come with proper file extensions, e.g. ’.HTML’. Sometimes even no filename extensions at all, such as is the case with Mediawiki sites. For a web site, this is not an issue, as the web server and your browser will work as a perfect tandem as long as the server sends along the correct MIME type with that content, and Mediawiki does a splendid job there.
As each page has quite a few links to:
- edit sections
- view this page’s history / revisions
- etc.etc.
your spider will really love to dig in and go there.

Unfortunately this is the Road To Hell (tm) as:
- any site of sufficient age, i.e. a large enough number of edits to its pages, will have your spider go... and go... and go... and then some more.
- To put it mildly, you may not be particularly interested in those historic edits / revisions / etc. -- I know I wasn’t, I just wanted to have the latest documentation along when I open up my laptop next where there’d be no Net. And I didn’t like my disc flooded with - to me - garbage.
- If you are really lucky with these highly dynamic sites, they’ll provide reporting and other facilities on a day to day basis: when the spider hits those calendars and the site is set up to, for example, show the state of the union, pardon, website for any given day back till the dawn of civilization, you’re in for a real treat as the spider will request those dynamic pages for every day in that lovely calendar.
  
  ETA on this process? Somewhere around this saturday next year. If you’re lucky and your IP doesn’t get banned before that day for abuse.
So the key to this type of spider activity is to be able to restrict the spider to the ’main pages’, i.e. that part of the content you are interested in.
- Which leaves only one ’minor’ issue: local files don’t come with a ’MIME type’, so you’re in a real need for some fitting filename extensions to help your HTML browser/viewer decide how to show that particular bit of content. After all, both a .HTML and a .JPG file are just a bunch of bytes, but, heck, does a JPG look wicked when you try to view it as if it were a HTML page. And vice versa.
Pavuk to the Rescue

pavuk is perfectly able to help you out with this challenge as it comes with quite a few features to selectively grab and discard pages during the spider process.

And it has something extra, which is not to be sneezed at when you are trying to convert dynamically generated content into some sort of static HTML pages for off net use: FILENAME REWRITING. This allows you to tell pavuk how you like those pages to be filed exactly and under what filenames and, very important to have your web browser cooperate when you feed it these pages from your local disc, the appropriate filename extensions.

Let’s have a look at the pavuk commandline which does all of that - and then some:

Note

(this is pavuk tests/ example script no. 2a by the way) The pavuk commandline has been broken across multiple lines to improve it’s readability.

We are going to grab the documentation for a 3D animation plugin called CAT, available at http://cat.wiki.avid.com/
Special notes for this spider run:
- We are also interested in the ’RecentChanges’ report/overview, as I edit my local copy of this documentation and like to know which pages have changed since the last time I visited the site.
- Remove the single spaces before each of those ’&’ in those URLs if you want the real URL; these were inserted only for simplification this document’s formatting.
- For the same reason, remove the single spaces following each ’,’ comma in several of the commandline option arguments down there.
```
&hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500
../src/pavuk 
-verbose 
-dumpdir pavuk_data/ 
-noRobots 
-cdir pavuk_cache/ 
-cookie_send 
-cookie_recv 
-cookie_check 
-cookie_update 
-cookie_file pavuk_data/chunky-cookies3.txt 
-read_css 
-auto_referer 
-enable_js 
-info_dir pavuk_info/ 
-mode mirror 
-index_name chunky-index.html 
-request ’URL:http://cat.wiki.avid.com/index.php? title=Special:Recentchanges&
hideminor=0 &hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500
METHOD:GET’ 
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Lonelypages METHOD:GET’

-request ’URL:http://cat.wiki.avid.com/index.php/Special:Unusedimages METHOD:GET’

-request ’URL:http://cat.wiki.avid.com/index.php/Special:Allpages METHOD:GET’

-request ’URL:http://cat.wiki.avid.com/ METHOD:GET’ 
-scndir pavuk_scenarios/ 
-dumpscn TestScenario.txt 
-nthreads 1 
-progress_mode 6 
-referer 
-nodump_after 
-rtimeout 10s
-wtimeout 10s
-timeout 60s
-dumpcmd test_cmd_dumped.txt 
-debug 
-debug_level ’all, !locks, !mtlock, !cookie, !trace, !dev, !net, !html, !htmlform,
!procs, !mtthr, !user, !limits, !hammer, !protos, !protoc, !protod, !bufio,
!rules, !js’
-store_info 
-report_url_on_err 
-tlogfile pavuk_log_timing.txt 
-dump_urlfd @pavuk_urlfd_dump.txt 
-dumpfd @pavuk_fd_dump.txt 
-dump_request 
-dump_response 
-logfile pavuk_log_all.txt 
-slogfile pavuk_log_short.txt 
-test_id T002 
-adomain cat.wiki.avid.com 
-use_http11 
-skip_url_pattern ’*oldid=*, *action=edit*, *action=history*, *diff=*, *limit=*,
*[/=]User:*, *[/=]User_talk:*, *[^p]/Special:*, *=Special:[^R]*, *.php/Special:[^LUA][^onl][^nul]*,
*MediaWiki:*, *Search:*, *Help:*’ 
-tr_str_str ’Image:’ ’’ 
-tr_chr_chr ’:\\!&=?’ ’_’ 
-mime_types_file ../../../mime.types 
-fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’ 
-fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’ 
-fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’ 
```
Whew, that’s some commandline you’ve got there! Well, I always start out with the same set of options, which are not really relevant here (we’re not all that concerned with tracking cookies on this one, for one), but it has grown into a habit which is hard to get rid of.

A bit of a toned down version looks like this:
Note

removed are:
- logging features (the -dump_whathaveyou commandline options / -store_info/-[ts]logfile)
- cookie tracking and handling options
- storage directory configuration (-dumpdir/-cdir/-info_dir/-scndir)
- multithreading configuration (-nthreads)
- verbosity and progress info aids (-verbose/-progress_mode/-report_url_on_err)
- diagnostics features: there a whole slew of flags there that are really helpful when you are setting up this sort of thing first time: without those it can be really hard to find the proper incantations for some of the remaining options (-debug/-debug_level)
- miscellaneous for administrative purposes (-test_ID)
leaving us:
```
../src/pavuk 
-noRobots 
-read_css 
-auto_referer 
-enable_js 
-mode mirror 
-index_name chunky-index.html 
-request ’URL:http://cat.wiki.avid.com/index.php? title=Special:Recentchanges&
hideminor=0 &hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500
METHOD:GET’ 
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Lonelypages METHOD:GET’

-request ’URL:http://cat.wiki.avid.com/index.php/Special:Unusedimages METHOD:GET’

-request ’URL:http://cat.wiki.avid.com/index.php/Special:Allpages METHOD:GET’

-request ’URL:http://cat.wiki.avid.com/ METHOD:GET’ 
-referer 
-adomain cat.wiki.avid.com 
-use_http11 
-skip_url_pattern ’*oldid=*, *action=edit*, *action=history*, *diff=*, *limit=*,
*[/=]User:*, *[/=]User_talk:*, *[^p]/Special:*, *=Special:[^R]*, *.php/Special:[^LUA][^onl][^nul]*,
*MediaWiki:*, *Search:*, *Help:*’ 
-tr_str_str ’Image:’ ’’ 
-tr_chr_chr ’:\\!&=?’ ’_’
-mime_types_file ../../../mime.types
-fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’
-fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’
-fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’
```
which tells pavuk to:
- skip the ’robots.txt’, if available from this web site (-noRobots)
- load and interpret any CSS files, i.e. see if there are additional URLs available in there (-read_css)
- play nice with the web server and tell the box which path it is traveling, just like a regular web browser would do when a human would click on the links shown on screen (-auto_referer/-referer)
- look at any JavaScript code for extra URLs (-enable_js). Yes, we’re that desperate for URLs to spider. Well, this option is in my ’standard set’ to use with pavuk, and if it (he? she?) doesn’t find any, it doesn’t hurt to have it here with us anyway.
- operate in ’mirror’ mode. Pavuk has several modes of operation available for you, but I find I use ’mirror’ most, probably because I’ve become really used to it. In a moment of weakness, I might concede that it’s more probable that I have found that often almost any problem can be turned into a nail if you find yourself holding a large and powerful hammer. And the ’mirror’ mode might just be my hammer there.
- directory index content should be stored in the ’chunky_index.html’ file for each such directory. Simply put: the content sent by the server when we request URLs that end with a ’/’. This is not the whole truth, but it’ll do for now.
- spider starting at several URLs (-request ...). Now this is interesting, in that, at least theoretically, I could have done with specifying a single start URL there:
```
-request ’URL:http://cat.wiki.avid.com/ METHOD:GET’
```
  as the other URLs shown up there can be reached from that page.
  
  In practice though, I often find it a better approach to specify each of the major sections of a site which you want to be sure your pavuk run needs to cover. Besides, practice shows that some of those extra URLs can only be reached by spidering and interpreting otherwise uninteresting revision/edit Mediapage system pages. And since we’re doing our darnedest best to make sure pavuk does NOT grab nor process any of _those_ pages, we’ll miss a few bits, e.g. these ones:
```
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Lonelypages METHOD:GET’
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Unusedimages METHOD:GET’
```
  will be completely missed if I hadn’t specified them explicitly here, while keeping all the restrictions (-skip_url_pattern et al) as strict and restrictive as they are now.
- restrict any spidering to the specified domain and any of its subdomains (-adomain). In this particular case, there’s only one domain to spider, but you can spider several locations in a single run, by specifying multiple ’acceptable domains’ using -adomain.
- to use the HTTP 1.1 protocol when talking to the web server. This is another one of those ’standard options’ which I tend to copy&paste in every pavuk command set. This one comes in handy when your web site is hosted on a ’virtual host’, i.e. when several domains share the same server and IP number (such as is the case with my own web sites, such as ’www.hebbut.net’ and ’www.hobbelt.com’. Though this option’s use dates back to older pavuk releases I still tend to include it, despite the fact that the latest pavuk versions default to using HTTP 1.1 instead of the older HTTP 1.0.
Restricting the Spider

And now some of the real meat of this animal:

-skip_url_pattern comes with a huge set of comma-separated wildcard expressions. When part of a URL matches any one of these expressions, pavuk will ignore that URL and hence skip grabbing that particular page.
’*oldid=*’

is kind of trivial: if we somehow end up attempting to spider a ’historic’ (older) copy of a given web page, we are NOT interested. This forces pavuk to skip any older versions of any Mediawiki pages.

’*action=edit*’

is another trivial one: we are not going to log in and edit the page as we are interested only in grabbing the current content. No editing pages with web forms for us then.

’*action=history*’

is a variant to the ’oldid’ expression above with the same intent. Note that all this is - of course - web site and Mediawiki specific, so web sites serviced by different brands of CMS/Wiki software, require their own set of skip patterns.

Nevertheless, the set above should work out nicely for most if not all Mediawiki sites.

Also note that the complete URL is matched against these patterns, i.e. including the ’?xxx&xxx&xxx’ URL query part of that URL. (Bookmarks, encoded as a dash-delimited last part of a client side URL like this: ’...#jump_here’, are NOT included in the match. The server should never get to see those anyway, as dash bookmarks are a pure client side thing.)

’*diff=*’

we don’t want to know what the changes to page X are compared to, say, the previous version of said page.

’*limit=*’

there are several report/system pages in any Mediawiki site, where lists of items are split into chunks to reduce page size and user strain. This is quick & dirty way to get rid of any of those.

And then there are the pages we do like to see (UnusedImages + LonelyImages), but are not interested in seeing till the end of the list if it is that large for the site.

’*[/=]User:*’ , ’*[/=]User_talk:*’

two more which are irrelevant from our perspective: we’re going offline with this material, so there’s no way to discuss matters with the editors.

’*[^p]/Special:*’
this one rejects any ’Special:’ pages at first glance, but is a little more wicked than that, as we do want those ’LonelyPages’, ’UnusedImages’ and ’AllPages’, thank you very much. See this pattern is augmented to any ’Special:’ pages which are not located in a (virtual) directory ending with a ’p’. Due to the way the Mediawiki software operates and presents to web pages, this basically means, this pattern will ONLY select any ’Special:’ pages which are not directly following the ’index.php’ processing page, which, in Mediawiki’s case, presents itself as if it is a directory, such as in this URL:
```
http://cat.wiki.avid.com/index.php/Special:Lonelypages
```
Unfortunately, the above pattern is not restrictive enough, as we’ll now be treated to a whole slew of main page ’Specials:’s. And that wasn’t what we wanted, did we?

Additional patterns to the rescue!

Remember that we are only in interested in three of them:
- ’LonelyPages’,
- ’UnusedImages’ and
- ’AllPages’
So the next pattern:
```
’*=Special:[^R]*’
```
may seem kind of weird right now. Let’s file that one for a later, and first have look at the next one after that:

’*.php/Special:[^LUA][^onl][^nul]*’: Now this baby looks just like the supplement we were looking for: Skip any ’Special:’s, which do not start their name with on of the characters ’L’, ’U’ or ’A’. Compare that to the three ’Special:’s we actually _do_ want to download listed above and the method should quickly become apparent: the second letter is declared ba-a-a-a-a-d and evil when it’s not one of these: ’o’, ’n’ or ’l’, and just to top it off, the third letter in the name is checked too: not ’n’, ’u’ or ’l’ and the page at hand is _out_.

So this should do it regarding those ’Special:’s, right?

Not Entirely, No. Because there’s still that fourth one we’d love to see:
```
-request ’URL:http://cat.wiki.avid.com/index.php? title=Special:Recentchanges&
hideminor=0 &hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500
METHOD:GET’
```
which has a bit of a different form around the ’Special’ text:
```
index.php?title=Special:Recentchanges
```
Note the ’=’ in there. So that ’s why we had that other pattern we had filed for later discussion:
```
’*=Special:[^R]*’
```
i.e. discard any page containing the string ’=Special:’. Which is not immediately followed by the character ’R’ of ’RecentChanges’.

So far, so good.

Mediawiki comes with another heap of system pages, which are categorically rejected using this set of three patterns:
```
’*MediaWiki:*, *Search:*, *Help:*’
```
NOW we’re done. At least as far as filtering/restricting the spider is concerned.

Note

A last note before we continue on with the next section: note that each of the ’-skip_url_pattern’ patterns are handled as if they were UNIX filesystem/shell wildcards: MSDOS/Windows people will recognize ’?’ (any single character) and ’*’ (zero or more characters), but UNIX wildcard patterns also accept ’sets’, such as ’[a-z]’ (any one of the letters of our alphabet, but only the lowercase ones) or ’[^0-9]’ (any one character, but it may NOT be digit!). pavuk calls these ’fnmatch()’ patterns and if you google the Net, you’ll be sure to find some very thorough descriptions of those. They live next to the ’regex’ (a.k.a. ’regular expressions’) which are commonly used in Perl and other languages.

pavuk - of course - comes with those too: if you like to use regexes, you should specify your restrictive patterns using the ’-skip_url_rpattern’ commandline option instead. Note that subtle extra ’r’ in the commandline option there.

Storing Content in ’processed’ Filenames

Still, if you grab a Mediawiki site’s content just like that, you’ll end up with a horrible mess of files with all sorts of funny characters in their filenames.

This might not be too bothersome on a UNIX box (apart from the glaring difficulty to properly view each filetype as the filename extensions are the browser/viewer’s only help as soon as these files end up on your local storage), but I wished to view the downloaded content on a laptop with Windows XP installed.

So there’s a bit more work to do here: knead the filenames into a form that palatable to both me and my Windows web page viewing tools.

This is where some of the serious power of pavuk shows. It might not be the simplest tool around, but if you were looking for that Turbo Piledriver to devastate those 9 inch nail-shaped challenges, here you are.

We’ll start off easy: Images.

They should at least have decent filenames and more importantly: suitable filename extensions.

So we add these commandline options as filename ’transformation’ instructions:

-tr_str_str ’Image:’ ’’

will simply discard any ’Image:’ string in the URL while converting said URL to a matching filename.

-tr_chr_chr ’:\\!&=?’ ’_’

Windows does NOT like ’:’ colons (and a few other characters), so we’ll have those replaced by a ’programmers’ space’, a.k.a. the ’_’ underscore.

This ’-tr_chr_chr’ will convert those long URLs which include ’?xxx&yyy&etc’ URL query sections into something without any of those darned characters: ’:’, ’\’ (note the UNIX shell escape there, hence ’\\’), ’!’, ’&’, ’=’ and ’?’.

Of course, if you find other characters in your grabbed URLs offend you, you can add them to this list.

Then we’re on to the last and most interesting part of the filename transformation act. But for that, we’ll need to help pavuk convert those MIME types to filename extensions.

That we do by providing a nicely formatted mime.types(3) (see online UNIX man pages for a format description) page:
```
-mime_types_file ../../../mime.types
```
Of course, I manipulated this file a bit so pavuk would choose ’.html’ over ’.htm’, etc. as several MIME types come with a set of possible filename extensions: MIME types and filename extensions come from quite disparate worlds and are not 1-on-1 exchangeable. But we try.
```
-fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’
```
Will take any URL which contains the string ’/index.php/’ and comes with a ’:’ a little further down the road, and convert it to a filename using

the
```
’%h:%r/%d/%n%s.%X’
```
template.

The
```
’F’
```
tells pavuk what follows will be a ’fnmatch()’ type pattern: like the ’-skip_url_pattern’ patterns above, these are very similar to UNIX filesystem wildcards. If you wish to use real perl(5) -alike regexes instead, you should specify ’R’ here instead.

The template ’%h:%r/%d/%n%s.%X’ instructs pavuk to construct the filename for the given URL like this:
- ’%h’ is replaced with the fully qualified host name, i.e. ’cat.wiki.avid.com’.
- ’%r’ is replaced with the port number, i.e. ’80’ for your average vanilla web site/server.
- ’%d’ is replaced with the path to the document.
- ’%n’ is replaced with the document name (including the extension).
- ’%s’ is replaced with the URL searchstring, i.e. the ’...?xxx&yyy&whatever’ section of the URL.
- and ’la piece de resistance’:
  
  ’%X’ is replaced with the default extension assigned to the MIME type of the document, if one exists. Otherwise, the existing file extension is used instead.
  
  Note
  
  And the manual also says this: “You may want to specify the additional command line option ’-mime_type_file’ to override the rather limited set of built-in MIME types and default file extensions.” Good! We did that already!
  
  But what is that about “Otherwise, the existing file extension is used instead”? Well, if the webserver somehow feeds you a MIME type with document X and your list/file does not show a filename extension for said MIME type, pavuk will try to deduce a filename extension from the URL itself. Basically this comes comes down to pavuk looking for the bit of the non-query part of the URL following the last ’.’ dot pavuk can find in there. In our case, that would imply the extension would end up to be ’.php’ if we aren’t careful, so it is imperative to have your ’-mime_type_file’ mime.types file properly filled with all the filename extensions for each of the MIME types you are to encounter on the website under scrutiny.
Since you’ve come this far, you might like to know that a large part of the pavuk manual has been devoted to the ’-fnrules’ option alone. And let me tell you: these ’-fnrules’ shown here barely scratch the surface of the capabilities of the ’-fnrules’ commandline option: we did not use any of the ’Extended Functions’ in the transformation templates here...

As we have covered the first ’-fnrules’ of the set shown in the example:
```
-fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’
-fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’
-fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’
```
you may wonder what the others are for and about.

The second one
```
-fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’
```
makes immediate sense as it is the equivalent of the first, but now only for those URLs which have a ’/’ slash or a ’?’ question mark following the string ’/index.php’ immediately.

But wait! Wouldn’t its transform template execute on the URLs as the first ’-fnrules’ statement. In other words: what’s the use of the first ’-fnrules’ if we have the second one too?

Well, there’s a little detail you need to know regarding ’fnrules’: every URL only gets to use ONE. What is saying is that once a URL matches one of the ’fnrules’, that template will be applied and no further ’-fnrules’ processing will be applied to that URL. This gives us the option to process several URLs in different ways, though we must take care about the order in which we specify these ’-fnrules’: starting from strictest matching pattern to most generic matching pattern. That is why the ’-fnrule’ with matching pattern ’*’ (= simply anything will do) comes last.

The second ’-fnrules’ line has only a few changes to its template, compared to the first:
```
(1st) -fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’
(2nd) -fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’
```
where %b is replaced with the basename of the document (without the extension) so that the URL query section of the URLs matching the 2nd ’-fnrules’ will be discarded for the filename, while the 1st ’-fnrules’ will include that (-tr_chr_chr/-tr_str_str transformed) part instead.

The third ’-fnrules’ option:
```
-fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’
```
is also interesting, because its template includes ’%Y’ instead of ’%X’, where the manual tells us this about ’%Y’: “ %Y is replaced with the file extension if one is available. Otherwise, the default extension assigned to the MIME type of the document is used instead.” Which means ’%Y’ is the opposite of ’%X’ in term of precedence of using the ’default/URL-derived filename extension’ and the MIME type derived filename extension: ’%X’ will have a MIME type related filename extension ’win’ over the extension ripped from the URL string, while ’%Y’ will act just the other way around. Hence, ’%Y’ will only use the MIME type filename extension if there’s no ’.’ dot in the filename section of the URL:
```
site.com/index.php
```
would keep it’s ’.php’, while
```
site.com/dir-with-extension.ext/no-extension-here
```
would cause pavuk to look up the related MIME type filename extension instead (notice that the filename section of the URL does not come with a ’.’ dot!).

Conclusion

... you’ve travelled far, but now we have covered all the commandline options which were relevant to the case at hand: spider a Mediawiki-based website for off-line perusal.

Along the way, you’ve had a whiff of the power of pavuk, while I hope you’ve found several bits that may be handy in your own usage of pavuk. I suggest you check out the other sections of the manual, forgive it its few grammitical errors as it was originally written by a non-native speaker, and enjoy pavuk for its worth: a darn powerful web spider and test machine. (Yes, I have used it to perform performance and coverage analysis on web sites with this tool. Check out the gaitlin gun of web access: the -hammer mode. But that’s a whole different story.)

I did intentionally not cover the very important diagnostics commandline options in this example, as that would have stretched your endurance as a reader beyond the limit. Perusing the ’-debug / -debug_level’ log output is subject matter to fill a book. Maybe another time.

Take care and enjoy the darndest best web spider out there. And it’s Open Source, so do as I did: grab the source if the tool doesn’t completely fit your needs already, and improve it yet further!

See Also

Look into ChangeLog file for more information about new features in particular versions of pavuk.

Author

Main development Ondrejicka Stefan

Look into CREDITS file of sources for additional information.

Availability

pavuk is available from http://pavuk.sourceforge.net/

Table of Contents