|
Table of Contents
pavuk - HTTP, HTTP over SSL, FTP, FTP over SSL and Gopher recursive document
retrieval program
pavuk [-X] [-x] [-with_gui] [-runX] [-[no]bg] [-[no]prefs] [-h] [-help] [-v]
[-version]
pavuk [ -mode {normal | resumeregets | singlepage | singlereget | sync |
dontstore | ftpdir | mirror} ] [-X] [-x] [-with_gui] [-runX] [-[no]bg] [-[no]prefs]
[-[no]progress] [-[no]stime] [ -xmaxlog $nr ] [ -logfile $file ] [
-slogfile $file ] [ -auth_file $file ] [ -msgcat $dir ] [
-language $str ] [ -gui_font $font ] [-quiet/-verbose] [-[no]read_css]
[-[no]read_msie_cc] [-[no]read_cdata] [-[no]read_comments] [ -cdir $dir ] [
-scndir $dir ] [ -scenario $str ] [ -dumpscn $filename ] [
-dumpdir $dir ] [ -dumpcmd $filename ] [ -l $nr ] [ -lmax
$nr ] [ -dmax $nr ] [ -leave_level $nr ] [ -maxsize $nr ] [
-minsize $nr ] [ -asite $list ] [ -dsite $list ] [ -adomain
$list ] [ -ddomain $list ] [ -asfx $list ] [ -dsfx $list ]
[ -aprefix $list ] [ -dprefix $list ] [ -amimet $list ] [ -dmimet
$list ] [ -pattern $pattern ] [ -url_pattern $pattern ] [
-rpattern $regexp ] [ -url_rpattern $regexp ] [ -skip_pattern
$pattern ] [ -skip_url_pattern $pattern ] [ -skip_rpattern $regexp
] [ -skip_url_rpattern $regexp ] [ -newer_than $time ] [ -older_than
$time ] [ -schedule $time ] [ -reschedule $nr ]
[-[dont_]leave_site] [-[dont_]leave_dir] [ -http_proxy $site[:$port] ] [
-ftp_proxy $site[:$port] ] [ -ssl_proxy $site[:$port] ] [ -gopher_proxy
$site[:$port] ] [-[no]ftp_httpgw] [-[no]ftp_dirtyproxy] [-[no]gopher_httpgw]
[-[no]FTP] [-[no]HTTP] [-[no]SSL] [-[no]Gopher] [-[no]FTPdir] [-[no]CGI] [-[no]FTPlist]
[-[no]FTPhtml] [-[no]Relocate] [-[no]force_reget] [-[no]cache] [-[no]check_size]
[-[no]Robots] [-[no]Enc] [ -auth_name $user ] [ -auth_passwd $pass ] [
-auth_scheme {1/2/3/4/user/Basic/Digest/NTLM} ] [-[no_]auth_reuse_nonce] [
-http_proxy_user $user ] [ -http_proxy_pass $pass ] [ -http_proxy_auth
{1/2/3/4/user/Basic/Digest/NTLM} ] [-[no_]auth_reuse_proxy_nonce] [
-ssl_key_file $file ] [ -ssl_cert_file $file ] [ -ssl_cert_passwd
$pass ] [ -from $email ] [-[no]send_from] [ -identity $str ]
[-[no]auto_referer] [-[no]referer] [-[no]persistent] [ -alang $list ] [
-acharset $list ] [ -retry $nr ] [ -nregets $nr ] [ -nredirs
$nr ] [ -rollback $nr ] [ -sleep $nr ] [ -[no]rsleep ] [ -timeout
$nr ] [ -rtimeout $nr ] [ -wtimeout $nr ] [-[no]preserve_time]
[-[no]preserve_perm] [-[no]preserve_slinks] [ -bufsize $nr ] [ -maxrate
$nr ] [ -minrate $nr ] [ -user_condition $str ] [ -cookie_file
$file ] [-[no]cookie_send] [-[no]cookie_recv] [-[no]cookie_update] [
-cookies_max $nr ] [ -disabled_cookie_domains $list ] [ -disable_html_tag
$TAG,[$ATTRIB][;...] ] [ -enable_html_tag $TAG,[$ATTRIB][;...] ] [
-tr_del_chr $str ] [ -tr_str_str $str1 $str2 ] [ -tr_chr_chr
$chrset1 $chrset2 ] [ -index_name $str ] [-[no]store_index] [
-store_name $str ] [-[no]debug] [ -debug_level $level ] [ -browser
$str ] [ -urls_file $file ] [ -file_quota $nr ] [ -trans_quota
$nr ] [ -fs_quota $nr ] [-enable_js/-disable_js] [ -fnrules $t
$m $r ] [ -mime_type_file $file ] [-[no]store_info]
[-[no]all_to_local] [-[no]sel_to_local] [-[no]all_to_remote] [ -url_strategy
$strategy ] [-[no]remove_adv] [ -adv_re $RE ] [-[no]check_bg]
[-[no]send_if_range] [ -sched_cmd $str ] [-[no]unique_log] [ -post_cmd
$str ] [ -ssl_version $v ] [-[no]unique_sslid] [ -aip_pattern $re
] [ -dip_pattern $re ] [-[no]use_http11] [ -local_ip $addr ] [ -request
$req ] [ -formdata $req ] [ -httpad $str ] [ -nthreads $nr
] [-[no]immesg] [ -dumpfd {$nr | @[@]$filepath } ] [ -dump_urlfd
{$nr | @[@]$filepath } ] [-[no]unique_name]
[-[dont_]leave_site_enter_dir] [ -max_time $nr ] [-[no]del_after]
[-[no]singlepage] [-[no]dump_after] [-[no]dump_response] [-[no]dump_request] [
-auth_ntlm_domain $str ] [ -auth_proxy_ntlm_domain $str ] [ -js_pattern
$re ] [ -follow_cmd $str ] [-[no]retrieve_symlink] [ -js_transform
$p $t $h $a ] [ -js_transform2 $p $t
$h $a ] [ -ftp_proxy_user $str ] [ -ftp_proxy_pass $str ]
[-[dont_]limit_inlines] [ -ftp_list_options $str ] [-[no]fix_wuftpd_list]
[-[no]post_update] [ -info_dir $dir ] [ -mozcache_dir $dir ] [ -aport
$list ] [ -dport $list ] [-[no]hack_add_index] [ -default_prefix
$str ] [ -ftp_login_handshake $host $handshake ] [ -js_script_file
$file ] [ -dont_touch_url_pattern $pat ] [ -dont_touch_url_rpattern
$pat ] [ -dont_touch_tag_rpattern $pat ] [ -tag_pattern $tag
$attrib $url ] [ -tag_rpattern $tag $attrib $url ] [
-nss_cert_dir $dir ] [-[no]nss_accept_unknown_cert]
[-nss_domestic_policy/-nss_export_policy] [-[no]verify] [ -tlogfile $file ] [
-trelative {object | program} ] [ -tp FQDN[:port] ] [ -transparent_proxy
FQDN[:port] ] [ -tsp FQDN[:port] ] [ -transparent_ssl_proxy
FQDN[:port] ] [-[not]sdemo] [-noencode] [ -[no]ignore_chunk_bug ] [ -hammer_mode
$nr ] [ -hammer_threads $nr ] [ -hammer_flags $nr ] [ -hammer_ease
$nr ] [ -hammer_rtimeout $nr ] [ -hammer_repeat $nr ] [
-[no]log_hammering ] [ -hammer_recdump {$nr | @[@]$filepath } ] [
URLs ]
pavuk [-mode {normal | singlepage | singlereget}] [ -base_level $nr
]
pavuk [-mode sync] [ -ddays $nr ] [ -subdir $dir ]
[-[no]remove_old]
pavuk [-mode resumeregets] [ -subdir $dir ]
pavuk [-mode linkupdate] [ -cdir $dir ] [ -subdir $dir ] [
-scndir $dir ] [ -scenario $str ]
pavuk [-mode reminder] [ -remind_cmd $str ]
pavuk [-mode mirror] [ -subdir $dir ] [-[no]remove_old]
[-[no]remove_before_store] [-[no]always_mdtm]
This manual page describes how to use pavuk.
Pavuk can be used to mirror contents of Internet/intranet servers and to maintain
copies in a local tree of documents. Pavuk stores retrieved documents in locally mapped
disk space. The structure of the local tree is the same as the one on the remote
server. Each supported service (protocol) has its own sub-directory in the local tree.
Each referenced server has its own sub-directory in these protocols sub-directories;
followed by the port number on which the service resides, delimited by character can be
be changed. With the option -fnrules you can change the default layout of the
local document tree, without losing link consistency.
With pavuk it is possible to have up-to-date copies of remote documents in
the local disk space.
As of version 0.3pl2, pavuk can automatically restart broken connections, and reget
partial content from an FTP server (which must support the REST command), from a
properly configured HTTP/1.1 server, or from a HTTP/1.0 server which supports
Ranges.
As of version 0.6 it is possible to handle configurations via so called scenarios.
The best way to create such a configuration file is to use the X Window interface and
simply save the created configuration. The other way is to use the -dumpscn switch.
As of version 0.7pl1 it is possible to store authentication information into an
authinfo file, which pavuk can then parse and use.
As of version 0.8pl4 pavuk can fetch documents for use in a local proxy/cache server
without storing them to local documents tree.
As of version 0.9pl4 pavuk supports SOCKS (4/5) proxies if you have the
required libraries.
As of version 0.9pl12 pavuk can preserve permissions of remote files and symbolic
links, so it can be used for powerful FTP mirroring.
The pavuk releases starting at 0.9.36 support dumping commands to a specific file
(see the -dumpdir and -dumpcmd arguments).
Pavuk supports SSL connections to FTP servers, if you specify ftps:// URL instead of
ftp://.
Pavuk can automatically handle file names with unsafe characters for file-system.
This is only implemented yet for Win32 platform and it is hard coded.
Pavuk can now use HTTP/1.1 protocol for communication with HTTP servers. It
can use persistent connections, so one TCP connection should be used to transfer
several documents without closing it. This feature saves network bandwidth and also
speeds up network communication.
Pavuk can do configurable POST requests to HTTP servers and support also file
uploading via HTTP POST request.
Pavuk can automatically fill found HTML forms, if user will supply data for its
fields before with option -formdata .
Pavuk can run configurable number of concurrently running downloading threads when
compiled with multithreading support.
Pavuk 0.9pl128 introduced the use of JavaScript bindings for doing some complicated
tasks (e.g. decision making, filename transformation) which need some more computing
complexity than may be achieved with a regular, non-scriptable program.
pavuk 0.9.36 introduced the optional multiplier suffixes K, M or G for numeric
parameter values of command line options. These multipliers represent the ISO
multipliers Kilo(1000), Mega(1000000) and Giga(1.0E9), unless otherwise specified (some
command line options relate to memory or disc sizes in either bytes of kBytes, where
these multipliers will then be processed as the nearest power-of-2: K(1024), M(1048567)
or G(1073741824).
- HTTP
-
http://[[user][:password]@]host[:port][/document]
[[user][:password]@]host[:port][/document]
- HTTPS
-
https://[[user][:password]@]host[:port][/document]
ssl[.domain][:port][/document]
- FTP
-
ftp://[[user][:password]@]host[:port][/relative_path][;type=x]
ftp://[[user][:password]@]host[:port][//absolute_path][;type=x]
ftp[.domain][:port][/document][;type=x]
- FTPS
-
ftps://[[user][:password]@]host[:port][/relative_path][;type=x]
ftps://[[user][:password]@]host[:port][//absolute_path][;type=x]
ftps[.domain][:port][/document][;type=x]
- Gopher
-
gopher://host[:port][/type[document]]
gopher[.domain][:port][/type[document]]
- HTTP
-
http://[[user][:password]@]host[:port][/document][?query]
to
http/host_port/[document][?query]
- HTTPS
-
https://[[user][:password]@]host[:port][/document][?query]
to
https/host_port/[document][?query]
- FTP
-
ftp://[[user][:password]@]host[:port][/path]
to
ftp/host_port/[path]
- FTPS
-
ftps://[[user][:password]@]host[:port][/path]
to
ftps/host_port/[path]
- Gopher
-
gopher://host[:port][/type[document]]
to
gopher/host_port/[type[document]]
Note
Pavuk will use the string with which it queries the target server as the name of the
results file. This file name may, in some cases, contain punctuations such as
$,?,=,& etc. Such punctuation can cause problems when you are trying to
browse downloaded files with your browser or you are trying to process downloaded files
with shell scripts or view files with file management utilities which reference the
name of the results file. If you believe that this may be causing problems for you,
then you can remove all punctuation from the result file name with the option:
-tr_del_chr [:punct:] or with other options for adjusting file names
(-tr_str_str and -tr_chr_chr ).
The order in which these URL to file name conversions are applied is as follows:
-tr_str_str is applied first, followed by -tr_del_chr , while
-tr_chr_chr comes last.
All options are case insensitive.
-
Mode
-
Help
-
Indicate/Logging/Interface options
-
Netli options
-
Special start
-
Scenario/Task options
-
Directory options
-
Preserve options
-
Proxy options
-
Proxy authentication
-
Protocol/Download Option
-
Authentication
-
Site/Domain/Port Limitation Options
-
Limitation Document properties
-
Limitation Document name
-
Limitation Protocol Option
-
Other Limitation Options
-
JavaScript support
-
Cookie
-
HTML rewriting engine tuning options
-
File name / URL Conversion Option
-
Hammer mode options: load testing web sites
-
Other Options
- -mode {normal, linkupdate, sync, singlepage, singlereget, resumeregets,
dontstore, ftpdir, mirror, reminder}
-
Set operation mode.
- normal
-
retrieves recursive documents
- linkupdate
-
update remote URLs in local HTML documents to local URLs if these URLs exist
in the local tree
- sync
-
synchronize remote documents with local tree (if a local copy of a document
is older than remote, the document is retrieved again, otherwise nothing
happens)
- singlepage
-
URL is retrieved as one page with all inline objects (picture, sound ...)
this mode is now obsoleted by -singlepage option.
- resumeregets
-
pavuk scans the local tree for files that were not retrieved fully and
retrieves them again (uses partial get if possible)
- singlereget
-
get URL until it is retrieved in full
- dontstore
-
transfer page from server, but don’t store it to the local tree. This
mode is suitable for fetching pages that are held in a local proxy/cache
server.
- reminder
-
used to inform the user about changed documents
- mirror
-
similar to the ’sync’ mode, but will automatically remove local
documents which do not exist anymore on the remote site. This mode will make an
exact copy of the remote site, including keeping the file names intact as much
as possible.
- ftpdir
-
used to list of contents of FTP directories
default operation mode is normal mode.
- -h, -help
-
Print long verbose help message
- -v, -version
-
Show version information and feature set configuration at compilation time.
Feature : Developer Debug Build
Short description : Identifies this pavuk binary as compiled with debug
features enabled (-DDEBUG), such as extra run-time checks.
Affects : all
Feature : Debug features
Short description : This pavuk binary can show very detailed debug /
diagnostic information about the grabbing process, including message dumps,
etc.
Affects : -debug/-nodebug , -debug_level $level
Feature : GNU gettext internationalization of messages
Short description : Important messages can be shown in the local
language.
Affects : -language , -msgcat
Feature : flock() / fcntl() document locking
Short description : When you do not have this built in, you should refrain
from running multiple pavuk binaries and/or multithreaded sessions. Depending on
the built-in locking type (’flock()’, ’Win32 flock()’ or
’fcntl()’) you can or should not use network shared storage to store
the results of your session: fcntl() locking is assumed to be capable of locking
files on NFS shares, while flock() very probably won’t be able to do
that.
Affects : file I/O
Feature : Gtk GUI interface
Short description : You can use the built-in GUI.
Affects : -X , -with_gui , -runX , -prefs ,
-noprefs , -xmaxlog , -gui_font
Feature : GUI with URL tree preview
Short description : You can use the built-in GUI URL tree views.
Affects : -browser
Feature : HTTP and FTP over SSL; SSL layer implemented with OpenSSL /
SSLeay / NSS library
Short description : You can access SSL secured URLs / sites and proxies.
pavuk may have been built with either OpenSSL, SSLeay or Netscape SSL support. Some
features are only available with the one, some only with another
implementation.
Affects : -noSSL , -SSL , -verify , -noverify ,
-noFTPS , -FTPS , -ssl_cert_passwd , -ssl_cert_file ,
-ssl_key_file , -ssl_cipher_list , -ssl_proxy ,
-ssl_version , -unique_sslid , -nounique_sslid ,
-nss_cert_dir , -nss_accept_unknown_cert ,
-nonss_accept_unknown_cert , -nss_domestic_policy ,
-nss_export_policy
Feature : Socks proxy support
Short description : You can SOCKS4 and/or SOCKS5 proxies.
Affects :
Feature : file-system free space checking
Short description : You can use quotas to prevent your local storage from
filling up / overflowing.
Affects : -file_quota
Feature : optional regex patterns in -fnrules and -*rpattern
options
Short description : You can use regular expressions to help pavuk select and
filter content. pavuk also mentions which regex engine has been built in: POSIX,
Bell V8, BSD, GNU, PCRE or TRE
Affects : -rpattern , -skip_rpattern , -url_rpattern ,
-skip_url_rpattern , -remove_adv , -noremove_adv ,
-adv_re , -aip_pattern , -dip_pattern , -js_pattern ,
-js_transform , -js_transform2 , -dont_touch_url_rpattern ,
-dont_touch_tag_rpattern , -tag_rpattern
Feature : support for loading files from Netscape browser cache
Short description : You can access the private browser cache of Netscape
browsers.
Affects : -nscache_dir
Feature : support for loading files from Microsoft Internet Explorer
browser cache
Short description : You can access the private browser cache of Microsoft
Internet Explorer browsers.
Affects : -ie_cache
Feature : support for detecting whether pavuk is running as background
job
Short description : Progress reports, etc. will be disabled when pavuk is
running as a background task
Affects : -check_bg , -nocheck_bg , -progress_mode ,
-verbose , -noverbose , -noquiet , -debug_level ,
-nodebug , -debug , ...
Feature : multithreading support
Short description : Allows pavuk to perform multiple tasks
simultaneously.
Affects : -hammer_threads , -nthreads , -immesg ,
-noimmesg
Feature : NTLM authorization support
Short description : You can access web servers which use NTLM-base access
security.
Affects : -auth_ntlm_domain , -auth_proxy_ntlm_domain
Feature : JavaScript bindings
Short description : You can use JavaScript-based filters and patterns.
Affects : -js_script_file
Feature : IPv6 support
Short description : Pavuk incorporates basic IPv6 support.
Affects :
Feature : HTTP compressed data transfer (gzip/compress/deflate
Content-Encoding)
Short description : pavuk supports compressed transmission formats (HTTP
Accept-Encoding) to reduce network traffic load.
Affects : -noEnc , -Enc
Feature : DoS support (a.k.a. ’chunky’ a.k.a. ’hammer
modes’)
Short description : this pavuk binary can be used to test
(’hammer’) your sites
Affects : -hammer_recdump , -log_hammering ,
-nolog_hammering , -hammer_threads , -hammer_mode ,
-hammer_flags , -hammer_ease , -hammer_rtimeout ,
-hammer_repeat
- -quiet
-
Don’t show any messages on the screen.
- -verbose
-
Force to show output messages on the screen (default)
- -progress/-noprogress
-
Show retrieving progress while running in the terminal (default is progress
off). When turned on, progress will be shown in the format specified by the
-progress_mode setting.
Note
This option only has effect when pavuk is run in a console window.
- -progress_mode $nr
-
Specify how progress (see -progress will be shown to the user. Several
modes $nr are supported:
-
Report every run (-hammer_mode ) and URL fetched on a separate line.
Also show the download progress (bytes and percentage downloaded) while
fetching a document from the remote site. This is the most verbose progress
display. (default)
Example output:
URL[ 1]: 35(0) of 56 http://hobbelt.com/CAT-tuts/panther-l2-50pct.jpg
S: 10138 / 10138 B [100.0%] [R: 187.8 kB/s] [ET: 0:00:00] [RT: 0:00:00]
URL[ 1]: 38(0) of 56 http://hobbelt.com/CAT-tuts/get-started-cat-50pct.jpg
S: 5868 / 5868 B [100.0%] [R: 114.8 kB/s] [ET: 0:00:00] [RT: 0:00:00]
URL[ 2]: 34(0) of 56 http://hobbelt.com/CAT-tuts/CAT_Panther_CM2.avi
S: 8311 / 8311 kB [100.0%] [R: 4.7 MB/s] [ET: 0:00:01] [RT: 0:00:00]
URL[ 2]: 40(0) of 56 http://hobbelt.com/icons/knowspam-teeny-9.gif
S: 817 / 817 B [100.0%] [R: 20.3 kB/s] [ET: 0:00:00] [RT: 0:00:00]
-
Report every run (-hammer_mode ) in a concise format
(’=RUN=’) and display each URL fetched as a separate dot
’.’.
Example output:
............................................[URL] download: ERROR: HTTP document not found
- 2, 3, 4, 5, 6
-
These are identical to mode 1 , except in hammer mode while
hammering a site. Increase the number to see less progress info during a hammer
operation.
- -stime/-nostime
-
Show start and end time of transfer. (default isn’t this information
shown)
- -xmaxlog $nr
-
Maximum number of log lines in the Log widget. 0 means unlimited. This option is
available only when compiled with the GTK+ GUI. (default value is 0)
$nr specifies the size in bytes, unless postfixed with one of the
characters K, M or G, which imply the multipliers K(1024), M(1048567) or
G(1073741824).
- -logfile $file
-
File where all produced messages are stored.
- -unique_log/-nounique_log
-
When logfile as specified with the option -logfile is already used by
another process, try to generate new unique name for the log file. (default is this
option turned off)
- -slogfile $file
-
File to store short logs in. This file contains one line of information per
processed document. This is meant to be used in connection with any sort of script
to produce some statistics, for validating links on your website, or for generating
simple site maps. Multiple pavuk processes can use this file concurrently, without
overwriting each others entries. Record structure:
- PID
-
process id of pavuk process
- TIME
-
current time
- COUNTER
-
in the format current/total number of URLs
- STATUS
-
contains the type of the error: FATAL, ERR, WARN or OK
- ERRCODE
-
is the number code of the error (see errcode.h in pavuk sources)
- URL
-
of the document
- PARENTURL
-
first parent document of this URL (when it doesn’t have parent -
[none])
- FILENAME
-
is the name of the local file the document is saved under
- SIZE
-
size of requested document if known
- DOWNLOAD_TIME
-
time which takes downloading of this document in format
seconds.milli_seconds
- HTTPRESP
-
contains the first line of the HTTP server response
- -language $str
-
Native language that pavuk should use for communication with its user (works
only when there is a message catalog for that language) GNU gettext support
(for message internationalization) must also be compiled in. Default language is
taken from your NLS environment variables.
- -gui_font $font
-
Font used in the GUI interface. To list available X fonts use the
xlsfonts command. This option is available only when compiled with GTK+ GUI
support.
- -read_css/-noread_css
-
Enable or disable fetching objects mentioned in inline and external CSS style
sheets.
- -read_msie_cc/-noread_msie_cc
-
Enable or disable fetching objects mentioned in Microsoft Internet Explorer
Conditional Comments (a.k.a. MSIE CC’s).
- -read_cdata/-noread_cdata
-
Enable or disable fetching objects mentioned in <![CDATA[...]]>
sections.
- -read_comments/-noread_comments
-
Enable or disable fetching objects mentioned in HTML <!-- ... --> Comment
sections.
- -verify/-noverify
-
Enable or disable verifying server CERTS in SSL mode.
- -tlogfile $file
-
Turn on Netli logging with output to specified file.
- -trelative {object | program}
-
Make Netli timings relative to the start of the first object or the program.
- -tp FQDN[:port] , -transparent_proxy FQDN[:port]
-
When processing URL, send the original, but send it to the IP address at
FQDN
- -tsp FQDN[:port] , -transparent_ssl_proxy FQDN[:port]
-
When processing HTTPS URL, send the original, but send it to the IP address at
FQDN
- -sdemo/-notsdemo
-
Output in sdemo compatible format. This is only used by sdemo . (For now
it simply means output ’-1’ rather than ’*’ when
measurements are invalid.)
- -encode/-noencode
-
Do / do not escape characters that are "unsafe" in URLs. Default behavior
is to escape unsafe characters.
- -X, -x, -with_gui
-
Start program with X Window interface (if compiled with support for GTK+). By
default pavuk starts without GUI and behaves like a regular command-line tool.
- -runX
-
When used together with the -X option, pavuk starts processing of URLs
immediately after the GUI window is launched. Without the -X given, this
option doesn’t have any effect. Only available when compiled with GTK+
support.
- -bg/-nobg
-
This option allows pavuk to detach from its terminal and run in background mode.
Pavuk will not output any messages to the terminal than. If you want to see
messages, you have to use the -log_file option to specify a file where
messages will be written. Default pavuk executes at foreground.
- -check_bg/-nocheck_bg
-
Normally, programs sent into the background after being run in foreground
continue to output messages to the terminal. If this option is activated, pavuk
checks if it is running as background job and will not write any messages to the
terminal in this case. After it becomes a foreground job again, it will start
writing messages to terminal in the normal way. This option is available only when
your system supports retrieving of terminal info via tc*() functions.
- -prefs/-noprefs
-
When you turn this option on, pavuk will preserve all settings when exiting, and
when you run pavuk with GUI interface again, all settings will be restored. The
settings will be stored in the ~./pavuk_prefs file. Default pavuk
want restore its option when started. This option is available only when compiled
with GTK+.
- -schedule $time
-
Execute pavuk at the time specified as parameter. The Format of the $time
parameter is YYYY.MM.DD.hh.mm . You need a properly configured scheduling
with the at command on your system for using this option. If default
configuration (at -f %f %t %d.%m.%Y ) of scheduling command won’t work
on your system, try to adjust it with -sched_cmd option.
$time must be specified as local (a.k.a. ’wall clock’)
time.
- -reschedule $nr
-
Execute pavuk periodically with $nr hours period. You need properly configured
scheduling with the at command on your system for using this option.
- -sched_cmd $str
-
Command to use for scheduling. Pavuk explicitly supports scheduling with
at $str should contain regular characters and macros, escaped by
% character. Supported macros are:
- %f
-
for script filename
- %t
-
for time (in format HH:MM)
- ...
-
all macros as supported by the strftime (3)
function
- -urls_file $file
-
If you use this option, pavuk will read URLs from $file before it starts
processing. In this file, each URL needs to be on a separate line. After the last
URL, a single dot . followed by a LF (line-feed) character denotes the end.
Pavuk will start processing right after all URLs have been read. If
$file is given as the - character, standard input will be
read.
- -store_info/-nostore_info
-
This option causes pavuk to store information about each document into a
separate file in the .pavuk_info directory. This file is used to
store the original URL from which the document was downloaded. For files that are
downloaded via HTTP or HTTPS protocols, the whole HTTP response header is stored
there. I recommend to use this option when you are using options that change the
default layout of the local document tree, because this info file helps pavuk to
map the local filename to the URL. This option is also very useful when different
URLs have the same filename in the local tree. When this occurs, pavuk detects this
using info files, and it will prefix the local name with numbers. At default is
disabled storing of this extra information.
- -info_dir $dir
-
You can set with this option location of separate directory for storing info
files created when -store_info option is used. This is useful when you
don’t want to mix in destination directory the info files with regular
document files. The structure of the info files is preserved, just are stored in
different directory.
- -request $req
-
With this option you can specify extended information for starting URLs. With
this option you can specify query data for POST or GET . Current
syntax of this option is:
URL:["]$url["] [METHOD:["]{GET|POST}["]] [ENCODING:["]{u|m}["]]
[FIELD:["]variable=value["]]
[COOKIE:["][variable=value;[...]]variable=value[;]["]]
[FILE:["]variable=filename["]]
[LNAME:["]local_filename["]]
- URL
-
specifies request URL
- METHOD
-
specifies request method for URL and is one of GET or POST.
- ENCODING
-
specifies encoding for request body data.
- m
-
is for multipart/form-data encoding
- u
-
is for application/x-www-form-urlencoded encoding
- FIELD
-
specifies field of request data in format variable=value. For encoding of
special characters in variable and value you can use same encoding as is used
in application/x-www-form-urlencoded encoding.
- COOKIE
-
specifies one or more cookies that are related to the specified URL. These
cookies will be used/transmitted by pavuk when this URL is accessed, thus
enabling pavuk to access URLs which require the use of specific cookies for a
proper response.
Note
The settings of command-line option -disabled_cookie_domains does
apply.
See the Cookie chapter for more info.
- FILE
-
specifies special field of query, which is used to specify file for POST
based file upload.
- LNAME
-
specifies localname for this request
When you need to use inside the FIELD: and FILE: fields of request
specification special characters, you should use the
application/x-www-form-urlencoded encoding of characters. It means all nonASCII
characters, quote character ("), space character ( ), ampersand character (&),
percent character (%) and equal character (=) should be encoded in form %xx
where xx is hexadecimal representation of ASCII value of character. So for
example % character should be encoded like %25 .
- -formdata $req
-
This option gives you chance to specify contents for HTML forms found during
traversing document tree. Current syntax of this option is same as for
-request option, but ENCODING: and METHOD: are meaningless in
this option semantics. In URL: you have to specify HTML form action URL,
which will be matched against action URLs found in processed HTML documents. If
pavuk finds action URL which matches that supplied in -formdata option,
pavuk will construct GET or POST request from data supplied in this
option and from default form field values supplied in HTML document. Values
supplied on command-line have precedence before that supplied in HTML file.
- -nthreads $nr
-
By means of this option you can specify how many concurrent threads will
download documents. Default pavuk executes 3 concurrent downloading threads.
This option is available only when pavuk is compiled to support
multithreading.
- -immesg/-noimmesg
-
Default pavuks behavior when running multiple downloading threads is to buffer
all output messages in memory buffer and flush that buffered data just when thread
finishes processing of one document. With this option you can change this behavior
to see the messages immediately when it is produced. It is only usable when you
want to debug some specials in multithreading environment.
This option is available only when pavuk is compiled to support
multithreading.
- -dumpfd $nr / -dumpfd @[@]$file
-
For scripting is sometimes usable to be able to download document directly to
pipe or variable instead of storing it to regular file. In such case you can use
this option to dump data for example to stdout ( $nr = 1
).
Note
pavuk 0.9.36 and later releases also support the @$file
argument, where you can specify a file to dump the data to. The file path must be
prefixed by an ’@’ character. If you prefix the file path with a
second ’@’, pavuk will assume you wish to append to an already
existing file. Otherwise the file will be created/erased when pavuk starts.
- -dump_after/-nodump_after
-
While using -dumpfd option in multithreaded pavuk, it is required to dump
document in one moment because documents downloaded in multiple threads can
overlap. This option is also useful when you want to dump document after pavuk
adjusts links inside HTML documents.
- -dump_request/-nodump_request
-
This option has effect only when used with the -dumpfd option. It is used
to dump HTTP requests.
- -dump_response/-nodump_response
-
This option has effect only when used with the -dumpfd option. It is used
to dump HTTP response headers.
- -dump_urlfd $nr / -dump_urlfd @[@]$file
-
When you will use this option, pavuk will output all URLs found in HTML
documents to file descriptor $nr . You can use this option to extract and
convert all URLs to absolute URLs and write those to stdout, for example.
Note
pavuk 0.9.36 and later releases also support the @$file
argument, where you can specify a file to dump the data to. The file path must be
prefixed by an ’@’ character. If you prefix the file path with a
second ’@’, pavuk will assume you wish to append to an already
existing file. Otherwise the file will be created/erased when pavuk starts.
- -scenario $str
-
Name of scenario to load and/or run. Scenarios are files with a structure
similar to the .pavukrc file. Scenarios contain saved configurations. You
can use it for periodical mirroring. Parameters from scenarios specified at the
command line can be overwritten by command line parameters. To be able to use this
option, you need to specify scenario base directory with option -scndir
.
- -dumpscn $filename
-
Store actual configuration into scenario file with name $filename
. This is useful to quickly create pre-configured scenarios for manual editing.
- -dumpcmd $str
-
File name where the command will be ’dumped’. To be able to use this
option, you need to specify the dump base directory with option -dumpdir
.
- -msgcat $dir
-
Directory which contains the message catalog for pavuk. If you do not have
permission to store a pavuk message catalog in the system directory, you should
simply create similar structure of directories in your home directory as it is on
your system.
For example:
Your native language is German, and your home directory is /home/jano
.
You should at first create the directory
/home/jano/locales/de/LC_MESSAGES/ , then put the German pavuk.mo there and
set -msgcat to /home/jano/locales/ . If you have properly set locale
environment values, you will see pavuk speaking German. This option is available
only when you compiled in support for GNU gettext messages
internationalization.
- -cdir $dir
-
Directory where are all retrieved documents are stored. If not specified, the
current directory is used. If the specified directory doesn’t exist, it will
be created.
- -scndir $dir
-
Directory in which your scenarios are stored. You must use this option when you
are loading or storing scenario files.
- -dumpdir $dir
-
Directory in which your command dumps are stored. You must use this option when
you are storing command dump files using the -dumpcmd command.
- -preserve_time/-nopreserve_time
-
Store downloaded document with same modification time as on the remote site.
Modification time will be set only when such information is available (some FTP
servers do not support the MDTM command, and some documents on HTTP servers
are created online so pavuk can’t retrieve the modification time of this
document). At default modification time of documents isn’t preserved.
- -preserve_perm/-nopreserve_perm
-
Store downloaded document with the same permissions as on the remote site. This
option has effect only when downloading a file through FTP protocol and assumes
that the -ftplist option is used. At default permissions are not
preserved.
- -preserve_slinks/-nopreserve_slinks
-
Set symbolic links to point exactly to same location as on the remote server;
don’t do any relocations. This option has effect only when downloading file
through FTP protocol and assumes that the -ftplist option is used. Default
symbolic links are not preserved, and are retrieved as regular documents with full
contents of linked file.
For example, assume that on the FTP server ftp.xx.org there is a symbolic link
/pub/pavuk/pavuk-current.tgz , which points to
/tmp/pub/pavuk-0.9pl11.tgz . Pavuk will create symbolic link
ftp/ftp.xx.org_21/pub/pavuk/pavuk-current.tgz
if option -preserve_slinks will be used this symbolic link will point to
/tmp/pub/pavuk-0.9pl11.tgz
if option -nopreserve_slinks will be used, this symbolic link will point
to ../../tmp/pub/pavuk-0.9pl11.tgz
- -retrieve_symlink/-noretrieve_symlink
-
Retrieve files behind symbolic links instead of replicating symlinks in local
tree.
- -http_proxy $site[:$port]
-
If this parameter is used, then all HTTP requests are going through this proxy
server. This is useful if your site resides behind a firewall, or if you want to
use a HTTP proxy cache server. The default port number is 8080. Pavuk allows you to
specify multiple HTTP proxies (using multiple -http_proxy options) and it
will rotate proxies with round robin priority disabling proxies with errors.
- -nocache/-cache
-
Use this option whenever you want to get the document directly from the site and
not from your HTTP proxy cache server. Default pavuk allows transfer of document
copies from cache.
- -ftp_proxy $site[:$port]
-
If this parameter is used, then all FTP requests are going through this proxy
server. This is useful when your site resides behind a firewall, or if you want to
use FTP proxy cache server. The default port number is 22. Pavuk supports three
different types of proxies for FTP, see the options -ftp_httpgw and
-ftp_dirtyproxy . If none of the mentioned options is used, then pavuk
assumes a regular FTP proxy with USER user@host
connecting to remote FTP server.
- -ftp_httpgw/-noftp_httpgw
-
The specified FTP proxy is a HTTP gateway for the FTP protocol. Default FTP
proxy is regular FTP proxy.
- -ftp_dirtyproxy/-noftp_dirtyproxy
-
The specified FTP proxy is a HTTP proxy which supports a CONNECT request
(pavuk should use full FTP protocol, except of active data connections). Default
FTP proxy is regular FTP proxy. If both -ftp_dirtyproxy and
-ftp_httpgw are specified, -ftp_dirtyproxy is preferred.
- -gopher_proxy $site[:$port]
-
Gopher gateway or proxy/cache server.
- -gopher_httpgw/-nogopher_httpgw
-
The specified Gopher proxy server is a HTTP gateway for Gopher protocol. When
-gopher_proxy is set and this -gopher_httpgw option isn’t used,
pavuk is using proxy as HTTP tunnel with CONNECT request to open connections
to Gopher servers.
- -ssl_proxy $site[:$port]
-
SSL proxy (tunneling) server [as that in CERNhttpd + patch or in Squid] with
enabled CONNECT request (at least on port 443). This option is available
only when compiled with SSL support (you need the SSleay or OpenSSL libraries with
development headers)
- -http_proxy_user $user
-
User name for HTTP proxy authentication.
- -http_proxy_pass $pass
-
Password for HTTP proxy authentication.
- -http_proxy_auth {1/2/3/4/user/Basic/Digest/NTLM}
-
Authentication scheme for proxy access. Similar meaning as the
-auth_scheme option (see help for this option for more details). Default is
2 (Basic scheme).
- -auth_proxy_ntlm_domain $str
-
NT or LM domain used for authorization again HTTP proxy server when NTLM
authentication scheme is required. This option is available only when compiled with
OpenSSL or libdes libraries.
- -auth_reuse_proxy_nonce/-noauth_reuse_proxy_nonce
-
When using HTTP Proxy Digest access authentication scheme use first received
nonce value in multiple following requests.
- -ftp_proxy_user $user
-
User name for FTP proxy authentication.
- -ftp_proxy_pass $pass
-
Password for FTP proxy authentication.
- -ftp_passive
-
Uses passive ftp when downloading via ftp.
- -ftp_active
-
Uses active ftp when downloading via ftp.
- -active_ftp_port_range $min :$max
-
This option permits to specify the ports used for active ftp. This permits
easier firewall configuration since the range of ports can be restricted.
Pavuk will randomly choose a number from within the specified range until an
open port is found. Should no open ports be found within the given range, pavuk
will default to a normal kernel-assigned port, and a message (debug level
net ) is output.
The port range selected must be in the non-privileged range (e.g. greater than
or equal to 1024); it is STRONGLY RECOMMENDED that the chosen range be large
enough to handle many simultaneous active connections (for example, 49152-65534,
the IANA-registered ephemeral port range).
- -always_mdtm/-noalways_mdtm
-
Force pavuk to always use "MDTM" to determine the file modification time and
never uses cached times determined when listing the remote files.
- -remove_before_store/-noremove_before_store
-
Force unlink’ing of files before new content is stored to a file. This is
helpful if the local files are hardlinked to some other directory and after
mirroring the hardlinks are checked. All "broken" hardlinks indicate a file
update.
- -retry $nr
-
Set the number of attempts to transfer processed document. Default set to 1,
this mean pavuk will retry once to get documents which failed on first attempt.
- -nregets $nr
-
Set the number of allowed regets on a single document, after a broken transfer.
Default value for this option is 2.
This option is discarded when running pavuk in singlereget mode as pavuk
will then keep on trying to reget the URL until successful or a fatal error occurs.
If the server is found to not support reget’ing content and
-force_reget has not been specified, this will be regarded as a fatal
error.
- -nredirs $nr
-
Set number of allowed HTTP redirects. (use this for prevention of loops) Default
value for this option is 5, and conform to HTTP specification.
- -rollback $nr
-
Set the number of bytes to discard from the already locally available content
(counted from the end of the file) if regetting. Default value for this option is
0.
- -force_reget/-noforce_reget
-
Force reget’ing of the whole document after a broken transfer when the
server doesn’t support retrieving of partial content. Pavuk default behavior
is to stop getting documents which don’t allow restarting of transfer from
specified position.
When forced reget’ing is turned on, pavuk will still start fetching each
URL by requesting a partial content download when (part of) the URL content is
already available locally. However, when such an attempt fails, pavuk will discard
the notion of requesting a partial content download (i.e. HTTP Range specification)
entirely for this URL only and attempt to download the content as a whole
instead.
Hence, in order for ’-force_reget’ to work as expected, you should
realize each URL should be at least spidered twice, i.e. the -nregets
command-line option should have a value of 1 at least (2 by default if this option
is not specified explicitly).
- -timeout $nr
-
Timeout for stalled connection attempts in milliseconds. Default timeout is 0,
and that means timeout checking is disabled.
$nr specifies the timeout in millieseconds, unless postfixed with one of
the characters S, M, H or D (either in upper or lower case), which imply the
alternative time units S = seconds, M = minutes, H = hours or D = days.
- -rtimeout $nr
-
Timeout for data read operations in milliseconds: the connection is closed with
an error when no further data is received within this time limit. Default timeout
is 0, an that means timeout checking is disabled.
$nr specifies the timeout in millieseconds, unless postfixed with one of
the characters S, M, H or D (either in upper or lower case), which imply the
alternative time units S = seconds, M = minutes, H = hours or D = days.
- -wtimeout $nr
-
Timeout for data write operations in milliseconds: the connection is closed with
an error when no further data could be transmitted within this time limit. Default
timeout is 0, an that means timeout checking is disabled.
$nr specifies the timeout in millieseconds, unless postfixed with one of
the characters S, M, H or D (either in upper or lower case), which imply the
alternative time units S = seconds, M = minutes, H = hours or D = days.
- -noRobots/-Robots
-
This switch suppresses the use of the robots.txt standard, which is used
to restrict access of Web robots to some locations on the web server. Default is
allowed checking of robots.txt files on HTTP servers. Enable this option always
when you are downloading huge sets of pages with unpredictable layout. This
prevents you from upsetting server administrators :-).
- -noEnc/-Enc
-
This switch suppresses / enables using the gzip , compress or
deflate encoding in transfer.
Some servers are broken as they are reporting files with the MIME type
application/gzip or application/compress as gzip or compress encoded, when it
should have been reported as ’untouched’, which is defined by the
keyword ’identity’ according to the HTTP standards. See for example
HTTP/1.1 standard RFC2616, section 14.3, Accept-Encoding and the counterpart:
section 14.11, Content-Encoding.
Turn this option off (-noEnc) when you don’t want to allow the server to
compress content for transmission: in that case, the server will transmit all
content as is, which, in the case of faulty servers mentioned above, means you will
receive the compressed file types exactly as they are stored on the server and no
undesirable decompression attempts will be made by pavuk.
By default, the option ’-Enc’ is enabled, as this allows for often
significant data transfer savings, resulting in fewer transmission costs and faster
web responses. Note: when you have a pavuk binary without libz support compiled in,
pavuk will never request content compression, as it won’t be able to
decompress those results. In that case, ’-Enc’ is identical to
’-noEnc’.
For improved functionality, make sure your pavuk binary comes with libz support.
Check your pavuk --version output for a mention of this feature
(’Content-Encoding’).
- -check_size/-nocheck_size
-
The option -nocheck_size should be used if you are trying to download pages from
a HTTP server which sends a wrong Content-Length: field in the MIME header
of response. Default pavuk behavior is to check this field and complain when
something is wrong.
- -maxrate $nr
-
If you don’t want to give all your transfer bandwidth to pavuk, use this
option to set pavuk’s maximum transfer rate. This option accepts a floating
point number to specify the transfer rate in kB/s. If you want get optimal
settings, you also have to play with the size of the read buffer (option
-bufsize ) because pavuk is doing flow control only at application level. At
default pavuk uses full bandwidth.
- -minrate $nr
-
If you hate slow transfer rates, this option allows you to break transfers with
slow speed. You can set the minimum transfer rate, and if the connection gets
slower than the given rate, the transfer will be stopped. The minimum transfer rate
is given in kB/s. At default pavuk doesn’t check this limit.
- -bufsize $nr
-
This option is used to specify the size of the read buffer (default size: 32kB).
If you have a very fast connection, you may increase the size of the buffer to get
a better read performance. If you need to decrease the transfer rate, you may need
to decrease the size of the buffer and set the maximum transfer rate with the
-maxrate option. This option accepts the size of the buffer in kB.
$nr specifies the size in kiloBytes, unless postfixed with one of the
characters K or M, which imply the corresponding (power-of-2) multipliers. That
means that the $nr value ’1K’ is 1 MegaByte, ’1M’ is
a whopping 1 GigaByte.
- -fs_quota $nr
-
If you are running pavuk on a multiuser system, you may need to avoid filling up
your file system. This option lets you specify how many space must remain free. If
pavuk detects an underrun of the free space, it will stop downloading files.
Specify this quota in kB. Default value is 0, and that mean no checking of this
quota.
$nr specifies the size in kiloBytes, unless postfixed with one of the
characters K or M, which imply the corresponding (power-of-2) multipliers. That
means that the $nr value ’1K’ is 1 MegaByte, ’1M’ is
a whopping 1 GigaByte.
- -file_quota $nr
-
This option is useful when you want to limit downloading of big files, but want
to download at least $nr kilobytes from big files. A big file will be
transferred, and when it reaches the specified size, transfer will break. Such
document will be processed as properly downloaded, so be careful when using this
option. At default pavuk is transferring full size of documents.
$nr specifies the size in kiloBytes, unless postfixed with one of the
characters K or M, which imply the corresponding (power-of-2) multipliers. That
means that the $nr value ’1K’ is 1 MegaByte, ’1M’ is
a whopping 1 GigaByte.
- -trans_quota $nr
-
If you are aware that your selection should address a big amount of data, you
can use this option to limit the amount of transferred data. Default is by size
unlimited transfer.
$nr specifies the size in kiloBytes, unless postfixed with one of the
characters K or M, which imply the corresponding (power-of-2) multipliers. That
means that the $nr value ’1K’ is 1 MegaByte, ’1M’ is
a whopping 1 GigaByte.
- -max_time $nr
-
Set maximum amount of time for program run. After time is exceeded, pavuk will
stop downloading. Time is specified in minutes. Default value is 0, and it means
downloading time is not limited.
- -url_strategy $strategy
-
This option allows you to specify a downloading order for URLs in document tree.
This option accepts the following strings as parameters:
- level
-
will order URLs as it loads it from HTML files (default)
- leveli
-
as previous, but inline objects URLs come first
- pre
-
will insert URLs from actual HTML document at start, before other
- prei
-
as previous, but inline objects URLs come first
- -send_if_range/-nosend_if_range
-
Send If-Range: header in HTTP request. I found out that some HTTP servers
(greetings, MS :-)) are sending different ETag: fields in different
responses for the same, unchanged document. This causes problems when pavuk
attempts to reget a document from such a server: pavuk will remember the old ETag
value and uses it in following requests for this document. If the server checks it
with the new ETag value and it differs, it will refuse to send only part of the
document, and start the download from scratch.
- -ssl_version $v
-
Set required SSL protocol version for SSL communication. $v is one
of:
This option is available only when compiled with SSL support. Default is
ssl23.
- -unique_sslid/-nounique_sslid
-
This option can be used if you want to use a unique SSL ID for all SSL
sessions. Default pavuk behavior is to negotiate each time new session ID for each
connection. This option is available only when compiled with SSL support.
- -use_http11/-nouse_http11
-
This option is used to switch between HTTP/1.0 and HTTP/1.1 protocol used with
HTTP servers. Using of HTTP/1.1 is recommended, because it is faster than HTTP/1.0
and uses less network bandwidth for initiating connections. pavuk uses HTTP/1.1 by
default.
- -local_ip $addr
-
You can use this option when you want to use specified network interface for
communication with other hosts. This option is suitable for multihomed hosts with
several network interfaces. Address should be entered as regular IP address or as
host name.
- -identity $str
-
This option allows you to specify content of User-Agent: field of HTTP
request. This is usable, when scripts on remote server returns different document
on same URL for different browsers, or if some HTTP server refuse to serve document
for Web robots like pavuk. Default pavuk sends in User-Agent: field
pavuk/$VERSION string.
- -auto_referer/-noauto_referer
-
This option forces pavuk to send HTTP Referer: header field with starting
URLs. Content of this field will be self URL. Using this option is required, when
remote server checks the Referer: field. At default pavuk wont send Referer:
field with starting URLs.
- -referer/-noreferer
-
This option allows to enable and disable the transmission of HTTP
Referer: header field. At default pavuk sends Referer: field.
- -persistent/-nopersistent
-
This option allows to enable and disable the use of persistent HTTP connections.
The default is to use persistent HTTP connections. Some servers have problems with
that type of connection and this options allows to get data from these type of
servers also.
- -httpad $str
-
In some cases you may want to add user defined fields to HTTP/HTTPS requests.
This option is exactly for this purpose. In $str you can directly
specify content of additional header. If you specify only raw header, it will be
used only for starting requests. When you want to use this header with each request
while crawling, prefix the header with + character.
To add multiple additional headers, you can repeatedly specify this command-line
option, once for each additional header.
- -page_sfx $list
-
Specify a collection of filename / web page extensions which are to be treated
as HTML pages, which is useful when scanning / hammering web sites which present
unusual mime types with their pages (see also: -hammer_mode ).
$list must contain a comma separated list of web page endings. The
default set is .html, .htm, .asp, .aspx, .php, .php3, .php4, .pl, .shtml
Note
When pavuk includes the chunky/hammer feature (see -hammer_mode ), any
web page which matches the endings specified in $list will be
registered in the hammering recording buffer and marked as a page starter
(’[STARTER]’): hammer time measurements are collected and reported on
a ’total page’ base (see -tlogfile ). This means that pavuk
assumes a user or web browser, which loads a page, will also load any style
sheets, scripts and images to properly display that page. All those items are
part of a ’total page’, but each page has only a single
’starting point’: the page itself.
To approximate ’total page’ timings instead of ’per
item’ timings, pavuk will mark the URLs which act as web page
’starting points’ as [STARTER]. Here pavuk assumes that each web page
is simple (i.e. does not use iframes, etc.), hence it is assumed that recognizing
the web page URL ending is sufficient.
Please note also that the ’endings’ in $list do not
have to be ’filename extensions’ per se: the ’endings’
are simply matched against the URL (with any ’?xxx=yyy’ query
elements removed) using a simple, case-insensitive comparison. Hence you may also
specify:
-page_sfx "index.html,index.htm"
when you only want any URLs which end with ’index.html’ or
’index.htm’ to be treated as ’page starters’ for timing
purposes.
- -del_after/-nodel_after
-
This option allows you to delete FILES from REMOTE server, when download is
properly finished. At default is this option off.
- -FTPlist/-noFTPlist
-
When the option -FTPlist will be used, pavuk will retrieve content of FTP
directories with the FTP command LIST instead of NLST . So the same
listing will be retrieved as with the "ls -l " UNIX command.
This option is required if you need to preserve permissions of remote files or
you need to preserve symbolic links. Pavuk supports wide listing on FTP servers
with regular BSD or SYSV style "ls -l" directory listing, on FTP
servers with EPFL listing format, VMS style listing,
DOS/Windows style listing and Novell listing format. Default pavuk
behavior is to use NLST for FTP directory listings.
- -ftp_list_options $str
-
Some FTP servers require to supply extra options to LIST or NLST FTP commands to
show all files and directories properly. But be sure not to use any extra options
which can reformat output of the listing. Useful is especially -a option
which force FTP server to show also dot files and directories and with broken WuFTP
servers it also helps to produce full directory listings not just files.
- -fix_wuftpd / -nofix_wuftpd
-
This option is result of several attempts to to get working properly the
-remove_old option with WuFTPd server when -ftplist option is used.
The problem is that FTP command LIST on WuFTPd don’t mind when trying to list
non-existing directory, and indicates success in FTP response code. When you
activate this option, pavuk uses extra FTP command ( STAT -d dir ) to
check whether the directory really exists. Don’t use this option until you
are sure that you really need it!
- -ignore_chunk_bug / -noignore_chunk_bug
-
Ignore IIS 5/6 RFC2616 chunked transfer mode server bug, which would otherwise
have pavuk fail and report downloads as ’possibly truncated’. When this
is reported by pavuk you should specify this option and retry the operation.
- -auth_file $file
-
File where you have stored authentication information for access to some
service. For file structure see below in FILES section.
- -auth_name $user
-
If you are using this parameter, pavuk will transmit your authentication details
with each HTTP access for grabbing a document. For security reasons, use this
option only if you know that only one HTTP server could be accessed or use the
-asite option to specify the sites for which you want to use authentication.
Otherwise your auth parameters will be sent to each accessed HTTP server.
- -auth_passwd $passwd
-
Value of this parameter is used as password for authentication
- -auth_scheme {1/2/3/4/user/Basic/Digest/NTLM}
-
This parameter specifies used authentication scheme.
- 1, user
-
means user authentication scheme is used as defined in HTTP/1.0 or
HTTP/1.1. Password and user name are sent in plaintext format
(unencrypted).
- 2, Basic
-
means Basic authentication scheme is used as defined in HTTP/1.0.
Password and user name are sent BASE64 encoded.
This is the default setting.
- 3, Digest
-
means Digest access authentication scheme based on MD5 checksums as
defined in RFC2069.
- 4, NTLM
-
means NTLM proprietary access authentication scheme used by Microsoft
IIS or Proxy servers. When you use this scheme, you must also specify NT or LM
domain with option -auth_ntlm_domain .
This scheme is supported only when compiled with OpenSSL or libdes
libraries.
- -auth_ntlm_domain $str
-
NT or LM domain used for authorization again HTTP server when NTLM
authentication scheme is required.
This option is available only when compiled with OpenSSL or libdes
libraries.
- -auth_reuse_nonce/-noauth_reuse_nonce
-
While using HTTP Digest access authentication scheme use first received
nonce value in more following requests. Default pavuk negotiates nonce for each
request.
- -ssl_key_file $file
-
File with public key for SSL certificate (learn more from SSLeay or OpenSSL
documentation).
This option is available only when compiled with SSL support (you need
SSleay or OpenSSL libraries and development headers).
- -ssl_cert_file $file
-
Certificate file in PEM format (learn more from SSLeay or OpenSSL
documentation).
This option is available only when compiled with SSL support (you need
SSleay or OpenSSL libraries and development headers).
- -ssl_cer_passwd $str
-
Password used to generate certificate (learn more from SSLeay or OpenSSL
documentation).
This option is available only when compiled with SSL support (you need
SSLeay or OpenSSL libraries and development headers).
- -nss_cert_dir $dir
-
Config directory for NSS (Netscape SSL implementation) certificates. Usually
~/.netscape (created by Netscape communicator/navigator) or profile
directory below ~/.mozilla (created by Mozilla browser). The directory
should contain cert7.db and key3.db files.
If you don’t use Mozilla nor Netscape, you must create this files by
utilities distributed with NSS libraries. Pavuk opens certificate database only
read-only.
This option is available only when pavuk is compiled with SSL support
provided by Netscape NSS SSL implementation.
- -nss_accept_unknown_cert/-nonss_accept_unknown_cert
-
By default will pavuk reject connection to SSL server which certificate is
not stored in local certificate database (set by -nss_cert_dir option).
You must explicitly force pavuk to allow connection to servers with unknown
certificates.
This option is available only when pavuk is compiled with SSL support
provided by Netscape NSS SSL implementation.
- -nss_domestic_policy/-nss_export_policy
-
Selects sets of ciphers allowed/disabled by USA export rules.
This option is available only when pavuk is compiled with SSL support
provided by Netscape NSS SSL implementation.
- -from $email
-
This parameter is used when accessing anonymous FTP server as password or is
optionally inserted into From field in HTTP request. If not specified
pavuk discovers this from USER environment variable and from site
hostname.
- -send_from/-nosend_from
-
This option is used for enabling or disabling sending of user
identification, entered in -from option, as FTP anonymous user password
and From: field of HTTP request. By default is this option off.
- -ftp_login_handshake $host $handshake
-
When you need to use nonstandard login procedure for some of FTP servers,
you can use this option to change default pavuk login procedure. To allow more
flexibility, you can assign the login procedure to some server or to all. When
$host is specified as empty string ("" ), than attached login
procedure is assigned to all FTP servers besides those having assigned own
login procedures. In the $handshake parameter you can specify exact
login procedure specified by FTP commands followed by expected FTP response
codes delimited with backslash (\ ) characters.
For example this is default login procedure when logging in regular ftp
server without going through proxy server:
USER %u\331\PASS %p\230
There are two commands followed by two response codes. After USER command
pavuk expects FTP response code 331 and after PASS command pavuk expects from
server FTP response code 230. In ftp commands you can use following macros
which will be replaced by respective values:
- %u
-
user name used to access FTP server
- %p
-
password used to access FTP server
- %U
-
user name used to access FTP proxy server
- %P
-
password used to access FTP proxy server
- %h
-
hostname of FTP server
- %s
-
port number on which FTP server listens
- -asite $list
-
Specify comma separated list of allowed sites on which referenced documents are
stored. When this option is specified, pavuk will only follow links which point to
servers in this list.
The -dsite parameter is the opposite of this one. If both are used the
last occurrence of them is used and all previous occurrences are discarded.
- -dsite $list
-
Specify a comma separated list of disallowed sites.
The -asite parameter is the opposite of this one. If both are used the
last occurrence of them is used and all previous occurrences are discarded.
- -adomain $list
-
Specify a comma separated list of allowed domains on which referenced documents
are stored. When this option is specified, pavuk will only follow links which point
to domains in this list.
The -ddomain parameter is the opposite of this one. If both are used the
last occurrence of them is used and all previous occurrences are discarded.
- -ddomain $list
-
Specify a comma separated list of disallowed domains.
The -adomain parameter is the opposite of this one. If both are used the
last occurrence of them is used and all previous occurrences are discarded.
- -aport $list
-
In $list , you can write comma separated list of ports from which
you allow to download documents.
The -dport parameter is the opposite of this one. If both are used the
last occurrence of them is used and all previous occurrences are discarded.
- -dport $list
-
This option is used to specify denied ports. When this option is specified,
pavuk will only follow links which point to servers in this list.
The -aport parameter is the opposite of this one. If both are used the
last occurrence of them is used and all previous occurrences are discarded.
- -amimet $list
-
List of comma separated allowed MIME types. You can also use wildcard patterns
with this option.
The -dmimet parameter is the opposite of this one. If both are used the
last occurrence of them is used and all previous occurrences are discarded.
- -dmimet $list
-
List of comma separated disallowed MIME types. You can also use wildcard
patterns with this option.
The -amimet parameter is the opposite of this one. If both are used the
last occurrence of them is used and all previous occurrences are discarded.
- -maxsize $nr
-
Maximum allowed size of document. This option is applied only when pavuk is able
to detect the document before starting the transfer. Default value is 0, and it
means this limit isn’t applied.
$nr specifies the size in bytes, unless postfixed with one of the
characters K, M or G, which imply the multipliers K(1024), M(1048567) or
G(1073741824).
- -minsize $nr
-
minimal allowed size of document. This option is applied only when pavuk is able
to detect the document before starting the transfer. Default value is 0, and it
means this limit isn’t applied.
$nr specifies the size in bytes, unless postfixed with one of the
characters K, M or G, which imply the multipliers K(1024), M(1048567) or
G(1073741824).
- -newer_than $time
-
Allow only transfer of documents with modification time newer than specified in
parameter $time . Format of $time is: YYYY.MM.DD.hh:mm . To
apply this option pavuk must be able to detect modification time of document.
$time must be specified as local (a.k.a. ’wall clock’)
time.
- -older_than $time
-
Allow only transfer of documents with modification time older than specified in
parameter $time. Format of $time is: YYYY.MM.DD.hh:mm . To apply this
option pavuk must be able to detect modification time of document.
$time must be specified as local (a.k.a. ’wall clock’)
time.
- -noCGI/-CGI
-
this switch prevents to transfer dynamically generated parametric documents
through CGI interface. This is detected with occurrence of ? character
inside URL. Default pavuk behavior is to allow transfer of URLs with query
strings.
- -alang $list
-
this allows you to specify an ordered comma separated list of preferred natural
languages. This option works only with HTTP and HTTPS protocol using
Accept-Language: MIME field.
- -acharset $list
-
This options allows you to specify a comma separated list of preferred encoding
standards for transferred documents. This works only with HTTP and HTTPS urls and
only if such document encodings are available on the destination server.
An example:
-acharset iso-8859-2,windows-1250,utf8
- -asfx $list
-
This parameter allows you to specify a set of comma separated suffixes used to
restrict the selection of documents which will be processed.
The -dsfx parameter is the opposite of this one. If both are used the
last occurrence of them is used and all previous occurrences are discarded.
- -dsfx $list
-
A set of comma separated suffixes that are used to specify which documents will
not be processed.
The -asfx parameter is the opposite of this one. If both are used the
last occurrence of them is used and all previous occurrences are discarded.
- -aprefix $list / -dprefix $list
-
These two options allow you to specify set of allowed or disallowed prefixes of
documents. They are mutually exclusive: when these options occur multiple times in
your configuration file and/or command line, the last occurrence will be used and
all previous ones discarded.
- -pattern $pattern
-
This option allows you to specify wildcard pattern for documents. All documents
are tested if they match this pattern.
- -rpattern $reg_exp
-
This is equal option as previous, but this uses regular expressions. Available
only on platforms which have any supported RE implementation.
- -skip_pattern $pattern
-
This option allows you to specify wildcard pattern for documents that should be
skipped. All documents are tested if they match this pattern.
- -skip_rpattern $reg_exp
-
This is equal option as previous, but this uses regular expressions. Available
only on platforms which have any supported RE implementation.
- -url_pattern $pattern
-
This option allows you to specify wildcard pattern for URLs. All URLs are tested
if they match this pattern.
Example:
-url_pattern http://\*.idata.sk:\*/~ondrej/\*
this option enables all HTTP URLs from domain .idata.sk on all ports which are
located under /~ondrej/ .
- -url_rpattern $reg_exp
-
This is equal option as previous, but this uses regular expressions. Available
only on platforms which have any supported RE implementation.
- -skip_url_pattern $pattern
-
This option allows you to specify wildcard pattern for URLs that should be
skipped. All URLs are tested if they match this pattern.
Example:
-skip_url_pattern ’*home*’
this option will force pavuk to skip all HTTP URLs which have ’home’
anywhere in their URL. This of course includes the query string part of the
URL,
hence
-skip_url_pattern ’*&action=edit*’ will direct pavuk to
skip any HTTP URLs which have a URL query section which has
’action=edit ’ as any but the first query element (as it would
then match ’*?action=edit* ’ instead).
- -skip_url_rpattern $reg_exp
-
This is equal option as previous, but this uses regular expressions. Available
only on platforms which have any supported RE implementation.
- -aip_pattern $re
-
This option allows you to limit set of transferred documents by server IP
address. IP address can be specified as regular expressions, so it is possible to
specify set of IP addresses by one expression. Available only on platforms which
have any supported RE implementation.
- -dip_pattern $re
-
This option similar to previous option, but is used to specify set of disallowed
IP addresses. Available only on platforms which have any supported RE
implementation.
- -tag_pattern $tag $attrib $url
-
More powerful version of -url_pattern option for more precise matching of
allowed URLs based on HTML tag name pattern, HTML tag attribute name pattern and on
URL pattern. You can use in all three parameters of this option wildcard patterns,
thus something like -tag_pattern ’*’ ’*’ url_pattern
is equal to -url_pattern url_pattern . The $tag and $attrib
parameters are always matched against uppercase strings. For example if you want
pavuk to follow only regular links ignoring any style sheets, images, etc., use
option -tag_pattern A HREF ’*’ .
- -tag_rpattern $tag $attrib $url
-
This is variation on the -tag_pattern . It uses regular expression
patterns in parameters instead of wildcard patterns used in the -tag_pattern
option.
- -noHTTP/-HTTP
-
This switch suppresses all transfers through HTTP protocol. Default is transfer
trough HTTP enabled.
- -noSSL/-SSL
-
This switch suppresses all transfers through HTTPS protocol (HTTP protocol over
SSL) . Default is transfer trough HTTPS enabled.
This option is available only when compiled with SSL support (you need SSleay or
OpenSSL libraries and development headers).
- -noGopher/-Gopher
-
Suppress all transfers through Gopher Internet protocol. Default is transfer
trough Gopher enabled.
- -noFTP/-FTP
-
This switch prevents processing documents allocated on all FTP servers. Default
is transfer trough FTP enabled.
- -noFTPS/-FTPS
-
This switch prevents processing documents allocated on all FTP servers accessed
through SSL. Default is transfer trough FTPS enabled.
This option is available only when compiled with SSL support (you need SSleay or
OpenSSL libraries and development headers).
- -FTPhtml/-noFTPhtml
-
By using of option -FTPhtml you can force pavuk to process HTML files downloaded
with FTP protocol. At default pavuk won’t parse HTML files from FTP
servers.
- -FTPdir/-noFTPdir
-
Force recursive processing of FTP directories too. The default setting is to
deny recursive downloading from FTP servers, i.e. FTP directory trees will not be
traversed.
- -disable_html_tag $TAG,[$ATTRIB][;...] / -enable_html_tag
$TAG,[$ATTRIB][;...]
-
Enable or disable processing of particular HTML tags or attributes. At default
all supported HTML tags are enabled.
For example if you don’t want to process all images you should use option
-disable_html_tag ’IMG,SRC;INPUT,SRC;BODY,BACKGROUND’ .SS "OTHER
LIMITATION OPTIONS"
- -subdir $dir
-
Sub-directory of local tree directory, to limit some of the modes {sync,
resumeregets, linkupdate} in its tree scan.
- -dont_leave_site/-leave_site
-
(Don’t) leave starting site. At default pavuk can span host when recursing
through WWW tree.
- -dont_leave_dir/-leave_dir
-
(Don’t) leave starting directory. If -dont_leave_dir option is used
pavuk will stay only in starting directory (including its own sub-directories). At
default pavuk can leave starting directories.
- -leave_site_enter_dir/-dont_leave_site_enter_dir
-
If you are downloading WWW tree which spans multiple hosts with huge trees, you
may want to allow downloading of document which are in directory hierarchy below
directory which we visited as first on each site. To obtain this, use option
-dont_leave_site_enter_dir . By default pavuk will go also to higher
directory levels on that site.
- -l $nr , -lmax $nr
-
Set maximum allowed level of tree traverse. Default is set to 0, what means that
pavuk can traverse at infinitum. As of version 0.8pl1 inline objects of HTML pages
are placed at same level as parent HTML page.
- -leave_level $nr
-
Maximum level of documents outside from site of starting URL. Default is set to
0, and 0 means that checking is not applied.
- -site_level $nr
-
Maximum level of sites outside from site of starting URL. Default is set to 0,
and 0 means that checking is not applied.
- -dmax $nr
-
Set maximum allowed number of documents that are processed. Default value is 0.
That means no restrictions are used in number of processed documents.
- -singlepage/-nosinglepage
-
Using option -singlepage allows you to transfer just HTML pages with all
its inlined objects (pictures, sounds, frame documents, ...). By default single
page transfer is disabled.
Note
This option renders the -mode singlepage option obsolete.
- -limit_inlines/-dont_limit_inlines
-
With this option you can control whether limiting options apply also to inline
objects (pictures, sounds, ...). This is useful when you want to download specified
set of HTML pages with all inline options without any restrictions.
- -user_condition $str
-
Script or program name for users own conditions. You can write any script which
should with exit value decide if download URL or not. Script gets from pavuk any
number of options, with this meaning :
- -url $url
-
processed URL
- -parent $url
-
any number of parent URLs
- -level $nr
-
level of this URL from starting URL
- -size $nr
-
size of requested URL
- -date $datenr
-
modification time of requested URL in format YYYYMMDDhhmmss
The exit status 0 of script or program means that current URL should be rejected and
nonzero exit status means that URL should be accepted.
Warning
use user conditions only if required because of big slowdowns caused by forking
scripts for each checked URL.
- -follow_cmd $str
-
This option allows you to specify script or program which can by its exit status
decide whether to follow URLs from current HTML document. This script will be
called after download of each HTML document. The script will get following options
as it’s parameters:
- -url $url
-
URL of current HTML document
- -infile $file
-
local file where is stored HTML document
The exit status 0 of script or program means that URLs from current document will be
disallowed, other exit status means, that pavuk can follow links from current HTML
document.
Support for scripting languages like JavaScript or VBScript in pavuk is done bit
hacky way. There is no interpreter for these languages, so not all things will work.
Whole support which pavuk have for these scripting languages is based on regular
expression patterns specified by user. Pavuk searches for these patterns in DOM event
attributes of HTML tags, in javascript:... URLs, in inline scripts in HTML documents
enclosed between <script></script> tags and in separate javascript
files. Support for scripting languages is only available when pavuk is compiled with
proper regular expression library (POSIX/GNU/PCRE/TRE).
- -enable_js/-disable_js
-
This options are used to enable or disable processing of JavaScript parts of
HTML documents. You must enable this option to be able to use processing of
javascript patterns.
- -js_pattern $re
-
With this option you are specifying what patterns match interesting parts of
JavaScript for extracting URLs. The parameter must be RE pattern with exactly one
subpattern which matches the URL part precisely. For example to match the URL in
the following type of javascript expressions:
document.b1.src=’pics/button1_pre.jpg’
you can use this pattern
^document.[a-zA-Z0-9_]*.src[ \t]*=[ \t]*’(.*)’$
- -js_transform $p $t $h $a
-
This option is similar to the previous one, but you can use custom transform
rules for the URL parts of patterns and also specify the exact HTML tag and
attribute where to look for this pattern. The $p is the pattern to
match the relevant part of script. The $t is a transform rule for the
URL. In this parameter the $x parts will be replaced by x -th
subpattern of the $p pattern. The $h parameter is
either the exact HTML tag or "*" when this applies to javascript body of HTML
document or separate JavaScript file: URLs or DOM event attribs or "" (empty
string) when this apply to javascript body of HTML document or separate JavaScript
file. The $a parameter is either the exact HTML attrib of tag or ""
(empty string) when this rule applies to the javascript body.
- -js_transform2 $p $t $h $a
-
This option is very similar to previous. The meaning of all parameters is same,
just the pattern $p can have only one substring which will be used in
the transform rule $t . This is required to allow rewriting of URL
parts of the tags and scripts. This option can also be used to force pavuk to
recognize HTML arg/attribute pairs which pavuk does not support.
Use this option instead of -js_transform when you want to make sure pavuk
’rewrites’ the transformed URL in the content grabbed from a site and
stored on your local disc.
In other words: -js_transform is good enough when you only want to direct
pavuk to grab a specific URL which is not literally available in the content
already downloaded, while -js_transform2 does just that little bit more: it
also makes sure this newly created URL ends up in the content saved to disc, by
replacing the text matched by the first sub-expression.
Note
Make sure that the first sub-expression always matches some content,
because otherwise pavuk will display a warning and not rewrite the content, as it
could not detect where you wanted the replacement URL to go.
Note
Additional caveat: when your pavuk binary was built using a RE library which does
not support sub-expressions, pavuk will report an error and abort when any of
the -js_pattern , -js_transform or -js_transform2 command-line
options were specified.
- -cookie_file $file
-
File where are stored cookie info. This file must be in Netscape cookie file
format (generated with Netscape Navigator or Communicator ...).
- -cookie_send/-nocookie_send
-
Use collected cookies in HTTP/HTTPS requests. Pavuk will not send at default
cookies.
- -cookie_recv/-nocookie_recv
-
Store received cookies from HTTP/HTTPS responses into memory cookie cache. At
default pavuk will not remember received cookies.
- -cookie_update/-nocookie_update
-
Update cookie file on disk and synchronize it with changes made by any
concurrent processes. At default pavuk will not update cookie file on disk.
- -cookies_max $nr
-
Maximum number of cookies in memory cookie cache. Default value is 0, and that
means no restrictions for cookies number.
- -disabled_cookie_domains $list
-
Comma-separated list of cookie domains which are permitted to send cookies
stored into cookie cache
- -cookie_check/-nocookie_check
-
Check when receiving cookie, if cookie domain is equal to domain of server which
sends this cookie. At default pavuk check is server is setting cookies for its
domain, and if it tries to set cookie for foreign domain pavuk will complain about
that and will reject such cookie.
- -noRelocate/-Relocate
-
This switch prevents the program from rewriting relative URLs to absolute URLs
after the HTML document has been transferred. Default pavuk behavior is to maintain
link consistency of HTML documents. So always when a HTML document is downloaded
pavuk will rewrite all URLs to point to the local document if it is available and
if it is not available it will point the link to the remote document instead. After
the document has been properly downloaded pavuk will update all the links in any
HTML documents which point at this one.
- -all_to_local/-noall_to_local
-
This option forces pavuk to change all URLs inside HTML document to local URLs
immediately after download of document. Default is this option disabled.
- -sel_to_local/-nosel_to_local
-
This option forces pavuk to change all URLs, which accomplish conditions for
download, to local inside HTML document immediately after download of document. I
recommend to use this option, when you are sure, that transfer will be without any
problems. This option can save a lot of processor time. Default is this option
disabled.
- -all_to_remote/-noall_to_remote
-
This option forces pavuk to change all URLs inside HTML document to remote URLs
immediately after download of document. Default is this option disabled.
- -post_update/-nopost_update
-
This option is especially designed to allow in -fnrules option doing
rules based on MIME type of document. This option forces pavuk to generate local
names for documents just after pavuk knows what is the MIME type of document. This
have big impact on the rewriting engine of links inside HTML documents. This option
causes dysfunction of other options for controlling the link rewriting engine. Use
this option only when you know what you are doing :-)
- -dont_touch_url_pattern $pat
-
This options serves to deny rewriting and processing of particular URLs in HTML
documents by pavuk HTML rewriting engine. This option accepts wildcard patterns to
specify such URLs. Matching is done against untouched URLs so when he URL is
relative, you must use pattern which matches the relative URL, when it is absolute,
you must use absolute URL.
- -dont_touch_url_rpattern $pat
-
This option is variation on previous option. This one uses regular patterns for
matching of URLs instead of wildcard patterns used by
-dont_touch_url_pattern option. This option is available only when pavuk is
compiled with support for regular expression patterns.
- -dont_touch_tag_rpattern $pat
-
This option is variation on previous option, just matching is made on full HTML
tag with included <>. This option accepts regular expression patterns. It is
available only when pavuk is compiled with support for regular expression
patterns.
- -tr_del_chr $str
-
All characters found in $str will be deleted from local name of
document. $str should contain escape sequences similar like in the
UNIX tr (1) command:
- \n
-
newline (ASCII LF: 10(dec))
- \r
-
carriage return (ASCII CR: 13(dec))
- \t
-
horizontal tab space (ASCII TAB: 9(dec))
- \0xXX
-
hexadecimal ASCII value (1-byte range, but you can never specify ASCII NUL
(0(dec)), i.e. XX can be in the range ’01’ to
’FF’)
- [:upper:]
-
all uppercase letters (ASCII ’A’..’Z’)
- [:lower:]
-
all lowercase letters (ASCII ’a’..’z’)
- [:alpha:]
-
all letters (ASCII ’A’..’Z’ +
’a’..’z’)
- [:alnum:]
-
all letters and digits (ASCII ’A’..’Z’ +
’a’..’z’ + ’0’..’9’)
- [:digit:]
-
all digits (ASCII ’0’..’9’)
- [:xdigit:]
-
all hexadecimal digits (ASCII ’0’..’9’ +
’A’..’F’ + ’a’..’f’)
- [:space:]
-
all horizontal and vertical white-space (ASCII SPACE(’ ’,
32(dec)), TAB(9(dec)), CR(10(dec)), LF(13(dec)), FF(12(dec)), VT(11(dec)))
- [:blank:]
-
all horizontal white-space (ASCII SPACE(’ ’, 32(dec)),
TAB(9(dec)))
- [:cntrl:]
-
all control characters (ASCII 1(dec)..31(dec) + 127(dec))
- [:print:]
-
all printable characters including space (ASCII 32(dec)..126(dec))
- [:nprint:]
-
all non printable characters (ASCII 1(dec)..31(dec) +
127(dec)..255(dec))
- [:punct:]
-
all punctuation characters (ASCII 33(dec)..47(dec) + 58(dec)..64(dec) +
91(dec)..96(dec) + 123(dec)..126(dec)), in other words these characters:
! (Exclamation mark),
" (Quotation mark; " in HTML),
$ (Dollar sign),
% (Percent sign),
& (Ampersand),
’ (Closing single quote a.k.a. apostrophe),
( (Opening parentheses),
) (Closing parentheses),
* (Asterisk a.k.a. star, multiply),
+ (Plus),
, (Comma),
- (Hyphen, dash, minus),
. (Period),
/ (Slant a.k.a. forward slash, divide),
: (Colon),
; (Semicolon),
< (Less than sign; < in HTML),
= (Equals sign),
> (Greater than sign; > in HTML),
? (Question mark),
@ (At-sign),
[ (Opening square bracket),
\ (Reverse slant a.k.a. Backslash),
] (Closing square bracket),
^ (Caret a.k.a. Circumflex),
_ (Underscore),
‘ (Opening single quote),
{ (Opening curly brace),
| (Vertical line),
} (Cloing curly brace),
~ (Tilde a.k.a. approximate))
- [:graph:]
-
all printable characters excluding space (ASCII 33(dec)..126(dec))
- -X
-
a range: expands to a character series starting with the last expanded
character (or ASCII(1(dec)) when the ’-’ minus character is
positioned at the start of this string/specification) and ending with the
character specified by X , where X may also be a
’\’-escaped character, e.g. ’\n’ or ’\x7E’.
Hence you can specify ranges like ’\x20-\x39’ and get what
you’d expect.
- -tr_str_str $str1 $str2
-
String $str1 from local name of document will be replaced with
$str2 .
- -tr_chr_chr $chrset1 $chrset2
-
Characters from $chrset1 from local name of document will be
replaced with corresponding character from $chrset2 .
$charset1 and $charset2 should have same syntax as
$str in -tr_del_chr option: both $charset1
and $charset2 will be expanded to a character set using the rules
described above. The characters in the expanded sets $charset1
and $charset2 have a 1:1 relationship, e.g. the second character
in $charset1 will be replaced by the second character in
$charset2 .
Caution
If the set $charset2 is smaller than the set
$charset1 , any characters in the set $charset1
at positions at or beyond the size of the set $charset2 will be
replaced by the last character in the set $cha2rset2 .
For example, tr_chr_chr(’abcd’, ’AB’,
’abcde’) will produce
the result
’ABBBe’
as ’c’ and ’d’ in $charset1 are
beyond the range of $charset2 , hence these are replaced by the
last character in $charset2 : ’B’. With the above
example this may seem rather obvious, but be reminded that elements like
’[:punct:]’ are deterministic (as they do not
depend on your ’locale’, but they can still be hard to use as you
must determine which and how many characters they will produce upon
expansion. See the description for -tr_del_chr above for additional
info to help you with this.
- $charset1 or $charset2 as locale dependent and
can
thus
- -store_name $str
-
Define the local filename to use for the very first file downloaded. This
option is most useful when running pavuk in ’singlepage’ mode, but
it works for any mode.
- -index_name $str
-
With this option you can change directory index name. By default the
filename _._.html is used, which is assumed to be a filename not usually
occuring on web/ftp/... sites.
- -store_index/-nostore_index
-
With option -nostore_index you deny storing of directory indexes into
HTML files (which are named according to the -index_name settings). The
default is to store all directory URLs as HTML index files (i.e.
-store_index ).
- -fnrules $t $m $r
-
This is a very powerful option! This option is used to flexibly change the
layout of the local document tree. It accepts three parameters.
-
The first parameter $t is used to say what type the following
pattern is:
- F
-
is used for a wildcard pattern (uses fnmatch (3) ), while
- R
-
is used for a regular expression pattern (using any supported RE
implementation).
-
The second parameter is the matching pattern used to select URLs for
this rule. If a URL matches this pattern, then the local name for this URL
is computed using the rule specified in the third parameter.
-
And the third parameter is the local name building rule. Pavuk now
supports two kinds of local name building rules. One is based only on
simple rule macros and the other is a more complicated, extended
rule rule, which also enables you to perform several functions in a
LISP-like micro language.
Pavuk differentiates between these two kinds of rules by looking at the
first character of the rule. When the first character is a
’(’ open bracket character, the rule is assumed to be of
the extended sort, while in all other cases it is assumed to be a simple
rule.
A Simple rule should contain a mix of literals and escaped macros.
Macros are escaped by the % character or the $ character.
Note
if you want to place a literal % or $ character in the
generated string, you can escape that character with a \ backslash
character prefix, so pavuk will not recognize it as a macro escape character
here.
Note
-fnrules always performs additional cleanup for file paths produced
by both matching simple and extended rules: multiple consecutive occurrences
of / slashes in the path are replaced by a single / slash,
while any directory and/or file names which end with a . dot have that
dot removed.
Note
-fnrules are processed in the order they occurred on the command
line. If a rule matches the current URL, this rule will be applied. Any
subsequent rules will be skipped. This allows you to specify multiple
-fnrules on the command line. By ordering them from specific to
generic, you can apply different rules to subsets of the URL collection (e.g.
you’re putting the -fnrules F ’*’
’%some%macros%’ statement last).
Note
When an -fnrules statement matches the current URL, any specified
-base_level path processing will not be applied to the -fnrules
generated path.
Here is list of recognized macros:
- $x
-
where x is any positive number. This macro is replaced with
x -th substring matched by the RE pattern, which was specified in
the second -fnrules argument $m . (If you use this you need
to understand RE sub-matches!)
- %i
-
is replaced with protocol id string:
(http,https,ftp,ftps,file,gopher)
- %p
-
is replaced with password. (use this only where applicable)
- %u
-
is replaced with user name. (use this only where applicable)
- %h
-
is replaced with the fully qualified host name.
- %m
-
is replaced with the fully qualified domain name.
- %r
-
is replaced with port number.
- %d
-
is replaced with path to document.
- %n
-
is replaced with document name (including the extension).
- %b
-
is replaced with base name of document (without the extension).
- %e
-
is replaced with the URL filename extension.
- %s
-
is replaced with the URL searchstring.
- %M
-
is replaced with the full MIME type of document as transmitted in the
MIME header. For example:
text/html; charset=utf-8
As of v0.9.36, you do not need to specify the -post_update option
to make this option work.
- %B
-
is replaced with basic MIME type of the document, i.e. the MIME type
without any attributes. For example:
text/html
- %A
-
is replaced with MIME type attributes of the document, i.e. all the
stuff following the initial ’;’ semicolon as specified in the
MIME type header which was sent to us by the server. For example:
charset=utf-8
- %E
-
is replaced with default extension assigned to the MIME type of the
document.
As of v0.9.36, you do not need to specify the -post_update option
to make this option work.
You may want to specify the additional command line option
-mime_type_file $file to override the rather limited set of
built-in MIME types and default file extensions.
- %X
-
is replaced with the default extension assigned to the MIME type of the
document, if one exists. Otherwise, the existing file extension is used
instead.
You may want to specify the additional command line option
-mime_type_file $file to override the rather limited set of
built-in MIME types and default file extensions.
- %Y
-
is replaced with file extension if one is available. Otherwise, the
default extension assigned to the MIME type of the document is used
instead.
You may want to specify the additional command line option
-mime_type_file $file to override the rather limited set of
built-in MIME types and default file extensions.
- %x
-
where x is a positive decimal number. This macro is replaced with
the x -th directory from the path of the document, starting with 1
for the initial sub-directory.
- %-x
-
where x is a positive decimal number. This macro is replaced with
the x -th directory from the path of the document, counting down
from end. The value 1 indicates the last sub-directory in the path.
- %o
-
default localname for URL
Here is an example. If you want to place the document into a single
directory, one for each extension, you should use the following -fnrules
option:
-fnrules F ’*’ ’/%e/%n’
Extended rules always begin with a ’(’ character.
These rules use a syntax much alike the LISP syntax.
Here are the basic rules for writing extended rules:
-
the complete rule statement must return the local filename as a string
return value
-
each function/operation is enclosed inside round braces ()
-
the first token right after the opening brace is the function
name/operator
-
each function has a nonzero fixed number of parameters
-
each function returns a numeric or string value
-
function parameters are separated by one or more space characters
-
any parameter of a function should be a string, number, macro or another
function
-
a literal string parameter must always be quoted using " double
quotes. When you need to include a " double quote as part of the
literal string itself, escape it by prefixing it with a \ backslash
character.
-
a literal numeric parameter can be presented in any encoding supported
by the strtol (3) function (octal, decimal,
hexadecimal, ...)
-
there is no implicit conversion from number to string
-
each macro is prefixed by % character and is one character
long
-
each macro is replaced by its string representation from current URL
-
function parameters are typed strictly
-
top level function must return string value
Extended rules supports the full set of % escaped macros supported by
simple rules, plus one additional macro:
- %U
-
URL string
Here is a description of all supported functions/operators:
- sc
-
- ss
-
-
substring from string
-
accepts three parameters.
-
first is string from which we want to cut a sub-part
-
second is number which represents starting position in string
-
third is number which represents ending position in string
-
returns string value
- hsh
-
-
compute modulo hash value from string with specified base
-
accepts two parameters
-
first is string for which we are computing the hash value
-
second is numeric value for base of modulo hash
-
returns numeric value
- md5
-
- lo
-
- up
-
- ue
-
-
encode unsafe characters in string with same encoding which is used
for encoding unsafe characters inside URL ( %xx ). By
default all non-ASCII values are encoded when this function is
used.
-
accepts two string values
-
first is string which we want to encode
-
second is string which contains unsafe characters
-
return string value
- ud
-
- dc
-
-
delete unwanted characters from string (has similar functionality as
-tr_del_chr option)
-
accepts two string values
-
first is string from which we want delete
-
second is string which contains characters we want to delete.
-
returns string value
- tc
-
-
replace character with other character in string (has similar
functionality as -tr_chr_chr option)
-
accepts three string values
-
first is string inside which we want to replace characters
-
second is set of characters which we want to replace
-
third is set of characters with which we want to replace those
with
-
returns string value
- ts
-
-
replace some string inside string with any other string (has similar
functionality as -tr_str_str option)
-
accepts three string values
-
first is string inside which we want to replace string
-
second is the from string
-
third is to string
-
returns string value
- spn
-
-
calculate initial length of string which contains only specified set
of characters. (has same functionality as strspn (3) libc function)
-
accepts two string values
-
first is input string
-
second is set of acceptable characters
-
returns numeric value
- cspn
-
-
calculate initial length of string which doesn’t contain
specified set of characters. (has same functionality as strcspn (3) libc function)
-
accepts two string values
-
first is input string
-
second is set of unacceptable characters
-
returns numeric value
- sl
-
- ns
-
-
convert number to string by format
-
accepts two parameters
-
first parameter is format string same as for printf (3) function
-
second is number which we want to convert
-
returns string value
- sn
-
-
convert string to number by radix
-
accepts two parameters
-
first parameter is string which we want to convert using the
strtol (3) function
-
second is radix number to use for conversion; specify radix
’0’ zero if the strtol (3)
function should auto-discover the radix used
-
returns numeric value
- lc
-
-
return position of last occurrence of specified character inside
string
-
accepts two string parameters
-
first string which we are searching in
-
second string contains character for which we are looking (only the
first character of the string is used)
-
returns numeric value; 0 if character could not be found
- +
-
- -
-
- %
-
-
calculate modulo remainder
-
accepts two numeric values
-
returns numeric value; return 0 if divisor is 0
- *
-
- /
-
-
divide two numeric values
-
accepts two numeric values
-
returns numeric value; 0 if division by zero
- rmpar
-
-
remove parameter from query string
-
accepts two strings
-
first parameter is the string which we are adjusting
-
second parameter is the name of parameter which should be
removed
-
returns adjusted string
- getval
-
-
get query string parameter value
-
accepts two strings
-
first parameter is query string from which to get the parameter
value (usually %s )
-
second string is name of parameter for which we want to get the
value
-
returns value of the parameter or empty string when the parameter
doesn’t exists
- sif
-
-
logical decision
-
accepts three parameters
-
first is numeric and when its value is nonzero, the result of this
decision is the result of the second parameter, otherwise it is the
result of the third parameter
-
second parameter is string (returned when condition is
nonzero/true)
-
third parameter is string (returned when condition is
zero/false)
-
returns string result of decision
- !
-
- &
-
- |
-
- getext
-
- seq
-
- fnseq
-
-
compare a wildcard pattern and a string (has the same functionality
as the fnmatch (3) libc function)
-
accepts two strings for comparison
-
first string is a wildcard pattern
-
second string is the data which should match the pattern
-
returns
- numeric value 0
-
if different
- numeric value 1
-
if equal
- sp
-
-
return URL sub-part from the matching -fnrules
’R’ regex
-
accepts one number, which references the corresponding
sub-expression in the -fnrules ’R’ regex
-
returns the URL substring which matched the specified
sub-expression
-
This function is available only when pavuk is compiled with regex
support, including sub-expressions (POSIX/PCRE/TRE/...).
- jsf
-
-
Execute JavaScript function
-
Accepts one string parameter which holds name of JavaScript function
specified in script loaded with -js_script_file option.
-
Returns string value equal to return value. of JavaScript function.
See the -js_script_file command line option for further
details.
-
This function is available only when pavuk is compiled with support
for JavaScript bindings.
For example, if you are mirroring a very large number of Internet sites
into the same local directory, too much entries in one directory will cause
performance problems. You may use for example hsh or md5
functions to generate one additional level of hash directories based on
hostname with one of the following options:
-fnrules F ’*’ ’(sc (nc "%02d/" (hsh %h 100))
%o)’
-fnrules F ’*’ ’(sc (ss (md5 %h) 0 2) %o)’
- -base_level $nr
-
Number of directory levels to omit in local tree.
For example when downloading URL
ftp://ftp.idata.sk/pub/unix/www/pavuk-0.7pl1.tgz you enter at command
line -base_level 4 in local tree will be created
www/pavuk-0.7pl1.tgz not
ftp/ftp.idata.sk_21/pub/unix/www/pavuk-0.7pl1.tgz as normally.
- -default_prefix $str
-
Default prefix of mirrored directory. This option is used only when you are
trying to synchronize content of remote directory which was downloaded using
-base_level option. Also you must use directory based synchronization
method, not URL based synchronization method. This is especially useful, when
used in conjunction with -remove_old option.
- -remove_adv/-noremove_adv
-
This option is used for turn on/off of removing HTML tags which contains
advertisement banners. The banners are not removed from HTML file, but are
commented out. Such URLs also will not be downloaded. This option have effect
only when used with option -adv_re . Default is turned off. This option
is available only when your system have support for one of the supported
regular expressions implementations.
- -adv_re $RE
-
This option is used to specify regular expressions for matching URLs of
advertisement banners. For example:
-adv_re http://ad.doubleclick.net/.*
is used to match all files from server ad.doubleclick.net. This option is
available only when your system has any supported regular expressions
implementation.
- -unique_name/-nounique_name
-
Pavuk by default always attempts to assign a unique local filename to each
unique URL. If this behavior is not wanted, you can use option
-nounique_name to disable this.
- -hammer_mode $nr
-
define the hammer mode:
-
0 = old fashioned: keep on running until all URLs have been accessed
-hammer_repeat times.
-
1 = record activity on first run; burst transmit recorded activity
-hammer_repeat times. This is an extremely fast mode suitable for
loadtesting medium and large servers (assuming you are running pavuk on
similar hardware).
- -hammer_threads $nr
-
define the number of threads to use for the replay hammer attack (hammer
mode 1)
- -hammer_flags $nr
-
define hammer mode flags: see the man page for more info
- -hammer_ease $nr
-
delay for network communications (msec). 0 == no delay, default = 0.
$nr specifies the timeout in millieseconds, unless postfixed with
one of the characters S, M, H or D (either in upper or lower case), which
imply the alternative time units S = seconds, M = minutes, H = hours or D =
days.
- -hammer_rtimeout $nr
-
timeout for network communications (msec). 0 == no timeout, default =
0.
$nr specifies the timeout in millieseconds, unless postfixed with
one of the characters S, M, H or D (either in upper or lower case), which
imply the alternative time units S = seconds, M = minutes, H = hours or D =
days.
- -hammer_repeat $nr
-
number of times the requests should be executed again (load test by
hammering the same stuff over and over).
- -log_hammering / -nolog_hammering
-
log all activity during a ’hammer’ run.
Note
Note: only applies to hammer_modes >= 1, as hammer_mode == 0 is
simply a re-execution of all the requests, using the regular code and
processing by pavuk and as such the regular pavuk logging is used for
that mode.
- -hammer_recdump {$nr | @[@]$filepath }
-
number of filedescriptor where to output recorded activity.
Note
pavuk 0.9.36 and later releases also support the @$file
argument, where you can specify a file to dump the data to. The file path
must be prefixed by an ’@’ character. If you prefix the file
path with a second ’@’, pavuk will assume you wish to append
to an already existing file. Otherwise the file will be created/erased
when pavuk starts.
- -sleep $nr
-
This option allows you to specify number of seconds during that the
program will be suspended between two transfers. Useful to deny server
overload. Default value for this option is 0.
- -rsleep/-norsleep
-
When this option is active, pavuk randomizes the the sleep time between
transfers in interval between zero and value specified with -sleep
option. Default is this option inactive.
- -ddays $nr
-
If document has a modification time later then $nr days before
today, then in sync mode pavuk attempts to retrieve a newer copy of the
document from the remote server. Default value is 0.
- -remove_old/-noremove_old
-
Remove improper documents (those which don’t exist on remote
site). This option have effect only when used in directory based
sync mode. When used with URL based sync mode, pavuk will not remove
any old files which were excluded from document tree and are not referenced
in any HTML document. You must also use option -subdir , to let
pavuk find files which belongs to current mirror. By default pavuk
won’t remove any old files.
- -browser $str
-
is used to set your browser command (in URL tree dialog you can use
right click to raise menu, from which you can start browser on actually
selected URL). This option is available only when compiled with GTK GUI and
with support for URL tree preview.
- -debug/-nodebug
-
turns on displaying of debug messages. This option is available only
when compiled with -DDEBUG, i.e. when having executed ./configure
--enable-debug to set up the pavuk source code. If the -debug
option is used, pavuk will output verbose information about documents,
whole protocol level information, file locking information and much more
(the amount and types of information depends on the -debug_level
command-line arguments). This option is used as a trigger to enable output
of debug messages selected by the -debug_level option. Default is
debug mode turned off. To check if your pavuk binary supports -debug
, you can run pavuk with the -version option.
- -debug_level $level
-
Set level of required debug information. $level can be
numeric value which represent binary mask for requested debug levels, or
comma separated list of supported debug level indentifiers.
The debug level identifiers (as listed below) can be prefixed with a
! exclamation mark to turn them off . For example, this
$level specification:
all,!html,!limits
will turn ’all’ debug levels ON, except
’html’ and ’limits’.
Currently pavuk supports following debug level identifiers:
- all
-
request all currently supported debug levels
- bufio
-
for watching the pavuk I/O buffering layer at work - this layer is
positioned on top of all file I/O and network traffic for improved
performance.
- cookie
-
for monitoring HTTP ’cookies’ processing.
- dev
-
for additional ’developer’ debug info. This generally
produces more debug info across the board.
- hammer
-
for watching events while running in -hammer_mode >= 1
replay mode
- html
-
for HTML parser debugging
- htmlform
-
for monitoring HTML web form processing, such as recognizing and
(automatically) filling in web form fields.
- protos
-
to see server side protocol messages
- protoc
-
to see client side protocol messages
- procs
-
to see some special procedure calls
- locks
-
for debugging of documents locking
- net
-
for debugging some low level network stuff
- misc
-
for miscellaneous unsorted debug messages
- user
-
for verbose user level messages
- mtlock
-
locking of resources in multithreading environment
- mtthr
-
launching/waking/sleeping/stopping of threads in multithreaded
environment
- protod
-
for DEBUGGING of POST requests
- limits
-
for debugging limiting options, you will see the reason why
particular URLs are rejected by pavuk and which option caused this.
- rules
-
for debugging -fnrules and JavaScript-based filters.
- ssl
-
to enable verbose reporting about SSL related things.
- trace
-
to enable verbose reporting of development related things.
- js
-
for debugging the -js_pattern , -js_transform and
-js_transform2 filter processing.
- -remind_cmd $str
-
This option has effect only when running pavuk in reminder mode.
To command specified with this option pavuk sends result of running
reminder mode. There are listed URLs which are changed and URLs which have
any errors. Default remind command is "mailx user@server -s \"pavuk
reminder result\"" .
- -nscache_dir $dir
-
Path to Netscape browser cache directory. If you specify this path,
pavuk attempts to find out if you have URL in this cache. If URL is there
it will be fetched else pavuk will download it from net. The cache
directory index file must be named index.db and must be
located in the cache directory. To support this feature, pavuk have to be
linked with BerkeleyDB 1.8x .
- -mozcache_dir $dir
-
Path to Mozilla browser cache directory. Same functionality as with
previous option, just for different browser with different cache formats.
Pavuk supports both formats of Mozilla browser disk cache (old for versions
<0.9 and new used in 0.9=<). The old format cache directory must
contain cache directory index database with name cache.db .
Then new format cache directory must contain map file
_CACHE_MAP_ , and three block files _CACHE_001_
, _CACHE_002_ , _CACHE_003_ . To support old
Mozilla cache format, pavuk have to be linked with BerkeleyDB 1.8x. New
Mozilla cache format doesn’t require any external library.
- -post_cmd $str
-
Post-processing command, which will be executed after successful
download of document. This command may somehow handle with document. During
time of running this command, pavuk leaves actual document locked, so there
isn’t chance that some other pavuk process will modify document. This
post-processing command will get three additional parameters from
pavuk.
- name
-
local name of document
- 1 / 0
-
- URL
-
original URL of this document
- -hack_add_index/-nohack_add_index
-
This is bit hacky option. It forces pavuk to add to URL queue also
directory indexes of all queued documents. This allow pavuk to download
more documents from site, than it is able achieve in normal traversing of
HTML documents. Bit dirty but useful in some cases.
- -js_script_file $file
-
Pavuk have optionally built-in JavaScript interpreter to allow high
level customization of some internal procedures. Currently you are allowed
to customize with your own JavaScript functions two things. You can use it
to set precise limiting options, or you can write own functions which can
be used inside rules of -fnrules option. With this option you can
load JavaScript script with functions into pavuks internal JavaScript
interpreter. This option is available only when you have compiled pavuk
with support for JavaScript bindings.
- -mime_type_file $file
-
Specify an alternative MIME type and file extensions definition file
$file to override the rather limited set of built-in MIME types and
default file extensions. The file must be of a UNIX mime.types(5)
compatible format.
If you do not specify this command line option, these MIME types and
extensions are known to pavuk by default:
MIME types and default file extensions
MIME type |
Default File Extension |
|
text/html* |
html |
text/js |
js |
text/plain |
txt |
image/jpeg |
jpg |
image/pjpeg |
jpg |
image/gif |
gif |
image/png |
png |
image/tiff |
tiff |
application/pdf |
pdf |
application/msword |
doc |
application/postscript |
ps |
application/rtf |
rtf |
application/wordperfect5.1 |
wps |
application/zip |
zip |
video/mpeg |
mpg |
Note that the source distribution of pavuk already includes a full
fledged mime.types file for your convenience. You may point
-mime_type_file at this file to make pavuk aware of (almost) all
MIME types available out there!
You may want to use the JavaScript bindings built into pavuk for performing
tasks which need some more complexity than can achieved with a regular,
non-scriptable program.
You can load one JavaScript file into pavuk using command line option
-js_script_file . Currently there are in pavuk two exits where user can
insert own JavaScript functions.
One is inside routine which is doing decision whether particular URL should
be downloaded or not. If you want insert own JavaScript decision function you
must name it pavuk_url_cond_check . The prototype of this function looks
following:
function pavuk_url_cond_check (url, level)
{
...
}
where the function return value is used by pavuk. Any return value which
evaluates to a boolean ’false’ or integer ’0’ (zero)
will be considered a ’NO’ answer, i.e. skip the given URL. Any
other boolean or integer return value constitutes a ’YES’ answer.
(Note that return values are cast to an integer value before evaluation.)
- level
-
is an integer number and indicates from which of five different places
in pavuk code is currently pavuk_url_cond_check function called:
- level == 0
-
condition checking is called from HTML parsing routine. At this
point you can use all conditions besides -dmax ,
-newer_than , -older_than , -max_size ,
-min_size , -amimet , -dmimet and
-user_condition when calling the pavuk url.check_cond(name,
....) URL class method from this JavaScript function script code.
Calling url.check_cond(name, ....) with any of the conditions
listed above will be processed as a no-op, i.e. it will return the
boolean value ’TRUE’.
- level == 1
-
condition checking is called from routine which is performing
queueing of URLs into download queue. These URLs have been collected
from another HTML page before. At this point you can only use the
conditions -dmax and -user_condition .
- level == 2
-
condition checking is called when URL is taken from download queue
and will be transferred after this check will be successful. At this
point you can use same set of conditions like in level == 0
except -tag_pattern and -tag_rpattern . Meanwhile you can
use the condition -dmax here.
- level == 3
-
condition checking is called after pavuk sent request for download
and detected document size, modification time and mime type. In this
level you can only use the conditions -newer_than ,
-older_than , -max_size , -min_size ,
-amimet , -dmimet and -user_condition . As with
the other levels, using any other conditions is identical to a no-op
check.
- url
-
is object instance of PavukUrl class. It contains all information about
particular URL and is wrapper for parsed URLs defined in pavuk like
structure of url type.
It have following attributes:
-
read-write attributes
- status
-
(int32, defined always) holds bitfields with different info
(look in url.h to see more)
-
And following methods:
- get_parent(n)
-
get URL of n-th parent document
- check_cond(name, ...)
-
check condition which option name is "name". when you will not
provide additional parameters pavuk will use parameters from
command-line or scenario file for condition checking. Else it will
use listed parameters.
The following condition names are recognized (note that the use
of other names is considered an error here):
Next to that, pavuk also offers a global print(...) function which will
print each of the parameters passed to it, separating them by a single space. The
text is terminated by a newline. Note that each of the print(...) parameters
is cast to a string before being printed.
Here is some example like pavuk_url_cond_check function can look:
function pavuk_url_cond_check (url, level)
{
if (level == 0)
{
if (url.level > 3 && url.check_cond("-asite", "www.host.com"))
return false;
if (url.check_cond("-url_rpattern"
, "http://www.idata.sk/~ondrej/"
, "http://www.idata.sk/~robo/")
&& url.check_cond("-dsfx", ".jar", ".tgz", ".png"))
return false;
}
if (level == 2)
{
par = url.get_parent();
if (par && par.get_moved())
return false;
}
return true;
}
This example is rather useless, but shows you how to use this feature.
Second possible use of JavaScript with pavuk is in -fnrules option for
generating local names. In this case it is done by special function of extended
-fnrules option syntax called "jsf " which has one parameter: the
name of javascript function which will be called. The function must return a string
and its prototype is something like the following:
function some_jsf_func(fnrule)
{
...
}
The -fnrule parameter is an object instance of PavukFnrules class.
it have three read-only attributes:
-
url - which is of PavukUrl type described above
-
pattern - which is the -fnrules provided pattern string
-
pattern_type - which is the -fnrules provided pattern type ID (an
integer number): when called by a -fnrules ... ’F’ option,
pattern_type == 2, when called by a -fnrules ... ’R’ (regex)
option, pattern_type == 1, otherwise pattern_type == 0 (unknown).
and also has two methods
-
get_macro( (macro ) ) - it returns value of the
’%’ macros used in -fnrules option, where the (string type)
(macro ) argument may be any of ’%i’, ’%p’,
’%u’, ’%h’, ’%m’, ’%r’,
’%d’, ’%n’, ’%b’, ’%e’,
’%s’, ’%q’, ’%U’, ’%o’,
’%M’, ’%B’, ’%A’, ’%E’,
’%Y’ or ’%X’. Any other (macro ) argument value
will not be processed and is passed as is, i.e. will be returned by
get_macro( (macro ) ) untouched.
-
get_sub( (nr ) ) - which returns the substring of
’urlstr’ as matched by the Regex sub-expression ’nr’
when the -fnrules R statement was matched.
You can do something like:
-fnrules F "*" ’(jsf "some_fnrules_func")’
As of version 0.9pl29 pavuk have changed indication of status by exit codes. In
earlier versions exit status 0 was for no error and nonzero exit status was
something like count of failed documents. In all version after 0.0pl29 there are
defined following exit codes:
-
no error, everything is OK
-
error in configuration of pavuk options or error in config files
-
some error occurred while downloading documents
-
a signal was caught while downloading documents; transfer was aborted
-
an internal check failed while downloading documents; transfer was
aborted
- USER
-
variable is used to construct email address from user and hostname
- LC_*, LANG
-
used to set internationalized environment
- PAVUKRC_FILE
-
with this variable you can specify alternative location for your
.pavukrc configuration file.
- at
-
is used for scheduling.
- gunzip
-
is used to decode gzip or compress encoded documents. Note that since pavuk
release 0.9.36 gunzip is only used when pavuk has been built without
built-in zlib support. You can check if your pavuk binary comes with built-in
zlib support by running pavuk -v which should report
’gzip/compress/deflate Content-Encoding’ as one of the optional
features available.
If you find any, please let me know.
- /usr/local/etc/pavukrc
-
---
- ~/.pavukrc
-
---
- ~/.pavuk_prefs
-
These files are used as default configuration files. You may specify there
some constant values like your proxy server or your preferred WWW browser.
Configuration options reflect command line options. Not all parameters are
suitable for use in default configuration file. You should select only some of
them, which you really need.
File ~/.pavuk_prefs is special file which contains
automatically stored configuration. This file is used only when running GUI
interface of pavuk and option -prefs is active.
- -auth_file $file
-
File $file should contain as many authentication records as you need.
Records are separated by any number of empty lines. Parameter name is case
insensitive.
Structure of record:
Field : Proto: <proto ID>
Description : identification of protocol (ftp/http/https/..)
Reqd: : required
Field : Host: <host:[port]>
Description : host name
Reqd: : required
Field : User: <user>
Description : name of user
Reqd: : optional
Field : Pass: <password>
Description : password for user
Reqd: : optional
Field : Base: <path>
Description : base prefix of document path
Reqd: : optional
Field : Realm: <name>
Description : realm for HTTP authorization
Reqd: : optional
Field : NTLMDomain: <domain>
Description : NTLM domain for NTLM authorization
Reqd: : optional
Field : Type: <type>
Description : HTTP authentication scheme. Accepted values:
{1/2/3/4/user/Basic/Digest/NTLM} Similar meaning as the
-auth_scheme option (see help for this option for more details). Default
is 2 (Basic scheme).
Reqd: : optional
See pavuk_authinfo.sample file for an example.
- ~/.pavuk_keys
-
this is file where are stored information about configurable menu option
shortcuts. This is available only when compiled with GTK+ 1.2 and higher.
- ~/.pavuk_remind_db
-
this file contains information about URLs for running in reminder
mode. Structure of this file is very easy. Each line contains information about
one URL. First entry in line is last known modification time of URL (stored in
time_t format - number of seconds since 1.1.1970 GMT), and second entry
is the URL itself.
First (if present) parsed file is /usr/local/etc/pavukrc then
~/.pavukrc (if present), then ~/.pavuk_prefs (if present). Last the
command line is parsed.
The precedence of configuration settings is as follows (ordered from highest to
lowest precedence):
Here is table of config file - command line options pairs:
Config file options vs. command line option equivalents
Config file option |
command line option |
|
ActiveFTPData: |
-ftp_active / -ftp_passive |
ActiveFTPPortRange: |
-active_ftp_port_range |
AddHTTPHeader: |
-httpad |
AdvBannerRE: |
-adv_re |
AllLinksToLocal: |
-all_to_local / -noall_to_local |
AllLinksToRemote: |
-all_to_remote / -noall_to_remote |
AllowCGI: |
-CGI / -noCGI |
AllowedDomains: |
-adomain |
AllowedIPAdrressPattern: |
-aip_pattern |
AllowedMIMETypes: |
-amimet |
AllowedPorts: |
-aport |
AllowedPrefixes: |
-aprefix |
AllowedSites: |
-asite |
AllowedSuffixes: |
-asfx |
AllowFTP: |
-FTP / -noFTP |
AllowFTPRecursion: |
-FTPdir |
AllowFTPS: |
-FTPS / -noFTPS |
AllowGopher: |
-Gopher / -noGopher |
AllowGZEncoding: |
-Enc / -noEnc |
AllowHTTP: |
-HTTP / -noHTTP |
AllowRelocation: |
-Relocate / -noRelocate |
AllowSSL: |
-SSL / -noSSL |
AlwaysMDTM: |
-always_mdtm / -noalways_mdtm |
AuthFile: |
-auth_file |
AuthReuseDigestNonce: |
-auth_reuse_nonce |
AuthReuseProxyDigestNonce: |
-auth_reuse_proxy_nonce |
AutoReferer: |
-auto_referer / -noauto_referer |
BaseLevel: |
-base_level |
BgMode: |
-bg / -nobg |
Browser: |
-browser |
CheckIfRunnigAtBackground: |
-check_bg / -nocheck_bg |
CheckSize: |
-check_size / -nocheck_size |
CommTimeout: |
-timeout |
CookieCheckDomain: |
-cookie_check / -nocookie_check |
CookieFile: |
-cookie_file |
CookieRecv: |
-cookie_recv / -nocookie_recv |
CookieSend: |
-cookie_send / -nocookie_send |
CookiesMax: |
-cookies_max |
CookieUpdate: |
-cookie_update / -nocookie_update |
Debug: |
-debug / -nodebug |
DebugLevel: |
-debug_level |
DefaultMode: |
-mode |
DeleteAfterTransfer: |
-del_after / -nodel_after |
DisabledCookieDomains: |
-disabled_cookie_domains |
DisableHTMLTag: |
-disable_html_tag |
DisallowedDomains: |
-ddomain |
DisallowedIPAdrressPattern: |
-dip_pattern |
DisallowedMIMETypes: |
-dmimet |
DisallowedPorts: |
-dport |
DisallowedPrefixes: |
-dprefix |
DisallowedSites: |
-dsite |
DisallowedSuffixes: |
-dsfx |
DocExpiration: |
-ddays |
DontLeaveDir: |
-leave_dir / -dont_leave_dir |
DontLeaveSite: |
-leave_site / -dont_leave_site |
DontTouchTagREPattern: |
-dont_touch_tag_rpattern |
DontTouchUrlPattern: |
-dont_touch_url_pattern |
DontTouchUrlREPattern: |
-dont_touch_url_rpattern |
DumpFD: |
-dumpfd |
DumpUrlFD: |
-dump_urlfd |
EmailAddress: |
-from |
EnableHTMLTag: |
-enable_html_tag |
EnableJS: |
-enable_js / -disable_js |
FileSizeQuota: |
-file_quota |
FixWuFTPDBrokenLISTcmd: |
-fix_wuftpd_list / -nofix_wuftpd_list |
FnameRules: |
-fnrules |
FollowCommand: |
-follow_cmd |
ForceReget: |
-force_reget |
FSQuota: |
-fs_quota |
FTPDirtyProxy: |
-ftp_dirtyproxy |
FTPhtml: |
-FTPhtml / -noFTPhtml |
FTPListCMD: |
-FTPlist / -noFTPlist |
FTPListOptions: |
-ftp_list_options |
FtpLoginHandshake: |
-ftp_login_handshake |
FTPProxy: |
-ftp_proxy |
FTPProxyPassword: |
-ftp_proxy_pass |
FTPProxyUser: |
-ftp_proxy_user |
FTPViaHTTPProxy: |
-ftp_httpgw |
GopherProxy: |
-gopher_proxy |
GopherViaHTTPProxy: |
-gopher_httpgw |
GUIFont: |
-gui_font |
HackAddIndex: |
-hack_add_index / -nohack_add_index |
HammerEaseOffDelay: |
-hammer_ease |
HammerFlags: |
-hammer_flags |
HammerMode: |
-hammer_mode |
HammerReadTimeout: |
-hammer_rtimeout |
HammerRecorderDumpFD: |
-hammer_recdump |
HammerRepeatCount: |
-hammer_repeat |
HammerThreadCount: |
-hammer_threads |
HashSize: |
-hash_size |
HTMLFormData: |
-formdata |
HTMLTagPattern: |
-tag_pattern |
HTMLTagREPattern: |
-tag_rpattern |
HTTPAuthorizationName: |
-auth_name |
HTTPAuthorizationPassword: |
-auth_passwd |
HTTPAuthorizationScheme: |
-auth_scheme |
HTTPProxy: |
-http_proxy |
HTTPProxyAuth: |
-http_proxy_auth |
HTTPProxyPass: |
-http_proxy_pass |
HTTPProxyUser: |
-http_proxy_user |
Identity: |
-identity |
IgnoreChunkServerBug |
-ignore_chunk_bug / -noignore_chunk_bug |
ImmediateMessages: |
-immesg / -noimmsg |
IndexName: |
-index_name |
JavaScriptFile: |
-js_script_file |
JavascriptPattern: |
-js_pattern |
JSTransform2: |
-js_transform2 |
JSTransform: |
-js_transform |
Language: |
-language |
LeaveLevel: |
-leave_level |
LeaveSiteEnterDirectory: |
-leave_site_enter_dir / -dont_leave_site_enter_dir |
LimitInlineObjects: |
-limit_inlines / -dont_limit_inlines |
LocalIP: |
-local_ip |
LogFile: |
-logfile |
LogHammerAction: |
-log_hammering / -nolog_hammering |
MatchPattern: |
-pattern |
MaxDocs: |
-dmax |
MaxLevel: |
-lmax / -l |
MaxRate: |
-maxrate |
MaxRedirections: |
-nredirs |
MaxRegets: |
-nregets |
MaxRetry: |
-retry |
MaxRunTime: |
-max_time |
MaxSize: |
-maxsize |
MinRate: |
-minrate |
MinSize: |
-minsize |
MozillaCacheDir: |
-mozcache_dir |
NetscapeCacheDir: |
-nscache_dir |
NewerThan: |
-newer_than |
NLSMessageCatalogDir: |
-msgcat |
NSSAcceptUnknownCert: |
-nss_accept_unknown_cert / -nonss_accept_unknown_cert |
NSSCertDir: |
-nss_cert_dir |
NSSDomesticPolicy: |
-nss_domestic_policy / -nss_export_policy |
NTLMAuthorizationDomain: |
-auth_ntlm_domain |
NTLMProxyAuthorizationDomain: |
-auth_proxy_ntlm_domain |
NumberOfThreads: |
-nthreads |
OlderThan: |
-older_than |
PageSuffixes: |
-page_sfx |
PostCommand: |
-post_cmd |
PostUpdate: |
-post_update / -nopost_update |
PreferredCharset: |
-acharset |
PreferredLanguages: |
-alang |
PreserveAbsoluteSymlinks: |
-preserve_slinks / -nopreserve_slinks |
PreservePermisions: |
-preserve_perm / -nopreserve_perm |
PreserveTime: |
-preserve_time / -nopreserve_time |
Quiet: |
-quiet / -verbose |
RandomizeSleepPeriod: |
-rsleep / -norsleep |
ReadBufferSize: |
-bufsize |
ReadCSS: |
-read_css / -noread_css |
ReadHtmlComment: |
-noread_comments / -read_comments |
Read_MSIE_ConditionalComments: |
-noread_msie_cc / -read_msie_cc |
Read_XML_CDATA_Content: |
-noread_cdata / -read_cdata |
RegetRollbackAmount: |
-rollback |
REMatchPattern: |
-rpattern |
ReminderCMD: |
-remind_cmd |
RemoveAdvertisement: |
-remove_adv / -noremove_adv |
RemoveBeforeStore: |
-remove_before_store / -noremove_before_store |
RemoveOldDocuments: |
-remove_old |
RequestInfo: |
-request |
Reschedule: |
-reschedule |
RetrieveSymlinks: |
-retrieve_symlink / -noretrieve_symlink |
RunX: |
-runX |
ScenarioDir: |
-scndir |
SchedulingCommand: |
-sched_cmd |
SelectedLinksToLocal: |
-sel_to_local / -nosel_to_local |
SendFromHeader: |
-send_from / -nosend_from |
SendIfRange: |
-send_if_range / -nosend_if_range |
SeparateInfoDir: |
-info_dir |
ShowDownloadTime: |
-stime |
ShowProgress: |
-progress |
SinglePage: |
-singlepage / -nosinglepage |
SiteLevel: |
-site_level |
SkipMatchPattern: |
-skip_pattern |
SkipREMatchPattern: |
-skip_rpattern |
SkipURLMatchPattern: |
-skip_url_pattern |
SkipURLREMatchPattern: |
-skip_url_rpattern |
SleepBetween: |
-sleep |
SLogFile: |
-slogfile |
SSLCertFile: |
-ssl_cert_file |
SSLCertPassword: |
-ssl_cert_passwd |
SSLKeyFile: |
-ssl_key_file |
SSLProxy: |
-ssl_proxy |
SSLVersion: |
-ssl_version |
StatisticsFile: |
-statfile |
StoreDirIndexFile: |
-store_index / -nostore_index |
StoreDocInfoFiles: |
-store_info / -nostore_info |
StoreName: |
-store_name |
TransferQuota: |
-trans_quota |
TrChrToChr: |
-tr_chr_chr |
TrDeleteChar: |
-tr_del_chr |
TrStrToStr: |
-tr_str_str |
UniqueDocName: |
-unique_name / -nounique_name |
UniqueLogName: |
-unique_log / -nounique_log |
UniqueSSLID: |
-unique_sslid / -nounique_sslid |
URLMatchPattern: |
-url_pattern |
URLREMatchPattern: |
-url_rpattern |
UrlSchedulingStrategy: |
-url_strategy |
URLsFile: |
-urls_file |
UseCache: |
-cache / -nocache |
UseHTTP11: |
-use_http11 |
UsePreferences: |
-prefs / -noprefs |
UserCondition: |
-user_condition |
UseRobots: |
-Robots / -noRobots |
Verify CERT: |
-verify / -noverify |
WaitOnExit: |
-ewait |
WorkingDir: |
-cdir |
WorkingSubDir: |
-subdir |
XMaxLogSize: |
-xmaxlog |
URL: |
one URL (more lines with URL: ... means more URLs) |
Some config file entries are not available as command-line options:
Extra config file options for the GTK GUI
Config file option |
Description |
|
BtnConfigureIcon: |
accepts a path argument |
BtnConfigureIcon_s: |
accepts a path argument |
BtnLimitsIcon: |
accepts a path argument |
BtnLimitsIcon_s: |
accepts a path argument |
BtnGoBgIcon: |
accepts a path argument |
BtnGoBgIcon_s: |
accepts a path argument |
BtnRestartIcon: |
accepts a path argument |
BtnRestartIcon_s: |
accepts a path argument |
BtnContinueIcon: |
accepts a path argument |
BtnContinueIcon_s: |
accepts a path argument |
BtnStopIcon: |
accepts a path argument |
BtnStopIcon_s: |
accepts a path argument |
BtnBreakIcon: |
accepts a path argument |
BtnBreakIcon_s: |
accepts a path argument |
BtnExitIcon: |
accepts a path argument |
BtnExitIcon_s: |
accepts a path argument |
BtnMinimizeIcon: |
accepts a path argument |
BtnMaximizeIcon: |
accepts a path argument |
A line which begins with ’#’ means comment.
TrStrToStr: and TrChrToChr: must contain two quoted strings. All
parameter names are case insensitive. If here is missing any option, try to look
inside config.c source file.
See pavukrc.sample file for example.
The most simple incantation:
pavuk http://<my_host>/doc/
Mirroring a site to a specific local directory tree, rejecting big files (>
16MB), plus lots of extra options for, among others: active FTP sessions, passive
FTP (for when you’re behind a firewall), etc. As such, this is a rather mix
& mash example:
pavuk -mode mirror -nobg -store_info -info_dir /mirror/info
-nthreads 1 -cdir /mirror/incoming -subdir /mirror/incoming
-preserve_time -nopreserve_perm -nopreserve_slinks -noretrieve_symlink
-force_reget -noRobots -trans_quota 16384 -maxsize 16777216
-max_time 28 -nodel_after -remove_before_store -ftpdir -ftplist
-ftp_list_options -a -dont_leave_site -dont_leave_dir -all_to_local
-remove_old -nostore_index -active_ftp_port_range 57344:65535
-always_mdtm -ftp_passive -base_level 2
http://<my_host>/doc/
Note
This is a writeup for a bit of extra pavuk documentation. Comments are
welcomed; I hope this is useful for those who are looking for some prime examples
of pavuk use (intermediate complexity).
Author: Ger Hobbelt
< ger@hobbelt.com >
Anyone who doesn’t find
’pavuk http://www.da-url-to-spider.com/’
suits their need entirely.
Anyone who feels an itch coming up when their current spider software croaks
again, merely because you were only interested in spidering a part of
the pages.
This example text assumes you’ve had your first few trial runs using pavuk
already. We take off at the point where you knew you should really read the manual
but didn’t dare do so. Yet. ... Or you did and got that look upon your face,
where your relatives start to laugh and your kids yell: “Mom! Dad is doing
that look again!”
We’re going to cover a hardcase example of use for any spider: a
Mediawiki-driven documentation website.
The goal: Get some easily readable pages in your local (off-line) storage.
I wished to have the documentation for a tool, which I have purchased before,
available off net, since I’m not always connected when I’m somewhere
where I find time to work with that particular tool. And the company that sells the
product doesn’t include a paper manual.
Their documentation is stored in a Mediawiki web site, i.e. a website driven by
the same software which was written for the well known Wikipedia.
There are several issues with such sites, at least from a ’off net
copy’ and ’spider’ perspective:
-
The web pages don’t come with proper file extensions, e.g.
’.HTML’. Sometimes even no filename extensions at all, such as is
the case with Mediawiki sites. For a web site, this is not an issue, as the web
server and your browser will work as a perfect tandem as long as the server
sends along the correct MIME type with that content, and Mediawiki does a
splendid job there.
-
As each page has quite a few links to:
your spider will really love to dig in and go there.
Unfortunately this is the Road To Hell (tm) as:
-
any site of sufficient age, i.e. a large enough number of edits to its
pages, will have your spider go... and go... and go... and then some
more.
-
To put it mildly, you may not be particularly interested in those
historic edits / revisions / etc. -- I know I wasn’t, I just wanted
to have the latest documentation along when I open up my laptop next where
there’d be no Net. And I didn’t like my disc flooded with - to
me - garbage.
-
If you are really lucky with these highly dynamic sites, they’ll
provide reporting and other facilities on a day to day basis: when the
spider hits those calendars and the site is set up to, for example, show
the state of the union, pardon, website for any given day back till the
dawn of civilization, you’re in for a real treat as the spider
will request those dynamic pages for every day in that lovely calendar.
ETA on this process? Somewhere around this saturday next year. If
you’re lucky and your IP doesn’t get banned before that day for
abuse.
So the key to this type of spider activity is to be able to restrict the
spider to the ’main pages’, i.e. that part of the content you are
interested in.
-
Which leaves only one ’minor’ issue: local files don’t
come with a ’MIME type’, so you’re in a real need for
some fitting filename extensions to help your HTML browser/viewer decide
how to show that particular bit of content. After all, both a .HTML and a
.JPG file are just a bunch of bytes, but, heck, does a JPG look wicked when
you try to view it as if it were a HTML page. And vice versa.
pavuk is perfectly able to help you out with this challenge as it comes with
quite a few features to selectively grab and discard pages during the spider
process.
And it has something extra, which is not to be sneezed at when you are
trying to convert dynamically generated content into some sort of static HTML
pages for off net use: FILENAME REWRITING. This allows you to tell pavuk how
you like those pages to be filed exactly and under what filenames and, very
important to have your web browser cooperate when you feed it these pages from
your local disc, the appropriate filename extensions.
Let’s have a look at the pavuk commandline which does all of that -
and then some:
Note
(this is pavuk tests/ example script no. 2a by the way) The pavuk
commandline has been broken across multiple lines to improve it’s
readability.
We are going to grab the documentation for a 3D animation plugin called CAT,
available at http://cat.wiki.avid.com/
Special notes for this spider run:
-
We are also interested in the ’RecentChanges’
report/overview, as I edit my local copy of this documentation and like
to know which pages have changed since the last time I visited the
site.
-
Remove the single spaces before each of those ’&’ in
those URLs if you want the real URL; these were inserted only for
simplification this document’s formatting.
-
For the same reason, remove the single spaces following each
’,’ comma in several of the commandline option arguments down
there.
&hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500
../src/pavuk
-verbose
-dumpdir pavuk_data/
-noRobots
-cdir pavuk_cache/
-cookie_send
-cookie_recv
-cookie_check
-cookie_update
-cookie_file pavuk_data/chunky-cookies3.txt
-read_css
-auto_referer
-enable_js
-info_dir pavuk_info/
-mode mirror
-index_name chunky-index.html
-request ’URL:http://cat.wiki.avid.com/index.php? title=Special:Recentchanges&
hideminor=0 &hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500
METHOD:GET’
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Lonelypages METHOD:GET’
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Unusedimages METHOD:GET’
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Allpages METHOD:GET’
-request ’URL:http://cat.wiki.avid.com/ METHOD:GET’
-scndir pavuk_scenarios/
-dumpscn TestScenario.txt
-nthreads 1
-progress_mode 6
-referer
-nodump_after
-rtimeout 10s
-wtimeout 10s
-timeout 60s
-dumpcmd test_cmd_dumped.txt
-debug
-debug_level ’all, !locks, !mtlock, !cookie, !trace, !dev, !net, !html, !htmlform,
!procs, !mtthr, !user, !limits, !hammer, !protos, !protoc, !protod, !bufio,
!rules, !js’
-store_info
-report_url_on_err
-tlogfile pavuk_log_timing.txt
-dump_urlfd @pavuk_urlfd_dump.txt
-dumpfd @pavuk_fd_dump.txt
-dump_request
-dump_response
-logfile pavuk_log_all.txt
-slogfile pavuk_log_short.txt
-test_id T002
-adomain cat.wiki.avid.com
-use_http11
-skip_url_pattern ’*oldid=*, *action=edit*, *action=history*, *diff=*, *limit=*,
*[/=]User:*, *[/=]User_talk:*, *[^p]/Special:*, *=Special:[^R]*, *.php/Special:[^LUA][^onl][^nul]*,
*MediaWiki:*, *Search:*, *Help:*’
-tr_str_str ’Image:’ ’’
-tr_chr_chr ’:\\!&=?’ ’_’
-mime_types_file ../../../mime.types
-fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’
-fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’
-fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’
Whew, that’s some commandline you’ve got there! Well, I always
start out with the same set of options, which are not really relevant here
(we’re not all that concerned with tracking cookies on this one, for
one), but it has grown into a habit which is hard to get rid of.
A bit of a toned down version looks like this:
Note
removed are:
-
logging features (the -dump_whathaveyou commandline options /
-store_info/-[ts]logfile)
-
cookie tracking and handling options
-
storage directory configuration (-dumpdir/-cdir/-info_dir/-scndir)
-
multithreading configuration (-nthreads)
-
verbosity and progress info aids
(-verbose/-progress_mode/-report_url_on_err)
-
diagnostics features: there a whole slew of flags there that are
really helpful when you are setting up this sort of thing first time:
without those it can be really hard to find the proper incantations for
some of the remaining options (-debug/-debug_level)
-
miscellaneous for administrative purposes (-test_ID)
leaving us:
../src/pavuk
-noRobots
-read_css
-auto_referer
-enable_js
-mode mirror
-index_name chunky-index.html
-request ’URL:http://cat.wiki.avid.com/index.php? title=Special:Recentchanges&
hideminor=0 &hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500
METHOD:GET’
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Lonelypages METHOD:GET’
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Unusedimages METHOD:GET’
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Allpages METHOD:GET’
-request ’URL:http://cat.wiki.avid.com/ METHOD:GET’
-referer
-adomain cat.wiki.avid.com
-use_http11
-skip_url_pattern ’*oldid=*, *action=edit*, *action=history*, *diff=*, *limit=*,
*[/=]User:*, *[/=]User_talk:*, *[^p]/Special:*, *=Special:[^R]*, *.php/Special:[^LUA][^onl][^nul]*,
*MediaWiki:*, *Search:*, *Help:*’
-tr_str_str ’Image:’ ’’
-tr_chr_chr ’:\\!&=?’ ’_’
-mime_types_file ../../../mime.types
-fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’
-fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’
-fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’
which tells pavuk to:
-
skip the ’robots.txt’, if available from this web site
(-noRobots)
-
load and interpret any CSS files, i.e. see if there are additional URLs
available in there (-read_css)
-
play nice with the web server and tell the box which path it is
traveling, just like a regular web browser would do when a human would
click on the links shown on screen (-auto_referer/-referer)
-
look at any JavaScript code for extra URLs (-enable_js). Yes,
we’re that desperate for URLs to spider. Well, this option is in my
’standard set’ to use with pavuk, and if it (he? she?)
doesn’t find any, it doesn’t hurt to have it here with us
anyway.
-
operate in ’mirror’ mode. Pavuk has several modes of
operation available for you, but I find I use ’mirror’ most,
probably because I’ve become really used to it. In a moment of
weakness, I might concede that it’s more probable that I have found
that often almost any problem can be turned into a nail if you find
yourself holding a large and powerful hammer. And the ’mirror’
mode might just be my hammer there.
-
directory index content should be stored in the
’chunky_index.html’ file for each such directory. Simply put:
the content sent by the server when we request URLs that end with a
’/’. This is not the whole truth, but it’ll do for
now.
-
spider starting at several URLs (-request ...). Now this is interesting,
in that, at least theoretically, I could have done with specifying a single
start URL there:
-request ’URL:http://cat.wiki.avid.com/ METHOD:GET’
as the other URLs shown up there can be reached from that page.
In practice though, I often find it a better approach to specify each of
the major sections of a site which you want to be sure your pavuk run needs
to cover. Besides, practice shows that some of those extra URLs can only be
reached by spidering and interpreting otherwise uninteresting revision/edit
Mediapage system pages. And since we’re doing our darnedest best to
make sure pavuk does NOT grab nor process any of _those_ pages, we’ll
miss a few bits, e.g. these ones:
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Lonelypages METHOD:GET’
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Unusedimages METHOD:GET’
will be completely missed if I hadn’t specified them explicitly
here, while keeping all the restrictions (-skip_url_pattern et al) as
strict and restrictive as they are now.
-
restrict any spidering to the specified domain and any of its subdomains
(-adomain). In this particular case, there’s only one domain to
spider, but you can spider several locations in a single run, by specifying
multiple ’acceptable domains’ using -adomain.
-
to use the HTTP 1.1 protocol when talking to the web server. This is
another one of those ’standard options’ which I tend to
copy&paste in every pavuk command set. This one comes in handy when
your web site is hosted on a ’virtual host’, i.e. when several
domains share the same server and IP number (such as is the case with my
own web sites, such as ’www.hebbut.net’ and
’www.hobbelt.com’. Though this option’s use dates back to
older pavuk releases I still tend to include it, despite the fact that the
latest pavuk versions default to using HTTP 1.1 instead of the older HTTP
1.0.
And now some of the real meat of this animal:
-skip_url_pattern comes with a huge set of comma-separated wildcard
expressions. When part of a URL matches any one of these expressions, pavuk
will ignore that URL and hence skip grabbing that particular page.
- ’*oldid=*’
-
is kind of trivial: if we somehow end up attempting to spider a
’historic’ (older) copy of a given web page, we are NOT
interested. This forces pavuk to skip any older versions of any Mediawiki
pages.
- ’*action=edit*’
-
is another trivial one: we are not going to log in and edit the page as
we are interested only in grabbing the current content. No editing pages
with web forms for us then.
- ’*action=history*’
-
is a variant to the ’oldid’ expression above with the same
intent. Note that all this is - of course - web site and Mediawiki
specific, so web sites serviced by different brands of CMS/Wiki software,
require their own set of skip patterns.
Nevertheless, the set above should work out nicely for most if not all
Mediawiki sites.
Also note that the complete URL is matched against these
patterns, i.e. including the ’?xxx&xxx&xxx’ URL
query part of that URL. (Bookmarks, encoded as a dash-delimited last
part of a client side URL like this: ’...#jump_here’, are NOT
included in the match. The server should never get to see those anyway, as
dash bookmarks are a pure client side thing.)
- ’*diff=*’
-
we don’t want to know what the changes to page X are compared to,
say, the previous version of said page.
- ’*limit=*’
-
there are several report/system pages in any Mediawiki site, where lists
of items are split into chunks to reduce page size and user strain. This is
quick & dirty way to get rid of any of those.
And then there are the pages we do like to see (UnusedImages +
LonelyImages), but are not interested in seeing till the end of the
list if it is that large for the site.
- ’*[/=]User:*’ , ’*[/=]User_talk:*’
-
two more which are irrelevant from our perspective: we’re going
offline with this material, so there’s no way to discuss matters with
the editors.
- ’*[^p]/Special:*’
-
this one rejects any ’Special:’ pages at first glance, but
is a little more wicked than that, as we do want those
’LonelyPages’, ’UnusedImages’ and
’AllPages’, thank you very much. See this pattern is augmented
to any ’Special:’ pages which are not located in a
(virtual) directory ending with a ’p’. Due to the way the
Mediawiki software operates and presents to web pages, this basically
means, this pattern will ONLY select any ’Special:’ pages which
are not directly following the ’index.php’ processing page,
which, in Mediawiki’s case, presents itself as if it is a directory,
such as in this URL:
http://cat.wiki.avid.com/index.php/Special:Lonelypages
Unfortunately, the above pattern is not restrictive enough, as we’ll
now be treated to a whole slew of main page ’Specials:’s. And that
wasn’t what we wanted, did we?
Additional patterns to the rescue!
Remember that we are only in interested in three of them:
-
’LonelyPages’,
-
’UnusedImages’ and
-
’AllPages’
So the next pattern:
’*=Special:[^R]*’
may seem kind of weird right now. Let’s file that one for a later, and
first have look at the next one after that:
’*.php/Special:[^LUA][^onl][^nul]*’: Now this baby looks just
like the supplement we were looking for: Skip any ’Special:’s,
which do not start their name with on of the characters ’L’,
’U’ or ’A’. Compare that to the three
’Special:’s we actually _do_ want to download listed above and the
method should quickly become apparent: the second letter is declared
ba-a-a-a-a-d and evil when it’s not one of these: ’o’,
’n’ or ’l’, and just to top it off, the third letter in
the name is checked too: not ’n’, ’u’ or
’l’ and the page at hand is _out_.
So this should do it regarding those ’Special:’s, right?
Not Entirely, No. Because there’s still that fourth one we’d
love to see:
-request ’URL:http://cat.wiki.avid.com/index.php? title=Special:Recentchanges&
hideminor=0 &hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500
METHOD:GET’
which has a bit of a different form around the ’Special’
text:
index.php?title=Special:Recentchanges
Note the ’=’ in there. So that ’s why we had that
other pattern we had filed for later discussion:
’*=Special:[^R]*’
i.e. discard any page containing the string ’=Special:’. Which
is not immediately followed by the character ’R’ of
’RecentChanges’.
So far, so good.
Mediawiki comes with another heap of system pages, which are categorically
rejected using this set of three patterns:
’*MediaWiki:*, *Search:*, *Help:*’
NOW we’re done. At least as far as filtering/restricting the spider is
concerned.
Note
A last note before we continue on with the next section: note that each of
the ’-skip_url_pattern’ patterns are handled as if they were UNIX
filesystem/shell wildcards: MSDOS/Windows people will recognize
’?’ (any single character) and ’*’ (zero or more
characters), but UNIX wildcard patterns also accept ’sets’, such
as ’[a-z]’ (any one of the letters of our alphabet, but only the
lowercase ones) or ’[^0-9]’ (any one character, but it may NOT be
digit!). pavuk calls these ’fnmatch()’ patterns and if you google
the Net, you’ll be sure to find some very thorough descriptions of
those. They live next to the ’regex’ (a.k.a. ’regular
expressions’) which are commonly used in Perl and other languages.
pavuk - of course - comes with those too: if you like to use regexes, you
should specify your restrictive patterns using the
’-skip_url_rpattern’ commandline option instead. Note that subtle
extra ’r’ in the commandline option there.
Still, if you grab a Mediawiki site’s content just like that,
you’ll end up with a horrible mess of files with all sorts of funny
characters in their filenames.
This might not be too bothersome on a UNIX box (apart from the glaring
difficulty to properly view each filetype as the filename extensions are the
browser/viewer’s only help as soon as these files end up on your local
storage), but I wished to view the downloaded content on a laptop with Windows
XP installed.
So there’s a bit more work to do here: knead the filenames into a form
that palatable to both me and my Windows web page viewing tools.
This is where some of the serious power of pavuk shows. It might not be the
simplest tool around, but if you were looking for that Turbo Piledriver to
devastate those 9 inch nail-shaped challenges, here you are.
We’ll start off easy: Images.
They should at least have decent filenames and more importantly: suitable
filename extensions.
So we add these commandline options as filename ’transformation’
instructions:
- -tr_str_str ’Image:’ ’’
-
will simply discard any ’Image:’ string in the URL while
converting said URL to a matching filename.
- -tr_chr_chr ’:\\!&=?’ ’_’
-
Windows does NOT like ’:’ colons (and a few other
characters), so we’ll have those replaced by a
’programmers’ space’, a.k.a. the ’_’
underscore.
This ’-tr_chr_chr’ will convert those long URLs which
include ’?xxx&yyy&etc’ URL query sections into
something without any of those darned characters: ’:’,
’\’ (note the UNIX shell escape there, hence ’\\’),
’!’, ’&’, ’=’ and
’?’.
Of course, if you find other characters in your grabbed URLs offend you,
you can add them to this list.
Then we’re on to the last and most interesting part of the filename
transformation act. But for that, we’ll need to help pavuk convert those
MIME types to filename extensions.
That we do by providing a nicely formatted mime.types(3) (see online UNIX man pages for a format
description) page:
-mime_types_file ../../../mime.types
Of course, I manipulated this file a bit so pavuk would choose
’.html’ over ’.htm’, etc. as several MIME types come
with a set of possible filename extensions: MIME types and filename extensions
come from quite disparate worlds and are not 1-on-1 exchangeable. But we
try.
-fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’
Will take any URL which contains the string ’/index.php/’ and
comes with a ’:’ a little further down the road, and convert it to
a filename using
the
’%h:%r/%d/%n%s.%X’
template.
The
’F’
tells pavuk what follows will be a ’fnmatch()’ type pattern:
like the ’-skip_url_pattern’ patterns above, these are very similar
to UNIX filesystem wildcards. If you wish to use real perl(5) -alike regexes instead, you should specify ’R’
here instead.
The template ’%h:%r/%d/%n%s.%X’ instructs pavuk to construct the
filename for the given URL like this:
-
’%h’ is replaced with the fully qualified host name, i.e.
’cat.wiki.avid.com’.
-
’%r’ is replaced with the port number, i.e. ’80’
for your average vanilla web site/server.
-
’%d’ is replaced with the path to the document.
-
’%n’ is replaced with the document name (including the
extension).
-
’%s’ is replaced with the URL searchstring, i.e. the
’...?xxx&yyy&whatever’ section of the URL.
-
and ’la piece de resistance’:
’%X’ is replaced with the default extension assigned to the
MIME type of the document, if one exists. Otherwise, the existing file
extension is used instead.
Note
And the manual also says this: “You may want to specify the
additional command line option ’-mime_type_file’ to override
the rather limited set of built-in MIME types and default file
extensions.” Good! We did that already!
But what is that about “Otherwise, the existing file extension
is used instead”? Well, if the webserver somehow feeds you a MIME
type with document X and your list/file does not show a filename
extension for said MIME type, pavuk will try to deduce a filename
extension from the URL itself. Basically this comes comes down to pavuk
looking for the bit of the non-query part of the URL following the last
’.’ dot pavuk can find in there. In our case, that would
imply the extension would end up to be ’.php’ if we
aren’t careful, so it is imperative to have your
’-mime_type_file’ mime.types file properly filled with all
the filename extensions for each of the MIME types you are to encounter
on the website under scrutiny.
Since you’ve come this far, you might like to know that a large part
of the pavuk manual has been devoted to the ’-fnrules’ option
alone. And let me tell you: these ’-fnrules’ shown here barely
scratch the surface of the capabilities of the ’-fnrules’
commandline option: we did not use any of the ’Extended Functions’
in the transformation templates here...
As we have covered the first ’-fnrules’ of the set shown in the
example:
-fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’
-fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’
-fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’
you may wonder what the others are for and about.
The second one
-fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’
makes immediate sense as it is the equivalent of the first, but now only for
those URLs which have a ’/’ slash or a ’?’ question
mark following the string ’/index.php’ immediately.
But wait! Wouldn’t its transform template execute on the URLs as the
first ’-fnrules’ statement. In other words: what’s the use of
the first ’-fnrules’ if we have the second one too?
Well, there’s a little detail you need to know regarding
’fnrules’: every URL only gets to use ONE. What is saying is that
once a URL matches one of the ’fnrules’, that template will be
applied and no further ’-fnrules’ processing will be applied to
that URL. This gives us the option to process several URLs in different ways,
though we must take care about the order in which we specify these
’-fnrules’: starting from strictest matching pattern to most
generic matching pattern. That is why the ’-fnrule’ with matching
pattern ’*’ (= simply anything will do) comes last.
The second ’-fnrules’ line has only a few changes to its
template, compared to the first:
(1st) -fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’
(2nd) -fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’
where %b is replaced with the basename of the document (without the
extension) so that the URL query section of the URLs matching the 2nd
’-fnrules’ will be discarded for the filename, while the 1st
’-fnrules’ will include that (-tr_chr_chr/-tr_str_str transformed)
part instead.
The third ’-fnrules’ option:
-fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’
is also interesting, because its template includes ’%Y’ instead
of ’%X’, where the manual tells us this about ’%Y’:
“ %Y is replaced with the file extension if one is available. Otherwise,
the default extension assigned to the MIME type of the document is used
instead.” Which means ’%Y’ is the opposite of
’%X’ in term of precedence of using the ’default/URL-derived
filename extension’ and the MIME type derived filename extension:
’%X’ will have a MIME type related filename extension
’win’ over the extension ripped from the URL string, while
’%Y’ will act just the other way around. Hence, ’%Y’
will only use the MIME type filename extension if there’s no
’.’ dot in the filename section of the URL:
site.com/index.php
would keep it’s ’.php’, while
site.com/dir-with-extension.ext/no-extension-here
would cause pavuk to look up the related MIME type filename extension
instead (notice that the filename section of the URL does not come with
a ’.’ dot!).
... you’ve travelled far, but now we have covered all the commandline
options which were relevant to the case at hand: spider a Mediawiki-based
website for off-line perusal.
Along the way, you’ve had a whiff of the power of pavuk, while I hope
you’ve found several bits that may be handy in your own usage of pavuk. I
suggest you check out the other sections of the manual, forgive it its few
grammitical errors as it was originally written by a non-native speaker, and
enjoy pavuk for its worth: a darn powerful web spider and test machine. (Yes, I
have used it to perform performance and coverage analysis on web sites with
this tool. Check out the gaitlin gun of web access: the -hammer mode. But
that’s a whole different story.)
I did intentionally not cover the very important diagnostics
commandline options in this example, as that would have stretched your
endurance as a reader beyond the limit. Perusing the ’-debug /
-debug_level’ log output is subject matter to fill a book. Maybe another
time.
Take care and enjoy the darndest best web spider out there. And it’s
Open Source, so do as I did: grab the source if the tool doesn’t
completely fit your needs already, and improve it yet further!
Look into ChangeLog file for more information about new
features in particular versions of pavuk.
Main development Ondrejicka Stefan
Look into CREDITS file of sources for additional information.
pavuk is available from http://pavuk.sourceforge.net/
Table of Contents
|
|