Pavuk - Last Changes

Pavuk

Last Update: April 19 2005

Last Changes :

* ---------- released version 0.9.33 (2005-09-27)
* fixed 64bit problems (BUG #1226863)
* updated German locale, fixes done by Debian developers (Hey, please inform
us about errors. Scanning the net and all distributions for possible fixes
is not very helpful.)
* ---------- released version 0.9.34 (2006-01-09)
* security fixes
* some minor bug fixes
* reworked build system a lot, fixed RPM spec file
* now builds fine using most of the possibilities pavuk provides
* RPM builds on openSUSE build service for SUSE since version 9.3, Fedora
since version 4 and Mandriva since version 2006
* RPM packages can be found here:
http://software.opensuse.org/download/home:/dstoecker/
* ---------- released version 0.9.35 (2007-02-21)
* added -persistent/-nopersistent option

2007-april-30 [notes taken from old work back in 2005/2006 merged into pavuk mainstream source tree]

* bufio has seen a MAJOR overhaul. It is now capable of pushing text &
binary data to the file system at unprecedented rates. This is done by
adding a variable sized (and possibly large) memory cache, resulting in
large size I/O operations. These perform very much faster than the regular
RTL I/O calls. (tested on quad CPU UNIX Dell servers)

the new bufio was required as I needed to log/track a huge amount of data
in the shortest possible time / lowest possible CPU load.

* cookie handling has been fixed/augmented. pavuk can now have the initial
cookie values that go with a certain web request preconfigured on the
commandline. Also, several bugs in handling the cookies have been fixed.
(tested on a wicked ASP.NET intranet site which 'assumed' the use of a
special web client (a TV set top box) which would transmit it's serial #
as a client-side created(!) cookie to the web server. This site/client
combo thus actually transmitted cookies which would first show up in a web
_request_ instead of the usual: a server-side _response_.)

* several portability items have been changed (h_errno, ...) to make the
code compile and work on the odd-flavored UNIX box. A native Win32 port is
under way: it now works, inclusing zlib and OpenSSL, though the latter has
not been tested recently.

Note that the changes may have broken GTK support, as I was not able to
build the code with GTK on my UNIX boxes.

* socket I/O (IP traffic) has been fixed to properly cope with user breaks
(a user hitting Ctrl+C). Several locations in the software where the
unexpected signal would cause an infinite loop have been identified and
fixed.

* added several lines of DEBUG_xxx to aid both developer and user in
tracking down hard to diagnose issues inside pavuk while scanning a site.

* Accepted-Encoding (more specifically: the handling of x-gzip/gzip/x-
compress/compress encoding) has been changed to allow for better
portability: data is expanded in-memory, without the need for an external
'gzip' tool and/or OS-specific forks & pipes.

(Win32 wouldn't know a fork if ever it saw one.)

* ALL stdio is now handled through the new bufio system. This not only
improves performance when you've got -debug and -debuglevel dialed all the
way up, but also corrected several spots where, depending on your C RTL,
stdio/stderr traffic would arrive at different moments on your console
(some of it was written through the FILE I/O, some through direct I/O,
causing blurbs of output to pass one another along the way to the actual
console).

* buffer overrun protection has been improved. Note also that every
snprintf() and derivative thereof is now 'augmented' by an additional line
of code which ensures that the last character in the buffer is guaranteed
to be a NUL sentinel, thus ensuring that the buffer will always present
data in correct C string format (NUL-terminated). (This is an old habit of
mine as some C RTLs have shown to be kinda flaky on the subject of NUL
sentinels when snprintf() et al are writing data up to the edge of their
output buffers: some C RTLs 'forget' to put a NUL there under particular
circumstances (some commercial Watcom compiler releases come to mind).

* multithreading pavuk has been tested on an high perf MP UNIX box and it
was like the documentation/notes state somewhere: instable. The thread
interlocking has now been fixed; one of the hardest to fix proved to be
the lockup at the end of a pavuk run. The fix also includes the use of
semaphores and some additional code changes to make the code thread safe;
critical sections are now handled as such. This includes placing several
non-threadsafe C RTL calls (e.g. ctime()) inside critical sections!

* auto-form-filling (the feature which led me to select pavuk over wget et
al when I started the hammer/chunky project) has been fixed for those
special pages where you have an empty form to submit: the site I had to
test included such a form, which was submitted using javascript, but did
not contain _any_ input fields (but cookies were expected to come with
that request, thank you). Before, pavuk crashed on such a page. This has
now been fixed.

* added a 'reindent' target to the makefile, using GNU indent to reformat
the code. (When you're working several weeks on end in crunch time, you
want to see some proper and consistent looking source code, even when you
just made it a mess yourself...)

Also extended the cleanup makefile target to help me in cleaning up any
backup and/or temporary files created by vi and some log diagnostic
scripts.

[edit may/2007: wasn't this already in the makefiles before - see
ChangeLog entry in 2003?]

* added several commandline parameter types, which allow you to instruct
pavuk to use OS file handles or file names for logging activity, while you
can now also specify whether a log file should be overwritten (default) or
appended to (new feature) by adding another '@' prefix to the file path.

TODO: document this properly.

* added hammer/crunchy modes: several ways to scan a web site and than
rescan it. The higher (later) hammer mode has been specifically written to
use pavuk as a 'replay attack' based DoS tool for testing high performance
web servers. (bufio was overhauled to allow us to log all I/O data +
diagnostics to disc while hammering the server while the pavuk system
_must_ perform better (= faster) than the web server when running both on
equivalent hardware.)

* The native Win32 port has been overhauled (previous code was never
released to the public) to make sure I did not have to look for OS-
specific path elements _everywhere_ in the code (it was becomes a code-
wise maintainance nightmare while fixing up/down all those 'absolute path'
and 'path expansion' code sections to handle Win32 drive letters (root is
'[A-Z]:[\\/]' instead of simply '/').

This has been fixed by using the cygwin 'path hack' for the native Win32
port too: root is '/cygdrive/[a-z]/' so it looks exactly like a UNIX path.

Any places in the codes which need to address the OS while passing an OS-
specific path are now handled almost invisibly: all relevant C RTL calls
(fopen/open/stat/lstat/symlink/link/unlink/rename/mkdir/rmdir/opendir) are
now encapsulated in tl_[sysname] wrapper functions where these
/cygdrive/[x]/ paths are converted back to native Win32 paths before the
actual C RTL function is called. Also any debug/print statement, which is
used to report a file path, is fixed to convert file paths to the native
representation with a minimum of fuss: see the new tl_native() call for a
description how this was done. This code has not been tested in a UNIX/MP
environment, but the design is such that this should not cause any trouble
(pthread port for Win32 is in progress ATM).

* added -debug_level modes: all/trace/dev/bufio/cookie/htmlform. Also added
a feature where you can now specify a set of debug levels and have some of
those levels _removed_, e.g. 'all,!dev' will show anything _except_ 'dev'
level debug output: note the new '!' prefix.

* -debug_level output is now prefixed with its level in caps and square
brackets, e.g. '[PROCE]' to aid in filtering the debug output (for
instance by piping it through sed/grep).

* unified debug output handling in the code: -debug_levels are now only
active when you specify -debug too.

* inflate_decode() and gzip_decode() have been fixed to suit a multithreaded
environment. gzip_decode() now has an in-memory implementation, using the
zlib library, for those systems which do not support UNIX pipes/forks.

* Fixed deflate/compress handling: the MJF Accept-Encoding deflate hack has
been removed and the request header extended. (tested on a Wikipedia
HTTP/1.1 compliant server)

You may wish to permanently disable the code within

in decode.c if you do not wish to depend on the external gzip tool any
more.

* _all_ system header file #include's have been removed from the sources and
integrated into config.h to allow for better portable source code.

config.h.in and autoconf.am have been extended to include several more OS-
dependent system call and header file checks.

A seperate native Win32 version of the header file is also provided (used
by the MSVC2005 native Win32 build).

* several hardcoded buffer sizes in the software have been made configurable
(but remain hardcoded). See for instance dinfo.c: 12 -->
PAVUK_INFO_DIRNAME and 1024-and-other-fixed-buf-sizes -->
BUFIO_ADVISED_READLN_BUFSIZE

* fixed several cases where dangling (i.e. free()d but not NULL-ed) pointers
caused havok. Code has been quickly reviewed to locate and fix additional
spots that did not yet cause pavuk to go 'crazy Ivan' (Hunt for the Red
October, anyone? ;-) )

* hardcoded lock filenames have been converted to #define's to allow these
to be changed in a single spot (config.h), improving portability. e.g.:

'._lock' --> PAVUK_LOCK_FILENAME

* UNIX-specific octal privs have been changed to their proper #define's to
allow for maximum portability (Win32 doesn't know '0644' but can cope with

S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH

though maybe in a odd way).

* fixed quite a few spots where an unidentified form encoding method would
lead to _very_ instable bahaviour, including crashes/core dumps. Look for

fi->method = FORM_M_UNKNOWN

assignments and additonal FORM_M_UNKNOWN checks.

* added -no_dns support for those who have to work in an environment with
flaky or no DNS support (I had to as I was working on a box in a specially
configured, partially walled-off DMZ zone while developing and testing
pavuk against a web server.)

* fixed typos in the text as I came along them.

* the bufio overhaul also lead to a overhaul of the -dumpxxx code,
removing/fixing several spots in the code which caused incorrect/instable
behaviour. (e.g. code in doc.c)

* Fixed handling of compressed data for any text-based server response;
pavuk now correctly handles any gzipped/deflated text, including, for
instance, any 'text/javascript' content sent over the wire in compressed
form (tested on a Wikipedia-based HTTP/1.1 compliant server).

* added -progress_mode: several choices in progress verbosity.

* added -no_disc_io: test a grab/scan without writing anything to disc.
Mostly useful in combination with the earlier -hammer modes.

* fixed/updated HTTP error response handling in accordance with RFC2616 so I
can better see what a HTTP/1.1 compliant target is reporting back to
pavuk. (errcode.c et al)

* unified timing units to fix a few timing oddities: instead of minutes,
etc. the code uses seconds everywhere (apart, of course, from the few
locations where we use milleseconds ;-) )

-timeout is now in milliseconds!

* Added -rtimeout and -wtimeout command line parameters.
(unit: milliseocnds)

* added -allow_persistent / -noallow_persistent commandline arguments to
allow/disallow the use of HTTP/1.1 persistent connections.

* added -dumpcmd and -dumpdir commandline arguments.

* added -bad_content commandline argument for use with the hammer/chunky
modes.

* added -report_url_on_err commandline argument: report the URL which was
processed while the error occurred.

* added -test_id commandline argument: this is included in the timing report
so reports can be better automatically processed / combined.

* added -page_sfx commandline argument to help pavuk identify what suffixes
are to be considered web pages (useful for scanning ASP and ASP.NET sites
which present unusual mime types with their pages).

* added -tlogfile4sum commandline argument: specify a log file where timing
info is stored. Handy when pavuk is not only used to grab the info off a
site but also scan & report site performance.

* added -encode commandline parameter as the counterpart of -noencode.

* added -nohtDig, -noquiet and -noverbose commandline parameters as
counterparts of -htDig, -quiet and -verbose respectively.

* added filepath support to -dumpfd and -dump_urlfd: by specifying the
option prefixed with a '@' character, pavuk will treat the option value as
filepath specification instead of a OS file handle and subsequently open
the specific file internally. Note that adding yet another '@' character
as a prefix signals pavuk to _append_ to the specified file, instead of
_overwriting_ it.

This is useful when you wish to have those dumps but are working in an
environment where you cannot pass valid file handles through the
commandline.

* added -dump_request and -nodeump_request commandline arguments for use
with -dumpfd: when -dump_request is specified, the log file will include
complete dump of each request sent to the server by pavuk. Thus you can
produce a complete audit trail of the exchange.

* replaced the DUMP_URLLIST macros in stats.c by two functions. Code is a
bit cleaner that way.

* fixed times.c which barfed on timestamps beyond 2037 (signed int wrap
around for time_t).

* added assert() checks at several locations in the code to help track down
unexpected behaviour which could lead to crashes (like it did till now).

* unified the proliferation of HEX2ASC-alike macros with and without off-by-
one offsets inside. Now there's one macro for each of 'em in tools.h.

* changed the configure.in option to --disable-threads to keep the pattern
consistent (--disable-xxx series of options in configure), but the default
behaviour remains the same.

* configure.in: as --disable-debug removes any debug-_related_ features from
the pavuk build, these options have been added: --disable-debugging will
create a default build with all debugging removed from the compiled
binaries. --disable-prof and --disable-gprof have been added to remove any
profile info from the default compiled binaries.

* added checks in configure.in for socklen_t, pid_t and a bunch of system
calls and header files that do not live in each environment.

2007-may-6

* included pthreads-Win32 based multithreading support in the native Win32
build.

* included EXPERIMENTAL tre (regex) support in the native Win32 build.

* fixed several lurking bugs (buffer overruns, etc.) which only showed in a
multithreaded environment.

* fixed locking bugs in the new bufio implementation.

* added Win32 memory leak + heap checking for the DEBUG build: many memory
leaks have been tracked and fixed. (MSVC <ctrdbg.h> based)

* fixed memory leak due to wrong scope in report_error() code.

* added DBGxxx macro's to aid heap tracking for the debug build. See
DBGdecl/DBGpass/DBGvars usage.

* removed a very nasty memleak in html_parser_get_url() which would leak at
least 3 blocks for each rejected local anchor URL - and those come quite a
few! Took me a day to track it down. :-(

* added filtering so gzipped/compressed files on the server are not
decompressed unintentionally while the server supports Accept-
Encoding:gzip or compress.

( doc_download_helper() in doc.c )

2007-may-11

* renamed function should_leave_persistent() to the more appropriately named
should_keep_persistent()

* Updated 'chunky' source to the state of the latest pavuk CVS contents (as
of today) as this code has not yet been merged into CVS itself.

* fixed bugs in -scenario handling, when scanrio files produced by pavuk are
re-used in the Win32 environment

* fixed bugs in path & file type commandline arguments for the native Win32
port.

* fixed bug in retrying/resuming download for RFC2616 (HTTP/1.1) 'chunked'
content download handling.

* merged -allow_persistent / -noallow_persistent commandline arguments with
the equivalent -persistent/-nopersistent feature from the official pavuk
CVS sources.

Also improved the code a bit: added the 'Connection: close' header for
requests over -nopersistent connections, so the server will close the
connection for us.

* added the -ignore_chunk_bug commandline argument to allow pavuk to handle
RFC2616 'chunked' downloads from buggy (IIS) web servers.

( See also:
http://www.subbu.org/weblogs/main/2004/11/persistent_conn.html
http://skrb.org/ietf/http_errata.html#chunk-size
http://www.apps.ietf.org/rfc/rfc2616.html#sec-3.6.1
http://www.jmarshall.com/easy/http/
)

2007-may/june

* recompiled in 64-bit Linux (SuSe 10.2) and fixed a few items in the
Makefile.am, configure.in and ac-config.h.in files. Also added the tests\
and www\ directories to the distro.

* fixed a few 64-bit compile warnings; at least the test cases in tests\
perform OK now on a 64-bit Linux system.

* updated the man page a bit; still a lot more to do. Where is that 'nroff
for dummies' cheatsheet when you need it? ;-(

* listed -use_http11 as 'on' by default now.

* moved MODE_MIRROR unescape code section up in url.c to line 1682 in
url_get_local_name_real() as this code would otherwise have no effect at
all in any environment where the '%' percent character is included in the
FS_UNSAFE_CHARACTERS charset (for example: Win32).

* PARAM_DOUBLE default values are now fixed point values in 'long' integer
format; the current values in the program (all 0.0) are clearly within
range _and_ it 'saves' on compiler warnings quite a bit. (We've still some
way to go before we get anywhere near a '[almost-]zero-warning cross
platform portable build: few int to pointer and vice versa casts remain.)

* fixed bug in cfg_get_num_params() which would access uninitialized memory
out there in NirvanaLand when a PARAM_UNSUPPORTED option was passed to
pavuk.

* Fixed configure.in to include 'debug' build handling for KDevelop (which
would pass '--enable-debug=full' to ./configure).

* updated the configure.in script to increase portability (opendir/closedir:
dirent.h et al)

* included a few aufoconf macros in the m4 directory for easier/proper
portability support using autoconf et al.

* bugs fixed from BUGS list: multithreaded mode is not as stable as single
threaded (fixed at least for the CLI version of pavuk; the GTK GUI version
is in a rather bad shape)

* bugs fixed from BUGS list: signal handling / timeout does not really work
(at least not in multi threaded downloads). After a SIGINT pavuk just
hangs.) This has also been fixed for the CLI version of pavuk at least.

* Win32 port now includes JavaScript support (using the statically linked
Mozilla js library).

* fixed short option definitions in options.h: -tp / -tsp et al

* 'fixed' GUI for Javascript enabled builds (GTK2) - WARNING: it compiles
now, but has NOT been tested, so expect bugs here!

* merged the 'chunky' code with the pavuk main source tree. Now 'chunky' is
equivalent to building pavuk with './configure --enable-hammer'.

* set default from -leave_site to -dont_leave_site to prevent 'blown up' web
crawls when this filter parameter has not been specified.

This change includes a fix for the cfg/command line handling of pavuk for
the conditions section (see condition.h + config.c) as pavuk assumed
sizeof(long)==sizeof(int) in these code sections.

* Now the proper GPL license (GPL, not LGPL) is included in the file
./COPYING.

2007-sep

* fixed processing of zero byte length files (robot.txt at figleaf.com,
etc.): no more crash/assertion failure due to NULLed docu->contents.

* fixed a few memleaks.

* added extra error checking for file rename operations as some issues were
found with the Win32 build when using a SAMBA-shared filesystem for
storing the spidered data/files. (It turned out that the same issues
existed when using native (NTFS, FAT32) filesystems.)

* dialed down the number of default threads from 3 to 1 (see BUGS) to
prevent a hail of (legitimate) rename error reports.

* added flock() implementation for Win32: when built with multithreading
support, having no valid flock() implementation is very dangerous!

* changed configure.in to detect both flock() and fcntl() file locking
mechanisms so pavuk will be able to support writing spidered content to
network shares on both Win32 and UNIX systems: flock() does not support
network shares locks, fcntl() does, at least on the latest Linux kernels,
see man flock(2)

* added error reporting/checking for undesirable use of invalid flock()
implementation. (Useful when porting pavuk to other non-Unix platforms.)

* Fixed content/file size treatment code for items which are already
available locally (i.e. pavuk finds the item at the remote has not changed
from when the last time it fetched the item into local cache).

* Fixed the conditions for when to display certain informational messages:
less screen clutter when not running in '-verbose' mode OR when running in
'-progress' modes.

* Fixed several error/info messages in the code section for decompressing
gzip/compress transmitted HTTP content.

* Fixed handling of gzip/compress transmitted content when retrieved from
local store instead (when pavuk discovers that the file at the remote site
has not changed since the last time it was fetched and stored on your
local disc).

* Fixed a few memleaks.

* Changed the DBGvars/DBGpass/DBGargs macros used for tracing memory
allocations in debug mode to make these macros look more like regular 'C'
functions to 'demented' code formatters and analysis tools. The drawback
is that these still look 'weird' in function prototypes, but that causes
quite a few less errors/warnings than the old style.

* Fixed bugs in get_abs_file_path() directory detection and Win32 abs path
processing.

Also fixed code which produced double slashes in file paths on occasion,
causing trouble on Win32 platforms. (Fix applied generally.)

* Fixed mk_native() allocated string management pool to support printf() et
al where up to 3 mk_native() calls are made in the argument list. This is
important to prevent spurious crashes in multithreaded mode when the worst
case scenario for mk_native() applies: all threads are executing printf()-
style statement which has multiple calls to mk_native() in the argument
list.

Currently overdimensioned a bit as the actual code only has two
simultaneous calls while the pool now is dimensioned to tolerate 3
simultaneous calls per thread.

* No more _strfindnchr() and strfindnchr(): strfindnchr() - and its use -
has now been fixed to match the (proper working) _strfindnchr().
[fnmatch.c/tools.c et al]

* Fixed const-correctness of several functions.

* Added '-mime_type_file' commandline option to help pavuk support an up-to-
date list of mime types and their filename extensions, using, for example,
the UNIX mime.types(5) config file as a source of MIME type information.

If the user does not specify the '-mime_type_file' option, the original
built-in defaults will be used instead.

This feature has been added to provide better support for the pavuk -
fnrules %M macro: this macro now will use this configuration to produce a
suitable filename extension for each MIME type: the first extension listed
in the '-mime_type_file' config file for the given MIME type will be used
as extension for the %M macro.

* Changed the GTK GUI macros to become functions for ease of debugging. The
added (tiny) call overhead won't be a performance hit anyway.

* Fixed -fnrules handling: the generated path is cleaned up before it is
returned to pavuk for use.

Cleanup actions:
- duplicate '/' slashes are removed
- filenames and directory names which end in a '.' dot, get the dot
removed

* Added '%X' to the -fnrules formatted processing to allow reformatting of
filenames using an optional mimetype-derived extension. This is useful
when grabbing Wiki (MediaWiki et al) sites when you'd like to store the
grabbed content using default mimetype-related filename extensions, so
instead of storing a file like

wiki/page/AboutThisSite

that would transform into

wiki/page/AboutThisSite.html

while pages like

wiki/static_page/contact.htm

would remain as is.

(Note: this might be considered shorthand for a -fnrules (...) expression
which compares both %e and %E. The intent of %X, however, is to only
allow %e extensions to pass which are 'valid' for the given MIME type and
force the %E mimetype based extension for all other cases.)

CAVEAT: %e/%E/%X/%Y will print the extension WITHOUT the leading '.' dot in
both simple mode and extended LISP mode.

* Added '%Y', '%A' and '%B' to the -fnrules macros: '%Y' uses the MIME type
prefered filename extension if the URL/filename doesn't have an extension
yet (while the rather similar '%X' will OVERRIDE the existing extension if
it is not listed with the specified MIME type).

'%B' prints the 'basic MIME type', i.e. the MIME type without the ';'
semicolon separated MIME attributes such as language, etc., while '%A' will
print these extensions (if they were passed to us by the server).

CAVEAT: %e/%E/%X/%Y will print the extension WITHOUT the leading '.' dot in
both simple mode and extended LISP mode.

All this allows for pavuk -fnrules commandline arguments like this:

-fnrules F '*' '%h:%r/%d/%b%s.%Y'
-mime_types_file ./mime.types
-tr_chr_chr ':\\!&=?' '_'

so we'll be able to grab a [Media]Wiki site while storing those pages as
regular 'abc_php_xyz.html', instead of 'abc.php?xyz' page/filenames.

* Added -fnrules 'fnseq' operator to the extended rules: compares a
wildcard pattern and a string a la fnmatch(3).

* Checked and updated manpage for the -fnrules operators (added 'ud' and
'sp' operators to the manpage).

* Added -fnrules 'sn' operator to the extended rules as counterpart of 'ns'.
'sn' uses strtol() to convert a string to a number, while 'ns' uses
printf() to format a number to a string. (See the man page.)

* Updated the man page a bit regarding '-fnrules'.

* sanitized escape_str(); a quick code review led us to a lurking bug in
uconfig.c@309, which has been fixed implicitly.

* Added/updates source code documentation: tools.c/tr.c soure code comments.

* Added some sanity checks in the code (tools.c/tr.c/lfname.c)

* Added debug_level 'rules' to allow debugging of both simple and 'extended'
-fnrules expressions and '-fnrules' URL F/R matching.

* Different boxes exhibit different mktime() behaviour, especially when
handling out of range tm value sets. Besides, mktime() works in 'local
time' while some parts of the code require a robust UTC mkgmtime() (not
available on many boxes) --> ripped & introduced as tl_mkgmtime(). A local
time-aware equivalent with excellent out-of-range handling is available as
tl_mktime().

* Added additional error handling around calls which try to parse time
stamps using tl_mkgmtime() and tl_mktime() (times.c).

Basically, now both HTTP and FTP benefit from the new code which should
now proces timestamps like the UTC timestamps they are, while 'out of UNIX
time_t bounds' timestamps (beyond the range 1970..2038 A.D.) are handled
in a more sane manner:

- out of bounds timestamps are reported by pavuk

- out of bounds timestamps are then 'sanitized', i.e. restricted to the
1/1/1970..31/12/2037 date range, i.e. a timestamp beyond the horizon,
like '1/4/2051' will be 'sanitized' (= restricted) to the upper bound:
31/12/2037. The same goes for te from antiquity like '11/3/1969' (the
birthday of a certain person), which will be 'sanitized' towards
1/1/1970.

* Split up DEBUG into developer related stuff, such as memory/heap checking,
ASSERT/VERIFY, etc. and user related stuff (the -debug and -debug_level
command line arguments): ./configure is now fitted with an extra
parameter:

--enable/disable-debug-features

which will turn on/off -debug/-debug_level user level debugging support in
pavuk, while the existing

--enable/disable-debug

adds/removes additional developer checks, such as heap allocated checks
and ASSERT and VERIFY macros.

In the code, -debug/-debug_level related code is located within the
'HAVE_DEBUG_FEATURES' sections, while the developer debug/release builds
are still related to the standard 'DEBUG' #define.

This now results in three ./configure options that determine the (debug)
feature set of your binary:

--enable/disable-debugging --> compile a binary with source level debug
info included and all optimizations
DISabled for improved debugging (by using
gdb or another debugger of your choice)

--enable/disable-debug --> include/exclude additional run time checks
in your binary. Most important are the
ASSERT and VERIFY pre/post-condition
validation methods located throughout the
code. The use of these is advised, though
these may cause a performance hit.

--enable/disable-debug-features
--> include/exclude user level -debug/-
debug_level command line features, which
help you as a pavuk user to 'debug' pavuk
during the run. Using -debug, pavuk will be
EXTREMELY verbose, which can be toned down
by applying a -debug_level restriction
filter. For example:

-debug -debug_level all,!devel

will be VERY verbose, but will NOT log any
DEVEL level debug info, while:

-debug -debug_level !all,rules

will ONLY produce additional output for the
RULES level, i.e. when pavuk processes -
fnrules and/or JavaScript macros.

* Fixed crash when non-RFC compliant website was grabbed: see testcase 7a.

* Added targeted help: when options cannot be parsed correctly,
short_usage() will try to help the user by printing the full help for the
abusing commandline option only. (Of course, I screwed up while using
debug_level flag sets _again_ :-( [Ger])

* Some improvements for network connectivity error handling and reporting.
(xvherror() added.) This is the result of some FTP tests with pavuk (tests
8b).

* Don't yak about 'Checking "robots.txt"' anymore when doing a FTP grab when
robots.txt is NOT applicable anyway.

* FTP: added crude 'autodetect/retry' mechanism for FTP servers which do not
like NLST (==> response code 550) but report correct directory content for
LIST (or vice versa). (ftp.c)

* FTP/HTTP: at debug level 'protoD' pavuk will now dump RAW data/content
received from the server before preprocessing (i.e. converting to HTML or
decompressing).

* Added command line option integer sizing support: byte sizes can now be
specified in K, M or G. Other integer values can also be postfixed with K,
M or G, but then these will be treated like the ISO values 1000, 1E6 and
1E9.

* Addition memory leak fixes in case pavuk is fed an invalid commandline.

* NTLM support code: fixed a few glaring bugs.

* Added O_SHORT_LIVED to lock file open() flags for better Win32 behaviour.

* Fixed code to load the pavuk configuration settings from, in order of
appearance:

env:PAVUKRC_FILE
~/.pavukrc
SYSCONFDIR/pavukrc

which matches the description in the manual.
(see also man page)

2008-jan

* Added 'js' flag to '-debug_level', which is used to dump a lot of detail
about the pattern matching and transformation applied to JavaScript code
using the '-js_pattern' and '-js_transform / -js_transform2' commandline
options.

* Added sanity check for '-js_pattern' and '-js_transform[2]' regexes, which
MUST contain a subexpression for them to 'work' as expected.

* removed re_pmatch_sub() and changed the code where it was used to work
with the available re_pmatch_subs() call, which allows for more elaborate
validation anyway. See htmlparser.c.

* Removed a regex handling bug in the -js_transform[2] code, which would
crash pavuk when using regexes where the first subexpression might be
empty.

The crash is due to the fact that the regex parser would return indexes '-
1' for these empty subexpression(s), resulting in out-of-bounds memory
writes in the rewrite code. This in turn would nuke the heap, so after
that is was only a matter of time for pavuk to fail dramatically.

2008 feb 04

* Added DEBUG_MISC() lines to solve sourceforge.net issue: [ 1852885 ] to
improve manipulation by locally stored files

* Included provisional fix (I don't have a working sample run to reproduce
the issue (yet)) for sourceforge.net issue: 1852884 ] infinite loop on
unexpected responses

* Cleaned up the mess that was -progress_mode.

* Cleaned up several DEBUG_xxx macro mistakes

* Added a little description to the 'hidden' -htDig commandline option,
which can be used to dump the server-transmitted MIME headers for each
URL, similar to the htdig tool.

* Added a bit of documentation for the -rollback option (which was
undocumented)

2008 mar 20

* GNU gettext tools don't like '\r' in i18n strings --> fixed by changing
the related printf() statements in src/doc.c

* started update of configure scripts to the latest autoconf/automake.

Also reordered the NEWS file so it will work with the new, stricter

./bootstrap && ./configure && make distcheck

distro test cycle.

2008 jul 10

* fixed ';' semicolon bug in http.c near line 2074 which caused incorrect
decoding of the HTTP/1.x response code header.

* fixed gzip/compress/... content compression support (HTTP/1.1 Accept-
Encoding); the previous code was a valliant attempt to 'fix' the client
side (pavuk) to cope with buggy web servers which send the wrong encoding
type for already compressed files, but this would screw up particular
responses by *well-behaving* web servers. Of course this would only happen
in rare circumstances so it was kinda hard to track down.

Documentation for -Enc/-noEnc has been updated to reflect this situation
and the code now (hopefully properly) finally supports compressed data
transmission for RFC2616-complaint web servers.

If you find that your 'downloaded' compressed files are already
/incorrectly/ DEcompressed by pavuk, this is NOT the fault of the client
(pavuk) but evidence that your server is behaving inappropriately and the
proper remedy for this is the use of the option '-noEnc' which turns this
feature off so the server is not allowed to screw up in this way any more.

Also made sure one can check if pavuk has been built with compression
support by calling 'pavuk --version' and looking at the feature list.

* autoconf/configure script: using the highly undocumented v_cflags or other
x_* variables as environment variables to hack the configure script (you
could do that, especially with v_cflags) has been obsoleted while the
configure and m4/* scripts have been upgraded to support autoconf
2.62/automake 1.10 and use ONLY *documented* AC.*/etc. macros from now on.

Note: thanks to the JavaScript library issues on SuSe10.2/AMD64 (older JS
lib version and seemingly partial header install), I may have failed
to eradicate all undocumented macros.

* Extra note about configure.in: bash, at least on SuSe10.2/64-bit, handles
'if eval test ...' just ever so slightly different than 'if test ...',
especially where it comes to 'test -n'. As these styles were mixed rather
arbitrarily before, the 'if eval test ...' style has been completely
removed from the configure script, as this would sometimes render quite
unexpected (and incorrect!) results.

* fix_crlf.sh has been updated to ensure important Microsoft Visual Studio
files are not damaged by having their CRLF sequences converted to UNIX LF
line endings: this kind of thing will make MSVC spit you in the face and
reject everything you try until you give it back those CRLF line endings
in there. So much for XML as project file format and MSVC...

* extra fixes to ensure 'make distcheck' does not barf up a hairball. This
includes enforcing the permanent inclusion of the 'po' subdirectory in the
Makefile set for multilingual support.

* configure/Makefile(s): if you don't have one or more of the
archiving/compression tools compress/lzma/gzip/tar/7z(7zip) installed on
your system, we don't go belly up at config ~ nor at 'make dist' time
anymore. This, of course, includes correct behaviour at 'make distcheck'
time: only use/test those 'GNU standard' formats, which can be created on
your box.

* Added the 'bootstrap' shell script, next to 'autogen.sh'. I know they
serve the (almost) same purpose, but 'bootstrap' is far more sophisticated
than autogen.sh and I didn't wish to overwrite 'autogen.sh'. Besides, IDEs
on UNIX boxen expect either the one or the other (there's no single
'standard' for this), so we might as well provide both.

At a later time, we might probably point autogen.sh to bootstrap.

* Updated the mime.types MIME 'hint' file: currently, it's a mix of

1) all properly registered MIME types ( http://www.iana.org/assignments/media-types/ )

2) the mime.types file provided with the latest Apache/XAMPP

3) my (Ger Hobbelt) additional file extension hints as used on my own
servers. This is mostly about professional graphics ~ and modern
'scene' audio/video container formats, such as Matroska. This only adds
extensions for otherwise already existing MIME types.

* Updated the DocBook-based documentation for several options (-End/-noEnc, ...)

* 'pavuk --version' now also reports if ZLIB support is included in the
binary. This is important for '-Enc'.

* Fixed the '-Enc' compressed transmission and HTTP header processing code
to act properly with fully RFC2616-compliant web servers, discarding the
old 'hack/fix' attempt to solve a non-complaint server issue at the
client, as this would break things for fully compliant servers in the rare
(but extremely annoying) use case:

- pavuk with '-Enc' option

- webserver is fully RFC2616 compliant

- pavuk issues request for file in a .tar.Z or other gzip/compress
compressed format, where the file on the server is only slightly
compressed (fastest compression).

- webserver will transmit file to pavuk, but due to pavuk reporting it is
able to handle compressed transmission AND the server discovering that
the content can be compressed quite some more than it already was, the
file will be transmitted after a server-side just-in-time compression
round.

- pavuk receives the data. The old hacked code would NOT decompress the
data. However it SHOULD because the server PROPERLY reported 'Content-
Encoding: gzip' to pavuk. End result: grabbed data which you cannot
process nor trust to be in the same format as stored on the server as it
all 'depends' on arbitrary conditions which you cannot control: is the
web server able to compress the data before transmission? Is the web
server configured to allow compression? Etc.

This use case has now been fixed.

The effect of BADLY behaving web servers (which send 'Content-Encoding:
gzip' for any .Z, .z or .gz files (IIS x.x and other servers which are not
configured to /properly/ handle files and MIME types) is described in the
DocBook manual page now, including the fix for this (specify the '-noEnc'
commandline with pavuk).

* active FTP: timeout and stop/break handling slightly improved: now pavuk
should always terminate under all circumstances while a break or stop has
been signalled.

* Changed the default for '-url_strategy' from 'level' to 'leveli' to make
pavuk behave more like your regular web browser (with a user clicking
through web pages).

* Initial fix for NTLM support for 64-bit Windows. (Only lightly tested.)

This includes converting that bit of code to support the C99 intNN_t types
(where NN e {8,16,32}), while the configure script takes care about
providing the proper types for not-fully-C99-compliant environments.

* The TRE regex package would barf up a hairball due to the incorrect header
file being loaded. ./configure now recognizes TRE specifics a bit better
and the code now loads the proper header file (<tre/regex.h> instead of
<regex.h>). This is important on systems which have multiple, ever so
slightly incompatible regex processing libraries installed.

* Improved diagnostics a little bit by adding reporting support for
URL_PARENT_REWRITING, i.e. the situation where a parent page of a grabbed
page is loaded for the sake of adjusting (rewriting) the URLs in its
content.

* Fixed code so it would compile in full (-DDEBUG) debug mode on UNIX.

* autoconf/configure: ran into some weird issues due to inconsistent M4 []
quoting: quite a few lines did without it. Turns out that this is a BIG
No!No! as adding the AX_ADD_OPTION() macro turned this lurking mess into a
true disaster.

Fixed by applying [] quoting throughout. The only place where I didn't do
it, is in the first and second args of AC_DEFINE() -- which should be used
instead of AC_DEFINE_UNQUOTED when you don't need the latters extra
functionality anyway -- and the first arg of AC_DEFINE_UNQUOTED(). Any
other spot where [] quotes are missing in the M4 macros and/or
configure.in? Consider that a bug and please report so I can fix it.

* Finally got the configure system to recognize my JavaScript libraries and
all. Tugged and tweaked a few items in the bindings to allow maximum
flexibility for the JS code when it is used to filter URLs (e.g.
JavaScript pavuk_url_cond_check() function).

* Updated jsbind.c to use latest SpiderMonkey 1.8.x (tested on Win32)

* Changed man/Makefile to ensure HTML is not recreated every 'make' run, but
only when manpage changes. This should really copy the results from
./doc/, but that's for later...

* DocBook documentation: tweaked man page generation to mimic original
manpage title exactly.

* DocBook documentation: updated '-version' info (important to see at run-
time what abilities you've got with /your/ pavuk.

* Win32/MSVC: all project files have been updated to produce next to
Win32/x86: Win64/AMD64 and Win64/Itanium binaries. These project files
assume the existence of all optional libraries: OpenSSL, SpiderMonkey
(JavaScript), zlib.

Where to get those, prefered directory layout, etc. to be published, so
others can build from source on Win32/64 too and get the same results.

2008 jul 20

* tweaked configure+makefiles so that a 'make dist' from CVS becomes
possible: there were quite a few references to yet unpublishable files in
my makefiles (Ger Hobbelt).

* config section: improved adherence to C standards: no more potentially
dangerous mixed use of function and data pointers by typecasting function
pointers into data pointers and vice versa.

This has been resolved by an added layer of indirection, which makes it
all very legal C again. It goes somewhat like this:

function_pointer_type ptr = &function;
data_pointer_type d = &ptr;

then use (d[0])(...) to call the function.

This contrasts the old code:

data_pointer_type d = (data_pointer_type)&function;

and function invocation using:

((function_pointer_type)d)(...)

* Added support for parsing 'hidden' CSS and JavaScript in HTML. The support
is also extended to generally parse inside HTML comments PLUS Microsoft IE
CC's (Conditional Comments):

-read_css
-read_cdata
-read_msie_cc
-read_comments

These are all enabled by default; documentation has been updated for these
as well.

* Fixed CSS and [Java]Script handling in the HTML tokenizer/parser, which
was feeding the filters and URL extractors (htmlparser.c).

Now the code can cope better with incorrectly formatted pages / files.

* Reordered the HTML tags in htmltags.c in a preparatory move to check the
list for missing attributes (onXXX JavaScript items for one! several are
missing) and HTML 3/4 tags. (htmltags.c)

2008 aug 13

* updated the -debug_level related code; DEBUG_DEVEL() and a few others now
'automagically' report the sourcefile+lineno without the need to specify
these explicitly + some DEVEL_*() calls have been shifted to other
'-debug_devel' levels (net, mtthr, htmlform, ...)

* completed the -debug_level tracing for multithreaded runs: now all
semaphore accesses can be traced using the -debug_devel mtthr

* Major fix for bufio+socket code: no more lockup for pavuk due to delayed
reception of response data (tl_selectr() would incorrectly lock
indefinitely -- which proved to be a generic coding mistake in both
tl_selectr() and tl_selectw() -- PLUS better error condition handling in
an attempt to improve handling of all sorts of 'spurious error conditions'
which may occur when your network suffers from packet loss or other
undesirable effects.

* -mode remind code fix for multithreaded use to make it match recurse and
other modes better; not severely tested so YMMV! (The old code wouldn't
work anyway, so it's an improvement anyhow).

* few code cleanups (#if 0 ... #endif)

* DocBook manual updated: now all return codes from pavuk are documented.

* minor code fixes for SSL/SFTP.

* updated configure and code to assist in compiling with both latest
SiderMonkey and older Mozilla JavaScript libraries (Win32/64 and UNIX
respectively).

* Some unused error checks replaced by ASSERT() and some ASSERT()s replaced
by error reports as those errors /can/ happen in actual use (though
seldom).

* Fix for parsing malformed URLs (with multiple '#' and/or '?': bookmarks
and query string parts would not be stripped/detached correctly as the
last '#'/'?' instead of the FIRST occurrence of '#'/'?' would be picked as
a separation point.

* Ran the gettext files through pot/pox/po again. Lots of 'fuzzies'... These
need to be fixed.

* EXPERIMENTAL: added preliminary code for extended JavaScript support:
hooks to process HTML and CSS just like you can process embedded <SCRIPT>s
now. The new hooks are still 'nulls', i.e. do not have any effect.

This is a work in progress; it compiles & runs (tested on UNIX and Win32
in multithreaded mode) but the new hooks still need to be implemented.

The goal here is that all grabbed (parsable) content should be processable
by custom JavaScript script functions AND when more than one URL is found,
the JavaScript code should be allowed to add those extra URLs to the pavuk
queue (using the new url.queue() JavaScript PavukUrl object method --
currently a 'nil' member function as it still must be fully implemented).

* isatty() fixes which check for error conditions and do /not/ provide
special 'console oriented' features when isatty(0) produces an error (may
happen on Win32/UNIX).

* Checked and updated all header files (after I ran into a cyclic dependency
when changing a bit of code): no .h files will #include "config.h"; all .c
files /do/ #include "config.h" as the first header.

System-dependent stuff (TRUE/FALSE definitions and a few other bits) have
been moved to config.h (where they below IMO) and removed from tools.h

This is a change required for the gzip fix [SF bug #2050527].

* Preliminary fix for CSS url grabbing and rewriting bug [SF bug #2050537].

The new code will now try to keep these three styles of <url> formatting
in CSS intact -- this is done so as to keep particular CSS browser hacks
intact as much as possible:

@import "<url>"
@import url(<url>)
@import url('<url>')
@import url("<url>")

and of course the use of 'url()' elsewhere in any CSS is treated like the
three examples above, i.e. NONE of these should be changed regarding <url>
delimiters (quotes or braces) when rewritten by pavuk.

The ONLY situation where pavuk will CHANGE the quotes is when a <url> is
found to contain the delimiter quote itself: in that case the quotes are
changed from ' to " and vice versa.

2008 aug 18

* minor fixes to the includes mime.types file

* configure: added support/auto-detection for the GNU GDB extended debug
output (-ggdb -g3) for when building a debug build.

* NTLM: fixed code for Win64 and other 64-bit platforms which do or do not
support structure packing.

* documentation update: -[no]chunk_bug commandline argument finally
documented (was in there already for a longer time; is a special fix for
badly behaving IIS web servers which transmit data in 'chunked mode'.

Also upgraded the documentation for the -tr_str_str/tr_chr_chr options so
one can finally read how to use [:print:] and other definitions in there
for -tr_chr_chr and be able to determine up front what the bugger will do
for you.

For example:

Why does -tr_chr_chr '[hexnum:]' '0123456789abcdef' *not* do what you
expect when the filename has any of the a..f characters? (Answer: they all
become 'f' as [:hexnum:] actually expands to

'0123456789ABCDEFabcdef'

itself, so it is longer than the destination set and by definition any
'overflow' will be replaced by the last character in the target set.)

* HTML/CSS/JavaScript parent rewriting was sometimes flaky; this has been
fixed by fixing several bits of antiquated code in pavuk: now all code
sections are equaly aware of URL_ISHTML, URL_ISSTYLE and/or URL_ISSCRIPT.

Several functions have been adapted to mirror the new awareness:

ext_is_html() has been enhanced and has been renamed to actually show its
intended function: ext_is_parsable() -- which can be a HTML, CSS *or*
JavaScript file! (not only HTML can be parent of other URLs and need
updating ('URL parent rewriting').

[ SF bug #2050537 ] CSS @import bad / HTML corrupted --> fixed

* On SuSe10.2/AMD64 glibc6 dumped core when running pavuk in full-out '-
debug -debug_level all' (the latter is implicit when you use '-debug')
mode. This was caused by glibc()'s printf() functions *sensibly* executing
a strlen() operation on the data fed to one of several '%.*s' printf()
formatting parameters, while those data series had NOT been NUL
terminated.

This would happen when debugging pavuk while fetching data from a gzip-
enabled web server: the gzip/inflate code would NOT append a new NUL
sentinel.

* Several other '%.*s' and '%s' related core dump spots in the DEBUG_XYZ()
code which would dump downloaded content have been fixed by feeding the
data through an enhanced asciidump function -- which will switch to HEX
dumping when the content to be shown for scutiny contains a large amount
of non-ASCII data (> 10% is the current heuristic to switch over).

* glibc6 on SuSe10.2/AMD64 would also dump core when being fed a 110K string
to a printf '%s' statement. This has been fixed by always limiting the
amount of content to be displayed when debug-printing downloaded data
(various '-debug_level's)

* gzip/inflate would fail to perform on 'non-parsable' content, i.e. plain
text files downloaded from a gzip-enabled web server. This has been fixed.

CAVEAT: The current gzip/inflate code does not deliver when it is fed very
large files. Hence, when downloading VMware images and/or multi-GB
ISO files, a workaround is to specify -noEnc. This will be fixed
at a later date.

[SF bug #2050527] nonparsed files saved in (wrong) compressed when using
HTTP --> fixed

* Parent rewriting would try to treat all parents as HTML, which is VERY
wrong when the actual parent is a CSS stylesheet or a JavaScript script
file. Fixed.

* unified variable names for 'struct doc' variables: it is *QUITE*
irritating to loose your display of 'docu' contents just because this call
uses 'docp' for the same (or 'html_doc') while trying to track down
lurking parent rewriting and file URL parsing bugs.

Updated all sourcefiles to the use of varname 'docu' for the current
document. 'docp' and 'html_doc' have been renamed.

* two bugfixes for the tr() code: (1) when using X-Y character ranges, the
size estimator would allocate way too less space. This has been fixed. (2)
the documentation says it well: you cannot include a NUL in a tr()
character set. In one case (a range at the start of the spec like this: '-
z' would actually attempt to insert such a NUL anyhow, causing subtle bug.
Fixed. And a minor code cleanup.

* fixed argument quoting for external app invocation, which is particularly
important for Windows machines: they treat '-quoting quite different from
"-quoting. Fixed by using "-quotes instead of the original '-quotes.

* -enable_js is now turned ON by default - just like the documentation
already said.

KNOWN ISSUE: empty lines in JavaScript code and files gets stripped by
pavuk on rewriting; this will be fixed at a later date.

* fix in mime.types file for CVS file extension + added mime types for
Microsoft Office 2007

* fixed heap corruption in ainterface.c when calling append_starting_url()
when url has been specified in the extended '-request' format, including
a predefined local filename. (Would dump core on some systems.)

* moved the url2diag and info2diag functions from recurse.c to where they should
have been: url.c -- to resolve a cyclic dependency.

* fixed up the '-request' format url parser/decoder url_parse() call: several
types of input specification error would be silently rejected (now pavuk
prints a suitable error message to tell the user what [s]he did wrong and what
was expected) + a few tugs & tweaks to fix behavior for parsing extended
URL specifications (including cookies, predefined local filenames, etc.) and
an extra '-debug' (level: URL) line to help you diagnose how the '-request's
have been parsed/decoded.

* now you can use the extended '-request' URL format anywhere on the
commandline and/or your pavuk configuration files -- as long as you keep
it within quotes on the commandline of course, e.g.

pavuk "URL:http://example.com/ LFNAME:example.html"

* fix: config files generated by pavuk now properly select the 'short format'
(URL:....) instead of the 'long url spec fomat' (Request:....): previously
pavuk would loose information about web forms, cookies, local filenames, etc.
for some types of requested url.

* quickfix for issue reported on the mailing list regarding JavaScript
interface functions causing the build to fail - which happened when no
JavaScript library could be found.

NOTE: on Linux, the JS libraries and headerfiles seem to get installed in
various places. The current ./configure script looks for the
jsapi.h
header file in the directory
/usr/include/js
unless you specify the '--with-js-includes=<dir>' option when running
./configure.

The same goes for the js library itself: the current configure script
looks for either libjs or libmozjs in any of these directories:
/usr/lib64/thunderbird
/usr/lib64/firefox
/usr/lib64
/usr/lib/thunderbird
/usr/lib/firefox
/usr/lib
unless you specify the ./configure --with-js-libraries=<dir> option
to point to your specific libjs.a / libmozjs.a

* added an advanced example of use to the pavuk DocBook documentation
which will end up in the manpage (where it's a bit too much, but then
at least the users have an extended example of actual use) -- example
shows how to grab the up-to-date content from a MediaWiki-based web
site.

* added S/M/H/D unit support for the time argument decoder function

* Updated the manual regarding:

- all missing 'hammer mode' options

- the missing -rtimeout and -wtimeout options

- checked first few options in options.h and made sure those were all
documented. (This is a work in progress...)

* All timeouts are now in milliseconds, except the -max_time one, which is
in minutes.

All timeout arguments (except -max_time) now recognize the alternative
units for specifying time: s/m/h/d/S/M/H/D: second, minute, hour, day.

When no unit has been specified, the unit 'milliseconds' is assumed.

* Fix for bug report #2158794: now all DEBUG_*() functions are called
using the proper number of arguments.

The code has been further enhanced for all printf()-like functions
(such as the DEBUF_*() and x*printf() functions) to enable GCC and MSVC
to check the format specification strings and parameter count and
type (GCC).

This led to the discovery of a multitude of errors, which have been
fixed (wrong integer sizes, etc.).

* Preliminary code move to allow downloading extremely large entities
(larger than 2GB) such as DVD ISO images: this has been done by more
judicious use of the size_t and ssize_t types instead of simply 'int'.

On 64-bit platforms, size_t/ssize_t can handle 64-bit sizes, while
'int' cannot (as GCC still uses 32-bit ints on most common hardware
64-bit architectures (Intel, ...)). Further effort will need to be
spent to adapt the system (and OpenSSL) calls to enable the complete
datapath for >2GB entity sizes (at least when compiled on 64-bit).

* Small documentation fix: regex overview of characterset changed in DocBook
source so it appears as a simple list, instead of just one long paragraph
full of concatenated items --> improved readability.

* const-ified the source code and fixed a few comment typos and a
lurking bug in FTP (found thanks to constification): filename
for directory index urls could be damaged in particular circumstances.

* fixed makefiles for environments without any DocBook tools. Also fixed
configure script to help detect the absence of mandatory DocBook template
files. Plus added DocBook produce to the distro as we cannot expect everyone
to have the DocBook tools; nevertheless, everybody /should/ receive a full
set of documentation.

* Bugfix in GET_NUMLIST(): now original numlist is properly removed (would only
be noticable before when specifying multiple port numbers).

* memleak fix for _free_httphdr(): now also the httphdr struct itself gets
free()d.

* Fixed lockups in debug logging code when running in '-x' GUI mode; overhauled the
'recursive invocation' detection code within, which is mandatory to prevent
recursive calls to debug/log functions to blow up the stack and dump core while
running in ultra verbose debug/diag mode (-debug -debug_level all). This is the
second part of the fix for bug #2184196.

* Bugfix for #2023089: new code is introduced for '-lmax' depth level checks:
the 'depth' (a.k.a. 'level') will always be taken from the non-inline parent URL
which has the lowest level.

This should fix situations where 'inline' URLs have 'inline' *parent* URLs, such
as style sheets, which are referenced non-inline URLs (HTML files).

Seeking out the lowest level non-inline parent should also take care of situations
where multiple HTML files at different levels themselves, all (directly!) reference the same
stylesheet/inline URL.

* Attempt at fixing a GUI semaphore lockup, caused by LOCK_CFG_URLSTACK being used
for different purposes (was a quick hack once to create a 'critical section' there)
in recurse.c @ 1129. Same hack, but now we use LOCK_GHBN which should cause much less trouble
there.

* Bit of code cleanup.

* Code review checks to see if URLT_FTPS and URLT_GOPHER are used consistently where
you'd expect them. As you would URLT_HTTPS, next to URLT_HTTP.

* Code review checks and fixes to prevent pspurious damage to url->parent structures:
now the access to this element is critical-sectioned /everywhere/ using LOCK_URL(u); existed
in 95% of the places already, now all code has been checked.

* Several fixes for multithreaded GTK GUI use. Most important thing which
was missing: a call to gtk_threads_init().

* JavaScript: updated HTML tag/attribute tables to recognize all
onXYZ=... JavaScript event attributes in HTML + added the full
set of attributes to the url pattern class/object which is
available in pavuk's own JavaScript extension.

For information on current development see here.