Pavuk

SourceForge

Last Update: April 19 2005

 
 

Last Changes :

* ---------- released version 0.9.33 (2005-09-27)
* fixed 64bit problems (BUG #1226863)
* updated German locale, fixes done by Debian developers (Hey, please inform
  us about errors. Scanning the net and all distributions for possible fixes
  is not very helpful.)
* ---------- released version 0.9.34 (2006-01-09)
* security fixes
* some minor bug fixes
* reworked build system a lot, fixed RPM spec file
* now builds fine using most of the possibilities pavuk provides
* RPM builds on openSUSE build service for SUSE since version 9.3, Fedora
  since version 4 and Mandriva since version 2006
* RPM packages can be found here:
  http://software.opensuse.org/download/home:/dstoecker/
* ---------- released version 0.9.35 (2007-02-21)
* added -persistent/-nopersistent option

2007-april-30 [notes taken from old work back in 2005/2006 merged into pavuk mainstream source tree]

* bufio has seen a MAJOR overhaul. It is now capable of pushing text &
  binary data to the file system at unprecedented rates. This is done by
  adding a variable sized (and possibly large) memory cache, resulting in
  large size I/O operations. These perform very much faster than the regular
  RTL I/O calls. (tested on quad CPU UNIX Dell servers)

  the new bufio was required as I needed to log/track a huge amount of data
  in the shortest possible time / lowest possible CPU load.

* cookie handling has been fixed/augmented. pavuk can now have the initial
  cookie values that go with a certain web request preconfigured on the
  commandline. Also, several bugs in handling the cookies have been fixed.
  (tested on a wicked ASP.NET intranet site which 'assumed' the use of a
  special web client (a TV set top box) which would transmit it's serial #
  as a client-side created(!) cookie to the web server. This site/client
  combo thus actually transmitted cookies which would first show up in a web
  _request_ instead of the usual: a server-side _response_.)

* several portability items have been changed (h_errno, ...) to make the
  code compile and work on the odd-flavored UNIX box. A native Win32 port is
  under way: it now works, inclusing zlib and OpenSSL, though the latter has
  not been tested recently.

  Note that the changes may have broken GTK support, as I was not able to
  build the code with GTK on my UNIX boxes.

* socket I/O (IP traffic) has been fixed to properly cope with user breaks
  (a user hitting Ctrl+C). Several locations in the software where the
  unexpected signal would cause an infinite loop have been identified and
  fixed.

* added several lines of DEBUG_xxx to aid both developer and user in
  tracking down hard to diagnose issues inside pavuk while scanning a site.

* Accepted-Encoding (more specifically: the handling of x-gzip/gzip/x-
  compress/compress encoding) has been changed to allow for better
  portability: data is expanded in-memory, without the need for an external
  'gzip' tool and/or OS-specific forks & pipes.

  (Win32 wouldn't know a fork if ever it saw one.)

* ALL stdio is now handled through the new bufio system. This not only
  improves performance when you've got -debug and -debuglevel dialed all the
  way up, but also corrected several spots where, depending on your C RTL,
  stdio/stderr traffic would arrive at different moments on your console
  (some of it was written through the FILE I/O, some through direct I/O,
  causing blurbs of output to pass one another along the way to the actual
  console).

* buffer overrun protection has been improved. Note also that every
  snprintf() and derivative thereof is now 'augmented' by an additional line
  of code which ensures that the last character in the buffer is guaranteed
  to be a NUL sentinel, thus ensuring that the buffer will always present
  data in correct C string format (NUL-terminated). (This is an old habit of
  mine as some C RTLs have shown to be kinda flaky on the subject of NUL
  sentinels when snprintf() et al are writing data up to the edge of their
  output buffers: some C RTLs 'forget' to put a NUL there under particular
  circumstances (some commercial Watcom compiler releases come to mind).

* multithreading pavuk has been tested on an high perf MP UNIX box and it
  was like the documentation/notes state somewhere: instable. The thread
  interlocking has now been fixed; one of the hardest to fix proved to be
  the lockup at the end of a pavuk run. The fix also includes the use of
  semaphores and some additional code changes to make the code thread safe;
  critical sections are now handled as such. This includes placing several
  non-threadsafe C RTL calls (e.g. ctime()) inside critical sections!

* auto-form-filling (the feature which led me to select pavuk over wget et
  al when I started the hammer/chunky project) has been fixed for those
  special pages where you have an empty form to submit: the site I had to
  test included such a form, which was submitted using javascript, but did
  not contain _any_ input fields (but cookies were expected to come with
  that request, thank you). Before, pavuk crashed on such a page. This has
  now been fixed.

* added a 'reindent' target to the makefile, using GNU indent to reformat
  the code. (When you're working several weeks on end in crunch time, you
  want to see some proper and consistent looking source code, even when you
  just made it a mess yourself...)

  Also extended the cleanup makefile target to help me in cleaning up any
  backup and/or temporary files created by vi and some log diagnostic
  scripts.

  [edit may/2007: wasn't this already in the makefiles before - see
  ChangeLog entry in 2003?]

* added several commandline parameter types, which allow you to instruct
  pavuk to use OS file handles or file names for logging activity, while you
  can now also specify whether a log file should be overwritten (default) or
  appended to (new feature) by adding another '@' prefix to the file path.

  TODO: document this properly.

* added hammer/crunchy modes: several ways to scan a web site and than
  rescan it. The higher (later) hammer mode has been specifically written to
  use pavuk as a 'replay attack' based DoS tool for testing high performance
  web servers. (bufio was overhauled to allow us to log all I/O data +
  diagnostics to disc while hammering the server while the pavuk system
  _must_ perform better (= faster) than the web server when running both on
  equivalent hardware.)

* The native Win32 port has been overhauled (previous code was never
  released to the public) to make sure I did not have to look for OS-
  specific path elements _everywhere_ in the code (it was becomes a code-
  wise maintainance nightmare while fixing up/down all those 'absolute path'
  and 'path expansion' code sections to handle Win32 drive letters (root is
  '[A-Z]:[\\/]' instead of simply '/').

  This has been fixed by using the cygwin 'path hack' for the native Win32
  port too: root is '/cygdrive/[a-z]/' so it looks exactly like a UNIX path.

  Any places in the codes which need to address the OS while passing an OS-
  specific path are now handled almost invisibly: all relevant C RTL calls
  (fopen/open/stat/lstat/symlink/link/unlink/rename/mkdir/rmdir/opendir) are
  now encapsulated in tl_[sysname] wrapper functions where these
  /cygdrive/[x]/ paths are converted back to native Win32 paths before the
  actual C RTL function is called. Also any debug/print statement, which is
  used to report a file path, is fixed to convert file paths to the native
  representation with a minimum of fuss: see the new tl_native() call for a
  description how this was done. This code has not been tested in a UNIX/MP
  environment, but the design is such that this should not cause any trouble
  (pthread port for Win32 is in progress ATM).

* added -debug_level modes: all/trace/dev/bufio/cookie/htmlform. Also added
  a feature where you can now specify a set of debug levels and have some of
  those levels _removed_, e.g. 'all,!dev' will show anything _except_ 'dev'
  level debug output: note the new '!' prefix.

* -debug_level output is now prefixed with its level in caps and square
  brackets, e.g. '[PROCE]' to aid in filtering the debug output (for
  instance by piping it through sed/grep).

* unified debug output handling in the code: -debug_levels are now only
  active when you specify -debug too.

* inflate_decode() and gzip_decode() have been fixed to suit a multithreaded
  environment. gzip_decode() now has an in-memory implementation, using the
  zlib library, for those systems which do not support UNIX pipes/forks.

* Fixed deflate/compress handling: the MJF Accept-Encoding deflate hack has
  been removed and the request header extended. (tested on a Wikipedia
  HTTP/1.1 compliant server)

  You may wish to permanently disable the code within


  in decode.c if you do not wish to depend on the external gzip tool any
  more.

* _all_ system header file #include's have been removed from the sources and
  integrated into config.h to allow for better portable source code.

  config.h.in and autoconf.am have been extended to include several more OS-
  dependent system call and header file checks.

  A seperate native Win32 version of the header file is also provided (used
  by the MSVC2005 native Win32 build).

* several hardcoded buffer sizes in the software have been made configurable
  (but remain hardcoded). See for instance dinfo.c: 12 -->
  PAVUK_INFO_DIRNAME and 1024-and-other-fixed-buf-sizes -->
  BUFIO_ADVISED_READLN_BUFSIZE

* fixed several cases where dangling (i.e. free()d but not NULL-ed) pointers
  caused havok. Code has been quickly reviewed to locate and fix additional
  spots that did not yet cause pavuk to go 'crazy Ivan' (Hunt for the Red
  October, anyone? ;-) )

* hardcoded lock filenames have been converted to #define's to allow these
  to be changed in a single spot (config.h), improving portability. e.g.:

    '._lock' --> PAVUK_LOCK_FILENAME

* UNIX-specific octal privs have been changed to their proper #define's to
  allow for maximum portability (Win32 doesn't know '0644' but can cope with

    S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH

  though maybe in a odd way).

* fixed quite a few spots where an unidentified form encoding method would
  lead to _very_ instable bahaviour, including crashes/core dumps. Look for

    fi->method = FORM_M_UNKNOWN

  assignments and additonal FORM_M_UNKNOWN checks.

* added -no_dns support for those who have to work in an environment with
  flaky or no DNS support (I had to as I was working on a box in a specially
  configured, partially walled-off DMZ zone while developing and testing
  pavuk against a web server.)

* fixed typos in the text as I came along them.

* the bufio overhaul also lead to a overhaul of the -dumpxxx code,
  removing/fixing several spots in the code which caused incorrect/instable
  behaviour. (e.g. code in doc.c)

* Fixed handling of compressed data for any text-based server response;
  pavuk now correctly handles any gzipped/deflated text, including, for
  instance, any 'text/javascript' content sent over the wire in compressed
  form (tested on a Wikipedia-based HTTP/1.1 compliant server).

* added -progress_mode: several choices in progress verbosity.

* added -no_disc_io: test a grab/scan without writing anything to disc.
  Mostly useful in combination with the earlier -hammer modes.

* fixed/updated HTTP error response handling in accordance with RFC2616 so I
  can better see what a HTTP/1.1 compliant target is reporting back to
  pavuk. (errcode.c et al)

* unified timing units to fix a few timing oddities: instead of minutes,
  etc. the code uses seconds everywhere (apart, of course, from the few
  locations where we use milleseconds ;-) )

  -timeout is now in milliseconds!

* Added -rtimeout and -wtimeout command line parameters.
  (unit: milliseocnds)

* added -allow_persistent / -noallow_persistent commandline arguments to
  allow/disallow the use of HTTP/1.1 persistent connections.

* added -dumpcmd and -dumpdir commandline arguments.

* added -bad_content commandline argument for use with the hammer/chunky
  modes.

* added -report_url_on_err commandline argument: report the URL which was
  processed while the error occurred.

* added -test_id commandline argument: this is included in the timing report
  so reports can be better automatically processed / combined.

* added -page_sfx commandline argument to help pavuk identify what suffixes
  are to be considered web pages (useful for scanning ASP and ASP.NET sites
  which present unusual mime types with their pages).

* added -tlogfile4sum commandline argument: specify a log file where timing
  info is stored. Handy when pavuk is not only used to grab the info off a
  site but also scan & report site performance.

* added -encode commandline parameter as the counterpart of -noencode.

* added -nohtDig, -noquiet and -noverbose commandline parameters as
  counterparts of -htDig, -quiet and -verbose respectively.

* added filepath support to -dumpfd and -dump_urlfd: by specifying the
  option prefixed with a '@' character, pavuk will treat the option value as
  filepath specification instead of a OS file handle and subsequently open
  the specific file internally. Note that adding yet another '@' character
  as a prefix signals pavuk to _append_ to the specified file, instead of
  _overwriting_ it.

  This is useful when you wish to have those dumps but are working in an
  environment where you cannot pass valid file handles through the
  commandline.

* added -dump_request and -nodeump_request commandline arguments for use
  with -dumpfd: when -dump_request is specified, the log file will include
  complete dump of each request sent to the server by pavuk. Thus you can
  produce a complete audit trail of the exchange.

* replaced the DUMP_URLLIST macros in stats.c by two functions. Code is a
  bit cleaner that way.

* fixed times.c which barfed on timestamps beyond 2037 (signed int wrap
  around for time_t).

* added assert() checks at several locations in the code to help track down
  unexpected behaviour which could lead to crashes (like it did till now).

* unified the proliferation of HEX2ASC-alike macros with and without off-by-
  one offsets inside. Now there's one macro for each of 'em in tools.h.

* changed the configure.in option to --disable-threads to keep the pattern
  consistent (--disable-xxx series of options in configure), but the default
  behaviour remains the same.

* configure.in: as --disable-debug removes any debug-_related_ features from
  the pavuk build, these options have been added: --disable-debugging will
  create a default build with all debugging removed from the compiled
  binaries. --disable-prof and --disable-gprof have been added to remove any
  profile info from the default compiled binaries.

* added checks in configure.in for socklen_t, pid_t and a bunch of system
  calls and header files that do not live in each environment.



2007-may-6

* included pthreads-Win32 based multithreading support in the native Win32
  build.

* included EXPERIMENTAL tre (regex) support in the native Win32 build.

* fixed several lurking bugs (buffer overruns, etc.) which only showed in a
  multithreaded environment.

* fixed locking bugs in the new bufio implementation.

* added Win32 memory leak + heap checking for the DEBUG build: many memory
  leaks have been tracked and fixed. (MSVC <ctrdbg.h> based)

* fixed memory leak due to wrong scope in report_error() code.

* added DBGxxx macro's to aid heap tracking for the debug build. See
  DBGdecl/DBGpass/DBGvars usage.

* removed a very nasty memleak in html_parser_get_url() which would leak at
  least 3 blocks for each rejected local anchor URL - and those come quite a
  few! Took me a day to track it down. :-(

* added filtering so gzipped/compressed files on the server are not
  decompressed unintentionally while the server supports Accept-
  Encoding:gzip or compress.

  ( doc_download_helper() in doc.c )



2007-may-11

* renamed function should_leave_persistent() to the more appropriately named
  should_keep_persistent()

* Updated 'chunky' source to the state of the latest pavuk CVS contents (as
  of today) as this code has not yet been merged into CVS itself.

* fixed bugs in -scenario handling, when scanrio files produced by pavuk are
  re-used in the Win32 environment

* fixed bugs in path & file type commandline arguments for the native Win32
  port.

* fixed bug in retrying/resuming download for RFC2616 (HTTP/1.1) 'chunked'
  content download handling.

* merged -allow_persistent / -noallow_persistent commandline arguments with
  the equivalent -persistent/-nopersistent feature from the official pavuk
  CVS sources.

  Also improved the code a bit: added the 'Connection: close' header for
  requests over -nopersistent connections, so the server will close the
  connection for us.

* added the -ignore_chunk_bug commandline argument to allow pavuk to handle
  RFC2616 'chunked' downloads from buggy (IIS) web servers.

  ( See also:
  http://www.subbu.org/weblogs/main/2004/11/persistent_conn.html
  http://skrb.org/ietf/http_errata.html#chunk-size
  http://www.apps.ietf.org/rfc/rfc2616.html#sec-3.6.1
  http://www.jmarshall.com/easy/http/
  )



2007-may/june

* recompiled in 64-bit Linux (SuSe 10.2) and fixed a few items in the
  Makefile.am, configure.in and ac-config.h.in files. Also added the tests\
  and www\ directories to the distro.

* fixed a few 64-bit compile warnings; at least the test cases in tests\
  perform OK now on a 64-bit Linux system.

* updated the man page a bit; still a lot more to do. Where is that 'nroff
  for dummies' cheatsheet when you need it?  ;-(

* listed -use_http11 as 'on' by default now.

* moved MODE_MIRROR unescape code section up in url.c to line 1682 in
  url_get_local_name_real() as this code would otherwise have no effect at
  all in any environment where the '%' percent character is included in the
  FS_UNSAFE_CHARACTERS charset (for example: Win32).

* PARAM_DOUBLE default values are now fixed point values in 'long' integer
  format; the current values in the program (all 0.0) are clearly within
  range _and_ it 'saves' on compiler warnings quite a bit. (We've still some
  way to go before we get anywhere near a '[almost-]zero-warning cross
  platform portable build: few int to pointer and vice versa casts remain.)

* fixed bug in cfg_get_num_params() which would access uninitialized memory
  out there in NirvanaLand when a PARAM_UNSUPPORTED option was passed to
  pavuk.

* Fixed configure.in to include 'debug' build handling for KDevelop (which
  would pass '--enable-debug=full' to ./configure).

* updated the configure.in script to increase portability (opendir/closedir:
  dirent.h et al)

* included a few aufoconf macros in the m4 directory for easier/proper
  portability support using autoconf et al.

* bugs fixed from BUGS list: multithreaded mode is not as stable as single
  threaded (fixed at least for the CLI version of pavuk; the GTK GUI version
  is in a rather bad shape)

* bugs fixed from BUGS list: signal handling / timeout does not really work
  (at least not in multi threaded downloads). After a SIGINT pavuk just
  hangs.) This has also been fixed for the CLI version of pavuk at least.

* Win32 port now includes JavaScript support (using the statically linked
  Mozilla js library).

* fixed short option definitions in options.h: -tp / -tsp et al

* 'fixed' GUI for Javascript enabled builds (GTK2) - WARNING: it compiles
  now, but has NOT been tested, so expect bugs here!

* merged the 'chunky' code with the pavuk main source tree. Now 'chunky' is
  equivalent to building pavuk with './configure --enable-hammer'.

* set default from -leave_site to -dont_leave_site to prevent 'blown up' web
  crawls when this filter parameter has not been specified.

  This change includes a fix for the cfg/command line handling of pavuk for
  the conditions section (see condition.h + config.c) as pavuk assumed
  sizeof(long)==sizeof(int) in these code sections.

* Now the proper GPL license (GPL, not LGPL) is included in the file
  ./COPYING.



2007-sep

* fixed processing of zero byte length files (robot.txt at figleaf.com,
  etc.): no more crash/assertion failure due to NULLed docu->contents.

* fixed a few memleaks.

* added extra error checking for file rename operations as some issues were
  found with the Win32 build when using a SAMBA-shared filesystem for
  storing the spidered data/files. (It turned out that the same issues
  existed when using native (NTFS, FAT32) filesystems.)

* dialed down the number of default threads from 3 to 1 (see BUGS) to
  prevent a hail of (legitimate) rename error reports.

* added flock() implementation for Win32: when built with multithreading
  support, having no valid flock() implementation is very dangerous!

* changed configure.in to detect both flock() and fcntl() file locking
  mechanisms so pavuk will be able to support writing spidered content to
  network shares on both Win32 and UNIX systems: flock() does not support
  network shares locks, fcntl() does, at least on the latest Linux kernels,
  see man flock(2)

* added error reporting/checking for undesirable use of invalid flock()
  implementation. (Useful when porting pavuk to other non-Unix platforms.)

* Fixed content/file size treatment code for items which are already
  available locally (i.e. pavuk finds the item at the remote has not changed
  from when the last time it fetched the item into local cache).

* Fixed the conditions for when to display certain informational messages:
  less screen clutter when not running in '-verbose' mode OR when running in
  '-progress' modes.

* Fixed several error/info messages in the code section for decompressing
  gzip/compress transmitted HTTP content.

* Fixed handling of gzip/compress transmitted content when retrieved from
  local store instead (when pavuk discovers that the file at the remote site
  has not changed since the last time it was fetched and stored on your
  local disc).

* Fixed a few memleaks.

* Changed the DBGvars/DBGpass/DBGargs macros used for tracing memory
  allocations in debug mode to make these macros look more like regular 'C'
  functions to 'demented' code formatters and analysis tools. The drawback
  is that these still look 'weird' in function prototypes, but that causes
  quite a few less errors/warnings than the old style.

* Fixed bugs in get_abs_file_path() directory detection and Win32 abs path
  processing.

  Also fixed code which produced double slashes in file paths on occasion,
  causing trouble on Win32 platforms. (Fix applied generally.)

* Fixed mk_native() allocated string management pool to support printf() et
  al where up to 3 mk_native() calls are made in the argument list. This is
  important to prevent spurious crashes in multithreaded mode when the worst
  case scenario for mk_native() applies: all threads are executing printf()-
  style statement which has multiple calls to mk_native() in the argument
  list.

  Currently overdimensioned a bit as the actual code only has two
  simultaneous calls while the pool now is dimensioned to tolerate 3
  simultaneous calls per thread.

* No more _strfindnchr() and strfindnchr(): strfindnchr() - and its use -
  has now been fixed to match the (proper working) _strfindnchr().
  [fnmatch.c/tools.c et al]

* Fixed const-correctness of several functions.

* Added '-mime_type_file' commandline option to help pavuk support an up-to-
  date list of mime types and their filename extensions, using, for example, 
  the UNIX mime.types(5) config file as a source of MIME type information.

  If the user does not specify the '-mime_type_file' option, the original 
  built-in defaults will be used instead.

  This feature has been added to provide better support for the pavuk -
  fnrules %M macro: this macro now will use this configuration to produce a 
  suitable filename extension for each MIME type: the first extension listed 
  in the '-mime_type_file' config file for the given MIME type will be used 
  as extension for the %M macro.

* Changed the GTK GUI macros to become functions for ease of debugging. The 
  added (tiny) call overhead won't be a performance hit anyway.

* Fixed -fnrules handling: the generated path is cleaned up before it is 
  returned to pavuk for use.

  Cleanup actions:
  - duplicate '/' slashes are removed
  - filenames and directory names which end in a '.' dot, get the dot 
    removed

* Added '%X' to the -fnrules formatted processing to allow reformatting of 
  filenames using an optional mimetype-derived extension. This is useful 
  when grabbing Wiki (MediaWiki et al) sites when you'd like to store the 
  grabbed content using default mimetype-related filename extensions, so 
  instead of storing a file like

    wiki/page/AboutThisSite

  that would transform into

    wiki/page/AboutThisSite.html

  while pages like

    wiki/static_page/contact.htm

  would remain as is.

  (Note: this might be considered shorthand for a -fnrules (...) expression 
   which compares both %e and %E. The intent of %X, however, is to only 
   allow %e extensions to pass which are 'valid' for the given MIME type and 
   force the %E mimetype based extension for all other cases.)

  CAVEAT: %e/%E/%X/%Y will print the extension WITHOUT the leading '.' dot in 
          both simple mode and extended LISP mode.

* Added '%Y', '%A' and '%B' to the -fnrules macros: '%Y' uses the MIME type 
  prefered filename extension if the URL/filename doesn't have an extension 
  yet (while the rather similar '%X' will OVERRIDE the existing extension if 
  it is not listed with the specified MIME type).

  '%B' prints the 'basic MIME type', i.e. the MIME type without the ';' 
  semicolon separated MIME attributes such as language, etc., while '%A' will 
  print these extensions (if they were passed to us by the server).

  CAVEAT: %e/%E/%X/%Y will print the extension WITHOUT the leading '.' dot in 
          both simple mode and extended LISP mode.

  All this allows for pavuk -fnrules commandline arguments like this:

    -fnrules F '*' '%h:%r/%d/%b%s.%Y'
    -mime_types_file ./mime.types
    -tr_chr_chr ':\\!&=?' '_'

  so we'll be able to grab a [Media]Wiki site while storing those pages as 
  regular 'abc_php_xyz.html', instead of 'abc.php?xyz' page/filenames.

* Added -fnrules 'fnseq' operator to the extended rules: compares a 
  wildcard pattern and a string a la fnmatch(3).

* Checked and updated manpage for the -fnrules operators (added 'ud' and 
  'sp' operators to the manpage).

* Added -fnrules 'sn' operator to the extended rules as counterpart of 'ns'. 
  'sn' uses strtol() to convert a string to a number, while 'ns' uses 
  printf() to format a number to a string. (See the man page.)

* Updated the man page a bit regarding '-fnrules'.

* sanitized escape_str(); a quick code review led us to a lurking bug in 
  uconfig.c@309, which has been fixed implicitly.
  
* Added/updates source code documentation: tools.c/tr.c soure code comments.

* Added some sanity checks in the code (tools.c/tr.c/lfname.c)

* Added debug_level 'rules' to allow debugging of both simple and 'extended' 
  -fnrules expressions and '-fnrules' URL F/R matching.
  
* Different boxes exhibit different mktime() behaviour, especially when 
  handling out of range tm value sets. Besides, mktime() works in 'local 
  time' while some parts of the code require a robust UTC mkgmtime() (not 
  available on many boxes) --> ripped & introduced as tl_mkgmtime(). A local 
  time-aware equivalent with excellent out-of-range handling is available as 
  tl_mktime().
  
* Added additional error handling around calls which try to parse time 
  stamps using tl_mkgmtime() and tl_mktime() (times.c).

  Basically, now both HTTP and FTP benefit from the new code which should 
  now proces timestamps like the UTC timestamps they are, while 'out of UNIX 
  time_t bounds' timestamps (beyond the range 1970..2038 A.D.) are handled 
  in a more sane manner:

  - out of bounds timestamps are reported by pavuk

  - out of bounds timestamps are then 'sanitized', i.e. restricted to the 
    1/1/1970..31/12/2037 date range, i.e. a timestamp beyond the horizon, 
    like '1/4/2051' will be 'sanitized' (= restricted) to the upper bound: 
    31/12/2037. The same goes for te from antiquity like '11/3/1969' (the 
    birthday of a certain person), which will be 'sanitized' towards 
    1/1/1970.
  
* Split up DEBUG into developer related stuff, such as memory/heap checking, 
  ASSERT/VERIFY, etc. and user related stuff (the -debug and -debug_level 
  command line arguments): ./configure is now fitted with an extra 
  parameter:
  
  --enable/disable-debug-features
  
  which will turn on/off -debug/-debug_level user level debugging support in 
  pavuk, while the existing
  
  --enable/disable-debug
  
  adds/removes additional developer checks, such as heap allocated checks 
  and ASSERT and VERIFY macros.
  
  In the code, -debug/-debug_level related code is located within the 
  'HAVE_DEBUG_FEATURES' sections, while the developer debug/release builds 
  are still related to the standard 'DEBUG' #define.

  This now results in three ./configure options that determine the (debug) 
  feature set of your binary:

  --enable/disable-debugging --> compile a binary with source level debug 
                                 info included and all optimizations 
                                 DISabled for improved debugging (by using 
                                 gdb or another debugger of your choice)

  --enable/disable-debug     --> include/exclude additional run time checks 
                                 in your binary. Most important are the 
                                 ASSERT and VERIFY pre/post-condition 
                                 validation methods located throughout the 
                                 code. The use of these is advised, though 
                                 these may cause a performance hit.

  --enable/disable-debug-features 
                             --> include/exclude user level -debug/-
                                 debug_level command line features, which 
                                 help you as a pavuk user to 'debug' pavuk 
                                 during the run. Using -debug, pavuk will be 
                                 EXTREMELY verbose, which can be toned down 
                                 by applying a -debug_level restriction 
                                 filter. For example:

                                   -debug -debug_level all,!devel

                                 will be VERY verbose, but will NOT log any 
                                 DEVEL level debug info, while:

                                   -debug -debug_level !all,rules

                                 will ONLY produce additional output for the 
                                 RULES level, i.e. when pavuk processes -
                                 fnrules and/or JavaScript macros.

* Fixed crash when non-RFC compliant website was grabbed: see testcase 7a.

* Added targeted help: when options cannot be parsed correctly, 
  short_usage() will try to help the user by printing the full help for the 
  abusing commandline option only. (Of course, I screwed up while using 
  debug_level flag sets _again_ :-( [Ger])

* Some improvements for network connectivity error handling and reporting. 
  (xvherror() added.) This is the result of some FTP tests with pavuk (tests 
  8b).
  
* Don't yak about 'Checking "robots.txt"' anymore when doing a FTP grab when 
  robots.txt is NOT applicable anyway.
  
* FTP: added crude 'autodetect/retry' mechanism for FTP servers which do not 
  like NLST (==> response code 550) but report correct directory content for 
  LIST (or vice versa). (ftp.c)
  
* FTP/HTTP: at debug level 'protoD' pavuk will now dump RAW data/content 
  received from the server before preprocessing (i.e. converting to HTML or 
  decompressing).

* Added command line option integer sizing support: byte sizes can now be 
  specified in K, M or G. Other integer values can also be postfixed with K, 
  M or G, but then these will be treated like the ISO values 1000, 1E6 and 
  1E9.

* Addition memory leak fixes in case pavuk is fed an invalid commandline.

* NTLM support code: fixed a few glaring bugs.

* Added O_SHORT_LIVED to lock file open() flags for better Win32 behaviour.

* Fixed code to load the pavuk configuration settings from, in order of 
  appearance:

   env:PAVUKRC_FILE
   ~/.pavukrc
   SYSCONFDIR/pavukrc
  
  which matches the description in the manual.
  (see also man page)



2008-jan

* Added 'js' flag to '-debug_level', which is used to dump a lot of detail 
  about the pattern matching and transformation applied to JavaScript code 
  using the '-js_pattern' and '-js_transform / -js_transform2' commandline 
  options.

* Added sanity check for '-js_pattern' and '-js_transform[2]' regexes, which 
  MUST contain a subexpression for them to 'work' as expected.

* removed re_pmatch_sub() and changed the code where it was used to work 
  with the available re_pmatch_subs() call, which allows for more elaborate 
  validation anyway. See htmlparser.c.

* Removed a regex handling bug in the -js_transform[2] code, which would 
  crash pavuk when using regexes where the first subexpression might be 
  empty.

  The crash is due to the fact that the regex parser would return indexes '-
  1' for these empty subexpression(s), resulting in out-of-bounds memory 
  writes in the rewrite code. This in turn would nuke the heap, so after 
  that is was only a matter of time for pavuk to fail dramatically.


2008 feb 04

* Added DEBUG_MISC() lines to solve sourceforge.net issue: [ 1852885 ] to 
  improve manipulation by locally stored files

* Included provisional fix (I don't have a working sample run to reproduce 
  the issue (yet)) for sourceforge.net issue: 1852884 ] infinite loop on 
  unexpected responses

* Cleaned up the mess that was -progress_mode.

* Cleaned up several DEBUG_xxx macro mistakes

* Added a little description to the 'hidden' -htDig commandline option, 
  which can be used to dump the server-transmitted MIME headers for each 
  URL, similar to the htdig tool.

* Added a bit of documentation for the -rollback option (which was 
  undocumented)


2008 mar 20

* GNU gettext tools don't like '\r' in i18n strings --> fixed by changing 
  the related printf() statements in src/doc.c

* started update of configure scripts to the latest autoconf/automake.

  Also reordered the NEWS file so it will work with the new, stricter

    ./bootstrap && ./configure && make distcheck

  distro test cycle.


2008 jul 10

* fixed ';' semicolon bug in http.c near line 2074 which caused incorrect 
  decoding of the HTTP/1.x response code header.

* fixed gzip/compress/... content compression support (HTTP/1.1 Accept-
  Encoding); the previous code was a valliant attempt to 'fix' the client 
  side (pavuk) to cope with buggy web servers which send the wrong encoding 
  type for already compressed files, but this would screw up particular 
  responses by *well-behaving* web servers. Of course this would only happen 
  in rare circumstances so it was kinda hard to track down.

  Documentation for -Enc/-noEnc has been updated to reflect this situation 
  and the code now (hopefully properly) finally supports compressed data 
  transmission for RFC2616-complaint web servers.

  If you find that your 'downloaded' compressed files are already 
  /incorrectly/ DEcompressed by pavuk, this is NOT the fault of the client 
  (pavuk) but evidence that your server is behaving inappropriately and the 
  proper remedy for this is the use of the option '-noEnc' which turns this 
  feature off so the server is not allowed to screw up in this way any more.

  Also made sure one can check if pavuk has been built with compression 
  support by calling 'pavuk --version' and looking at the feature list.

* autoconf/configure script: using the highly undocumented v_cflags or other 
  x_* variables as environment variables to hack the configure script (you 
  could do that, especially with v_cflags) has been obsoleted while the 
  configure and m4/* scripts have been upgraded to support autoconf 
  2.62/automake 1.10 and use ONLY *documented* AC.*/etc. macros from now on.

  Note: thanks to the JavaScript library issues on SuSe10.2/AMD64 (older JS 
        lib version and seemingly partial header install), I may have failed 
        to eradicate all undocumented macros.

* Extra note about configure.in: bash, at least on SuSe10.2/64-bit, handles 
  'if eval test ...' just ever so slightly different than 'if test ...', 
  especially where it comes to 'test -n'. As these styles were mixed rather 
  arbitrarily before, the 'if eval test ...' style has been completely 
  removed from the configure script, as this would sometimes render quite 
  unexpected (and incorrect!) results.

* fix_crlf.sh has been updated to ensure important Microsoft Visual Studio 
  files are not damaged by having their CRLF sequences converted to UNIX LF 
  line endings: this kind of thing will make MSVC spit you in the face and 
  reject everything you try until you give it back those CRLF line endings 
  in there. So much for XML as project file format and MSVC...

* extra fixes to ensure 'make distcheck' does not barf up a hairball. This 
  includes enforcing the permanent inclusion of the 'po' subdirectory in the 
  Makefile set for multilingual support.

* configure/Makefile(s): if you don't have one or more of the 
  archiving/compression tools compress/lzma/gzip/tar/7z(7zip) installed on 
  your system, we don't go belly up at config ~ nor at 'make dist' time 
  anymore. This, of course, includes correct behaviour at 'make distcheck' 
  time: only use/test those 'GNU standard' formats, which can be created on 
  your box.

* Added the 'bootstrap' shell script, next to 'autogen.sh'. I know they 
  serve the (almost) same purpose, but 'bootstrap' is far more sophisticated 
  than autogen.sh and I didn't wish to overwrite 'autogen.sh'. Besides, IDEs 
  on UNIX boxen expect either the one or the other (there's no single 
  'standard' for this), so we might as well provide both.

  At a later time, we might probably point autogen.sh to bootstrap.

* Updated the mime.types MIME 'hint' file: currently, it's a mix of 

  1) all properly registered MIME types ( http://www.iana.org/assignments/media-types/ ) 

  2) the mime.types file provided with the latest Apache/XAMPP

  3) my (Ger Hobbelt) additional file extension hints as used on my own 
     servers. This is mostly about professional graphics ~ and modern 
     'scene' audio/video container formats, such as Matroska. This only adds 
     extensions for otherwise already existing MIME types.

* Updated the DocBook-based documentation for several options (-End/-noEnc, ...)

* 'pavuk --version' now also reports if ZLIB support is included in the 
  binary. This is important for '-Enc'.

* Fixed the '-Enc' compressed transmission and HTTP header processing code 
  to act properly with fully RFC2616-compliant web servers, discarding the 
  old 'hack/fix' attempt to solve a non-complaint server issue at the 
  client, as this would break things for fully compliant servers in the rare 
  (but extremely annoying) use case:

  - pavuk with '-Enc' option

  - webserver is fully RFC2616 compliant

  - pavuk issues request for file in a .tar.Z or other gzip/compress 
    compressed format, where the file on the server is only slightly 
    compressed (fastest compression).

  - webserver will transmit file to pavuk, but due to pavuk reporting it is 
    able to handle compressed transmission AND the server discovering that 
    the content can be compressed quite some more than it already was, the 
    file will be transmitted after a server-side just-in-time compression 
    round.

  - pavuk receives the data. The old hacked code would NOT decompress the 
    data. However it SHOULD because the server PROPERLY reported 'Content-
    Encoding: gzip' to pavuk. End result: grabbed data which you cannot 
    process nor trust to be in the same format as stored on the server as it 
    all 'depends' on arbitrary conditions which you cannot control: is the 
    web server able to compress the data before transmission? Is the web 
    server configured to allow compression? Etc.

  This use case has now been fixed.

  The effect of BADLY behaving web servers (which send 'Content-Encoding: 
  gzip' for any .Z, .z or .gz files (IIS x.x and other servers which are not 
  configured to /properly/ handle files and MIME types) is described in the 
  DocBook manual page now, including the fix for this (specify the '-noEnc' 
  commandline with pavuk).

* active FTP: timeout and stop/break handling slightly improved: now pavuk 
  should always terminate under all circumstances while a break or stop has 
  been signalled.

* Changed the default for '-url_strategy' from 'level' to 'leveli' to make 
  pavuk behave more like your regular web browser (with a user clicking 
  through web pages).

* Initial fix for NTLM support for 64-bit Windows. (Only lightly tested.)

  This includes converting that bit of code to support the C99 intNN_t types 
  (where NN e {8,16,32}), while the configure script takes care about 
  providing the proper types for not-fully-C99-compliant environments.

* The TRE regex package would barf up a hairball due to the incorrect header 
  file being loaded. ./configure now recognizes TRE specifics a bit better 
  and the code now loads the proper header file (<tre/regex.h> instead of 
  <regex.h>). This is important on systems which have multiple, ever so 
  slightly incompatible regex processing libraries installed.

* Improved diagnostics a little bit by adding reporting support for 
  URL_PARENT_REWRITING, i.e. the situation where a parent page of a grabbed 
  page is loaded for the sake of adjusting (rewriting) the URLs in its 
  content.

* Fixed code so it would compile in full (-DDEBUG) debug mode on UNIX.

* autoconf/configure: ran into some weird issues due to inconsistent M4 [] 
  quoting: quite a few lines did without it. Turns out that this is a BIG 
  No!No! as adding the AX_ADD_OPTION() macro turned this lurking mess into a 
  true disaster.

  Fixed by applying [] quoting throughout. The only place where I didn't do 
  it, is in the first and second args of AC_DEFINE() -- which should be used 
  instead of AC_DEFINE_UNQUOTED when you don't need the latters extra 
  functionality anyway -- and the first arg of AC_DEFINE_UNQUOTED(). Any 
  other spot where [] quotes are missing in the M4 macros and/or 
  configure.in? Consider that a bug and please report so I can fix it.

* Finally got the configure system to recognize my JavaScript libraries and 
  all. Tugged and tweaked a few items in the bindings to allow maximum 
  flexibility for the JS code when it is used to filter URLs (e.g. 
  JavaScript pavuk_url_cond_check() function).

* Updated jsbind.c to use latest SpiderMonkey 1.8.x (tested on Win32)

* Changed man/Makefile to ensure HTML is not recreated every 'make' run, but 
  only when manpage changes. This should really copy the results from 
  ./doc/, but that's for later...

* DocBook documentation: tweaked man page generation to mimic original 
  manpage title exactly.

* DocBook documentation: updated '-version' info (important to see at run-
  time what abilities you've got with /your/ pavuk.

* Win32/MSVC: all project files have been updated to produce next to 
  Win32/x86: Win64/AMD64 and Win64/Itanium binaries. These project files 
  assume the existence of all optional libraries: OpenSSL, SpiderMonkey 
  (JavaScript), zlib.

  Where to get those, prefered directory layout, etc. to be published, so 
  others can build from source on Win32/64 too and get the same results.



2008 jul 20

 * tweaked configure+makefiles so that a 'make dist' from CVS becomes
   possible: there were quite a few references to yet unpublishable
   files in my makefiles (Ger Hobbelt).
   
 * config section: improved adherence to C standards: no more 
   potentially dangerous mixed use of function and data pointers by
   typecasting function pointers into data pointers and vice versa.
   
   This has been resolved by an added layer of indirection, which makes
   it all very legal C again. It goes somewhat like this:
   
     function_pointer_type ptr = &function;
     data_pointer_type d = &ptr;
   
   then use (d[0])(...) to call the function. 
   
   This contrasts the old code:
   
     data_pointer_type d = (data_pointer_type)&function;
   
   and function invocation using:
   
     ((function_pointer_type)d)(...)
   
 * Added support for parsing 'hidden' CSS and JavaScript in HTML.
   The support is also extended to generally parse inside HTML comments
   PLUS Microsoft IE CC's (Conditional Comments): <!--[if...]><![endif]-->
 
     -read_css
     -read_cdata
     -read_msie_cc
     -read_comments
   
   These are all enabled by default; documentation has been updated for
   these as well.
   
 * Fixed CSS and [Java]Script handling in the HTML tokenizer/parser,
   which was feeding the filters and URL extractors (htmlparser.c).
   
   Now the code can cope better with incorrectly formatted pages / files.
   
 * Reordered the HTML tags in htmltags.c in a preparatory move to 
   check the list for missing attributes (onXXX JavaScript items for one!
   several are missing) and HTML 3/4 tags. (htmltags.c)
   
 

For information on current development see here.