|
Changes :
version 0.3 (Jul 14 1997)
-----------
* enhanced X Window user interface - now supports keyboard focus traversing
between widgets (work not perfect)
* most of widgets are modified
* new feature added - updating remote URL references in local tree to local in
HTML documents
* now is possible to enter more starting URLs
* many bug fixes
version 0.3pl1 (Aug 6 1997)
--------------
* avoid to change modification time of file (I want to implement document tree
synchronisation soon)
* removed bug which results in hang when try to transfer moved robots.txt file
* now moved URLs are correctly rewritten in HTML document (broken in 0.3)
* more verbose reporting about moved documents
version 0.5 (Sep 25 1997)
--------------
* now every host name is converted to lower case to prevent redundancy
* some changes in widget library
* implemented transparent "reget" with FTP or HTTP protocol. Not every HTTP
server supports reget. (Apache 1.2, Netscape, MS IIS, and ever HTTP/1.1
compliant server)
* now all files are at first stored with temporary name (possible use of reget
in another run of program). When download is finished file gets true filename.
* new mode "resume regets" is implemented
* code restructuring
* functions to convert date string to internal format (synchronisation ...)
* new mode "singlepage" added - download only one HTML document with all inline
objects (pictures, ...)
* server side map are now handled correctly
* repaired bug when anchor names are not written in local URLs when rewriting
(broken in 0.3, 0.3pl1, in previous versions was good)
* changes in file naming rules (each directory index is now stored in _._.html
file not in index.html or ftp_dir_index.html) == better reverse transformation
from filename to URL.
* implemented HTTP and FTP synchronization
* added new mode to SButton widget and its successors to emulate on/off button
* Toggle implemented transparently (mixed use of SButton > , CheckButton ,
CheckME)
* asynchronous connect when running in X Window mode
* !!!!!!!!!!!! changed name for subdirectory where www documents are stored from
!!!!!!!!!!!! "www" to "http" (this make one of my colleague very sick :-))
* timeouts are now handled via "select()"
* now is each URL added to hash table too for better performance in
was_before() function - this means little more work for each URL but when
working on big set of URLs this will save lot of CPU time.
* simple SSL support by using of SSLeay
* removed some bugs
* added FTP proxy support
* update X Window interface and scheduler to reflect all changes
* updated documentation
version 0.5pl1 (Sep 30 1997)
--------------
* removed bug which avoid use of X Window interface when compiled without SSL
support
* start to rewrite some of widgets
* all modes which scans local document tree now scans only desired directories
* removed bug when pavuk sometimes hangs for long period if you try to schedule
version 0.6 (Nov 11 1997)
-----------
* all command line parameters are handled transparently via param table
* each parameter is now possible to handle in "pavukrc" file
* !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
* WOW WOW WOW I finally solve that problem with that dirty TreeWidget !!!!
* !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
* keyboard control for TreeWidget (ScrollTreeWidget)
* removed one big memory leak in get_abs_file_path()
* Combo widget
* Configuration management via so called scenarios
* many bug fixes in X window interface
* more command line switches (opposites for booleans)
* removed bug in file_is_parsable() while checking if file successfully opened
* removed bug in close_socket() -> "if (sock < 0) close(sock)"
^^^^.. I love you strace.
version 0.6pl1 (Nov 13 1997)
--------------
* removed mistake with list parameters ( -asite , -dsite , -ddomain ...)
* removed bugs in -v -h parameters checking
version 0.6pl2 (Nov 16 1997)
--------------
* repaired some bugs - scenario loading, Domain Allow/disallow switch ...
* extended scenario loader/saver to allow scenario dir selection
* repaired html parser - \n or \r inside parsed tag results in buggy result
* command-line scenario saver
version 0.6pl3 (Dec 2 1997)
--------------
* limitation for size of transfered document added (-maxsize)
* limitation for MIME type of transfered document via HTTP/HTTPS
(-amimet/-dmimet)
* authorization for HTTP proxy added
* repaired bug - Xtoolkit standart parameter were not recognized
* repaired bug - when parent document were not successfully processed ,
stays locked
* repaired bug - when using HTTP proxy && connecting to SSL server
* added SSL proxy support
* added Gopher proxy support
* added gatewaying FTP and Gopher via HTTP proxy
* better FTP data connection handling
* progres meter on terminal (-progres)
* Log widget implemented
version 0.7 (Dec 30 1997)
-----------
* rewritten message reporting system for X Window - now based on Log widget
* added NLS support via GNU gettext
* created slovak message catalog by ondrej@idata.sk (zatial bez diakritiky)
* implemented removing of improper files directories (in sync mode)
* bug in FTP synchronization removed - buggy reply code check
* some needless FTP commands are not send while retrieving directory list -
(MDTM . RETR)
* ftp data connection is established before REST while restarting FTP transfer -
sometimes FTP server starts transfer from beginning instead of from given
position (I don't known why)
* checking of file size when synchronizing (FTP only)
* better FTP control connection handling
* some bug fixes
* logging messages to file
* solved problems with FTP synchronization
version 0.7pl1 (Jan 13 1998)
--------------
* added support for HTTP/HTTPS URLs with authentication informations :
http://user:password@host:port/....
* in sync mode used standart UTC time instead of localtime - gmtime()
* ftp command MDTM sent only when required
* handling of HTML tag <META HTTP-EQUIV="Refresh" Content="..; URL=...">
* added in file stored authentication informations (read manual for authinfo
file format)
* added more entries into mime type selection dialog
(from apache mime.types file)
* now pavuk sets return code of program to number of failed transfers
* now you can optionally omit some directory levels from local doc tree
(try set -base_level $nr at command line and you will see what this means)
* checking of write() fail
* progres is now reported correctly when restarting transfer
* changed some of widgets to have translatable strings
* repaired bug in ScrollWin widget code , when TreeList or Log widget sometimes
jumps up
* asynchronous DNS name resolving via external process
(breakable in X11 interface)
* dirty solved error in Col and Row widget when resizable widget gets zero size
* German message catalog by Joergen Grieb
version 0.7pl2 (Jan 15 1998)
--------------
* repaired compile bug in update_links.c (when compiling without X Window
interface support)
* implemented buffered DNS requests in dns_gethostbyname()
* repaired bug when downloading FTP directory via HTTP gateway and gateway
returns HTML document with local nor remote URLs
* implemented so called dirty ftp proxy (-ftp_dirtyproxy) using CONNECT
request to HTTP proxy.
* repaired bug in filename_to_url() http.password and http.user are not
initialised to NULL
* synchronisation with FTP<->HTTP gateway is now possible
* to translatable message catalog added geometry of window
version 0.7pl3 (Jan 26 1998)
--------------
* in sync mode is now reported correctly ,that document is up to date
* implemented active FTP data connection
* new slovak message catalog in ISO-8859-2 encoding by me
* you can now specify directory from which will be message catalog loaded
(-msgcat or NLSMessageCatalogDir:)
* rewritten passing of X-attributes to be smarter translatable
* now each comand line switch can have own help text ==> easier management
of massage catalogs && self documenting switches
* rewritten all interface dependent staff to easier support GTK
* some initial GTK things done
version 0.8 (Feb 27 1998)
-----------
* automake/autoconf compilation-configuration scripts == very easy
installation
* GTK interface
* gnu-win32 portability
* rewritten HTML parsing code + HTML4.0 support
* fcntl locking on systems, where flock not supported
* some bugs in X-interface solved
* GTK Calendar widget
* minor bug fixes
* restriction on document creation time implemented
* rewritten parts of X-toolkit interface to look similar as GTK interface
* Czech message catalog by Petr Vyhnalek
version 0.8pl1 (Mar 25 1998)
--------------
* some memory leaks removed
* URL based synchronisation
* command line scheduling (-schedule)
* repaired configure script : don't fail configuring GTK interface when Xpm or
Xext libraries not successfully checked, gettext in glibc2
* cyclic rescheduling (-reschedule)
* limit set of documents only on starting site (-dont_leave_site/-leave_site)
* limit set of documents only on starting directory on starting site
(-dont_leave_dir/-leave_dir)
* updated GTK interface for GTK+-0.99.4 =<
* inline objects are on same level of tree as parent when checking deep limit
* new option (-leave_level) to limit number of levels outside from starting
site
* you can now disable compiling of URL tree preview (big memory save)
run configure script with --disable-tree
* solved bug in xinterface.c , which causes segfault in sprintf with some
versions of libc.
* man page is installable via make install
* solved problems in widgets, which refuse to run Xt interface in some
configurations
version 0.8pl2 (Mar 30 1998)
--------------
* repaired bug in url_to_absolute_url() , when relative URL start with / ,
was oddly rewritten.
* localedir in configure script now point in right place
* added pavuk.spec to distribution (for building RPMS)
* repaired configure script to detect right Xext,Xt library in some i
configurations
* extended set of unsafe characters in URL for encoding
version 0.8pl3 (Jun 9 1998)
--------------
* repaired bug when pavuk seg faults if redirecting to unsupported protocol
* repaired bug when pavuk miss part of tag between attribute name and value
of attribute while rewriting links inside HTML document
* repaired bug in GTK interface - reading of uninitialised values
version 0.8pl4 (Jul 19 1998)
--------------
* added function CardBoxSwitchTo() to allow switching of Tabs in CardBox widget
* added "Open URL" dialog to File menu
* new mode "dontstore" implemented, for fetching files to proxy-cache
servers
* added logo to About dialog
version 0.9 (Aug 5 1998)
-----------
* repaired bug in HTTP proxy code
* totally rewritten internal handling of URL tree !!!!!!
(thank to Marc David Rovners base idea and my hard long work :-) )
* now icons works in tree preview with GTK interface as in Xt interface
* updated Czech message catalog
* window delete event is now handled right in GTK interface
version 0.9pl1 (Aug 9 1998)
--------------
* solved problems while compiling v0.9 without GUI
* repaired bugs excellently reported by Dmitry Semenov
- HTTP reget doesn't work in sync mode
- -preserve_time doesn't work with FTP and only in sync mode
* I have get working menu with Tree preview in GTK interface :-) as in Xt
interface
* it is now possible to disable processing of some URLs by using of Tree
preview
version 0.9pl2 (Sep 6 1998)
--------------
* minor bug fixes reported by some users
* repaired bug ,when -cdir ends with '/' and using -base_level switch results
to broken filenames
* implemented interactive downloading using URL tree preview dialog
* solved problem in GTK URL tree preview with more starting URLs
* URL tree preview dialog in Xt interface is now not modal
* basic support for sending and receiving HTTP cookies (writing to cookie file
not supported yet, GUI can't hand cookie parameters - only via cmd-line)
version 0.9pl3 (Sep 20 1998)
--------------
* intelligent updating of cookie file implemented (the some file may be updated
with more processes concurrently without cookie looses)
* GUI interface for cookies setup
* HTML file on FTP server is processed right
* repaired rewriting of redirected url with fragment name specification
* you can now download from URL tree preview manually files which were broken or
rejected
version 0.9pl4 (Jan 6 1999)
--------------
* cookie file may contain any comments started by '#'
(not saved back after update)
* host name translation errors are reported now right
* buffered IO implemented
* some minor bug fixes
* repaired any segfaults
* new & more icons for URL tree preview
* HTML tag & attribute restrictions for selection of URL's from HTML docs
* checking cookies if source domain is equal with domain attribute of
Set-Cookie MIME entry
* cookie file is now right ordered (not reversed each time :-)
* new Czech message catalog in ISO8859-2 encoding by Petr Vyhnalek
* added new switch -gui_font , which allows you to set font used in
GUI interface
* added new switch -language for used to set language of messages while
compiled with GNU gettext support
* added very simple SOCKS(4/5) support (not tested yet)
* -pattern accepts comma-separated list of documentname matching patterns
* new option -url_pattern to enter comma-separated list of url matching
patterns
* -user_condition options added to provide option for user to specify by
external script or program if URL should be processed or not
* repaired bug when extra space characters in scenario file are not removed
* repaired seg-fault while doing HTTP reget (thank to Orestes Sanchez Benavente)
* added -disabled_cookie_domains option
version 0.9pl5 (Jan 28 1999)
--------------
* you can now immediately change communication language from GTK GUI
* added gtk-config script to configure script for GTK configuration checking
* added client certification stuff for HTTPS (SSL) (not tested yet)
* some segfaults repaired in GUI code
* repaired time handling bugs
* added realm info to authinfo file
* HTTP authorization schemes are now handled properly
* HTTP digest access authorization implemented (it work with my apache server)
version 0.9pl6 (Feb 28 1999)
--------------
* when compiling with SSLeay lib using md5 computing routines from libcrypto.a
instead of apaches md5c.c
* reuse of HTTP digest access nonce in more following requests is now
implemented
* digest authorization with proxy server
* added QueryGeometry to all Nws widgets for windows autosizing
(finally - I am so lazy :-))
* filename conversion routines for changing local filename
(delete set of characters , change string to string , tr like char to char)
* language change now work too if some files were processed
(Tree preview not destroyed)
* while changing language all visible windows stay visible
* menu entry labels are GNOME compliant
* beautify of xinterface.c
* rewritten Xt interface to support language change from GUI
* each file selection entry now have browse button
* send QUIT signal while running in text mode and pavuk will exit safe
* added sample of Xt resources file for Pavuk
* thank to H?ard Skinnemoen added some features from gtk+-1.1.*
- new style of adding childs to scrolled windows
- parsing of ~/.pavuk-gtkrc
* solved win32/cygwin32/unix file path madnes
version 0.9pl7 (Mar 30 1999)
--------------
* changes for support GTK+-1.2.0
* removed sk and cs ASCII message catalogs from distribution
* repaired comandline time parameter scanning routine
* all labels in GTK interface are now left justified
* scheduling now work well
* solved problems when compiling without GNU gettext support and with GUI
support
* a lot of GTK improvements
* better processing of some stupid HTML constructions
* HTML comments and inline scripts are not parsed && processed
* default location of system pavukrc changed from $(prefix)/lib/pavukrc
to $(prefix)/etc/pavukrc
* added a lot of new HTML tags for processing
version 0.9pl8 (Apr 12 1999)
--------------
* now compile with gettext support on systems without LC_MESSAGES defined
* checking of robots.txt now work again (thank to Stefan Stidl)
- checking disabled in many previous versions because of oddly written
condition :-(
* better detection of cyclic HTTP redirections
* repaired SEG fault while in GUI and HTTP redirection to already processed
document occurs
* new icons for buttons added from Andreas Kraska . If you want old buttons,
execute configure script with --disable-new_buttons option.
* accelerated menubar with GTK+-1.2<
* using putenv on system where setenv & unsetenv not found
* a lot of minor bug fixes
version 0.9pl9 (Apr 18 1999)
--------------
* repaired bug, when all documents downloaded over HTTP/HTTPS were processed as
HTML documents (a lot of rewriting operations on binary files :-()
* repaired implementation of setenv/unsetenv on systems where not implemented
(thank to Orestes Sanchez Benavente)
* timeout on connect() call
* now pavuk work on filesystems, where doesn't work link() call (FAT)
* better detection of already downloaded directories
* not buffered read while reading document data from net
* new Action menu
* enhanced use of GTK+-1.2 < features (GTK 1.0.x compatibility preserved)
version 0.9pl10 (Apr 25 1999)
---------------
* repaired bugs in net_connect() function
* repaired bug while using active ftp connection
* you can now miniaturize main pavuk window (GTK+ only)
* !!!!! -progres option repaired to -progress
* new option -runX (you can immediately start downloading files after GUI
interface is started)
* simple support for CSS
* a lot of bugs fixed
version 0.9pl11 (May 2 1999)
---------------
* new -index_name option used to change default name of directory index
* new -store_name option used to set filename for document downloaded with
-mode singlepage
* changed version of used autoconf (1.3) and automake (1.4)
* support for processing standalone CSS files
* doesn't get SIGPIPE when decoding encoded file (not fork-ing in GUI)
* using CTree widget instead of Tree with GTK+-1.2
version 0.9pl12 (May 5 1999)
---------------
* new option -ftplist to use wide listing of FTP directories (using LIST
ftp cmd instead of NLST) (only unix style of list supported)
* new option -preserve_perm to preserve options of ftp files
(assume -ftplist option)
* now pavuk saves ftp symbolic links as symbolic links not normal files
* new option -preserve_slinks to leave point symbolic link to same location
as on remote server.
* Go Bg button now work properly with GTK+ (thank to Jan Kratochvil)
* new option -FTPhtml/-noFTPhtml to enable/disable processing of files
downloaded over FTP protocol
* anchor names for FTP urls now parsed right
version 0.9pl13 (May 16 1999)
---------------
* pavuk now removes empty directories in local document tree
* directories are now processed right
* new option -min_size to eliminate transfer of small documents
* new options -skip_url_pattern and -skip_pattern
* repaired bug in document time preservation (thank to Tomas Dobrovolny)
* while updating parent document links, and it is locked, pavuk will wait
until lock will be released
* locked document is allways rescheduled
version 0.9pl14 (May 23 1999)
---------------
* thank to Steffen Kern added dropping of URL's to url list and pavuk main window
(for example from netscape)
* thank to Tomas Dobrovolny fixed some minor bugs in configure.in script
* new HTML tags for table backgrounds added (thank to Szabolcs Szakacsits)
* new -htDig option for cooperation with htDig web indexing program
* new option -check_size/-nocheck_size for enabling/disabling checking of
document size (some HTTP servers report bad Content-length: header)
* minor bug fixes
version 0.9pl15 (Jun 21 1999)
---------------
* many fixes and changes in HTML parser code
* better support for Cascading Style Sheets
* lot of patches from Szabolcs Szakacsits and Stefen Kern added
* fetching of URLs from clipboard implemented for GTK and Xt GUI
* repaired encoding of URLs (thank to Marc Haber and Szabolcs Szakacsits)
* new option -urls_file (for reading URLs from file or stdin)
* get SSL stuff working again (was broken because of non-blocking IO)
* updated Czech message catalog (by Petr Cech)
* new icons in icons/ directory
* a lot of changes / bug fixes
version 0.9pl16 (Jun 29 1999)
---------------
* checking for zero size of file
* fixed bug with using -store_name option (thank to Marc Haber)
* new type of log file added (option -slogfile)
* -mode resumeregets now recurse through links
* removed many memory leaks inside new HTML and CSS parser code
* removed some random crashes with Xt GUI
version 0.9pl17 (Jul 06 1999)
---------------
* bigger read buffer -> better read performance on fast connections
* new option -identity for specifying User-Agent: HTTP request field
* new option -nosend_from for deny sending From: field with HTTP request
* new option -nostore_index used to tell pavuk not to store documents
referenced with directory URLs
* new option -acharset used to specify set of prefered document encodings
for HTTP protocol
* changed selection retrieving with GTK+ GUI
* better native language switching in internationalized environment
* bug fixes
version 0.9pl18 (Jul 26 1999)
---------------
* support for EPLF format listing of FTP directories
* support for Novel format listing of FTP directories
* repaired one typo which breaks compilation without GUI
* automatic preferences saving/loading to file ~/.pavuk_prefs
* loading & saving of menu accelerator keys to prefs file
* fixed type casting bug in html/css parser code (thank to Robert Gasch)
* support for newer openssl versions (0.9.3<)
* better & nicer progress meter
* limitation of transfer speed (max/min)
* my CERN HTTP/proxy server is somehow odd - synchronization of WWW pages
wont work if you specify port number in URL (curious), so port number
was removed from URL if portnumber is default.
* sync mode work now well when spanning to another server
* sync mode work again with servers which not respond right 304 code (mea culpa)
* added Apply button to configuration dialogs
* fixed lot of bugs in net_connect function
* instalation of pavuk icons to $(prefix)/share/icons/
* new quota options (quota for file size, transfer amount and free space on
filesystem)
* solved bug, when Gtk+ URL list not show its contents
* solved bug, when pavuk crashes on redirection to unsupported URL
* corrected fetching of URI: header content for redirected URLs
* several bug fixes and improvements
version 0.9pl19 (Sep 06 1999)
---------------
* changed URL equivalence checking from filename based to URL based
* internal URL representation now contains its local filename , this means
lower memory footprint, but bigger memory consumption
* several minor memoryleaks removed
* implemented universal & flexible mapping mechanism URL -> local filename
based on RE or wildcard patterns and simple rules (see manual ,
option -fnrules) (thank for James Feeney base idea)
* implemented optional saving of info files for each document (each info file
contain source URL of document and documents downloaded via HTTP/HTTPS have
there whole HTTP header)
* repaired parsing of standalone CSS files
* if is enabled storing of info files and you change default local tree layout
(with -fnrules or -base_level or -tr_* options) now will URLs newer overlap
* new option -all_to_local used to force rewriting all URLs in HTML document,
to point to expected location
* new reminder mode for checking if any URL was modified in given period
* code cleanups
* new option -sel_to_local used to force rewriting all URLs in HTML document,
which accomplish to limits, to point to expected location
* many corrections in messages (thank to Colin Marquardt)
* repaired bug in removing BASE tag from HTML code, and now is not removed, but
commented out (thank for bug report and idea to Jan Tomasek)
* added icons to OK && Cancel buttons in Gtk interface (GTK+ only)
* changed all GtkList widgets to GtkCList
* added Clear & Modify buttons to each editlist dialog (GTK+ only)
* you can now optionally change pixmaps for buttons from pavukrc file
(see all Btn*Icon*: statements)
* fixed bug in ftp directory translation to HTML when using passwords with
FTP URL
* finally I fixed that bug which randomly puts trash to pattern options in GUI
interface. strtok() is really bad function :-(
* fstatfs emulation on SYSV systems using fstatvfs
* better detection of header files where is fstatfs declared
* repaired Seg Fault when using cookies (thank to Andrew Hall)
* added more icons to GTK+ dialogs (thank to Frederic Toussaint)
* each dialog window can be closed with ESC key (GTK+1.2 only)
* each menu entry can have now assigned shortcut (GTK+1.2 only)
* make uninstall now work well (thank to Colin Marquardt)
* option -lmax now work properly with inline objects
(thank to Bernd Lutkenhoner)
* removed old_buttons
* actualized German message catalog (thank to Colin Marquardt), please if you
speak German check it and possible errors report to Colin
* new option -check_cookie for enabling checking if cookie is set for from
which commes
* fixed bug in cookie handling code
* collections of button icons for pavuk in button_icons/
* a bit fixed URL redirection code for nonabsolute URLs
* fixed detection of base URL of document for documents with URL with search
string
* new French message catalog (many thanks to Frederic Toussaint), please if you
speak French check it and possible corrections report to author
* actualized Czech message catalog (thank to Petr Cech)
version 0.9pl20 (Sep 29 1999)
---------------
* new option -all_to_remote used to leave all links inside HTML document to
remote location (proposed by Diego Antona Archilla)
* fixed incompatibility with GTK+-1.0
* with starting HTTP URLs now pavuk sends optionally as Referrer: field self URL
see option -auto_referer (proposed by Sergey Taranenko)
* fixed segfault in cookie modification code
* numbering of documents with overlaying local names for different URLs
* new better HTML tag handling routines
* removed a lot of memory leaks
* URL downloading order strategies implemented (idea by Sergey Taranenko)
* replaced GtkText widget with GtkCList widget in log window
* now works limiting of length of log in GTK+ interface
* fetching files from Netscape browser cache directory
(great idea by Sergey Taranenko)
* new Spanish message catalog by Javier Comeron
version 0.9pl21 (Oct 13 1999)
---------------
* support for removing advertisement banners from HTML pages
(base idea by Mika Joukainen)
* timestamps are written to regular log file when starting and ending log
(proposed by Jan Tomasek)
* support for Bell V8 implementation of regular expressions (as used in cygwin)
* fixed SegFault which occurs while loading scenarios during downloading
progress (thank to Sergey Taranenko)
* authorization info editor (only for GTK+ GUI)
* new option -check_bg/-nocheck_bg used to detect if we run as background job,
if so don't write any messages to screen
* fixed some errors in Xt interface errors
* fixed bug when stdout isn't flushed before _exit()
(thank to Szabolcs Szakacsits)
* new option -send_if_range/-nosend_if_range. This option should be used when
HTTP server supports reget, but sometimes generates different Etag field
for not changed document (if Etag and If-Range field differs reget will start
from beginning of file)
* locking of log file
* optional numbering of log file when log file locked (option -unique_log)
(proposed by Sergey Taranenko)
* several messages fixes (thank to Colin Marquardt)
* running of post processing command after successful download of document
see option -post_cmd (proposed by Sergey Taranenko)
* counting of fatal errors
* fixed core dump in lfname structure cleanup when using fnmatch patterns
(thank to Kevin Gamiels report)
* fixed bug which causes some broken links
* fixed bug which causes bug when compiling Xt version of interface with
support for loading files from Netscape browser cache
(thank to Niraj Sachdeva)
* portability to HPUX solved (thank to Niraj Sachdeva)
* fixed bugs and oddities in sync mode code (thank to Szabolcs Szakacsits)
* fixed typo which causes problems using mode linkupdate from command line
(thank to Szabolcs Szakacsits)
* fixed bug when using -store_info, pavuk leaves opened some of lock
files, this causes Too many open files error (thank to Dawit Yimam)
* significant speedup of sync mode
* some internationalization fixes (thank to Javier Comeron)
* several bug fixes in local name assigning code (when using -fnrules option)
* fixed possible problems with timeout detection in GTK+ interface
* now is possible to specify template of scheduling command
(look for -sched_cmd option)
* fixed bad behavior with "" urls inside HTML documents
* fixed bug in URL parsing when contains both anchor and searchstr
version 0.9pl22 (Nov ?? 1999)
------------
* fixed portability to systems which doesn't declare h_errno
* got rid of all dirty strtok()s (I hope without mistakes)
* removed all configuration environment values !!!!!!!!
* fixed problems with loading files from NS cache on big endian machines
* more properties for URL displayed in URL tree preview (GTK only)
* added UI configuration for -stime option
* fixed some bugs in base URL of document handling in HTML parser (thank to
Laurent Salles report)
* fixed functionality of -min_size option (thank to Frank Baumgart)
* fixed segfault when running user condition script (thank to Frank Baumgart)
* added support for BSD regular expressions
* added support for GNU regular expressions
* started debug levels implementation
* selection of SSL client methods version implemented, option -ssl_version
(thank to Ians idea)
* handling of & and & inside URLs (thank to Matts note)
* fixed typo in configure script which causes misconfiguration in some cases
* fixed handling of URLs with \n \r \t characters
* repaired handling of nonblocking IOs (thank to Szabolcs Szakacsits solution)
* fixed buggy behaviour of get_abs_file_path() function
* optional unique SSL ID with all SSL sessions (thank to Jeff Roberson howto)
* added handling of starting urls in form server:[port]/...
* added new Append URL dialog for appending URLs within downloading progress
(GTK only)
* added proxy authorization with CONNECT request
* fixed handling of \ and " characters inside quoted strings
* added new option -httpad to be able to add some user defined HTTP headers
in HTTP requests
* implemented statistical reports for downloading progress (can be saved to
file - -statfile option, or previewed inside GTK UI window)
* fixed limits checking (prefix,postfix,patterns) for HTTP URLs with search
string part
* changed debug mode controlling with -debug_level option
* new WIN32 specific option -ewait, to enable user to control if console
will disappear after pavuk will finished (proposed by Jan Tomasek)
* started writing NEWS document, to enable users briefly know new pavuk
features in particular pavuk versions without reading huge ChangeLog file
* new chance to save URL tree structure from URL tree preview dialog
window (GTK+-1.2 only)
* .pavuk_info directories are now omitted, when scanning local document tree
in linkupdate,resumeregets and local tree based sync mode
* fixed pavuks behavior of option -check_bg on systems where getpgrp() needs
PID parameter
version 0.9pl23 (Dec 20 1999)
---------------
* huge internal rewrite, changed handling of some globals - big step to
MT version, cleanup of internal algorithms
* implemented new mode (ftpdir) for listing contents of FTP directories
(proposed by Niraj Sachdeva)
* added new macro %m (domain name) to -fnrules option
* changed handling of encoded documents - now are decoded only HTML and
plain text documents all others will be stored encoded
* fixed corruption of cookies.txt file after user break
* completely changed handling of refresh META tag - broken in several
previous releases
* fixed potability to FreeBSD (thank to Holdrich Kristian)
* new options -aip_pattern & -dip_pattern for specifying allowed IP
addresses with regular patterns (proposed by Samuel Laker)
* fixed bug in option -debug_level setting to "all" (thank to Andreas Mohr)
* fixed logging to nonanonymous FTP servers through HTTP gateway proxy
(thank to Andreas Mohr)
* new option -site_level for limiting how many site levels to leave from
starting site
* TOS settings for FTP data and control connection
* introduced new protocol FTPS for making SSL connection to FTP servers
with SSL support
* if you will set environment variable PAVUKRC_FILE, pavuk will read this
file as user pavukrc file instead of ~/.pavukrc file (proposed by
Andreas Mohr)
* fixed SSL reading function, which should cause in some cases lost of data
at end of file or hang in select()
* fixed problems with makealldirs() on WIN32 platform
* added additional informations (size,processing time) to structured log
file (proposed by Dave Becket)
* fixed problems with restarting in GUI interfaces
* fixed problem with URLs with slashes at end of query string (thank to
Dave Becket report)
* fixed problem with naming of local copies of FTP directories when
downloading trough HTTP gateway
* added new HTML tag for URL processing CSOBJ/HT
* added new URL schemes for processing (tel,fax,modem,sms - from IETF drafts)
* automatic handling of unsafe characters inside filenames (now handled only
Windows - \:*?"<>|) (proposed by Jan Tomasek)
* configure script now detects if msgfmt supports --statistics option
(proposed by Dave Becket)
* fixed hangup after blocking locking inside document read loop
* implemented much cleaner blocking locking
* fixed several odd behaviours when generating localname of document
* implemented simple adjusting of too long filenames
* partially implemented HTTP/1.1 protocol with persistent connections !!!
* new options -use_http11/-nouse_http11 for enabling or disabling HTTP/1.1
protocol support
* many many bug fixes
* extended URL based sync mode. Now you can specify subdirectory which
contains mirrored documents (with option -subdir) and that directory is
scanned before for documents, and after URL based synchronization is finished
pavuk starts checking URLs from local tree, which were not checked in URL
based synchronization.
* get rid of most of unsafe static buffers
* support for deflate encoding method via zlib
* handling of 1xx HTTP response codes
* bit changed behaviour with -site_level & -leave_level when processing
moved URLs
* more automatic scan for OpenSSL || SSLeay libraries location
* fixed bug , which causes segfault, if BASE URL is unknown or unsupported
(thank to Jeff Robersons report)
* applied patch from Jeff Roberson, which enables to use specified local
netwok interface for communication (usefull for multihomed hosts)
uses new option -local_ip
* thanks to Colin Marquardt improved quality of manual
* fixed linkupdate to work properly again (thank to Jaydeep Desais report)
version 0.9pl24 (Feb 09 2000)
---------------
* implemented parsing of VMS style FTP directory listings
* solved problems with FTP control connections, when pavuk breaks data
transfer before finished
* rewritten from scratch URL parser - now is cleaner, easier extensible,
faster and with lower memory footprint, and I hope conformable with
RFC 2396
* new routine for comparing URLs based on url structure instead of URL
string - means faster and with lower memory footprint
* bit better internal handling of query strings
* fixed segfault with decoding nonHTML documents
* fixed handling of FTP list processing on FTP servers which doesn't include
"total xxx" line on top of directory listing
* added support for parsing old style BSD directories listings
* removed some random memory leaks introduced in previous release
* fixed closeups of several unhandled HTTP/1.1 persistant connections with
remaining unrequired data
* fixed again handling of moved URLs with -leave_level option
* fixed ftpdir mode behaviour with some of HTTP gateways for FTP (for example
Squid) (thanks to Niraj Sachdeva)
* implemented HTTP POST requests (see option -request)
* implemented parsing of DOS/Windows style FTP directory listings
* fixed handling of oddly detected persistant connections when using HTTP/1.0
and talking to HTTP/1.1 server which doesn't respond with Connection: close
header
* fixed "Zero size" possible error reporting only for cases when we don't know
exact size or size is non zero
* implemented dialog for editing HTML forms (GTK+ only)
* new option -hash_size for performance tuning when mirroring large amount
of URLs
* now supports FTP URLs as defined in RFC (ftp://serv.dom/path for relative
path to login directory and ftp://serv.dom//path for absolute path from FTP
server root directory)
* changed behavior when doing FTP directory listings (CWD path + NLST/LIST
changed to NLST/LIST /path)
* rejection of UNIX special files (sockets, devices, fifos) in FTP directory
listings
* fixed segfault on empty FTP directory listings
* fixed segfault in document info storing code
* rewritten document locking routine, because of possible race conditions and
errors in previous implementation
* enhancement for -fnrules option, which allows much higher flexibility in
local name assignment to document (undocumented and not well tested yet)
* fixed unfunctional -store_name option
* fixed h_errno test in configure script, to work on SYSV systems (thanks to
Marc Chantome)
* implemented dropping of URLs to URL Append dialog
* implemented option to be able to follow downloading process inside
URL tree preview window (GTK+-1.2 only) (proposed by Francois RicharC)
* fixed odd behavior of FTP URL parser on WIN32 platform with FTP URLs in
form ftp://ftp.server.dom//absolute/path/...
* fixed bug in new FTP directory processing routines when listing directories
on MS FTP servers (thank to LE FAUCHEUR Frederic)
* fixed bug in routine which is computing difference between GMT and local
time (on some platforms localtime() and gmtime() returns same statically
allocated buffer for returning result)
* updated Properties view in URL Tree preview to show POST request infos
* support for inserting POST request inside URL tree from Form editor
dialog
* repaired URL parser to support URLs in form http://www.server.dom?xxxx
http://www.server.dom#xxx
* fixed possible segfault in FTP code, which may occur, when pavuk is not
able to establish data connection
* fixed bugs in scenario saving code (thank to Peter Erbak, Bill Miller)
* fixed cookies handling with moved documents
version 0.9pl25 (Mar ?? 1999)
---------------
* get rid of all Xt GUI code
* fixed bug in code which handles filesystem unsafe characters in Win32
* fixed bug in sync mode which stops crawling when starting document is
up to date (thank to Dave Becket)
* fixed minor bug in handling of ; character inside URL
* implemented support for multiple HTTP proxy servers with intelligent round
robin scheduling
* fixed segfault when using ftp/gopher HTTP gateway and cookies are enabled
for sending
* fixed bug in url_compare() function which have bad results when comparing
URLs with different scheme (thank to Niraj Sachdeva)
* fixed uninitialized HOME environment variable checking (thank to Andreas
Mohr)
* added check for db_185.h to configure script when looking for Berkeley DB1
header files (thank to Roar Bergheim)
* fixed checking of start/end time limits in sync mode (thank to Peter Thalman)
* fixed segfault with moved robots.txt files (thank to Bill Miller)
* fixed bug in function filename_to_url() which causes odd behavior mostly
in sync mode (thank to Peter Thalman)
* fixed HTTP proxy Digest authorization code
* added possibility to use authinfo file to store proxy authorization
informations
* implemented optional multithreading support (now works only console version,
GTK version need some further changes and testing)
* changed URL encoding/decoding handling, now user must enter regularly
encoded URLs
* several simplification changes in Makefile.am files (thank to aldomel)
* fixes to configure.in script Makefile.in files to get working
'make distcheck' (thanks to aldomel)
* simplified recomputation of GMT time from local time on systems with
tm_gmtoff inside struct tm (thank to Robert Brennecke)
* corrected pavuk behaviour when -request contains some unpredictable request
specifications (thank to aldomel)
* fixed compilation with --disable-tree
* fixed SSL read/write errors handling (thank to Jeff Roberson)
* split gui code to more modules
* fixed segfault when trying to preview document properties in URL tree
preview dialog
* fixed scheduling from UI
* bit changed statusbar in UI
* zillion miscelaneous changes to get working GUI with multithreading
* workaround HP-UX NAME_MAX/PATH_MAX settings to disable automatic adjusting
of long filenames to 14/255 limits (thank to Niraj Sachdeva)
* get working again -store_name option (thank to Orestes Sanchez Benavente
and Jan Tomasek)
* fixed possible problems with reading and writing via SSL on nonblocking
sockets.
* fixed functionality of -local_ip option when you change it in GUI
* fixed rewriting of URLs in HTML form action tags
* optimized header files dependencies - faster compilation
* removed minor memory leaks in HTML forms processing code
* corrected parsing of FTP response to PASV command to be able to cooperate
with publicfile FTP server (thank to Felix von Leitner)
* fixed implementation of html_tag_co_elem() function
* implemented chance to fill noninteractively HTML forms when matching form
is found (many thanks to Jeff Robersons idea and first implementation)
* implemented dumping of documents to any supplied file descriptor (thank to
Honza Tomasek)
* corrected pavuk process exit value computation (redirected documents are
not counted as failed yet) (thank to Thomas Coppock)
* fixed bug in function url_to_absolute_url() which causes bad behaviour with
URLs ending with -index_name. (thank to Antoine Martin)
* --------- released testing version 0.9pl25c
* implemented code for saving session data to ~/.pavuk_keys in GTK interface
* corrected handling of multiline lists in HTML form filling dialog
* corrected several bugs in HTML forms parsing code
* fixed hangup on exit when using language switching from GUI menu
* fixed possible segfault when HTTP server respond with improper response
* --------- released testing version 0.9pl25d
* added several sample identity strings to combobox in GUI
* added files for integration to Gnome menu
* fixed bug with -fnrules F ... caused by FNM_PATHNAME flag passed to
fnmatch() with some libc implementations (thank to Nicolay Mausz)
* corrected bad behaviour of function get_abs_file_path_oss() which expands
wrong way relative paths to absolute paths
* changed behaviour of 'Load scenario' which now resets configuration before
loading scenario and added new function 'Add scenario' which behaves same
as 'Load scenario' before
* fixed bug introduced in 0.9pl25a which damages url structure and cause
cycling of download and hangups or segfaults on exit
* adjusted NS cache directory access routines to be safe when accessing from
multiple threads
* ---------- released testing version 0.9p25e
* fixed segfault caused by wrong call to tl_str_concat() in doc_download()
* fixed GUI compilation without NLS support (thanks to Gabor Z. Papp)
* fixed Toggle toolbar functionality
* minor corrections in Makefiles (thanks to Petr Cech)
* fixed pavuk.spec file to properly build RPMs
* updated Slovak,Cech,Spanish massage catalogs (thanks to all authors)
version 0.9pl26 (Aug 31 2000)
---------------
* added new Italian message catalog by Antonio Fragola
* updated German message catalog (thanks to Colin Marquardt)
* fixed sending of HTTP Content-type: request header with POST requests
* implemented optional deleting of remote FTP documents after successfull
transfer (idea by Gabor Z. Papp)
* you can now optionally disable the numbering of overlaying documents to
achieve unique name using option -nounigue_name (idea by Nicolay Mausz)
* added patch from Nicolay Mausz which implements new rmpar function in
-fnrules option syntax
* fixed bug in SSL reading code which raises error when session was regularly
closed on other side (thanks to Martijn van Oosterhout patch)
* fixed cooperation with SSL FTP servers which indicates successful swith to
SSL mode with 234 response code (thanks to Martijn van Oosterhout patch)
* fixed opening of FTP data connections. Old code should make deadlocks in
communication with some proxy servers. (thanks to Martijn van Oosterhout)
* fixed typo in config.h which refuses compilation on HP-UX (thanks to Niraj
Sachdeva)
* ---------- released testing version 0.9p26a
* better checking for pthreads support in configure script
* added option --with-gtk-config to configure script, to allow easier
configuration on system with such weird renaming of libs/scripts as
on FreeBSD
* added handling of HTTP server response fields Content-Location:,
Content-Base:, Base: for setting base URL of document (thanks to Robo
Dobozy)
* warning Zero length ... will now not appear with HTTP documents which
doesn't contain Content-Lenght: response field
* fixed total document size computation of partially transfered documents
if server doesn't provide Content-Lenght: header but only Content-Range:
* fixed broken robots.txt parser
* support for extended robots.txt standart with new Allow: statement
* -request option was extended to allow specify in request also destination
filename of document in local filesystem
* -debug_level user show now also filename where document is stored
* fixed bug in robots.c when host name field in robots structure was
deallocated without discarding data when restarting
* added MT locking of robots data; without locking should cause unpredictable
segfaults
* now it is possible to enter empty values for form data in POST request
specification dialog
* form editor dialog now properly extracts also hidden fields
* corrected handling of HTTP response code 303 with POST requests, now pavuk
correctly redirects to GET request as it should
* ---------- released testing version 0.9p26b
* added support for PCRE regular expression in -*rpattern options and in
-fnrules option
* -amime -dmime options now accepts also wildcard patterns
* added TLSv1 support for HTTPS/FTPS communication
* added new option in configure script --with-regex, which allow to select
prefered regular expression type (one of none/auto/posix/gnu/v8/bsd/pcre)
* fixed compilation error in lfname.c when none of supported regular
expressions types was configured
* enabled substring substitution in -lfname option when using Bell V8 regular
expressions and regsub() function is available (cygwin b20 doesn't export it)
* added new option -dump_urlsfd to enable outputing URLs from downloaded HTML
documents to selected file descriptor - usable for scripting
* addjusted filenames handling in WIN32 version to support new style of mapping
win32 paths to POSIX paths in newer cygwin-1.x.y versions
* corrected comparing of URLs in -formdata option (thanks to Jeff Roberson)
* ---------- released testing version 0.9pl26c
* fixed seg-fault on parsing supported URLs with missing scheme dependant
part of URL string (thanks to Marc Tooley).
* fixed problem with sleep() implementations which use SIGALRM for wake up
in multithreaded version (thanks to Antoine Martin)
* new option -dont_leave_site_enter_dir/-leave_site_enter_dir which allows to
limit leaving of directory which we entered first on the site
* enabled option -store_name to work also in other modes than just singlepage
* wrote small document wget-pavuk.HOWTO for wget users who are starting to
use pavuk
* updated manual page
* -h option works now properly when -bg option is also used (thanks to
Artem Frolov)
* attempt for workaround signal handling inconsistency in multithreading
environment (thanks to Antoine Martin)
* define DB_LIBRARY_COMPATIBILITY_API in nscache.c before including db_185.h
to force reading 1.8x Berkeley DB format with 3.xx library
* updated Slovak message catalog
* ---------- released testing version 0.9pl26d
* fixed problems with frozed threads on Solaris when starting download (thanks
to Antoine Martin)
* added call to FreeConsole when running pavuk with -bg option on Win32
systems (thanks to Andreas Mohr)
* added some gdk_flush() calls to status list modification code to force
better updates
* added new option -singlepage/-nosinglepage to overcome limits of -mode
singlepage (thanks to Jo? Savignon)
* now in sync mode is also checked size of documents downloaded over HTTP
(thanks to Raun Nohavitza)
* added check for ssize_t type, without it won't compile on Ultrix
* ---------- released testing version 0.9pl26e
* added support to using network paths on WIN32 with cygwin-1.1 =<
* fixed broken -dont_leave_site_dir option
* added commandline passwords hiding feature (thanks to Steven Haryanto)
* fixed behaviour of -dont_leave_site_dir with moved site enter URLs
* updated German and Spanish translations (thanks to Javier and Colin)
version 0.9pl27 (Dec 13 2000)
---------------
* fixed infinite loop bug when both -store_name && -request options are used
(thanks to Matthew)
* add new menu to GUI for selecting starting URLs from opened documents inside
Netscape
* fixed bug which causes to reload mostly all HTML documents in sync mode
because of sizes comparing
* fixed bug in parsing FnameRules: scenario field (thanks to Le Faucheur
Frederic)
* fixed freeze on scenario loading from GUI in multithreaded version (thanks
to Le Faucheur Frederic)
* query string from HTTP/HTTPS URLs are now not decoded when generating
local names
* new naming convention for local documents downloaded via POST request
name#query (thanks to mda)
* fixed bug which causes hangs or segfaults when using -formdata option,
because of doublefreeing memory chunk (thanks to Matthew)
* added two new patterns (<script , <style) to routine for guessing HTML files
* fixed dumping of wrong ENCODING: fields in -formdata, -request infos to
scenario file (thanks to Matthew)
* ---------- released testing version 0.9pl27a
* now works -disable_html_tag all or -enable_html_tag all to disable/enable
all HTML tags
* fixed fast spawning loop in multithreaded version caused by bad use of
pthread_cond_timedwait() (thanks to Bjorn R. Bjornsson)
* fixed progress display bug showing size in bytes instead of kilobytes
(thanks to Andreas Mohr)
* fixed bug in FTP code when pavuk opens twice data connection for directory
listings (thanks to Raun Nohavitza)
* fixed stupid bug when pavuk uses short int type instead of unsigned short
for storing port numbers (thanks to Raun Nohavitza)
* fixed checking of HTML document types with added encoding after MIME type
(thanks to Brunie-Taton Alain)
* repaired broken site level computing on sites with moved starting documents
in -site_level option
* implemented functions for launching commands on WIN32 with system()-like
function when cygwin not installed (thanks to Thierry R?nier)
* added support for loading files from MSIE cache on Win32, and added options
-ie_cache/-noie_cache to enable/disable this feature
* backported improvements to gaccel code from chbg. Now it is much more
reliable.
* added new macro %q to -fnrules option, which will be replaced with urlencoded
query string from POS/GET request specification
* fixed big memory leak in old style fnrules evaluation function caused by bad
block nesting
* added two new functions (sif, !, &, |) to -fnrules option. ! is logical NOT
for numeric values. & is logical AND for num. values, | is logical OR for
numeric values. sif is decision between two strings by condition.
(sif (cond) (val_if_cond_true) (val_if_cond_false)) is equivalent for C
expression (cond) ? (val_if_cond_true) : (val_if_cond_false)
* added checks to reject compilation of NS cache reading code with BerkeleyDB
2.0 and above because of incompatible database format. NScache uses 1.8x hash.
* corrected support for reading NS cache on big endian platforms based on patch
for my NScache program from ...
* made HTTP/1.1 default (still possible to switch to HTTP/1.0 with option
-nouse_http11)
* changed handling of parent urls in URL structure. Now is used linked list
instead of nul terminated array. It is much safer for handling in MT.
* fixed segfault on redirection of robots.txt when HTTP/1.1 enabled cased by
bad handling of persistant connections
* fixed bug in robots.txt file parsing code which causes infinite loops with
some robots.txt files
* fixed memory leaks on robots.txt redirections
* fixed segfault when using -mode dontstore in multithreaded mode, caused by
allocating shorter buffer for storing temporary unique name :-(
* fix to be able to compile with gtk-1.3 (aka gtk-2.0)
* added support for HTTP redirection on 307 response code
* added description messages for all HTTP/1.1 response codes which may occur
and cause unknown errors just with numeric description
* fixed bug in processing of HTTP/1.1 chunked transfer encoding types
after moved URLs because of oddly initialized trailer reading flags :-(
* it is possible now enter on commandline options unsupported in current
compile time configuration, pavuk now only displays warning instead of
raising error and exiting (thanks to Bjorn R. Bjornsson)
* fixed compilation when threads are enabled support for regular expressions
is disabled or not present
* added locking of robots.txt info structure to prevent downloading it
concurrently with multiple threads when compiled with MT support
* ---------- released testing version 0.9pl27b
* fixed compilation bug when compiling without SSL support (thanks to
Le Faucheur Frederic)
* fixed bug made in previous testing release which causes segfault always
when opening Limits config dialog because of use of initialized pointer
* added support for long/short commandline options with GNU getopt like syntax
and compatibility with old format of pavuk options (no short options defined
yet)
* changed handling of scenarios from commandline. Scenario is now loaded at time
when is --scenario option processed by commandline parser instead of prior to
commandline parsing as before.
* now it is not mandatory to specify --scndir option before loading scenario.
* ---------- released testing version 0.9pl27c
* more reliable implementation of asynchronous DNS client/server for GUI
version. Now guarantees atomicity of reads/writes, so no possible of
protocol inconsistence after user break in middle of communication.
* internal restructuralization of code (hope not, but may lead to problems)
* fixed bug in preserving of persistant connections on robot.txt redirects
* fixed unnecessary closures of persistant connections in sync mode after
304 response code
* added new options -dump_after/-nodump_after for use with -dumpfd option.
this option control when will be document dumped to output (immediately or
after download&processing)
* added new options -dump_response/-nodump_response for dumping also HTTP
responses to -dumpfd
* fixed bug in parsing CSS inside HTML tags
* removed support for extracting destination URL from HTML after HTTP
redirects. It must be broken server which doesn't send Location: header
after redirect ... not worth to add workarounds for this problem
* rewrote from scratch the HTML parser (this means I've got rid of the
oldest, worsest written code in pavuk). It seemds it should be bit faster
and is much better extensible an maintainable.
* removed few small memory leaks
* added simple support for javascript patterns in DOM event attributes of tags,
based on regular expressions
* ---------- released testing version 0.9pl27d
* fixed several memory leaks
* fixed bug in base64 encoding routine which was failing with non ASCII
characters above 127
* changed way how is handled Digest authorization
* implemented NTLM authorization
* implemented NTLM proxy authorization
* now -auth_scheme & -http_proxy_auth options accept also textual parameters
"user" "Basic" "Digest" "NTLM" besides numeric 1 2 3 4
* total restructuralization and cleanup of HTTP handling code. I was carefull,
but it may lead to problems.
* now works NTLM and Digest authorization also with CONNECT requests
* minor changes in common settings dialog
* fixed bug in processing js patterns caused by bad tag attributes
* added new option -js_patterns to allow parsing of custom javascript patterns
inside HTML documents
* added support for parsing also script body and look for patterns line by line
(works also for files referenced by <SCRIPT SRC=...>
* implemented handling of proxy redirects (305 HTTP response)
* fixed compilation bug caused by undeclared _mt_dumpfd_lock_ mutex (thanks
to Le Faucheur Frederic)
* fixed bug in handling locales in national environment (thanks to Milan
Kerslager)
* added Czech translation to Gnome desktop entry for pavuk (thanks to Milan
Kerslager)
* ---------- released testing version 0.9pl27e
* implemented detection of broken HTTP/1.0 proxies which don't handle properly
downgrading to HTTP/1.0 when communicating with server which use newer HTTP
protocol version (this causes bug when trying to use persistent connections)
* more paranoia checking of reading/writing sockets in HTTP code
* automatic request repeat after premature closure of persistent HTTP
connection
* added support for robots excluding with <META NAME="robots" content="...">
(thanks to Markus Mayer)
* fixed compilation bug with OpenSSL-0.9.6 because of new MD4 implementation
int this OpenSSL version (thanks to Le Faucheur Frederic)
* fixed bug in new html parsing engine which fails to parse properly rest of
document after <script>...</script>
* added support for HTTP/1.0 Keep-Alive proxy connections
* ---------- released testing version 0.9pl27f
* added install script for NSIS win32 installer
* fixed compilation bugs when building without GUI
* portability fixes to QNX RtP
* updated auth info edit dialog for NTLM support
* fixed possible MT race condition in gopher directory persing routine
* fixed confusion of ftp code with -remove_old & -ftplist when in sync mode
files disappeared from server were processed like directories which failed
(thanks to galanga)
* ported to BeOS 5 PE (works fine except file locking)
* added support for javascrip parsing in javascript:... URLs inside any
supported HTML attribute
* fixed ftp directory listing when using active ftp data connections
* added option -follow_cmd which allows you to execute some script which
can decide if pavuk should follow links from current document (thanks to
Georg Rehm and hashao)
* adjusted establishment of active ftp data connections to be able to handle
properly states, when server is unable or don't want to connect before
sending response
* leading/trailing spaces are removed from attributes before processing it
as URL to support broken sites ...
* ---------- released testing version 0.9pl27g
* fixed segfault when Location: contains relative URL after redirect
* fixed broken timestamping of HTML files in sync mode (thanks to Le Faucheur
Frederic)
* fixed segfault on broken HTML tags with leading spaces and unclosed quotes
* if -store_info is active also rejected URLs contain stored MIME header
(thanks to Georg Rehm)
* don't apply limiting conditions (minsize/maxsize/mimet) on robots.txt
documents
* fixed segfault when -norelocate option is activated (thanks to Markus Mayer)
* added O_BINARY to several open calls to prevent possible problems on Win32
* added new options -retrieve_symlink/-noretrieve_symlink to enable
downloading of symbolic links from FTP server as regular files (thanks to
Petr Cech & Andras Korn)
* fixed segfault in robots info cleanup code
* implemented new -js_transform option to allow bit more powerfull support
for js patterns. No rewriting supported now (thanks to Mark D. Anderson)
* fixed problems when compiling with PCRE support
* ---------- released testing version 0.9pl27h
* fixed segfault on broken meta refresh tag (thanks to Georg Rehm)
* fixed bug in removing of trailing spaces from URLs (thanks to Le Faucheur
Frederic)
* added support for access authorization to FTP proxy server (thanks to Beno
Kardel)
* added GUI config for -js_transform option
* fixed bug in processing javascript bodies enclosed between <script></script>,
which causes breaking of ending </script> tag
* -js_pattern patterns without substrings are now omitted
* fixed broken behaviour of pavuk when while regeting file receives empty
response, it will process it as proper HTTP/0.9 response and stops regeting
file (thanks to Christian Axbrink)
* simplified that horrible dialogs for adding prefered languages,charsets and
mime types
* added new debug level "limits" for debugging limiting conditions
* updated manual page
* fixed deadlock on closing log file
* ---------- released testing version 0.9pl27i
* updated Czech message catalog (thanks to Petr Cech)
* added initialization of GTK locales
* added possibility to generate massage catalogs in UTF-8 encoding for
use with future versions of GTK+
* fixed problems with switching language multiple times in GUI window
* updated documentation
* updated German message catalog (thanks to Colin Marquardt)
* fixed retrieving of URLs from selection and via DND to omit illegal CRLF
characters (thanks to Aleksander Adamowski)
* adjusted win32 installer script to support installing message catalogs
* added support for setting message catalog path on WIN32 to install directory
* better handling of WIN32 paths in GUI
* added window icon to WIN32 version
version 0.9pl28 (Aug ?? 2001)
---------------
* added new option (-limit_inlines/-dont_limit_inlines) to disable checking
of limiting options for inline objects (thanks to Olivier Sirol)
* fixed bug with special characters in filenames on FTP servers (thanks to
Jo? GRONDIN), same for Gopher directories
* FTP directory listings are now transfered in ASCII mode (thanks to Jo?
GRONDIN)
* removed MT race condition in calling inet_ntoa()
* added new option -ftp_list_options to allow passing options to FTP LIST/NLST
commands
* support for multiple WWW-Authenticate: and Proxy-Authenticate: in HTTP
response (thanks to Monika Nowotnik)
* ported to AtheOS
* fixed improperly handled rewriting of links in HTML documents pointing to
itself (thanks to Nicolay Mausz)
* added new function (getval) to -fnrules option extened syntax rule for
getting values of query parameters of URL (thanks to Nicolay Mausz)
* added initialization of OpenSSL PRNG randomizer to prevent message
"PRNG not seeded" on some platforms (thanks to Albert Chin)
* ---------- released testing version 0.9pl28a
* compilation fixes for nongcc compilers and bigendian architectures (thanks
to Albert Chin)
* fixed segfault which occurred always when used unknown long option
* added forgoten gdk options to option table
* fixed compilation without NTLM support enabled (thanks to Georg Rehm)
* added option --disable-ntlm to configure script to be able to compile
pavuk without NTLM authorization support (thanks to Albert Chin)
* fixed segfault which occurs when closing Common config dialog (thanks to
Georg Rehm)
* fixed all notworking options using regular patters when pavuk is compiled
as multithreaded program (thanks to Mirko)
* fixed NTLM implementation to be able to work properly on bigendian machines,
with non GCC compilers and on 64bit platforms
* fixed leaking of file descriptors after "File redirect" when have before
persistent connection opened
* improved URL queue handling and downloading threads management
* changed internally handling of filename assignments (not well tested yet,
can cause instability or deadlocks in MT)
* fixed segfault when no URL is specified in -request or -formdata options
(thanks to Andrew Price)
* fixed segfault when using -formdata option caused by freeing already freed
memory chunk (thanks to Andrew Price)
* removed several minor memory leaks
* added checking of BerkeleyDB implementation in libc in configure script
* updated French message catalog (thanks to Le Faucheur Frederic and
Pascal Adoux)
* added new option -fix_wuftpd, to fix broken wuftpd behaviour, when it
doesn't raise error when listing not existing directory (thanks to
Jo? GRONDIN)
* ---------- released testing version 0.9pl28b
* added new option -post_update/-nopost_update to force pavuks URL updating
engine to update in parents documents only URL currently downloaded
* %o macro is supported now also in simple -fnrules macros
* added two new macros to -fnrules option - %M == mime type of document,
%E == standard extension of document MIME type. This two new macros work
properly only when used with -post_update options. (thanks to Majkel
Kretschmar)
* in sync mode are now processed at first links from directory scan (if -subdir
was specified) and than just other links.
* added two new functions to -fnrules option rules (getext - gets extension
from path , seq - string equal)
* fixed scheduling, broken by changes to support long options
* fixed commandline parser, so it again support --long-opt=val style of
options
* using mkstemp instead of tmpnam when available (thanks to Fr??ic
L . W . Meunier)
* type icons in tree view were replaces with smaller icons
* new option -info_dir which allows you to store pavuk_info files outside
of document tree
* fixed bug, when after reget of document also unnecessary documents are
loaded to memory, this can cause out of memory situations with big
documents (thanks to Jinghua Liu)
* added new option -js_transform2 which have similar function as -js_transform
just it allows also rewriting of matched URLs. This is also very suitable to
add tags/attributes which are not supported by pavuk at default.
* added forgoten handling of GUI configuration of -js_transform option
* new faster growing hash function to allow bigger size hashes when downloading
huge amount of documents
* ---------- released testing version 0.9pl28c
* fixed resources leaking after reopening of netscape cache index
* better handling of netscape chache index file after modifying with some
other program
* added support for loading files form mozilla browser chache directory
* fixed broken saving of document infos for rejected files (thanks to Georg
Rehm)
* changed a bit logic of lists when cleaning lists and deleting fields (thanks
to Marco Strack)
* implemented new options -aport/-dport to allow/deny downloading of documents
from servers at specified ports (thanks to Georg Rehm)
* fixed bug in handling patterns in GUI (thanks to Georg Rehm)
* added to configure script checking of POSIX regex in libregex (as on recent
cygwin versions)
* fixed compilation of MT version (thans to Jeremy P. Campbell)
* ---------- released testing version 0.9pl28d
* fixed problems with -preserve_time on win2000 (thanks to Andreas Schiling)
* added new option -hack_add_index/-nohack_add_index usefull to more extensive
site mirroring when for each URL taken from HTML documents also directory
of the document is added to queue (thanks to stvictor)
* better handling of unsafe characters in HTTP requests
* updated manual page
* after unexpected error while regeting, the .in_ file now will be always
preserved
* ftp directories are not insterted into queue twice when doing directory
based synchronization (thanks to Jo? GRONDIN)
* no more problems with duplicating FTP directory indexes in sync mode
(thanks to Jo? GRONDIN)
* on error in scenario file pavuk now exits with error instead of continuing
(thanks to Jo? GRONDIN)
* when processing symlink from FTP server which points to directory, pavuk
will make link to directory not to directory index file (thanks to Jo?
GRONDIN)
* if HTTP server sends Content-Length: in response and option -check_size is
active, than pavuk now reads exactly this size without waiting on
connection close even when not using persistent connections. This (thanks to
Glen Stewart)
* ---------- released testing version 0.9pl28e
* fixed SSL library detection on SYSV systems with libsock (thanks to Eun-Mok)
* added new option -default_prefix to simplify mirroring when -base_level
option is used
* -max_time option now allows to specify subminute times
* in GUI it is now possible to enter subminute communication timeout
* added right button menu to log widget
* ---------- released testing version 0.9pl28f
* new function "ud" for -fnrules option used for decoding URL encoded
strings (thanks to Tony Gale)
* applied patch from Albert Chin
- new -egd_socket <path> command-line option
- new --egd-socket=<path> autoconf option to provide a hard-coded
compile-time path for the EGD socket
- use RAND_file_name to get the pathname of the EGD socket if RANDFILE
env variable is set instead of RAND_EGD_SOCKET_PATH env variable
- new --with-zlib-includes=DIR and --with-zlib-libraries=DIR autoconf
options to specify location of zlib library
(many thanks to Albert Chin)
* fixed bug in URL rewriting engine (thanks to Nicolay Mausz)
* fixed broken -mode reminder (thanks to Andrea Tasso)
* fixed bug in parsing ftp URLs with transfer type specified (thanks to
Richard Ems)
* replaced old config.sub, config.guess files with new versions from
automake-2.50 and adapted for atheos (thanks to Petr Cech)
* in -formdata and -request options it is now possible to specify requests
without any field entered (thanks to Dima Nemchenko)
* fixed broken behaviour of -limit_inlines/-dont_limit_inlines option
* fixed sync mode with mirrors with changed layout of local tree
* rewritten limiting conditions checking engine
* ---------- released testing version 0.9pl28g
* fixed msgfmt detection in configure script (thanks to Richard Ems)
* fixed compilation without SSL support (thanks to Richard Ems)
* updated Spanish Message catalog for 0.9pl27 (thanks to Francisco Javier
Comer? Gayoso)
* rewritten limiting conditions checking engine again
* implemented JavaScript bindings to enable users to use more flexible
conditions for excluding URLs from download (new option -js_script_file)
* implemented new function "jsf" for -fnrules option which allows execution
of JavaScript functions by name
* ---------- released testing version 0.9pl28h
* implemented JavaScript console dialog
* fixed segfault which occurred always after unexpected HTTP response when
regeting files (thanks to ha shao)
* implemented workaround for ftp servers which understand REST command but
always restart from scratch (greeting MS :-)) (thanks to Raun Nohavitza)
* exported new attribute of url in Javascript bindings (html_tag) which holds
source HTML tag of particular URL when level == 0
* new method "get_sub" of PavukFnrules class in JS bindings for getting
subpatterns from -fnrules patterns
* more enhancements for JS bindings classes
* fixed hangup in http_throw_message_body()
* fixed possible race condition when using url_set_path()
* added new option -ftp_login_handshake to enable customizing of FTP server
login procedure (thanks to Marko Daris)
* added new option -rsleep for randomizing sleep time between transfers in
interval 0 -> -sleep (thanks to Christian Canella)
* added new Japanese message catalog by SATO Satoru (thanks)
* ---------- released testing version 0.9pl28i
* rewrote detection of BerkeleyBD 1.8x in configure script
* updated French message catalog (thanks to Frederic Le Faucher)
* fixed compilation with Gtk+-1.0
* applied IRIX portability patch from Albert Chin (thanks)
* fixed compilation on newest version of cygwin (thanks to Pablo Blasco)
version 0.9pl29 (??? ?? 2001)
---------------
* redesigned SSL implementation
* FTPS now works perfectly over proxy
* added support for Netscape NSS as replacement for OpenSSL SSL layer
* fixed detection BerkeleyBD 1.8x header files
* ---------- released testing version 0.9pl29a
* FTP active mode now uses address from getsockname() instead from gethostname()
* applied changes made between 0.9pl28i and 0.9pl28
* added IPv6 networking support including FTP support by RFC1639 and RFC2428
* added new options -dont_touch_url_pattern -dont_touch_url_rpattern to deny
download and rewrite of particular URLs in HTML tags
* added clause into COPYING file about linking pavuk with OpenSSL
* applied part of patch from Albert Chin (thanks!)
- fixed prototype declarations in htmlparser code
- added include arpa/inet.h in http_proxy.c
- fixed declarations in html_proxy.c, ftp.c to make some compilers happy
- rewrite of configure script to use config file instead of horrible looking
infinite compilation commandlines
* updated win32 installer config file - pavuk.nsi
* added README.win32 file to source distribution
* added new Polish message catalog by Przemyslaw Sulek (thanks!)
* ---------- released testing version 0.9pl29b
* updated Czech message catalog by Petr Cech (thanks!)
* updated Polish message catalog by Przemyslaw Sulek (thanks!)
* added new Ukrainian message catalog by Dmytro O. Redchuk (thanks!)
* fixed typing of variables in ntl_auth.c (thanks to Petr Cech)
* added new options -dont_touch_tag_rpattern to deny download and rewrite
of URLs in particular HTML tags
* droped GTK+-1.0.x GUI support (sorry)
* fixed swithing of languages in GUI
* much better handling of GoBg function ... now is GUI cleanup done mostly
immediately without waiting for ending transfer
* added two new properties to PavukUrl class in JS bindings to allow writing
of content based limiting options (both are defined when level == 0)
- .html_doc - full content of parent document of URL
- .html_doc_offset - offset of current HTML tag in parent document of URL
* fixed compilation without IPv6 support
* po/Makefile now uses generated list of catalogs instead of hand written
* LINGUAS support in configure script
* fixed initialization NSS library after config changes
* -unique_sslid option is now supported when using NSS as SSL library
* new options -nss_domestic_policy/-nss_export_policy to allow selection
of SSL ciphers suites in NSS for U.S. Domestic or for Export ciphers
* support for libmcrypt/libgcrypt DES in ntlm code to allow not to use
non GPL compatible libcrypto from OpenSSL
* removed default javascript patterns
* applied patch from Harald Forster (thanks!)
- fixed *printf formats for shorts and chars
- fixed handling of va_list in xvaprintf
- fixed bugs in dllist* routines
* using safe vsnprintf instead of vsprintf when available
* fixed IPv6 support to work with FreeBSD
* replaced libc ctype functions with own functions to be not dependent on
proper locale
* fixed get_1qstr() to not fail when last char is \ (back slash)
* added GUI config for -dont_touch* options
* added support for no EOL closed strings in gui_xprint()
* hopefully fixed the reason for crashing randomly in multithreaded mode
caused by trashing URL structure temporary linked insided hash tables
(thanks to all who reported the MT crashes)
* when loading preferences (-prefs) always reset config to prevent adding
cumulative options loaded form rc files (thanks to Harald Forster)
* loading scenario and reseting config from GUI no more leaks memory
* updated Japanese message catalog (thanks to Sato Satoru)
* fixed bug with -max_time option causing total pavuk confusion (thanks to
Gema Pizana)
* the entered not applied config changes in common and limits dialogs will
not disappear when trying to popup already visible dialogs (thanks to Harald
Forster)
* ---------- released testing version 0.9pl29c
* added new spec file for building multiple RPMS of pavuk with different
configurations (thanks to Rami El-Charif )
* added support for new format of Mozilla cache
* implemented new options -tag_pattern & -tag_rpattern to allow precise
matching of URLs inside HTML tags based on matching of HTML tag, HTML
attribute and URL patterns (thanks to Huaxin Wang)
* updated man page
* updated Slovak message catalog
* switched to use newer autoconf(2.50) & automake(1.4-p4) versions
* processing of HTML files downloaded over gopher is now supported
* retry for document transfer is now performed always when it is
clever to do so. Increased default number of retries to 2.
* fixed storing of local name for URL into scenario (thanks to Stephen Sweigart)
* when you will specify LNAME: filed in -formdata specification, it
will be used like local name for the request
* !!! changed exit values of pavuk process. Now 0 means everything was OK,
1 means configuration error and 2 means that there were some problems
with some documents
* MacOSX portability fixes
* fixed routine for adding starting URLs to allow entering file: URLs
* fixed segfaulting in cookie expiration code (thanks to Mark D. Anderson)
* fixed compilation with disabled regexp support
* fixed segfaulting when using -asite/-dsite/-adomain/-ddomain options and
file: URL appears in the html documents (thanks to farquat)
* ---------- released testing version 0.9pl29d
* several random fixes
* fixed NTLM nonce decoding
* fixed getval and rmpar functions of fnrules option (thanx to Alexey Morozov)
* fixed ssl_write functions which sometimes hangs printing error message forever
(thanks to Robert Dobozy)
* fixed optional sending of WC/ASCII type of NTLM T3 messages
* ---------- released testing version 0.9pl29e
* Netli measurement code added
* SSL rewrite
* First Sourceforge release
* ---------- released testing version 0.9pl29f (Jun 3 2003)
* bug fixes
* ---------- released testing version 0.9pl30a (Jul 12 2003)
* fixed build system for translation files
* added -referer/-noreferer option
* updated german translations
* updated autoconf stuff and rebuild all Makefiles using autoconf 2.57
* included lots of updates and newer files floating around in the net
* ---------- released testing version 0.9pl30b (2004-07-05)
* fixed buffer overflow (BUG #984898)
* added AREA tag onClick event in htmltags.c to make javascript work.
* added a number of mimetype extensions to mimetype.h
* fixed OPTION element default value for certain common case.
* made POST the default form method in get_data_socket()
* fixed buffer overflows in digest authentication code
* fixed crash for META-Refresh URL's
* introduced new source-code design to get rid of tabs:
indent --no-space-after-function-call-names
--no-space-after-parentheses
--dont-break-procedure-type
--no-space-after-for
--no-space-after-if
--no-space-after-while
--no-tabs
--brace-indent0
--dont-line-up-parentheses
or
indent -npcs -nprs -npsl -nsaf -nsai -nsaw -nut -bli0 -nlp
Sources will be modified step by step. Care is necessary, as indent fails
on the MT-macros! A new target reindent in source reindents the whole sources
directory.
* migrated large parts of code to ANSI-C, fixed lots of warning messages
* added --disabled-gtk2 option to autoconfig and GTK_FACE define now holds the
GTK version number, some fixes are necessary for GTK2
* security fixes preventing possible buffer overflows
* cleanup of build system
* fixed wrong name building (BUG #1012746)
* ---------- released version 0.9.31 (2004-11-08)
* security fixes preventing possible buffer overflows
* cleanup build (language installation works again)
* added more const statements allover the source
* fixed HTML entity decoding error (thanks Michal Toma for the report)
* compiles with GTK2, but still brings run-time warnings (BUG #1068224)
* fixed handling of local anchors (<A HREF="#link">)
* fixed handling of path separators in search strings (BUG #1064453)
* read support for KDE2 cookie file (~/.kde/share/apps/kcookiejar/cookies)
* Added --enable-utf-8 option to configure, which produces all locale files
in UTF-8 encoding.
* ---------- released version 0.9.32 (2005-03-17)
* slovak locale updated
* dont_leave_site condition no longer differentiates between protocols (HTTP,
HTTPS, ...)
* fixed bug in case there are quoting characters inside a quoted string
* fixed strange URL's in the form <a href="?...."> to use the parent document
instead of no document
* security patches
* fixed .pavukrc error (BUG #1247202)
* ---------- released version 0.9.33 (2005-09-27)
* fixed 64bit problems (BUG #1226863)
* updated German locale, fixes done by Debian developers (Hey, please inform
us about errors. Scanning the net and all distributions for possible fixes
is not very helpful.)
* ---------- released version 0.9.34 (2006-01-09)
* security fixes
* some minor bug fixes
* reworked build system a lot, fixed RPM spec file
* now builds fine using most of the possibilities pavuk provides
* RPM builds on openSUSE build service for SUSE since version 9.3, Fedora
since version 4 and Mandriva since version 2006
* RPM packages can be found here:
http://software.opensuse.org/download/home:/dstoecker/
* ---------- released version 0.9.35 (2007-02-21)
* added -persistent/-nopersistent option
2007-april-30 [notes taken from old work back in 2005/2006 merged into pavuk mainstream source tree]
* bufio has seen a MAJOR overhaul. It is now capable of pushing text &
binary data to the file system at unprecedented rates. This is done by
adding a variable sized (and possibly large) memory cache, resulting in
large size I/O operations. These perform very much faster than the regular
RTL I/O calls. (tested on quad CPU UNIX Dell servers)
the new bufio was required as I needed to log/track a huge amount of data
in the shortest possible time / lowest possible CPU load.
* cookie handling has been fixed/augmented. pavuk can now have the initial
cookie values that go with a certain web request preconfigured on the
commandline. Also, several bugs in handling the cookies have been fixed.
(tested on a wicked ASP.NET intranet site which 'assumed' the use of a
special web client (a TV set top box) which would transmit it's serial #
as a client-side created(!) cookie to the web server. This site/client
combo thus actually transmitted cookies which would first show up in a web
_request_ instead of the usual: a server-side _response_.)
* several portability items have been changed (h_errno, ...) to make the
code compile and work on the odd-flavored UNIX box. A native Win32 port is
under way: it now works, inclusing zlib and OpenSSL, though the latter has
not been tested recently.
Note that the changes may have broken GTK support, as I was not able to
build the code with GTK on my UNIX boxes.
* socket I/O (IP traffic) has been fixed to properly cope with user breaks
(a user hitting Ctrl+C). Several locations in the software where the
unexpected signal would cause an infinite loop have been identified and
fixed.
* added several lines of DEBUG_xxx to aid both developer and user in
tracking down hard to diagnose issues inside pavuk while scanning a site.
* Accepted-Encoding (more specifically: the handling of x-gzip/gzip/x-
compress/compress encoding) has been changed to allow for better
portability: data is expanded in-memory, without the need for an external
'gzip' tool and/or OS-specific forks & pipes.
(Win32 wouldn't know a fork if ever it saw one.)
* ALL stdio is now handled through the new bufio system. This not only
improves performance when you've got -debug and -debuglevel dialed all the
way up, but also corrected several spots where, depending on your C RTL,
stdio/stderr traffic would arrive at different moments on your console
(some of it was written through the FILE I/O, some through direct I/O,
causing blurbs of output to pass one another along the way to the actual
console).
* buffer overrun protection has been improved. Note also that every
snprintf() and derivative thereof is now 'augmented' by an additional line
of code which ensures that the last character in the buffer is guaranteed
to be a NUL sentinel, thus ensuring that the buffer will always present
data in correct C string format (NUL-terminated). (This is an old habit of
mine as some C RTLs have shown to be kinda flaky on the subject of NUL
sentinels when snprintf() et al are writing data up to the edge of their
output buffers: some C RTLs 'forget' to put a NUL there under particular
circumstances (some commercial Watcom compiler releases come to mind).
* multithreading pavuk has been tested on an high perf MP UNIX box and it
was like the documentation/notes state somewhere: instable. The thread
interlocking has now been fixed; one of the hardest to fix proved to be
the lockup at the end of a pavuk run. The fix also includes the use of
semaphores and some additional code changes to make the code thread safe;
critical sections are now handled as such. This includes placing several
non-threadsafe C RTL calls (e.g. ctime()) inside critical sections!
* auto-form-filling (the feature which led me to select pavuk over wget et
al when I started the hammer/chunky project) has been fixed for those
special pages where you have an empty form to submit: the site I had to
test included such a form, which was submitted using javascript, but did
not contain _any_ input fields (but cookies were expected to come with
that request, thank you). Before, pavuk crashed on such a page. This has
now been fixed.
* added a 'reindent' target to the makefile, using GNU indent to reformat
the code. (When you're working several weeks on end in crunch time, you
want to see some proper and consistent looking source code, even when you
just made it a mess yourself...)
Also extended the cleanup makefile target to help me in cleaning up any
backup and/or temporary files created by vi and some log diagnostic
scripts.
[edit may/2007: wasn't this already in the makefiles before - see
ChangeLog entry in 2003?]
* added several commandline parameter types, which allow you to instruct
pavuk to use OS file handles or file names for logging activity, while you
can now also specify whether a log file should be overwritten (default) or
appended to (new feature) by adding another '@' prefix to the file path.
TODO: document this properly.
* added hammer/crunchy modes: several ways to scan a web site and than
rescan it. The higher (later) hammer mode has been specifically written to
use pavuk as a 'replay attack' based DoS tool for testing high performance
web servers. (bufio was overhauled to allow us to log all I/O data +
diagnostics to disc while hammering the server while the pavuk system
_must_ perform better (= faster) than the web server when running both on
equivalent hardware.)
* The native Win32 port has been overhauled (previous code was never
released to the public) to make sure I did not have to look for OS-
specific path elements _everywhere_ in the code (it was becomes a code-
wise maintainance nightmare while fixing up/down all those 'absolute path'
and 'path expansion' code sections to handle Win32 drive letters (root is
'[A-Z]:[\\/]' instead of simply '/').
This has been fixed by using the cygwin 'path hack' for the native Win32
port too: root is '/cygdrive/[a-z]/' so it looks exactly like a UNIX path.
Any places in the codes which need to address the OS while passing an OS-
specific path are now handled almost invisibly: all relevant C RTL calls
(fopen/open/stat/lstat/symlink/link/unlink/rename/mkdir/rmdir/opendir) are
now encapsulated in tl_[sysname] wrapper functions where these
/cygdrive/[x]/ paths are converted back to native Win32 paths before the
actual C RTL function is called. Also any debug/print statement, which is
used to report a file path, is fixed to convert file paths to the native
representation with a minimum of fuss: see the new tl_native() call for a
description how this was done. This code has not been tested in a UNIX/MP
environment, but the design is such that this should not cause any trouble
(pthread port for Win32 is in progress ATM).
* added -debug_level modes: all/trace/dev/bufio/cookie/htmlform. Also added
a feature where you can now specify a set of debug levels and have some of
those levels _removed_, e.g. 'all,!dev' will show anything _except_ 'dev'
level debug output: note the new '!' prefix.
* -debug_level output is now prefixed with its level in caps and square
brackets, e.g. '[PROCE]' to aid in filtering the debug output (for
instance by piping it through sed/grep).
* unified debug output handling in the code: -debug_levels are now only
active when you specify -debug too.
* inflate_decode() and gzip_decode() have been fixed to suit a multithreaded
environment. gzip_decode() now has an in-memory implementation, using the
zlib library, for those systems which do not support UNIX pipes/forks.
* Fixed deflate/compress handling: the MJF Accept-Encoding deflate hack has
been removed and the request header extended. (tested on a Wikipedia
HTTP/1.1 compliant server)
You may wish to permanently disable the code within
in decode.c if you do not wish to depend on the external gzip tool any
more.
* _all_ system header file #include's have been removed from the sources and
integrated into config.h to allow for better portable source code.
config.h.in and autoconf.am have been extended to include several more OS-
dependent system call and header file checks.
A seperate native Win32 version of the header file is also provided (used
by the MSVC2005 native Win32 build).
* several hardcoded buffer sizes in the software have been made configurable
(but remain hardcoded). See for instance dinfo.c: 12 -->
PAVUK_INFO_DIRNAME and 1024-and-other-fixed-buf-sizes -->
BUFIO_ADVISED_READLN_BUFSIZE
* fixed several cases where dangling (i.e. free()d but not NULL-ed) pointers
caused havok. Code has been quickly reviewed to locate and fix additional
spots that did not yet cause pavuk to go 'crazy Ivan' (Hunt for the Red
October, anyone? ;-) )
* hardcoded lock filenames have been converted to #define's to allow these
to be changed in a single spot (config.h), improving portability. e.g.:
'._lock' --> PAVUK_LOCK_FILENAME
* UNIX-specific octal privs have been changed to their proper #define's to
allow for maximum portability (Win32 doesn't know '0644' but can cope with
S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH
though maybe in a odd way).
* fixed quite a few spots where an unidentified form encoding method would
lead to _very_ instable bahaviour, including crashes/core dumps. Look for
fi->method = FORM_M_UNKNOWN
assignments and additonal FORM_M_UNKNOWN checks.
* added -no_dns support for those who have to work in an environment with
flaky or no DNS support (I had to as I was working on a box in a specially
configured, partially walled-off DMZ zone while developing and testing
pavuk against a web server.)
* fixed typos in the text as I came along them.
* the bufio overhaul also lead to a overhaul of the -dumpxxx code,
removing/fixing several spots in the code which caused incorrect/instable
behaviour. (e.g. code in doc.c)
* Fixed handling of compressed data for any text-based server response;
pavuk now correctly handles any gzipped/deflated text, including, for
instance, any 'text/javascript' content sent over the wire in compressed
form (tested on a Wikipedia-based HTTP/1.1 compliant server).
* added -progress_mode: several choices in progress verbosity.
* added -no_disc_io: test a grab/scan without writing anything to disc.
Mostly useful in combination with the earlier -hammer modes.
* fixed/updated HTTP error response handling in accordance with RFC2616 so I
can better see what a HTTP/1.1 compliant target is reporting back to
pavuk. (errcode.c et al)
* unified timing units to fix a few timing oddities: instead of minutes,
etc. the code uses seconds everywhere (apart, of course, from the few
locations where we use milleseconds ;-) )
-timeout is now in milliseconds!
* Added -rtimeout and -wtimeout command line parameters.
(unit: milliseocnds)
* added -allow_persistent / -noallow_persistent commandline arguments to
allow/disallow the use of HTTP/1.1 persistent connections.
* added -dumpcmd and -dumpdir commandline arguments.
* added -bad_content commandline argument for use with the hammer/chunky
modes.
* added -report_url_on_err commandline argument: report the URL which was
processed while the error occurred.
* added -test_id commandline argument: this is included in the timing report
so reports can be better automatically processed / combined.
* added -page_sfx commandline argument to help pavuk identify what suffixes
are to be considered web pages (useful for scanning ASP and ASP.NET sites
which present unusual mime types with their pages).
* added -tlogfile4sum commandline argument: specify a log file where timing
info is stored. Handy when pavuk is not only used to grab the info off a
site but also scan & report site performance.
* added -encode commandline parameter as the counterpart of -noencode.
* added -nohtDig, -noquiet and -noverbose commandline parameters as
counterparts of -htDig, -quiet and -verbose respectively.
* added filepath support to -dumpfd and -dump_urlfd: by specifying the
option prefixed with a '@' character, pavuk will treat the option value as
filepath specification instead of a OS file handle and subsequently open
the specific file internally. Note that adding yet another '@' character
as a prefix signals pavuk to _append_ to the specified file, instead of
_overwriting_ it.
This is useful when you wish to have those dumps but are working in an
environment where you cannot pass valid file handles through the
commandline.
* added -dump_request and -nodeump_request commandline arguments for use
with -dumpfd: when -dump_request is specified, the log file will include
complete dump of each request sent to the server by pavuk. Thus you can
produce a complete audit trail of the exchange.
* replaced the DUMP_URLLIST macros in stats.c by two functions. Code is a
bit cleaner that way.
* fixed times.c which barfed on timestamps beyond 2037 (signed int wrap
around for time_t).
* added assert() checks at several locations in the code to help track down
unexpected behaviour which could lead to crashes (like it did till now).
* unified the proliferation of HEX2ASC-alike macros with and without off-by-
one offsets inside. Now there's one macro for each of 'em in tools.h.
* changed the configure.in option to --disable-threads to keep the pattern
consistent (--disable-xxx series of options in configure), but the default
behaviour remains the same.
* configure.in: as --disable-debug removes any debug-_related_ features from
the pavuk build, these options have been added: --disable-debugging will
create a default build with all debugging removed from the compiled
binaries. --disable-prof and --disable-gprof have been added to remove any
profile info from the default compiled binaries.
* added checks in configure.in for socklen_t, pid_t and a bunch of system
calls and header files that do not live in each environment.
2007-may-6
* included pthreads-Win32 based multithreading support in the native Win32
build.
* included EXPERIMENTAL tre (regex) support in the native Win32 build.
* fixed several lurking bugs (buffer overruns, etc.) which only showed in a
multithreaded environment.
* fixed locking bugs in the new bufio implementation.
* added Win32 memory leak + heap checking for the DEBUG build: many memory
leaks have been tracked and fixed. (MSVC <ctrdbg.h> based)
* fixed memory leak due to wrong scope in report_error() code.
* added DBGxxx macro's to aid heap tracking for the debug build. See
DBGdecl/DBGpass/DBGvars usage.
* removed a very nasty memleak in html_parser_get_url() which would leak at
least 3 blocks for each rejected local anchor URL - and those come quite a
few! Took me a day to track it down. :-(
* added filtering so gzipped/compressed files on the server are not
decompressed unintentionally while the server supports Accept-
Encoding:gzip or compress.
( doc_download_helper() in doc.c )
2007-may-11
* renamed function should_leave_persistent() to the more appropriately named
should_keep_persistent()
* Updated 'chunky' source to the state of the latest pavuk CVS contents (as
of today) as this code has not yet been merged into CVS itself.
* fixed bugs in -scenario handling, when scanrio files produced by pavuk are
re-used in the Win32 environment
* fixed bugs in path & file type commandline arguments for the native Win32
port.
* fixed bug in retrying/resuming download for RFC2616 (HTTP/1.1) 'chunked'
content download handling.
* merged -allow_persistent / -noallow_persistent commandline arguments with
the equivalent -persistent/-nopersistent feature from the official pavuk
CVS sources.
Also improved the code a bit: added the 'Connection: close' header for
requests over -nopersistent connections, so the server will close the
connection for us.
* added the -ignore_chunk_bug commandline argument to allow pavuk to handle
RFC2616 'chunked' downloads from buggy (IIS) web servers.
( See also:
http://www.subbu.org/weblogs/main/2004/11/persistent_conn.html
http://skrb.org/ietf/http_errata.html#chunk-size
http://www.apps.ietf.org/rfc/rfc2616.html#sec-3.6.1
http://www.jmarshall.com/easy/http/
)
2007-may/june
* recompiled in 64-bit Linux (SuSe 10.2) and fixed a few items in the
Makefile.am, configure.in and ac-config.h.in files. Also added the tests\
and www\ directories to the distro.
* fixed a few 64-bit compile warnings; at least the test cases in tests\
perform OK now on a 64-bit Linux system.
* updated the man page a bit; still a lot more to do. Where is that 'nroff
for dummies' cheatsheet when you need it? ;-(
* listed -use_http11 as 'on' by default now.
* moved MODE_MIRROR unescape code section up in url.c to line 1682 in
url_get_local_name_real() as this code would otherwise have no effect at
all in any environment where the '%' percent character is included in the
FS_UNSAFE_CHARACTERS charset (for example: Win32).
* PARAM_DOUBLE default values are now fixed point values in 'long' integer
format; the current values in the program (all 0.0) are clearly within
range _and_ it 'saves' on compiler warnings quite a bit. (We've still some
way to go before we get anywhere near a '[almost-]zero-warning cross
platform portable build: few int to pointer and vice versa casts remain.)
* fixed bug in cfg_get_num_params() which would access uninitialized memory
out there in NirvanaLand when a PARAM_UNSUPPORTED option was passed to
pavuk.
* Fixed configure.in to include 'debug' build handling for KDevelop (which
would pass '--enable-debug=full' to ./configure).
* updated the configure.in script to increase portability (opendir/closedir:
dirent.h et al)
* included a few aufoconf macros in the m4 directory for easier/proper
portability support using autoconf et al.
* bugs fixed from BUGS list: multithreaded mode is not as stable as single
threaded (fixed at least for the CLI version of pavuk; the GTK GUI version
is in a rather bad shape)
* bugs fixed from BUGS list: signal handling / timeout does not really work
(at least not in multi threaded downloads). After a SIGINT pavuk just
hangs.) This has also been fixed for the CLI version of pavuk at least.
* Win32 port now includes JavaScript support (using the statically linked
Mozilla js library).
* fixed short option definitions in options.h: -tp / -tsp et al
* 'fixed' GUI for Javascript enabled builds (GTK2) - WARNING: it compiles
now, but has NOT been tested, so expect bugs here!
* merged the 'chunky' code with the pavuk main source tree. Now 'chunky' is
equivalent to building pavuk with './configure --enable-hammer'.
* set default from -leave_site to -dont_leave_site to prevent 'blown up' web
crawls when this filter parameter has not been specified.
This change includes a fix for the cfg/command line handling of pavuk for
the conditions section (see condition.h + config.c) as pavuk assumed
sizeof(long)==sizeof(int) in these code sections.
* Now the proper GPL license (GPL, not LGPL) is included in the file
./COPYING.
2007-sep
* fixed processing of zero byte length files (robot.txt at figleaf.com,
etc.): no more crash/assertion failure due to NULLed docu->contents.
* fixed a few memleaks.
* added extra error checking for file rename operations as some issues were
found with the Win32 build when using a SAMBA-shared filesystem for
storing the spidered data/files. (It turned out that the same issues
existed when using native (NTFS, FAT32) filesystems.)
* dialed down the number of default threads from 3 to 1 (see BUGS) to
prevent a hail of (legitimate) rename error reports.
* added flock() implementation for Win32: when built with multithreading
support, having no valid flock() implementation is very dangerous!
* changed configure.in to detect both flock() and fcntl() file locking
mechanisms so pavuk will be able to support writing spidered content to
network shares on both Win32 and UNIX systems: flock() does not support
network shares locks, fcntl() does, at least on the latest Linux kernels,
see man flock(2)
* added error reporting/checking for undesirable use of invalid flock()
implementation. (Useful when porting pavuk to other non-Unix platforms.)
* Fixed content/file size treatment code for items which are already
available locally (i.e. pavuk finds the item at the remote has not changed
from when the last time it fetched the item into local cache).
* Fixed the conditions for when to display certain informational messages:
less screen clutter when not running in '-verbose' mode OR when running in
'-progress' modes.
* Fixed several error/info messages in the code section for decompressing
gzip/compress transmitted HTTP content.
* Fixed handling of gzip/compress transmitted content when retrieved from
local store instead (when pavuk discovers that the file at the remote site
has not changed since the last time it was fetched and stored on your
local disc).
* Fixed a few memleaks.
* Changed the DBGvars/DBGpass/DBGargs macros used for tracing memory
allocations in debug mode to make these macros look more like regular 'C'
functions to 'demented' code formatters and analysis tools. The drawback
is that these still look 'weird' in function prototypes, but that causes
quite a few less errors/warnings than the old style.
* Fixed bugs in get_abs_file_path() directory detection and Win32 abs path
processing.
Also fixed code which produced double slashes in file paths on occasion,
causing trouble on Win32 platforms. (Fix applied generally.)
* Fixed mk_native() allocated string management pool to support printf() et
al where up to 3 mk_native() calls are made in the argument list. This is
important to prevent spurious crashes in multithreaded mode when the worst
case scenario for mk_native() applies: all threads are executing printf()-
style statement which has multiple calls to mk_native() in the argument
list.
Currently overdimensioned a bit as the actual code only has two
simultaneous calls while the pool now is dimensioned to tolerate 3
simultaneous calls per thread.
* No more _strfindnchr() and strfindnchr(): strfindnchr() - and its use -
has now been fixed to match the (proper working) _strfindnchr().
[fnmatch.c/tools.c et al]
* Fixed const-correctness of several functions.
* Added '-mime_type_file' commandline option to help pavuk support an up-to-
date list of mime types and their filename extensions, using, for example,
the UNIX mime.types(5) config file as a source of MIME type information.
If the user does not specify the '-mime_type_file' option, the original
built-in defaults will be used instead.
This feature has been added to provide better support for the pavuk -
fnrules %M macro: this macro now will use this configuration to produce a
suitable filename extension for each MIME type: the first extension listed
in the '-mime_type_file' config file for the given MIME type will be used
as extension for the %M macro.
* Changed the GTK GUI macros to become functions for ease of debugging. The
added (tiny) call overhead won't be a performance hit anyway.
* Fixed -fnrules handling: the generated path is cleaned up before it is
returned to pavuk for use.
Cleanup actions:
- duplicate '/' slashes are removed
- filenames and directory names which end in a '.' dot, get the dot
removed
* Added '%X' to the -fnrules formatted processing to allow reformatting of
filenames using an optional mimetype-derived extension. This is useful
when grabbing Wiki (MediaWiki et al) sites when you'd like to store the
grabbed content using default mimetype-related filename extensions, so
instead of storing a file like
wiki/page/AboutThisSite
that would transform into
wiki/page/AboutThisSite.html
while pages like
wiki/static_page/contact.htm
would remain as is.
(Note: this might be considered shorthand for a -fnrules (...) expression
which compares both %e and %E. The intent of %X, however, is to only
allow %e extensions to pass which are 'valid' for the given MIME type and
force the %E mimetype based extension for all other cases.)
CAVEAT: %e/%E/%X/%Y will print the extension WITHOUT the leading '.' dot in
both simple mode and extended LISP mode.
* Added '%Y', '%A' and '%B' to the -fnrules macros: '%Y' uses the MIME type
prefered filename extension if the URL/filename doesn't have an extension
yet (while the rather similar '%X' will OVERRIDE the existing extension if
it is not listed with the specified MIME type).
'%B' prints the 'basic MIME type', i.e. the MIME type without the ';'
semicolon separated MIME attributes such as language, etc., while '%A' will
print these extensions (if they were passed to us by the server).
CAVEAT: %e/%E/%X/%Y will print the extension WITHOUT the leading '.' dot in
both simple mode and extended LISP mode.
All this allows for pavuk -fnrules commandline arguments like this:
-fnrules F '*' '%h:%r/%d/%b%s.%Y'
-mime_types_file ./mime.types
-tr_chr_chr ':\\!&=?' '_'
so we'll be able to grab a [Media]Wiki site while storing those pages as
regular 'abc_php_xyz.html', instead of 'abc.php?xyz' page/filenames.
* Added -fnrules 'fnseq' operator to the extended rules: compares a
wildcard pattern and a string a la fnmatch(3).
* Checked and updated manpage for the -fnrules operators (added 'ud' and
'sp' operators to the manpage).
* Added -fnrules 'sn' operator to the extended rules as counterpart of 'ns'.
'sn' uses strtol() to convert a string to a number, while 'ns' uses
printf() to format a number to a string. (See the man page.)
* Updated the man page a bit regarding '-fnrules'.
* sanitized escape_str(); a quick code review led us to a lurking bug in
uconfig.c@309, which has been fixed implicitly.
* Added/updates source code documentation: tools.c/tr.c soure code comments.
* Added some sanity checks in the code (tools.c/tr.c/lfname.c)
* Added debug_level 'rules' to allow debugging of both simple and 'extended'
-fnrules expressions and '-fnrules' URL F/R matching.
* Different boxes exhibit different mktime() behaviour, especially when
handling out of range tm value sets. Besides, mktime() works in 'local
time' while some parts of the code require a robust UTC mkgmtime() (not
available on many boxes) --> ripped & introduced as tl_mkgmtime(). A local
time-aware equivalent with excellent out-of-range handling is available as
tl_mktime().
* Added additional error handling around calls which try to parse time
stamps using tl_mkgmtime() and tl_mktime() (times.c).
Basically, now both HTTP and FTP benefit from the new code which should
now proces timestamps like the UTC timestamps they are, while 'out of UNIX
time_t bounds' timestamps (beyond the range 1970..2038 A.D.) are handled
in a more sane manner:
- out of bounds timestamps are reported by pavuk
- out of bounds timestamps are then 'sanitized', i.e. restricted to the
1/1/1970..31/12/2037 date range, i.e. a timestamp beyond the horizon,
like '1/4/2051' will be 'sanitized' (= restricted) to the upper bound:
31/12/2037. The same goes for te from antiquity like '11/3/1969' (the
birthday of a certain person), which will be 'sanitized' towards
1/1/1970.
* Split up DEBUG into developer related stuff, such as memory/heap checking,
ASSERT/VERIFY, etc. and user related stuff (the -debug and -debug_level
command line arguments): ./configure is now fitted with an extra
parameter:
--enable/disable-debug-features
which will turn on/off -debug/-debug_level user level debugging support in
pavuk, while the existing
--enable/disable-debug
adds/removes additional developer checks, such as heap allocated checks
and ASSERT and VERIFY macros.
In the code, -debug/-debug_level related code is located within the
'HAVE_DEBUG_FEATURES' sections, while the developer debug/release builds
are still related to the standard 'DEBUG' #define.
This now results in three ./configure options that determine the (debug)
feature set of your binary:
--enable/disable-debugging --> compile a binary with source level debug
info included and all optimizations
DISabled for improved debugging (by using
gdb or another debugger of your choice)
--enable/disable-debug --> include/exclude additional run time checks
in your binary. Most important are the
ASSERT and VERIFY pre/post-condition
validation methods located throughout the
code. The use of these is advised, though
these may cause a performance hit.
--enable/disable-debug-features
--> include/exclude user level -debug/-
debug_level command line features, which
help you as a pavuk user to 'debug' pavuk
during the run. Using -debug, pavuk will be
EXTREMELY verbose, which can be toned down
by applying a -debug_level restriction
filter. For example:
-debug -debug_level all,!devel
will be VERY verbose, but will NOT log any
DEVEL level debug info, while:
-debug -debug_level !all,rules
will ONLY produce additional output for the
RULES level, i.e. when pavuk processes -
fnrules and/or JavaScript macros.
* Fixed crash when non-RFC compliant website was grabbed: see testcase 7a.
* Added targeted help: when options cannot be parsed correctly,
short_usage() will try to help the user by printing the full help for the
abusing commandline option only. (Of course, I screwed up while using
debug_level flag sets _again_ :-( [Ger])
* Some improvements for network connectivity error handling and reporting.
(xvherror() added.) This is the result of some FTP tests with pavuk (tests
8b).
* Don't yak about 'Checking "robots.txt"' anymore when doing a FTP grab when
robots.txt is NOT applicable anyway.
* FTP: added crude 'autodetect/retry' mechanism for FTP servers which do not
like NLST (==> response code 550) but report correct directory content for
LIST (or vice versa). (ftp.c)
* FTP/HTTP: at debug level 'protoD' pavuk will now dump RAW data/content
received from the server before preprocessing (i.e. converting to HTML or
decompressing).
* Added command line option integer sizing support: byte sizes can now be
specified in K, M or G. Other integer values can also be postfixed with K,
M or G, but then these will be treated like the ISO values 1000, 1E6 and
1E9.
* Addition memory leak fixes in case pavuk is fed an invalid commandline.
* NTLM support code: fixed a few glaring bugs.
* Added O_SHORT_LIVED to lock file open() flags for better Win32 behaviour.
* Fixed code to load the pavuk configuration settings from, in order of
appearance:
env:PAVUKRC_FILE
~/.pavukrc
SYSCONFDIR/pavukrc
which matches the description in the manual.
(see also man page)
2008-jan
* Added 'js' flag to '-debug_level', which is used to dump a lot of detail
about the pattern matching and transformation applied to JavaScript code
using the '-js_pattern' and '-js_transform / -js_transform2' commandline
options.
* Added sanity check for '-js_pattern' and '-js_transform[2]' regexes, which
MUST contain a subexpression for them to 'work' as expected.
* removed re_pmatch_sub() and changed the code where it was used to work
with the available re_pmatch_subs() call, which allows for more elaborate
validation anyway. See htmlparser.c.
* Removed a regex handling bug in the -js_transform[2] code, which would
crash pavuk when using regexes where the first subexpression might be
empty.
The crash is due to the fact that the regex parser would return indexes '-
1' for these empty subexpression(s), resulting in out-of-bounds memory
writes in the rewrite code. This in turn would nuke the heap, so after
that is was only a matter of time for pavuk to fail dramatically.
2008 feb 04
* Added DEBUG_MISC() lines to solve sourceforge.net issue: [ 1852885 ] to
improve manipulation by locally stored files
* Included provisional fix (I don't have a working sample run to reproduce
the issue (yet)) for sourceforge.net issue: 1852884 ] infinite loop on
unexpected responses
* Cleaned up the mess that was -progress_mode.
* Cleaned up several DEBUG_xxx macro mistakes
* Added a little description to the 'hidden' -htDig commandline option,
which can be used to dump the server-transmitted MIME headers for each
URL, similar to the htdig tool.
* Added a bit of documentation for the -rollback option (which was
undocumented)
2008 mar 20
* GNU gettext tools don't like '\r' in i18n strings --> fixed by changing
the related printf() statements in src/doc.c
* started update of configure scripts to the latest autoconf/automake.
Also reordered the NEWS file so it will work with the new, stricter
./bootstrap && ./configure && make distcheck
distro test cycle.
2008 jul 10
* fixed ';' semicolon bug in http.c near line 2074 which caused incorrect
decoding of the HTTP/1.x response code header.
* fixed gzip/compress/... content compression support (HTTP/1.1 Accept-
Encoding); the previous code was a valliant attempt to 'fix' the client
side (pavuk) to cope with buggy web servers which send the wrong encoding
type for already compressed files, but this would screw up particular
responses by *well-behaving* web servers. Of course this would only happen
in rare circumstances so it was kinda hard to track down.
Documentation for -Enc/-noEnc has been updated to reflect this situation
and the code now (hopefully properly) finally supports compressed data
transmission for RFC2616-complaint web servers.
If you find that your 'downloaded' compressed files are already
/incorrectly/ DEcompressed by pavuk, this is NOT the fault of the client
(pavuk) but evidence that your server is behaving inappropriately and the
proper remedy for this is the use of the option '-noEnc' which turns this
feature off so the server is not allowed to screw up in this way any more.
Also made sure one can check if pavuk has been built with compression
support by calling 'pavuk --version' and looking at the feature list.
* autoconf/configure script: using the highly undocumented v_cflags or other
x_* variables as environment variables to hack the configure script (you
could do that, especially with v_cflags) has been obsoleted while the
configure and m4/* scripts have been upgraded to support autoconf
2.62/automake 1.10 and use ONLY *documented* AC.*/etc. macros from now on.
Note: thanks to the JavaScript library issues on SuSe10.2/AMD64 (older JS
lib version and seemingly partial header install), I may have failed
to eradicate all undocumented macros.
* Extra note about configure.in: bash, at least on SuSe10.2/64-bit, handles
'if eval test ...' just ever so slightly different than 'if test ...',
especially where it comes to 'test -n'. As these styles were mixed rather
arbitrarily before, the 'if eval test ...' style has been completely
removed from the configure script, as this would sometimes render quite
unexpected (and incorrect!) results.
* fix_crlf.sh has been updated to ensure important Microsoft Visual Studio
files are not damaged by having their CRLF sequences converted to UNIX LF
line endings: this kind of thing will make MSVC spit you in the face and
reject everything you try until you give it back those CRLF line endings
in there. So much for XML as project file format and MSVC...
* extra fixes to ensure 'make distcheck' does not barf up a hairball. This
includes enforcing the permanent inclusion of the 'po' subdirectory in the
Makefile set for multilingual support.
* configure/Makefile(s): if you don't have one or more of the
archiving/compression tools compress/lzma/gzip/tar/7z(7zip) installed on
your system, we don't go belly up at config ~ nor at 'make dist' time
anymore. This, of course, includes correct behaviour at 'make distcheck'
time: only use/test those 'GNU standard' formats, which can be created on
your box.
* Added the 'bootstrap' shell script, next to 'autogen.sh'. I know they
serve the (almost) same purpose, but 'bootstrap' is far more sophisticated
than autogen.sh and I didn't wish to overwrite 'autogen.sh'. Besides, IDEs
on UNIX boxen expect either the one or the other (there's no single
'standard' for this), so we might as well provide both.
At a later time, we might probably point autogen.sh to bootstrap.
* Updated the mime.types MIME 'hint' file: currently, it's a mix of
1) all properly registered MIME types ( http://www.iana.org/assignments/media-types/ )
2) the mime.types file provided with the latest Apache/XAMPP
3) my (Ger Hobbelt) additional file extension hints as used on my own
servers. This is mostly about professional graphics ~ and modern
'scene' audio/video container formats, such as Matroska. This only adds
extensions for otherwise already existing MIME types.
* Updated the DocBook-based documentation for several options (-End/-noEnc, ...)
* 'pavuk --version' now also reports if ZLIB support is included in the
binary. This is important for '-Enc'.
* Fixed the '-Enc' compressed transmission and HTTP header processing code
to act properly with fully RFC2616-compliant web servers, discarding the
old 'hack/fix' attempt to solve a non-complaint server issue at the
client, as this would break things for fully compliant servers in the rare
(but extremely annoying) use case:
- pavuk with '-Enc' option
- webserver is fully RFC2616 compliant
- pavuk issues request for file in a .tar.Z or other gzip/compress
compressed format, where the file on the server is only slightly
compressed (fastest compression).
- webserver will transmit file to pavuk, but due to pavuk reporting it is
able to handle compressed transmission AND the server discovering that
the content can be compressed quite some more than it already was, the
file will be transmitted after a server-side just-in-time compression
round.
- pavuk receives the data. The old hacked code would NOT decompress the
data. However it SHOULD because the server PROPERLY reported 'Content-
Encoding: gzip' to pavuk. End result: grabbed data which you cannot
process nor trust to be in the same format as stored on the server as it
all 'depends' on arbitrary conditions which you cannot control: is the
web server able to compress the data before transmission? Is the web
server configured to allow compression? Etc.
This use case has now been fixed.
The effect of BADLY behaving web servers (which send 'Content-Encoding:
gzip' for any .Z, .z or .gz files (IIS x.x and other servers which are not
configured to /properly/ handle files and MIME types) is described in the
DocBook manual page now, including the fix for this (specify the '-noEnc'
commandline with pavuk).
* active FTP: timeout and stop/break handling slightly improved: now pavuk
should always terminate under all circumstances while a break or stop has
been signalled.
* Changed the default for '-url_strategy' from 'level' to 'leveli' to make
pavuk behave more like your regular web browser (with a user clicking
through web pages).
* Initial fix for NTLM support for 64-bit Windows. (Only lightly tested.)
This includes converting that bit of code to support the C99 intNN_t types
(where NN e {8,16,32}), while the configure script takes care about
providing the proper types for not-fully-C99-compliant environments.
* The TRE regex package would barf up a hairball due to the incorrect header
file being loaded. ./configure now recognizes TRE specifics a bit better
and the code now loads the proper header file (<tre/regex.h> instead of
<regex.h>). This is important on systems which have multiple, ever so
slightly incompatible regex processing libraries installed.
* Improved diagnostics a little bit by adding reporting support for
URL_PARENT_REWRITING, i.e. the situation where a parent page of a grabbed
page is loaded for the sake of adjusting (rewriting) the URLs in its
content.
* Fixed code so it would compile in full (-DDEBUG) debug mode on UNIX.
* autoconf/configure: ran into some weird issues due to inconsistent M4 []
quoting: quite a few lines did without it. Turns out that this is a BIG
No!No! as adding the AX_ADD_OPTION() macro turned this lurking mess into a
true disaster.
Fixed by applying [] quoting throughout. The only place where I didn't do
it, is in the first and second args of AC_DEFINE() -- which should be used
instead of AC_DEFINE_UNQUOTED when you don't need the latters extra
functionality anyway -- and the first arg of AC_DEFINE_UNQUOTED(). Any
other spot where [] quotes are missing in the M4 macros and/or
configure.in? Consider that a bug and please report so I can fix it.
* Finally got the configure system to recognize my JavaScript libraries and
all. Tugged and tweaked a few items in the bindings to allow maximum
flexibility for the JS code when it is used to filter URLs (e.g.
JavaScript pavuk_url_cond_check() function).
* Updated jsbind.c to use latest SpiderMonkey 1.8.x (tested on Win32)
* Changed man/Makefile to ensure HTML is not recreated every 'make' run, but
only when manpage changes. This should really copy the results from
./doc/, but that's for later...
* DocBook documentation: tweaked man page generation to mimic original
manpage title exactly.
* DocBook documentation: updated '-version' info (important to see at run-
time what abilities you've got with /your/ pavuk.
* Win32/MSVC: all project files have been updated to produce next to
Win32/x86: Win64/AMD64 and Win64/Itanium binaries. These project files
assume the existence of all optional libraries: OpenSSL, SpiderMonkey
(JavaScript), zlib.
Where to get those, prefered directory layout, etc. to be published, so
others can build from source on Win32/64 too and get the same results.
2008 jul 20
* tweaked configure+makefiles so that a 'make dist' from CVS becomes
possible: there were quite a few references to yet unpublishable files in
my makefiles (Ger Hobbelt).
* config section: improved adherence to C standards: no more potentially
dangerous mixed use of function and data pointers by typecasting function
pointers into data pointers and vice versa.
This has been resolved by an added layer of indirection, which makes it
all very legal C again. It goes somewhat like this:
function_pointer_type ptr = &function;
data_pointer_type d = &ptr;
then use (d[0])(...) to call the function.
This contrasts the old code:
data_pointer_type d = (data_pointer_type)&function;
and function invocation using:
((function_pointer_type)d)(...)
* Added support for parsing 'hidden' CSS and JavaScript in HTML. The support
is also extended to generally parse inside HTML comments PLUS Microsoft IE
CC's (Conditional Comments): <!--[if...]><![endif]-->
-read_css
-read_cdata
-read_msie_cc
-read_comments
These are all enabled by default; documentation has been updated for these
as well.
* Fixed CSS and [Java]Script handling in the HTML tokenizer/parser, which
was feeding the filters and URL extractors (htmlparser.c).
Now the code can cope better with incorrectly formatted pages / files.
* Reordered the HTML tags in htmltags.c in a preparatory move to check the
list for missing attributes (onXXX JavaScript items for one! several are
missing) and HTML 3/4 tags. (htmltags.c)
2008 aug 13
* updated the -debug_level related code; DEBUG_DEVEL() and a few others now
'automagically' report the sourcefile+lineno without the need to specify
these explicitly + some DEVEL_*() calls have been shifted to other
'-debug_devel' levels (net, mtthr, htmlform, ...)
* completed the -debug_level tracing for multithreaded runs: now all
semaphore accesses can be traced using the -debug_devel mtthr
* Major fix for bufio+socket code: no more lockup for pavuk due to delayed
reception of response data (tl_selectr() would incorrectly lock
indefinitely -- which proved to be a generic coding mistake in both
tl_selectr() and tl_selectw() -- PLUS better error condition handling in
an attempt to improve handling of all sorts of 'spurious error conditions'
which may occur when your network suffers from packet loss or other
undesirable effects.
* -mode remind code fix for multithreaded use to make it match recurse and
other modes better; not severely tested so YMMV! (The old code wouldn't
work anyway, so it's an improvement anyhow).
* few code cleanups (#if 0 ... #endif)
* DocBook manual updated: now all return codes from pavuk are documented.
* minor code fixes for SSL/SFTP.
* updated configure and code to assist in compiling with both latest
SiderMonkey and older Mozilla JavaScript libraries (Win32/64 and UNIX
respectively).
* Some unused error checks replaced by ASSERT() and some ASSERT()s replaced
by error reports as those errors /can/ happen in actual use (though
seldom).
* Fix for parsing malformed URLs (with multiple '#' and/or '?': bookmarks
and query string parts would not be stripped/detached correctly as the
last '#'/'?' instead of the FIRST occurrence of '#'/'?' would be picked as
a separation point.
* Ran the gettext files through pot/pox/po again. Lots of 'fuzzies'... These
need to be fixed.
* EXPERIMENTAL: added preliminary code for extended JavaScript support:
hooks to process HTML and CSS just like you can process embedded <SCRIPT>s
now. The new hooks are still 'nulls', i.e. do not have any effect.
This is a work in progress; it compiles & runs (tested on UNIX and Win32
in multithreaded mode) but the new hooks still need to be implemented.
The goal here is that all grabbed (parsable) content should be processable
by custom JavaScript script functions AND when more than one URL is found,
the JavaScript code should be allowed to add those extra URLs to the pavuk
queue (using the new url.queue() JavaScript PavukUrl object method --
currently a 'nil' member function as it still must be fully implemented).
* isatty() fixes which check for error conditions and do /not/ provide
special 'console oriented' features when isatty(0) produces an error (may
happen on Win32/UNIX).
* Checked and updated all header files (after I ran into a cyclic dependency
when changing a bit of code): no .h files will #include "config.h"; all .c
files /do/ #include "config.h" as the first header.
System-dependent stuff (TRUE/FALSE definitions and a few other bits) have
been moved to config.h (where they below IMO) and removed from tools.h
This is a change required for the gzip fix [SF bug #2050527].
* Preliminary fix for CSS url grabbing and rewriting bug [SF bug #2050537].
The new code will now try to keep these three styles of <url> formatting
in CSS intact -- this is done so as to keep particular CSS browser hacks
intact as much as possible:
@import "<url>"
@import url(<url>)
@import url('<url>')
@import url("<url>")
and of course the use of 'url()' elsewhere in any CSS is treated like the
three examples above, i.e. NONE of these should be changed regarding <url>
delimiters (quotes or braces) when rewritten by pavuk.
The ONLY situation where pavuk will CHANGE the quotes is when a <url> is
found to contain the delimiter quote itself: in that case the quotes are
changed from ' to " and vice versa.
2008 aug 18
* minor fixes to the includes mime.types file
* configure: added support/auto-detection for the GNU GDB extended debug
output (-ggdb -g3) for when building a debug build.
* NTLM: fixed code for Win64 and other 64-bit platforms which do or do not
support structure packing.
* documentation update: -[no]chunk_bug commandline argument finally
documented (was in there already for a longer time; is a special fix for
badly behaving IIS web servers which transmit data in 'chunked mode'.
Also upgraded the documentation for the -tr_str_str/tr_chr_chr options so
one can finally read how to use [:print:] and other definitions in there
for -tr_chr_chr and be able to determine up front what the bugger will do
for you.
For example:
Why does -tr_chr_chr '[hexnum:]' '0123456789abcdef' *not* do what you
expect when the filename has any of the a..f characters? (Answer: they all
become 'f' as [:hexnum:] actually expands to
'0123456789ABCDEFabcdef'
itself, so it is longer than the destination set and by definition any
'overflow' will be replaced by the last character in the target set.)
* HTML/CSS/JavaScript parent rewriting was sometimes flaky; this has been
fixed by fixing several bits of antiquated code in pavuk: now all code
sections are equaly aware of URL_ISHTML, URL_ISSTYLE and/or URL_ISSCRIPT.
Several functions have been adapted to mirror the new awareness:
ext_is_html() has been enhanced and has been renamed to actually show its
intended function: ext_is_parsable() -- which can be a HTML, CSS *or*
JavaScript file! (not only HTML can be parent of other URLs and need
updating ('URL parent rewriting').
[ SF bug #2050537 ] CSS @import bad / HTML corrupted --> fixed
* On SuSe10.2/AMD64 glibc6 dumped core when running pavuk in full-out '-
debug -debug_level all' (the latter is implicit when you use '-debug')
mode. This was caused by glibc()'s printf() functions *sensibly* executing
a strlen() operation on the data fed to one of several '%.*s' printf()
formatting parameters, while those data series had NOT been NUL
terminated.
This would happen when debugging pavuk while fetching data from a gzip-
enabled web server: the gzip/inflate code would NOT append a new NUL
sentinel.
* Several other '%.*s' and '%s' related core dump spots in the DEBUG_XYZ()
code which would dump downloaded content have been fixed by feeding the
data through an enhanced asciidump function -- which will switch to HEX
dumping when the content to be shown for scutiny contains a large amount
of non-ASCII data (> 10% is the current heuristic to switch over).
* glibc6 on SuSe10.2/AMD64 would also dump core when being fed a 110K string
to a printf '%s' statement. This has been fixed by always limiting the
amount of content to be displayed when debug-printing downloaded data
(various '-debug_level's)
* gzip/inflate would fail to perform on 'non-parsable' content, i.e. plain
text files downloaded from a gzip-enabled web server. This has been fixed.
CAVEAT: The current gzip/inflate code does not deliver when it is fed very
large files. Hence, when downloading VMware images and/or multi-GB
ISO files, a workaround is to specify -noEnc. This will be fixed
at a later date.
[SF bug #2050527] nonparsed files saved in (wrong) compressed when using
HTTP --> fixed
* Parent rewriting would try to treat all parents as HTML, which is VERY
wrong when the actual parent is a CSS stylesheet or a JavaScript script
file. Fixed.
* unified variable names for 'struct doc' variables: it is *QUITE*
irritating to loose your display of 'docu' contents just because this call
uses 'docp' for the same (or 'html_doc') while trying to track down
lurking parent rewriting and file URL parsing bugs.
Updated all sourcefiles to the use of varname 'docu' for the current
document. 'docp' and 'html_doc' have been renamed.
* two bugfixes for the tr() code: (1) when using X-Y character ranges, the
size estimator would allocate way too less space. This has been fixed. (2)
the documentation says it well: you cannot include a NUL in a tr()
character set. In one case (a range at the start of the spec like this: '-
z' would actually attempt to insert such a NUL anyhow, causing subtle bug.
Fixed. And a minor code cleanup.
* fixed argument quoting for external app invocation, which is particularly
important for Windows machines: they treat '-quoting quite different from
"-quoting. Fixed by using "-quotes instead of the original '-quotes.
* -enable_js is now turned ON by default - just like the documentation
already said.
KNOWN ISSUE: empty lines in JavaScript code and files gets stripped by
pavuk on rewriting; this will be fixed at a later date.
* fix in mime.types file for CVS file extension + added mime types for
Microsoft Office 2007
* fixed heap corruption in ainterface.c when calling append_starting_url()
when url has been specified in the extended '-request' format, including
a predefined local filename. (Would dump core on some systems.)
* moved the url2diag and info2diag functions from recurse.c to where they should
have been: url.c -- to resolve a cyclic dependency.
* fixed up the '-request' format url parser/decoder url_parse() call: several
types of input specification error would be silently rejected (now pavuk
prints a suitable error message to tell the user what [s]he did wrong and what
was expected) + a few tugs & tweaks to fix behavior for parsing extended
URL specifications (including cookies, predefined local filenames, etc.) and
an extra '-debug' (level: URL) line to help you diagnose how the '-request's
have been parsed/decoded.
* now you can use the extended '-request' URL format anywhere on the
commandline and/or your pavuk configuration files -- as long as you keep
it within quotes on the commandline of course, e.g.
pavuk "URL:http://example.com/ LFNAME:example.html"
* fix: config files generated by pavuk now properly select the 'short format'
(URL:....) instead of the 'long url spec fomat' (Request:....): previously
pavuk would loose information about web forms, cookies, local filenames, etc.
for some types of requested url.
* quickfix for issue reported on the mailing list regarding JavaScript
interface functions causing the build to fail - which happened when no
JavaScript library could be found.
NOTE: on Linux, the JS libraries and headerfiles seem to get installed in
various places. The current ./configure script looks for the
jsapi.h
header file in the directory
/usr/include/js
unless you specify the '--with-js-includes=<dir>' option when running
./configure.
The same goes for the js library itself: the current configure script
looks for either libjs or libmozjs in any of these directories:
/usr/lib64/thunderbird
/usr/lib64/firefox
/usr/lib64
/usr/lib/thunderbird
/usr/lib/firefox
/usr/lib
unless you specify the ./configure --with-js-libraries=<dir> option
to point to your specific libjs.a / libmozjs.a
* added an advanced example of use to the pavuk DocBook documentation
which will end up in the manpage (where it's a bit too much, but then
at least the users have an extended example of actual use) -- example
shows how to grab the up-to-date content from a MediaWiki-based web
site.
* added S/M/H/D unit support for the time argument decoder function
* Updated the manual regarding:
- all missing 'hammer mode' options
- the missing -rtimeout and -wtimeout options
- checked first few options in options.h and made sure those were all
documented. (This is a work in progress...)
* All timeouts are now in milliseconds, except the -max_time one, which is
in minutes.
All timeout arguments (except -max_time) now recognize the alternative
units for specifying time: s/m/h/d/S/M/H/D: second, minute, hour, day.
When no unit has been specified, the unit 'milliseconds' is assumed.
* Fix for bug report #2158794: now all DEBUG_*() functions are called
using the proper number of arguments.
The code has been further enhanced for all printf()-like functions
(such as the DEBUF_*() and x*printf() functions) to enable GCC and MSVC
to check the format specification strings and parameter count and
type (GCC).
This led to the discovery of a multitude of errors, which have been
fixed (wrong integer sizes, etc.).
* Preliminary code move to allow downloading extremely large entities
(larger than 2GB) such as DVD ISO images: this has been done by more
judicious use of the size_t and ssize_t types instead of simply 'int'.
On 64-bit platforms, size_t/ssize_t can handle 64-bit sizes, while
'int' cannot (as GCC still uses 32-bit ints on most common hardware
64-bit architectures (Intel, ...)). Further effort will need to be
spent to adapt the system (and OpenSSL) calls to enable the complete
datapath for >2GB entity sizes (at least when compiled on 64-bit).
* Small documentation fix: regex overview of characterset changed in DocBook
source so it appears as a simple list, instead of just one long paragraph
full of concatenated items --> improved readability.
* const-ified the source code and fixed a few comment typos and a
lurking bug in FTP (found thanks to constification): filename
for directory index urls could be damaged in particular circumstances.
* fixed makefiles for environments without any DocBook tools. Also fixed
configure script to help detect the absence of mandatory DocBook template
files. Plus added DocBook produce to the distro as we cannot expect everyone
to have the DocBook tools; nevertheless, everybody /should/ receive a full
set of documentation.
* Bugfix in GET_NUMLIST(): now original numlist is properly removed (would only
be noticable before when specifying multiple port numbers).
* memleak fix for _free_httphdr(): now also the httphdr struct itself gets
free()d.
* Fixed lockups in debug logging code when running in '-x' GUI mode; overhauled the
'recursive invocation' detection code within, which is mandatory to prevent
recursive calls to debug/log functions to blow up the stack and dump core while
running in ultra verbose debug/diag mode (-debug -debug_level all). This is the
second part of the fix for bug #2184196.
* Bugfix for #2023089: new code is introduced for '-lmax' depth level checks:
the 'depth' (a.k.a. 'level') will always be taken from the non-inline parent URL
which has the lowest level.
This should fix situations where 'inline' URLs have 'inline' *parent* URLs, such
as style sheets, which are referenced non-inline URLs (HTML files).
Seeking out the lowest level non-inline parent should also take care of situations
where multiple HTML files at different levels themselves, all (directly!) reference the same
stylesheet/inline URL.
* Attempt at fixing a GUI semaphore lockup, caused by LOCK_CFG_URLSTACK being used
for different purposes (was a quick hack once to create a 'critical section' there)
in recurse.c @ 1129. Same hack, but now we use LOCK_GHBN which should cause much less trouble
there.
* Bit of code cleanup.
* Code review checks to see if URLT_FTPS and URLT_GOPHER are used consistently where
you'd expect them. As you would URLT_HTTPS, next to URLT_HTTP.
* Code review checks and fixes to prevent pspurious damage to url->parent structures:
now the access to this element is critical-sectioned /everywhere/ using LOCK_URL(u); existed
in 95% of the places already, now all code has been checked.
* Several fixes for multithreaded GTK GUI use. Most important thing which
was missing: a call to gtk_threads_init().
* JavaScript: updated HTML tag/attribute tables to recognize all
onXYZ=... JavaScript event attributes in HTML + added the full
set of attributes to the url pattern class/object which is
available in pavuk's own JavaScript extension.
|
|