LinkChecker Help (version 1.2)

This document describes the LinkChecker version 1.2 developed by Z. Wagner -- Ice bear Soft.

Contents

Usage

LinkChecker is a CGI script which verifies validity of links in WWW pages. It is invoked through a form. Its elements have the following meaning:

Remember that specification of a proxy is treated as an advice only. If the proxy does not respond, the script will start communicating directly without any warning. On the other hand, if the server requires use of a specified proxy, the LinkChecker will try to use it. The proxy may be configured not to allow download pages from certain servers. If the proxy returns the 403 error, LinkChecker will try to get the object direcly without using any proxy.

Use of proxy may bring you some advantages. If your browser uses the same proxy as LinkChecker, you will see quickly pages which were juset verified by LinkChecker and vice versa the LinkChecker will quickly load pages which you have recently seen from your browser. It is useful mainly if you have direct connection (on the same LAN) to the server where the LinkChecker runs, otherwise you may not be allowed to use the proxy of the LinkChecker's server and LinkChecker may not be allowed to use your proxy.

On the other hand, use of a proxy may be source of some problems. Suppose that LinkChecker reports you a wrong link. You correct it, put the new page to the server, run LinkChecker again and it reports the same error -- because it reads the old page from the proxy cache. Since LinkChecker does not cache anything itself between runs, it may sometimes be necessary to disable use of proxy.

What does the LinkChecker verify

The script fetches the page from the server and parses the HTML code. It looks for the folloving elements:

What is not checked

LinkChecker is based upon an HTTP protocol. It cannot check objects accessible via other protocols as FTP, GOPHER, TELNET, NEWS, etc. It does not check e-mail addresses either. If you know how to program this in Perl5 and supply the code, I will add it to the next version.

I would also like to check HTTPS pages. Unfortunatelly, it is not implemented in the OS/2 version of curl and I did not manage to force OpenSSL to download any page (probably I am doing something wrong but I do not understand its documentation).

LinkChecker does not understand Java and JavaScript. Therefore links in applets and scripts are not checked.

Hypertext links are checked only in documents with Content-Type: text/html. Theoretically it might be extended to Gopher but I do not know its language.

LinkChecker does not ever try to verify authenticated documents. It merely displays an error message "401 Unauthorized" or similar.

What will not work

WWW pages written in Czech often supply re-encoding via CGI scripts using GET method and PATH_INFO. These scripts may not correctly respond to the HEAD method and may even confuse persistent connections.

How to read the results

The result window start with a chapter title and a short text containing a link to this help. The help will open in its own window specified by "TARGET=HELP". The same specification is used in the form, so that you can easily switch between your operating browser and this help. The link is also given at the end of almost each section.

Each page has its own section. The sections are numbered and the title is formed as a hypertext link. The page will open in its own window specified by "TARGET=TEST". Since all such links are directed to a window with an identical name, you will never get more than three browser windows and your system should not be overloaded. The section than contains the result of loading the object, i.e. its Status, URL, Content-Type, Last-Modified date&time and possibly Location of a moved document. The status of a successfully loaded object is "200 OK". If you requested display of full response headers, you will get much detailed information but most of this is not interesting for normal use.

Each link is presented as a subsection numbered within the section. The subsection title is again formatted as a hypertext link. It is written exactly in the same form as in the checked document so that you can easily find it. The subsection title is followed by a similar information as above.

Why all the correct links are displayed?

LinkChecker must at least try the existence of the objects, HTML pages must be loaded and parsed in order to find the <META> elements. It takes some time. If your page contains a great many correct links, the LinkChecker will not send you anything for a long time. Your browser may incorrectly treat it as a lost connection. When the LinkChecker finally has something to send you, the connection will no longer exist. Therefore the LinkChecker sends you results of all tests (and will always do so) in order to inform your browser that it still lives.

Anyway, all error and diagnostic messages are, at least partly, displayed in red colour. You can thus view the results very fast and search for red texts only.

Understanding redirections

Server may send redirection commands in several situations. Some of them are natural and no correction is needed. Sometimes this informs the author of the page that the links should be changed. The reasons of redirections are briefly described below.

Missing slash after directory name

Imagine that you link to http://hroch486.icpf.cas.cz/webtools within your page. When a user selects this link, the server will recognize that webtools is a directory name and will send a default document. This document may contain relative links to other pages. If the above mentioned URL is used, the relative links will be merged incorrectly. In order to make everything correct, the server must force the browser to redirect to http://hroch486.icpf.cas.cz/webtools/ which is achieved by sending status 301 with a new location. If you see such redirection which only adds a trailing slash, you know it is this case. All browsers understand it and correction is not necessary.

Generic name resolution

The server may recognize a generic name which is easy to remember and redirect you to the correct object. The objects may also be dynamic and change automatically with time but the generic name remains the same. Some script may redirect you to another computer which is currently "less busy". There may be other reasons connected mainly with Uniform Resource Name resolution. These redirections will be signalled by status 302 or 303 but 301 may also be used. There is no automatic way how to recognize this type of redirection. It helps to view the linked object or contact the administrator (but remember that administrators are usually very busy, do not disturb them without good reason).

Object moved

The object may be moved to another location with identical or changed name. The web administrator may configure the server to send status 301 with the new location, or, the authors of the pages (mainly if they lease space on someone else's servers) add META REFRESH. If you see this type of redirection, you should change the link to the new location. If you select the link to the page with META REFRESH, you will usually see some explanation message.

Progress indication

Since version 1.2 LinkChecker provides a kind of progress indication implemented in JavaScript. It does not matter if you disable JavaScript in your browser. In such a case only the progress indication will be unavailable but the LinkChecker as such will continue to work. The progress indicator displays something as:

Testing #5; remaining 12 (7/26)

It means that LinkChecker curently verifies the fifth page, it knows about 12 other pages to be verified, the current page contains 26 links and the seventh link is just being verified. The number of pages to be verified is determined during verification, thus the number of remaining pages increases during the run. If the verified link coresponds to redirection, either by <meta refresh> or by 3xx status codes, the number of links on the current page is increased and the URL resulting from the redirection is verified immediatelly. The progress indication is thus very approximate measure of the remaining time.

In addition, the script outputs its start time and end time. The times are output via the toLocaleString of the JavaScript's Date object. If you do not like its format, change the settings of your locale.

It might seem natural to display progress in the status bar. Unfortunately the status bar is often rewritten by the browser's logic during the page transfer. LinkChecker will thus safely report that the processing has finished which is of no use. The progress indication has therefore been implemented in the title bar. If you are using Mozilla with the tabbed view, you will see it, although sometimes with some delay, also in the corresponding tab. It should also be displayed in your window list. You can thus check the progress without scrolling the browser's window.

Using the summary table

The summary table contains all types of diagnostic messages encountered, both errors and warnings. Each message is accompanied with the count of occurences and a hyperlink to the last occurence. Each error message displayed at the verified object is provided with a hyperlink to the previous occurence of an error or warning of the same type with an exception of the first occurence which contains the link to the summary table. you can thus easily traverse the output without necessity to read the whole protocol.

Remember that an objec may contain two different error or warning messages. Be sure that you follow the correct chain.

Error messages

There are many things which may go wrong. The errors are subdivided to several classes.

Errors in SGML parser

The parser of my own (package IceBearSoft::ZWsgml) is very simple, does not require DTD and can even handle some errors in the HTML code. However, it may sometimes fail. I am not sure whether it will handle all situations, e.g. if only part of the page is loaded due to network error. Probably the parser may die without any message.

Errors in IceBearSoft::Zwebfun package

Errors may be encountered within the package which communicates with the WWW servers. It will be displayed inside "Error-Message". Most often it will be "Host not found". However, it may happen that the Error-Message field is empty. I will then appreciate if you send me a bug report and I will try to fix it.

HTTP Errors

These are various network errors as specified in RFC 2616 (update of RFC 2068). The status messages consist of a three-digit code and a reason phrase. Some servers may add a numeric subcode which is divided by a period from the status code. The reason phrases specified in RFC 2068 are only suggestions, the server may return any text which seems reasonable.

Status 1**

You should not see these errors but I do not know any server which is actually sending it. I am not sure that IceBearSoft::Zwebfun treats them correctly. Using proxy should solve it but I would appreciate information about such server so that I can fix it in my package.

Status 2**

200 OK means success, other messages of this class are reserved for purposes which should not occur in link checking. Therefore they will be treated as errors.

Status 3**

These messages denote redirection and the LinkChecker tries to handle them.

"301 Moved permanently" is most often caused by missing trailing slash in the directory name. No correction is needed since all browsers will handle it. However, if the new location differs considerably from the specified URL, it is most probably a server generated response and the document is really moved. You should consider changing the link in your page.

302 and 303 specify alternative locations and the LinkChecker automatically tries to verify the redirected documents. These are temporary redirections and you should not change your page.

305 informs that you have to use the specified proxy and the LinkChecker will do it.

Further information is available in Understanding redirections.

Status 4** and 5**

These messages are just displayed without any further action.

LinkChecker Errors

LinkChecker may encounter additional error conditions. They are usually displayed in red color.

LinkChecker is not able to verify this URL

The document uses other protocol than HTTP or is not of type text/html.

Only HTML files can be checked

You tried, either directly or indirectly, to verify a document which is not of type text/html.

Label #lbl does not exist

LinkChecker found the object but did not find the label.

I cannot check the following forms

Forms cannot be checked. Actions and request methods are displayed.

The proxy has been used but the server complains, contact the administrator!

LinkChecker tried to use the proxy specified by the 305 message but either the proxy does not respond or it did not help.

Sorry, there was no response

Package IceBearSoft::Zwebfun did not send any response, not even the error message. I hope this will never happen.

Proxy Error

LinkChecker connects to Internet via a proxy but proxy found that it is not possible to connect to the server. The proxy did not returned any specific error message.

Cannot connect to server

LinkChecker was unable to download robots.txt from the server either due to 5** error or because the server did not respond at all or host was not found. LinkChecker will no longer try to connect to this server in order to make checking faster and will display this message.

The resource body was not loaded

The resource headers were successfully received via the GET method but the parsed body cannot be found. One of the following happened:

  1. Loading is forbidden by robots.txt (this will be written in the Error-Message field).
  2. Unexpected network error occured and LinkChecker was not able to retrieve the resource body.
  3. An internal error occured in package IceBearSoft::Zwebfun.
  4. The format of the resource body is strange and IceBearSoft::Zwebfun was not able to handle it. Use of proxy may help. Display of full headers may also reveal the source of the problem.
  5. Parser (package IceBearSoft::ZWsgml) was not able to parse a corrupted body.
  6. An internal error occured in package IceBearSoft::ZWsgml.
Try to check the corresponding URL only and if the problem persists, send me a bug report.

The <META> element does not allow robots to follow the links within this page

The page contains <META NAME="ROBOTS" CONTENT="... NOFOLLOW ..."> (the tokens are not case sensitive). This command is recognized by the Robots Exclusion Protocol and does not allow robots to follow the links. If you are an author of the page, you may temporarily remove the command (preferably by changing its angle brackets to comment marks <!-- and -->). The next version may recognize a META element which will instruct Link Checker to ignore META NOFOLLOW (see To Do List later in this document).

META redirection

This message informs that the page contains META REFRESH. The LinkChecker will verify both the original page and the redirected resource. This message is also explained in Object moved.

Too many redirections!

The link is redirected to another object which is redirected to another object which is redirected to another object... This may be an infinite loop. Examine the chain of links. If there is really a loop, there is no help. If not, try to select the last link of the chain just above this message. If your browser shows you something reasonable, the chain is really so long and you should correct it in your document. You can also note the chain and verify the last object by the LinkChecker. This will give you several next steps. Remember that according to RFC 2068 only 5 redirections are allowed. If you leave the redirection chain too long, the browsers should complain about an infinite loop in redirections.

Wrong type of an image object

The object referenced in the <IMG> element exists but is not an image. Possible reasons are:

  1. You link to an incorrect object
  2. The server is not properly installed and does not recognize the object as an image
  3. Your image does not have known format or its name is not recognized as an image file

Browser died

This message only appears in server's error log. It means that connection to the browser had been broken (either by a network error or by the user) before LinkChecker sent its output. There is no way how to send this message to the user.

How to report bugs

If something goes wrong, you should first save the result into an HTML file. Afterwards return to the form and note your input. Try the same again in order to see whether the condition is persistent. You can also try to use proxy or HTTP/1.0. If the error is persistent, mail me the saved result as an attachment and include your input into the form. Do not expect fast response. I am very busy with other projects but I will certainly read your message and try to fix the bug.

Go and check it!

Now you can go to the Link Check Form. This help may be open again from both the form and the results.

Technical details for web administrators

These technical details should inform the web administrators how to disallow the Link Checker to access parts of their web servers.

The LinkChecker identifies itself as User-Agent: LinkChecker/x.y where x.y is its version. The current version is 1.0c.

The LinkChecker reads robots.txt and obeys commands for User-Agent: LinkChecker. The test is not case sensitive in the current version as well as in all previous versions.

The LinkChecker writes "Browser died" (with some irrelevant information) to the server's error log if either the user had stopped his/her browser or connection had been broken before all output was sent.

Installation

First you need a WWW server. LinkChecker is tested with Apache but it will hopefully work with other servers. Send me a note if other servers require modifications and I will enhance the code of the next version.

LinkChecker is written in Perl5. It is object oriented and requires sockets. Be sure that you have the correct version of Perl properly installed.

LinkChecker relies upon several modules which are part of the IceBearSoft Perl Package. Be sure that you have this package properly installed. Try to connect with http.pl to your server. If it does not work, LinkChecker will not work either.

The distributions of the IceBearSoft Perl Package as well as LinkChecker are available from my page.

LinkChecker is distributed as a part of the IceBearSoft Perl Package which has its own installation script. You should follow the manual of the whole package (see manual.html in the doc subdirectory). You can then customize LinkChecker.

If you create a new directory for LinkChecker's HTML file, you should create index.html or rename linkcheck-help.html to index.html. Then open open linkcheck.html and customize it according to your conditions. Look for <form ... action="/cgi-bin/linkcheck.pl/webtools/linkcheck-help.html">. Change cgi-bin to your directory of CGI scripts. Change /webtools/linkcheck-help.html to your path to the help file. If you renamed the file to index.html or something else, you must also modify the link above the first comment. The name or IP address of the default proxy must be specified in the DEFAULT_PROXY environment variable. Be sure that Apache passes this variable to the CGI scripts. Alternativelly it may be set directly in the Apache configuration file. Remove attribute checked if you do not wish to have it as default. If you wish to change the read timeout for HTTP connections, do it in linchckeck.pl.

Obtaining LinkChecker

You can download the whole IceBearSoft Perl Package, which contains LinkChecker, from the software section of my page.

Changes

Version 1.0a introduces the following changes:

  1. Several fixes and enhancements in IceBearSoft::Zwebfun package:
    1. Shorter timeout for "Host not found"
    2. Do not use HTTP/1.1 if server responded with HTTP/1.0 only
    3. Bug in Transfer-Encoding: chunked fixed
  2. Faster diagnosis of died client (more polite to my server)
  3. Implements Robots Exclusion Protocol (more polite to other servers):
    1. Honors robots.txt and obeys both an asterisk (meaning all robots) and its own name, i.e. LinkChecker.
    2. Obeys <META NAME="ROBOTS" CONTENT="... nofollow ...">. Tokem "nofollow" may appear in connection with other tokens.
  4. Buffering removed (response starts faster)
  5. Tests for infinite loops in redirections
  6. Checking redirections specified in META elements.
  7. Warn if <IMG SRC="..."> is not of type image/...
  8. Option "Use persistent connection" is checked as default in the form
  9. Links to the help window presented at the end of almost all sections of output
  10. URL of the help window specified as PATH_INFO in the input form (see comment in linkcheck.html). This will make transfer to other servers easier.

Version 1.0b contains a few changes needed due to reimplementation of Zwebfun.

Version 1.0c adds the following changes:

  1. Bug fix: some local pages were incorrectly pushed for checking several times.
  2. Fixed bug in using the SGML parser.
  3. The LinkChecker does not try to check a link if it did not succeed to load robots.txt due to 5** error or because the server did not respond at all or the host was not found. This makes checking considerably faster.
  4. Incremental checking added
  5. Added checking of <BODY BACKGROUND=>. It is needed if the whole tree is generated automatically, e.g. when preparing off-line version of WWW pages for distribution on CD.
  6. Almost all functionality moved from linkcheck.pl to Linkchecker.pm. Therefore linkcheck.pl will most probably remain unchanged in the future versions and the administrator will not lose his or her customization.
  7. Enhanced comments in Linkchecker.pm and linkcheck.html

Changes from 1.0c to 1.2:

  1. If the proxy server replies with error 403, LinkChecker tries to fetch the page directly without proxy.
  2. Line numbers of links in the source file reported.
  3. Optional headers added.
  4. Summary table implemented.
  5. JavaScript enhancements.
  6. A few bugs in redirections fixed.
  7. Name or IP of the default proxy taken from the DEFAULT_PROXY environment variable.
  8. Including LinkChecker into the IceBearSoft Perl Package.
  9. Documentation updated.

To Do List

  1. Finding the best values for different kinds of timeout
  2. Inventing META elements for finer control of the LinkChecker
  3. Handling status "300 Multiple choices"(?) -- it seems that this condition is already handled automatically. It can also be influenced by the user by means of the optional headers

Last modified: 12 Aug 2005