LinkChecker Help (version 1.2)

This document describes the LinkChecker version 1.2 developed by Z. Wagner -- Ice bear Soft.

Usage
What does the LinkChecker verify
What is not checked
What will not work
How to read the results
Why all the correct links are displayed?
Understanding redirections
Progress indication
Using the summary table
Error messages
How to report bugs
Go and check it!
Technical details for web administrators
Installation
Obtaining LinkChecker
Changes
To Do List

Usage

LinkChecker is a CGI script which verifies validity of links in WWW pages. It is invoked through a form. Its elements have the following meaning:

URL is the location of the page to be checked.
Follow can specify which pages should be checked.
- Incremental checking checks only the specified URL. At the end of the result page it will displayed the list of unchecked local URLs and present a lot of hidden fields. It is then possible to select another URL and check it. It may be useful if you have lots of pages with lots of references. If you try to verify all of them by following local links, your browser may crash or connection may be lost. This function stores intermediate results in hidden fields. If you verify a great many pages, the number of hidden fields may grow and some browsers may not be able to send them to the server.
- Nothing informs that you wish to check only the specified page.
- Local links mean all referenced pages residing on the same host as the specified page.
- Links below the specified URL will request checking all referenced pages the names of which start with the specified string. This makes sense only if the starting URL is a directory.
Use persistent connection will try to download the pages through a single connection using HTTP/1.1. This is usually faster but may fail in some cases. You can always use it unless you encounter some problems.
Use HTTP/1.0 will force usage of the older protocol. Usually it is not necessary but you can use it if some server does not like HTTP/1.1 request. Persistent connections will then not be allowed.
Display full response headers will inform the script to print all response headers together with some internal parsed variables. The headers are sorted alphabetically. This is provided for debugging. It is not useful for normal operation because it provides too much information.
Show internal variables option should not be used unless you know what you are doing. It appends structured listing of all internal variables of package IceBearSoft::Linchecker. It may be interesting for demonstration how the script works or for fishing bugs, but beware: it is always longer than the display of normal results. Your browser may crash if you have too many verified links.
Load via default proxy asks the script to use the default proxy (usually the same where the LinkChecker resides) for downloading. Its IP address or name is specified as a value of this control. My package IceBearSoft::Zwebfun may be unable to communicate with some servers and this may help.
Proxy is a name or an IP address of another proxy which has to be used. If it is specified, the above option is ignored.
Optional headers influence the way how the pages are retrieved mainly if content negotiation is used. The headers are sent while fetching each page. The most useful headings are Accept-Language and Accept-Charset, therefore special fields are provided for them. You can specify three more headers by giving their names and values. LinkChecker does not verify the contents of the headers and sends them as they are specified. If you enter them incorrectly, LinkChecker may fail. Use them only if you understand them, otherwise leave these fields empty.

Remember that specification of a proxy is treated as an advice only. If the proxy does not respond, the script will start communicating directly without any warning. On the other hand, if the server requires use of a specified proxy, the LinkChecker will try to use it. The proxy may be configured not to allow download pages from certain servers. If the proxy returns the 403 error, LinkChecker will try to get the object direcly without using any proxy.

Use of proxy may bring you some advantages. If your browser uses the same proxy as LinkChecker, you will see quickly pages which were juset verified by LinkChecker and vice versa the LinkChecker will quickly load pages which you have recently seen from your browser. It is useful mainly if you have direct connection (on the same LAN) to the server where the LinkChecker runs, otherwise you may not be allowed to use the proxy of the LinkChecker's server and LinkChecker may not be allowed to use your proxy.

On the other hand, use of a proxy may be source of some problems. Suppose that LinkChecker reports you a wrong link. You correct it, put the new page to the server, run LinkChecker again and it reports the same error -- because it reads the old page from the proxy cache. Since LinkChecker does not cache anything itself between runs, it may sometimes be necessary to disable use of proxy.

What does the LinkChecker verify

The script fetches the page from the server and parses the HTML code. It looks for the folloving elements:

<form ...> is not checked at all. URLs of the actions together with the request methods are displayed. Submitting the form invokes an action on the server and this should not be done by checking of the links.
<img ...> is checked for the validity of the SRC attribute by the HEAD method. It means that the server sends the information about the object but the image is not downloaded. Special extensions as DYNSRC are not checked. If you need it, I may add it to a next version.
<body ...> is searched for validity of the BACKGROUND attribute if present. Similarly as above, the object is loaded by the HEAD method and verified that it is of type "image/...". This feature is new since version 1.0c.
<a ...> have two different forms. Its attribute NAME specifies a label within the page and it is just stored in the parsed body. The HREF attribute points to the linked object. Its existence is verified by the HEAD method. If the link contains a fragment (label), the linked page must be loaded with the GET method in order to obtain the list of labels.
<frame ...> is treated similarly as above but it is always loaded and checked even if Follow=Nothing was specified.

What is not checked

LinkChecker is based upon an HTTP protocol. It cannot check objects accessible via other protocols as FTP, GOPHER, TELNET, NEWS, etc. It does not check e-mail addresses either. If you know how to program this in Perl5 and supply the code, I will add it to the next version.

I would also like to check HTTPS pages. Unfortunatelly, it is not implemented in the OS/2 version of curl and I did not manage to force OpenSSL to download any page (probably I am doing something wrong but I do not understand its documentation).

LinkChecker does not understand Java and JavaScript. Therefore links in applets and scripts are not checked.

Hypertext links are checked only in documents with Content-Type: text/html. Theoretically it might be extended to Gopher but I do not know its language.

LinkChecker does not ever try to verify authenticated documents. It merely displays an error message "401 Unauthorized" or similar.

What will not work

WWW pages written in Czech often supply re-encoding via CGI scripts using GET method and PATH_INFO. These scripts may not correctly respond to the HEAD method and may even confuse persistent connections.

How to read the results

The result window start with a chapter title and a short text containing a link to this help. The help will open in its own window specified by "TARGET=HELP". The same specification is used in the form, so that you can easily switch between your operating browser and this help. The link is also given at the end of almost each section.

Each page has its own section. The sections are numbered and the title is formed as a hypertext link. The page will open in its own window specified by "TARGET=TEST". Since all such links are directed to a window with an identical name, you will never get more than three browser windows and your system should not be overloaded. The section than contains the result of loading the object, i.e. its Status, URL, Content-Type, Last-Modified date&time and possibly Location of a moved document. The status of a successfully loaded object is "200 OK". If you requested display of full response headers, you will get much detailed information but most of this is not interesting for normal use.

Each link is presented as a subsection numbered within the section. The subsection title is again formatted as a hypertext link. It is written exactly in the same form as in the checked document so that you can easily find it. The subsection title is followed by a similar information as above.

Why all the correct links are displayed?

LinkChecker must at least try the existence of the objects, HTML pages must be loaded and parsed in order to find the <META> elements. It takes some time. If your page contains a great many correct links, the LinkChecker will not send you anything for a long time. Your browser may incorrectly treat it as a lost connection. When the LinkChecker finally has something to send you, the connection will no longer exist. Therefore the LinkChecker sends you results of all tests (and will always do so) in order to inform your browser that it still lives.

Anyway, all error and diagnostic messages are, at least partly, displayed in red colour. You can thus view the results very fast and search for red texts only.

Understanding redirections

Server may send redirection commands in several situations. Some of them are natural and no correction is needed. Sometimes this informs the author of the page that the links should be changed. The reasons of redirections are briefly described below.

Missing slash after directory name
Generic name resolution
Object moved

Missing slash after directory name

Imagine that you link to http://hroch486.icpf.cas.cz/webtools within your page. When a user selects this link, the server will recognize that webtools is a directory name and will send a default document. This document may contain relative links to other pages. If the above mentioned URL is used, the relative links will be merged incorrectly. In order to make everything correct, the server must force the browser to redirect to http://hroch486.icpf.cas.cz/webtools/ which is achieved by sending status 301 with a new location. If you see such redirection which only adds a trailing slash, you know it is this case. All browsers understand it and correction is not necessary.

Generic name resolution

The server may recognize a generic name which is easy to remember and redirect you to the correct object. The objects may also be dynamic and change automatically with time but the generic name remains the same. Some script may redirect you to another computer which is currently "less busy". There may be other reasons connected mainly with Uniform Resource Name resolution. These redirections will be signalled by status 302 or 303 but 301 may also be used. There is no automatic way how to recognize this type of redirection. It helps to view the linked object or contact the administrator (but remember that administrators are usually very busy, do not disturb them without good reason).

Object moved

The object may be moved to another location with identical or changed name. The web administrator may configure the server to send status 301 with the new location, or, the authors of the pages (mainly if they lease space on someone else's servers) add META REFRESH. If you see this type of redirection, you should change the link to the new location. If you select the link to the page with META REFRESH, you will usually see some explanation message.

Progress indication

Since version 1.2 LinkChecker provides a kind of progress indication implemented in JavaScript. It does not matter if you disable JavaScript in your browser. In such a case only the progress indication will be unavailable but the LinkChecker as such will continue to work. The progress indicator displays something as:

Testing #5; remaining 12 (7/26)

It means that LinkChecker curently verifies the fifth page, it knows about 12 other pages to be verified, the current page contains 26 links and the seventh link is just being verified. The number of pages to be verified is determined during verification, thus the number of remaining pages increases during the run. If the verified link coresponds to redirection, either by <meta refresh> or by 3xx status codes, the number of links on the current page is increased and the URL resulting from the redirection is verified immediatelly. The progress indication is thus very approximate measure of the remaining time.

In addition, the script outputs its start time and end time. The times are output via the toLocaleString of the JavaScript's Date object. If you do not like its format, change the settings of your locale.

It might seem natural to display progress in the status bar. Unfortunately the status bar is often rewritten by the browser's logic during the page transfer. LinkChecker will thus safely report that the processing has finished which is of no use. The progress indication has therefore been implemented in the title bar. If you are using Mozilla with the tabbed view, you will see it, although sometimes with some delay, also in the corresponding tab. It should also be displayed in your window list. You can thus check the progress without scrolling the browser's window.

Using the summary table

The summary table contains all types of diagnostic messages encountered, both errors and warnings. Each message is accompanied with the count of occurences and a hyperlink to the last occurence. Each error message displayed at the verified object is provided with a hyperlink to the previous occurence of an error or warning of the same type with an exception of the first occurence which contains the link to the summary table. you can thus easily traverse the output without necessity to read the whole protocol.

Remember that an objec may contain two different error or warning messages. Be sure that you follow the correct chain.

Error messages

There are many things which may go wrong. The errors are subdivided to several classes.

Errors in SGML parser
Errors in IceBearSoft::Zwebfun package
HTTP errors
LinkChecker errors

Errors in SGML parser

The parser of my own (package IceBearSoft::ZWsgml) is very simple, does not require DTD and can even handle some errors in the HTML code. However, it may sometimes fail. I am not sure whether it will handle all situations, e.g. if only part of the page is loaded due to network error. Probably the parser may die without any message.

Errors in IceBearSoft::Zwebfun package

Errors may be encountered within the package which communicates with the WWW servers. It will be displayed inside "Error-Message". Most often it will be "Host not found". However, it may happen that the Error-Message field is empty. I will then appreciate if you send me a bug report and I will try to fix it.

HTTP Errors

These are various network errors as specified in RFC 2616 (update of RFC 2068). The status messages consist of a three-digit code and a reason phrase. Some servers may add a numeric subcode which is divided by a period from the status code. The reason phrases specified in RFC 2068 are only suggestions, the server may return any text which seems reasonable.

Status 1**

You should not see these errors but I do not know any server which is actually sending it. I am not sure that IceBearSoft::Zwebfun treats them correctly. Using proxy should solve it but I would appreciate information about such server so that I can fix it in my package.

Status 2**

200 OK means success, other messages of this class are reserved for purposes which should not occur in link checking. Therefore they will be treated as errors.

Status 3**

These messages denote redirection and the LinkChecker tries to handle them.

"301 Moved permanently" is most often caused by missing trailing slash in the directory name. No correction is needed since all browsers will handle it. However, if the new location differs considerably from the specified URL, it is most probably a server generated response and the document is really moved. You should consider changing the link in your page.

302 and 303 specify alternative locations and the LinkChecker automatically tries to verify the redirected documents. These are temporary redirections and you should not change your page.

305 informs that you have to use the specified proxy and the LinkChecker will do it.

Further information is available in Understanding redirections.

Status 4 and 5

These messages are just displayed without any further action.

LinkChecker Errors

LinkChecker may encounter additional error conditions. They are usually displayed in red color.

LinkChecker is not able to verify this URL

The document uses other protocol than HTTP or is not of type text/html.

Only HTML files can be checked

You tried, either directly or indirectly, to verify a document which is not of type text/html.

Label #lbl does not exist

LinkChecker found the object but did not find the label.

I cannot check the following forms

Forms cannot be checked. Actions and request methods are displayed.

The proxy has been used but the server complains, contact the administrator!

LinkChecker tried to use the proxy specified by the 305 message but either the proxy does not respond or it did not help.

Sorry, there was no response

Package IceBearSoft::Zwebfun did not send any response, not even the error message. I hope this will never happen.

Proxy Error

LinkChecker connects to Internet via a proxy but proxy found that it is not possible to connect to the server. The proxy did not returned any specific error message.

Cannot connect to server

LinkChecker was unable to download robots.txt from the server either due to 5** error or because the server did not respond at all or host was not found. LinkChecker will no longer try to connect to this server in order to make checking faster and will display this message.

The resource body was not loaded

The resource headers were successfully received via the GET method but the parsed body cannot be found. One of the following happened:

Loading is forbidden by robots.txt (this will be written in the Error-Message field).
Unexpected network error occured and LinkChecker was not able to retrieve the resource body.
An internal error occured in package IceBearSoft::Zwebfun.
The format of the resource body is strange and IceBearSoft::Zwebfun was not able to handle it. Use of proxy may help. Display of full headers may also reveal the source of the problem.
Parser (package IceBearSoft::ZWsgml) was not able to parse a corrupted body.
An internal error occured in package IceBearSoft::ZWsgml.

Try to check the corresponding URL only and if the problem persists, send me a bug report.

The `<META>` element does not allow robots to follow the links within this page

The page contains <META NAME="ROBOTS" CONTENT="... NOFOLLOW ..."> (the tokens are not case sensitive). This command is recognized by the Robots Exclusion Protocol and does not allow robots to follow the links. If you are an author of the page, you may temporarily remove the command (preferably by changing its angle brackets to comment marks ). The next version may recognize a META element which will instruct Link Checker to ignore META NOFOLLOW (see To Do List later in this document).

META redirection

This message informs that the page contains META REFRESH. The LinkChecker will verify both the original page and the redirected resource. This message is also explained in Object moved.

Too many redirections!

The link is redirected to another object which is redirected to another object which is redirected to another object... This may be an infinite loop. Examine the chain of links. If there is really a loop, there is no help. If not, try to select the last link of the chain just above this message. If your browser shows you something reasonable, the chain is really so long and you should correct it in your document. You can also note the chain and verify the last object by the LinkChecker. This will give you several next steps. Remember that according to RFC 2068 only 5 redirections are allowed. If you leave the redirection chain too long, the browsers should complain about an infinite loop in redirections.

Wrong type of an image object

The object referenced in the <IMG> element exists but is not an image. Possible reasons are:

You link to an incorrect object
The server is not properly installed and does not recognize the object as an image
Your image does not have known format or its name is not recognized as an image file

Browser died

This message only appears in server's error log. It means that connection to the browser had been broken (either by a network error or by the user) before LinkChecker sent its output. There is no way how to send this message to the user.

How to report bugs

If something goes wrong, you should first save the result into an HTML file. Afterwards return to the form and note your input. Try the same again in order to see whether the condition is persistent. You can also try to use proxy or HTTP/1.0. If the error is persistent, mail me the saved result as an attachment and include your input into the form. Do not expect fast response. I am very busy with other projects but I will certainly read your message and try to fix the bug.

Go and check it!

Now you can go to the Link Check Form. This help may be open again from both the form and the results.

Technical details for web administrators

These technical details should inform the web administrators how to disallow the Link Checker to access parts of their web servers.

The LinkChecker identifies itself as User-Agent: LinkChecker/x.y where x.y is its version. The current version is 1.0c.

The LinkChecker reads robots.txt and obeys commands for User-Agent: LinkChecker. The test is not case sensitive in the current version as well as in all previous versions.

The LinkChecker writes "Browser died" (with some irrelevant information) to the server's error log if either the user had stopped his/her browser or connection had been broken before all output was sent.

Installation

First you need a WWW server. LinkChecker is tested with Apache but it will hopefully work with other servers. Send me a note if other servers require modifications and I will enhance the code of the next version.

LinkChecker is written in Perl5. It is object oriented and requires sockets. Be sure that you have the correct version of Perl properly installed.

LinkChecker relies upon several modules which are part of the IceBearSoft Perl Package. Be sure that you have this package properly installed. Try to connect with http.pl to your server. If it does not work, LinkChecker will not work either.

The distributions of the IceBearSoft Perl Package as well as LinkChecker are available from my page.

LinkChecker is distributed as a part of the IceBearSoft Perl Package which has its own installation script. You should follow the manual of the whole package (see manual.html in the doc subdirectory). You can then customize LinkChecker.

If you create a new directory for LinkChecker's HTML file, you should create index.html or rename linkcheck-help.html to index.html. Then open open linkcheck.html and customize it according to your conditions. Look for <form ... action="/cgi-bin/linkcheck.pl/webtools/linkcheck-help.html">. Change cgi-bin to your directory of CGI scripts. Change /webtools/linkcheck-help.html to your path to the help file. If you renamed the file to index.html or something else, you must also modify the link above the first comment. The name or IP address of the default proxy must be specified in the DEFAULT_PROXY environment variable. Be sure that Apache passes this variable to the CGI scripts. Alternativelly it may be set directly in the Apache configuration file. Remove attribute checked if you do not wish to have it as default. If you wish to change the read timeout for HTTP connections, do it in linchckeck.pl.

Obtaining LinkChecker

You can download the whole IceBearSoft Perl Package, which contains LinkChecker, from the software section of my page.

Changes

Version 1.0a introduces the following changes:

Several fixes and enhancements in IceBearSoft::Zwebfun package:
1. Shorter timeout for "Host not found"
2. Do not use HTTP/1.1 if server responded with HTTP/1.0 only
3. Bug in Transfer-Encoding: chunked fixed
Faster diagnosis of died client (more polite to my server)
Implements Robots Exclusion Protocol (more polite to other servers):
1. Honors robots.txt and obeys both an asterisk (meaning all robots) and its own name, i.e. LinkChecker.
2. Obeys <META NAME="ROBOTS" CONTENT="... nofollow ...">. Tokem "nofollow" may appear in connection with other tokens.
Buffering removed (response starts faster)
Tests for infinite loops in redirections
Checking redirections specified in META elements.
Warn if <IMG SRC="..."> is not of type image/...
Option "Use persistent connection" is checked as default in the form
Links to the help window presented at the end of almost all sections of output
URL of the help window specified as PATH_INFO in the input form (see comment in linkcheck.html). This will make transfer to other servers easier.

Version 1.0b contains a few changes needed due to reimplementation of Zwebfun.

Version 1.0c adds the following changes:

Bug fix: some local pages were incorrectly pushed for checking several times.
Fixed bug in using the SGML parser.
The LinkChecker does not try to check a link if it did not succeed to load robots.txt due to 5** error or because the server did not respond at all or the host was not found. This makes checking considerably faster.
Incremental checking added
Added checking of <BODY BACKGROUND=>. It is needed if the whole tree is generated automatically, e.g. when preparing off-line version of WWW pages for distribution on CD.
Almost all functionality moved from linkcheck.pl to Linkchecker.pm. Therefore linkcheck.pl will most probably remain unchanged in the future versions and the administrator will not lose his or her customization.
Enhanced comments in Linkchecker.pm and linkcheck.html

Changes from 1.0c to 1.2:

If the proxy server replies with error 403, LinkChecker tries to fetch the page directly without proxy.
Line numbers of links in the source file reported.
Optional headers added.
Summary table implemented.
JavaScript enhancements.
A few bugs in redirections fixed.
Name or IP of the default proxy taken from the DEFAULT_PROXY environment variable.
Including LinkChecker into the IceBearSoft Perl Package.
Documentation updated.

To Do List

Finding the best values for different kinds of timeout
Inventing META elements for finer control of the LinkChecker
Handling status "300 Multiple choices"(?) -- it seems that this condition is already handled automatically. It can also be influenced by the user by means of the optional headers

Last modified: 12 Aug 2005