This document describes the LinkChecker version 1.2 developed by Z. Wagner -- Ice bear Soft.
LinkChecker is a CGI script which verifies validity of links in WWW pages. It is invoked through a form. Its elements have the following meaning:
Remember that specification of a proxy is treated as an advice only. If the proxy does not respond, the script will start communicating directly without any warning. On the other hand, if the server requires use of a specified proxy, the LinkChecker will try to use it. The proxy may be configured not to allow download pages from certain servers. If the proxy returns the 403 error, LinkChecker will try to get the object direcly without using any proxy.
Use of proxy may bring you some advantages. If your browser uses the same proxy as LinkChecker, you will see quickly pages which were juset verified by LinkChecker and vice versa the LinkChecker will quickly load pages which you have recently seen from your browser. It is useful mainly if you have direct connection (on the same LAN) to the server where the LinkChecker runs, otherwise you may not be allowed to use the proxy of the LinkChecker's server and LinkChecker may not be allowed to use your proxy.
On the other hand, use of a proxy may be source of some problems. Suppose that LinkChecker reports you a wrong link. You correct it, put the new page to the server, run LinkChecker again and it reports the same error -- because it reads the old page from the proxy cache. Since LinkChecker does not cache anything itself between runs, it may sometimes be necessary to disable use of proxy.
The script fetches the page from the server and parses the HTML code. It looks for the folloving elements:
<form ...>
is not checked at all. URLs of the actions together with
the request methods are displayed. Submitting the form invokes an action on the server and
this should not be done by checking of the links.
<img ...>
is checked for the validity of the SRC attribute by the
HEAD method. It means that the server sends the information about the object but the image
is not downloaded. Special extensions as DYNSRC are not checked. If you need it, I may add
it to a next version.
<body ...>
is searched for validity of the BACKGROUND attribute if
present. Similarly as above, the object is loaded by the HEAD method and verified that it
is of type "image/...". This feature is new since version 1.0c.
<a ...>
have two different forms. Its attribute NAME specifies a
label within the page and it is just stored in the parsed body. The HREF attribute points
to the linked object. Its existence is verified by the HEAD method. If the link contains a
fragment (label), the linked page must be loaded with the GET method in order to obtain the list of
labels.
<frame ...>
is treated similarly as above but it is always loaded
and checked even if Follow=Nothing was specified.
LinkChecker is based upon an HTTP protocol. It cannot check objects accessible via other protocols as FTP, GOPHER, TELNET, NEWS, etc. It does not check e-mail addresses either. If you know how to program this in Perl5 and supply the code, I will add it to the next version.
I would also like to check HTTPS pages. Unfortunatelly, it is not implemented in the OS/2
version of curl
and I did not manage to force OpenSSL to download any page (probably I
am doing something wrong but I do not understand its documentation).
LinkChecker does not understand Java and JavaScript. Therefore links in applets and scripts are not checked.
Hypertext links are checked only in documents with Content-Type: text/html. Theoretically it might be extended to Gopher but I do not know its language.
LinkChecker does not ever try to verify authenticated documents. It merely displays an error message "401 Unauthorized" or similar.
WWW pages written in Czech often supply re-encoding via CGI scripts using GET method and PATH_INFO. These scripts may not correctly respond to the HEAD method and may even confuse persistent connections.
The result window start with a chapter title and a short text containing a link to this help. The help will open in its own window specified by "TARGET=HELP". The same specification is used in the form, so that you can easily switch between your operating browser and this help. The link is also given at the end of almost each section.
Each page has its own section. The sections are numbered and the title is formed as a hypertext link. The page will open in its own window specified by "TARGET=TEST". Since all such links are directed to a window with an identical name, you will never get more than three browser windows and your system should not be overloaded. The section than contains the result of loading the object, i.e. its Status, URL, Content-Type, Last-Modified date&time and possibly Location of a moved document. The status of a successfully loaded object is "200 OK". If you requested display of full response headers, you will get much detailed information but most of this is not interesting for normal use.
Each link is presented as a subsection numbered within the section. The subsection title is again formatted as a hypertext link. It is written exactly in the same form as in the checked document so that you can easily find it. The subsection title is followed by a similar information as above.
LinkChecker must at least try the existence of the objects, HTML pages must be loaded and
parsed in order to find the <META>
elements. It takes some time. If your page
contains a great many correct links, the LinkChecker will not send you anything for a long time.
Your browser may incorrectly treat it as a lost connection. When the LinkChecker finally has
something to send you, the connection will no longer exist. Therefore the LinkChecker sends you
results of all tests (and will always do so) in order to inform your browser that it still lives.
Anyway, all error and diagnostic messages are, at least partly, displayed in red colour. You can thus view the results very fast and search for red texts only.
Server may send redirection commands in several situations. Some of them are natural and no correction is needed. Sometimes this informs the author of the page that the links should be changed. The reasons of redirections are briefly described below.
Imagine that you link to http://hroch486.icpf.cas.cz/webtools
within your page.
When a user selects this link, the server will recognize that webtools
is a directory
name and will send a default document. This document may contain relative links to other pages. If
the above mentioned URL is used, the relative links will be merged incorrectly. In order to make
everything correct, the server must force the browser to redirect to
http://hroch486.icpf.cas.cz/webtools/
which is achieved by sending status 301 with a
new location. If you see such redirection which only adds a trailing slash, you know it is this
case. All browsers understand it and correction is not necessary.
The server may recognize a generic name which is easy to remember and redirect you to the correct object. The objects may also be dynamic and change automatically with time but the generic name remains the same. Some script may redirect you to another computer which is currently "less busy". There may be other reasons connected mainly with Uniform Resource Name resolution. These redirections will be signalled by status 302 or 303 but 301 may also be used. There is no automatic way how to recognize this type of redirection. It helps to view the linked object or contact the administrator (but remember that administrators are usually very busy, do not disturb them without good reason).
The object may be moved to another location with identical or changed name. The web administrator may configure the server to send status 301 with the new location, or, the authors of the pages (mainly if they lease space on someone else's servers) add META REFRESH. If you see this type of redirection, you should change the link to the new location. If you select the link to the page with META REFRESH, you will usually see some explanation message.
Since version 1.2 LinkChecker provides a kind of progress indication implemented in JavaScript. It does not matter if you disable JavaScript in your browser. In such a case only the progress indication will be unavailable but the LinkChecker as such will continue to work. The progress indicator displays something as:
Testing #5; remaining 12 (7/26)
It means that LinkChecker curently verifies the fifth page, it knows about 12 other pages to be
verified, the current page contains 26 links and the seventh link is just being verified. The
number of pages to be verified is determined during verification, thus the number of remaining
pages increases during the run. If the verified link coresponds to redirection, either by
<meta refresh>
or by 3xx status codes, the number of links on the current page
is increased and the URL resulting from the redirection is verified immediatelly. The progress
indication is thus very approximate measure of the remaining time.
In addition, the script outputs its start time and end time. The times are output via the
toLocaleString
of the JavaScript's Date
object. If you do not like its
format, change the settings of your locale.
It might seem natural to display progress in the status bar. Unfortunately the status bar is often rewritten by the browser's logic during the page transfer. LinkChecker will thus safely report that the processing has finished which is of no use. The progress indication has therefore been implemented in the title bar. If you are using Mozilla with the tabbed view, you will see it, although sometimes with some delay, also in the corresponding tab. It should also be displayed in your window list. You can thus check the progress without scrolling the browser's window.
The summary table contains all types of diagnostic messages encountered, both errors and warnings. Each message is accompanied with the count of occurences and a hyperlink to the last occurence. Each error message displayed at the verified object is provided with a hyperlink to the previous occurence of an error or warning of the same type with an exception of the first occurence which contains the link to the summary table. you can thus easily traverse the output without necessity to read the whole protocol.
Remember that an objec may contain two different error or warning messages. Be sure that you follow the correct chain.
There are many things which may go wrong. The errors are subdivided to several classes.
The parser of my own (package IceBearSoft::ZWsgml) is very simple, does not require DTD and can even handle some errors in the HTML code. However, it may sometimes fail. I am not sure whether it will handle all situations, e.g. if only part of the page is loaded due to network error. Probably the parser may die without any message.
Errors may be encountered within the package which communicates with the WWW servers. It will be displayed inside "Error-Message". Most often it will be "Host not found". However, it may happen that the Error-Message field is empty. I will then appreciate if you send me a bug report and I will try to fix it.
These are various network errors as specified in RFC 2616 (update of RFC 2068). The status messages consist of a three-digit code and a reason phrase. Some servers may add a numeric subcode which is divided by a period from the status code. The reason phrases specified in RFC 2068 are only suggestions, the server may return any text which seems reasonable.
You should not see these errors but I do not know any server which is actually sending it. I am not sure that IceBearSoft::Zwebfun treats them correctly. Using proxy should solve it but I would appreciate information about such server so that I can fix it in my package.
200 OK means success, other messages of this class are reserved for purposes which should not occur in link checking. Therefore they will be treated as errors.
These messages denote redirection and the LinkChecker tries to handle them.
"301 Moved permanently" is most often caused by missing trailing slash in the directory name. No correction is needed since all browsers will handle it. However, if the new location differs considerably from the specified URL, it is most probably a server generated response and the document is really moved. You should consider changing the link in your page.
302 and 303 specify alternative locations and the LinkChecker automatically tries to verify the redirected documents. These are temporary redirections and you should not change your page.
305 informs that you have to use the specified proxy and the LinkChecker will do it.
Further information is available in Understanding redirections.
These messages are just displayed without any further action.
LinkChecker may encounter additional error conditions. They are usually displayed in red color.
The document uses other protocol than HTTP or is not of type text/html.
You tried, either directly or indirectly, to verify a document which is not of type text/html.
LinkChecker found the object but did not find the label.
Forms cannot be checked. Actions and request methods are displayed.
LinkChecker tried to use the proxy specified by the 305 message but either the proxy does not respond or it did not help.
Package IceBearSoft::Zwebfun did not send any response, not even the error message. I hope this will never happen.
LinkChecker connects to Internet via a proxy but proxy found that it is not possible to connect to the server. The proxy did not returned any specific error message.
LinkChecker was unable to download robots.txt
from the server either due
to 5** error or because the server did not respond at all or host was not found.
LinkChecker will no longer try to connect to this server in order to make checking faster
and will display this message.
The resource headers were successfully received via the GET method but the parsed body cannot be found. One of the following happened:
robots.txt
(this will be written in the Error-Message
field).
<META>
element does not allow robots to follow the links within this
pageThe page contains <META NAME="ROBOTS" CONTENT="... NOFOLLOW ...">
(the tokens
are not case sensitive). This command is recognized by the Robots Exclusion Protocol and does not
allow robots to follow the links. If you are an author of the page, you may temporarily remove the
command (preferably by changing its angle brackets to comment marks <!--
and
-->
). The next version may recognize a META element which will instruct Link
Checker to ignore META NOFOLLOW (see To Do List later in this document).
This message informs that the page contains META REFRESH. The LinkChecker will verify both the original page and the redirected resource. This message is also explained in Object moved.
The link is redirected to another object which is redirected to another object which is redirected to another object... This may be an infinite loop. Examine the chain of links. If there is really a loop, there is no help. If not, try to select the last link of the chain just above this message. If your browser shows you something reasonable, the chain is really so long and you should correct it in your document. You can also note the chain and verify the last object by the LinkChecker. This will give you several next steps. Remember that according to RFC 2068 only 5 redirections are allowed. If you leave the redirection chain too long, the browsers should complain about an infinite loop in redirections.
The object referenced in the <IMG> element exists but is not an image. Possible reasons are:
This message only appears in server's error log. It means that connection to the browser had been broken (either by a network error or by the user) before LinkChecker sent its output. There is no way how to send this message to the user.
If something goes wrong, you should first save the result into an HTML file. Afterwards return to the form and note your input. Try the same again in order to see whether the condition is persistent. You can also try to use proxy or HTTP/1.0. If the error is persistent, mail me the saved result as an attachment and include your input into the form. Do not expect fast response. I am very busy with other projects but I will certainly read your message and try to fix the bug.
Now you can go to the Link Check Form. This help may be open again from both the form and the results.
These technical details should inform the web administrators how to disallow the Link Checker to access parts of their web servers.
The LinkChecker identifies itself as User-Agent: LinkChecker/x.y where x.y is its version. The current version is 1.0c.
The LinkChecker reads robots.txt
and obeys commands for User-Agent:
LinkChecker. The test is not case sensitive in the current version as well as in all
previous versions.
The LinkChecker writes "Browser died" (with some irrelevant information) to the server's error log if either the user had stopped his/her browser or connection had been broken before all output was sent.
First you need a WWW server. LinkChecker is tested with Apache but it will hopefully work with other servers. Send me a note if other servers require modifications and I will enhance the code of the next version.
LinkChecker is written in Perl5. It is object oriented and requires sockets. Be sure that you have the correct version of Perl properly installed.
LinkChecker relies upon several modules which are part of the IceBearSoft Perl Package.
Be sure that you have this package properly installed. Try to connect with
http.pl
to your server. If it does not work, LinkChecker will not work
either.
The distributions of the IceBearSoft Perl Package as well as LinkChecker are available from my page.
LinkChecker is distributed as a part of the IceBearSoft Perl Package which has its own
installation script. You should follow the manual of the whole package (see
manual.html
in the doc
subdirectory). You can then customize LinkChecker.
If you create a new directory for LinkChecker's HTML file, you
should create index.html
or rename linkcheck-help.html
to
index.html
. Then open open linkcheck.html
and customize it
according to your conditions. Look for <form ...
action="/cgi-bin/linkcheck.pl/webtools/linkcheck-help.html">
. Change
cgi-bin
to your directory of CGI scripts. Change
/webtools/linkcheck-help.html
to your path to the help file. If you renamed
the file to index.html
or something else, you must also modify the link above
the first comment. The name or IP address of the default proxy must be specified in the
DEFAULT_PROXY
environment variable. Be sure that Apache passes this variable to the
CGI scripts. Alternativelly it may be set directly in the Apache configuration file. Remove
attribute checked
if you do not wish to have it as default. If you wish to change the
read timeout for HTTP connections, do it in linchckeck.pl
.
You can download the whole IceBearSoft Perl Package, which contains LinkChecker, from the software section of my page.
Version 1.0a introduces the following changes:
robots.txt
and obeys both an asterisk (meaning all robots) and its own
name, i.e. LinkChecker.
<META NAME="ROBOTS" CONTENT="... nofollow ...">
. Tokem "nofollow" may
appear in connection with other tokens.
<IMG SRC="...">
is not of type image/...
linkcheck.html
). This will make transfer to other servers easier.
Version 1.0b contains a few changes needed due to reimplementation of Zwebfun.
Version 1.0c adds the following changes:
robots.txt
due to 5** error or because the server did not respond at all or
the host was not found. This makes checking considerably faster.
<BODY BACKGROUND=>
. It is needed if the whole
tree is generated automatically, e.g. when preparing off-line version of WWW pages for
distribution on CD.
linkcheck.pl
to
Linkchecker.pm
. Therefore linkcheck.pl
will most probably remain
unchanged in the future versions and the administrator will not lose his or her
customization.
Linkchecker.pm
and linkcheck.html
Changes from 1.0c to 1.2:
DEFAULT_PROXY
environment variable.