logo To Foot
© J R Stockton, ≥ 2009-11-22

Check Local Links and Anchors.

No-Frame * Framed Index * Frame This
Links within this site :-
This page requires include1.js
and wants styles-a.css.

Links Check

This form spiders a local copy of a Web site tree, starting at the page indicated, and working outwards, mainly to locate Anchor errors. Miscellanea are reported. Pages are read into an iframe and parsed by the browser. See also program CHEKLINX.EXE, via index.

: Directory File
: Starting File
Root :
: Readable Extensions
Page loaded by timeout : ms     Shibboleth : g
  Self :   GoUp :   WkDy :  

Status

The table may change or be out-of-date.

BrowserEffects
Misc.Timeout!=0Timeout==0Read SelfGo UpUsable
MS IE 8[1]OKNOYESyes [5]yes
Firefox 3.0OKOKNO [2]NO [3]YES
Opera 9.6OKOKNO [4]YESYES
Safari 4.0OKOKYESYESYES
Chrome 3.0OKOKYESYESYES

Notes

This page has been developed mainly in Chrome (which is fast) and Opera (where success came first), using Windows XP sp3.

The pages scanned should be free of HTML and onload script syntax errors.

It is assumed that all folder and file names (but not anchor names) within the site will be lower-case on the server, and that therefore they will be lower-case in the links arrays and can be made lower-case within this code.

Program CHEKLINX.EXE reads the files exactly as on disc, with simplified parsing. This LINXCHEK.HTM uses the page structure at the completion of loading. Scripts executed during loading can, but commonly do not, add and/or remove anchors and/or links.

Algorithm

Names can be stored as string elements of an Array. On the other hand, they can be used to name properties of an Object. That avoids name duplication, and one can determine without ostensible search whether a name is already present. Such Objects are used here to handle page names, anchor names, and names of pages linked to.

The Directory File (if any) is read first, and its lines are stored using an Object as in the box on the left.

A complete Entry
{Name: string,
 Shib: number,
 Ankas: object,
 Dupes: number,
 Cites: object,
 Next: object}

The named page is read next. Page data is held in a linked list, and a complete entry is as the boxed form on the right. When a page is read, its anchors and links arrays are attached to Ankas and Cites. New entries named in Cites that are not folder names, exist on the disc (using the object FromDIR), and have an extension given as acceptable are added to the list. An object PagesObj is given an entry named for each file added to the list, and used to determine whether a name is new. While a page is being read, a similar object detects duplicate anchors.

After all necessary pages have been read, the entries are scanned to see whether all anchors are present on the appropriate pages, and whether any are duplicated on a page, etc.

Operation

Loading

The code uses simple browser-testing in order that, at least on my machine, the controls are initially set suitably.

If the page is invoked with something like linxchek.htm?GoAt=page.htm&Tout=0 (case-dependent query part) then it will immediately run beginning at page.htm, with elements of the Form optionally set by name from the query string. Form input controls currently are, in order, GoAt Xtns Tout Shib Smod Self GoUp WkDy . Use 0/1 / true/false for checkboxes.

Otherwise, you must set the controls and press the long button in the usual way.

Directory File

A "Directory File" should be named. Otherwise, linked site files will be read on the presumption that they exist. If they do not, the consequences may be browser-dependent.

c:\current\astron-1.htm
c:\current\programs\
c:\current\programs\someprog.pas

Its contents must resemble what, at a Windows XP command prompt, is given by DIR /B /S > $DIR.TXT (see green box on right). The characters / \ are equivalent; directories do not need a trailing slash, case does not matter. If you test on the system that you are using to read this, the c:\current\ part must match the corresponding part of what appears on yellow above after ' this : '.

The program will then read only page files that exist and have listed extensions.

The decision as to whether a file is deemed Present or Missing depends entirely on the Directory File that the user provides. Peculiar cases may remain to be handled. For example, a link to a file name containing ^ was earlier considered to be a link to a file with %5E in that position.

The Directory File is read using the iframe. Its contents are read with textContent or innerText.

Dates

When the WkDy box is set, the pages are scanned for anything like an ISO 8601 date (yyyy-mm-dd) followed by whitespace then exactly three letters (XXX). Whenever that is found, the date part is checked for validity and the day-of-week of the date is compared with XXX. Any apparent error is reported in a Confirm box. Certain common XXXs, such as "the" and "and", are disregarded. The code to do this was taken from Day-of-Week Checking in JavaScript Miscellany 1, with subsequent changes.

Years above 275350 are not checked. Negative years may be a problem?

The best ways of dealing with false positives include
  • If the date is intentionally invalid, use another separator;
  • Rephrase so that the date, as displayed, is not followed by a TLA.

This option slows the program considerably, in some browsers.

Starting

There may be no test for the initially-named page being missing; but that will rapidly become obvious.

Misc

A browser may not let a page load a copy of itself into its own iframe. If Self coerces to false, this page will not be queued for reading. To check that, try it as Start File.

A browser may not be able to handle pages in subdirectories here. If GoUp coerces to false, subdirectory pages will not be queued for reading.

If a file or directory name could be interpreted by JavaScript as a number, then there could possibly be a clash with another name that corresponded to the same number.

Some browsers MAY get confused about which directory a page is in.

With some browsers, e.g. Opera 10.01, it may be necessary that the pages scanned are free of "major" errors.

Running

The code may seem slow to start; maybe a browser needs to get resources.

The code may tend to appear to run in fits and starts.

The browser's error-display system will indicate, for the pages scanned, any recognised errors in their HTML and in any of their scripts that run during or on page-load. Dismissing such an error should allow this page to proceed.

After Scanning

A run is completed and analysed when the status line of the form goes green, the iframe vanishes, and more buttons appear instead, within the blue Form. Press them in any sequence.

Notes to Me

More into Consolidation? Check drive matches $dir.txt. Remove Root from FromDIR indexes?? Test a single file??

Glossary

LocalOutLinks
Links to files on the current machine but not in the directory of the starting file or subdirectories thereof.
Page loading timeout
An entry of zero or blank means that code to automatically detect the end of each frame loading is to be used, but that does not work in all browsers. Otherwise, a page will be read with that delay after calling for loading. If the delay is too short, later links and anchors, etc., will not be found. Try 300 ms.
Readable Extensions
Extensions of files which can be safely loaded into the iframe. Files with other extensions are not read.
Root
It is currently assumed that your copy of this page, your Directory File, and the Start File are all in the same directory; and that all pages of the site are in that directory or its subdirectories. That directory is here called the Root.
Shibboleth
An argument for a new RegExp(), which is counted in the body.textContent or body.innerText of each page read. Do not double backslashes. The results are given in Consolidation and EntitySummary.
Home Page
Mail: no HTML
© Dr J R Stockton, near London, UK.
All Rights Reserved.
These pages are tested mainly with Firefox 3.0 and W3's Tidy.
This site, http://www.merlyn.demon.co.uk/, is maintained by me.
Head.