Contents 
 Index 
 "TPT User's Guide" 
 < Previous 
 Next > 

Chapter 14 html_look_integrate.pl

The html_look_integrate.pl perl program and its html_look_integrate.pm (name is optional) companion are intended as an internal web spider tool. You give it the starting point (an HTML file) and it traces the hyperlinks. While tracing, it creates a hierarchical table of contents. The operation is complete when there are no more unvisited HTML files referenced from hyperlinks in any of the files.

Overview

This tool came out of the need to provide a table of contents structure to some inherited HTML documentation from an outside vendor which didn’t have such. Moreover, the HTML documentation could be updated whenever the vendor released new software or whenever we ran their tools to generate HTML documentation for our enhancements to their software.

The html_look_integrate.pl perl program is not perfect, but could be useful. This is how it works.

1 You provide it with one or more top-level starting files.

2 It opens the starting file and looks for any hyperlinks to other files.

3 A list of children files is created for the owning document based on the destination files of the hyperlinks.

» If the destination file is a starting file, it is added to the list but its contents are not traced.

» Likewise, if the destination file has already been processed through some other path, it is added to the list but its contents are not traced again.

4 For each child file requiring tracing is opened and the step 3 is repeated with the child now being an owning document.

The imperfections of the html_look_integrate.pl perl program that immediately come to light:

• If a document is already a type of table of contents with some hierarchical structure, this tracing flattens that structure out giving equal weight to all found hyperlinks.

• If a document makes an off-hand reference hyperlink to a related topic document, this tool may explore that hierarchy early in the process and place its structure out of contents under the “owning” document.

• In terms of tracing, this gives equal weighting to child hyperlink references that are hierarchical and those that mentioned more in passing, such as copyright hyperlinks or cross-reference hyperlinks.

• To avoid loops, this remembers all visited documents and only places them into the hierarchy once.

• This expects all hyperlinks and cross-referencing to be localized. Hyperlinks out into the greater WWW could result in much churning as it builds a tracing structure for the Web instead of for the local HTML documents.

Input

input_scope A file path and name to a perl package used to limit the scope of the search and to help structure the output. The html_look_integrate.pm can be used as an example.

html_tmpl template for generated files. This is typically voyant_master_nav.html. Various fields are extracted (inside <head>, top of <body>, bottom of <body>) and inserted into the HTML files.

The input_scope file is the key to the successful operation. This file will require tweaking as you experiment with the output.

Caution! This is a code file, so be careful about syntax.

The input_scope file is required to have:

@top_files, an array of files that determines the starting points for all subsequent traces. If the destination of a hyperlink references one of these array items, further tracing is stopped. The order of the items in this file is important, because it determines the order of items in the traced heirarchy.

@ex_as_child, an array of files that are to be excluded as children and further tracing when found as the destination of a hyperlink reference. The intent of this are to exclude HTML files that might be referenced that are themselves table of content or heirarchy summary files.

You may run the html_look_integrate.pl perl many times to tweak the input_scope file in order to get the desired output.

Output

The html_look_integrate.pl perl generates a tree.script file that can be used by the Java TOC Applet.

In addition, this uses the html_tmpl template for generated files. This is typically voyant_master_nav.html. Various fields are extracted (inside <head>, top of <body>, bottom of <body>) from the master (html_tmpl) and their contents inserted into the HTML files. The purpose of this is to give a consistent look and feel to the HTML system.

In other words, the HTML files are all overwritten. The general content doesn’t change. What changes is

• any navigation for browsing at top of <body>.

• any generation date and copyright information at the bottom of <body>.

• any CSS, formats, or applet definitions within the <head>.



 "TPT User's Guide" 
 < Previous 
 Next > 


Open-Source tools compliments of Voyant Technologies, Inc. and Glenn C. Maxey.
01/13/2003

TP Tools v2-00-0a

# tpt-hug-02