|
|
|
|
|
The voyant_indexer.pl perl program takes the following parameters:
path location to find the index files. The name should be terminated with a slash (\). Although the directory_name is the first command line parameter, the true input are the index files contained within that directory. Index files must begin with "index_".
master_tree_file [optional path and] filename for the HTML file to use as a template for all index files to be generated. Ideally, this should contain navigation tools to get between the generated index files [a-z] and other parts of the system, such as the table of contents. This file has several specially flagged HTML comment sections that are required.
ignore_terms_file [optional path and] filename for a text file that contains words to ignore in the word-chunking process.
The true input are the index_ files. These files are an unsorted running list of index tokens that were extracted from the HTML files in a directory. The tokens have two parts: the index entry and its URL. The separator is defined in the globe.pm file and is :,: ($globe::word_url_boundary). Additionally, the index entry can have two levels. In such cases, the separator is :;: ($globe::word_c_boundary). Finally, a given index entry can represent multiple references or URLs. In such cases, the multiple entries are separated by :;;;: ($globe::division_mult_entry)
If the separators are changed in the generator program (voyant_nav.pl), they need to be changed here, too. The variable names in both programs are the same. The separators themselves were chosen because they were deemed never to occur in an index entry or URL and aren't Perl special characters.
More information about the master_file. Aside from serving as a template for all generated index files, this file chunks information using specially tagged HTML comments in order to simplify locating where generated information is to be placed. In addition, some tags contain information critical to the proper operation of the indexer.
This does not support index entries that might be or have Perl special characters. These are often eliminated early in the process.
The input index_ files cannot have .htm as part of the name. This assumes that input information was of the proper format with an index entry, $globe::word_url_boundary, and URL. If any of the input index_ files did not have this, this can cause problems.
|
|
|
Open-Source tools compliments of Voyant Technologies, Inc. and Glenn C. Maxey.
01/13/2003
TP Tools v2-00-0a
# tpt-hug-02