for DC Perl Mongers, 4 February 2003

Documentation

Code

Samples

ElementMap.status: Main documentation
Layered.status: simple docs for the Hash::Layered module
Driver.pod: Token docs for driver
Grove.pod: Token docs for driver
SGMLS.pod: Token docs for driver
XMLParser.pod: Token docs for driver
Project.status: Notes rom past and future

ElementMap.pm: The main module object
Layered.pm: The stack-hash data structure
test.pl: main test code
layered-test.pl: Hash::Layered specific tests
Driver.pm: the Driver base class (does little)
Grove.pm: the Grove driver (it’s the simplest of the drivers because it has the whole document in memory as a tree)
EventQueue.pm: common base class for the event drivers: SGMLS and XMLParser
SGMLS.pm: the SGMLSpm driver
XMLParser.pm: the XML::Parser::Expat driver (has to be expat because it feeds the document through piecewise in order to get control back from the parsing. it isn’t supposed to be pretty)

identity.pm: implementation of code to output a document just as it was input. This shows accessing of all the parser data you can get
identity-run: command to actually run the identity.pm module
calc.pm: implementation of the simple calculator example
calc-run: command to actually run the calc.pm code
calc.dtd: The DTD for the simple calculator document
test-simple.calc: input document for the simple calculator example
test-simple.calc-out: output for running the simple calculator on the test-simple.calc document
trace-calc-1: output for running the simple calculator on the test-simple.calc document with calc’s debug tracing
trace-calc-2: output for running the simple calculator on the test-simple.calc document with debug tracing activee for calc.pm and SGML::ElementMap
trace-calc-3: output for running the simple calculator on the test-simple.calc document with debug tracing for all of calc.pm, SGML::ElementMap and Hash::Layered
desc2dtd.pl: older file for generating a DTD from a DTD description document (like a simple XMLSchema)
htmltrans: older but substantial example for formatting ASM International article documents into HTML
character-old.pm: an old formatting script showing a different style of interacting with the parser (which is still better than directly managing the parser events)
character-new.pm: the modern version of the above script

If you are looking at these on your own, you’ll need to understand that several of the samples files refer to older versions of this module, even all the way back to its original incarnation as a helper built on SGML::SPGrove (which is now XML::Grove). The main differences are in accessing the parsed data, but there are some cosmetic changes here and there (and of course inability to use newer capabilities).

You can get the SGML::ElementMap module and its respective submodules, examples and documentation from this directory. Don’t forget to get the highest version. That directory also contains these notes and files compressed into a single archive.

Mention:

the core value of SGML is in being able to make changes to the script for mass changes to documents
- The htmltrans script reliably converts the ASM Handbook Series (23 volumes, 2,000 articles, 22,000 pages, 37,000 images, 10,000 tables) to crosslinked html with tables of contents, etc.
- This seperation is also useful on even very small scales, because it guarantees consistency and leads to increased reliability.
this module predates XSLT
doing/allowing things several different ways to give flexibility to use/develop a variety of idioms
- maybe some of that potential doesn’t pan out, can tighten up later

The processing model

hierarchical code, like the document structure
keep code together and intuitive, especially for pre-content, post-content

A code sample for OmniMark and ElementMap. This fragment takes elements from “<ext.xref pointer='ARTICLE_ID'>content</ext.xref>” to “<ext.xref vol.no='NUM' collection='NUM'>content</ext.xref>”


;; omnimark code
element ext.xref
  local counter junk
    and stream volnum
    and stream colnum
    and switch successful
  output "<%lq"
  repeat over specified attributes as spec-attr
    output " "
    output key of attribute spec-attr
    output "=%"%hv(spec-attr)%""
  again
  activate successful
  reset junk to system-call "%g(idcommand) --format='vol.no=%%v col.no=%%c' --save-output=%g(TempFile) %v(pointer)"
  do unless file "%g(TempFile)" exists
    deactivate successful
    put #error "Warning: auto-generated file %g(TempFile) not found%n"
    increment ErrorCount
  else
    do scan file "%g(TempFile)"
      match "vol.no=" (letter or digit)+ => vol white-space+ "col.no=" (letter or digit)+ => col
         set buffer volnum to "%x(vol)"
         set buffer colnum to "%x(col)"
      else
        deactivate successful
        put #error "Warning: auto-generated file %g(TempFile) is invalid: [%v(pointer)]%n"
        increment ErrorCount
    done
    ;reset junk to system-call "rm %g(TempFile)"
  done
  do when not active successful
    set buffer volnum to "unknown"
    set buffer colnum to "unknown"
  done
  output " vol.no=%"%g(volnum)%" collection=%"%g(colnum)%""
  output ">%c"
  output ""

Omnimark

has good sgml parsing and support built in
has rough equivalent of Data::Locations
looks really awful (even worse than perl!)
lacks major advances in programming language design, like functions
actually, has improved a lot
- less pointlessly verbose
- functions


# this is just thrown together from the omnimark code: there might be errors
$e->element('EXT.XREF', sub {
    my ($engine, $element) = @_;
    my ($attrs, $successful, $line, $volnum, $colnum);
    $output->print("<".$element->{'Name'}." ");
    $attrs = $element->{'Attributes'}
    foreach (@$attrs) {
        $output->print(" " . $_ . '="' . $attrs->{$_} . '"');
    }
    system($idcommand, "--format='vol.no=%v col.no=%c'",
           "--save-output=".$TempFile, $attrs->{'pointer'});
  OK: {
      $successful = 0;
      if (! -f $TempFile) {
          warn "Warning: auto-generated file ".$TempFile." not found\n";
          $ErrorCount += 1;
          last OK;
      }
      $line = <$TempFile>;
      if ($line && $line =~ m/vol\.no=(\w+)\s+col\.no=(\w+)/) {
          $volnum = $1;
          $colnum = $2;
      } else {
          warn "Warning: auto-generated file " . $TempFile .
              " is invalid: [" . $attrs->{'pointer'} . "]\n";
          $ErrorCount += 1;
          last OK;
      }
      $successful = 1;
  }
    if (!$successful) {
        $volnum = $colnum = 'unknown';
    }
    $output->print(' vol.no="'.$volnum.'" collection="'.$colnum.'">');
    $engine->process_content;
    $output->print("{'Name'}.">");
});

Other processors

impossible with PerlSAX
- SAX calls you, so you MUST return from a start element handler before you will get any content events, and the content and end element handlers are called at the same level as the start handler
- you have to track ALL the document state and your own to-do
- Note this is only a problem because Perl lacks threads
  - Java encourages threads, so these were natural decisions for SAX
  - but bad decisions for PerlSAX
  - pull vs push
    - an event pull API can be translated into a push API
    - reverse requires threads to partition the parsing from the processing call hierarchy
XSLT
- if you want to be able to program in it, then you should use a programming language
  - pointless reinvention fragments progress
  - convolutes programmatic processing with static processing
  - discourages development of good style engines and models
  - discourages development of good hooks and API in above too
- came after my module anyway

Should read the SGML::ElementMap documentation and start looking at the ElementMap.pm code

Why use constants for object data reference?

obvious, less important: speed (hash lookups much slower than arrays)
less obvious, more important: correctness (no name typos, no subclass name conflicts, actual name space separation)
see code for subclass creation (eg Driver::SGMLS)

What do we do with handlers?

store them in bunches by handler type
need to keep ordering for handlers that might match
seperate into two cases: default handlers and name handlers
- use hash and ” as name for default handlers
- only need to keep order under a name and among default handlers

What do the main objects look like? (Notice the colons. This is kind of structure describing pseudo-perl. Nothing formal or correct.)


mode : {
  'handler_type' => handler_set : {
          'NAME' or '' => handler_pair : [ pattern, handler_ref ]

$mode = { '_ MODENAME ' => 'FOO',
          '_ FINALIZE ' => '',
          'Element' => {
              'PARA' => [ '.*/SECTION/.*', \§ion;_para ],
              '' => [ '', \&no;_handler_warning ] },
          'CData' => {
              '' => [ '', \&data;_accumulate ] },
      };

Mode

mode has its name saved in it
current modes is just a list of refs to modes, need to recover names
modes could easily be arrays
generally use hashes early on
- trivial to mix around data fields
- might want transient fields
- later if the fields are static, convert to arrays
finalize field has ref to lookup proc and data if mode is finalized $finalized_data => [ \&lookup;_func, $handler1, $handler2, … ]


$main = [
     $state_data,
     $all_modes,
     $global_vars,
     $stack_vars
 ];

$state_data = [
     driver : SGML::ElementMap::Driver
     node_path : ''
     handler_modes : [ $mode, $mode_2, $mode_3, ... ]
     handler_mode_stack : [ $mode_set_1, $mode_set_2, ... ]
     named_handlers : { 'NAME' => \&handler; }
     last_gen_name : 'aaa'
 ];

$all_modes = { 'MODE_NAME_1' => $mode,
               'MODE_NAME_2' => $mode_2 };

$global_vars = { 'NAME' => $some_value };

$stack_vars = Hash::Layered;

Why global variable support?

might simplify cases where handlers cross packages and only the ElementMap object itself is reliably common to all modules. Imagine processing with many interchangeable modules of handlers
that’s not really necessary, probably remove soon

Why stack variable support?

need to collect and manage data along the doc hierarchy, just like with code
- otherwise handlers will have problems when they are their own children
technically local() is good enough, but have had many troubles with local()
the special modes for layers add some extra possibility
maybe should not have this hardwired into the processing: hooks instead

Drivers

Different processors need different interfaces to work with them

the core data is the same, as is the structure
in-memory vs serialized vs parsing quality/detail


# these can default to Driver methods
$d->input($type);  # 'file' 'literal' 'handle' etc.
$d->markup($type); # 'xml' or 'sgml'
$d->parser($parser_object);
$d->process_xml_file($elementmap, $file, @handler_args);
$d->process_sgml_file($elementmap, $file, @handler_args);
# these must be implemented in Driver sub-classes
$d->process(...);
$d->reparent_current_subtree($new_el_name, @attribute_pairss);
$d->reparent_subtree($new_el_name, @attribute_pairss);
$d->dispatch_subtrees($elementmap, $pattern, @handler_args);
$d->skip_subtrees();
$d->context_path();

Some of the drivers have a lot in common: the simple event based ones. So we have Driver::EventQueue

maintains a control stack and a parse event queue
the event queue is filled by calling the actual driver’s queue_more_events method
the control stack says what to do with events in the event queue
- need to notice when the current element closes
- reparenting requires insertion of synthesized events at the right places

Hash::Layered

Sample execution:


$h->set_default('cascade');
$h->{'a'} = 31;
$h->{'b'} = 32;
$h->{'c'} = 33;
$h->push;

cascade	a	b	c	d	e
default	31	32	33
default


assert($h->{'a'} == 31);
$h->set_layer('opaque');
assert(! defined $h->{'a'});
$h->{'c'} = 34;
$h->{'d'} = 35;

cascade	a	b	c	d	e
default	31	32	33
opaque			34	35


$h->set_layer('default')
assert($h->{'a'} == 31);
assert($h->{'b'} == 32);
assert($h->{'c'} == 34);
assert($h->{'d'} == 35);
$h->push;
$h->{'a'} = 36
$h->set_layer('oneway');
$h->{'e'} = 37
$h->{'a'} = 38

cascade	a	b	c	d	e
default	36	32	33
default			34	35
oneway	38				37


assert $h->{'a'} == 38
assert $h->{'b'} == 32
assert $h->{'e'} == 37
$h->pop
assert !defined $h->{'e'}
$h->pop
assert $h->{'a'} 36
assert $h->{'b'} 32
assert $h->{'c'} 33
assert !defined $h->{'d'}

Want to use the object as a hash reference, but still have access to object methods. I initially tried this with a single object; however, that did not work. I don’t have notes, unfortunately, but I think the issue was getting the data structure out to work with. Using a single object, it’s more difficult to tell when a method is called if it needs to call tied (note that the hash ref and the object ref will be blessed to the same object, so ref() won’t help). Using two objects makes this very easy.

(Note: Haven’t converted to use sub constants for object fields.)

Have two places for behavior settings

individual layer
“default”
- used when the layer has “default” semantics
- lets you change behavior easily for most of the layers

Layers have IDs

combined from an id in the layer_data_list and the layer_data_count
lets us reuse layer_data_list arrays
we don’t actually use them for anything, they are more to catch errors

intervening_layer($target_index, $is_write)

key internal function
searches for an opaque layer between the current layer and the target_index
target_index can be ‘top’
$is_write is 0 for read, 1 for write, and 2 for a new key
should use a cache, but I haven’t put it in
- fancy behaviors mean you need seperate caches for reads and writes
- haven’t done any consideration on the cost of that extra caching
note I don’t call _intervening_layer as an object method
- it’s already slow
- it’s called from everywhere

behaviors for the hash

opaque : completely blocks other layers
transparent : ignored unless it already has the key
hidden : ignored entirely
oneway : opaque to writes, transparent to reads
cascade : opaque to new keys, transparent to existing keys

OK, OK, how does it work?


$layered_hash = [
   $default_layer_state : 'cascade'
   $layer_data_count : -1
   $layer_data_list : [ $layer_data_1, $layer_data_2, ... ]
   $var_val_stack_hash : { }
   $iter_data : [ [ keys], key_index, intervening_layer_for_reads]
];

$layer_data = [
   $sub_id,
   $behavior,
   'VAR1',   # list of all variables that have values in this layer
   'VAR2',
   ...
];

$var_val_stack_hash = {
  'VAR1' => [ $layer_index_1, $val_1,  $layer_index_2, $val_2,  ... ]
  ...
};

Huh?

each layer has an id, a behavior, and a list of variable keys
each variable key (NOT LAYER) has
- a list of pairs: a layer index and a value

Lookup of a key

find the key in the var_val_stack_hash
get last two elements: layer index and value
see if there is an opaque layer between the current layer and the one with the value

Iteration

get a list of all keys
lookup intervening opaque layer for reads
skip any keys blocked by the opaque layer

pm-2003-02

Drivers

Hash::Layered