SGML::ElementMap Implementation

for DC Perl Mongers, 4 February 2003

Documentation Code Samples
  • ElementMap.pm: The main module object
  • Layered.pm: The stack-hash data structure
  • test.pl: main test code
  • layered-test.pl: Hash::Layered specific tests
  • Driver.pm: the Driver base class (does little)
  • Grove.pm: the Grove driver (it’s the simplest of the drivers because it has the whole document in memory as a tree)
  • EventQueue.pm: common base class for the event drivers: SGMLS and XMLParser
  • SGMLS.pm: the SGMLSpm driver
  • XMLParser.pm: the XML::Parser::Expat driver (has to be expat because it feeds the document through piecewise in order to get control back from the parsing. it isn’t supposed to be pretty)
  • identity.pm: implementation of code to output a document just as it was input. This shows accessing of all the parser data you can get
  • identity-run: command to actually run the identity.pm module
  • calc.pm: implementation of the simple calculator example
  • calc-run: command to actually run the calc.pm code
  • calc.dtd: The DTD for the simple calculator document
  • test-simple.calc: input document for the simple calculator example
  • test-simple.calc-out: output for running the simple calculator on the test-simple.calc document
  • trace-calc-1: output for running the simple calculator on the test-simple.calc document with calc’s debug tracing
  • trace-calc-2: output for running the simple calculator on the test-simple.calc document with debug tracing activee for calc.pm and SGML::ElementMap
  • trace-calc-3: output for running the simple calculator on the test-simple.calc document with debug tracing for all of calc.pm, SGML::ElementMap and Hash::Layered
  • desc2dtd.pl: older file for generating a DTD from a DTD description document (like a simple XMLSchema)
  • htmltrans: older but substantial example for formatting ASM International article documents into HTML
  • character-old.pm: an old formatting script showing a different style of interacting with the parser (which is still better than directly managing the parser events)
  • character-new.pm: the modern version of the above script
If you are looking at these on your own, you’ll need to understand that several of the samples files refer to older versions of this module, even all the way back to its original incarnation as a helper built on SGML::SPGrove (which is now XML::Grove). The main differences are in accessing the parsed data, but there are some cosmetic changes here and there (and of course inability to use newer capabilities).

You can get the SGML::ElementMap module and its respective submodules, examples and documentation from this directory. Don’t forget to get the highest version. That directory also contains these notes and files compressed into a single archive.

Mention:

  • the core value of SGML is in being able to make changes to the script for mass changes to documents
    • The htmltrans script reliably converts the ASM Handbook Series (23 volumes, 2,000 articles, 22,000 pages, 37,000 images, 10,000 tables) to crosslinked html with tables of contents, etc.
    • This seperation is also useful on even very small scales, because it guarantees consistency and leads to increased reliability.
  • this module predates XSLT
  • doing/allowing things several different ways to give flexibility to use/develop a variety of idioms
    • maybe some of that potential doesn’t pan out, can tighten up later

The processing model

  • hierarchical code, like the document structure
  • keep code together and intuitive, especially for pre-content, post-content

A code sample for OmniMark and ElementMap. This fragment takes elements from “<ext.xref pointer='ARTICLE_ID'>content</ext.xref>” to “<ext.xref vol.no='NUM' collection='NUM'>content</ext.xref>


;; omnimark code
element ext.xref
  local counter junk
    and stream volnum
    and stream colnum
    and switch successful
  output "<%lq"
  repeat over specified attributes as spec-attr
    output " "
    output key of attribute spec-attr
    output "=%"%hv(spec-attr)%""
  again
  activate successful
  reset junk to system-call "%g(idcommand) --format='vol.no=%%v col.no=%%c' --save-output=%g(TempFile) %v(pointer)"
  do unless file "%g(TempFile)" exists
    deactivate successful
    put #error "Warning: auto-generated file %g(TempFile) not found%n"
    increment ErrorCount
  else
    do scan file "%g(TempFile)"
      match "vol.no=" (letter or digit)+ => vol white-space+ "col.no=" (letter or digit)+ => col
         set buffer volnum to "%x(vol)"
         set buffer colnum to "%x(col)"
      else
        deactivate successful
        put #error "Warning: auto-generated file %g(TempFile) is invalid: [%v(pointer)]%n"
        increment ErrorCount
    done
    ;reset junk to system-call "rm %g(TempFile)"
  done
  do when not active successful
    set buffer volnum to "unknown"
    set buffer colnum to "unknown"
  done
  output " vol.no=%"%g(volnum)%" collection=%"%g(colnum)%""
  output ">%c"
  output ""

Omnimark

  • has good sgml parsing and support built in
  • has rough equivalent of Data::Locations
  • looks really awful (even worse than perl!)
  • lacks major advances in programming language design, like functions
  • actually, has improved a lot
    • less pointlessly verbose
    • functions

# this is just thrown together from the omnimark code: there might be errors
$e->element('EXT.XREF', sub {
    my ($engine, $element) = @_;
    my ($attrs, $successful, $line, $volnum, $colnum);
    $output->print("<".$element->{'Name'}." ");
    $attrs = $element->{'Attributes'}
    foreach (@$attrs) {
        $output->print(" " . $_ . '="' . $attrs->{$_} . '"');
    }
    system($idcommand, "--format='vol.no=%v col.no=%c'",
           "--save-output=".$TempFile, $attrs->{'pointer'});
  OK: {
      $successful = 0;
      if (! -f $TempFile) {
          warn "Warning: auto-generated file ".$TempFile." not found\n";
          $ErrorCount += 1;
          last OK;
      }
      $line = <$TempFile>;
      if ($line && $line =~ m/vol\.no=(\w+)\s+col\.no=(\w+)/) {
          $volnum = $1;
          $colnum = $2;
      } else {
          warn "Warning: auto-generated file " . $TempFile .
              " is invalid: [" . $attrs->{'pointer'} . "]\n";
          $ErrorCount += 1;
          last OK;
      }
      $successful = 1;
  }
    if (!$successful) {
        $volnum = $colnum = 'unknown';
    }
    $output->print(' vol.no="'.$volnum.'" collection="'.$colnum.'">');
    $engine->process_content;
    $output->print("{'Name'}.">");
});

Other processors

  • impossible with PerlSAX
    • SAX calls you, so you MUST return from a start element handler before you will get any content events, and the content and end element handlers are called at the same level as the start handler
    • you have to track ALL the document state and your own to-do
    • Note this is only a problem because Perl lacks threads
      • Java encourages threads, so these were natural decisions for SAX
      • but bad decisions for PerlSAX
      • pull vs push
        • an event pull API can be translated into a push API
        • reverse requires threads to partition the parsing from the processing call hierarchy
  • XSLT
    • if you want to be able to program in it, then you should use a programming language
      • pointless reinvention fragments progress
      • convolutes programmatic processing with static processing
      • discourages development of good style engines and models
      • discourages development of good hooks and API in above too
    • came after my module anyway

Should read the SGML::ElementMap documentation and start looking at the ElementMap.pm code

Why use constants for object data reference?

  • obvious, less important: speed (hash lookups much slower than arrays)
  • less obvious, more important: correctness (no name typos, no subclass name conflicts, actual name space separation)
  • see code for subclass creation (eg Driver::SGMLS)

What do we do with handlers?

  • store them in bunches by handler type
  • need to keep ordering for handlers that might match
  • seperate into two cases: default handlers and name handlers
    • use hash and ” as name for default handlers
    • only need to keep order under a name and among default handlers

What do the main objects look like? (Notice the colons. This is kind of structure describing pseudo-perl. Nothing formal or correct.)


mode : {
  'handler_type' => handler_set : {
          'NAME' or '' => handler_pair : [ pattern, handler_ref ]

$mode = { '_ MODENAME ' => 'FOO',
          '_ FINALIZE ' => '',
          'Element' => {
              'PARA' => [ '.*/SECTION/.*', \§ion;_para ],
              '' => [ '', \&no;_handler_warning ] },
          'CData' => {
              '' => [ '', \&data;_accumulate ] },
      };

Mode

  • mode has its name saved in it
  • current modes is just a list of refs to modes, need to recover names
  • modes could easily be arrays
  • generally use hashes early on
    • trivial to mix around data fields
    • might want transient fields
    • later if the fields are static, convert to arrays
  • finalize field has ref to lookup proc and data if mode is finalized $finalized_data => [ \&lookup;_func, $handler1, $handler2, … ]

$main = [
     $state_data,
     $all_modes,
     $global_vars,
     $stack_vars
 ];

$state_data = [
     driver : SGML::ElementMap::Driver
     node_path : ''
     handler_modes : [ $mode, $mode_2, $mode_3, ... ]
     handler_mode_stack : [ $mode_set_1, $mode_set_2, ... ]
     named_handlers : { 'NAME' => \&handler; }
     last_gen_name : 'aaa'
 ];

$all_modes = { 'MODE_NAME_1' => $mode,
               'MODE_NAME_2' => $mode_2 };

$global_vars = { 'NAME' => $some_value };

$stack_vars = Hash::Layered;

Why global variable support?

  • might simplify cases where handlers cross packages and only the ElementMap object itself is reliably common to all modules. Imagine processing with many interchangeable modules of handlers
  • that’s not really necessary, probably remove soon

Why stack variable support?

  • need to collect and manage data along the doc hierarchy, just like with code
    • otherwise handlers will have problems when they are their own children
  • technically local() is good enough, but have had many troubles with local()
  • the special modes for layers add some extra possibility
  • maybe should not have this hardwired into the processing: hooks instead

Drivers

Different processors need different interfaces to work with them

  • the core data is the same, as is the structure
  • in-memory vs serialized vs parsing quality/detail

# these can default to Driver methods
$d->input($type);  # 'file' 'literal' 'handle' etc.
$d->markup($type); # 'xml' or 'sgml'
$d->parser($parser_object);
$d->process_xml_file($elementmap, $file, @handler_args);
$d->process_sgml_file($elementmap, $file, @handler_args);
# these must be implemented in Driver sub-classes
$d->process(...);
$d->reparent_current_subtree($new_el_name, @attribute_pairss);
$d->reparent_subtree($new_el_name, @attribute_pairss);
$d->dispatch_subtrees($elementmap, $pattern, @handler_args);
$d->skip_subtrees();
$d->context_path();

Some of the drivers have a lot in common: the simple event based ones. So we have Driver::EventQueue

  • maintains a control stack and a parse event queue
  • the event queue is filled by calling the actual driver’s queue_more_events method
  • the control stack says what to do with events in the event queue
    • need to notice when the current element closes
    • reparenting requires insertion of synthesized events at the right places

Hash::Layered

Sample execution:


$h->set_default('cascade');
$h->{'a'} = 31;
$h->{'b'} = 32;
$h->{'c'} = 33;
$h->push;
cascade a b c d e
default 31 32 33
default

assert($h->{'a'} == 31);
$h->set_layer('opaque');
assert(! defined $h->{'a'});
$h->{'c'} = 34;
$h->{'d'} = 35;
cascade a b c d e
default 31 32 33
opaque 34 35

$h->set_layer('default')
assert($h->{'a'} == 31);
assert($h->{'b'} == 32);
assert($h->{'c'} == 34);
assert($h->{'d'} == 35);
$h->push;
$h->{'a'} = 36
$h->set_layer('oneway');
$h->{'e'} = 37
$h->{'a'} = 38
cascade a b c d e
default 36 32 33
default 34 35
oneway 38 37

assert $h->{'a'} == 38
assert $h->{'b'} == 32
assert $h->{'e'} == 37
$h->pop
assert !defined $h->{'e'}
$h->pop
assert $h->{'a'} 36
assert $h->{'b'} 32
assert $h->{'c'} 33
assert !defined $h->{'d'}

Want to use the object as a hash reference, but still have access to object methods. I initially tried this with a single object; however, that did not work. I don’t have notes, unfortunately, but I think the issue was getting the data structure out to work with. Using a single object, it’s more difficult to tell when a method is called if it needs to call tied (note that the hash ref and the object ref will be blessed to the same object, so ref() won’t help). Using two objects makes this very easy.

(Note: Haven’t converted to use sub constants for object fields.)

Have two places for behavior settings

  • individual layer
  • “default”
    • used when the layer has “default” semantics
    • lets you change behavior easily for most of the layers

Layers have IDs

  • combined from an id in the layer_data_list and the layer_data_count
  • lets us reuse layer_data_list arrays
  • we don’t actually use them for anything, they are more to catch errors

intervening_layer($target_index, $is_write)

  • key internal function
  • searches for an opaque layer between the current layer and the target_index
  • target_index can be ‘top’
  • $is_write is 0 for read, 1 for write, and 2 for a new key
  • should use a cache, but I haven’t put it in
    • fancy behaviors mean you need seperate caches for reads and writes
    • haven’t done any consideration on the cost of that extra caching
  • note I don’t call _intervening_layer as an object method
    • it’s already slow
    • it’s called from everywhere

behaviors for the hash

  • opaque : completely blocks other layers
  • transparent : ignored unless it already has the key
  • hidden : ignored entirely
  • oneway : opaque to writes, transparent to reads
  • cascade : opaque to new keys, transparent to existing keys

OK, OK, how does it work?


$layered_hash = [
   $default_layer_state : 'cascade'
   $layer_data_count : -1
   $layer_data_list : [ $layer_data_1, $layer_data_2, ... ]
   $var_val_stack_hash : { }
   $iter_data : [ [ keys], key_index, intervening_layer_for_reads]
];

$layer_data = [
   $sub_id,
   $behavior,
   'VAR1',   # list of all variables that have values in this layer
   'VAR2',
   ...
];

$var_val_stack_hash = {
  'VAR1' => [ $layer_index_1, $val_1,  $layer_index_2, $val_2,  ... ]
  ...
};

Huh?

  • each layer has an id, a behavior, and a list of variable keys
  • each variable key (NOT LAYER) has
    • a list of pairs: a layer index and a value

Lookup of a key

  • find the key in the var_val_stack_hash
  • get last two elements: layer index and value
  • see if there is an opaque layer between the current layer and the one with the value

Iteration

  • get a list of all keys
  • lookup intervening opaque layer for reads
  • skip any keys blocked by the opaque layer