for DC Perl Mongers, 4 February 2003
Documentation | Code | Samples |
|
|
|
If you are looking at these on your own, you’ll need to understand that several of the samples files refer to older versions of this module, even all the way back to its original incarnation as a helper built on SGML::SPGrove (which is now XML::Grove). The main differences are in accessing the parsed data, but there are some cosmetic changes here and there (and of course inability to use newer capabilities).
You can get the SGML::ElementMap module and its respective submodules, examples and documentation from this directory. Don’t forget to get the highest version. That directory also contains these notes and files compressed into a single archive. |
Mention:
- the core value of SGML is in being able to make changes to the script for mass changes to documents
- The htmltrans script reliably converts the ASM Handbook Series (23 volumes, 2,000 articles, 22,000 pages, 37,000 images, 10,000 tables) to crosslinked html with tables of contents, etc.
- This seperation is also useful on even very small scales, because it guarantees consistency and leads to increased reliability.
- this module predates XSLT
- doing/allowing things several different ways to give flexibility to use/develop a variety of idioms
- maybe some of that potential doesn’t pan out, can tighten up later
The processing model
- hierarchical code, like the document structure
- keep code together and intuitive, especially for pre-content, post-content
A code sample for OmniMark and ElementMap. This fragment takes elements from “<ext.xref pointer='ARTICLE_ID'>content</ext.xref>
” to “<ext.xref vol.no='NUM' collection='NUM'>content</ext.xref>
”
;; omnimark code
element ext.xref
local counter junk
and stream volnum
and stream colnum
and switch successful
output "<%lq"
repeat over specified attributes as spec-attr
output " "
output key of attribute spec-attr
output "=%"%hv(spec-attr)%""
again
activate successful
reset junk to system-call "%g(idcommand) --format='vol.no=%%v col.no=%%c' --save-output=%g(TempFile) %v(pointer)"
do unless file "%g(TempFile)" exists
deactivate successful
put #error "Warning: auto-generated file %g(TempFile) not found%n"
increment ErrorCount
else
do scan file "%g(TempFile)"
match "vol.no=" (letter or digit)+ => vol white-space+ "col.no=" (letter or digit)+ => col
set buffer volnum to "%x(vol)"
set buffer colnum to "%x(col)"
else
deactivate successful
put #error "Warning: auto-generated file %g(TempFile) is invalid: [%v(pointer)]%n"
increment ErrorCount
done
;reset junk to system-call "rm %g(TempFile)"
done
do when not active successful
set buffer volnum to "unknown"
set buffer colnum to "unknown"
done
output " vol.no=%"%g(volnum)%" collection=%"%g(colnum)%""
output ">%c"
output ""
Omnimark
- has good sgml parsing and support built in
- has rough equivalent of Data::Locations
- looks really awful (even worse than perl!)
- lacks major advances in programming language design, like functions
- actually, has improved a lot
- less pointlessly verbose
- functions
# this is just thrown together from the omnimark code: there might be errors
$e->element('EXT.XREF', sub {
my ($engine, $element) = @_;
my ($attrs, $successful, $line, $volnum, $colnum);
$output->print("<".$element->{'Name'}." ");
$attrs = $element->{'Attributes'}
foreach (@$attrs) {
$output->print(" " . $_ . '="' . $attrs->{$_} . '"');
}
system($idcommand, "--format='vol.no=%v col.no=%c'",
"--save-output=".$TempFile, $attrs->{'pointer'});
OK: {
$successful = 0;
if (! -f $TempFile) {
warn "Warning: auto-generated file ".$TempFile." not found\n";
$ErrorCount += 1;
last OK;
}
$line = <$TempFile>;
if ($line && $line =~ m/vol\.no=(\w+)\s+col\.no=(\w+)/) {
$volnum = $1;
$colnum = $2;
} else {
warn "Warning: auto-generated file " . $TempFile .
" is invalid: [" . $attrs->{'pointer'} . "]\n";
$ErrorCount += 1;
last OK;
}
$successful = 1;
}
if (!$successful) {
$volnum = $colnum = 'unknown';
}
$output->print(' vol.no="'.$volnum.'" collection="'.$colnum.'">');
$engine->process_content;
$output->print("{'Name'}.">");
});
Other processors
- impossible with PerlSAX
- SAX calls you, so you MUST return from a start element handler before you will get any content events, and the content and end element handlers are called at the same level as the start handler
- you have to track ALL the document state and your own to-do
- Note this is only a problem because Perl lacks threads
- Java encourages threads, so these were natural decisions for SAX
- but bad decisions for PerlSAX
- pull vs push
- an event pull API can be translated into a push API
- reverse requires threads to partition the parsing from the processing call hierarchy
- XSLT
- if you want to be able to program in it, then you should use a programming language
- pointless reinvention fragments progress
- convolutes programmatic processing with static processing
- discourages development of good style engines and models
- discourages development of good hooks and API in above too
- came after my module anyway
- if you want to be able to program in it, then you should use a programming language
Should read the SGML::ElementMap documentation and start looking at the ElementMap.pm code
Why use constants for object data reference?
- obvious, less important: speed (hash lookups much slower than arrays)
- less obvious, more important: correctness (no name typos, no subclass name conflicts, actual name space separation)
- see code for subclass creation (eg Driver::SGMLS)
What do we do with handlers?
- store them in bunches by handler type
- need to keep ordering for handlers that might match
- seperate into two cases: default handlers and name handlers
- use hash and ” as name for default handlers
- only need to keep order under a name and among default handlers
What do the main objects look like? (Notice the colons. This is kind of structure describing pseudo-perl. Nothing formal or correct.)
mode : {
'handler_type' => handler_set : {
'NAME' or '' => handler_pair : [ pattern, handler_ref ]
$mode = { '_ MODENAME ' => 'FOO',
'_ FINALIZE ' => '',
'Element' => {
'PARA' => [ '.*/SECTION/.*', \§ion;_para ],
'' => [ '', \&no;_handler_warning ] },
'CData' => {
'' => [ '', \&data;_accumulate ] },
};
Mode
- mode has its name saved in it
- current modes is just a list of refs to modes, need to recover names
- modes could easily be arrays
- generally use hashes early on
- trivial to mix around data fields
- might want transient fields
- later if the fields are static, convert to arrays
- finalize field has ref to lookup proc and data if mode is finalized $finalized_data => [ \&lookup;_func, $handler1, $handler2, … ]
$main = [
$state_data,
$all_modes,
$global_vars,
$stack_vars
];
$state_data = [
driver : SGML::ElementMap::Driver
node_path : ''
handler_modes : [ $mode, $mode_2, $mode_3, ... ]
handler_mode_stack : [ $mode_set_1, $mode_set_2, ... ]
named_handlers : { 'NAME' => \&handler; }
last_gen_name : 'aaa'
];
$all_modes = { 'MODE_NAME_1' => $mode,
'MODE_NAME_2' => $mode_2 };
$global_vars = { 'NAME' => $some_value };
$stack_vars = Hash::Layered;
Why global variable support?
- might simplify cases where handlers cross packages and only the ElementMap object itself is reliably common to all modules. Imagine processing with many interchangeable modules of handlers
- that’s not really necessary, probably remove soon
Why stack variable support?
- need to collect and manage data along the doc hierarchy, just like with code
- otherwise handlers will have problems when they are their own children
- technically local() is good enough, but have had many troubles with local()
- the special modes for layers add some extra possibility
- maybe should not have this hardwired into the processing: hooks instead
Drivers
Different processors need different interfaces to work with them
- the core data is the same, as is the structure
- in-memory vs serialized vs parsing quality/detail
# these can default to Driver methods
$d->input($type); # 'file' 'literal' 'handle' etc.
$d->markup($type); # 'xml' or 'sgml'
$d->parser($parser_object);
$d->process_xml_file($elementmap, $file, @handler_args);
$d->process_sgml_file($elementmap, $file, @handler_args);
# these must be implemented in Driver sub-classes
$d->process(...);
$d->reparent_current_subtree($new_el_name, @attribute_pairss);
$d->reparent_subtree($new_el_name, @attribute_pairss);
$d->dispatch_subtrees($elementmap, $pattern, @handler_args);
$d->skip_subtrees();
$d->context_path();
Some of the drivers have a lot in common: the simple event based ones. So we have Driver::EventQueue
- maintains a control stack and a parse event queue
- the event queue is filled by calling the actual driver’s queue_more_events method
- the control stack says what to do with events in the event queue
- need to notice when the current element closes
- reparenting requires insertion of synthesized events at the right places
Hash::Layered
Sample execution:
$h->set_default('cascade');
$h->{'a'} = 31;
$h->{'b'} = 32;
$h->{'c'} = 33;
$h->push;
cascade | a | b | c | d | e |
default | 31 | 32 | 33 | ||
default |
assert($h->{'a'} == 31);
$h->set_layer('opaque');
assert(! defined $h->{'a'});
$h->{'c'} = 34;
$h->{'d'} = 35;
cascade | a | b | c | d | e |
default | 31 | 32 | 33 | ||
opaque | 34 | 35 |
$h->set_layer('default')
assert($h->{'a'} == 31);
assert($h->{'b'} == 32);
assert($h->{'c'} == 34);
assert($h->{'d'} == 35);
$h->push;
$h->{'a'} = 36
$h->set_layer('oneway');
$h->{'e'} = 37
$h->{'a'} = 38
cascade | a | b | c | d | e |
default | 36 | 32 | 33 | ||
default | 34 | 35 | |||
oneway | 38 | 37 |
assert $h->{'a'} == 38
assert $h->{'b'} == 32
assert $h->{'e'} == 37
$h->pop
assert !defined $h->{'e'}
$h->pop
assert $h->{'a'} 36
assert $h->{'b'} 32
assert $h->{'c'} 33
assert !defined $h->{'d'}
Want to use the object as a hash reference, but still have access to object methods. I initially tried this with a single object; however, that did not work. I don’t have notes, unfortunately, but I think the issue was getting the data structure out to work with. Using a single object, it’s more difficult to tell when a method is called if it needs to call tied (note that the hash ref and the object ref will be blessed to the same object, so ref() won’t help). Using two objects makes this very easy.
(Note: Haven’t converted to use sub constants for object fields.)
Have two places for behavior settings
- individual layer
- “default”
- used when the layer has “default” semantics
- lets you change behavior easily for most of the layers
Layers have IDs
- combined from an id in the layer_data_list and the layer_data_count
- lets us reuse layer_data_list arrays
- we don’t actually use them for anything, they are more to catch errors
intervening_layer($target_index, $is_write)
- key internal function
- searches for an opaque layer between the current layer and the target_index
- target_index can be ‘top’
- $is_write is 0 for read, 1 for write, and 2 for a new key
- should use a cache, but I haven’t put it in
- fancy behaviors mean you need seperate caches for reads and writes
- haven’t done any consideration on the cost of that extra caching
- note I don’t call _intervening_layer as an object method
- it’s already slow
- it’s called from everywhere
behaviors for the hash
- opaque : completely blocks other layers
- transparent : ignored unless it already has the key
- hidden : ignored entirely
- oneway : opaque to writes, transparent to reads
- cascade : opaque to new keys, transparent to existing keys
OK, OK, how does it work?
$layered_hash = [
$default_layer_state : 'cascade'
$layer_data_count : -1
$layer_data_list : [ $layer_data_1, $layer_data_2, ... ]
$var_val_stack_hash : { }
$iter_data : [ [ keys], key_index, intervening_layer_for_reads]
];
$layer_data = [
$sub_id,
$behavior,
'VAR1', # list of all variables that have values in this layer
'VAR2',
...
];
$var_val_stack_hash = {
'VAR1' => [ $layer_index_1, $val_1, $layer_index_2, $val_2, ... ]
...
};
Huh?
- each layer has an id, a behavior, and a list of variable keys
- each variable key (NOT LAYER) has
- a list of pairs: a layer index and a value
Lookup of a key
- find the key in the var_val_stack_hash
- get last two elements: layer index and value
- see if there is an opaque layer between the current layer and the one with the value
Iteration
- get a list of all keys
- lookup intervening opaque layer for reads
- skip any keys blocked by the opaque layer