Hierarchical HMMs

As stated previously, probablistic models, including hidden Markov models, seem to offer a robust, fault-tolerant means for retrieving data from machine-generated (CGI) HTML pages.

Earlier work with HMMs has focused on parsing pages as a sequence of tokens (tags and text), but has ignored the hierarchical nature of HTML. By modifying the state structure of a hidden Markov model to more closely model the tree-structure of HTML, the location of relevant data within the tree may be more closely modeled. This structure overcomes some limitations of HMMs which use sequentially parsed HTML in dealing with hierarchical, recursive data structures.

Existing Work

Kai Shih has developed a probablistic model which takes into account the tree structure of hierarchies. When labling existing data or classifying new data, the tendencies of nearby nodes in the tree to have a given attribute is considered. This is done using a "tendency to mutate" between parent and child nodes in the tree.

The RoadRunner system, by Crescenzi, et. al., implements fully automated wrapper generation on CGI-generated pages without using pre-labeled data. It compares pages from a query set and attempts to find a regular expression which matches all pages. This is done by iterating through the tokens comprising two sample pages, looking for mismatches. These mismatches are used to find optional parts of the page, iteratively generated records, and recursive data structures.

Muslea, et. al. have augmented the standard HLRT (head-left-right-tail) wrapper induction algorithm with a hierarchical overlay. The user specifies a hierarchical structure, or "embedded catalog" for the page, and a wrapper is learned from examples with this structure as a guide.

More existing work is summarized here.

Approach

Standard HMMs deal with linear structure in both states and observations. That is, the sequence of observations is linear, and while in the sequence of states that produces them individual states may have many successor states, only one is chosen at any given time by the probablistic mode.
In a standard HMM, a single state i, from a set of states N, has a set of state transition probabilities aij, such that

Σj aij = 1 1 ≤ i, j ≤ N, aij ≥ 0 (1)

Each state also has a set of emission probabilities bj(k), for observatoins k out of the set M, such that

Σk bj(k) = 1 1 ≤ j ≤ |N|, 1 ≤ k ≤ |M|, bj(k) ≥ 0 (2)
To accomodate the tree structure of HTML, we modify the HMM structure to allow state sequences to branch, or have more than one successor node at a time.
We introduce ε to represent the sequence of child branches from a given state. That is, ε = 1 represents the first child of a state, ε = 2 represents the second child, etc. Εi represents the set of children of state i.
The transition probabilities between states are now dependent on which child ε we are discussing. Thus, (1) becomes

Σj aij(ε) = 1 1 ≤ i, j ≤ |N|, aij ≥ 0, 1 ≤ ε ≤ |Ε| (3)

Unresolved Issues

Need to work out implications of ε on the Viterbi and Baum-Welch algorithms. Probably attack using depth-first search through the tree.

Extensions

Once a hierarchical structure has been established for HTML pages, we can easily incorporate transitions between pages (such as forms and links) by rooting the target pages' trees within the originating page's tree at the point of the transition.

Other Approaches

See notes on using Node ID Probabilities to retrieve data from an HTML tree.

References

Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. Technical Report n. RT-DIA-64-2001, D.I.A., Universit a di Roma Tre, 2001. http://citeseer.nj.nec.com/crescenzi01roadrunner.html

Miller, R., Myers, B. Lightweight Structured Text Processing. In Proceedings of USENIX 1999 Annual Technical Conference, June 1999, Monterey, CA. http://www-2.cs.cmu.edu/~rcm/papers/usenix99/

Muslea, I., Minton, S., and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. http://citeseer.nj.nec.com/muslea99hierarchical.html

Seymore, K., McCallum, A., Rosenfeld, R. Learning Hidden Markov Model Structure for Information Extraction. In AAAI Workshop on Machine Learning for Information Extraction, 1999. http://citeseer.nj.nec.com/seymore99learning.html.

Shih, L., Karger, D. Learning Classes Correlated to a Hierarchy. MIT AI Lab Technical Note.