As stated previously, probablistic models, including hidden Markov models, seem to offer a robust, fault-tolerant means for retrieving data from machine-generated (CGI) HTML pages.
Earlier work with HMMs has focused on parsing pages as a sequence of tokens (tags and text), but has ignored the hierarchical nature of HTML. By modifying the state structure of a hidden Markov model to more closely model the tree-structure of HTML, the location of relevant data within the tree may be more closely modeled. This structure overcomes some limitations of HMMs which use sequentially parsed HTML in dealing with hierarchical, recursive data structures.
Kai Shih has developed a probablistic model which takes into account the tree structure of hierarchies. When labling existing data or classifying new data, the tendencies of nearby nodes in the tree to have a given attribute is considered. This is done using a "tendency to mutate" between parent and child nodes in the tree.
The RoadRunner system, by Crescenzi, et. al., implements fully automated wrapper generation on CGI-generated pages without using pre-labeled data. It compares pages from a query set and attempts to find a regular expression which matches all pages. This is done by iterating through the tokens comprising two sample pages, looking for mismatches. These mismatches are used to find optional parts of the page, iteratively generated records, and recursive data structures.
Muslea, et. al. have augmented the standard HLRT (head-left-right-tail) wrapper induction algorithm with a hierarchical overlay. The user specifies a hierarchical structure, or "embedded catalog" for the page, and a wrapper is learned from examples with this structure as a guide.
More existing work is summarized here.
|Σj aij = 1||1 ≤ i, j ≤ N,||aij ≥ 0||(1)|
Each state also has a set of emission probabilities bj(k), for observatoins k out of the set M, such that
|Σk bj(k) = 1||1 ≤ j ≤ |N|,||1 ≤ k ≤ |M|,||bj(k) ≥ 0||(2)|
The transition probabilities between states are now dependent on which child ε we are discussing. Thus, (1) becomes
|Σj aij(ε) = 1||1 ≤ i, j ≤ |N|,||aij ≥ 0,||1 ≤ ε ≤ |Ε|||(3)|
Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. Technical Report n. RT-DIA-64-2001, D.I.A., Universit a di Roma Tre, 2001. http://citeseer.nj.nec.com/crescenzi01roadrunner.html
Miller, R., Myers, B. Lightweight Structured Text Processing. In Proceedings of USENIX 1999 Annual Technical Conference, June 1999, Monterey, CA. http://www-2.cs.cmu.edu/~rcm/papers/usenix99/
Muslea, I., Minton, S., and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. http://citeseer.nj.nec.com/muslea99hierarchical.html
Seymore, K., McCallum, A., Rosenfeld, R. Learning Hidden Markov Model Structure for Information Extraction. In AAAI Workshop on Machine Learning for Information Extraction, 1999. http://citeseer.nj.nec.com/seymore99learning.html.
Shih, L., Karger, D. Learning Classes Correlated to a Hierarchy. MIT AI Lab Technical Note.