Hierarchical, Stochastic Wrapper Induction

Statistical methods seem well suited for HTML wrapper induction. They have proven to be quickly learned from limited examples as well as robust with respect to variations in both the structure and the content of new pages.

However, traditional stochastic models, such as hidden Markov models, can have problems parsing hierarchical structures.

I propose a probablistic structure designed specifically for modelling trees. The model is based on a set of probabilities at each node representing how many times the node appears as a child of its parent.

Existing Work

Kai Shih has developed a probablistic model which takes into account the tree structure of hierarchies. When labling existing data or classifying new data, the tendencies of nearby nodes in the tree to have a given attribute is considered. This is done using a "tendency to mutate" between parent and child nodes in the tree.

The RoadRunner system, by Crescenzi, et. al., implements fully automated wrapper generation on CGI-generated pages without using pre-labeled data. It compares pages from a query set and attempts to find a regular expression which matches all pages. This is done by iterating through the tokens comprising two sample pages, looking for mismatches. These mismatches are used to find optional parts of the page, iteratively generated records, and recursive data structures.

Muslea, et. al. have augmented the standard HLRT (head-left-right-tail) wrapper induction algorithm with a hierarchical overlay. The user specifies a hierarchical structure, or "embedded catalog" for the page, and a wrapper is learned from examples with this structure as a guide.

Approach

Begin with a single example HTML page of the site to be marked up.
Parse the HTML into a tree structure (for example, following the algorithm of the HTML::TreeBuilder perl module)
Assign probabilities εij(n) to each node.
- εij(n) is defined as the probability that node j exists n times as a child of node i.
- Compare the child subtrees the given node, looking for subtrees which are identical.
- Assign εij(n) with n equal to the count for each set of identical subtrees found.
Once we have the full tree model from the first page, overlay a second marked-up example page.
Adjust the εij(n) at each node based on the new page.
Continue until all examples have been applied.
To retrieve markup for a non-example page, begin at top level of tree and descend, following most probable route to data.

Unresolved Issues

Should text (non-markup elements) be ignored when comparing subtrees? Presumably, any important text will have been marked up by the user...
When determining the εij(n), how do we deal with the ordering of children at each node? There may be more than one type of child at each node, and they may be intertwined. Ordering definitely matters in HTML, but should it matter in a probablistic model which is just looking for the location of certain elements?

Extensions

Once a hierarchical structure has been established for HTML pages, we can easily incorporate links between pages (such as forms and links) by rooting the target pages' trees within the originating page's tree at the point of the link.

Other Approaches

TreeBuilder algorithms assign a unique code to each node of the tree, based on which child node was traversed at each level to get there (e.g. "0.1.0" represents the first child of the second child of the root). Using the code of each relevant piece of data in the tree, over several examples, could provide another way to probablistically map the location of data in the tree.

References

Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. Technical Report n. RT-DIA-64-2001, D.I.A., Universit a di Roma Tre, 2001. http://citeseer.nj.nec.com/crescenzi01roadrunner.html

Miller, R., Myers, B. Lightweight Structured Text Processing. In Proceedings of USENIX 1999 Annual Technical Conference, June 1999, Monterey, CA. http://www-2.cs.cmu.edu/~rcm/papers/usenix99/

Muslea, I., Minton, S., and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. http://citeseer.nj.nec.com/muslea99hierarchical.html

Seymore, K., McCallum, A., Rosenfeld, R. Learning Hidden Markov Model Structure for Information Extraction. In AAAI Workshop on Machine Learning for Information Extraction, 1999. http://citeseer.nj.nec.com/seymore99learning.html.

Shih, L., Karger, D. Learning Classes Correlated to a Hierarchy. MIT AI Lab Technical Note.