Hierarchical, Stochastic Wrapper Induction


Statistical methods seem well suited for HTML wrapper induction. They have proven to be quickly learned from limited examples as well as robust with respect to variations in both the structure and the content of new pages.

However, traditional stochastic models, such as hidden Markov models, can have problems parsing hierarchical structures.

I propose a probablistic structure designed specifically for modelling trees. The model is based on a set of probabilities at each node representing how many times the node appears as a child of its parent.

Existing Work

Kai Shih has developed a probablistic model which takes into account the tree structure of hierarchies. When labling existing data or classifying new data, the tendencies of nearby nodes in the tree to have a given attribute is considered. This is done using a "tendency to mutate" between parent and child nodes in the tree.

The RoadRunner system, by Crescenzi, et. al., implements fully automated wrapper generation on CGI-generated pages without using pre-labeled data. It compares pages from a query set and attempts to find a regular expression which matches all pages. This is done by iterating through the tokens comprising two sample pages, looking for mismatches. These mismatches are used to find optional parts of the page, iteratively generated records, and recursive data structures.

Muslea, et. al. have augmented the standard HLRT (head-left-right-tail) wrapper induction algorithm with a hierarchical overlay. The user specifies a hierarchical structure, or "embedded catalog" for the page, and a wrapper is learned from examples with this structure as a guide.

Approach

Unresolved Issues

Extensions

Other Approaches


References

Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. Technical Report n. RT-DIA-64-2001, D.I.A., Universit a di Roma Tre, 2001. http://citeseer.nj.nec.com/crescenzi01roadrunner.html

Miller, R., Myers, B. Lightweight Structured Text Processing. In Proceedings of USENIX 1999 Annual Technical Conference, June 1999, Monterey, CA. http://www-2.cs.cmu.edu/~rcm/papers/usenix99/

Muslea, I., Minton, S., and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. http://citeseer.nj.nec.com/muslea99hierarchical.html

Seymore, K., McCallum, A., Rosenfeld, R. Learning Hidden Markov Model Structure for Information Extraction. In AAAI Workshop on Machine Learning for Information Extraction, 1999. http://citeseer.nj.nec.com/seymore99learning.html.

Shih, L., Karger, D. Learning Classes Correlated to a Hierarchy. MIT AI Lab Technical Note.