s   e   c   o   n   d   t   h   o   u   g   h   t   .   o   r   g
think again...

Tree Pattern Inference and Matching for
Wrapper Induction on the World Wide Web

Andrew Hogue

Master of Engineering in
Electrical Engineering and Computer Science

Massachusetts Institute of Technology

June 2004

Abstract: We develop a method for learning patterns from a set of positive examples to retrieve semantic content from tree-structured data. Specifically, we focus on HTML documents on the World Wide Web, which contain a wealth of semantic information and have a useful underlying tree structure. A user provides examples of relevant data they wish to extract from a web site through a simple user interface in a web browser. To construct patterns, we use the notion of the edit distance between the subtrees represented by these examples to distill them into a more general pattern. This pattern may then be used to retrieve other instances of the selected data from the same page or other similar pages. By linking patterns and their components with semantic labels using RDF, we can create semantic ``overlays'' for Web information which are useful in such projects as the Semantic Web and the Haystack information management environment.

    Postscript (3.1 MB)
    Adobe Acrobat (PDF) (1.6 MB)

ahogue at secondthought dot org