RDF-aided Wrapper Induction

Wrapper induction is a method for automatically constructing "wrappers", or scripts which automate the process of retrieving information from a lightly-structured information resource. In several forms, wrapper induction has been shown to be quite successful at consistently and accurately retrieving relational data from HTML-encoded web pages.

The Semantic Web is "an extension to the current web" which uses the RDF standard to mark up relational data into a form easily consumed by machines. RDF consists of subject-predicate-object triplets which allow users to superimpose relational structure on top of a normal web page.

I believe there is a remarkable compatability between the goals of the Semantic Web and the abilities of wrapper induction. By using RDF markup to augment the wrapper induction process, I believe that both the ease with which users can mark up documents using RDF as well as the accuracy of wrapper induction will be greatly improved.

Existing Work

Wrapper Induction was initially developed by Kushmerick and Freitag, et. al. It takes several forms, depending on the structure of the document, but all rely on finding delimiters that mark the left and right boundaries of the information to be extracted. In HTML, these delimiters are usually markup tags, such as <p> and </p>. Unfortunately, Wrapper Induction's biggest weakness is its inflexibility - not only can small changes in the source incapacitate wrappers, they are ill-fit to handle complicated pages where important data is intermixed with advertisements, irrelevant text, and extraneous markup. Initial implementations were found to successfully wrap only 48% of sampled internet resources.

Freitag developed SRV, an information extraction algorithm based on relational learning. SRV induces rules based on examples by mapping them to a set of tokens. Enhancing this token set with HTML markup tends to increase the effectiveness of SRV. SRV is more flexible than Wrapper Induction, and works well with documents in which data may not be in the same location from page to page. Coupling this algorithm with RDF might prove to be an effective user-centric means for information extraction.

The RoadRunner system, by Crescenzi, et. al., implements fully automated wrapper generation on CGI-generated pages without using pre-labeled data. It compares pages from a query set and attempts to find a regular expression which matches all pages. From this regular expression, it infers the underlying relational structure of the database that generated the page. The algorithm is extremely fast on most data sets, and is able to handle recursion and iteratively generated data. While not immediately applicable to users tagging HTML data with RDF, this work may be useful in discovering patterns and augmenting other approaches that utilize user-generated markup.

Hidden Markov models have been implemented in several ways to model the flow of text and markup in a document. Their stochastic nature is very tolerant of small variations in the source. Some of the implementations include:

Leek specifies a rigid initial structure, and then learns words associated with each state to extract gene names and locations from biological papers.
Freitag and McCallum learn both the structure and the content of HMM states by starting with a generic design (background text, prefixes, target states, and suffixes) and then splitting and combining states when appropriate. Each HMM is restricted to extract only one class of data.
Seymore et. al. begin by specifying a class for each word in the training data, with connections from each word to the next. Structure is then learned by several merging techniques, combining nearby redundant states with the same class.

Approach

Because of their stochastic, fault-tolerant nature and the ability to model documents with transitions and a class structure that is reminiscent of the way RDF models relational data, my intuition suggests HMM wrapper induction would work best with RDF markup.

Ideas:

use HMMs to model the flow of structured information through the page
user assigns RDF tags to relevant text
following Seymore's approach, each token in document is assigned a class:
- with RDF-tagged tokens are classified by their RDF tag.
- HTML tags are each assigned a class based on their type (question: do attributes affect classification?)
- Other text tokens are treated as "background text", having a single class.
each word has a connection in from the previous one and out to the next.
neighboring states with identical classes are merged to reduce size of model
once structure is established, use baum-welch algorithm to learn from examples
once trained, RDF tags may be retrieved from a given page using viterbi algorithm

Unresolved Issues

many web sources for relational data span several pages (e.g. IMDB). how can we model search forms and moving from page to page? could be resolved with more state classes (?)
should attributes be modeled? could be complicated, but might add important information for locating data
how will the user markup the data? some form of context menu with Haystack continuations seems best. how to build this into the rendering engines (mozilla/IE) haystack uses?

Possible Extensions

coupled with natural language queries, could make a powerful tool for querying CGI-generated web sites.

Next Steps

build python (java? perl?) toy model and test (on what web site?)

References

Berners-Lee, T., Hendler, J., Lassila, O. The Semantic Web. Scientific American, May 2001. http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2

Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. Technical Report n. RT-DIA-64-2001, D.I.A., Universit a di Roma Tre, 2001. http://citeseer.nj.nec.com/crescenzi01roadrunner.html

Freitag, D. Information Extraction from HTML: Application of a General Learning Approach. In Proceedings of the Fifteenth Conference on Artificial Intelligence AAAI-98 (1998), 517--523. http://citeseer.nj.nec.com/freitag98information.html

Freitag, D., and Kushmerick, N. Boosted Wrapper Induction. In Proceedings of the 17th National Conference on Artificial Intelligence, Pages 577-583, 2000. http://citeseer.nj.nec.com/freitag00boosted.html.

Freitag, D., and McCallum, A. Information Extraction with HMM structures learned by stochastic optimization. In Proceedings of the 18th Conference on Artificial Intelligence, 2000. http://citeseer.nj.nec.com/article/freitag00information.html.

Kushmerick, N., Thomas, B. Adaptive information extraction: Core technologies for information agents. http://citeseer.nj.nec.com/kushmerick02adaptive.html.

Kushmerick, N., Weld D., and Doorenbos, R. Wrapper induction for information extraction, IJCAI-97, 1997. http://citeseer.nj.nec.com/kushmerick97wrapper.html

Leek, T. Information extraction using hidden Markov models. Master's Thesis, University of California, San Diego, 1997. http://citeseer.nj.nec.com/leek97information.html.

Muslea, I., Minton, S., and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. http://citeseer.nj.nec.com/muslea99hierarchical.html

Seymore, K., McCallum, A., Rosenfeld, R. Learning Hidden Markov Model Structure for Information Extraction. In AAAI Workshop on Machine Learning for Information Extraction, 1999. http://citeseer.nj.nec.com/seymore99learning.html.