RDF-aided Wrapper Induction


Wrapper induction is a method for automatically constructing "wrappers", or scripts which automate the process of retrieving information from a lightly-structured information resource. In several forms, wrapper induction has been shown to be quite successful at consistently and accurately retrieving relational data from HTML-encoded web pages.

The Semantic Web is "an extension to the current web" which uses the RDF standard to mark up relational data into a form easily consumed by machines. RDF consists of subject-predicate-object triplets which allow users to superimpose relational structure on top of a normal web page.

I believe there is a remarkable compatability between the goals of the Semantic Web and the abilities of wrapper induction. By using RDF markup to augment the wrapper induction process, I believe that both the ease with which users can mark up documents using RDF as well as the accuracy of wrapper induction will be greatly improved.

Existing Work

Wrapper Induction was initially developed by Kushmerick and Freitag, et. al. It takes several forms, depending on the structure of the document, but all rely on finding delimiters that mark the left and right boundaries of the information to be extracted. In HTML, these delimiters are usually markup tags, such as <p> and </p>. Unfortunately, Wrapper Induction's biggest weakness is its inflexibility - not only can small changes in the source incapacitate wrappers, they are ill-fit to handle complicated pages where important data is intermixed with advertisements, irrelevant text, and extraneous markup. Initial implementations were found to successfully wrap only 48% of sampled internet resources.

Freitag developed SRV, an information extraction algorithm based on relational learning. SRV induces rules based on examples by mapping them to a set of tokens. Enhancing this token set with HTML markup tends to increase the effectiveness of SRV. SRV is more flexible than Wrapper Induction, and works well with documents in which data may not be in the same location from page to page. Coupling this algorithm with RDF might prove to be an effective user-centric means for information extraction.

The RoadRunner system, by Crescenzi, et. al., implements fully automated wrapper generation on CGI-generated pages without using pre-labeled data. It compares pages from a query set and attempts to find a regular expression which matches all pages. From this regular expression, it infers the underlying relational structure of the database that generated the page. The algorithm is extremely fast on most data sets, and is able to handle recursion and iteratively generated data. While not immediately applicable to users tagging HTML data with RDF, this work may be useful in discovering patterns and augmenting other approaches that utilize user-generated markup.

Hidden Markov models have been implemented in several ways to model the flow of text and markup in a document. Their stochastic nature is very tolerant of small variations in the source. Some of the implementations include:

Approach

Because of their stochastic, fault-tolerant nature and the ability to model documents with transitions and a class structure that is reminiscent of the way RDF models relational data, my intuition suggests HMM wrapper induction would work best with RDF markup.

Ideas:

Unresolved Issues

Possible Extensions

Next Steps


References

Berners-Lee, T., Hendler, J., Lassila, O. The Semantic Web. Scientific American, May 2001. http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2

Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. Technical Report n. RT-DIA-64-2001, D.I.A., Universit a di Roma Tre, 2001. http://citeseer.nj.nec.com/crescenzi01roadrunner.html

Freitag, D. Information Extraction from HTML: Application of a General Learning Approach. In Proceedings of the Fifteenth Conference on Artificial Intelligence AAAI-98 (1998), 517--523. http://citeseer.nj.nec.com/freitag98information.html

Freitag, D., and Kushmerick, N. Boosted Wrapper Induction. In Proceedings of the 17th National Conference on Artificial Intelligence, Pages 577-583, 2000. http://citeseer.nj.nec.com/freitag00boosted.html.

Freitag, D., and McCallum, A. Information Extraction with HMM structures learned by stochastic optimization. In Proceedings of the 18th Conference on Artificial Intelligence, 2000. http://citeseer.nj.nec.com/article/freitag00information.html.

Kushmerick, N., Thomas, B. Adaptive information extraction: Core technologies for information agents. http://citeseer.nj.nec.com/kushmerick02adaptive.html.

Kushmerick, N., Weld D., and Doorenbos, R. Wrapper induction for information extraction, IJCAI-97, 1997. http://citeseer.nj.nec.com/kushmerick97wrapper.html

Leek, T. Information extraction using hidden Markov models. Master's Thesis, University of California, San Diego, 1997. http://citeseer.nj.nec.com/leek97information.html.

Muslea, I., Minton, S., and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. http://citeseer.nj.nec.com/muslea99hierarchical.html

Seymore, K., McCallum, A., Rosenfeld, R. Learning Hidden Markov Model Structure for Information Extraction. In AAAI Workshop on Machine Learning for Information Extraction, 1999. http://citeseer.nj.nec.com/seymore99learning.html.