Syntactic and Semantic Tree Structure in HTML
All browsers parse HTML into a Document Object Model, or DOM. This represents the syntactic tree of an HTML document.
The RDF standard is used to group semantic meaning into hierarchical form.
There tends to be some overlap between the syntactic DOM tree of a web site and the semantic tree of the information it presents. For instance, the DOM of Google's search results page is (with some nodes omitted for clarity:
HTML \- BODY \- DIV |- P | |- A | | \- [ResultURL] | \- FONT | |- [ResultSummary] | |- SPAN | | \- A | | \- [Category] | |- [ResultURL] | |- A | | \- [CacheURL] | \- A | \- [SimilarPagesURL] |- P | |- A | | \- [ResultURL] | \- FONT | |- [ResultSummary] | |- SPAN | | \- A | | \- [Category] | |- [ResultURL] | |- A | | \- [CacheURL] . \- A . \- [SimilarPagesURL] .
A semantic ontology for describing Google's results might look like:
[SearchResult] |- [ResultURL] |- [Summary] |- [CacheURL] |- [Category] \- [SimilarPagesURL]
There are some similarities between the syntactic and semantic
trees. For instance, one might say that the entire
<p>
tag represents the [Result]
class in the ontology. Each time a similar syntactic tree, rooted
at <p>
, is found, it is an instance of
the [Result]
ontology.
Existing Work
For a summary of non-hierarchical wrapper induction approaches, see my other notes here and here.
Stalker, by Muslea, et. al., also takes a hierarchical approach to wrapper induction by developing a semantic tree structure for each page. However, they seem to ignore the hierarchical structure of HTML itself, instead proceeding to extract information in a traditional, linear fashion.
Approach
[SearchResult]
class may appear as a child of the nodes
HTML-BODY-DIV
or HTML-BODY-DIV-BLOCKQUOTE
.
[Category]
class may or may not appear, depending on the result.
Interface
Unresolved Issues
Current Work / Next Steps
Berners-Lee, T., Hendler, J., Lassila, O. The Semantic Web. Scientific American, May 2001. http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2
Miller, R., Myers, B. Lightweight Structured Text Processing. In Proceedings of USENIX 1999 Annual Technical Conference, June 1999, Monterey, CA. http://www-2.cs.cmu.edu/~rcm/papers/usenix99/
Muslea, I., Minton, S., and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. http://citeseer.nj.nec.com/muslea99hierarchical.html
Shih, L., Karger, D. Learning Classes Correlated to a Hierarchy. MIT AI Lab Technical Note.