Syntactic and Semantic Tree Structure in HTML


All browsers parse HTML into a Document Object Model, or DOM. This represents the syntactic tree of an HTML document.

The RDF standard is used to group semantic meaning into hierarchical form.

There tends to be some overlap between the syntactic DOM tree of a web site and the semantic tree of the information it presents. For instance, the DOM of Google's search results page is (with some nodes omitted for clarity:

HTML
 \- BODY
     \- DIV
         |- P
	 |  |- A
	 |  |   \- [ResultURL]
	 |  \- FONT
	 |      |- [ResultSummary]
	 |      |- SPAN
	 |      |   \- A
	 |      |      \- [Category]
	 |      |- [ResultURL]
	 |      |- A
	 |      |  \- [CacheURL]
	 |      \- A
	 |         \- [SimilarPagesURL]
         |- P
	 |  |- A
	 |  |   \- [ResultURL]
	 |  \- FONT
	 |      |- [ResultSummary]
	 |      |- SPAN
	 |      |   \- A
	 |      |      \- [Category]
	 |      |- [ResultURL]
	 |      |- A
	 |      |  \- [CacheURL]
	 .     \- A
	 .        \- [SimilarPagesURL]
	 .

A semantic ontology for describing Google's results might look like:

[SearchResult]
  |- [ResultURL]
  |- [Summary]
  |- [CacheURL]
  |- [Category]
  \- [SimilarPagesURL]

There are some similarities between the syntactic and semantic trees. For instance, one might say that the entire <p> tag represents the [Result] class in the ontology. Each time a similar syntactic tree, rooted at <p>, is found, it is an instance of the [Result] ontology.

Existing Work

For a summary of non-hierarchical wrapper induction approaches, see my other notes here and here.

Stalker, by Muslea, et. al., also takes a hierarchical approach to wrapper induction by developing a semantic tree structure for each page. However, they seem to ignore the hierarchical structure of HTML itself, instead proceeding to extract information in a traditional, linear fashion.

Approach

Interface

Unresolved Issues

Current Work / Next Steps


References

Berners-Lee, T., Hendler, J., Lassila, O. The Semantic Web. Scientific American, May 2001. http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2

Miller, R., Myers, B. Lightweight Structured Text Processing. In Proceedings of USENIX 1999 Annual Technical Conference, June 1999, Monterey, CA. http://www-2.cs.cmu.edu/~rcm/papers/usenix99/

Muslea, I., Minton, S., and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. http://citeseer.nj.nec.com/muslea99hierarchical.html

Shih, L., Karger, D. Learning Classes Correlated to a Hierarchy. MIT AI Lab Technical Note.