Extracting from HTML Using TreeBuilder Node IDs

Extracting Data from HTML Using TreeBuilder Node IDs

HTML documents have an inherent hierarchical structure. To aid in locating RDF-tagged data in HTML documents, I propose assigning each node in an HTML parse-tree an ID based on the path taken from the root to get to the given node. By utilizing node IDs from example pages, along with the types of tags passed along the route from root to data, a probablistic model may be built to locate the same data types in similarly structured pages.

Existing Work

Existing work is summarized here and here.

Approach

Create parse tree of HTML page. Assign each node in the tree an ID which contains the branch number at each depth which was followed to reach that node. E.g. ID 0.3.0 indicates the first child of the fourth child of the root.
Mark up text nodes in page with RDF tags as appropriate (e.g. <dc:title>)
For each marked up node, remember two items:
1. The node ID, which represents the choice of child nodes along the path taken to get from root to data. E.g. "0.1.3"
2. The types of tags passed along that route. E.g. {<html>, <body>, }
Over several example pages, much of these two sequences should remain the same, assuming that the pages are structured similarly.
Differences in the sequences of different example pages represent probablistic "options" along the path from root to data.
For example, a difference in the node ID:

Example Page 1 <dc:title> 0.1.3 {<html>, <body>, }

Example Page 2 <dc:title> 0.1.4 {<html>, <body>, }

When finding the <dc:title> data in subsequent pages, the algorithm would decend to the second child of the root, and then choose either the third or fourth child of that node, depending on which was a  tag.

Another example, with the "option" of an extra tag:

Example Page 1 <dc:title> 0.1.3 {<html>, <body>, }

Example Page 2 <dc:title> 0.1.3.0 {<html>, <body>, , }

In this case, when finding <dc:title>, there is the "option" of the data being within a  tag, rather than directly below the  tag.

Unresolved Issues

What is the full range of variations that might be encountered? (E.g. "different child," "extra node," "different tag," etc.)
How do we incorporate these variations into a formal probablistic model?
Will any of them cause the model to "blow up" in certain situations?

Extensions

Transitions between pages may be incorporated by simply appending the sequences from child pages to those from the parent page via tags with links (such as <form> and <a>)

Next Steps

Finish porting tree-building algorithm to java (using perl's HTML::TreeBuilder module as an example) [still need to unit test it against live pages to ensure it parses the same on all pages]
Tag a few pages (from what website?) and test possible methods for assigning probabilities to individual pages, and for integrating probabilities from multiple examples.
Finish building Haystack HTML source viewer UI and work on user's interface for marking up pages

Other Approaches

See notes on Hierarchical HMMs.

References

Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. Technical Report n. RT-DIA-64-2001, D.I.A., Universit a di Roma Tre, 2001. http://citeseer.nj.nec.com/crescenzi01roadrunner.html

Miller, R., Myers, B. Lightweight Structured Text Processing. In Proceedings of USENIX 1999 Annual Technical Conference, June 1999, Monterey, CA. http://www-2.cs.cmu.edu/~rcm/papers/usenix99/

Muslea, I., Minton, S., and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. http://citeseer.nj.nec.com/muslea99hierarchical.html

Seymore, K., McCallum, A., Rosenfeld, R. Learning Hidden Markov Model Structure for Information Extraction. In AAAI Workshop on Machine Learning for Information Extraction, 1999. http://citeseer.nj.nec.com/seymore99learning.html.

Shih, L., Karger, D. Learning Classes Correlated to a Hierarchy. MIT AI Lab Technical Note.

Example Page 1	`<dc:title>`	0.1.3	{<html>, <body>, <p>}
Example Page 2	`<dc:title>`	0.1.4	{<html>, <body>, <p>}