Extracting Data from HTML Using TreeBuilder Node IDs
HTML documents have an inherent hierarchical structure. To aid in locating RDF-tagged data in HTML documents, I propose assigning each node in an HTML parse-tree an ID based on the path taken from the root to get to the given node. By utilizing node IDs from example pages, along with the types of tags passed along the route from root to data, a probablistic model may be built to locate the same data types in similarly structured pages.
Existing Work
Existing work is summarized here and here.
Approach
<dc:title>
)For example, a difference in the node ID:
Example Page 1 | <dc:title> |
0.1.3 | {<html>, <body>, <p>} |
Example Page 2 | <dc:title> |
0.1.4 | {<html>, <body>, <p>} |
When finding the <dc:title>
data in subsequent pages,
the algorithm would decend to the second child of the root, and then choose either the
third or fourth child of that node, depending on which was a <p>
tag.
Another example, with the "option" of an extra tag:
Example Page 1 | <dc:title> |
0.1.3 | {<html>, <body>, <p>} |
Example Page 2 | <dc:title> |
0.1.3.0 | {<html>, <body>, <p>, <b>} |
In this case, when finding <dc:title>
, there is the "option"
of the data being within a <b>
tag, rather than directly below the
<p>
tag.
Unresolved Issues
Extensions
<form>
and <a>
)Next Steps
Other Approaches
Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. Technical Report n. RT-DIA-64-2001, D.I.A., Universit a di Roma Tre, 2001. http://citeseer.nj.nec.com/crescenzi01roadrunner.html
Miller, R., Myers, B. Lightweight Structured Text Processing. In Proceedings of USENIX 1999 Annual Technical Conference, June 1999, Monterey, CA. http://www-2.cs.cmu.edu/~rcm/papers/usenix99/
Muslea, I., Minton, S., and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. http://citeseer.nj.nec.com/muslea99hierarchical.html
Seymore, K., McCallum, A., Rosenfeld, R. Learning Hidden Markov Model Structure for Information Extraction. In AAAI Workshop on Machine Learning for Information Extraction, 1999. http://citeseer.nj.nec.com/seymore99learning.html.
Shih, L., Karger, D. Learning Classes Correlated to a Hierarchy. MIT AI Lab Technical Note.