Extracting Data from HTML Using TreeBuilder Node IDs


HTML documents have an inherent hierarchical structure. To aid in locating RDF-tagged data in HTML documents, I propose assigning each node in an HTML parse-tree an ID based on the path taken from the root to get to the given node. By utilizing node IDs from example pages, along with the types of tags passed along the route from root to data, a probablistic model may be built to locate the same data types in similarly structured pages.

Existing Work

Existing work is summarized here and here.

Approach

Unresolved Issues

Extensions

Next Steps

Other Approaches


References

Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. Technical Report n. RT-DIA-64-2001, D.I.A., Universit a di Roma Tre, 2001. http://citeseer.nj.nec.com/crescenzi01roadrunner.html

Miller, R., Myers, B. Lightweight Structured Text Processing. In Proceedings of USENIX 1999 Annual Technical Conference, June 1999, Monterey, CA. http://www-2.cs.cmu.edu/~rcm/papers/usenix99/

Muslea, I., Minton, S., and Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. http://citeseer.nj.nec.com/muslea99hierarchical.html

Seymore, K., McCallum, A., Rosenfeld, R. Learning Hidden Markov Model Structure for Information Extraction. In AAAI Workshop on Machine Learning for Information Extraction, 1999. http://citeseer.nj.nec.com/seymore99learning.html.

Shih, L., Karger, D. Learning Classes Correlated to a Hierarchy. MIT AI Lab Technical Note.