Extracting Data from HTML Using TreeBuilder Node IDs

HTML documents have an inherent hierarchical structure. To aid in locating RDF-tagged data in HTML documents, I propose assigning each node in an HTML parse-tree an ID based on the path taken from the root to get to the given node. By utilizing node IDs from example pages, along with the types of tags passed along the route from root to data, a probablistic model may be built to locate the same data types in similarly structured pages.

Existing Work

Unresolved Issues


Next Steps

Other Approaches


