Project Summaries for Andrew W Hogue

Introduction

The following are summaries for several projects in which I have participated. For more information, please contact me at ahogue at secondthought dot org.


Research Project and Thesis

Tree Pattern Inference and Matching for Wrapper Induction on the World Wide Web: The goal of this thesis is to develop a system for inferring reusable patterns from positive examples given by a user. Utilizing the tree structure of HTML documents, these positive examples are used to create a recursive finite automaton. These automata are useful for later finding instances of similar structure in other documents. We also allow users to apply semantic labels drawn from existing ontologies to these patterns. By giving the patterns semantic meaning, flat data on the web may be integrated into rich Semantic Web environments such as the Haystack information management client. Haystack uses the RDF standard to maintain and manage a system of semantic descriptions of a user's data. When our patterns are matched against a page in Haystack's browser, the user is provided with context-sensitive menus which allow them to interact with the Web in a much richer way.

Once users are able to define these patterns simply by pointing out positive examples on a given web page, any number of personalizable functions may be easily implemented. For instance, a user defining a pattern for the headlines on The New York Times web site could have those headlines emailed to him on a daily basis. Web sites important to the user could be watched for specific changes based on a pattern. Sites with the same semantic content could even be interchangably reformatted - a user could read the headlines from The New York Times in the style of Slashdot. Autonomous agents could gather news headlines and aggregate them into a single newsreader application. The system could automatically monitor bank balances or stocks and notify the user upon certain events.

The full text of this thesis may be found here. In addition, a paper presented at the Interaction Design and the Semantic Web workshop at the WWW-2004 conference is located here, along with the acompanying presentation here.


Advent, Inc.

DAX - Data Acquisition and Transformation: While at Advent, there was a need to replace old, consultant-designed software that managed the transformation of raw financial data from our clients into several internal formats, including XML and storage in a relational database. The old system relied on a one-by-one approach - for each new source of data, a script was designed solely to transform that source into a single internal format. This resulted in a huge number of specialized scripts, little code reuse, and extremely difficult maintenance.

To replace this system, my team realized that we could capitalize on the fact that, despite its appearance, these disparate sources and destinations actually represented very few types of data. No matter where the data came from, it consisted of certain standard business concepts, such as a transactions and accounts. If we could develop a standard description of these business objects, we would dramatically increase the reuse of our code and greatly simplify the addtion of new sources and destinations for data.

To this end, we created a standard XML description for these objects. To process data, we developed a Java-based, multithreaded agent model which allowed us to reuse threads for reading and writing data, as long as they all spoke the same "language" - the language of our new XML business objects.

Now, any data source could easily be written to any data destination by simply configuring an Agent with the correct reader and writer threads. Adding a new source or destination was reduced to writing a single thread and instructing it on how to process the requisite business objects.


Storefront Media, Inc.

Natural Fit Software: Storefront Media was a startup built around the idea of providing a better experience for people shopping for clothing online. Considerations like style, fit, and color make clothing one of the most personalized items for which consumers shop. The difficulty in matching customers to the correct size and style of clothing makes the return-rate for online and catalog clothing companies one of the highest of any type of goods.

To help alleviate theses issues, Storefront Media designed a suite of tools to aid the consumer in making the right choices while shopping. These included a drag-and-drop web interface to ease the shopping experience, and a collaborative-filtering-based recommendation engine that took into account user preferences, past purchases, and anonymous statistics gleaned from other users.

I designed and implemented the third component, the "Natural Fit" system. This system translated both user measurements and manufacturer measurements into a common format, which allowed us to match the customer's body to the best size of a given garment. Natural Fit also took into account properties of the fabric, such as stretch, to provide a more accurate recommendation. To provide good analysis even when the user could only provide a few personal measurements, a large database of full measurements was compared using nearest-neighbor search to fill out the missing measurements.

While the company as a whole had funding issues, and eventually ceased to exist, the lessons learned from the design of its software were invaluable.


MIT Media Lab

OpenMind/CommonSense Knowledgebase: OpenMind / CommonSense is a web-based project run by Marvin Minsky and Push Singh through the MIT Media lab. Its goal is to "make computers smarter by [giving them] the millions of pieces of ordinary knowledge that constitute 'common-sense'." To this end, the site consists of a set of about 25 activities which allow the user to add to a knowledgebase of interlinked, descriptive, natural-language text. Sample activities include "Where Things Are" (describe where things are found), "Describe a Picture", and "Explain why" (tell why a fact is true).

Upon joining the project, I found that one large gap in the system's "knowledge" came in relating different snippets to each other. Many pieces of knowledge existed, but how were they connected?

To try to resolve this, I worked on designing a set of activities that incorportated more than one knowlege snippet into a larger framework. The first of these allowed users to tell a "story" - a sequence of events which are related temporally. Users could also add to an existing story, even branching the story at a specific event to give it multiple endings.

Other activities designed to relate different pieces of knowledge were designed as well. "Explain a relation" allowed the user to relate a pair of words that the system noticed always appeared near each other. "Cause and Effect" asked the user to provide an effect of a given action. "What Changed?" provided the knowledgebase with a sense of the differences before and after a given action.

Taken together, these changes provided a very disparate, granular knowledgebase with a sense of how its bits knowledge interacted with one another, a component which is essential to what we call "common sense".