Consider an application that maintains a database of bibliographic entries. Each entry refers to a published paper, and includes the abstract and title, and details relating to the providence of the paper such as the technical scope of the paper and author biographies, etc. A simple interface agent observes the different entries examined by the user, and based on these observations, presents new entries that the user might also find interesting.
To achieve this, the agent uses a machine learning algorithm to induce and use a user profile for each user. This profile represents the user's interests, and is induced from the observations made by the application. The observations capture the entries examined, and any related actions, e.g. the query used to find the entry, or the name of the file in which the entry was subsequently stored. New bibliographic entries are compared to the user profile, and any that match are presented to the user.
|
||||||||||||||
| Fig 1: A pre-processed bibliographic entry. |
However, mapping these observations into a representation suitable for presentation to the learning algorithm can be problematic. Some of the fields in the bibliographic entries (Fig. 1) can be easily mapped to attributes within training instances (e.g. Publication Type). However, mapping fields that contain free text (e.g. Abstract Terms) present a number of problems, as the domain of each field may contain in the order of 20,000 - 100,000 terms.
|
| Fig 2: Performance Weights. |
Most agent systems solve this problem by using only a subset of the available fields (normally those fields with a small domain, such as the Publication Type field), or by selecting a subset of terms (e.g. the most frequent 100 terms) found in the free text fields. This selection process is similar to that used for selecting attributes when using a machine learning algorithm. A number of studies have shown that the performance of a learning algorithm can be improved if it is integrated into the attribute selection process, as opposed to using separate pre-processing selection methods.
We are currently examining a variety of attribute selection and dimensionality reduction techniques used for attribute selection to investigate ways these can be applied to the task of:
These selection techniques include using simple statistical filters e.g. rating terms using term frequency-inverse document frequency (tfidf) measures, performance weights which are adjusted depending on the performance of the learning algorithm on a classification task (Fig.2), and the use of geometric techniques that map observations represented as geometric vectors from some hyperspace (dimensionality determined by the number of terms in each field) into a much smaller subspace (a similar approach is used by Latent Semantic Indexing).
We believe that by combining various methods to identify and select those fields and terms found to be relevant to a classification task, we can improve the accuracy of machine learning techniques when used within interface agents. Also, by learning which fields and terms are important to an agent-based task, these generic techniques can be readily applied to most text-based interface agents.
A proposal is available that describes these techniques in greater detail.
U.K. Engineering and Physical Sciences Research Council (EPSRC) Studentship