Tuesday, May 30, 2006
Who Are you
(ACL Classification Paper Online)
So we have finally prepared the final version of our CoLing/ACL paper on the classification of text be personality of author. I've previously discussed some of our reasons for attempting this: knowing more about an individual can help better model aspects of their language, such as their expression of sentiment. Here I just want to give you a brief introduction to what we were trying to do with this toe-dipping paper. The methodology is quite simple in terms of automated classification, and our approach is relatively straightforward, but looking at the areas we did made sense for our first move in this direction. Essentially, we experimented with two significant factors, the feature set, and the division of data points.
I'll not sit here explaining the entire paper - although it seems like I have just done so. I just want to give you a brief idea of the direction we took in our work, because the methodology does appear quite basic. Obviously this is just the first work of this ilk of many that we intend to produce, so watch this space for further developments. If anyone wants to talk about the work, ask questions or give general feedback, I'd love to hear from you.
- Feature set: we begin with a pseudo bag-of-words approach using word bi- and tri-grams (sequences of two and three words). We investigated reducing the feature set statistically, in order to see how efficient a classifier we could produce. Many simple classification systems throw the frequency of say 100s or 1000s of words at an algorithm, but we wanted to keep it minimal.
To do this we used the approaches described in my thesis to establish a relationship between n-grams and each personality trait. This consisted of determining which n-grams were significantly used by high and low scoring groups, more than anyone else. This reduces the set considerably, and produces better results than just throwing everything in the pot.
- Division of dataset: When dealing with personality in the thesis, we used a statistical approach to divide the corpus along each trait into high, medium and low subgroups. It occurred to us however that there were different ways to do this, all of which similarly statistically valid. Different approaches also enabled better comparison to similar work.
The most obvious task is binary classification between high and low groups, and so we split the corpus three different way. From there we attempted multi-class classification, by combining those varietals with the different medium group. We even attempted five different classes. Results vary, but as you would expect high accuracy is achieved on easier tasks.