Tuesday, May 30, 2006
A Whole New World
(Don't Look Down, We've A New Blog)
As I mentioned the other day I had the idea that Amanda and I should blog our experiences as we prepare to start our new life in Australia. This is of course NOT just pure blogosphere vanity, but it will be a good way to keep friends here (and across the world) up to date on our goings on. I've recommended it to a number of people going on long trips...why write multiple emails saying roughly the same thing (or even one mass email) when you could blog it!

So Amanda has set this up and got the ball rolling. Go check out our [Exit Music].

Who Are you
(ACL Classification Paper Online)
So we have finally prepared the final version of our CoLing/ACL paper on the classification of text be personality of author. I've previously discussed some of our reasons for attempting this: knowing more about an individual can help better model aspects of their language, such as their expression of sentiment. Here I just want to give you a brief introduction to what we were trying to do with this toe-dipping paper. The methodology is quite simple in terms of automated classification, and our approach is relatively straightforward, but looking at the areas we did made sense for our first move in this direction. Essentially, we experimented with two significant factors, the feature set, and the division of data points.
  1. Feature set: we begin with a pseudo bag-of-words approach using word bi- and tri-grams (sequences of two and three words). We investigated reducing the feature set statistically, in order to see how efficient a classifier we could produce. Many simple classification systems throw the frequency of say 100s or 1000s of words at an algorithm, but we wanted to keep it minimal.

    To do this we used the approaches described in my thesis to establish a relationship between n-grams and each personality trait. This consisted of determining which n-grams were significantly used by high and low scoring groups, more than anyone else. This reduces the set considerably, and produces better results than just throwing everything in the pot.

  2. Division of dataset: When dealing with personality in the thesis, we used a statistical approach to divide the corpus along each trait into high, medium and low subgroups. It occurred to us however that there were different ways to do this, all of which similarly statistically valid. Different approaches also enabled better comparison to similar work.

    The most obvious task is binary classification between high and low groups, and so we split the corpus three different way. From there we attempted multi-class classification, by combining those varietals with the different medium group. We even attempted five different classes. Results vary, but as you would expect high accuracy is achieved on easier tasks.
I'll not sit here explaining the entire paper - although it seems like I have just done so. I just want to give you a brief idea of the direction we took in our work, because the methodology does appear quite basic. Obviously this is just the first work of this ilk of many that we intend to produce, so watch this space for further developments. If anyone wants to talk about the work, ask questions or give general feedback, I'd love to hear from you.

Sunday, May 28, 2006
Men at Work
A post title with even more hidden meanings and available interpretations than normal. So I got official confirmation this past week, that I have been offered a job that I applied for back at the beginning of March. I did the interview at the end of March, while I was in Palo Alto in fact. I didn't like to say anything because... well... I just didn't. Needless to say though, I can now confirm that I got the job.

What's more it's exactly where we want to be: Sydney, Australia. I cannot tell you how excited Amanda and I are at the prospect of the big move down under. We having been looking forward to leaving behind the grey and starting a whole new life for so long now. It's just going to be (expressive word of the day) awesome!!

I had a thought this morning...we should blog it. You know, what we do to get there, our thoughts once we arrive, document our experiences in the hope that we may one day aid and encourage other fledgling migrants. There are zabillions of bloggers out there doing the same thing, with so few readers available, but I'm not sure that really matters. Of course, it's hard enough to keep one blog updated so...

Wednesday, May 24, 2006
The Wonderful World of Weblogs
(At the WWW Conference)
Attended the Workshop on Weblogging Ecology (WWE) at the World Wide Web (WWW) conference yesterday. Didn't have to travel far as it was being hosted here in Edinburgh. What tipped me off that this was a big deal of a conference, beyond the camera crews filming stuff, and the fact that parts made the news, was that Jack McConnell, Scotland's First Minister was among the opening speakers. Of course, I was only there for the workshop, so I had a bit of a lie in.

I suppose given the 5000 miles, it's not surprising that there weren't so many faces familiar from the CAAW symposium I was at in March. It did offer an opportunity to catch up with Matt and Natalie of BlogPulse, along with meeting Thomas Lento from Cornell, who I didn't get a chance to talk to in Palo Alto.

The day was pretty interesting. Mostly different work from CAAW, with just a smidgeon of overlap. Disappointingly there was little work on the language of blogs from a sociobiographic perspective, but there really aren't that many of us.

One interesting difference is that BlogPulse released a blog corpus for people to work with. The aim was to report the results of what you did with it at the conference. This is a standard idea in technical conferences, but the first time I've seen it done. The obvious follow up was the discussion section of the day: the topic was around the general idea of creating a really good corpus for all to share. There were lots of good ideas, and everybody is very keen. The tricky part is finding people to do the work: the cleaning, removing splogs etc. Of course no corpus would be of use to me without some serious demographic labeling, and that sort of data doesn't just grow on trees.

One interesting observation concerning the days blogademics...very few of them blog. I got that impression at CAAW, but here we were asked to write our name, affiliation, email and blog URL if we had them. I was near the back as the paper was passed around and a quick scan revealed very few URLs. I know I've made this point before, and I'm not saying people have to blog to study just wouldn't hurt.

All in all it was a pretty good day and I got a neat bag!

Friday, May 19, 2006
Updating Reality
(So What's Been Happening)
Well, somehow I got into a pattern of posting once a week on a Wednesday. This wasn't initially deliberate, but you know, it became a bit of a thing. And then I missed last week because I was just too busy. So I start thinking I should post on another day, so Weblog Wednesday doesn't become a fully ingrained thing. And then it reached this most recent Wednesday and I figured why not! Once more, however, I was busy. So here I am determined to post, because so much has been happening, and I'm going to break it down into bullet points because they can be easier to read.
  • I have been working quite a bit the last weeks making the final corrections to my thesis. Amanda read the entire thesis for me and identified an impressive number of errors (I have an odd habit of pluralising things in an unnecessarily Ali G ways, and singularising things when there are lot of them [sic]). I've extended my literature review appropriately, which includes significantly rearranging the middle section. I've replaced linear regression with logistic regression for gender, and I added in a paragraph that LaTeX had decided to remove.

    The appropriate documentation is with the exam board, the thesis is at the library for fancy dan hard back binding, and the electronic copy has been updated for your reinvigorated reading pleasure.

  • I have volunteered to help run some perception experiments. The experiments themselves are pretty straightforward, requiring little more than starting groups off and being around to deal with any issues. However, lining up subjects takes a little more effort. This is my first experience with this since my data gathering was done on-line, so I didn't have to worry about such things. It'll be good experience though.

  • Still not had a chance to look at the corrections required for the ACL paper I talked about last time. They are due pretty soon though so I'll be getting right onto them. not least so that I can finally post the finished paper here and in my web space. We are also thinking about doing another poster for the local event, but the deadline for that is also soon so we'll see.

  • I am also supposed to be going to the Weblogging Ecosystem workshop next week. It is part of the World Wide Web Conference which is actually being held here in Edinburgh. This makes it very easy to get to, but we haven't sorted out registration yet. Leaving it late I know, online registration has certainly closed. It is being co-chaired by Matt Hurst (amongst others) who I may have mentioned previously. A bit of a homecoming for Matt, getting to see all his old haunts.
There are a few other things which I could talk about, but I feel I've taken enough of people's time right here, so I'll save these for another day. I'm not even going to mention the issues I've been having concerning broken glasses, new trial mis-aligned contact lenses, and withdrawn cleaning solution!!

Wednesday, May 03, 2006
Get In There
Conference Acceptance

So I finally heard the news I had been expecting for the past couple of months: our paper was accepted to CoLing/ACL. For anyone who doesn't know, this conference brings together two of the largest (if not the two largest) computational linguistics conferences around. A super conference if you will. This year it is being hosted in Sydney. Which will be very nice. It will technically be winter over there, but no doubt it'll still be nicer than much of the summer over here.

I shall say more about the contents of the paper when I get a chance to post a link to it (I'm blogging from home right now, and I'd like to see the reviewers comments first). The title kind of lets you know what it's about though:

Whose thumb is it anyway? Classifying author personality from weblog text.
The title refers to fact that many of the early papers in sentiment analysis made some allusion to their movie review data (for example) as being good or bad - thumbs up or down. Sentiment analysis is very popular at the moment. There were a good many talks/posters at the recent symposium which dealt with detecting and classifying sentiment from weblogs. Our main argument though, and this has been suggested by others, is that one persons four star review is another persons two (Pang & Lee, 2005).

Think about it.

If women are more comfortable expressing feelings and emotions than men, surely the language they use to review something will be different. The same can be said for people with different personality types. Consider these totally made up examples - entirely believable if you've read a lot of blogs:

  1. Wow! That was incredible! The acting was fantastic, the dialogue was sharp and polished and it held me from the off. Brilliant. Superb! I enjoyed that so very very much. Film of the year. Amazing.
  2. Went to the cinema tonight. Film was really good. Had a good night.
Imagine these were written by the same person. It's obvious they liked film one more than film 2. Imagine however, that the first is written by an expressive highly Open Extravert, and the second a less Open Introvert. It is possible that these two people saw the same film, that they liked it very much, and if asked to give it a score, would both give it full marks. Any direct comparison between the two and the second is far less positive. If you were to take into account the natural style of the two authors however, it might be possible to rank the second as high as the first, giving a much more accurate picture of how much a product is liked (or of course disliked).

Well anyway, that is just a brief bit about why we are classifying text by author personality. I'll hopefully get the paper up soon, and I shall definitely say more about what we did. I also hope to tell you more about a little project that's recently been confirmed.

Nedstat Basic - Free web site statistics

Powered by Blogger