Tuesday, November 25, 2003

Slacker? Moi?

How dare you. So yes, it's been a number of days since I last posted, but there's been little progress to report on. But, I reckon I'll take a break from the heady word of file manipulation and software fiddling, and fill you in.

So I've been finalising my methodology for getting the data ready and so far it's come to this:

  1. Get a single HTML file containing each subjects May Archive.

  2. Use a package called Tidy to neaten up the HTML, and output the file as XHTML, which is an HTML file as well-formed XML.

  3. Mark-up the text in full XML using the tags I have created:
    • Using simple text processing techniques, such as PERL scripts, it should be possible to automate some of this process. For example, it should be really easy to automatically find the HTML tags < EM > and < B > so tagging of surface details.
    • For more content based tagging, I am looking to use a GUI based mark up tool. This maybe needed for seemingly simple things like DAYS and POSTS, because many weblog tools have different ways of tagging these, so it may not be possible to do them all automatically.

  4. Spell checking. Still not quite decide how to do this, but I have a Java toolset that I'm going to start playing with, and we are considering the option of using XML to tag incorrectly spelt words with the correction, so that we have both versions available for analysis.

Problems encountered so far:

  1. Some blog providers archive by month, some by week, some by day and some by post. That is potentially a lot of files to collect, and one big concatenation.

  2. Some HTML files, which appear perfectly well in a browser, have errors that need to be corrected before Tidy can correct the errors it corrects, but I can't find out how to make it tell me what those errors are.

  3. XML...hmm...
    • I'm used to writing PERL scripts that search files line by line. But what if matching tags run over a starts to get tricky. I'm thinking I might need to move this stage before the next stage, so that I remove all line breaks, search through one big long line, and then use Tidy to stick them back in, which it seems to like doing. We'll see.
    • What XML tagger to use? The current plan is to use one that was developed in part here at the University, because my secondary supervisor worked on it, making access to the development team for much needed help a whole lot easier. Of course, I need to track them down first.

  4. I'm not quite at that stage yet, but let's just hope my minimal JAVA skills are sufficient to cope.

So, that's the thinking at the moment. I'm currently experimenting by taking a relatively simple file through the procedure. I'll let you know how it goes.
Friday, November 14, 2003

Even MORE to consider

As you might have come to expect of a Thursday, I had a meeting yesterday. I reported my results, told them of my YESs and NOs and we discussed the MAYBEs. I should them my thoughts so far on the tags I am planning on using for mark-up. One thing I am going to have to try, if that I'll want to keep as much surface detail as I can, so any bolding, italicising or CAPITALISATION needs to be tagged.

Actually, CAPITALISATION may not need to be tagged, since it is a property of text itself, and not the HTML. But why not.

so I need to look at mark-up tools, because it turns out you don't have to do it by hand, tag by tag. There are people in the department who are familiar with XML and tools and may be able to help me.

Another thing I need to investigate, which I didn't really count on, is spell checkers. There are various arguments to be made for data cleaning...that is correcting the spelling. Simply, clean data is better than unclean data. Image the word...any word..."rhetoric". It may be that this will show something important. But what if half of all uses are spelt "retoric"? The two data items would be treated separately. And what if I am only considering factors the happen at least 10 times, but the are 9 "rhetoric"s and 1 "retoric"? "Rhetoric" will miss the cut and not be considered.

Now this is fair, because many spelling mistakes are merely unchecked typos. But what about text speak (or txt spk). Here are instances of words spelt incorrectly, but deliberately so, so you don't really wish to correct them, because they may turn up interesting results. But the intended word is the maybe you do. I'm thinking what you need is a database of spelling variations, and common mistakes, so that you can look up the variations, and see which base they relate to.

eg. text and txt are both "text". the and teh are "the". and so on.

But what about errors like "there" V "their". How easy is it to correct this. Likewise words that when spelt incorrectly produce other correctly spelt words like "spell" and "spill". on the their/there issue, maybe that is a factor worth looking at.

Ultimately there are many arguments to be made for using both the corrected and uncorrected versions of my texts. Problem is, the spell checking could take a really long time. Which is why I need to look into spell checkers, to see what they do. Since spelling might also be an interesting factor, I'll need to keep a log of corrections made.

Another issue...outliers. To anyone not familiar with statistics, outliers are those rare few data element who lie way beyond the normal range of data. For example:

I need to know the average age of my 10 (imaginary) subjects: 15, 16, 17, 16, 20, 21, 65, 17, 18, 22.

The subject aged 65 clearly stands out, it is obviously an outlier (in cases where it may not be so obvious, statistics can show us where the boundary is.

The average age of my subjects is 227/10 = 22.7 years. That average doesn't say much about the true data, because 90% of my subjects are actually younger than that, it's just that one is SO much older.

If we exclude outliers, the new average is 162/9 = 18. Makes sense see.

I encounter this in at least one instance, when it comes to word count. I can't give you exact figures yet, but I know I have much of data coming in at around the 10,000 word mark. Probably more below at 5000 words. But I have one text, that weighs in at around 50,000 words. There is so much text from one individual, that it may bias any results so that features that appear general, are actually only relevant to this one person. We will decide what to do nearer the time of analysis, but one suggestion is to cap the number of words we take from people for language analysis.

It occurs to me that my most recent posts have been quite long, and possibly boring, but honestly, the point of this blog is to tell you about my work. To give you an insight, not just into my research, but into how that work happens. These are just some of the issues that come up all the time while researching things. I hope you don't mind me discussing these with you.
Tuesday, November 11, 2003

On diaries

While Googling, I came across a quote from Sir Walter Scott:

What is a diary as a rule? A document useful to the person who keeps it. Dull to the contemporary who reads it and invaluable to the student, centuries afterwards, who treasures it.

I think he's wrong about the dull part, and I should hope I don't have to wait centuries for my PhD.

On an interesting note, I've also seen this quote attributed to an actress in 1908, Dame Ellen Terry. By the BBC no less.

Classification Stage 1...complete

At last. I've finally looked through ALL the blogs of my subjects, and classified them as YES, NO or MAYBE. But disappointingly, the final few were all NOs. There as a stark increase in NOs. Nostly people who didn't even blog in May, the month I was asking for text for. Silly people. Getting my hopes up, holding the data in front of my nose, but just whipping it away at the last minute.

So, now I need to deal with all those MAYBEs. The plan is to mark a couple up for our meeting on Wednesday and we can discuss it from there. I have less YESs than I had anticipated, so I'm hoping to include as many MAYBEs as possible. Not that I'm going to lower my standards, or massage the requirements just so I can have more data.

The problem is i'm no longer sure what a diary is.

Most people define things by their understanding of it. To me, my diary was where i wrote down what I did every day. So that was what I was looking for when I asked for diary-style blogs: a weblog detailing people everyday lives, no matter how mundane. I never really wrote anything personal in my diaries, but people do. People put their thoughts, their feelings, they put themselves into their diaries.

A while ago I wrote about a similar issue. In summary:

If you want to vent, write about your day, or put an event on paper, you are keeping a diary.

If you are writing with a goal in mind, or your writing has a specific purpose, then you are journaling.

One way I have started to see it, is that a journal is a log or what you did, but a diary is what you thought and felt. The terms are technically interchangeable: one dictionary definition I came across defines a journal as a diary.

So my problem is that I also see a lot of comment weblogs. The purpose of these blogs is to discuss There is nothing of the life of the auther, beyond their opinions on certain matters important to them. These are not the kind of blog that I am interested in.

But, what of the diarist, who first writes briefly of just one event that happened in the day, and that leads them to discuss some issue arising from that. It's like a diary, but they are making a comment on something, but in a diary-esque way. It's tricky.

I think i could say more about this, but for now my brain is fuzzing over, so I'll come back to it. Please feel free to add any thoughts you have on this.

Thursday, November 06, 2003

Tuesdays Work Update

I never did repost my brilliant words from Tuesday, and since I've just had a meeting, I might as well do it now.

So, progress has been slow, because I was ill at the weekend, but basically I am still working my way through my data classifying it as YES, NO and MAYBE. Things have taken a turn for the worst of late, as the percentage of NOs and MAYBEs is on the increase. I am currently sitting on about 60% YESs, which is quite low, but I have a feeling it will end closer to 70%, which is better, and if hopefully half the MAYBEs become YESs, that'll be a decent amount of data.

YESs and NOs are easy to deal with. Blogs are either good enough or they aren't. The problem come, and the time is taken, over the MAYBEs. It all comes down to my definition of the diary/journal style i'm looking for, along with the ratio of diary/journal posts to posts that purely link and discuss the outsdie world.

We have a couple of options for dealing with these maybes:

1 - So far, my judgements are purely based on eyeballing the data, but the final decision on the MAYBEs will have to be more scientific. One option is to determine an exact point at which data becomes a YES or a NO. This could be word count or post count, and it
would require marking the text within each blog as either personal or comment/link. We could say that a blog needed to be at least 50% personal for it to get in.

2 - We could also continue to rely on human judgements, but not just mine. We get a number (at least 2) of raters to read the MAYBE blogs (along with a couple of YES NO blogs for guidance) and get them to decide independently if each is a YES or NO. Then you compare the ratings and get an answer for each blog.

After the meeting we've decided to lean towards a variation of method 1. For the moment, all MAYBEs where this is an issue, will for now be included. I will begin to markup all the data, including marking posts as either personal or comment/link. Once this is complete we can get a look at the statistics, word count, posts count etc. This should allow us to decide a suitable threshold for decision making.

Also, it is best to keep as much data as you can, because once you throw it away, you lose it forever, you cannot later slip it back into the study. But if you keep it in, and at some point you realise you can't keep it, it is easy to remove.

Once the data is marked-up, it may even be possible to include all the MAYBEs but not include the comment/link text in the analysis.

A lot to think of for next weeks meeting.

Who broke Blogger?

No, really? come on...own up. I reloaded Blogademia. First of all I got an error saying something like:

could not find file naysmith/orange.html

What is that all about? I hit back, and there was Blogademia. I tried again...WHAT THE ...!?

There was MY URL, and someone elses blog. I clicked on their archives, because the address was clearly my base URL plus their archive worked!!!

What's up with that?

PS - I've learned my lesson, I'm writing this entry in a text editor in case blogger messes up did...yay...

Tuesday, November 04, 2003


at least it'll give me something to do tomorrow :D


no matter how many times i hit reload, it's still not appearing.


clearly just keeping blogger open is wrong. I wrote a really long post about my work. when i hit post it wanted me to log in. i did. my post never appeared. it's gone. i'm annoyed. stoopid blogger. WHERE IS IT?!?!?!?!!!!?!??!?!

Nedstat Basic - Free web site statistics

Powered by Blogger