Blogademia

Friday, November 14, 2003

Even MORE to consider

As you might have come to expect of a Thursday, I had a meeting yesterday. I reported my results, told them of my YESs and NOs and we discussed the MAYBEs. I should them my thoughts so far on the tags I am planning on using for mark-up. One thing I am going to have to try, if that I'll want to keep as much surface detail as I can, so any bolding, italicising or CAPITALISATION needs to be tagged.

Actually, CAPITALISATION may not need to be tagged, since it is a property of text itself, and not the HTML. But why not.

so I need to look at mark-up tools, because it turns out you don't have to do it by hand, tag by tag. There are people in the department who are familiar with XML and tools and may be able to help me.

Another thing I need to investigate, which I didn't really count on, is spell checkers. There are various arguments to be made for data cleaning...that is correcting the spelling. Simply, clean data is better than unclean data. Image the word...any word..."rhetoric". It may be that this will show something important. But what if half of all uses are spelt "retoric"? The two data items would be treated separately. And what if I am only considering factors the happen at least 10 times, but the are 9 "rhetoric"s and 1 "retoric"? "Rhetoric" will miss the cut and not be considered.

Now this is fair, because many spelling mistakes are merely unchecked typos. But what about text speak (or txt spk). Here are instances of words spelt incorrectly, but deliberately so, so you don't really wish to correct them, because they may turn up interesting results. But the intended word is the same...so maybe you do. I'm thinking what you need is a database of spelling variations, and common mistakes, so that you can look up the variations, and see which base they relate to.

eg. text and txt are both "text". the and teh are "the". and so on.

But what about errors like "there" V "their". How easy is it to correct this. Likewise words that when spelt incorrectly produce other correctly spelt words like "spell" and "spill". on the their/there issue, maybe that is a factor worth looking at.

Ultimately there are many arguments to be made for using both the corrected and uncorrected versions of my texts. Problem is, the spell checking could take a really long time. Which is why I need to look into spell checkers, to see what they do. Since spelling might also be an interesting factor, I'll need to keep a log of corrections made.

Another issue...outliers. To anyone not familiar with statistics, outliers are those rare few data element who lie way beyond the normal range of data. For example:

I need to know the average age of my 10 (imaginary) subjects: 15, 16, 17, 16, 20, 21, 65, 17, 18, 22.

The subject aged 65 clearly stands out, it is obviously an outlier (in cases where it may not be so obvious, statistics can show us where the boundary is.

The average age of my subjects is 227/10 = 22.7 years. That average doesn't say much about the true data, because 90% of my subjects are actually younger than that, it's just that one is SO much older.

If we exclude outliers, the new average is 162/9 = 18. Makes sense see.

I encounter this in at least one instance, when it comes to word count. I can't give you exact figures yet, but I know I have much of data coming in at around the 10,000 word mark. Probably more below at 5000 words. But I have one text, that weighs in at around 50,000 words. There is so much text from one individual, that it may bias any results so that features that appear general, are actually only relevant to this one person. We will decide what to do nearer the time of analysis, but one suggestion is to cap the number of words we take from people for language analysis.

It occurs to me that my most recent posts have been quite long, and possibly boring, but honestly, the point of this blog is to tell you about my work. To give you an insight, not just into my research, but into how that work happens. These are just some of the issues that come up all the time while researching things. I hope you don't mind me discussing these with you.
- posted by scott @ 2:45 pm