Blogademia

Tuesday, November 25, 2003

Slacker? Moi?

How dare you. So yes, it's been a number of days since I last posted, but there's been little progress to report on. But, I reckon I'll take a break from the heady word of file manipulation and software fiddling, and fill you in.

So I've been finalising my methodology for getting the data ready and so far it's come to this:

Get a single HTML file containing each subjects May Archive.
Use a package called Tidy to neaten up the HTML, and output the file as XHTML, which is an HTML file as well-formed XML.
Mark-up the text in full XML using the tags I have created:
- Using simple text processing techniques, such as PERL scripts, it should be possible to automate some of this process. For example, it should be really easy to automatically find the HTML tags < EM > and < B > so tagging of surface details.
- For more content based tagging, I am looking to use a GUI based mark up tool. This maybe needed for seemingly simple things like DAYS and POSTS, because many weblog tools have different ways of tagging these, so it may not be possible to do them all automatically.
Spell checking. Still not quite decide how to do this, but I have a Java toolset that I'm going to start playing with, and we are considering the option of using XML to tag incorrectly spelt words with the correction, so that we have both versions available for analysis.

Problems encountered so far:

Some blog providers archive by month, some by week, some by day and some by post. That is potentially a lot of files to collect, and one big concatenation.
Some HTML files, which appear perfectly well in a browser, have errors that need to be corrected before Tidy can correct the errors it corrects, but I can't find out how to make it tell me what those errors are.
XML...hmm...
- I'm used to writing PERL scripts that search files line by line. But what if matching tags run over a line...it starts to get tricky. I'm thinking I might need to move this stage before the next stage, so that I remove all line breaks, search through one big long line, and then use Tidy to stick them back in, which it seems to like doing. We'll see.
- What XML tagger to use? The current plan is to use one that was developed in part here at the University, because my secondary supervisor worked on it, making access to the development team for much needed help a whole lot easier. Of course, I need to track them down first.
I'm not quite at that stage yet, but let's just hope my minimal JAVA skills are sufficient to cope.

So, that's the thinking at the moment. I'm currently experimenting by taking a relatively simple file through the procedure. I'll let you know how it goes.
- posted by scott @ 5:54 pm