Blogademia

Friday, December 05, 2003

Lots Done, Lots Still To Do

So obviously it's been a busy week, yeah. I've done quite a bit, learned an amount and come closer to getting procedures finalised. Biggest lesson of the week: things ALWAYS take longer than you think they will.

So last weeks meeting was a technical meeting. I met with my second supervisor to discuss the software I had been looking at, that she had been involved in:

< longstory >
The software was of no use.
< /cutshort >

It just wasn't really appropriate for what I wanted it for. However, coincidentally (and there will be a few of those in this post) a friend of my office mates' had been in visiting the day previously and had downloaded an XML toolkit, < oXygen/ >. So we thought we'd take a look at that. Lo and behold, they do a Linux version. Champion.

Dowload. Install. Load.

It really was that simple. Initial impressions are that it certainly looks nice. And it seemed to have this option whereby if I highlighted a block of text, then I could chose to surround it in a tag, selected from a list. This is exactly what I want. Some tagging will be automatic (the more the merrier) but I will need to do much of it by hand. This click 'n' pick technique is just what I'm looking for: I thought by hand meant literally that, traversing each text and hand coding every tag.

So, the next problem: how to extend the list of available tags to include those I wish to use. I perused the Users Manual, but it was just that: It told you how to use the tool, the menus, the options, how not actually what you did to make things work. It was assumed you knew that part. Rats.

Then, by coincedence, a new office mate showed up. There wasn't really anywhere for him to be, but we had a chat anyway (I did my Masters with him). Turns out, he's done a bit of XML. So he knows all about what is involved. I need two things:

In order to do the automatic tagging, i need to perform an XSL Transform. This is a series of stylistic operations which will allow me to find all the HTML tags I need, and turn them into my form of tagging. XSLTs work on XML files, which is handy because it's what I need out, but it's also what i can put in: < oXygen/ > can import HTML files, resulting in XHTML, and tidy can output in that format (XHTML effectively being HTML files in XML form).
I need a Document Type Definition, or DTD. These define the building blocks of an XML files, it defines the grammar of your tag structure. It turns out that once I create this file, I can create an XML file in < oXygen/ > using said file, and there are my tags ready to be used. Wicked.

So off I set, using the links I've made above, to teach myself the fundamentals of XSLT and DTD writing. Obviously, with the transform coming first in the process, I got on with writing my DTD.

Armed with my new knowledge, I went for my meeting yestreday. It was disappointing the number 1 supervisor didn't show up, but thats obviously the perogative of busy senior academics. Number 2 supervisor and, for some reason, given that were each had our own offices, and the coffee room was real close, opted to sit outside number 1's office. That's right, we had a meeting in the corridor.

But what a good meeting it was. We looked over my DTD so far, and discussed what more should go in it, and what the XSLT should cover. We settled on the fact that I should keep absolutely as much surface data as I can, and then remove all remaining HTML, because as much as I wanted to keep it, we couldn't actually come up with a reason why. The only argument to be made was that it would help with that tagging, but if while tagging the stripped down XML file, I have the HTML file open in a browser, I can use that to help.

I've still not entirely settled on one aspect of my DTD, but all that affects is for the later extraction (another XSLT) of the raw plain text from the file for the linguistic analysis. And of course, there is a chance I will discover some feature in a blog that I've not yet thought of, and I will need to evolve my DTD, but that shouldn't be too much problem.

Well, that's pretty much the work I've been doing this past week, as far as I can remember. There are a couple more things I'd like to say, but i'm going to put them in seperate posts, for reasons that shall become clear.

- posted by scott @ 4:07 pm