This was an interesting two day meeting hosted by ATILF, and organized by Susanne Alt and Veronika Lux-Podalla, which brought together a number of key TEI users and interested parties in the francophone world. It followed up a preliminary meeting also held in Nancy at INIST during the spring, at the time that Nancy established itself as a TEI host.
The programme combined themed sessions of presentations, and parallel discussion groups. Following some words of welcome from the director of ATILF and from the European TEI editor, it began with a session on dictionaries and terminology. Christiane Fritze (Academy of Sciences of Berlin) reported on the results of a demonstration project using TEI as the exchange format to permit interaction between a group of previously encoded (in various formats) historical dictionaries, and a corpus of digital texts, aided and abetted by the usual panoply of XSLT stylesheets and an eXist database. I was taken by the way in which the research group works: each month they choose a new project in the general area of promoting scholarly awareness in the humanities, and work on it. Despite its small scale, this was a persuasive demonstration of the benefits of using TEI markup to integrate the outputs of previously independent projects.
This was followed by a presentation on similar themes from the "kompetenzcentrum" (I think we would say "centre of excellence") at the University of Treves, given in Germlish by Hannes Greil and Niels Blohnert. Their kompetenz seems to be mostly is in retro-digitization of major historical dictionaries and their subsequent integration with corpora of historical materials in German. Most of the dictionaries they discussed were originally encoded in Tustep, and even their conversion to XML was carried out using Tustep. The scale of their operation was impressive however.
Laurent Romary reminded us of the difference between a dictionary which maps words to senses, and the semasiological orientation of a terminological database, which maps concepts in a defined domain to words. He sketched out his ideas for a reformulation of the.late lamented TEI terminology chapter, and made a persuasive case for the general usefulness of such a chapter in the P5 -- if only someone would draft it.
A lady whose name I forgot to write down reported on the deliberations of a joint ATILF/INIST workgroup concerning the applicability of the TEI Header, and its place in the metadata universe. The group aims to define a minimal header for use in the cataloguing of material "born digital", in particular electronic theses: where have I heard that before. They like the fact that the header permits such richness of metadata (arborescence is the french word), but think it needs to be constrained. She also presented some comparisons (and mappings) with OLAC and Dublin core, but did not, curiously, mention METS. Apparently the structure of the TEI Header is not derived from AACR2 but from ISO 2709.
A consultant from a company called Archimed called Christophe Arnoult described an interface developed to search TEI conformant digital theses, using metadata apparently derived from other online resources. The work had been done using a very small sample, and the only interesting thing I noticed about it was their application of statistical procedures derived from the tag usage to determine which header elements should be indexed and how. Otherwise he made the familiar point that in the absence of clear guidance on how to apply it, the TEI Header DTD provides more than enough rope for most conventional retrieval systems to throttle themselves on, and incidentally bemoaned the fact that people didn't always package together all the resources needed to handle a document (e.g. where is the DTD? where are the system entities referenced?)
Denise Malrieu (Paris X) asked a good question and proposed an impressive range of possible answers. The question was: what kind of metadata might be useful for literary scholars working with narrative texts; and the answers ranged far beyond purely literary-historical-bibliographic perspectives to include internal structural features and audience expectation, as well as summary statistics and their deviance or otherwise from a norm. She also talked about the viability (or otherwise) of automatic tagging of various discourse features -- not only what the computational linguists call "named entity recognition" but also narrative level and status (dialogue, quotation etc). She also mentioned but did not describe ongoing work in establishing appropriate parameters for narrative text classification.
After lunch (up the hill: chicken and chips), we divided into three groups for more focussed discussions on (a) dictionaries, animated by Susanne Alt; (b) metadata, animated by Laurent; and (c) electronic theses, animated by Sylvie Gressillaud. I attended the last of these, and watched as Gautier Poupeau walked through the most relevant features of TEI Lite (using Oxygen), only occasionally muttering about how different things would be in P5. These parallel groups then reported back: the dictionary group had looked at Susanne's (excellent) presentation; the metadata group had been introduced to Roma and derived a simpler DTD for the header.
Last session of the day discussed questions of documentation and translation. Pierre-Yves Duchemin (Enssib, Lyon) briefly sunmmarised the actitivies of the G5 group which had started with the ambition of translating the whole of the TEI Guidelines into French. They have now produced a version of the P5 Header chapter and are working on two others. However, they seem to have taken note of the comment I made when visiting Lyon earlier this year about the need to base their translation on the ODD sources rather than on their output. The working relationship betrween this group and the TEI's own I18n effort, on which Veronika reported next, remains somewhat murky to me. In presenting what is planned for thew latter, Veronika provoked a surprisingly heated debate about the wisdom (or otherwise -- most people thought otherwise) of translating the element names in TEI documents.
The second day began with a very impressive presentation from Gautier Poupeau (Ecole des Chartes) about the principles and practice underlying their digital publication program. This was high quality dynamically on the web, conforming to the needs amnd expectations of scholarly editors.
Also impressive in its way was the presentation from Jouve, a long established French electronic publishing house which is apparently now piloting use of TEI as an interchange and delivery format for its digitization activities. Denis Delvalle explained a workflow they have now introduced in which OCR output, expressed in a detailed proprietary format, is transformed into TEI, and then converted to the customer's specific requirements. Interestingly, Omnimark is still their weapon of choice in this struggle.
And finally, I gave that talk about Xaira again, this time translated into halting French, and augmented with live demonstrations of assorted French corpora running on a borrowed laptop.
After lunch (up the hill again, steak and chips this time), Laurent and Veronika chaired a discussion about next steps and further missionary activity. It was agreed to hold a focussed training session on developing the digital thesis plus header customisation front just before Christmas and to try to provide nmore input and discussion material on the website. The idea of a TEI summer school was floated. And there was some discussion about what exactly Veronika should report to the members meeting next week.
Overall, this was a reassuring, even encouraging, workshop. There were over 20 participants, mostly but by no means exclusively local, and from different backgrounds. The problems raised and issues discussed are would resonate with any TEI user anywhere in the world; unlike some other such gatherings, however there was an evident committment to making the TEI work, by participating in its development and promotion.
The Extreme Markup Languages conference proudly differentiates itself from others XML conferences by (amongst other things) its willingness to foster dissent from XML orthodoxies, its low tolerance of marketing speak and generally by having higher-than-average geek-appeal. This year's event, held at a curiously decorated downtown hotel in Montréal, certainly lived up to the stated goals. The following biassed, ill-informed, and unreliable report should be read in conjunction with the complete online proceedings and indeed the official photo page.
Tommie Usdin (Mulberry), chair of the conference, opened proceedings with what sounded like a tried and tested rehearsal of the conference objectives and guidelines. This year, for the first time, nearly all of the submitted proceedings had not only used the right DTD but validated against it. Some however had been hoist with the petard of over ingenuity — in Tommie's memorable phrase ‘Mathml may be ready for prime time but it's not ready for amateur use’.
James Mason (no, not the movie star, the SGML star), reported some experiments he'd been trying at something called the Y12 Manufacturing Facility of the US National Security Agency. This is a very long established military factory complex whose products range from the enriched uranium produced originally for the Manhattan Project, to special widgets for banjos (mercifully, he didn't tell us what the military does with its banjos) and complete off-the-shelf field hospitals. The complexity of its operations and consequently its data resources, seemed to Jim to make it a good candidate for explication via a topic map, and why not. His protoype web application demonstrated how the topic map worked as an effective way of navigating complex interdependencies amongst all bicycle-related resources. In answer to that FAQ, ‘what's special about topic maps?’, Jim opined that it was the ability to index and merge different kinds of data. This was fun but not, forgive me, rocket science.
Second plenary of the conference was a double-act involving two people from a health industry consultancy Pensare called Duane and Renee. Their big idea was that because the terminology used in any specialist field changes over time, development of ontologies and topic maps derived from them needed to build in a significant usage monitoring component. (At least, I think that's what it was). Duane and Renee advocated something alarmingly called `stealth knowledge management' techniques to help address this problem, which on interrogation seemed to mean paying attention to the informal ontologies people actually start using after the expensive formal ontology-creating consultants have left — presumably by retaining said consultants on a permanent basis. It's hard to disagree with Renee's pitch ‘usability isn't an end point, it's an ongoing process’; harder to see what you do about it.
Over lunch, I chatted informally with Terrence Brady from LexisNexis and learned that they use a topic map to navigate the horrors of the system documentation associated with their thousands of different databases, which was reassuring. After lunch, we split into parallel sessions, one (in the Incognita room) being mostly devoted to über-geeky reports on cool hacks, and the other (in the Mont Blanc room) less so. In the former category, I tried not very successfully to follow Bryan Thompson explain how existing HTTP GET (etc) commands can be used to access XML fragments from large scale web resources using Xpointer and something I've never heard of called the REST (`Representational State Transfer') architecture. This was followed by the first of several reports from current German computer science departments: Sebastian Schaffert (Munich) on a new declarative query language called Xcerpt, the key feature of which is an explicit separation of the process of querying resources and constructing their results. The claim is that Xpath in particular confuses these two in a very unteutonic way, and that separating them again better facilitates rule-chaining and other reasoning processes needed for the semantic web; he's probably right, but I don't think I can explain why.
After tea I returned to the Mont Blanc room to listen to Steve Pepper (Ontopia) announce how topic maps had taken over if not the world then at least the whole of Norway. Work on a Government-funded e-learning project has put Ontopia into the enviable position of being able to define a `Semantic Portal', that is, a group of subject-specific portals each of which exposes its contents by means of a topic map, and which can therefore be accessed as a group, using a single `identity mechanism' to identify when topics can be mapped to one another. Allegedly, the philosophers' stone in question is achieved by the use of Published Subjects, (Steve referred to this as the semantic superhighway), and TMRAP, which is a protocol for remote access to both topic maps and — critically — other resources as if they were topic maps; the final piece is a topic map-specific query language now being discussed within ISO. In an access of enthusiasm, Steve said that these constituted the building blocks of ‘seamless knowledge’ and would allow us to achieve all the semantic web promised and more. I don't think I was alone in feeling a little sceptical about this.
Last paper of the day, back in the Incognita suite, was Eric Miller (formerly at OCLC now at W3C) whose title, abstract, and presentational style all promised rather more than was actually delivered. The subject was mapping between one specific XML schema and an RDF representation using XSLT; the use case was Michael Sperberg-McQueen's sui-generis calendar data and Eric's own; a harsh critic might say that since the main purpose of the application discussed was to find a way of scheduling time for the two authors to plan the content of their paper, the lack of content in the paper demonstrated rather well the viability of this approach. However, Eric did do a very good job of re-problematising the issues of semantic mapping which Steve's presentation had somewhat obfuscated with marketing hype.
Day two of the conference was largely devoted to papers about Overlap, a major theme of the conference and also the pretext for some amusing lapel pins being handed out by Patrick Durusau. Andreas Witt (Bielefeld) gave a rather good summary of the current state of knowledge about how to handle concurrent structures, endearing himself to me greatly by demonstrating how little the state of human knowledge on this has advanced since the TEI originally discussed it in the late eighties. The issue seems to be not so much how to choose between the different possible ways of representing overlapping structures in markup (TEI milestones, standoff, MECS, LMNL...) as how on earth to process them effectively. Andreas suggested conversion to a set of prolog facts (using python) and gave a good overview of the sorts of meta-relations recoverable from such a re-encoding of multiple layers of annotation.
Patrick Durusau, wearing a remarkably silly Gandalf hat, covered basically similar ground, but used a simple relational database rather than a prolog fact base as engine. He also reported availability of a sample data set — the first few books of Milton's Paradise Lost marked up with page/line hierarchies from different editions, and also with sentence and clause analyses (but not, regrettably for Miltonists, speaker divisions), which sounds like a good test for such systems.
In the absence of Steve De Rose, Tommie Usdin briefly summarized his very thorough presentation of — guess what — the various ways available of presenting overlapping hierarchies. Steve's paper featured a number of varyingly memorable metaphors, most notably the concept of `Trojan milestones', involving start- and end- pointers on empty versions of otherwise ordinary looking structural elements. He proposed a formalism called CLIC (for ‘canonical LMNL in XML’).
Wendell Piez reminded us that overlapping and multiple concurrent hierarchic structures were not exactly the same problem. He gave an update on the purpose and nature of LMNL (Layered Markup and Annotation Language) which he and Jenni Tennison had presented in 2002. Since then, Alex Czmiel had produced an implementation of this non-XML based data model, but, as Wendell agreed, the problem of how to process it remained. He also reported availability of a nice dataset (heavily annotated extracts from Lord of the Rings) and demonstrated how it might be processed by conversion to what is effectively milestone-only (or COCOA-style) markup; several people suggested that LMNL could usefully be simplified by not treating annotations on annotations differently from any other kind of annotation.
After lunch, I fear I paid less attention than I should have to two more German computer scientists: Christian Siefkes (Berlin) presented an algorithm for automagically tidying up well-formedness markup errors which didn't appear to convince anyone. Felix Sasaki, from Bielefield, discussed rather more fundamental issues about ways of representing markup semantics independent of their instantiation, and thus mapping between different schemas (I think), the erudition of which appeared to stun everyone into silence.
After tea, there was a rather comical panel featuring several of the available topic-map heads (Patrick Durusau, Lars Marius Garshol, Steve Newcombe, Ann Wrightson). It seems that the ISO work group charged with defining a topic map reference model as a complement to ISO 13250 (which defines the terminology and ISO syntax for topic maps) had met immediately prior to the Markup Conference, and discovered they all disagreed about what that reference model might be. Garshol had the enviable task of reporting this contretemps, which he did in a rather disarming way; the other panellists then proceeded vigorously to disagree with each other to everyone's satisfaction and we all went to dinner.
Day three of the conference opened with Liam Quinn, also wearing a very silly hat, who now works for W3C, and thus has time to worry about old chestnuts like the feasibility of standardizing a binary format for XML. A W3C activity chaired by Robert Berjon has been formed, which will collect usecases and report after 12 months whether or not there is a case for doing the technical work needed to define such a thing. As Lee pointed out, it's in the XML dogma that all processors would be able to understand all XML documents, which seems to suggest that proposing a standard for ‘islands of binary goop’ (his phrase) would not stop them remaining insular. On the other hand it's clear that plenty of user communities do need binary formats (Lee talked about delivering PDF fragments and extracts from multi-gigabyte map data to memory-challenged handheld devices) so reducing the number of such competing formats might be advantageous.
Matthew Fuchs' plenary, which followed this, was a rather more technical piece about ways of adding object-oriented like features to XML processing, in particular the use of the element() function in XSLT2 which seemed to offer the only way of taking advantage of the inheritance and compositional properties definable for elements in W3C schema. The XSD-based UBL (Universal Business Language) (which Matthew claimed was on the verge of world domination) uses these as a framework for extensible schemas, but the tools are lacking. He reported in more detail than I could follow his success in adapting Ken Holman's stylesheets for UBL to use polymorphic classes. Norm remarked that XSLT2 (still a moving target) now does more of what was needed for this purpose.
After the coffee break, I paid only cursory attention to a presentation about DITA, which seems to be a kind of topic map application for integration of technical documentation systems, and was mildly nonplussed by a presentation shared by David Birnbaum and David Dubin. The former explained at some length the principles of Russian versification, while the latter gave us a demo of a prolog program which inferred the presence of textual features such as rhyme from markup of phonological features. Allen Renear explained to me over lunch that the point of all this was to demonstrate that the information recoverable from a marked up text was not necessarily exhausted by the markup itself. I found this insight distinctly underwhelming, no doubt because I was worrying about my own presentation.
I presented the current state of the ODD system developed for production of TEI P5, emphasizing its features for modularity, modification, and internationalisation, and was politely received, but (rightly) taken for task for overstating our desire to ditch DTD users. Syd and Julia continued the ODD theme with a discussion of some of the implications of extensive user modification, also outlining some limitations of what could currently be customised in P5, notably the documentation. (We need to find a way of translating the TEI GIs that are referenced in the text). The day finished with a very nice presentation from Norm Walsh about the way in which DocBook is going down the same righteous path as TEI in its RelaxNG-based modularization.
Simon St Laurent kicked off the last morning of the conference by tackling another of those things we thought XML had disposed of: namely general entities. He reminded us of all the nice things you can do with them and how xinclude really doesn't hack it. The DTD declaration may be dead in an XML world, but it isn't lying down as long as we need to use general entities. Simon's suggestion is to use a pre-processor to do entity management; he reported progress on (and some details about the inwards of) his ents parser (http://simonstl.com/projects/ents/. Since however the XML committee has reportedly opined that ‘existing methods suffice’ there doesn't seem to be much future in this.
Another grand old man of the SGML world, Sam Wilmot, original developer of the Omnimark language, gave an entertaining and discursive talk about pattern matching languages like icon and snobol; in particular he presented an implementation of pattern matching primitives in python. His main point was to remind us that pattern matching was a useful technique not adequately supported by most currently available XML processors.
Nicest surprise of the conference was Eric van der Vlist's talk. He reported a successful project which has provided for DocBook more or less the same customizing functionality as we hope to provide for TEI with the new ODD system, using nothing up his sleeves and no software fancier than an Open Office spreadsheet and some XSLT scripts.
Extreme appears to have a tradition of allowing Michael Sperberg McQueen to deliver a closing sermon: this year's text was the word model as in ‘XML has no formal model’, an accusation which Michael triumphantly rebutted, with his customary wit, erudition, and length, not to mention appeals to formal logic theory and corpus evidence. I did feel a little worried that he felt it necessary to ask us whether the solar system modelled the theory that the earth goes round the sun though.
This year's mid-term committee meeting of the Association for Literary and Linguistic Computing was branded as a ‘roadmap’ meeting (apparently a current vogue word for meetings devoted to navel gazing) which might hopefully produce some content for a New Directions in Humanities Computing panel session to be held at the close of this year's ALLC-ACH conference. In addition to ALLC committee members, the meeting, organized and chaired by David Robey and Harold Short, was attended by a small number of hand-picked invited experts from the field; I was given the task of presenting a summary of the issues raised during the course of the two day discussion, so had to pay attention. The meeting was held in Pisa, so the dinners were good.
Each of the five sessions was supposed to address the same questions (‘Where are we now?’, ‘Where are we headed?’, ‘What should our agenda be?’) from the perspective of the following five different application areas: ‘linguistics’ (Laslo Hunyadi and Elisabeth Burr); ‘literary studies’ (Paul Fortier and Lisa-Lena Opas-Hanninen); ‘Bibliography and textual criticism’ (John Dawson and Wilhelm Ott); ‘Libraries and archives’(Espen Ore, Marilyn Deegan, Susan Hockey); ‘Multimedia and performance studies’ (Lorna Hughes and Jean Anderson); ‘Methodologies and digital scholarship’ (Willard McCarty and Harold Short).
I took a lot of notes, from which I distilled a final summary that seemed to go down quite well. You can read these notes at 0204pisanotes.txt; here's a briefer and possibly less tactful summary of the chief conclusions I drew from the experience:
However, as I said, the dinners were very good.
At the end of April 99, I spent a week in Granada courtesy of the GCA, where I attended the XML Europe 99 conference. The price of this privilege was that I had to pay attention during all the sessions, and then give a closing address at the end of the conference, after a very sybaritic lunch (see below).
I duly attended, and took assiduous notes, (and then lost them all which is why I don't have a visit report to post). I gave my closing plenary from a series of bullet points on my laptop, which lead several people to assume that I had actually prepared a serious talk, and, more seriously, to ask me for copies of the slides afterwards.
I have therefore attempted to confect a more or less plausible version of what I actually said, written out in my best joined-up XML, using the newly minted XML version of the TEI Lite DTD, so you should be able to read it with almost anything....
Maybe I should start with the acronyms...
Presiding genius and visionary of both ECAI and EBTI is Lewis Lancaster from the Department of East Asian Studies at UC Berkeley, but both groups have an extraordinary roster of distinguished and energetic scholars and institutions scattered around the Pacific rim, co-operating in a number of equally impressive digitization projects. This was the third of their international conferences at which I have had the honour of being invited to present a TEI Workshop; previous events were held at Haiensa Monastery in Korea in 1994, and Otani University in Kyoto in 1996, and the next will be at Berkeley in California next year. The sheer scale of EBTI's interest, embracing canons of ancient texts in Pali, Sanskrit, Tibetan, Korean, Chinese, Japanese, and other less well-known languages, makes for a pretty rigorous test of the TEI's claimed ability to cope with all texts in all languages of all times. Put this together with a characteristically Buddhist atmosphere of mutual tolerance and altruistic scholarship (to say nothing of the lure of their exotic locations) and it is understandable why such invitations are hard to resist.
The 1999 event was hosted by Academia Sinica, a Taiwanese government funded research institution and the Taiwanese Ministry of Education. Academia Sinica has a long history of interest in SGML and XML; one of its chief luminaries being Professor Ching-chun Hsieh, architect of (amongst other things) the modifications to Unicode needed for it to handle the full range of ancient Chinese characters. Indeed, should the TEI decide that it needs a host organization in this part of the world, I think that Academia Sinica would be a natural first choice for a site to approach. Also in Taipei, at the National Taiwan University but in close contact with Academia Sinica, the CBETA project, is creating a TEI-conformant corpus of the Chinese Tripitaka. involving Christian Wittern, inventor of the Kanjibase is a recent recruit to Academia Sinica, and is converting several large structured-text databases to TEI. Both Christian and Rick are enthusiastic promoters of the TEI.
The conference proper was preceded by four tutorial workshops on Metadata and the Dublin Core, by Helen Jarvis (University of New South Wales); TEI (me); GIS by Lawrence Crissman (Griffith University); and Image Data by Jan Glowski (Ohio State). I missed all of these except my own, owing to a pressing need to revise my overheads and drink as much of the Academia Sinica's excellent coffee as I could afford, but the handouts indicate that they all provided a useful technical overview of these four key topics. After the workshops, Rick Jelliffe kindly shepherded a group of us long-noses around the famous night-markets of downtown Taipei, where I bargained furiously for brass ornaments, and eat lots of tasty sausages, and freshly cooked chestnuts. My appetite for exploration thus whetted, I must confess to taking a whole day off to wander around Taipei on my own which was an exhausting but extraordinary experience.
The conference itself spread over four days, with up to four parallel sessions running throughout, each devoted to progress reports and position papers about text creation projects, networking and digital library projects, and GIS applications in Asian studies. Most of the presenters came from Taiwan, Japan, USA, Australia or Korea, but India, Nepal, Thailand, Russia, Germany, France, Mexico, and the UK were also represented. We were fed at regular intervals with excellent Chinese buns and other goodies, quite apart from plentiful Chinese lunches, and there was much extra-session activity and discussion, involving the exchange of business cards. There was also a splendid Chinese banquet, held in the Taipei Business Club (a building that has to be experienced to be believed), complete with witty meditations on the approaching millenium from a very distinguished Taiwanese professor and a welcoming speech from the newly elected mayor of Taipei, whose election pledges apparently include Internet access for one in three inhabitants of the city, quite apart from finishing the rapid transit system that Taipei desperately needs. I'm not sure what we eat (apart from the Sharks Fin Soup) but it was all very tasty, and plentiful.
An opening plenary by Roy Weber from AT&T gave a taste of some of AT&T's more futuristic products, based on close integration of internet and conventional telephony systems: these included WISL, allowing for management of distributed telephone sales staff (did you know that nearly half of all calls to AT&T are to 800 numbers?); a deluxe form of video-conferencing called "virtual presence"; and the endearingly-titled "cyber fridge", (named after the fridge on which all good US families post family-oriented information) which we will all use to keep in touch when our homes have permanently active internet connexions, and flat display panels are built into refrigerator doors.
One of the two other Brits at the conference, Susan Whitfield, reported progress on the International Dunhuang Project at the British Library, home of the "sponsor a sutra" digitization scheme. Not content with digitizing this extraordinary collection of 20,000 manuscripts and manuscript fragments, all now catalogued, the project is now developing an integrated catalogue of the Stein photo archive. This links extracts from Stein's diary of his turn of the century travels along the Silk Road with the photographs he took; its use of a geographic or spatial metaphor as a way of accessing disparate collections of digital resources seemed genuinely innovative to me and was a recurrent theme of the conference.
Tom Duncan, from the Museum Informatics Project at Berkeley, described some of the technology underlying their project, based on the use of Sybase as a data manager for 50,000 images, delivered via JTIP compression, and also as a means of providing full text retrieval from the Korean Buddhist text canon. They deploy a traditional thesaurus style access to these resources via a Sybase implementation of a range of standard thesauri. Howie Lan, also from Berkeley, gave an overview of other Digital Library research activities at Berkeley, but ran out of time before getting down to much real detail of his advertised topic of "multivalent documents" (I think this is mostly about combining digital images, transcriptions, and other views of a document)
I attended a business meeting of ECAI, my curiosity about which had been sparked by an introduction to the power of GIS systems which Ian Johnson kindly gave me during a long conversation about the relative merits of XML and RDBMS for organizing data. ECAI has an active technical group apparently largely driven by said Johnson and other Australians, and an interesting programme of work. The plan appears to be to collect metadata describing any and all datasets relating to Asian cultural materials using a modified (surprise!) set of Dublin Core descriptors, and to provide access to the corresponding distributed datasets via a new desktop mapping system called Timemap. I muttered about OAIS and the Warwick framework, and perhaps more significantly infiltrated a late night drinking session hosted by the imperturable Helen Jarvis (who is, incidentally, co-author of the definitive study of Cambodian atrocities ) and a frock-coated Matthew Ciolek (of WWW Virtual library fame). I hope to persuade some of those involved in ECAI's technical group to present a session at this year's DRH conference, since I think we have much to learn from the scale and scope of ECAI, quite apart from its intrinsic interest.
A recurrent theme of the EBTI sessions on ancient Chinese and Japanese texts concerned the difficulty of encoding such texts in a standard way resulting from the Unicode Consortium's decision to "unify" CJK scripts. Shigeki Moro, for example, reported that in encoding the Taisho Tripitaka (which includes both ancient Japanese and Chinese marterial) they had so far needed 5840 distinct charavcters, of which 1264 were unavailable in JIS and 338 were missing from Unicode. These so-called Gaiji have to be represented by SGML-like entity references, using numbers assigned by inventories such as Wittern's Kanjibase; when their textbase is converted to XML, they propose to use empty elements for the purpose. In the same session I learned of the existence of the Mojikyo Font Center , an admirable organization which provides free Truetype fonts for over 80,000 Japanese and Chinese characters.
Technical solutions to linguistic diversity was a recurrent theme in the EBTI sessions I attended. Amongst many others, I noted the following: Jost Gippert presented the TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) project at the University of Frankfurt, demonstrating alignment of several texts from this immense collection using WordCruncher; the TITUS site also includes an ingenious set of pages for testing the ability of your system to deal with Unicode. Dhanajay Chavan from the Vipassana Research Institute in India demonstrated the new version of the Chattha Sangayana CD which allows the entire Pali canon to be viewed in Devanagari, Roman, Myanmar, Thai, Sinhalese, Cambodian or Mongolian scripts. Marvin Moser from Lucent Technologies in Chicago presented an input system for Tibetan script which (I learned) has similar "stacking" problems to classical Arabic. The Asian Classics Input Project , with which the OTA already has links, continues to create and distribute a vast library of Indo-Tibetan literature, recently expanded to include digital images of collections of such material in St Petersburg. Meanwhile, a team from Dongguk University in Korea has been quietly developing their own XML-aware Unicode editor for inputting the Korean Tipitaka, which may well turn out to have wider application, if it ever gets out of Korea; unfortunately I couldn't find out as much about this as I wanted to: this paper describing it is in Korean.
In another tribute to the lure of XML, Charles Muller (Toyo Gakuen University) recounted his experience in converting his Dictionary of East Asian Buddist terms into XML, using XSL to render it via Internet Explorer 5. The possibilities thus opened up of linking this dictionary with other resources such as the Rangjung Yeshe Tibetan English dictionary of Buddhist culture attracted much interest. Christian Wittern gave an excellent account of how encoding projects like CBETA could make use of the TEI which last he described in a memorable phrase as being "a travel guide not a catechism"; he uses UltraEdit, a programmable two-byte aware editor, to input and tag the Chinese corpus according to a modified TEI Lite, and delivers the result for browsing via Microsoft's free HTML Help wizard (an ingenious idea I propose to steal).
My account necessarily omits whole strands ofthe conference devoted to broader non-textual collections of cultural artefacts, and much discussion of the opportunities afforded by the fusion of networking and digital library, as well as some of the more specialist topics discussed, simply because one cannot be in two places at once. Nevertheless, I hope I have given some sense of the diversity and richness of this exceptional conference.
ICAME, the annual get together for corpus linguists, was held this year in a luxurious Victorian hotel in Newcastle, Co. Down, where the mountains of Mourne come down to the sea (and they do), organized with great panache by John Kirk from Queens University Belfast. The food was outstanding in both quantity and quality, there were magnificent sea views, the weather was splendid, and the conference programme was full of substance and variety. With about a hundred participants, the conference was not too big to allow for plenty of interaction and socializing, even with a fairly crowded programme of sixty formal papers and a dozen posters spread across three days. Also, I counted nearly twenty presentations of one kind or another explicitly derived from work using the BNC, which was (as they say) nice. In what follows, I summarize briefly only those sessions I attended, passing over in silence a few I could not attend either through an inability to be in two places at once, or through conference-fatigue.
For the third time, it had been decided to hold a separate one-day Historical corpora Workshop as a curtain raiser to the main event. This was opened by Matti Rissanen's annual survey of projects and resources for diachronic corpus studies, now available or underway, in which he also said some nice things about the Oxford Text Archive. Irma Taavitsainen from Helsinki reported on a study of what she termed "metatextual" comments in a corpus of Early English medical texts dated 1375 to 1550: the claim was that both form and incidence of such comments as It is to be seen that, I will now demonstrate etc change over this period to reflect other changes from personal to impersonal modes of narrative, or from an oral to a literary style. Terttu Nevalainen, also from Helsinki, presented CEEC, an interesting attempt to apply sociological criteria in the design of a Collection of Early English (1417-1681) Correspondence. Of necessity, the 6000 letters making it up had all been taken from published editions, (thus introducing a rats-nest of copyright problems), and also thus to some extent reflecting modern editorial practices in such matters as spelling or selection procedure (less than a sixth of the letters are from women, for example). Nevertheless, the corpus is clearly of immense interest as a way of quantifying and detailing hypotheses about such phenomena as the social distribution of you/ye forms, or the changes from -s to -th verbal forms, or the use of multiple negation.
In the first of two papers about the Lampeter corpus, I rattled through an account of its design and encoding, focussing mostly on the problems of getting from a semi-SGML form of markup to one which actually validated against a TEI dtd, before Claudia Claridge (Chemnitz) reported on some real work done using it as a source of information to substantiate theories about the development of scientific thought; for example, the lexical patterns typifying the empirical approach, and the gradual replacement of active voice reporting forms by passive constructions over the century that this corpus samples.
After our first experience of a Slieve Donard lunch, Douglas Biber (Arizona) author of a recently-published book on corpus linguistics, described an application of factor analysis techniques to 18th century English. Unlike previous work in which the parameters studied had been derived from 20th century material, this work began by constructing a number of specifically 18th century "dimensions" along which the rates of occurrences for a wide range of linguistic features were shown to cluster in different ways in texts taken from ARCHER corpus. Different genres are then identifiable by their location along these dimensions.
Ann Curzan (Michigan) reported on a study of the shift from grammatical to natural gender agreement for anaphoric pronouns across the period 1150-1215 in the Helsinki corpus: her study shows that this was by no means a simple transition, involving both factors relating to both lexis (some nouns remaining stubbornly gendered) and syntax (eg distance between anaphor and antecedent). Christer Geisler (Uppsala) presented a mass of data relating to postmodified clauses in the tree-banked version of the Helsinki corpus, the purpose of which went straight over my head. Gerry Knowles (Lancaster) was also fairly recondite as to methodology, but the purpose was plain enough: to identify the origins of northern varieties of English by analysis of evidence from dialect maps rather than on the assumption that they derived from some homogenous Middle English dialect.
The Historical Workshop closed with a discussion as to whether or not it was A Good Thing, or whether it would be better to roll its contributions in with the rest of ICAME. Since ICAME's expansion is now officially International Computer Archive of Mediaeval and Modern English, I felt that it probably should; others, perhaps more territorially minded, disagreed, and we all adjourned to the bar.
Next day ICAME proper began with John Kirk explaining the structure of the event: there would be a a series of themed sessions, focussing on major corpus development initiatives: today ICE, tomorrow the BNC, then Birmingham, and so on. He'd also planned a session on dialectology, but the dialectologists had not co-operated. John reminded us that the result of the Northern Ireland Referendum would be due around tea time on Saturday, and that we all had a copy of the consultation document in our packs, so that we could consider whether or not the verbs exercise and discharge were in fact used synonymously. A special excursion in the evening would take us to a secret location near Belfast where the joint Anglo-Irish secretariat would ply us with drink before the Referendum result hopefully removed it (the secretariat, not the drink) from existence.
The ICE session began with a presentation by Baz Arts and Sean Wallis from UCL of the ICECUP annotation and search software. This is a classy piece of Windows software which allows you to search the completely parsed ICE-GB corpus of one million words in terms of its linguistic annotation. The interface looked a bit like the SARA query builder screen, or the Linguistic Database (LDB) searching software developed by Hans van Halteren for the TOSCA project many years ago -- but on steroids. The idea is that the user defines fuzzy tree fragments -- templates for nodes which are to be matched in the complete syntactic tree. Each node has three parts (function, POS category, feature) and the arcs can be directional, or non-directional. The system performed well and looked good, but Sean spent rather a lot of his time explaining how model based systems query front ends like this were an improvement on those based on logical expressions, which whether true or not was fairly uninteresting to non computer-scientists. There were promises of an enhanced and extended ICE-GB corpus to be developed later, with the software bundled, presumably contingent on grants coming through. I asked how the system handled contextual queries (since metadata wasn't included within the nodes) but didn't understand the answer. Chuck Meyer, a real linguist, then reported his experience of using the new system, and in particular of comparing its usability with his analyses of the same corpus published in 1996: he focussed however on minutiae of the results rather than usability issues which meant that I rapidly fell asleep.
I awoke (briefly) to hear Atro Voutilainen and Pasi Tapainenen from Finnish language engineering company Conexor, newly formed to exploit the run away success of their English Constraints Grammar Parser, probably the most widely used and amongst the most successful of current automatic tagging systems for English. A new version called Englite is now available on the web. Most of the technical detail of the Finnish team's impressive work is available in publications elsewhere, to which he gave several useful pointers He also demonstrated some Unix tools for processing the parser output, for example to produce new groupings of idiomatic phrases. The tools look good, but you have to be a true believer in dependency grammar to get the best out of them.
Jim Cowie from New Mexico State repeated some fairly well-worn observations about the role of corpora in improving automatic translation, citing some nice examples (how to distinguish storms of ice cream from snowstorms in a Spanish text) and also showing some nice software. He said there was a need to enhance the lexica used by your average MT system with frequency and contextual information, which is as true now as it was when Mike Lesk said it in 1986. Ylva Berglund presented World Wide Web Access to Corpora from Essex, a JTAP project which aims to demonstrate how corpora can be used in language teaching. Her presentation was meticulous but the project remains seriously underwhelming from my point of view. Its future remains unclear and there were some politely critical remarks from the floor about the need for such a project to be a little less self-promoting, and maybe more extensive in its coverage. The afternoon was given over to software demos, which went without a hitch: I demonstrated some of the spiffier bits of Sara, in particular how to use it to examine differing usage patterns for the word pretty as an adverb by men and women; the Zurich team demonstrated their impressive web front end to SARA, which they promised we could distribute for them when it was ready; and I went for a walk along the beach with Tony Mcenery. As to the secret reception, we went by bus for miles along twisty Northern Irish roads to a place surrounded with barbed wire which looked rather like a converted school hall, where everyone had a lot to drink, and I explained at least seven times to different people what the state of the BNC currently was.
Next morning began with an hour of deeply statistical discussion by Professor Nakumura and colleagues from Tokushima University, concerned with various methods of automatically identifying text-type within the LOB corpus. The rest of the morning was largely devoted to papers reporting work done with the BNC. It began however with two serious papers from Douglas Biber and Geoffrey Leech both derived from their Corpus Grammar work. Biber's was mostly on the ways lexical patterns explained explain different usage rates in different registers for apparently interchangeable constructions, in this case complement classes (verb+that vs verb+to). Leech's focused on conversation, and proposed some interpretations for the characteristic patterns of difference found amongst the four basic genres analysed in their grammar. Speech is characterized by shared knowledge, an avoidance of elaboration, a plethora of interactions, a need to express emotion and stance, freqent repetition of set phrases and (because of its time-based nature) frequent front or back loading of syntactic structures. All of these can be shown to underly the significantly different syntactic patterns found in speech.
After coffee, Tony McEnery lowered the tone of the proceedings by reporting with relish his investigations into the naughty words used throughout the BNC. He produced a number of examples to demolish various intuitive claims about who swears about what to whom published by one G. Hughes, and also remarked in passing how odd it was that the Norwegians came in for so much invective in the spoken part of the BNC.
Hans Martin Lehmann and Gunnel Tottie from Zurich reported a technique and some results for the automatic retrieval of adverbial relatives (e.g. this is the place where/at which/that/0 he found it) and for investigating their different usage patterns. Automatically retrieving zero-marked relatives is particularly tricky, even for the ingenious Lehmann, involving running the untagged text through the Helsinki parser, to identify potentially appropriate patterns, which are then manually checked. Apparently, the manual search missed 20% of the cases found by the automated process -- but the program was entirely at the mercy of tagging errors in the parser.
Sebastian Hoffman, also from Zurich gave a thoughtful presentation about the collocational evidence available from the BNC. Most people believe that native speakers know many complex lexical items: but empirical evidence showed that for relatively infrequent words there were disparities between predicted collocates and those actually attested in the corpus (using log-likelihood measurement of the collocational strength). The question of how speakers recognize such combinations in rare words remains open.
At this point, and in a major departure from ICAME norms, the conference split into two parallel strands. I sat tight for two more presentations on the BNC. Roberta Facchinetti (Verona) presented a study of preferences for will vs going to in speech and writing, somewhat marred by the observation that the written part of the BNC did not reliably distinguish reported speech. Jurgen Gerner (Berlin) discussed the increasing use of they as what should grammatically be a singular pronoun, used anaphorically to refer to everyone or somebody (as in everybody should do what he/they can): there seemed to be a preference for the singular only with the some- form.
After the usual extravagant and irresistible lunch, we settled down for an afternoon of presentations from Birmingham. Ramesh Krishnamurthy briefly described TRACTOR, a research archive for the various corpora and resources being created by and for the TELRI project, which has recently obtained a further three years funding for its pioneering work in corpus-ifying the languages of Eastern and Central Europe. The archive will include a number of interesting tools, integrated into a single framework, as well as the various corpora already produced and in production by TELRI member sites. Ramesh was at pains to distinguish the project from Multext East (less specific) and Parole (more specific), but TELRI still sounds more of a club than a project to me. Not that there's anything wrong with clubs (especially if they will admit me as a member, which I did my best to persuade them they should).
Geoff Barnbrook made deprecating remarks about the Bank of English and COBUILD, briefly touching on the political fallout from the recent massive "downsizing" of the latter, before giving a fairly anodyne description of a parsing system under development for the definition texts of the COBUILD dictionaries. He was followed by Oliver Mason, who again discussed the notion of lexical gravity in collocation together with some useful techniques for its automatic calculation: this remains no less impressive than when I saw it presented at PALC last year, but does not seem to have advanced much since then. Sue Blackwell discussed how the words look and well are used as markers of discourse function in a range of examples from the Bank of English; Willem Meijs gave a fairly thorough overview of national stereotypes as revealed by mutual information scores, but came to no firm conclusions that I noticed, perhaps because my attention was beginning to flag by this stage of the day.
Revived by tea, I plumped for the parallel session for annotation-dweebs (thus regrettably missing three papers on the slightly unlikely topic of dialect studies in the BNC) and managed to follow quite a lot of Eric Atwell and Clive Souter's discussion of the problems of mapping between the outputs of different parsing systems. As part of the Amalgam project, they had attempted mappings between nine different parsers and (despite the best efforts of Expert Advisory Groups for Language Engineering Standards) concluded that for syntactic analyses at least there simply is no obvious or even non-obvious interlingua. Even something as simple as labelled bracketting is controversial if you happen to be mapping an analysis based on a dependency rather than a phrase structure grammar. This dispiriting news was followed by an interesting paper from Yibin Ni (Singapore) who had been trying to make explicit by tagging some fairly recondite co-referential relations in discourse, but who did not seem to have hit on any notational scheme adequate to the purpose. Final paper in the session was from Geoff Sampson, presenting with characteristic clarity some of the problems in trying to define an annotation scheme that can guarantee consistency if application across corpora of transcribed speech: such common features of speech as repair and truncation wreak havoc with the best designed syntactic tagging schemes, to say nothing of the gaps in an analysis caused by <unclear> elements. The day concluded with a mammoth session in which each of the poster presenters got five minutes to announce themselves, and a reception at which John Kirk tried to explain some salient features of Northern Irish linguistic history. I think.
My notes on the next day begin with Vincent Ooi (Singapore) who promised to explore the different "reality" evidenced by collocational data from Singaporean and Malaysian English, but instead gave only what I found a rather impressionistic account of some multi-word phrases in English as she is spoke in the Straits. I did however learn that Singaporean lifts are equipped with devices which sound an alarm should anyone be taken short whilst inside: these are rather unimaginatively known as urine detectors.
Martin Wynne (Lancaster) presented the results of an interesting comparison between two part-of-speech taggers: CLAWS, from Lancaster, and QTAG, from Birmingham. The comparison was effected by running both parsers on the same corpus (the written half of the BNC sampler), mapping the results into the EAGLES recommended annotation scheme, and comparing the results. Martin conceded that this was grossly unfair on several counts: the EAGLES scheme is much closer to CLAWS in the first place; CLAWS was trained on the BNC; in cases of ambiguity, CLAWS uses portmanteau tagging, whereas QTAG gives a prioritized list from which they always took the first. He also spent a lot of time saying that he didn't regard the results (in a corpus of over a million words, CLAWS disagreed with the reference scheme 2% of the time, and QTAG about 15%) as proving anything, in which case one couldnt help wondering why they were being presented.
Antoinette Renouf (Liverpool) reported on the next phase of her unit's continuing and (I think) unique work on neologisms. The AVIATOR project, which monitored the appearance of new words in newspaper texts over a four year period, demonstrated that such words tended to have very low frequencies, thus requiring the development of rather rarefied statistical techniques for their detection and classification. Such techniques having been developed in collaboration with Harald Baayen from Nijmegen, Antoinette announced the arrival of a new project called Analysis and Prediction of Innovation in the Lexicon (APRIL), the aim of which is to develop a system of automated classification, accounting qualitatively and quantitatively for the features found in huge amounts of text, and then extrapolate from this to predict the structure of the future lexicon. Allegedly.
Graeme Kennedy (Victoria University of Wellington), author of the other newly published book on corpus linguistics, presented an intriguing paper on Maori borrowings into contemporary New Zealand English, couched largely as a comparison between those reported in the new dictionary of NZ english and those attested in a recently compiled corpus of spoken New Zealand English. Apparently, 77% of the words in the dictionary are not attested in the corpus, while 86% of forms found in the corpus don't appear in the dictionary.
Under the title It's enough funny, man, Anna-Brita Stenström reported on some features of teenage talk familiar to those with teenage daughters (e.g. use of enough and well as adjective pre-modifiers) but maybe not to others. In traditional ICAME fashion, she presented a mass of useful and interesting evidence for these usages, and their typical contexts based on searches of the COLT and the BNC; she had also used the online OED as a source of comparative diachronic information, enabling her to reveal that enough as a premodifier appears about 800 times in OED citations, while well is well frequent in Old English, thus suggesting that teenagers have merely rediscovered an enough established feature of Early Modern English.
Due to other committments, I had to make my excuses and leave at this point, thus missing amongst other things a report on the Lancaster multimedia corpus of children's writing as well as the closing celebrations. Despite this sacrifice, I still somehow managed to missed my flight home and had to spend an extra night at the Belfast airport hotel. Even this did not dampen my enthusiasm for the event: one of the best of a long series, and a hard act to follow.
Chemnitz is a large town in Saxony (you may have heard of it under the name of Karl-Marx-Stadt, when it was a large town in the DDR) notable for a well-established University, which has just had the good sense to elect as its pro-vice-rector (sounds less implausible in German)Prof. Dr. Josef Schmied, director of the REAL English language centre. Josef is also a corpus nut, and the originator of the Lampeter corpus project, which is why I was pleased to receive an invitation to visit his centre for a two day workshop on language teaching with computers. funded by the British Council in Germany.
I was one of four invited Brits, each of whom was allocated 90 minutes talk on the first day; the audience consisted of about 40 assorted teachers from other German Universities and Language Centres some of whom were allowed to speak (but only for 20 minutes each) on the second day, before a round up discussion session.
Before all this began, however, there was a little ceremony in which a Very Important Person from the British Council in Berlin and the Vice Rector himself signed a renewed cultural exchange agreement which would ensure that Chemnitz students might continue to enjoy partnership arrangements with a variety of UK universities in the interests of greater mutual understanding, the spread of the true British language, and so on. We also had a little talk from Frank Frankel, now retired from the Council but retained as a consultant in co-ordination of language centres throughout Germany, and a long-time enthusiast for CALL, as to how its wider use might change the roles of teacher and learner.
Josef Schmied gave a brief overview of curent projects at TUC involving language-learning with computers (LLC), of which probably the most interesting is the Internet Grammar project: a web-based system for language learners, teachers, and linguists to collaborate in developing a kind of translator tool for the teaching of English grammar. Unlike the UCL project of the same name, which is monolingual, the Chemnitz project is aimed very specifically at translators' syntactic needs; it also aims to synthesize inductive and deductive procedures to establish rules.
Hilary Nesi, from Warwick, gave a well focussed and informative review of the pedagogic usefulness (or otherwise) of a range of English language dictionaries available on CD-ROM; she also waxed lyrical on the potential classroom usefulness of the humble hand-held dictionary-type device, if only their manufacturers could be induced to give them more linguistically-oriented features instead of overloading them with electronic organizer type facilities.
Geoffrey Leech, from Lancaster, gave a very interesting talk about the grammar of spoken English, derived from a major new work on which he is collaborating with Stig Johansson, Doug Biber, Ed Finegan and others. The corpus underlying this is a 20 million word corpus comprising the spoken part of the BNC and a parallel corpus of American spoken material, collected according to the same principles by Longman, but not (so far as I know) available to anyone outside the project. Beside presenting and analysing an impressive mass of comparative data, with far more detailed examples than I can present here, he made some very cogent suggestions about their potential relevance to language teaching. The availability of well based frequency information for various syntactic constructions could be used not just to determine actual usage, and thus to provide authentic examples of usage, but also perhaps to help determine which grammatical structures should be taught, and which were most appropriate for different varieties of English, or types of discourse -- both areas in which language learners have most difficulty -- thus (to use Geoff's phrase) "maximizing communicative payoff" both in production and reception.
I gave the standard talk about the BNC again, including a live demonstration of the BNC Sampler, and a little coda stolen from Guy Aston about how the BNC can be used by learners to challenge their teachers, which seemed to go down well.
Gary Motteram from Manchester's Centre for English Language Studies in Education discussed some of the pedagogic issues in using IT for language teaching, based on his extensive experience in running a Masters degree in TESOL, which is now done entirely online, and sounded to my relatively untutored ear like a model case study in IT-based distance learning. He demonstrated, and advocated, the use of Toolbook as a low price, easy access, tool kit for the construction of powerful courseware.
At the end of this long day, the British Council took us all to an a rather unusual performance at the Schauspielhaus. Das Ballhaus is a kind of musical-cum-ballet, presenting fifty years of German history through the device of a dance hall, in which a massive and highly skilled cast waltzed, tangoed, drank, and generally overacted through the terrible twenties, the even worse thirties, the unspeakable forties, and the unbelievably depressing fifties, before succumbing to the degenerate sixties, the cacophonous seventies, and the incomprehensible eighties. The piece lasted three and a half hours, with an interval, and the theatre was unbearably hot, but it was oddly compelling, being full of humour and incident, and quite the most extraordinary display of really bad tailoring I have ever seen. Although entirely non-verbal, and although almost all of the music was entirely familiar and accessible, it also demonstrated how far shared knowledge and experience is essential to understanding a culture -- the second half in particular being full of incidents which the East German contingent found riotously funny or touching but which left the rest of us entirely bewildered.
The next day was given over to a mixed bag of presentations from other happy recipients of the British Council's bounty (in the shape of exhange agreements with UK universities). Regrettably I missed the first of these -- a group from Rostock describing a proposed diachronic newspaper corpus comsisting of samples of high, middle, and low brow British newspapers from three centuries. Speakers from Magdeburg's Otto von Guericke University and from various units of Potsdam University described their experiences in setting up IT-based learning resource centres, aka Self Access Centres or SACs, and in using the internet as a source for such things. There were also brief consumer reports on a various workshops attended (thanks to British Council funding) at a number of UK institutions. Sound principles (teachers should act as guides not experts; teach how rather than what etc.) were articulated and many favourite web sites cited.
Dr Thomas Bellman, from Leipzig's Hochschule fur Technik Wirtschaft un Kultur, demonstrated some nice software for cataloguing ELT resources: it looked like your standard bibliographic database, but included hot links which would start up bits of courseware directly from the record, screens which enabled you to pre-define subsets of records according to ELT-relevant categories, and buttons for access to a web browser, a wordprocessor, ond even a virus checker, all embedded within the same interface. I did not have the heart to murmur "OLE?"
Dr Jurgen Martini (also from the Otto von Guericke University in Magdeburg, but a different part of it) gave a defiantly non-IT-based presentation about the difficultiues of teaching cultural studies which kept promising to open up a new area of discussion, but didn't quite do so. The courses he teaches sound very interesting though.
The Workshop's final speaker, Bernd Rueschoff (from Karlsruhe), has the unusual ability to make CALL-theory sound interesting, as well as knowing how to make a very impressive presentation. He covered a very wide range of pedagogic issues, ranging from the need for authenticity in language learning and cultural studies, the use of technology as a means of enriching rather than replacing the traditional learning environment, and the drive towards active participation in learning rather than passive acquisition of information. He also had some practical advice to offer on how exactly to achieve these motherhoods, based on substantial experience.
I concluded from all this that, like the rest of us, German institutions have both Language Centres and Linguistics Departments, and are not quite sure whether IT has a role to play in either, neither, or both. They are also rather vague about whether corpus linguistics belongs in linguistics or cultural studies, or both, or somewhere else. All of these constituencies were represented, eying each other somewhat nervously, at the workshop; and if there wasn't a lot of rapprochement, there was a respectable amount of talk. Recurrent topics in the discussion included the impact of "self-access" as a way of learning rather than simply as a means of doing pre-set assignments, and doubts as to its general applicability for all students; anciety was also expressed about the possible dangers of allowing students access to politically incorrect notions over the internet, and the need for reliable assessment of networked resources, at which I felt a plug for Humbul would not be out of place.
I stayed on for an extra day after the workshop ended in order to show the Chemnitz team what I had been doing with their Lampeter corpus, and to discuss plans for future collaboration. This took so long that I never got round to doing any respectable amount of sight seeing, which is why this report is regrettably deficient in my reactions to the new Germany, the architectural oddities of Chemnitz (yes, Karl Marx's head is still there, just across the road from McDonalds) and Dresden (wonderful palace -- shame about the traffic system), or the delights of Saxon cookery. But it's probably long enough as it is.
The Scuola Superiore di Lingue Moderne per Interpretore e Tradutori is one of the very small number of University-level institutions in Italy dedicated to the training of interpretors and translators. It has a high national reputation, only partly inherited from its illustrious parent, the University of Bologna, since it is in fact located at Forlì, a pleasant Emilian town some 30 kilometres from Bologna,on the edge of one of Italy's major wine growing areas, and close to Predappio, birthplace of Il Duce.
I was invited to teach a full TEI workshop to a mixed group of about fifteen 3rd and 4th year students, all of whom were fluent in English, and had already had some exposure to computing methods and results by virtue of using the BNC and SARA. The main objectives were to explain some basic markup principles, to give some hands-on experience of other SGML software, to demonstrate the extent to which the usability of a computer corpus is determined by its markup; and get the students thinking about how they might prepare their own corpora. The workshop consisted of eight 90 minute lectures, three two hour practical sessions, and two discussion sessions, somehow squeezed into six days of fairly concentrated effort.
Before the workshop proper I gave as curtain raiser an open lecture on the British National Corpus, remarkably similar in content to the one I had given the week before in Lòdz, though couched in somewhat different terms. The emphasis was, naturally enough, on how the BNC actually used TEI-like markup. It was followed by the following sessions:
In a final round up session, over coffee and cakes, the group voiced some concerns about the relevance of all this to the practical problems they will face as translators: some of them wondered if a TEI corpus would help them preserve private corpora of translated works; others were more interested in the availability of large public corpora like the BNC.
I must add that these students were a real pleasure to teach. They worked very hard to grapple ideas and methods initially quite unfamiliar to them, and (particularly in the practical sessions) worked with great enthusiasm. And they put up very politely with my hectoring style of teaching too. By the end of the week, they had definitely earned the certificates we handed out.
It's somewhat of a cliché to describe Lódz (pronounced, approximately, Wootch) as the Manchester of Poland: this doesn't mean so much that it is blessed by an excellent football team, as that it is cursed with a major industrial past. At the turn of the century, it was a rich city, built on cloth, and with one of the most prosperous middle class communities east of Berlin. The centre of the town, and the University quarter, still boasts a number of the fine houses they built, some of them now being carefully restored; others still hidden beneath the grey dust and neglect that seems to live over the whole of this region of Europe. But then, of course, came the thirties, and the appropriation of Poland by invaders, first from the West, and then from the East. Lódz, I learned from my Guidebook, has the dubious honour of having been the first city in modern times to establish a ghetto and the ghost of that absence still haunts the place. Curiously, for there are now no Jews to be seen here, it's the star of David which is now daubed on walls, in contexts where elsewhere one might find the swastika daubed, in association with the swirling iconography of urban rage imitated from the inner cities of further west.
The teaching of English is a major growth area in Polish universities; for every student learning Russian or German, there are ten wishing to learn the language of McDonalds and Marks and Spencer, whose emblems now dominate the centre of Warsaw as well as that of High Wycombe. The British Council appears to be playing a major role in satisfying this demand, as witness its funding of this conference, and a number of other educational programmes aimed at secondary and tertiary English language teaching. An unusual and interesting aspect of these programmes is the recognition that access to (and study of) language corpora are of particular importance if there is to be a systematic improvement in the quality of English spoken (and taught) by Poles. Hence the organization of this well-attended international conference.
The four day event was hosted jointly by the British Council and the University of Lódz Institute of English Studies, represented respectively by Susan Maingay and Barbara Lewandowska, with assistance from James Melia, and took place in the University's recently constructed and comfortable conference centre. Each day began with a brisk walk to the University's council chamber for a plenary session, held beneath the stern depicted gaze of assorted dignatories in funny hats, followed by parallel sessions at the conference centre, combining project reports and research papers from a wide spread of corpus applications and interests. Evenings were given over to relaxation, discussion, and the opportunity to sample some excellent Polish hospitality.
Plenary sessions
The conference began on a high note, with an excellent lecture by Professor Michael Hoey (Liverpool) on the nature of the corpus linguistics enterprise, the questions it asked, and the answers it might provide. Asserting that corpora could be used to provide evidence of lexical patterns, of semantic prosodies, of syntactic patterns, of colligation, and even of text grammars, Hoey proceeded to discuss striking examples from each of these categories, of which I will summmarize here only what he called "the drinking problem principle". If someone has difficulty in drinking, it will generally not be referred to as a drinking problem because the more common collocational sense is inappropriate. In the same way, corpus evidence, rather surprisingly, demonstrates that following a posessive adjective (my, our, his etc) the plural form reasons is always preferred to the singular reason when the intended sense is "cause" (our reasons for doing this...) rather than "rationality" (to lose one's reason).
The second plenary speaker was Tony McEnery (Lancaster Univ) introducing the notion of what he called multimedia corpora. He stressed the need to introduce visual information as a context for understanding verbal material, and showed us a few pages from the corpus of children's writing and drawings currently being created at Lancaster, which will be distributed freely over the web.
The final plenary speaker was Patrick Hanks (OUP) who gave a bravura demonstration of the problems that corpora give lexicographers. What exactly is it that lexicographers do when they go through the lines of a concordance assigning each one to some sense or another of a word? and how on earth do they do it? Hanks has published several papers on this, and worked with some of the best computational names in the business (Atkins, Fillmore..) but he's honest enough to say he still has no definitive answers. His presentation focussed on a few interesting examples: the word "baked" for example, which seems to require certain lexical classes (not just edible food stuff, but specific categories of them) and the word .
Parallel sessions
In the nature of things, I couldn't attend all of these. The exigencies of time and space meant that I had to choose to miss presentations from inter alia: Bengt Altenberg (Lund), Wieslaw Babik (Krakow), Michael Barlow (Rice), Simon Botley (Lancaster), Igor Burkhanov (Rzeszów), Doug Coleman (Toledo), Martha Jones (Nottingham), Dorothy Kenny (Dublin), Bernhard Kettermann (Graz), Przemyslaw Kaszubski (Poznan), Anne Lawson (Birmingham), Barbara Lewandowska-Tomaszczyk (Lódz) , Belinda Maia (Oporto), Michal Pawica (Kraków), Margherita Ulrych (Trieste), and Maciej Widawski (Gdansk), whose names I list to give at least some indication of the geographical spread of participants. What follows by way of review should therefore be regarded only as a sample of the concerns raised and materials discussed -- though not, I hope, too unrepresentative a one.
Stig Johansson (Oslo) reported on the progress of the English-Norwegian parallel corpus project, now expanding to include up to a hundred texts in other European languages, notably German, Dutch, and Portuguese. The well thought out design of this corpus allows comparison both between texts translated from language A into language B, and the reverse, although it is not always easy to find sufficient texts to do this (there are far more English-Norwegian translations, for example, than the reverse, simply because it is hard to find comparable texts) In a separate evening session, he described some of the software developed for the project, in particular the automatic alignment procedure developed at Bergen by Knut Hofland and a Windows retrieval package developed by Jarle Eberling at Oslo.
Michael Rundell (Longman) gave a pleasant presentation about corpus evidence for the British fondness of understatement, in particular the phrase not exactly and similar ironic uses. It's probably not too controversial to say that this was not exactly unfamiliar to those who had heard him speak at TALC last year, but none the worse for that.
Guy Aston (Forlì) contrasted the pedagogic usefulness of large corpora such as Cobuild or the BNC with that of small specialized corpora drawn from, specific text types, such as a 14,000 word "hepatitis corpus" in use at Forlì. Small corpora are more easily managed by the language learner, and their lexis can be studied in extenso; learners can use them to practice their inductive powers, hypothesizing lexical, colligational or collocational patterns, which may or may not be confirmed by examination of large reference corpora. In this respect, it is possible to get the best of both worlds.
Akua Anokye (Toledo) described some interpretative problems in analyzing the transcriptions of Afro-American folk narratives recorded on aluminum disk by Hurston, Lomax et al in the late twenties and now stored in the Library of Congress. She had transcribed some of these recordings, using her own scheme, and presented a largely impressionistic account of the interplay between their phonological and contextual features.
Sylvia Shaw (Middlesex) described how corpora had influenced the production and format of the third edition of Longman's Dictionary of Contemporary English. This had included both the use of frequency information, derived from large corpora, and particular attention to typical language learner errors, derived from corpora of language learners' production. Thus the student can be advised, for example, of the range of things to which words such as beautiful are typically applied by native speakers, which is much smaller than that used by language learners, and given advice on how to choose between near synonyms such as error and mistake.
Raphael Salkie (Brighton) quoted a number of French writers' opinions about the differences between French and English, intended to help translators as rules of thumb (e.g. ). His paper reported some interesting work on the extent to which these perceptions were born out by corpus evidence, and gave a brief overview of the Intersect project.
Chris Tribble, (Lancaster, Reading, Warsaw) picking up Guy Aston's paper on the benefits of small corpora, suggested that for class room use, small corpora were of more use than large as well as being more accessible. His paper reported on some experiments using Microsoft's Encarta as a language resource (as well as a source of factual information) noting that the type of language it contains is very similar to that which language learners are typically required to produce: brief factual articles.
Oliver Mason (né Jakobs) from Birmingham presented what was in many ways an exemplary research report about some very interesting work he has been doing on identifying statistically the size of collocation spans, by calculating the type-token ratio of the words appearing in each position to the left and to the right of the keyword. The results are striking: different node words exhibit strikingly different patterns of influence on their neighbours, giving a visual hint of the extent to which they construct fixed phrases, for which he proposed the term lexical gravity.
Sylvia Scheur (Poznan) discussed several aspects of her research into the pronunciation difficulties faced by Polish-speaking learners of English. She had recorded 17 Polish language-learners reading the same English texts at the start of their course and a year later, and was transcribing these phonetically (using the SAMPA writing system for the International Phonetic Alphabet, developed by John Wells). Students were also asked to assess their own performance, which produced some interesting comments about their perceptions of English prosody.
I gavethat paper about the BNC and SARA again, spiced up somewhat for this audience by the addition of hints about forthcoming availability of the corpus outside the EU, and also with the first ever live demonstration of the sampler corpus.
Philip King (Birmingham) gave an overview of the Lingua multilingual parallel concordancing project, now in its second phase of existence, with a particular focus on some of the pedagogic software being developed at Birmingham for its exploitation, its use in generating course material for student use, and the ability to browse and search parallel corpora. It would be interesting to compare the methods and results of this project with those of the ENPC described by Stig Johansson, but no-one had the temerity to do so, in public at least.
Social events
Like other academic conferences this one was oiled by a couple of very pleasant evenings, drinking, dining, and discussing. Particularly memorable was an evening concert of baroque music by Teleman and Handel played on original instruments, followed by a splendid buffet dinner. This was held in one of the afore mentioned bourgeois palaces and much appreciated by all. After a couple of glasses of very drinkable Hungarian wine, McEnery, Ketteman and I were able successfully to empress all the help we needed to make TALC 98 (hopefully, to be held in Oxford next July) a reality.
I also took the opportunity of a free Sunday at the end of the conference to visit Arkadia: this is an ornamental garden full of picturesque gothic ruins, classical statues, and the like, originally laid out by the local aristocracy in the 18th century, and now a pleasant enough place for a Sunday afternoon stroll. Getting there involved a detailed and educational study of Polish regional railways and bus services, to say nothing of the refreshment room at Lowicz railway station (which I won't).
I spent ten days on the other side of the Atlantic last week, much of it in the frozen midwest, quite a lot of it hanging around Detroit Metro airport, waiting for planes which were either not due for hours, or which were, but had been delayed for similar periods of time. The only place I found there capable of serving a decent cup of coffee had no spot to plug in my laptop, and the food doesn't bear thinking about. Fortunately, when not in airports I was well looked after, and well fed, by a number of friendly natives. I was rained on in snowy Indiana, and saw some nice tourist sights in foggy San Francisco. For more details, follow the links below...
For those who don't know it, the Linguist list is a well-established ListServ list, read by tens of thousands of linguists of all varieties world wide. The list is hosted by Eastern Michigan University, a small (by American standards) campus just up the road from the University of Michigan. Ever anxious to improve and extend it, Linguist's presiding deities Helen Dry and Antony Aristar last year applied for and received a grant from the American National Science Foundation to move the list to a new plateau of serviceability, including bibliographies and source texts, abstracts, and all manner of wonderful things. To their credit, they decided that SGML was just the ticket for this purpose, and solicited consultancy and advice from their readership. Somewhat to my surprise, my response to this request lead to my being invited last autumn to come and discuss the project with them and their colleagues, which I duly did last week.
The consultation took place in the Aristar-Dry's rather splendid architect-designed home in snow-bound Ypsilanti in the shape of a two day informal workshop. The three moderators of Linguist (Aristar, Dry, and Dan Seely) were present, along with John Remmers, their technical editor,and several of the graduate `associate editors' who currently have the thankless task of editing and controlling the hundreds of messages which arrive for the list every day. Also participating were another SGML consultant, in the shape of Gary Simons of the Summer Institute of Linguistics, with whom I have had the honour of working previously on TEI matters, and Ron Reck, billed as Linguist technical editor, a former EMU linguistics graduate, now moving to a better paid job in Washington.
We began with an overview of the software they currently use. The whole operation is hosted on the University's unix system, but is about to move to its own Digital Unix box. The software used, developed by Remmers, is a home-grown but functional mixture of shell scripts and C programs, embedded within the Unix mailer. Considerable care is needed to ensure that files are correctly processed, keyworded, and proof read, to say nothing of vetting for obnoxious content, but the procedures for doing all this are well understood and documented. Remmers then presented a good overview of where they would like to go in developing a new system, and some suggestions about how they might get there. The chief problem areas highlighted were in character handling, the need for various format conversions (the list is currently distributed both as email and in HTML from its web site), and in making searching more sensitive, i.e. context aware. I gave, without benefit of overheads, a brief sermon on the advantages of SGML, focussing in particular on these issues, and on the desirability of document analysis before proceeding much further, and we then all adjourned for a rather tasty Japanese dinner in Ann Arbor.
The following morning, Remmers came to take Simons and me out to a decent breakfast, over which we were able to get a little more specific about his currently proposed SGML dtd, the chief drawback of which is its lack of any structure within the body. Simons then gave an illuminating overview of the key issues in making the world's character sets uniformly accessible by today's computing systems. He concluded, unsurprisingly, that the only workable long term answer is to use Unicode/ISO 10646, but not without having given at least as much time as they deserved to eight other possible approaches, and giving as much technical detail as anyone might reasonably require about the current state of such mysteries as the Basic Multilingual Plane, how to shoehorn Unicode characters into Windows code points, whether and where to use Unicode entity references, and how to ship UCS-2 documents as Mime attachments.
In a final discussion session, the group reviewed the state of affairs, and started trying to identify what kinds of document their DTD should handle, which services their new system would be expected to provide, and what software would be appropriate. I tried, not very successfully, to get the concepts of document analysis across, and rather more successfully to persuade them that individual messages should be their primary data objects. This would enable them to produce personalized issues and automatically retrieved summaries in addition to email and HTML versions of the current "Linguist Issue", particularly given the availability of more detailed and accurate indexing of the message contents than exists at present. There was some discussion of what should go into these indexes, and how it should be controlled. A consensus was established in favour of Open Text 5 (which they have been promised for a paltry $1500) as a suitable indexing tool.
The project has only just begun, and its grant is small. However, this is an exciting time to be beginning such a project, with the announcement of XML and other relevant developments. It has a daunting task ahead of it, given the need to maintain the current level of service, but the team seems both highly motivated and technically competent. I will be watching it develop with interest.
The Humanities Text Initiative at Michigan University is an organization not entirely unlike our very own Humanities Computing Unit, but based within the University library and rather more solidly funded. It is run by John Price-Wilkin, with whom the Oxford Text Archive has long co-operated. I arrived unannounced in the evening and was pleasantly surprised to find a room in which a pair of industrious graduate students were tending the electronic production line that is the HTI American Verse project. At one end were real books, each with a little card marking its current status; at the other, proof-read and consistently TEI encoded versions of them: on the table was a well thumbed copy of TEI P3. The OCR software, called Typrereader, looked rather sophisticated to me: it is used to produce a markup-free text, proof read carefully against the original. Markup is introduced at the next stage, using Author/Editor. Texts are loaded into Panorama, and tagged printout is proof read again. I did not get any figures about the throughput of this part of the highly professional Digital Library Production Service, but it is clearly doing well enough to keep the HTI's web site busy, as well as servicing a number of other products. It is interesting to note the shift in emphasis away from collection and research support towards content creation and service provision.
Price-Wilkin had a busy schedule the next day, during which he somehow made time to give me a brief overview of the organization, and describe in full detail some work he has been doing with Dr Johnson's dictionary. He also allowed me all too brief access to the riches of their collection, and the use of his office for an unrelated TEI telephone conversation before taking me to lunch, where I was introduced to the dubious delights of the chilli dog.
A number of different approaches have been taken at Michigan to the perennial problem of providing good quality humanities computing support. Initially, they had set up a `collaboratory' -- a personalized computing facility which particular named scholars could apply to use for suitable projects over a fixed period --- but this had not been altogether successful. Take-up and productivity of the HTI, which included an open access facility-rich room, staffed by experts in a number of different fields, and engaged in resource creation for specific projects, were much greater. As well as superior OCR and tagging services, they offer a Kontron camera (a Progress 3012) for high quality image capture, free for internal use.
Other facilities available to library users include a `Knowledge Navigation Center' specifically developed to provide assistance on IT to humanities students, faculty and staff: this has a number of machines and support staff. The Office of Instructional Technology also provides a service aimed at developing and adapting teaching and learning software to faculty's needs: I formed the impression that this unit had a poor record of co-operation with other more resource-focussed centres.
The HTI forms part of the University's Digital Library Production Services, along with a number of complementary and well-funded projects: MESL (the Museum Educational Site Licencing Project), the venerable TULIP electronic journals project (currently co-operating with JSTOR) and a new service known as the SGML Server Program, the object of which is to sell SGML consultancy and expertise in setting up electronic text centres to other universities at a knock down price. Income thus generated is intended to go into the content creation activities. The DLPS reports to a board on which the Library, the Information Technology Division (a large Unix based central computing service), and CAEN (another large Unix-based central computing service) are all represented. (I suspect that there is an interesting history to be written about how Michigan came to have two competing computing services). The HTI gets funding from the University's Press, the Library, the Office of the President (of the University) as well as from grant giving bodies like NEH and NSF, which it uses to carry out prestigious content creation projects like the Making of America (which will produce 1.5 million page images, combined with OCR'd text in TEI format). It currently has six full time staff and its activities are expanding to include not only images (unlike the Library of Congress, which delivers TIFF format page images, theirs are dynamically converted to GIF for web delivery) but also music and film (as a natural outgrowth of the MESL project); they are even contemplating numeric data (for the ICPSR, no less).
Michigan has an HFS system like ours, run by CAEN, which the DLPS is planning to use for large datasets, such as GIS data. Otherwise they rely on their own largescale RAID system, which gives them six 72 gigabyte disks, mounted on a Sun server. They use a product called DLT for backup. They use their own software to interface web users with the underlying text search engine, which is Open Text release 5; this software is also supported by the SGML Server Program mentioned above.
I could have spent a lot longer in Ann Arbor, had my itinerary permitted, since I think we have a lot to learn from their successes. I also discussed with Price-Wilkin the idea of organizing a TEI-header users-only workshop, which he seemed to think a good idea; he also suggested that the TEI really should get into the business of selling consultancy services, which I did not quarrel with since he was buying the dinner (and it was a very good one). Afterwards, I was re-united with an old acquaintance, Professor Richard W. Bailey, whom I last saw in the late seventies and with whom I spent a very entertaining evening reminding each other of past follies and embarassments, cut short only by my need to get up early the following day for a flight to Indianopolis.
I took a side trip to Indiana University's new Music Library, on 17-18 January. This is home to the Variations project, a state-of-the-art real-time music delivery system, which reportedly makes use of IBM's Digital Library products.
The project director, David Fenske, introduced me to the systems support person, Jon Dunn, and the librarian responsible for the digitization process, Constance Mayer, all of whom kindly gave up a lot of time to making sure I saw as much as possible during my brief visit. The goal of the project is to digitize substantial quantities of the library's holdings of recorded music, held on CD, LP, and cassette. Real-time delivery of digital sound currently requires non standard disk access and storage methods, and the system at Bloomington relies on an IBM proprietary method known as tiger shark (apparently because data is `striped' across the media, rather than being stored in discreet blocks) for storage, and on an internal ATM network for delivery.
Although the project is not currently using the IBM Digital Library (henceforth, DL) software, Fenske assured me that several components of the software were already in use, while others would soon be upgraded to a state where they would deliver what was needed. The principal gap was in support for the afore-mentioned TigerShark file system, which could not be accessed via the current ADSM software. Consequently, at present, Indiana are using ADSM for backup, and as a repository only. Audio files are manually moved between the ADSM and a separate `playback' server. This is a conventional `pull' type unix server, running under AIX, with its own filestore which is accessed via a product called Multimedia Server. It is planned to replace this with a new IBM product called Video Charger, due for release in September, which will interface directly with ADSM. Mention was also made of a similar product called Media Streamer, designed to handle real time audio broadcasting. For our purposes, the most relevant forthcoming component of the Digital Library system will be Visual Info. This is a database product designed for storage and retrieval of images and text which sits on top of the well established (not to say venerable) db2 engine. It will also have some extras called db2 `extenders' because they extend the searching capabilities of db2, apparently using black magic to do things like searching images by colour, shape etc.
For metadata and cataloguing purposes, Indiana plans to replace its current OPAC system (NOTIS) with a new one, currently being developed by a company called Ameritech Library Services. This OPAC is shared by nine distinct campuses, so the upgrade will be slow. The new product, called Horizon, will interface directly with Digital Library (according to an agreement between IBM and ALS). It is designed for inter-operability, and has Z39.50 support. Fenske pointed out that Z39.50 does not address the realtime networking issues critical to their needs, which made integration with the Digital Library product correspondingly more important. We discussed the relative wisdom of rolling your own solution versus waiting while manufacturers roll one for you: Fenske said that his concern was always to make sure his needs were addressed by vendors' announced and supported product plans. He had found IBM very responsive, and was confident in their ability to deliver the required functionality in the long run. He is currently working very closely with the company, and will be working part-time as a consultant at the Santa Teresa laboratory where DL development is based.
Bloomington is nationally and internationally famous for the quality of its music teaching: out of approximately 7000 applications, they admit about 350 a year; their current enrollment is about 1500 music ``majors'' and a hundred part-time students. The music library occupies four floors of a new purpose-built (and privately funded) Performing Arts Center, with several large reading rooms, and purpose-built stacks. Round the edges of the reading rooms I saw rows of carrels, some with outmoded (and under-used) analog playback systems, others with newer equipment (typically a MAC or PC workstation, with a Kurzweil keyboard, MIDI, and playback systems). In total there are 70 such workstations, of which 30 can currently access the Variations system. (This is partly because the current Variations software runs only on PCs: it's planned to switch over to NT4 as soon as ATM drivers for NT are available). The front end software is Netscape: a page specifies the lists of musical works allocated to particular courses, with direct links to the digitized audio itself, where this is available. Clicking on one of these activates the Variations player, which is configured as a Netscape helper application. The player allows you to select particular tracks from the work, randomly and with impressively smooth and rapid access. The sound quality is comparable with what you would expect from a good domestic CD-player over headphones. Fenske told me that their server logged about 17,000 audio file deliveries per month.
In addition to these carrels, the library has three seminar rooms, and a LARC-like room, equipped with about 50 macs and PCs, which is run by the central University Computing Service. These facilities are all linked to the ATM network, and so can all access the Variations system provided that they are able to run its software. There are fifteen full time library staff and two full time technicians.
I then visited the office where the digitization and cataloguing is actually carried out (this also doubles as the control room for a small recording studio). Digitization is done largely by part time student labour, under Mayer's direction. The procedure is only partially automated, needing a certain amount of manual intervention. Up to twelve hours of music get processed each day: limiting factors are the time taken to compress the WAV files to JPEG (this is done in batch overnight) and the amount of disk space available. Operators have to check that space is available to hold the material they are creating, and also to create manually a ``tracks file'' which records title and composer information for each track digitized. This is taken directly from the CD or LP, rather than from the existing catalogue records, for a number of reasons, ranging from variability in the level of cataloguing details actually available (MARC cataloguing practice for published music varies greatly in what gets included, and where), to political and programming difficulties in getting direct access to the centrally-maintained catalogue records. Consequently, the operators' keyboarding instructions have to specify exactly how proper names of composers should be entered --- there is no other authority control --- and include the depressing note that all foreign accented characters should be ignored.
Five cataloguers are employed to enter the data into a simple line-mode shell script, taking up to 30 minutes per CD. The only automated part of the process appears to be the reading and detection of track duration times directly from the CD: there is no workflow program to check, for example, that the catalogue records are correctly updated. A filename is allocated to each piece of music, derived from its identifier in the NOTIS system. When the whole piece has been digitized and is ready for compression, it is backed up to the AHDS and a catalogue record update is requested, apparently by hand. This will (eventually) insert an entry in the MARC 856 field, containing the URL at which the digitized track will be accessible, assuming that it is available from the server. For example, the piece of music with NOTIS identifier ABE7278 will gain an 856 field containing something like the following:
$1 /.../cgi-bin/var/access?ABE7278 $2 http
I had some private conversation with Jon Dunn, who provided some more technical details of the present and future system. In future he expected that the OPAC would link to detailed metadata held on a library server, which would in turn point to digital objects held on an object server. As noted, their present system relied on AIX and the Multimedia server, using the ADSM only for backup and as an archival store (for both WAV and MPEG versions of the files). The glue holding this together was all developed in house: in particular, the Variations player was written in Visual C++ and the web interface material written in Perl. A half FTE post had just been established to port the Player to a 32 bit environment. There had been no particular planning exercise or formal acceptance procedure.
At present, Jon said, the system only has to handle 30 concurrent accesses over the ATM network, but it should be able to handle up to 100 such, if the number of workstations expands. The playback server is an IBM RS6000 series, specifically a 59H uniprocessor (apparently, Multimedia Server does not work on multiprocessor environments) with 512 Mb of main memory, running AIX 4.1.1 and ADSM 2.1. The server addresses a total of 120 GB of SCSI disk storage, and manages the ``striping'' (this precludes using RAID to increase the amount of available disk space). The disks are accessible by NFS and the in house ATM network used for delivery is connected to the campus wide area network, so in principal the system could be accessed from anywhere on campus. However, since the campus network is a conventional FDDI ring running at 1200 Mbits/second, and most buildings have an ethernet running at only 10 Mb/sec this is not regarded as a practical possibility. The campus network is run by the University Computing Service which is reluctant to risk degenerating performance in this way.
As noted above, after about ten hours of music has been digitized, the holding area is full and all data has to be compressed. Up to 120 Gb of data can be held in the play back area, but they have digitized ``much more'' than that. With the new video charger software, transfer between ADSM and the playback area can be automated, but at present it has to be done manually, on the basis of course requirements. The new DL will also include a defined API for applications such as the Variations player and its associated web pages, to which Dun expects to be writing code. He mentioned the existence of a detailed IBM technical paper describing the internals of the Video Charger software.
In a final discussion, Fenske gave me some other US contacts which might be worth following up. These are all members of something called the ``Renaissance Consortium'' --- a club of early DL users run by IBM, loosely under the aegis of Michel Bizy. (spelling?)
Unfortunately, bad weather meant that I had to leave Bloomington earlier than planned, but I don't think I would have learned a great deal more by staying. It seems clear to me that we should continue to be deeply skeptical about the claimed abilities of the IBM digital library software.
I did manage to visit the LETRS electronic library project while I was there: this is another TEI based project, firmly located in the University main library, run by Perry Willet. LETRS has adopted similar solutions to the provision of digital texts online as HTI, though on a smaller scale. It is a joint venture of the library and computing services, with five parttime graduate consultants and one full time technical consultant. LETTRS provides access to a number of networked CDs, and the OT5 software used at Michigan. They have also created a sizeable amount of TEI-conformant text as part of an ongoing Victorian Women Writers Project.
I cut short my visit to Indiana (which was now getting seriously cold) in order to get back to Detroit before too many of the planes out of it had been cancelled by bad weather. This turned out to be a Good Idea, since the one I eventually caught to San Francisco left two hours late, and managed to lose my luggage in the process. But California is still a warmer place to be, even when you have no clean shirts and only a substitute toothbrush.
I had been invited to the Library at Berkeley by a campus-wide working group, jointly sponsored by the Townsend Center for the Humanities and the UCB Library, which has a campus-wide remit to promote interest and information on computer usage across the Humanities. Amongst other interested (and interesting) parties, this brings together the Library's own Electronic Text Unit, the Bancroft Library's Technical Services Division, the Library's Photographic Service, and several academic departments, notably those of English and Linguistics. Berkeley Library is, of course, the home of EAD: the Encoded Archival Description, now being developed at the Library of Congress, as well as many other good things. My talk, on the relationship between scholarship and the digital library, attracted a gratifyingly large audience, many of whom did not appear to know each other, and nearly all of whom managed to survive until the coffee break. I had been asked to stimulate enthusiasm for the TEI, and did my best: for at least an hour.
After a pleasant lunch with my hosts (Daniel Pitti, Tim Hoyer, and others), and a little light tourism (I have now been up to the top of Berkeley's Campanile and admired as much of the view as the fog would permit) we reconvened for a discussion session, which like many such ranged comfortably over a number of issues without settling any of them. I did learn however that the Electronic Text Unit had developed a dtd for managing digital images along with their transcriptions, calledthe EBind dtd, and that according to Prof Lewis Lancaster, double keying is a far more cost effective method of data capture than OCR. (Lancaster also brought me up to date on the activities of the Electronic Buddhist Text Inititiative, which is still going strong.) In the evening, I was then taken out for an excellent Japanese dinner by the Linguistics Department, in the shape of Prof C. Fillmore, J. B. Lowe, Jane Edwards, and two graduate students, and we all got to discuss corpus linguistics late into the night. To round off a perfect day, on returning to my hotel, I found that my luggage had finally caught up with me.
Next day, rejoicing in a clean shirt, I set off to visit Uli Heid, currently a visiting fellow at the International Computer Science Institute in Berkeley. We spent an hour or two discussing corpus retrieval software and I at last saw the Corpus tools developed in Stuttgart by Oliver Christ. Alas, since this prestigious international institution did not have a single PC running Windows in it, I was unable to respond by demonstrating SARA, other than over the web (which worked).
Fillmore is about to start on a new project which involves annotation of a corpus with detailed lexical information: we talked a little about how that might be supported in the TEI scheme. After lunch, I visited the Berkeley Linguistics Department proper, where I saw some of the impressive work Lowe is doing in bringing together and (eventually) marking up components of dozens of African language dictionaries. This project, known as CBOLD (Comparative Bantu Online Dictionary) looked like an excellent TEI prospect.
I spent the rest of the day engaged in tourism in San Francisco, at last. I can now report that I have crossed the Golden Gate Bridge, visited the City Lights bookshop, eaten at Max's Diner, and seen the cars wiggling their way through Lombard Street (the wiggly block of 7th street). Oh, and also seen the remains of the 1910 Worlds Fair -- some utterly implausible red sandstone ruins put up by William Randolph Hearst. I should express my thanks to Jane Edwards for introducing me to those delights, to my Berkeley hosts for allowing this tour to end on such a high note, and indeed to all the people I visited for allowing me to disrupt their routine with so many impertinent questions.
This was a hard working weekend conference, rather on the TEI model, organized by Peter Robinson of de Montfort University and the Canterbury Tales Project, in collaboration with Dr Hope Mayo from the newly Mellon-funded project on Electronic Access to Manuscripts. It assembled representatives from several major European manuscript collections, specifically the British Library; the Arnamagnaean Institute at the University of Copenhagen; the Institut de Recherche et Histoire des Textes in Paris; the Vatican Library; the Czech national library in Prague; the Welcome Institute; the National Library of the Netherlands at the Hague; the Marburg Institute and the Bodleian, together with experts on MARC, on the Berkeley Finding Aids project, the Text Encoding Initiative and the Dublin Core. The charmingly-named Merilee Proffitt was also present as a representative of the joint Berkeley-Columbia Digital Scriptorium project. The object of the two days was to formulate a basis for minimum level descriptive cataloguing, specirfically of western medieval manuscripts, and agree a context in which this could be developed, with an eye to future European funding.
This was an unusually well-prepared and documented meeting: brief bibliographic and biographical details of all the participants were circulated before hand, and I came away laden with paper and replete with technical detail. A full report was promised, so this should be taken only as a summary stop gap, composed from my personal and probably ursine standpoint.
The first day was taken up by presentations from each of the cataloguing experts present, recounting their own institutional policies and current practices, which proved a most instructive mixture. At the BL, for example, Michelle Brown spends one day a week "up a ladder" looking for illuminated mss, which when found are given minimal level descriptive cataloguing "at shelf" using a dtd developed by Richard Masters, while Rachel Stockdale and her colleagues are contemplating the horrors of retrospectively converting an estimated million printed textual descriptions to a structured format. Mathew Driscoll and Dominik Wujastyk gave good introductions to what was special about their own kinds of mss (Icelandic and Sanskrit respectively); Elizabeth Lalou described the latest incarnation of Medium, the medieval ms database developed at the IRHT in the 70s; we learned a little about the Vatican's 150,000 estimated mss, of which only 15% are catalogued, from Piazzoni, and also about the thinking behind the Czech National Libary's Unesco-funded digitization project from Adolf Kroll. Thomas Brandt described the Marburg Institute's MIDAS system, which underlies a very impressive publishing programme, and also a very interesting collaborative cataloguing venture called Diskus, involving several major German museums and art historical institutions. Anne Korteweg described the cataloguing scheme used for manuscripts at theDutch Royal Library, mentioning in particular its use of PICA and of Iconclass Finally, Richard Gartner outlined theBodleian's mss cataloguing projects, using EAD and a set of TEI extensions.
Day two began with some real computer demonstrations from David Cooper, showing the Bodleian's Celtic manuscript project, and from IBM's digital library projects, in the shape of Peter Elliott and Uschi Reber. None of this had a lot to do with the stated business of the meeting, but it was nice to see some really good digital images (or would have been had I not been busily getting my own presentation ready). Real business resumed with three brief technical presentations about MARC (from Larry Creider), EAD (from Daniel Pitti), the TEI Header (me), and the Dublin Core as a discovery vehicle (from Jennifer Trant). The rest of the day was devoted to discussion of what exactly should constitute a Dublin-style core set of descriptors for manuscript materials. This began with a rather unsuccessful attempt to get everyone to agree on a top ten list of fields that must be present, got bogged down somewhat, but eventually cohered around the notion of identifying a key set of descriptive categories, within which more specific fields might be mandated. The discussion was finally given some coherence by a rather neat taxonomy proposed by Jennifer Trant, which I reproduce below. Starred items indicate those which were agreed on as essential, according to my notes.
A fuller and clearer version of this list is to be produced in the official report of the meeting. With this consensus established, the real work of defining the nitty gritty of how (for example) this consensus might be mapped on to a MARC record structure or a TEI header was left to another day. The plan is to prepare a bid for a two year development project, funded by the EU libraries programme and co-ordinated from de Montfort, which will define such implementations, systems for producing such records in technologically-challenged libraries, and a central repository for them, as well as a detailed set of Guidelines. Watch this space.
In all honesty, I must add that some credit for the hard thinking and energetic discussion which characterized this meeting may be due also to the sybaritic nature of its surroundings. Studley Priory is a beautiful building, with an excellent chef, a well-stocked bar, and some rather fine gardens for stomping around in, even on a windy November weekend.
I was invited to deliver the opening address at A decade of power, the third annual conference of the Belgium-Luxembourgian SGML Users' Group, held at the Business Faculty of Brussells University, just down the road from the airport (rather inconveniently for us Eurostar-fanatics). The day long event was well attended, with about fifty delegates, mostly from the industrial/publishing community in Belgium and the Netherlands, though there were also some academics present. There was a small exhibition, notably featuring Omnimark, Balise, Adobe, and Fotek, and a very friendly atmosphere. The day was given over to presentations, which I summarise below.
Lou Burnard (Oxford University) SGML on the Web
You can downloadthe latest version of my presentation from my web site, including the overheads. As far as I remember, I explained at some length why HTML was a Bad Thing for electronic publishers (this is what the Americans call ‘preaching to the choir’), and rather more briefly why SGML was a Bad Thing for the Websurfer in the Street (which is probably what the Americans call ‘making waves’). I also made a few incautious remarks about month, which provoked some interest.
Elizabeth Gower (Adobe) Practical issues in SGML Publishing
The practical issues concerned all related to conversion of legacy data: how to do it, how to make sure you don't get ripped off by the company you sub-contract to do it, and why it's going to cost you more than you think. The presentation was slick, professional, thorough, and aimed fair and square at senior management: an invaluable checklist of practical things to do (make sure you've got enough MIPS and enough disk space! make sure your network can stand the load!) and how to benchmark the process, but not much on different software strategies you might adopt to do the conversion, or manage the results of doing it.
Norbert Mikula (Philips Semiconductors) Electronic databooks: proof of concept
Not content with writing an SGML parser (of sorts) in JAVA (imaginatively called Cappuccino), Mikula has also turned his hand to the production of Yet Another DSSSL Engine (or YADE), which uses Milowski's Kawa scheme interpreter, also written in Java. The context for these tools is the Philips Semiconductors Electronic Databook, an application of PCIS, the dtd Philips have developed within the Pinnacles framework, and forms the basis of Mikula's research at the University of Klagenfurt in Austria. His presentation was impressive, and although only in prototype form, the work he outlined shows great potential.
Milena Dobreva (Inst of Maths and CompSci, Sofia, Bulgaria) Use of SGML by philologists: experiences gained during the Medieval Slavic Manuscripts Encoding Project
This paper gave an overview of the trials and tribulations experienced by a group of Bulgarian philogists in applying the TEI Guidelines to the problems of describing and encoding medieval Slavic manuscripts. Dobreva's presentation focussed more on the organizational and training aspects of the project than its technical content, though she did briefly present the set of extensions the group had found it necessary to make to the TEI dtds, which were of some interest. The facility with which previously SGML-naive users were able to make useful progress, even in a technologically challenged environment, was very heartening, while the survey of common problems encountered was reassuringly familiar.
Benoit LaSalle (Omnimark Technologies) Using microdocuments and hybrid distributed databases for building up hypertext rich content online services
Exoterica Corporation has changed its name to Omnimark Technologies, and this paper was presented by its European sales head, rather than by Eric Skinner or John McFadden. Otherwise, this was much the same story as presented at Munich earlier this year: the future lies in microdocuments (crystals, or document fragments) which can be embedded within conventional table-like record structure. This is undoubtedly true, but would be more persuasive if there were evidence of implementations not simply hacked together from a mass of Omnimark scripts doing clever stuff behind the scenes.
Paul Hermans (ProText) Questor: publishing social law to different media
This was also a user-report, this time on an electronic publishing project concerned with course materials for social law. The material needed different organization for electronic access: for the former, a browser based on Synex viewport (the engine behind Panorama) was used, for the latter, Framemaker. The necessary data massage was carried out by Omnimark. They had begun with three separate dtds, for capture, hardcopy, and online, one of which also sported ICADD attributes subsequently abandoned for performance reasons; these had then been ingeniously combined into a single dtd. Hermans spent some time outlining the ways in which Hytime linking strategies had enabled them to organize the material for online searching in an effective way, before commenting that TEI extended pointers had proved a much easier option when it came to actually implementing the system with current SGML tools, which was also heartening.
Jacques Desseyne (Sema Group Belgium) The SGML Tree Transformation Process (STTP): processing SGML documents in an absolutely standardized way
This was a good introduction to the STTP side of the DSSSL specification, giving a clear overview of how it works, and why it is a distinct component of the DSSSL architecture. Deseyne rather fudged the issue of implementation, and did not make the expected announcement of a new Sema product, or not in public at least, but rumours were abundant during the coffee break.
Raf Schietekat (Fotek NV) DSSSL: the promise FOSI did not fulfill
This was the last paper of a long day, and I found it rather disappointing. It covered some basic principles of DSSSL in a rather mechanical way, and did not add much to my understanding of the overall scope of its formatting language. In particular, I had hoped for some explanation of what FOSIs are, before they become extinct, but did not receive one.
This three day colloqium was hosted in Paris by Jean-Louis Lebrave of the Institut des textes et manuscrits modernes (ITEM), with funding from the CNRS under a programme called Le Patrimoine Ecrit or Written Heritage, managed by Almuth Grésillon. It brought together an unusually wide-ranging group of European researchers and implementors, many of whom were previously unknown to me, and all of whom had something interesting to demonstrate or to say about using hypertextual methods in the process of critical editing or electronic publication. Most of the presentations involved web-hosted demonstrations, displayed on a huge screen in a darkened room, fitted with exceedingly comfortable cinema-style armchairs. I learned that the French for "web" is toile, which also means canvas, as well as being a slang term for the cinema screen. After a rather good lunch, it is testimony to the rhetorical skills of the presenters that no-one visibly fell asleep in such surroundings.
Unfortunately, local committments made it impossible for me to attend the first day's sessions. According to David Robey (who turned out to be the only other brit present), they were rather theoretical and lacking in focus, but I was sorry to have missed meeting Bernard Stiegler. The second day began with Etienne Brunet, (UPRESA, University of Nice) a founder member of the ALLC and author of the Hyperbase concordance system, describing his Balzac cd-rom project. This will make available 90% of La Comédie Humaine (Balzac's monster sequence of 19th c novels) --- once the copyright problems have been ironed out. More interestingly, the project is a collaborative one, in which 33 international Balzac scholars are scheduled to produce notes, commentary, and assorted links. Brunet suggested that the online text was really there as a way of indexing the collection of associated Balzacian commentary. The web site has some interesting Balzaciana, including photos of the great man's walking sticks as well as his manuscripts. It also has a KWIC concordance, carved up into HTML, with each line of context linked automatically to the passage in the complete text from which it comes, and some rather impressive statistical displays.
Luca Toschi from the CRAIAT research group at the University of Florence prefixed his demonstrations with some suggestive thoughts on the ways in which digitizations facilitated many views of a text, based on his experience with the publication of an electronic Goldoni variorum. I found the software, when we finally saw it, rather impressive: it displays each state of the text separately, but with variant passages colour coded. Clicking on one of these then brings up the state of the text in which that variant appears, as a new window. They use an IBM product called Linkway Live (?).
Paolo d'Iorio (Paris) described how the web had effectively created an enormous, multinational, multilingual, Nietzche database. His chief focus was on the way the web made co-operative authoring ventures possible, but he also gave some nice examples of ms images linked to transcriptions. Asked whysomeone didn't just publish everything on CD, he stressed the open, dynamic, nature of the enterprise; in similar vein, when asked whether he didn't fear that this kind of community would create some new kind of orthodoxy, he pointed out that the dissidents were perfectly free to set up webs linked to the same material, displaying it in their prefered way.
Daniel Ferrer (also from ITEM) gave a considered presentation of the advantages of Storyspace in presenting the evolution of the text of Joyce's Ulysses through the various states of manuscript, typescript, correction, page proofs, etc. The database includes Gabler's transcription as well as a sets of page images; much play was made of the difficulty of linking non-rectangular zones of Joyce's scrawl in the latter with parts of the former. The Joycean theme was picked up by Marlena Corcoran, who demonstrated the importance of good design and a simple interface in a system (developed with mediatool) which allowed one to see the corrected original page proofs (placards) ofUlysses , as they appeared before they were cut up by some over-zealous conservator, and thus to infer something about the correction process itself.
Nicole Moulinoux from the University of Rennes described another scholarly collaboration centred on the works of William Faulkner, in particular a web based (but password-protected for copyright reasons) electronic variorum of the Sanctuary . Another member of the team showed a comprehensive set of retrieval and statistical tools aimed (I think) at content analysis, which had used the text: the unusual feature of this was the provision of dynamically generated tables of hits retrieved according to an unexplained factor analysis. This seemed to offer a lot of possibilities for stylistic analysis.
Alain Giffart (IMEC) discussed a new project at the Collège de France to create a hypertextual edition of Roland Barthes' Comment vivre ensemble . His discussion was wide ranging and highly theoretical. If hypertext has a theory (and it must have, for the French to take it seriously) then Barthes anticipated most of it. It will be interesting to see what the project produces: the Barthes archive includes video and audio of the lectures from which his published works derive, made at different times. Extending the geneticist school of textual criticism to cope with such materials poses interesting challenges.
Peter Szendi (IRCAM) presented what he called a maquette diaboliciel (or diabolical pilot program) -- a prototype for an exploration of the variations within and between versions of Beethoven's Diabelli Variations , in which the scores were linked to an audio track. For some reason his comment that it should be possible to find some auditory equivalent of the clickable button (to indicate an anchor within an audio track) provoked heated dissent from the audience.
I opened the final day of the Colloqium by presenting the usual overview of the TEI architecture in my best A-level French, somewhat enhanced by illustrations of how to do hypertext-y things in TEI, including the Comenius example, and also for good measure two different ways of encoding the start of the Beowulf ms. I was also able to wave a copy of the newly published edition of Cahiers Gutenberg containing Francois Role's French translation of the TEI Lite tutorial. Which was, as they say, nice.
David Robey (Manchester) followed this up by describing his experience as a member of a TEI work group, and presenting his views on how to tag the rhyme and alliteration patterns in Dante's Inferno . This provoked an interesting argument about the extent to which algorithmically or procedurally determined patterns needed to be made explicit in an encoding.
David Pietrowski and Georges Vignaux (INALF/LIMSI) described a prototype system for use in tracking reading of a dictionary, specifically the Grand Robert .This was presented as a way of creating a new perspective on language rather than a neat way of capturing the reading process itself, which might have been more persuasive.
A team from Saarbrucken, led by one Wender, gave a brief overview of a rather fine Goethe project. This web site generates HTML on the fly from their internal markup, combining text, manuscripts, and critical apparatus in an impressive way. They are also working on Musil and Buechner. The methodological problem they foregrounded was that of making explicit the temporal stages of a series of variants.
Alessandro Pamini (Istituto Metacultura) tried (twice) to present a new hypertext system being developed in collaboration with the Cultural Anthopology department of the University of Rome. His talk was given in Italian, and all I gleaned from it was that he felt everyone else had misunderstood the fundamental issues. (But he did later hand out a French translation, which looks a lot more interesting)
The colloquium concluded with a long and wide ranging round table discussion in which several distinguished luminaries picked up some themes of the colloqium, and introduced some new ones. Hans-Walter Gabler (Munich), whose computer-assisted re-editing of Joyce provoked some controversy a few years back, was quietly optimistic about the possibilities offered by a true computer-based editing: I learned later that he is currently planning a new graduate seminar on textual criticism with a strong computational component. Yannick Maignien of the Bibliotheque Nationale and Robert Martin of the Institut Nationale de la Langue Française raised several key issues about the opportunities and difficulties of electrronic text provision (role of libraries, need for metadata, publishing, copyright, etc). Christine Coutoure, documentalist at the École Normale, spoke feelingly of the librarian's perspective on the untamed wilderness of the web. Eric Lochard from Montpellier, Alain Giffard, Jean-Louis Lebrave, and Daniel Ferrer all in their separate ways questioned some of the methodological implications of the hypertext method, and the seductive availability of apparently neutral digital resources. There was much debate, but I found most persuasive Toschi's eloquent reminder that whatever else scholars do, they should not abnegate their responsibility to spin a plausible tale.
DAVID is a newly funded project in structured programming and document manipulation, based at the University of Minho's computer science department and funded by JNICT (Junta Nacional de Investigação Científica e Tecnológica: i.e. the Portuguese national research council). Its goals are to build on the department's expertise in formal methods and grammars, exploring new ways of applying them to text processing and information handling. To start the project off, they organized a three day workshop on "Algebraic document processing" in Braga last week, inviting a small number of guest speakers (myself, James Clark, and Sebastian Rahtz) to share their ideas in a pleasantly relaxed environment. There were about a dozen participants, mostly from the department itself, but also from other Portuguese universities.
I was asked to give an overview of SGML and then to describe the TEI architecture and ways of using it. I took the opportunity of reorganizing the standard TEI workshop slightly, also introducing a little more technical content than usual. Following the now-traditional account of the problems of realizing the full potential of electronic texts, I presented a version of the Gentle Introduction to SGML expanded to include all (but only) the SGML features actually used in the TEI scheme. Sebastian Rahtz followed this with a nicely contrasting account of the practical difficulties of applying SGML in the production-line world of Elsevier's journals division. Elsevier has committed itself to using SGML for the archival storage of their several hundred scientific and technical journals, using a version of the ISO 12083 dtd with various varyingly satisfactory accomodations to cope with the enormous amounts of maths, graphics etc. needed. This policy is expected to pay off in terms of re-usability, with such ventures as the Science Online database due to start next year, which will deliver full text of 300 scientific journals on the web, complete with links to relevant bibliographic databases and abstracting services, for those well-enough funded to access them, at least.
After lunch (pizza of course), I gave the traditional TEI architecture presentation, focussing rather more explicitly on the nitty gritty of its class system and how the modularity of the dtd is implemented. I then gave the document analysis presentation, followed by a group tagging exercise. This last was quite successful, unexpectedly, since the text chosen as vehicle was an obscure 17th century English political pamphlet, which several participants found linguistically rather challenging. The first day concluded with a brief account of the motivation for, and contents of, the TEI header, after which I was more than ready for a cool beer and the obligatory bacalhao.
Day two began with the audience's choice of TEI tagsets: offered a choice of TEI-ana, TEI-spoken, and TEI-lite, they plumped for the first two, which is what I duly gave them. This was a pleasant change, for me at least; in retrospect it's a great shame that I had not prepared more on the feature structure tagset, since such mention as I was able to make of it was clearly of considerable interest to this audience. I concluded my presentations with a brisk canter through available SGML tools and strategies for handling TEI texts.
James Clark then improved considerably on my superficial account of what a parser does by giving a detailed presentation of his SP parser, complete with some glimpses of its 60,000 lines of C++ code. The new version of SP implements everything defined in ISO 8879, even the silly bits. Unlike his earlier parser, sgmls, it supports multi-byte character sets such as Unicode, maintains non-ESIS information such as comments and use of white space, and also allows for modification of the concrete syntax. It is also reportedly at least twice as fast. SP is actually a general purpose C++ class library, with a well defined entity manager, and comes with a number of useful implementations; it is completely free of charge, and has already been incorporated into a number of leading SGML products (notably Technoteacher, Balise, and Microstar's SGML Author for Word).
After lunch, Sebastian Rahtz' second presentation described an ingenious approach to the rather unusual problem created by Elsevier's document management policies: the need to convert from LaTeX into SGML (rather than the more usual reverse). Authors' use of LaTeX varied so much, and LaTeX itself was so flexible, that in some cases it would be cheaper to throw away the author's source and retype it from scratch, or to throw away the TeX markup and retag it. Automatic conversions based on parser technology inevitably fell foul of the macro-processing nature of the TeX system sooner or later, and so the optimum solution seemed to be to use TeX itself to process the input, but tweak its macros so as to emit appropriate SGML tagging along with the formatted text. Probably you have to be a TeX hacker of Sebastian's expertise to even think of this solution, let alone to implement it, but it does apparently work more reliably than any of the alternatives. Sebastian also outlined various strategies for presenting maths on the web and tried very hard to persuade us that using PDF was a good idea.
Day three was mostly devoted to DSSSL, the new Document Style Specification and Semantics Language, ISO 10179. José Ramalho from Braga began with a brief overview of the contents of this important new standard, which complements the standard syntax offered by SGML with a standard way of defining its semantics. DSSSL has four components: a language and a processing model for transforming one SGML document into another; a query language called SDQL for identifying portions of an SGML document; a style language for applying formatting characteristics to an SGML document, expressed in terms of things called "flow objects"; and a powerful expression language derived from SCHEME which ties the whole thing together.
James Clark then presented his new piece of software: JADE (believed to be short for James's Awesome DSSL Engine). He began with an impressive proof of concept which involved converting the 300 pages of the DSSSL standard itself directly into RTF before our very eyes, and showing the results, but then descended (or ascended) to showing us large amounts of the code in Jade. I regret to report that my comprehension of object-oriented techniques, already somewhat overstretched, did not allow me to follow as much of this as it should: but the availability of a DSSSL style sheet for TEI texts (on which work is currently proceeding) will no doubt change this.
After lunch, Jocelyn Payne described a system he has developed called Web-o-matic, which appears to be a way of doing what most people do with CGI scripts but using object-oriented Rexx. The system is used by a do-it-yourself economic modelling web site at the Institute for Fiscal Studies. It wasn't clear to me what the point of this approach might be, but it's always nice to hear what Jocelyn is actually up to. I wonder whether he knows that at least three internet service providers use the term web-o-matic to describe their tacky home-page creating scripts?
Finally, Pedro Henriquez and José João Almeida gave an overview of their plans forthe DAVID project. They have three years funding from JNICT, the Portuguese government research funding agency, to explore new techniques and ideas in textual information and processing. Their strong background in language processing, with a corresponding emphasis on representation of meaning as abstract syntax trees, together with their large resources of programming skills (they have already developed a specification language and prototyping environment called Camila, and applied it to the task of defining a `literate programming' application) clearly give them a head start (over me at least) in understanding DSSSL and applying it to new and interesting areas.
This was an unusual workshop, with a high degree of technical content in a pleasantly relaxed atmosphere. For me, it was also a comparatively painless way of finding out what the excitement about DSSSL is all about; my thanks are due to the organizers and to my co-speakers for making this an occasion to remember.
SGML Europe 96 was held this year in a very large and expensive hotel in Munich, adjacent to what would have been very pleasant park if the weather had been less damp. I travelled there from the Tyrol by train, thus missing the opening ceremonies (by all accounts a rather lachrymose affair, this being the first major conference since the death of Yuri Rubinsky to whose memory the event was dedicated), but saving the TEI the cost of an extra night's stay in said hotel. I did arrive in time to hear Charles Goldfarb give his "Inventor's Keynote", from which I learned that Charles has been thinking about the World Wide Web, and perhaps regretting that he didn't take the chance to set Tim Berners-Lee right about SGML when he had it. When titans meet, Dr Goldfarb opined, one should find another field -- sound advice, with reference to web browser wars, but rather defeatist for those of us who think that the SGML community might have something to learn from the runaway success of the web. Key events of the year were the publication of DSSSL; and some reorganization and realignment of various competing areas of HyTime and DSSSL activities, notably the definition of the Standard Document Query Language and of the HyTime "general facilities" (aka the useful bits -- architectural forms, property sets, groves, formal system identifiers etc). Charles also proudly announced his "Purity Test" for so-called SGML-conformant applications, on which see http://www.SGMLsource.com/Goldfarb/purity.html, if you care about such things.
The real function of conferences like this one is not however to listen to presentations, however inspirational, but to hobnob with the vendors, who were there in force. There were three exhibition halls full of variously sized booths to do this in, with almost all major players represented (conspicuously absent were Microsoft and Novell) and many minor ones, throughout the proceedings, pausing only for the evening reception (on the 24th floor, commanding magnificent views of some very damp tree tops). I duly hobnobbed, to the point of exhaustion, as did most of the other 300 or so delegates, when not drifting in and out of sessions.
Here are a few of the software products that made some impression on me: some newish SGML authoring tools, notably InContext (the new version is now reduced in price to the point where we could actually afford it), and Stilo, which is nearly available -- and both of which successfully processed the TEI dtds under my suspicious gaze. Folio Views had put a lot of money into free mugs and pads proclaiming "Folio does SGML" (I am told that this is somewhat economical of the truth). Astoria, the new SGML object database from Xerox, had an immense and very busy stand. At the STEP booth, they were busily producing a daily news bulletin directly from an online news feed, converting it automagically to SGML and formatting and printing it in real time. Synex, makers of Panorama, are now actively marketing their Viewport engine at a price we certainly cannot afford. Jouve have a (comparatively) cheap and cheerful CD-ROM production system called GTI publisher which could give DynaText a run for its money. AIS have a new version of Balise (with a new logo) and a new English manual on the stocks. At the cheaper end of the market, two new application development toolkits were in evidence, one called SGML-C, from Bruce Hunter, and the other, called NSL originally developed for the Multext project by Henry Thompson. Both worth hunting down on the net.
Even had I not been somewhat pre-occupied with getting my part of the closing plenary session ready, it would have been impossible to take in all this as well as the three parallel tracks, so my report is necessarily somewhat fragmentary. I listened to a session on document management systems, which included a good overview of issues in document database design from John Chelsom, and a characteristically pragmatic discussion of ways of building hybrid distributed document databases from John McFadden (see http://www.exoterica.com/pres/hddb). I dipped in and out of a major overview of SGML software tools organized by Steve Pepper (Falch) and Robin Tomlin (SGML Open), in the hope that this would stop defining criteria and start evaluating products (it didn't -- but the overview was very thorough). I attended a session about the role of SGML on the internet, which featured Eric van Herwijnen and Martin Bryan, neither of whom had anything new to say on the subject (though Eric had some nice pictures); Jon Bosak, in a different session, also addressed the web and had some rather sharper comments to make: (see http://www.sgmlopen.org/sgml/docs/library/dssslo.htm). I was rather disappointed to find that no-one apparently had anything to say about the importance of SGML as an archival format, or a metadata format, and I don't think I heard the phrase "digital library" mentioned once in four days, which seems odd. The conversations I heard in the halls were all about document production and document management, the intranet and improving your organization's information flow. I heard nothing about ICADD or the TEI, or how SGML might be of use to the rest of us, which was all rather depressing.
The final plenary began with Dale Waldt explaining at some length just why commercial publishers love (or should love) SGML, which somewhat reinforced my prejudices against them. I found myself speaking up for academia and reminding the conference that maybe SGML had a more significant role to play than just helping publishers and consultants get rich, and that its true Destiny was to preserve our cultural heritage. I also suggested that it really shouldn't be so difficult to get started with SGML, the basic notions of which can be jotted down on the back of an envelope, a theme which I was glad to see Tim Bray pick up in his masterly wrap up of the whole proceedings (see http://andor.opentext.com/~tbray/SGML-Europe/index.html). Tim stopped in the middle of his presentation (done like almost everyone else's with Powerpoint), exported it as an HTML file with one click of the mouse, and asked the question the SGML Industry really needs to answer: why is it so difficult for vendors to build true SGML support in with comparable user friendliness and simplicity?
Outside the conference, for the first time since I arrived in Germany, the sun came out. It was a public holiday, so the park was full of cheerful Bavarians drinking beer and eating sausages to the accompaniment of an oom-pa band. Time to go home...
At the invitation of Daniel Ridings, I visited the Sprakdata Unit at Goteborgs University at the beginning of May. This unit, which is distinct from the Linguistics Dept, was set up originally by Swedish Acdemician Sture Allen, and is the major lexicographical and NLP research centre in Sweden. I gave a two hour lecture on the subject of the British National Corpus, followed by an extended demonstration of the capabalities of SARA to an audience of about 15 lexicographers and others, which seemed to be much appreciated. On the second day of my visit, I gave a boiled-down version of the standard TEI workshop to students on the second year of the University's Computational Linguistics course, with an emphasis on the implications of TEI markup for lexical data, at the invitation of Riding's colleague Pernilla Danielsson. Danielsson and Ridings are collaborating on a project called PEDANT, which is an aligned multilingual corpus, using TEI markup and the toolkit developed by Henry Thomson as part of the Multext project, which we discussed at some length. I also met Steffan Larsson and colleagues from the Linguistics Dept at Goteborg, who have developed their own complex annotation scheme for the markup of a spoken Swedish corpus, and with whom I discussed the possibility of converting their markup scheme to TEI conformance. My BNC lecture was also attended by Gary Ward, a systems programmer from the University of Vaxjo, which is also attempting to set up a SARA server. With assistance from technicians at Goteborg, we were able to make considerable progress in getting the current version working, to the point where we were able to determine that there really was a bug in the server (since fixed by Scotty, I am glad to report).
The third day of my visit to Sweden, being Sunday, I spent visiting the Goteborg museum (splendid collection of late 19th century soft porn) and getting rained on in various picturesque locations, before heading to Bergen to participate in an encoding seminar being organized by the NCCH's Wittgenstein Archive. Part of the reason for this workshop, it was explained to me, was to prepare local staff and staff from other Norwegian centres for the forthcoming ALLC conference; I suspect it was also to field-trial the splendid new facilities being made available for said conference in Bergen's new Law Faculty building. Whatever the reason, I very much enjoyed the opportunity of re-running the boiled-down TEI workshop for the benefit of an intelligent and alert audience. In the time available, it was not possible to do a full hands-on session, but I was able to run through the exercises I had prepared (loading an untagged file into A/E, tagging it, building a header, exporting it, viewing it with Panorama, etc.) which seemed to be much appreciated. After the seminar, Claus Huitfeldt, Peter Cripps and I discussed somewhat inconclusively how the TEI/MECS Workshop planned for the start of the ALLC conference should be organized. I think we decided that we'd do them in parallel, sort of. I also met Lars Johnsen, the new head of NCCH, a former corpus linguist turned administrator, as well as old friends from Bergen. The second day of the workshop was given over to Allen Renear, from Brown, who gave an intriguing historical overview of the development of what he called the ontology of texts, from platonism (as characterized by the Brownian OHCO model) through liberalism (the view that there are many possible hierarchies) to anti-realism (the view that text has no reality aside from that projected onto them). Renear characterized the last as being "fashionable", which I take to mean "wrong".
Committments in Oxford meant that I was regrettably unable to stay for the remainder of this Workshop, which was to include a discussion of MECS and how it differed from TEI, so I took the customary large lunch at Hotel Norge before returning home on Tuesday evening.
After two days' frenetic but parochial activity in Oxford, I set off for Innsbruck, via Munich, in the company of Mike Popham. Our objective was partly to find out what exactly the NOLA project might be or might become, and partly to get a little time quietly to discuss the two other meetings to which I had committed Mike's time this month. Once arrived in Innsbruck, I developed a streaming cold, not helped by the rain which had followed me from Goteborg, which may have somewhat coloured my view of the ensuing meeting.
The meeting was organized by Allan Janik, of the Brenner Archiv, and attended by Heinz Hauffe of the Innsbruck University Library, Donald Broady of the Royal Institute for Technoloy at Stockholm, Lars Johnsen from Bergen in addition to Mike and me. Its purpose was to review the the NOLA proposal for funding under the EU's Libraries Programme, recasting it in the light of criticisms received so far. There was a general feeling that much work needed to be done, particularly in recasting the relationship of the project to the TEI, and in involving more prominent library activities. I had hoped that the work being done at the Bodleian in extending the TEI Header to handle manuscript materials might be of some relevance here, but this remains to be seen. Mike Popham's excellent minutes (soon to be available from the NOLA web page at http://???) summarize the chief areas discussed and the future plans of the group.
Mike and I then spent a very useful day preparing an initial position paper for the Oxford Text Archive to present at the first Service Providers Forum meeting of the AHDS, and in discussing Mike's presentation at the Princeton workshop on Software for the Humanities, before going our separate ways -- he back to Munich and thence to Princeton, I into the Tyrol and thence to Munich for the SGML96 conference. It continued to rain.
A TEI Workshop was organized by Harry Gaylord at the University of Groningen in the Netherlands 21-22 November 1995. The workshop was hosted by Harry's department at the University, with additional funding from SURFNET (the Dutch equivalent of JANET) and attracted an impressively large number of delegates, not all of them Dutch (there was a small group of Francophones, Swiss, Belgian and French; also some Spanish and Portuguese; but no visible Germans, no doubt owing to the similar event in Tuebingen the week before). I counted about 120 during the opening plenary, and even by the final session there were well over fifty. Although dogged by administrative and technical mishaps, ranging from the collapse of one of the presenters with influenza 24 hours beforehand to the dispatch of the printed course materials to some other conference in an adjacent province, I'm glad to report that the event appeared to have been very well received by all who attended, with much discussion and questioning, both enthusiastic and intelligent, during the sessions and afterwards.
Day 1 began with a series of plenary lectures: I gave the celebrated "Gentle introduction to SGML", followed by two other invited speakers: Bert Bos from the W3 Consortium spoke illuminatingly about the relationship between SGML and HTML, and the great hope afforded by the proposed stylesheet mechanism in HTML3; Gert van der Steen gave an exhaustive overview of SGML software, characterized by its functions. After a break we were joined (virtually) by Rich Giordano, who gave a very impressive lecture on the TEI Header and his thoughts about its relationship to projects in resource discovery, by video tape recorded in his office in Berkeley. We knew it was live because at one point the phone rang (it was Harry reminding him to make the video) and at another someone tried to walk into the office.
The TEI Workshop proper began with a presentation of the TEI architecture (regrettably using my fallback set of overheads, rather than the nice new version I'd prepared for use with Panorama), followed by a group document analysis, using as vehicle some pages of the Sherlock Holmes story "The Red Headed League". In the absence of printed handouts, at that time finding their way back to us from the aforesaid adjacent province, Alan Morrison and I had to perform this text as well as elicit discussion of its chief features -- which turned out to be quite an amusing experience for all concerned. Whether for this reason or not, this session was very lively raising a number of fundamental encoding issues in a satisfactory manner. The day concluded with a presentation of the standard TEI document analysis slides, which I gave from paper notes in the absence of anything flashier.
Day 2 was to have begun with the first of three graded exercises on using Author/Editor and Panorama, but it became apparent fairly rapidly that the teaching room allocated for this purpose was simply not set up correctly. Delegates seemed philosophical about this, occupying the time in such characteristically Dutch habits as checking email and strolling down the corridor for a smoke, while we re-arranged the timetable to allow a further hour's installation time. I gave (again from backup slides) the overview of TEI Lite and discussed briefly ways of using it to implement the features we'd identified the previous day. After lunch we divided the group up, with those uninterested in Author Editor being free to go and inspect the various demonstrations of student projects and other software (notably Near and Far and Explorer).
The rest repaired to the PC room and drove out a few hopeful students who thought it was theirs. Alan and I then walked through our Author Editor exercises successfully. In the first, we imported a "plain ASCII" version of the Holmes text, tagging it as a single div, then as a paragraph, then split the paragraphs. This taught use of the surround, remove, split, and joint options. Then we exported the text, looked at it in Write, and reimported it. In the second exercise, we started from a blank new document, and built the skeleton of a full TEI conformant text, using the "Insert markup" option to teach something of the TEI header at the same time. In the third exercise, we imported a version of the tale which Alan and I had tagged in some detail earlier, with all direct speech identified and allocated to speakers. We then showed how this markup could be displayed in Panorama, using previously prepared stylesheets and navigators. This programme of work was fairly exhausting for the 30 or so participants who stayed the course, and entirely so for the two of us trying to give it, but it seemed to go down well. I did finally manage to see some of the work which students at Groningen have been doing -- an illustrated version of the Christmas Carol; a morphologically analysed version of Hiawatha -- but not as much as I'd have liked.
I was greatly impressed by the energy and enthusiasm of the Workshop participants. They were very tolerant of the technical mishaps which prevented the event from being as great a success as it deserved to be. The local team, particularly the students, also deserve commendation for their great resourcefulness in the face of adversity. I hope that Surfnet will be willing to sponsor another such event in the near future.
The Memoria Project's latest meeting was in Pisa on Monday and Tuesday this week, but I decided to go early, partly to save the project 250 pounds in airfare, and partly in order to discuss with the indispensable A. Zampolli the chance of getting more funds for the TEI from Brussells under a call for proposals due to expire on 15 March (gak).
I took a fly (as Alitalia likes to call them) from Gatwick at 0945 (which meant getting the 0558 train from Oxford) on Saturday, found the leaning tower still unfallen, and basking in spring sunshine, and Italy just as it should be. In the evening, after some prevarication, Zampolli and I managed to dine and dine well at the Trattoria opposite my hotel, but I was concentrating so much on the need to sort out the politics of the TEI proposal that I don't remember any thing except some truly unforgettable gnocchettini with cheese and basil sauce, and a piece of fish with rather more olives attached than I had bargained for.
Sunday I went to Florence, partly for touristic reasons, but mainly to persuade my old friend Dr A to write the BNC Handbook for me, which I duly did over an excellent lunch in an obscure, but very crowded fiaschetteria. This was the sort of Italian eating establishment I like -- hams hanging from the ceiling, elbow-to-elbow diners all shouting at the top of their voices, small children misbehaving with indulgent grandparents, plates of food and bottles of chianti flying everywhere. You eat what's on offer, and don't regret it. I had a thick vegetable soup, a plate of prosciutto, and what would have been a monstrously thick pork chop had it not been beef of a most succulent and delicious nature. To walk it off, Dr A and I staggered through the Boboli Gardens up to the Belvedere commanding magnificent views over the town, before calling at someone's house for tea and cake as the sun set, and we went our separate ways.
On Monday, the Memoria meeting proper took place: Catherine Toussaint from CAP who has taken charge of the project summarised what had happened at the Project Review meeting in Luxembourg last week (not a lot) and we discussed what exactly should go into Work package 4 (quite a bit). Lunch was unremarkable, but for dinner we went to my other favourite sort of Italian restaurant -- the kind where the proprietor welcomes you like a long lost relative and proceeds to foist exotic and unlikely dishes on you, saying "but you must taste this" and "what do you think of this?". Such establishments rarely have more than four customers at a time, and are buried in obscure side streets, but well worth the extra walk. I was seduced into trying the home-smoked tunnyfish, followed by just a little oyster-mushroom soup with crostini, followed by a main course composed mostly of small pieces of roast marinated goose, rounded off with a pudding of baked ricotta in lemon sauce, and an aromatic grappa. Exquisite is the word that springs to mind. Also, filling.
Day two of the meeting took place at the ILC in Pisa, and consisted largely of a demo of image processing software by Andrea Bozzi and friends. With the aid of a Quadra 7/50 and a box of transputers, they can figure out where the words in a line of digitized medieval manuscript begin and end and tie them up with a transcription of it. There was also some discussion of what exactly should be presented to the first meeting of the Special Interest group (to be constituted next week), and a very instructive discussion of what exactly Yannick Maignien meant in his description of the Memoria "maquette". Also some bleating from me about how exactly OUCS was supposed to carry out Work Package 3 in the absence of any input from the other partners. Lunch was a modest affair, (pasta in cream sauce, followed by salad and chips) and very late but none the worse for that. The rest of the afternoon I spent trying to connect to the BNC server, with modest sucess, eventually; also trying to extract a copy of the TEI conformant 30 million word corpus of modern Italian -- I nearly got it, but not quite, before returning this evening.
This joint meeting of the UK TeX Users Group and the BCS Electronic Publishing Specialist Group attracted a large and mixed audience of academics, TeX hackers, publishers, and software developers, with representatives from most UK organizations active in the field of electronic publishing and document management. I was expecting rather more disagreement about the relative merits of the various approaches now available for the creation of portable documents; in the event, the path of SGML-based righteousness, with appropriate concessions to the practical merits of PostScript-based systems, was apparently endorsed by the consensus.
First of the seven speakers was David Brailsford from Nottingham University, who described Adobe's Acrobat as "a de facto industry standard". His presentation of exactly how the various components of this product worked together, and could be made to interact with both LaTeX and SGML, was very clear and refreshingly free of hype. The choice of PDF (which is effectively a searchable and structured form of Postscript, in which logical structure and hypertextual links are preserved along with the imaging information) as an archival format was a pragmatic one for journals such as EPodd where fidelity to every detail of presentation was crucial. The availability of a free Acrobat reader was also a plus point. He characterized the difficulties of mapping the logical links of a LaTeX or SGML document on to the physical links instantiated in a PDF document as a classic case of the importance of "late binding", and revealed the open secret that Adobe's free PDF reader would soon be upgraded to recognise and act on HTML-style anchors. A demonstration of the Acrobat-based electronic journal project CAJUN is already available online at http://quill.cs.nott.ac.uk.
David Barron, from Southampton, gave an excellent overview of what exactly is implied by the phrase "portable document". Documents are not files, but compound objects, combining text, images, time-based media. There is a growing awareness that electronic resources should be regarded as virtual documents, repositories of information from which many different actual documents may be generated. These developments all make "portability" (defined as the ability to render documents -- with varying degrees of visual fidelity -- in different hardware or software environments) very difficult. Portability was of crucial importance, not only for publishers wishing to distribute in the electronic medium, and not only for specific user communities wishing to pool information, but also for all of us. Information available only in a non-portable electronic form was information at the mercy of technological change. He cited as portability success stories the widespread use of PostScript and LaTeX as a distribution medium by the research community, referring to the Physics preprint library at Los Alamos as a case where this had now become the normal method of publication. By contrast, the success of the World Wide Web seemed to be partly due to its use of a single markup language (HTML) which effectively takes rendering concerns entirely out of the hands of authors. From the archival point of view, however, none of the available standards seemed a natural winner: hypertext was still too immature a technology, and there were still many intractable problems in handling multiple fonts and character sets. Professor Barron concluded with a brief summary of the merits of SGML as providing a formal, verifiable and portable definition for a document's structure, mentioning in passing that Southampton are developing a TEI-based document archive with conversion tools going in both directions betweeen SGML and RTF, and SGML and LaTex. Looking to the future, he saw the IBM/Apple Opendoc architecture as offering the promise of genuinely portable dynamic documents, which could be archived in an SGML form once static.
The third speaker of the morning, Jonathan Fine, began by insisting that the spaces between words were almost as important as the words themselves. I felt that he wasted rather a lot of his time on this point, as he did later on explaining how to pronounce "TeX" (surely unnecessary for this audience) before finally describing a product he is developing called "Simsim" (Arabic for sesame, which is a trademark of British Petroleum we learned). This appears to be a set of TeX macros for formatting SGML documents directly, using components of the ESIS to drive the formatter, but I did not come away with any clear sense of how his approach differed from that already fairly widely used elsewhere.
Peter Flynn, from University College Cork, did his usual excellent job of introducing the Wondrous Web World, focussing inevitably on some of its shortcomings from the wider SGML perspective, while holding out the promise that there is a real awareness of the need to address them. What the Web does best, in addition to storage and display of portable documents, is to provide ways of hypertextually linking them. Its success raises important and difficult issues about the nature of publishing in the electronic age: who should control the content and appearance of documents -- the user, the browser vendor, or the originator? Publishing on the Web also raises a whole range of fundamental and so far unresolved problems in the area of intellectual property rights, despite the availability of effective authentication and charging mechanisms. He highlighted some well-known "attitude" problems -- not only are most existing HTML documents invalid, but no-one really cares -- and concluded that the availability of better browsers, capable of handling more sophisticated DTDs, needed to be combined with better training of the Web community for these to be resolved.
The three remaining presentations, we were told after a somewhat spartan lunch, would focus on the real world, which seemed a little harsh on the previous speakers. Geeti Granger from John Wiley described the effect on a hard-pressed production department of going over to the use of SGML in the creation of an eight volume Chemical Encyclopaedia. Her main conclusions appeared to be that it had necessitated more managerial involvement than anticipated, largely because of the increased complexity of the production process. She attributed this partly to the need for document analysis, proper data flow procedures, progress reports etc., though why these should be a consequence of using SGML I did not fully understand. More persuasively, she reported the difficulty the project had had in finding SGML-aware suppliers, in designing a DTD in advance of the material it described, in agreeing on an appropriate level of encoding and in getting good quality typeset output.
Martin Kay, from Elsevier, described in some detail the rationale and operation of the Computer Aided Production system used for Elsevier's extensive stable of academic journals. Authors are encouraged to submit material in a variety of electronic forms, including LaTeX, for which Elsevier provide a generic style sheet. Other formats are converted and edited using an inhouse SGML-aware system (apparently implemented in WordPerfect 5, though I may have misheard this). This uses their own dtd, based on Majour, with extensions for maths, which seemed to be a major source of difficulty. Documents will be archived in SGML or PDF in something called an electronic warehouse, of which no details were vouchsafed. Both PDF and SGML were seen as entirely appropriate formats for online journals, CD-ROM and other forms of electronic delivery. The advantages of SGML lay in its independence of the vagaries of technological development, and its greater potential. However, potential benefits always had to be weighed against current costs; like any other business, Elsevier was not interested in experimentation for its own sake.
The last speaker was Michael Popham, formerly of the SGML Project at Exeter, and now of the CTI Centre for Textual Studies at Oxford. His presentation did a fairly thorough demolition job on the popular notion that there is still not much SGML-aware software in the world, starting with a useful overview of the SGML context -- the ways in which SGML tools might fit into particular parts of an enterprise -- and then listing a number of key products organized by category. It was nice to hear the names of so many real SGML products (auto-taggers, authoring aids, page layout systems, transformation tools, document management systems, browsers and parsers) being aired, after a long day obsessed by Acrobat and LaTex. He concluded with a useful list of places where up-to-date product information can be found, and a reminder that the field is rapidly expanding, with new tools appearing all the time.
The day concluded with an informal panel session, onto which I was press ganged, which effectively prevented me from taking notes, but also gave me the chance to promote the recently-published DynaText version of the TEI Guidelines, which I did shamelessly. I also remember Malcolm Clark asking, tongue firmly in cheek, why everyone couldn't just use Word, and being somewhat agreeably surprised by the number of people in the audience who were able to tell him the answer, and in no uncertain terms. Other topics addressed included auto-tagging, whether maths and formulae should be encoded descriptively or presentationally, whether Microsoft will still be around in the next century, and whether we would ever learn how to format documents for electronic presentation as well as we could on paper.
I was invited to give a number of talks at the University of Oslo as part of a small conference organized by the new inter-faculty Text Laboratory set up there, in collaboration with the Department of English and American studies, but with visiting guests from other linguistics departments at Oslo, and from the Universities of Jykslava (Finland) and Lund (Sweden). The emphasis was on corpus linguistics and encoding; between twenty and thirty staff and research students attended over the three days of the conference.
Willard McCarty from the University of Toronto's Centre for Computing in the Humanities began the first day with a detailed presentation of his forthcoming electronic edition of Ovid's Metamorphoses, which continues to be a fascinating example of just how far the humanities scholar can go with an ad hoc encoding scheme. I then gave the usual rapid canter through the TEI Guidelines, their milieu and architecture, which gave rise to some quite useful discussion before we broke for a substantial lunch. In the afternoon, Willard and I spent some time in the Text Laboratory, trying to install the very first BNC starter set (in my case) and checking email (in his). The Lab has a large Unix fileserver (some kind of DEC machine, since it runs Ultrix), and a room full of Windows and MACs connected to it via ethernet. We saw no-one else trying to use the equipment while we were there, but the Lab has only just begun operations.
On day two of the conference, Willard gave a talk which began promisingly by outlining the history of concordancing and concordances, from the middle ages onwards, but then became an overview of the features of TACT, which did little to improve my opinion of the design of that loose baggy monster of a concordance program. I then gave the usual rapid canter through the BNC, which aroused considerable interest. There were several intelligent questions about the design and construction of the corpus, and the accuracy of its linguistic tagging. I was also able to do my bit for the European Union by pointing out that a "no" vote in Norway might make it more difficult for us to distribute copies of the BNC there (the day before I arrived the Swedish referendum had confirmed Swedish membership; while we were there, rival campaigns on either side of the Norwegian referendum were in full swing).
During the afternoon, Willard and I were (independently) esconced in offices to act as consultants for a couple of hours: I spent most of my time re-assuring a lady from the German department that the TEI really could handle very simple encodings as well as complex ones, and rehearsing with her the TEI solutions to the usual corpus-encoding problems. Oslo is collaborating with Finnish and Swedish linguistic researchers in the development of a set of bilingual corpora (English-Finnish, English-Norwegian, and English-Swedish), so I also spent some time discussing and reviewing the project's proposed usage of the TEI Guidelines. Bergen and Oslo have developed a procedure for automatically aligning parallel texts in English and Norwegian, which appears to work reasonably well, perhaps because the languages are not so dissimilar. I rather doubt whether automatic alignment of English and Finnish will be as easy, but the Finns seemed quite cheerful about the prospect. In the evening we were taken out for a traditional Norwegian Christmas dinner, comprising rotten fish, old potatoes, and boiled smoked sheep's head, washed down with lots of akvavit: not as nasty as it sounds, but twice as filling.
The final day of the conference began with an excellent talk by Doug Biber, from Northern Arizona University, describing the use of factor analysis in the identification of register within a large corpus of materials in three languages (English, Korean and Somali). Biber's use of statistics is persuasive and undogmatic; the basic method was outlined in his book on speech and writing (1988) but its application to cross-linguistic (or diachronic) corpora is new and provoked considerable discussion.
This was followed by my swan song at the conference, a real seat-of-the-pants nail-biting event, being my first ever attempt to describe and then demonstrate the BNC retrieval software running (on Willard's laptop) live and in real time. As a result of careful pre-selection and late night rehearsal, I'm relieved to say that the software did not crash once, though my ability to control Willard's laptop's track-ball in public was frankly pitiful. SARA herself attracted favourable reaction, in particular because of the system design. Interest was expressed in the idea of extending her functionality to cope with the display and searching of parallel TEI-encoded corpora: not a task I think we will be undertaking ourselves in the near future.
This was a relaxed but far from vacuous three days, with ample opportunity for discussion and debate in pleasant surroundings. Sincere thanks are due to my host, Stig Johansson, and his department for arranging it and funding my participation.
As originally proposed at ACH-ALLC in Washington earlier this year, Peter Flynn of the Curia Project at the University of Cork organized a two day meeting with a general view of creating dialogue between the TEI and the developers of World Wide Web, one of the most rapidly growing computer systems since the Internet itself. WWW is a distributed hypertext system running at some improbably large number of sites worldwide, which uses a very simple SGML tagset called HTML (it has been rather unkindly characterized as "Pidgin-SGML"). WWW itself consists of a markup language (HTML), a set of Internet protocols (FTP, HTTP etc) and a naming scheme for objects or resources (the "Universal Resource Locator" or URL). A number of browsers are now available which use these components. Mosaic, developed at NCSA, is probably the most impressive: running on Mac, X and Windows it offers a fully graphical interface with just about everything current technology can support. Lynx, developed at the Computer Science dept at U of Kansas, is at the opposite extreme, assuming only a VT100 (there is also a WWW-mode for EMACS!). I will not attempt here to describe WWW in operation. Web browsers are freely available by anonymous FTP all over the place: if you haven't tried it out already, and can't see what all the fuss is about, then you should stop reading now, get yourself a browser and do so forthwith.
The two day meeting was attended by Chris Wilson (NCSA); Lou Montulli (Lynx, U Kansas); Bill Perry (EMACS, Indiana University); Dave Ragget (Hewlett-Packard; HTML+) and myself for TEI. Various representatives of the Curia project, notably Patricia Kelly from the Royal Irish Academy, were also present. I gave a short presentation about the TEI, focussing mostly on contextual issues but also including some detailed technical stuff about bases and toppings and X-pointer syntax, which seemed to be well received. Dave Ragget then talked us through the current HTML+ draft which started off a very wide ranging discussion. This continued during the second day of the meeting, but was at least partially nailed down in the shape of a brief report (see below) which should be somewhere in the Web by the time you read this one.
To their credit, most WWW people seem painfully aware of the limitations of the current HTML specification, which was very much an experimental dtd hacked together in haste and ignorance of the finer points of SGML. (or indeed the blunter ones). HTML+, which Dave Raggett has been working on for the last year or so, attempts to improve on it without sacrificing too much of its flexibility. This draft will eventually progress to Internet RFC status; there is also talk of an IETF working group co-chaired by Ragget and Tim Berners-Lee (of CERN; onlie begettor of the Web) to steer this process through.
The Cork meeting was an interesting opportunity for the developers of three of the major Web browsers to meet face to face and argue over some of the design decisions implicit in the HTML+ spec. To some extent this did happen, though the discussion was rather anarchic and unstructured. It was also a good opportunity for the TEI to encourage development of HTML+ in a TEI convergent manner, and this I think was achieved. Several of the changes accepted, at least in principle, will make it much easier to transform TEI documents into HTML, if not vice versa. Some practical issues about how WWW should handle TEI conformant documents were also resolved.
Outside the meeting, this was also a good opportunity to find out more about the Curia project itself. My hasty assessment is that this project has still some way to go. There is a clear awareness of the many different ways in which it could develop, and a tremendous enthusiasm. I think the project would benefit from some detailed TEI consultancy before too much more P1-conformant material is created. It also offers interesting contrastive opportunities with other corpus-building activities, chiefly because of its enormous diachronic spread, and its polyglot nature.
Lou Burnard, Cork, 21 Nov 93
========= Concluding statement of the WWW/TEI Meeting follows ==========
<!-- This uses the HTML dtd --> <title>WWW/TEI Meeting</title> <h1>Notes from WWW/TEI Meeting</h1> <h3>Action Items/Recommendations</h3> <list> <li>HTML 1.0 should be documented to define the behavior of existing browsers, and should be frozen as agreed upon at the WWW Developers' Conference. <list> <li>Features to be documented, implemented and specified include collapsing spaces, underline, alt attribute, BR, HR, ISMAP... <li>HTML IETF spec needs to be updated by CERN, as well as existing documentation </list> <li>HTML+ future browsers need not support HTML 1.0 features after a reasonable amount of time. As an aid in transition, the HTML+ spec/DTD will not include any deprecated features of HTML 1.0. <list> <li>HTML 1.0 deprecated features <list> <li>nextid <li>method, rel, rev, effect from <A> tag (but not from the <LINK> tag) <li>blockquote --> quote <li>There was a feeling that the <img> tag will be superceded by the <fig> tag, although its deprecation was not agreed upon. <li>menu list --> ul <li>dir list--> ul </list> </list> <li>The intention of HTML+ is to support generic SGML-compliant authoring tools, and authors are recommended to use this software with the HTML+ DTD for the creation/maintenance of documents. <li>Browsers may implement different levels of HTML+ conformance. <list> <li>Level 0 implementation <list> <li>HTML 1.0 spec referenced above </list> <li>Level 1 implementation <list> <li>Partial fill-out forms <li>New entity definitions (in section 5.1 of HTML+ draft) </list> <li>Level 2 implementation <list> <li>Additional presentation tags (sub, sup, strike) & logical emphasis <li>Full forms support (incl. type checking) <li>Generic emphasis tag </list> <li>Level 3 implementation <list> <li>Figures <li>NOTEs and admonishments </list> <li>Further levels to be specified </list> <li>Authoring tools are expected to conform to the HTML+ DTD and are <b>NOT</b> to support deprecated features. <li>We expect the HTML+ DTD to be developed incrementally. The HTML+ internet draft will make clear which features are now stable and which are still subject to change. The DTD will be structured to reflect this. <ol> <li>HTML+ will work with the SGML reference concrete syntax. <li>The entity sets will be user-specifiable (in the long run). <li>HTML+ will support nested divisions or containers. <li>There will be a number of new features <dl> <dt><b>Figures & Images</b> <dd><fig> may be able to subsume the role of <img>. <dt><b>Generic highlighting tag</b> <dd>The <em> tag will be used with a set of three or four defined attributes to present a guaranteed-distinct presentation of these attributes. <dt><b>Generic roles</b> <dt><b>Support for undefined elements</b> (user extensions) (render) <dd> <dt><b>Tables</b> <dd>This is now stable. <dt>Math</b> <dd>for research </dl> </ol> <li>HTML/TEI <list> <li>It was felt the correct way to convert between TEI and HTML was to do it on the server side using a conversion filter. <li>This server will also provide a hypertext link to download the raw TEI text. <li>We (WWW developers and TEI people) will strive together to converge functionality between HTML* and TEI, as well as to produce this server/filter system. </list> <li>Links to: <list> <li>HTML spec <li>HTML+ spec <li><ref target="http://curia.ucc.ie/curia/doc/tei.html">TEI overview</a> </list> </list>
One benefit accruing from my attendance at the Eagles meeting in Paris is that I got some more information about what exactly is supposed to be going on in Eagles. I also doubled the size of my collection of NERC reports by acquiring a copy of a very sensible TEI review produced by Dominique Vignaud and Pierre Lafon.
There are five EAGLES groups: on Resources, Language Corpora, Speech, the Lexicon, and Evaluation & Assessment (with which Henry Thompson is involved). The group on language corpora has five subgroups. One is chaired by John Sinclair on text typology, and I have no information on its progress other than verbal rumblings from Birmingham. One is on linguistic annotation and is chaired by Geoffrey Leech who has circulated a detailed work paper which rather overlaps the work presented in AI2 W2, but has some interesting differences. A third is supposed to address something called "text representation issues" and is chaired by Gregoire Clemencin of GSI/ERLI; the Paris meeting was called to discuss a work paper submitted to this group by Nancy Ide and Jean Veronis. There seemed to be some doubt about the responsibilities of the two remaining subgroups: one concerns documentation and distribution, for which Pisa is responsible; the other concerns "tools" for which Wolf Paprotte at Muenster is fingered.
Members of the TR subgroup are Veronis, Clemencin, Ramesh Krishnamurthy from Birmingham and Henry Thompson from Edinburgh. Nancy Ide was also present at the meeting as a TEI spokesperson, officially; Ramesh was unable to attend; Antonio was also unable to attend, through illness (He did however speak to us from his hospital bed by telephone during the afternoon; the word is that he should be back at work next week.)
EAGLES groups have two years in which to produce their recommendations; a mid term report (to be jointly edited by Nicoletta Calzolari and Jock McNaught) is due in April 1994, with final reports one year later. I'm not sure whether this report is supposed to cover all Eagles activities, or just those of the Corpus group, but believe the latter. The TR subgroup plans to complete its input to this report by February, and the document previously circulated by NI/JV was a first stab at producing draft for it.
The task of this workgroup was to provide a set of recommendations for the encoding of corpora, based on a review of the TEI. Most of the meeting was spent reviewing some of the formal decisions already taken by the TEI with a view to generality of application; I will not repeat them here since I am sure they will also appear in the minutes of the meeting. The chief point made and made frequently was that it was up to this workgroup to pin down those generalities and recommend specific solutions. There was a clear understanding of the extension and modification mechanisms provided within the TEI framework, and an evident willingness to apply them to the task in hand. It was less clear whether EAGLES would decide to propose a single 'boiled-down' TEI dtd, or instead a package of customizations of the full TEI.2 dtd.
It was at least tentatively agreed that Eagles-conformance implied two things: adherence to a specific set of editorial principles (HT likes to call these "invariants") governing such things as punctuation, normalization etc., as documented in the TEI <editDecls>, or a list of such specified sorts. And secondly use of an SGML dtd which could be derived in a TEI conformant way from TEI.3. The perceived value of sgml was in the validation it offered, which goes some way to explain the concern expressed in the NI/JV document about "polysematic combinatorics" etc.
Unfortunately I had to leave the meeting before it finished (someone was on strike at Roissy) so I do not yet know whether (or how) the work group plans to descend from these general preliminaries to the tough task of actually deciding whether to recommend <div> or <div1> or either, what sort of <w> tag to use and so on. With my BNC hat on, it seems to me that we could possibly offer some help in this respect; with my TEI hat on, it seems to me essential to monitor closely the work of this group, as it tests the viability of the TEI scheme in a variety of new application areas.
The Network Services Conference is an annual get together organized by the European Association of Research Networks (EARN, aka the European end of Bitnet). I was invited to the first NSC, held last year in Pisa, but was unable to attend; I therefore felt morally bound to accept the invitation to this year's conference, held last week in Warsaw. Poland has recently (in the last six months) become a full member of EARN and has good connexions with the outside world, though not, as yet, much of an internal communications infrastructure. For the duration of this conference however, happy networking nerds from the whole of Europe, including many from countries of the former Soviet Union, were to be found pumping email messages from the room full of terminals thoughtfully provided by a number of local computer firms, through Polands new's 60 megabit gateway and thence to homes in Georgia, Russia, Belarus, the Ukraine, Germany, Italy, France, the Netherlands, the UK and even the US.
The need for "quality" networked information was one recurrent motif of the conference; another (intimately linked with it by some) was the need for commercialization of network facilities. To quote the first speaker I heard: "we need to find a way for people to be paid for putting information on the internet". By contrast, there was mercifully little gung-ho technobabble, though the closing speaker did assure us that "every human being on the planet" would have access to the network by the end of this century, and that future conferences would require the use of virtual reality helmets. In its place, there was a surprisingly stubborn anglo-saxon stress on "bringing services to end-users".
The session on training began with a presentation on the UK's mailbase service, funded by JISC and hosted at Newcastle: its un-surprising conclusion was that electronic communities need a lot of hand-nurturing and training, a theme picked up by another speaker from Newcastle, who described an ITTI project which is producing generic training materials for network awareness. How not to do it was well exemplified by the third paper in this session - which described a protoype international EARN helpdesk: having relied on member states to provide publicity for its services, this was singularly underused.
Favourite software of the conference was undoubtedly Mosaic, which was demonstrated live in a bravura display of networking nerve by Robert Cailliau from CERN: pointing and clicking from Honolulu to Oslo to Israel before our very eyes. Mosaic is also probably the nearest thing we currently have to the "User network interface to everything (UNITE)" which was proposed by an equally charismatic speaker from Northern Ireland, while mention of it also elicited my favourite mixed metaphor of the conference from the speaker (George Brett) who said that although it was essential to keep pushing the envelope and going in at the deepend, it was necessary at the same time to keep the VT100 user on board (who is this user? I think we should be told).
Yet another Newcastle speaker (Jill Foster) drew a rather neat analogy between the current state of network publishing and the Sorcerer's Apprentice episode in Walt Disney's Fantasia. In this analogy, the techies currently trying to cope with the unmanageable explosion of networked information are Mickey Mouse contending with all those mops and buckets of water, while the librarians are the sorcerer deus ex machina who will finally restore order, by applying their classificatory, cataloguing, filtering, and archiving skills.
Other sessions I attended included a German speaker on something called UDINE which sounded like another universal interface, but with the added attraction that it would take advantage of ATM (nothing to do with holes in the wall, this stands for 'asynchronous transfer mode' and is apparently the very next thing in broadband ISDN). A Norwegian speaker described an interesting experiment in creating a distributed archive of photographs. An American speaker droned on about the problems of running a Listserv discussion group. In the same session, I missed (unfortunately) hearing the legendary Eric Thomas who was promoting a new user guide explaining why Listserv is the true and only path to righteous network communication. Rich Giordano flew in from Manchester just in time to give a thoughtful paper about the need for interactivity in information generating communities, and how poorly current models of electronic libraries supported it, which had the unusual distinction of citing some real evidence.
I spoke third in a session on "publishing on the network", shared with Terry Morrow (who runs the BIDS service at Bath) and Ann Mumford (who chairs the UK SGML Project and is a CGM whizz). The SGML banner, in particular the TEI flavour, seemed to go down reasonably well -- everyone agreed that the best way of improving the "quality" of networked information was to improve the richness of its markup.
On the whole, however, I did not find the papers presented here particularly impressive. The real business of the conference was evidently being conducted in a number of official and not so official networking special interest groups and workgroups meeting there. Not being a member of RARE or IETF or ISUS or others, there wasn't much opportunity for me to participate in these politickings; I was also disadvantaged by the fact that I had been placed in a monstrous hotel some 20 minutes walk from the conference venue. I did however spend some time impressing on George Brett (from CNIDR) the importance of the TEI Header as a starting point for one Internet Task Force which is supposedly coming up with proposals to improve Gopher's usability by providing associated documentatioon files. If the TEI Header can't be used for that purpose, there's something seriously wrong with it -- and likewise, if the IETF Task Force so charged completely ignores the TEI proposals. This is something which perhaps others in the Steering Committee with the ear of CNIDR might like to pursue (this means you Susan H!).
Secondly I spent some time bending the ear of Robert Cailliau (apparently a largeish cheese --or spider?-- in the World Wide Web) on the subject of what's wrong with HTML. Unlike most HTMLers I've met, this one at least was ready to listen, and something may just come of it.
Lou
The SALT Club is an informal group of researchers, from both academic and commercial sectors, with a common interest in Speech and Language Technology (nothing to do with Strategic Arms Limitation). It is organised and minimally funded by the UK Government's Department of Trade and Industry, as a side effect of the funding it provides for research work in this area together with the Science and Engineering Research Council (SERC), under something called the Joint Framework for Information technology (JFIT). Aside from a useful bulletin distributed to JFIT-funded projects, SALT organizes an annual workshop and provides a nexus of useful contacts. It also constitutes an expert group with some influence on government priorities for research funding in this area.
This year's SALT Club Workshop was concerned with Multimedia. Under the rather grand title "Paradigm Shift in Speech and Language Technology: Integrating with other Media", this two-day event combined a series of carefully-chosen plenary presentations and a number of small working group sessions, collectively addressing the issue of what, if anything, there might be of interest to SALT hackers in multimedia and vice versa.
The workshop began with a presentation from Graham Howard about the US Art of Memory project, which (due to the hostility of London Transport) I missed. I did however arrive in time to enjoy Barry Arons' (MIT Media Lab) round-up of work concerned with hypertext-like ways of interacting with recorded speech, ranging from intelligent telephone answering machines to 'virtual conversations'. The technically interesting part of this concerned the design of an appropriate user interface for a database of sound bites accessible by voice only. The database consisted of 13 minutes of monologue from five speakers, gathered over the telephone, which Arons had analysed into a network of some 80 discrete nodes, linked by 750 links of various types (e.g. 'summary', 'detail', 'supporting argument' etc.). It could be queried using a simple 17 word command language. The technology was impressive and sophisticated but mechanically organized snippets of spoken language cannot really be said to be conversations, although Arons did make passing reference to Grice's work in discourse analysis as providing useful models for how 'reports' from such a database might be organized.
This was followed by a more low-key presentation from Paola Fabrizi of the RNIB about the Electronic Daily Newspaper. I found this easily the most impressive project described in terms of real world applications of multimedia. Every night while the compositors are busy setting up the printed text of the Guardian, an electronic version of the same text in a specially encrypted form is broadcast via the unusued lines in teletext transmissions. Subscribers download the text into their pcs overnight, and can then `read' the paper at the same time as their sighted nbeighbours, using a voice synthesizer, large character display or transitory braille display. Paola demonstrated the system with a bottom end of the market speech synthesiser which came as somewhat of a shock after the realism of MIT's synthetic voice, but the simplicity of the interface had much to commend it. Users could scroll, and search, switch betweeen paragraph, line, word and letter mode, the latter being useful to spell out unfamiliar words. Selected stories or prtions of them can be pasted to a scrapbook etc.
Janet Cauldwell (OUP) presented the electronic OED of which any further praise would be otiose. I shall therefore remark only that this product is so wonderful its presentation and marketing could be safely left to idiots.
After lunch, we were organized into a five work groups, on a variety of occasionally overlapping topics (Organizational structures within language; database interfaces; use of nonlinguistic information in linguistics; educational applications; communications facilities). Each group had an assigned leader, a small number of thought-provoking presentations and a rapporteur charged with making proposals for ways in which new SALT-related activities could be realigned within a multimedia paradign. I joined the first group, and enjoyed brief presentations from Adam Kilgariff and Lynne Pemberton, both, in different ways, concerned with ways of structuring complex texts. The group invented an interesting application area: an intelligent agent capable of summarising or expanding the information content of individual components of a multimedia system as well as identifying and categorising the links between them. I waxed lyrical on the wonders of HyTime, TEI etc, and found myself appointed rapporteur for my pains.
Day two began with a brief hectoring from Gerry Gavigan of the DTI, the gist of which was that the SALT progrmme was likely to be axed if there was not a more enthusiastic response to its next call for proposals, provoking a certain amount of muttering about the difficulty of persuading hard-pressed commercial partners to join and the complexities of the application procedure. This was followed by a video presentation displaying results from an ESPRIT project called MMI-squared (for MultiModal Interfaces for Man-Machine Interaction) which I found slick but curiously uninventive. The previous day's workgroups then resumed.
Two further plenary presentations addressed topics of some relevance to TEI interests: one, from Arnold Smith (SRI) discussed the ubiquity of modelling techniques, (in dbms, CAD, spreadsheets, robotics etc) and the need to achieve mapping between them, which, he opined, could be done by something called "abductive equivalential translation", of which I understood just about enough to see similarities with the use of SGML: translating models clearly requires the addressing of semantic issues rather than simple data format comnversion. It also facilitates re-use of encoded knowledge: as an unexpected application area he mentioned integrated manufacturing systems for which he claimed that language translation methiods were directly applicable. The second, from John Lee, (Edinburgh) talked at some length on the different communicative aspects of different modalities and appropriate bases for choosing amongst them, for which he proposed 'specivity' as an interesting metric. Like Smith (and others) he saw the significant contribution that SALT could makein terms of the application of a discourse model.
After lunch the rapporteurs were invited to present the findings of their groups. These are to be edited up into a report for the DTI, which I will circulate to anyone interested when it is completed. The information handling group had seemingly decided that linguistic description was a better way of querying large databases (specific instances included images and distributed dbs) than graphical methods. Re-expression of queries in different modes was a good way of checking that they were being correctly interpreted. Their new project was a voice input/recognition system for attaching annotation to large quantities of images or paperwork. The Education workgroup, on which Nick Ostler reported, had surveyed a number of interesting applications, but had chiefly highlighted a checklist of the problems of multimedia developmt (high cost, dangers of rigidity, difficulty in carryover, copyright problems etc). The workgroup on 'nonlinguistic issues' (Chris Mellish reported) had also identified a large number of problem areas and research opportunities, from which I recall as particularly interesting the following: combining automatic lip-reading with speech recognition (useful in noisy discos); research on the semantics of body language and gesture; application of linguistic anmalysis to existing graphical interfces etc. The telecoms group, reported on by Martin Crossley (BT), began by positing the widespread availability of a number of emergent technologies (wideband networking into the home, videophones, teleconferencing, teleworking, teleshopping etc). In the short term they predicted speech controlled telephones with more intelligent conversational interfaces; in the medium term, bettwer interactive educational tools (necessary for any expansion of distance learning). In the longer term they foresaw proper multimodal access to distributed computer systems, linguistic interaction with intelligent search agents, ways of intelligent topical indexing of video data.
The workshop wound up, like many others, with discussion of ways in which funding might be obtained for further SALT/Multimedia research, with five possible being proposed. The DARPA model, in which there is a predefined goal for which consortia are invited to compete, with payment contingent on their achieving specified targets, was liked by some as much as it was disliked by others. Alternatives included the notion of a centrally funded institute and more precisely defined contractual arrangements (as in LRE). There was a familiar call for outreach to other related communities. Ostler closed with a review of the current status of the SALT club itself, now no longer bankrolled by DTI, it transpired, but by his own company.
I travelled to Paris at the invitation of Francois Chahuneau, md of AIS, to see the prototype of the system his company has developed on behalf of the Bibliotheque de France, in connexion with their ambitious PLAO (Poste de Lecteur Assiste par Ordinateur) scheme, on which I reported last year. The prototype is for a scholarly workstation, which brings together a lot of interesting ideas about how scholars interact with electronic text, both in transcribed form and as digital images. Essentially it provides an integrated environment for the management of texts, including their annotation, closely modelled on traditional scholarly practice. The software runs on SPARC stations under X-windows and uses PAT as its main retrieval engine, which means that its performance is very impressive. Texts, both transcribed and in image form, can be organised into (possibly overlapping) typed logical 'zones', annotated and given hypertextual links. Text and text image can be synchronised, though only at a relatively coarse level. The texts themselves are read-only, while annotations and structuring information are dynamic, as is rendering. It uses SGML (of course) though with a very simple dtd based essentially on the use of typed milestones to mark zone boundaries, but can take advantage of whatever markup is present in a text. I had supplied Chahuneau with a TEI-style marked up text which he was able to import directly into the system, with impressive results. The query language used is particularly powerful, and takes full advantage of the structuring capabilities of SGML. The prototype will also be demonstrated at the Waterloo conference next month, where I expect it to arouse considerable interest: it combines much of the functionality of Dynatext with the power of PAT and the user-friendliness of Lector. Licensing and distribution arrangements are not yet clear, but it looks as if it will be considerably cheaper and more 'open' than any of those products. AIS are also planning to release a general-purpose tool kit for converting SGML texts into other formats, known as Balise 2, which will sit on top of the public domain SGML parser sgmls: this looks particularly interesting.
While in Paris I spent an afternoon with Dominique Vignaud who has been commissioned by Quemada and Tournier to assess TEI proposals for the encoding of corpora as the main French contribution to the NERC project. Vignaud is one of France's leading SGML experts (she was responsible for an attempt to create a French version of the AAP standard and also for my favourite expansion of 'SGML' -- Surement Genial Mais Laborieux) and so I was much heartened by her enthusiastic praise for the general design principles of the TEI. We discussed in some detail the difficulty of reconciling the incompatible goals of different research projects within an effective interchange framework. The 'base plus topping' method advocated by the TEI seemed the best theoretical solution, although for practical reasons it seemed likely that a 'lowest common denominator' approach would be followed, with Vignaud proposing as candidate for standardisation a simple subset of TEI recommendations, not dissimilar to that used by the BNC, on which I was also able to bring her up to date.
The University of Essex hosted an interesting one day workshop with the title `Social History: the challenge of Technology' on June 1st. It adopted a format new to me, in which ten invited speakers were allowed a maximum of five minutes to highlight key issues in their previously-circulated papers, followed by a a 15 minute commentary on each pair of papers by an invited `discussant' and a general discussion. The programme was carefully arranged to include five complementary pairs of papers, and the whole affair proved remarkably successful in generating a fruitful and stimulating exchange of views amongst the hand- picked participants, most of whom were computing historians, data archivists or similar.
The first session dealt with the creation of `public use data sets' with two speakers (Steven Ruggles from Minnesota and Liam Kennedy from Queens, Belfast) presenting basically similar projects in the integration and harmonisation of large scale existing data sets for re-use, in Ruggles' case derived from US census data over a large number of years; in Kennedy's a whole spectrum of 19th century Irish statistical data. Both called attention to the very practical difficulties of harmonising the differing analytic preconceptions of the original data collectors, while stressing the need to make the datasets (which Ruggles described as a `national treasure') more accessible and user friendly. Prof Michael Anderson (Edinburgh) in commenting, made several practical points from the ESRC viewpoint, which he characterised as sceptical about the usefulness of secondary analyses. He stressed the need to set realistic and achievable targets as well as the importance of making explicit the theoretical basis for combining datasets, the need for long term institutional support and the difficulties of making such data sets user-accessible for casual enquiry. The discussion indicated wide support for these concerns, though Ruggles stoutly defended the general usefulness of public use datasets in social science, claiming that in the few cases where available they were the most widely used resources. Several of the historians present expressed anxiety about the difficulty of linking datasets using different `codebooks'. Kennedy noted the `spurious consistency' of terminology such as `general labour' in occupational classifications; Bob Morris (Edinburgh) pointed out that this variability reflected important historical differences; Dan Greenstein (Glasgow) noting that historical sources provided an `opaque window onto the past' as well as being objects of interest in their own right, stressed the need for explicit data modelling of the researcher's interpretations of them. It occurred to me that many of these problems were strikingly similar to those faced by linguists trying to unify different linguistic annotation schemes and might therefore benefit from the kind of approaches currently being discussed within TEI AI1.
The second session dealt with the need for standardisation in data collection and interchange. My paper gave a brief summary of the TEI, stressing its attempts to avoid prescriptivism, and promising great things for SGML as a powerful notation system. Manfred Thaller (Gottingen) distinguished four levels of description appropriate to historical material: numerical data, factual data, running text and bit-mapped images. Interpretation differed at each level but all had a common core of problems. He shrewdly observed that standardisation was becoming more difficult as researchers tended to define themselves by a particular technology. A successful standard must be descriptive, based on a conceptual analysis rather than any technology, but could only succeed if it was backed up by well-designed and acceptable technology. Responding, Peter Denley (Westfield) stressed that standardisation was not a mere abstraction, and deplored the lack of recognition given in research culture to collaborative effort. While the TEI's proposals needed to be made more accessible to the non-professional, there was a danger that computing historians were about to propose a `third pope' if they continued to ignore the very real and highly relevant progress made by other research communities, while their datasets would be increasingly marginalised. This view seemed to have general acceptance, though some, notably Anderson, insisted that quick and dirty methods would always prevail in the long run.
The third session dealt specifically with the role of data archives in historical research, with papers from Paul de Guchteneire (Steinmetz Archive, Amsterdam) and Hans-Jorgen Marker (Danish Data Archive, Odense) and a particularly useful synthesis of them from Bridget Winstanley (ESRC Archive, Essex). De Guchteneire remarked on the highly skewed usage distribution of archived material (very small numbers of items being used very frequently), the archives' reluctance, through lack of resources, to provide ancillary support facilities, the need to preserve data sets currently being produced by government and other agencies, and on the need to formalise citation of datasets. Marker, picking up Thaller's four-fold characterisation of historical datasets, agreed that most archives were really only capable of dealing with survey type material, and that their methods might not be generalisable. Winstanley reiterated the need for standardisation in citing and cataloguing datasets; addressing the palpable discomfort of historians for the social science model of archiving, she stressed the need for proper bibliographic description and control. In the ensuing discussion, I drew attention to the TEI's recommendations for bibliographic description, and noted the interesting contradiction between an archive's dual role as repository (foregrounding idiosyncrasy) and as source of reusable resources (foregrounding integration). Anderson made the good point that traditional social science archives were about to be engulfed with the results of `qualitative' surveys which could mean a narrowing of the distinction between textual and numeric data.
In the fourth session, dealing with the use and potential of online historical databases, Humphrey Southall (QMW) described the trials and tribulations of providing JANET-wide access to a large Ingres database of information about local labour markets via a Hypercard-based front-end and Heiko Tjalsma (Leiden) a project called Chronos which provides integrated access to a variety of SAS datasets and their documentation. Both talks tended to concentrate overmuch on the technical details of networking, which also dominated the discussion. The rapporteur, Don Spaeth (Glasgow) contrasted traditional mainframe-based online services with the workstation model, the provision of networked access to CD-ROMs etc. and the discussion tended to drift into rather ill-informed speculation about likely technological changes, though Eric Tannenbaum (Essex) (who was unfortunately present only for this session) did make the interesting observation that it would be the need of environmental and other researchers for terabytes of data within seconds which would determine the likely development of new networks.
The last session of the day dealt with the topic of IT in the teaching of history. Frank Colson (Southampton) gave an optimistic acount, based on the impressive success record of his HIDES system, which is designed to complement rather than replace traditional library resources. Deian Hopkin (Aberystwyth) was less sanguine: his analysis concentrated on the impossibility of funding new IT-based teaching methods without proper institutional and financial support. As Rick Trainor (Glasgow) pointed out, the papers were nicely complementary, in that the teaching methods described by one were precisely those which present funding arrangements made it difficult or impossible to provide. Hides had been justly praised for the way it enhanced the traditional teaching role by providing students with the opportunity to carry out systematic analysis of major problems, on their own terms, while the National Curriculum appeared to wish to undermine or undervalue that very potential by using IT as a low-cost way of providing transferable skills. Computer- based teaching was an effective way of bringing a closer rapprochement between teaching and research as well as in breaking artificial discipline-based distinctions. He felt that it was better to focus on the institutional problems preventing its wider acceptance than on the purely technological ones, and that the methodological differences between social science and history had been overstated.
In the following discussion, Morris probably spoke for several when expressing disquiet at the way the morning's discussions of the technical possibilities and the intellectual challenge that they posed seemed to have gone adrift. He feared that technophoria would distract from the fact that only some models of enquiry were well served by IT. For Hopkin, IT was of importance because it enforced a `confrontation with the data' and reminded historians of the need for a methodology to handle that confrontation. Greenstein agreed on the crucial importance of a formalised model of enquiry (for which he gracefully gave credit to the TEI), and noted that without one teaching history (as opposed to transferable skills which could be got anywhere) was very difficult indeed. Spaeth remarked that revolutionary fervour had been the downfall of the quantifying school of historians and agreed that the misuse of IT for its own sake lead to poor teaching practice.
I came away very favourably impressed by the format of the day: tightly-focussed small group discussions can sometimes be a little incestuous, but the programme had been carefully arranged to provoke ample scope for controversy and some fruitful argument had ensued. From the TEI perspective, I was particularly encouraged by the evident willingness of the computing historians to confront standardisation problems at the heart of the TEI agenda, and to restate them in their own terms.
LB June 6th 91
The Dutch SGML Users Group hosted a two day international conference in Amsterdam 16-17 May under the general title `SGML Update: consultancy, tools, courses'. This attracted over a hundred delegates, by no means all from the Benelux area, though mostly from European publishing and software houses. There were two keynote speakers (Sperling Martin for the AAP, and myself for the TEI), about a dozen presentations from manufacturers or consultants and a well-arranged software exhibit in which all the major SGML software vendors were represented, with the conspicuous exception of Software Exoterica who had apparently had to withdraw at the last minute. There was ample opportunity for discussion and argument between presentations, over an excellent buffet lunch and in the evenings.
Sperling Martin as one of the chief progenitors of the AAP standard was happy to report that it was now in use by more than 25 major publishers, with a further forty planning to adopt it over the next twelve months. He gave brief overviews of three particularly successful applications on the fringes of conventional publishing. Firstly, the Association for Computing Machinery, which has just developed a five year strategic plan with the AAP standard at the centre of several dozen new print products, on demand reprint facilities, optically stored databases, hypertext products etc. Perhaps more interestingly, the ACM plans to mandate the AAP standard as the interchange format of preference for its army of unpaid professional contributors, reviewers and referees in the future. Secondly, the Society of Automative Engineers, which is adapting the AAP standard for use in something called a `Global Mobility Technology Information Center' or in plainer English, a database of information about all sorts of transport systems. The interesting thing here was the convergence between SGML and object-oriented databases -- as well as manuals of technical information, SGML was being used as the vehicle for data to be transferred directly into CAD/CAM systems. Sperling's third AAP success story was a similarly hybrid development: a new legal database system developed for the Clark Boardman Company, providing integrated information services derived from legal journals, statutes and regulations, a body of case law together with interpretation and annotation, usable by traditional print journals or electronic hypertexts. Of course, the AAP project had not been an unmitigated success: it had begun at a time when SGML was barely established, and some aspects, notably those concerned with maths, formulae and tables have never been finished properly. Moreover, there are a few deliberate errors in the standard, introduced (said Sperling ingenuously) as `reader tests'. He also called attention to some image problems -- all too familiar to TEI ears -- such as the perceived conflict between TeX and SGML, or ODA and SGML, and the intimidating nature of SGML so long as its cause is left to the purists and the evangelists. Looking to the future, Martin predicted an increased awareness of SGML within the library community as a practical means of coping with the explosive growth of published materials, particularly in Science and Medicine. The AAP standard was to be assessed for suitability as a `non-proprietary information exchange vehicle' for electronically networked journals, by the 110-member Association of Research Libraries, under a scheme for which the National Science Foundation had recently provided $0.75m seed funding. His presentation concluded with some sound advice for those developing a strategic business plan in which SGML featured (concentrate on the business asset, don't expect technology to do everything, expect to spend at least $5 a page to get electronically tractable text...) and some predictions for future AAP work. A corrected version of the AAP standard would be re-submitted to ANSI and a summary of needed corrections to the published dtds would appear in EPSIG news at the end of this year.
Seamus McCague gave an impressively detailed description of two practical applications of SGML in work undertaken by his company, ICPC, a fifteen year old Dublin-based specialist typesetting company. One, for Elsevier, involved the production of about 100,000 pages of high quality camera-ready copy from SGML encoded text annually; the other, for Delmar, the conversion of an existing reference book into an electronic resource. Details of the two projects provided interesting contrasts in production methods; they also showed how the SGML solution was equally applicable to two very different scale operations. For Elsevier, the use of SGML greatly simplified both process and quality control, by facilitating the automatic extraction of data for the publisher's control database; for Delmar, it had made possible significant improvements to the product (a drug handbook) by automating the production of a variety of indexes.
Francois Chahuneau of AIS, the thinking man's Antoine de Caunes, gave a characteristically ebullient presentation about the relationship between SGML documents and database systems. He distinguished four characteristic modes of action: simple storage of documents in a database, where typically only a limited amount of header type information is visible to the database; database- driven document extraction, where documents are synthesized from information held in a database as a specialised form of report; tightly coupled systems in which highly volatile document and database systems share information; and the true document database in which all the information and structure of a document are represented by isomorphic database constructs, thus combining the well-understood strengths of database systems in such matters as concurrency control, security and resilience with the flexibility and multiple-indexing capabilities of document processing systems. As examples of this last mode, he then described in some detail two products: his own company's SGML- Search, which is based on PAT, and Electronic Book Technologies' Dynatext, and also demonstrated a beta-test version of the MS- Windows version of the latter. It uses an interesting scripting language based in part on DSSSL, which enables it to be configured to look more or less like anything, whereas SGML Search is command-line driven, using a fairly rebarbative syntax.
The interface between SGML and database systems was also touched on by Jan Grootenhuis of CIRCE, the doyen of Dutch SGML consultancies. Speaking of his experience in teaching SGML, he remarked that people with a typographic background found SGML almost as difficult to understand as people with a computer science background found the requirements of typography, which struck a familiar chord. He then briefly described a recent project in which documents had been converted automatically into an Oracle database, using a database model defined by Han Schouten. The project had shown that database definitions could be automatically generated from a DTD; the complete suite of Oracle manuals, created as Ventura or WordPerfect documents, had been loaded into an Oracle-Freetext database, using SGML as an intermediary. He noted that the tendency of technical writers to use descriptive tagging to bring about formatting effects had made this task unnecessarily difficult, and argued for better enforcement of descriptive standards. He also outlined some experiences in using SGML for CD-ROM publication of journals at Samson, and of legal and other regulations published by the Dutch government, and the updating problems involved. His conclusion was that SGML was now past the point of no return. It was no longer being used in pilot projects only, but as an integral part of real work. Its use was no longer regarded as worthy of comment; moreover, because its evangelists were too busy doing real work to try to publicise it, the task was being taken on by professional teachers and educators.
The first day of the conference concluded with manufacturers' presentations. Tim Toussaint(MID) and Paul Grosso (Arbortext) gave a joint presentation. Toussaint revealed that MID, formerly Dutch and now German, is now 26% French. They used Arbortext as an SGML editor, and Exoterica's XTRAN to convert it for loading into an unspecified relational database. Applications included standard reference works such as the Brockhaus Duden and a database of standards documentation. Grosso gave a good sales pitch for Arbortext, which is a luxuriously appointed SGML editor intended for use primarily in an electronic publishing environment and described as non-intimidating and user-congenial. It includes a specialised WYSYWG editor for tables and formulae from which AAP-conformant marked up text is generated, has good browsing and outlining facilities and its own script language.
Hugo Sleimer, European Sales Director for Verity (a spinoff from Advanced Decision Systems) gave a classy presentation of a product called TOPIC, the only relevance of which seemed to be that it supported a wide variety of document formats, including SGML. Much of his presentation dealt exhaustively with the problems of text retrieval by boolean logic, at a level which did not show much respect for his audience's intelligence. Tibor Tscheke, from Sturtz Electronic Publishing, was due to talk about his company's work in creating an electronic version of the Brockhaus Encyclopedia, but had unfortunately been forbidden to do so by Brockhaus. He was therefore reduced to some generalities about the role of information within an enterprise, the integration of SGML systems into mainstream information processing and so forth, which was a pity.
I opened the second day of the conference by summarising the current status of the TEI and discussing some of the technical problem areas we had so far identified, in particular those raised by historians and linguists for whom any tagging is an interpretation which must be defensible. This being the second time I had done it in two weeks, I managed to get through most of my material within a reasonable approximation to the time allocated me.
Yuri Rubinsky (SoftQuad Inc) gave an entertaining and wide- ranging talk, picking up in passing some of the technical issues I had raised rather than simply presenting a product review, though he did mention in passing (and also demonstrated) that Author/Editor was now available under Windows and Motif as well as for the MAC. The theme of his talk was that SGML could be used to describe more than just documents, and that several of its capabilities were under-used. There was more to an SGML document than its element structure. Among specific examples he mentioned were customised publication, for example by extracting `technical data packages' geared to a specific maintenance task from CALS- compliant documentation in the Navair database; using attribute values to generate documentation at different user levels from a common source; an ingenious use of entity references within `boiler plate text fragments' in General Motors manuals; and the assembly of customised DTDs from sets of DTD fragments by a use of parameter entities strikingly similar to that proposed by the TEI, or by use of marked sections. For the GM application, this approach had reportedly saved the cost of its implementation within six months.
Pamela Gennusa (Database Publishing Systems) also picked up the recurrent theme of this conference: that SGML was uniquely appropriate to database publishing. She gave a good description of the major issues in preparing text for publication in database format and the strengths of SGML as a means of making explicit the information content of texts in a neutral way, which was essential given that authors and consumers had different requirements of it and touching on the problems of security, high volume and time sensitivity which characterise database publishing as an industry. She also gave a good overview of the capabilities of the new version of Datalogics' set of SGML products, notably WriterStation, an impressive authoring tool with several new facilities and DMA (Document Management Architecture) a complex set of object-oriented tools providing database management facilities for SGML material which also includes full text searching facilities like those described earlier by Chahuneau.
Ruud Loth (IBM Netherlands) gave a workmanlike presentation of IBM's SGML product range, which now includes an context sensitive editor for OS/2 called TextWrite, a formatter for VM or MVS called BookMaster and a new range of products called Book Manager to deal with `softcopy books' (IBMese for `electronic texts'). Book manager Build runs under VM and MVS and generates `softcopy' from GML or SGML documents; BookManager Read runs additionally under DOS or OS/2 and has impressive facilities for hypertext- style browsing, intelligent text retrieval, indexing and annotation. IBM documentation (47,000 titles, 9 milliard pages) would soon be available in this new form.
Bruce Wolman of Texcel AS then gave a detailed product description of the Avalanche `FastTag' automatic tagging system which, it is claimed, can handle almost any kind of text and automatically insert usable markup into it. The product has two components, a `visual recognition engine' which searches for visually distinct entities in a document, as defined by a set of rules encoded in a language confusingly called Inspec, and another language, called Louise, which defines the form in which these objects should be encoded. Things like tables, footnotes, horizontal lines, running headers or footers or special control sequences could all be automatically tagged as well as objects defined by regular expressions or specific keywords in the text. The product had just been launched in Europe and was available for MSDOS, VMS, Ultrix and Macintosh.
John Mackenzie Owen of the Dutch consultancy Pandata gave a brief description of the SGML handling capabilities of BasisPlus, stressing however its strengths as a document management system rather than its admittedly limited SGML features. Bev Nichols of Shafstall described the Shafstall-6000, an all-singing all- dancing document conversion system based on a package called CopyMaster which included SGML among its 800,000 claimed `document-to-document' pairings but which (I had the impression) would really rather be operating on a proprietary format called the Shaffstall Document Standard. The last presentation of the day was from Ian Pirie of Yard Software Systems who described the successful Protos project carried out by Sema Group and Pandata for the CEC. The project handled proposals for funding from DG 13 which had to be distributed to member states for comment and the ensuing comments. MarkIt had been used to validate the format of the messages passed in either direction, its regular expression facilities being particularly useful in automatically encoding the content of telex messages, and its application language to encode the messages for storage in a Basis database. The whole operation had been carried out with minimal disruption of the message system.
Aside from the presentations, the conference provided an excellent opportunity to catch up on the expanding world of SGML- aware software. Among products demonstrated were new versions of MarkIt and WriteIt from Sema Group, of Author/Editor from Softquad, Arbortext, Writerstation from Datalogics and an interesting new product, an SGML editor called EASE from a Dutch company called E2S. Delegates were also given a copy of the first fruits from the European Work group on SGML, a consortium of European publishers which has been working on a set of AAP- inspired dtds for scientific journals which took the form of a very well designed and produced booklet documenting a DTD for scientific article headers. I came away from the conference reassured that SGML was alive and well and living somewhere in Europe.
Lou Burnard
Text Encoding Initiative
A postscript to the above
My report on last month's Amsterdam SGML UserGroup conference, recently posted on TEI-L and on comp.text.sgml was, like most such reports, written with timeliness and liveliness as primary objectives, rather than considered sober opinion. Consequently, it contains some phrases which I would certainly not wish to stand as matters of official published record, and also a few inaccuracies that I'd like to correct. I've recently received a letter from Sperling Martin drawing attention to some of these, most of which is quoted below. This is partly a way of expressing gratitude to Sperling for having taken the time to correct my misrepresentations so thoroughly and with such good humour. His reply also provides some fascinating background detail about those rugged pioneering days of the SGML revolution - — I for one would like to know what became of the Atari SGML parser!
Lou Burnard
There are three points about which I want to provide further explanation. The first concerns the "planted" errors in the early AAP DTDs; the second relates to ACM; and, the third pertains to the Association of Research Libraries' activities.
As to the condition of the AAP DTDs, I hope you can recall that I said that the development of what ultimately became the AAP Standard was begun before SGML had even achieved formal ISO draft status. This was done, with obvious risk - - what if ISO had not approved SGML?? Just think were we would all be now!
SGML, of course, was in some form of ANSI evolution from the late 1970's. Fortunately, by the early 80's the core of SGML had reached a fairly solid condition. Much work remained in refining and enhancing that core. And that was the focus of the ANSI/ISO committee efforts during the period 1982-1986. In addition, once the standard had reached a nearly complete form, the ANSI/ISO community moved rather quickly to get SGML through the draft and final approval cycles, saving us on the AAP Project significant embarrassment.
To give you a calendar metric, the AAP project was launched in late 1983. It produced its final report and initial set of DTDs in February 1986 -- about eight months before the officially approved ISO version of SGML. The earliest attempts to use the primitive SGML tools to describe the AAP document structures were useful in getting us headed in the right direction. It was a bit of a juggling act, however, to keep the AAP technical efforts completely synchronous with the evolving SGML. (You folks on the TEI project have it so much easier -- he says truly ingenuously!)
The point I was trying to make was that we on the AAP Project were working with a bit of a moving target. And, in the later AAP project phases, as SGML began to solidify and become more widely circulated as a draft ISO document, its complexity was a bit of an impediment to understanding its richness and utility. Still a problem today.
As we were obligated to share the draft AAP DTDs with a panel of publishing technology and "SGML" experts, we wanted to be certain that the material we were presenting for review was being thoroughly read. Our simple "test" was to plant a few obvious errors in the DTDs to see if our reviewers were paying attention. There were no SGML parsers in 1985 save a very limited "toy" that Charles Goldfarb had built to operate on his home ATARI! That meant that the only way to catch an error was to read the whole DTD character by character.
The result of the review drill was that most errors were caught. In fact, most of the errors were ones that we made by being unfamiliar with SGML applications development and not correctly interpreting SGML's metalinguistic rules. There were even a few instances where we discovered syntactic conflicts in the draft versions of SGML that were subsequently rectified -- contributing to the refinement of SGML. (In its final year, the AAP Project did serve a valuable role in "testing" some aspects of SGML as it too was taking final shape.)
Anyhow, whatever the "planted" errors, and I recall only three egregious instances, they were removed long before the AAP DTDs received any form of wide circulation. The more important issue today, that I apparently did not make clear, was the revision to the AAP DTDs that is now underway to correct errors and ambiguities that were unintentional. And, as you reported, revised, corrected versions of the DTDs are likely to be available later this year.
While on the AAP theme, let me add that the DTDs for math and tabular material are not quite as rife with problems as you may have thought I suggested. The AAP math material has been very useful in many commercial publishing applications that are alive and well. The TeX vs SGML debate continues apace independently of the AAP Standard. The tabular component of the AAP Standard has seen even wider use.
The SAE project that I described, uses the AAP tabular approach as the basis of the engineering data tables that are part of their aerospace and ground vehicle standards publications. Many CALS compliant applications have used the AAP tabular material approach. What I was trying to emphasize was that work remains in improving the AAP math and tabular components. EPSIG is now soliciting comments and suggestions about those components. I hope that those who were at the Amsterdam meeting who are interested in contributing ideas, will respond to the solicitation.
The one point about my description of the ACM project that needs clarification concerns the use of the AAP standard for manuscript submissiomn. I am certain that I didn't say that for electronic submissions, ACM will mandate the use of an SGML application. The overhead that I used and was part of my handout, clearly shows that full SGML application tagging will occur at the receiving end -- that is at the ACM headquarters. SGML application tagging can certainly be done by the authors and editors, but it will not be mandated. There are basic guidelines that are suggested for electronic submissions, but they cannot be followed without requiring authors to do comprehensive source document tagging. As the ACM project evolves and suitable tools become more prevalent, the groundrules for electronic submissions can be expected to change.
Finally, about the Association of Research Libraries activitis, the electronic journal effort that I described is a very recent initiative. Any "seed" funding from the National Science Foundation or others has not yet been established. The only thing that should be reported is that project funding will be addressed upon completion of the formal project plan -- and that is still to be completed. At this stage it is assumed that the technical basis for the collaborative information interchange will likely build on widely accepted standards, including the AAP Standard.
ICAME is the annual get together of corpus linguists. This year's, (the twelfth) was hosted by Leeds University at a rather nice decayed Victorian hotel on the edge of Ilkley Moor and enjoyed excellent weather, the usual relaxed atmosphere and the usual extraordinary array of research reports, which can only be very briefly noticed in this report. As usual, there were about 50 invited delegates, most of whom knew each other well, and a few rather bemused looking non-Europeans, notably Mitch Marcus (Penn State) and Louise Guthrie (New Mexico SU). The social programme included an outing to historic Haworth by steam train which, alas, your correspondent had to forgo in order to attend to other TEI business, and large amounts of good Yorkshire cooking, which he did not.
For the first time, the organising committee had included a so- called open day, to which a number of interested parties, supposedly keen to find out what this corpus-linguistics racket was all about, had been invited. As curtain raiser to this event, I was invited to present a TEI status report, which I did at break neck speed, and Jeremy Clear (OUP) to describe the British National Corpus project, which he did at a more relaxed pace. The open day itself included brief presentations from Stig Johansson (Oslo), on the history of ICAME since its foundation in 1973, from Antoinette Renouf (Birmingham) on the basic design problems of corpus building, from Sid Greenbaum (London) on the design and implementation of the new co-operative International Corpus of English project, from Eric Atwell (Leeds) on the kinds of parsing systems which corpus linguistics made possible, from Jan van Aarts (Nijmegen) on the Nijmegen approach to computational linguistics, from John Sinclair (Birmingham) on the revolutionary effect of corpus linguistics on lexicography and on language teaching, from Gerry Knowles (Lancaster) on the particular problems of representing spoken language in a corpus and from Knut Hofland (Bergen) on the technical services provided for ICAME at Bergen. While none of these speakers said anything particularly new, several of them (notably van Aarts, Renouf and Sinclair) managed to convey very well what is distinctive and important about the field. As far as I could tell, most of the ICAME community was a bit dubious about the usefulness of the Open Day. For outsiders wishing to get up to speed on why corpus linguistics is interesting and why it matters however, I would judge it a notable success.
Corpus linguistics is, of course, all about analysing large corpora of real world texts. To do this properly, you probably need a good lexicon, and you will certainly finish up with one, if you do the job properly. Not surprisingly therefore, the conference proper began with a series of papers about electronic lexica of various flavours, ranging from the CELEX database (Richard Piepenbrock, Nijmegen) in which a vast array of information about three languages (Dutch, English and German) is stored in a relational database, to the experimental word-sense lattices traced by Willem Meijs' Amsterdam research teams from the LDOCE definitions. Work based on this, surely by now the most analysed of all mrds, was also described by Jacques Noel (Liege) and by Louise Guthrie (NMSU). The former had been comparing word-senses in Cobuild and LDOCE, while the latter had been trying to distinguish word senses by collocative evidence from the LDOCE definition texts: although well presented and argued, her conclusions were rather unsurprising (highly domain specific texts are easier to disambiguate than the other sort), and to base any conclusions about language in general on the very artificial language of the LDOCE definition texts seems rather dubious.
The traditional ICAME researcher first quantifies some unsuspected pattern of variation in linguistic usage and then speculates as to its causes. Karin Aijmer (Lund), for example, reported on various kinds of `opener's in the 100 or so telephone conversations in the London-Lund Corpus, in an attempt to identify what she called routinisation patterns. In a rather more sophisticated analysis, Bent Altenberg (Lund) reported on a frequency analysis of recurrent word class combinations in the same corpus, and Pieter de Haan (Nijmegen) on patterns of sentence length occurrences within various kinds of written texts.
Although attendance at ICAME is by invitation only, an honourable tradition is to extend that invitation to anyone who is doing something at all related to corpus work, even a mere computer scientist like Jim Cowie (Stirling) who began his very interesting paper on automatic indexing with the heretical assertion that restructing the type of text analysed was essential if you wanted to do anything at all in NLP. The object of his research was to identify birds, plants etc. by means of descriptive fragments of text and his method, which relied on identifying roles for parts of the text as objects, parts, properties and values, both highly suggestive for other lines of research and eminently pragmatic. A similarly esoteric, but only potentially fruitful, line of enquiry was suggested by Eric Atwell's report on some attempts to apply neural networks to the task of linguistic parsing.
Another nice ICAME tradition is the encouragement of young turks and research assistants, who, when not acutely terrified, are often very good at presenting new approaches and techniques. This year's initiates included Simon Botley (Lancaster), who presented a rather dodgy formalism for the representation of anaphoric chains, Paul Gorman (Aberystwyth) who had translated CLAWS2 into ADA and almost persuaded me that this was a good idea, Christine Johansson (Uppsala) who had been comparing `of which' with `whose' - almost certainly not a good idea and Paul Rayson & Andrew Wilson (also Lancaster) who had souped up General Enquirer to do some rather more sophisticated content analysis of market research survey results by using Claws2 to parse it.
Two immaculately designed and presented papers concerned work at the boundary between spoken speech as recorded by an acoustic trace and by transcription: Anne Wichmann (IBM) presented an analysis of `falls' in the London-Lund corpus, a notorious area of disagreement between transcribers. Her elicitation experiment tended to show that there was a perceived continuity between high and low falls which transcribers could not therefore categorise. Gerry Knowles (Lancaster) proposed a model for speech transcription, in which perceived phonemic categories formed an intermediate mapping between text and acoustic data. Speech transcriptions require a compromise between patterns that can be computed from text and interpretations derived from acoustic data.
High spots of the conference for me were the presentations from O'Donoghue (Leeds) and Marcus. If there is anyone around who still doesn't believe in systemic functional grammar, Tim O'Donoghue's presentation should have converted him or her. He reported the results of comparing statistical properties of a set of parse- trees randomly generated from the systemic grammar developed by Fawcett and Tucker for the Polytechnic of Wales Corpus with the parse trees found in the same (hand-)parsed corpus itself. The high degree of semantic knowledge in the grammar was cited to explain some very close correlations while some equally large disparities were attributed to the specialised nature of the texts in the corpus.
Mitch Marcus (Penn State) gave a whirlwind tour of the new burgeoning of corpus linguistics (they call it `stochastic methods') in the US, and made no bones about its opportunistic nature or or funding priorities. Incidentally providing the conference with one of its best jokes, when remarking of the ACL/DCI, the Linguistics Data Consortium etc. "People want to do this work extremely badly, and they need syntactic corpora to do it", he described the methods and design goals of the Penn Treebank project, stressing its engineering aspects and providing some very impressive statistics about its performance.
Several presentations and one evening discussion session concerned the new `International Corpus of English' or ICE project. Laurie Bauer (Victoria University) described its New Zealand component in one presentation, while Chuck Meyer (UMass) described some software developed to tag it (using Interleaf) in another. The most interesting of these however was from And Rosta (London) who is largely responsible for ICE's original and, for my taste, rather baroque encoding scheme: itvtook the form of a detailed point by point comparison between this and the TEI scheme with a view to assessing the possibility of converting between them. The verdict was largely positive, though he identified several points where TEI was lacking, some of which (notably the inability to tag uncertainty of tag assignment and a whole raft of problems in tagging spoken material) should certainly be addressed and all of which provided very useful and constructive criticisms.
There was a general feeling that standardisation of linguistic annotation (which corpus linguists confusingly insist on calling `tagging') was long overdue. Marcus pointed out that the LOB corpus had used 87 different tags for part of speech, LOB had upped this to 135, the new UCREL set had 166 and the London Lund Corpus 197. In Nijmegen, the TOSCA group has an entirely different tagset of around 200 items which has been adopted and, inevitably, increased by the ICE project. It seems to me that someone should at least try to see whether these various tagsets can in fact be harmonised using the TEI recommendations, or at least compared with the draft TEI starter set described in TEI AI1 W2. I also think that someone should at least try to see how successful the feature-structure mechanisms are at dealing with systemic networks of the POW kind.
LB, 14 May 90
Between 80 and 100 people, from a large variety of European software houses and research centres were invited to this "Information Day", at the European Commission in Luxembourg, the goal of which was to stimulate interest in the CEC's Call for Tenders for the next phase (1992-4) of the Eurotra II Project. I attended in my role as consultant to Sema Group (Brussels), whose presentation was partly based on a report I wrote for them some time ago. Several background papers were provided and are listed at the end of this report; copies are available from me on request.
Background Presentations
After brief initial background presentations from F. De Bruine and Sergei Perschke, placing the Eurotra project in its political and historical contexts respectively, Nino Varile (CEC) gave a brief overview of the last three decades of what is now known as NLP. He stressed chiefly the way in which idiosyncratic procedural systems based on Transformational Grammar had given way to declarative systems based on lexical unification theories. He argued that such systems, being inherently more robust, would speed up progress in MT systems during the 90s, as would the notion of reusable modular lexical and processing resources. The object should be systems of high quality; the days of what Nino disparagingly referred to as ‘do-it-yourself’ NLP were over. Multi-functional resources, shareable between projects, would be the norm.
Roberto Cencione (CEC) then introduced the main business of the day: initial reports from four feasibility studies commissioned by the CEC, each of which had been briefed to investigate a distinct aspect of Eurotra II. Unlike Eurotra I, this would be a kind of ‘NLP workbench’, modular, formalism- independent and capable of evolving to meet new requirements, but currently solidly based in industry standards (POSIX, X/OPEN, WINDOWS, NFS, SQL, SGML...). Each of the four studies had been asked to assess the current prototype, consider existing relevant formalisms, and specify new tools, formalisms or strategies as appropriate. Each had involved collaboration between academic and industrial partners: final reports are not due until July, but initial versions of all were presented at the meeting. Cencione also highlighted Eurotra's drift into professionalism (my phrase): until 1987 all R&D had been in the hands of academics; during the last two years a central software team had taken this over. The next phase, until 1994, would be characterised by turnkey projects carried out under contract. After 1994, cost- sharing projects would become the norm.
Feasibility Studies
The System Architecture
Jörg Schütz (Inst for Applied Information Science, Saarbrücken) picked up the theme of open modular architecture. By contrast with Eurotra I, the new system takes an object-oriented approach. He presented the various layers of the architecture, from man-machine interface down to database storage, by way of user agent and object manager. The latter handles interactions with lexical resources and rules; the former oversees a number of co-operative software agents or ‘toolboxes’, for example for text-handling. A need was identified for a formalism-independent ‘lexical interface representation language’. A speaker from CAP-Gemini gave some further detail of the MMI agent: it would have a distinctive Eurotra ‘look and feel’, but would be built on X/Windows. As for system control -- if all else failed, there would always be Unix.
ET6-1 The Linguistic Formalism
Steve Pullman of SRI International began by noting that the linguistic formalism must be usable for general NLP, in a multilingual as well as a monolingual context, as well as for MT. Other desiderata were that it should be declarative, reversible (i.e. usable for generation as well as analysis) and theory independent. It should have an easily implemented core, to which equally monotonic and declarative extensions could be hooked, and should use ‘typed feature logic’. There was some discussion of the interfaces between the Language Analysis agent (LA) and the user agent's virtual machine (VM) on the one hand and with the Text Handling agent (TH) on the other: LA/VM was well defined: primitive functions included Parse, Generate, Refine (i.e. further transform the output from Generate) and Transfer (i.e. translate), with in each case appropriate parameters such as language or grammar. LA/TH was rather more fuzzy, with some linguistic functions such as morphological analysis being done by TH rather than LA.
ET6-3: The Text Handling Component
Christian Devillers of SEMA reviewed the text- handling design study. This component interfaces the Linguistic Analysis system with real texts, both during input and generation. The study had involved a brief survey of existing office document handling systems, SGML systems, and systems used within the ‘literary and linguistic computing’ paradigm. A simple SGML dtd for texts passing across the TH/LA interface had been defined (EDIF - Eurotra Document Interchange Format). Recognising that TH tools would probably be of wide interest outside Eurotra, EDIF has been designed with an eye on TEI conformance.
The TH component's main function is to translate between a formatted document and whatever linguistic structures are used for input to (or output from) the LA component. At present, LA requires input of single sentences, with no nested quotations etc., as well as some quite detailed morphological analysis, and that is therefore what TH must produce. Devillers stressed that the segmentation performed by TH was determined entirely by the LA: if this were enhanced for example to deal with paragraphs, then that would be passed across the interface. Some rendition features of the input text are passed through to LA; the majority however are filtered out and stored somewhere unspecified, so that they can be re-associated with the output text.
This presentation was followed by a few desultory questions about other related CEC-funded projects and about the range of material anticipated for translation. I spoke very briefly about the TEI in response to a prod from Cencione. It felt like a very long morning (no coffee break) by the time we all went, thankfully, to lunch in what is unquestionably one of the biggest and best office canteens in the world.
ET7: Reusability of lexical resources
Ulrich Heid (University of Stuttgart) presented some initial results from this rather different feasibility study. Like Varile, he stressed the economic argument in favour of re- using resources, which follow from the imperative need to ‘size up’ NLP projects. A toy system can demonstrate anything: you only demonstrate what is really feasible with a realistically sized system. Reusability might mean simply re-cycling of a resource prepared for some other purpose, or it might mean designing resources with multiple applications in mind from the start.
Most of the presentation dealt with questions specific to re- usable machine-readable dictionaries (MRDs). Heid touched briefly on the existence of a number of related projects and initiatives (e.g. Acquilex, Genelex, Multilex, which are concerned with acquisition, formal description and integration of MRDs respectively). MRDs contain vastly larger amounts of information than electronic lexica, but they are not available for many languages and their underlying structures are not explicit. The study group's approach to unification of lexicographic information across different MRDs was to try to define some primitive level of description, expressed in a type feature logic, corresponding with the linguistic phenomena which the dictionaries purported to describe. Note: This sounds, on reflection, very like the approach proposed within TEI Working Group AI1, so it must be right. Among the 20 or 30 assorted research groups, publishers and software vendors involved in the study, there had been no dispute about the use of an attribute-value representation scheme, though its practical viability had yet to be demonstrated. The use of SGML as an exchange mechanism and the need for interaction with TH and TEI had been equally noncontroversial, though some unspecified concerns had been expressed about character set problems. At the end of the year, another feasibility study would be reporting on some pilot projects demonstrating the benefits and methods of standardisation.
This talk provoked some disagreement from the floor, in the shape of Wolf Paperote (Munster), who asserted on behalf of corpus linguists everywhere (there were none present) that MRDs were a lot less useful than corpora as a source of linguistic information, and that since parsing corpora was marginally easier than parsing MRDs, and much cheaper, wouldn't the money be better spent on the former? To judge from the icy silence that greeted this remark, his was a minority view.
Contractual matters
Cencione rounded off the day by setting out the procedure to be adopted for the call for tenders, set out in ET9. Tenders were invited for two distinct projects: first the implementation of a Eurotra II development environment, as described by the four study papers; second the provision of maintenance and software support facilities for all Eurotra Project researchers (currently 17 sites located in 12 countries). The contracts would run to 1993, with the possibility of extensions under the new Language Research Engineering (LRE) programme for a further two years from 1994.
Nearly a hundred different companies and institutions had expressed interest in tendering. The formal invitation to tender would be published towards the end of May; the deadline for bids would be the end of July; contractor/s would be chosen by late September with a view to concluding contractual negotiations by the end of the year, and starting work early in Jan 1992. Contractors could organise the work as they deemed fit, bid for one or both projects, subcontract work etc, but the CEC would contract with only one member of a consortium, who should moreover be responsible for at least half of the work on the project. Software developed under ET9-1 would remain the property of the CEC, and must be shareable with any future research projects funded by the Commission. Their estimate of the costs for both projects over two years was around 30 man/years, with two thirds of the approximately 15 man/years for the first year being allocated to ET9-1.
TEI Relevance
At the risk of stating the obvious, I would like to stress the importance of the CEC's Linguistic Research Engineering project to the future of the TEI, and not just because of the amounts of money involved (several millions of ECU over the next few years). Over the next few weeks I will be writing up an assessment of the TH study, as part of the OUCS/SEMA consultancy project. Any input or comments from the TEI perspective would be most useful. I see three chief areas of overlapping interest, briefly summarised below.
Reuse of lexical resources
There is a lot of effort and money going into initiatives to standardise lexical resources such as MRDs, with which TEI is already involved by virtue of overlapping membership (Nicoletta Calzolari is, of course, a significant contributor to ET-7); however, it was clear from talking to Heid that closer collaboration would be both possible and welcome. As a first step I have requested copies of ET-7's detailed working papers, several of which include surveys of existing encoding schemes and recommendations for standardisation which should be brought to the attention of the relevant WGs. I think the new WGs on lexical resources and on terminology in particular should be encouraged to build on this European work rather than go their own way.
Language formalisms
Someone competent to judge the issues within the AI1 should be asked to assess the linguistic formalism of ET6-1 and consider ways of representing it using TEI style feature sets. In my report on ET6 I would like to suggest that LA should be able to output the results of its analysis in a TEI conformant way: it would be nice to have some specific arguments and examples to support this, but I am not confident of my competence to produce them.
Text Handling tools
Despite the general lack of enthusiasm within the Eurotra project for corpus linguistics, it seems to me that some of the tools developed as part of TH may prove to be of particular interest to several TEI projects. TH will (for example) have to develop ways of automatically detecting and tagging sentences and morphological structure in the full range of European languages in SGML. If properly designed and implemented, such tools would be of great general applicability. CEC's policy as regards making such tools availabile freely to the research community, at least within Europe, sounds distinctly encouraging, as does their declared intention of working within an open Unix environment.
Document List
The following documents were made available at the workshop.
All academic communities define themselves partly by regular gatherings dedicated to self examination; the community of ‘information scientists’, i.e. those skilled in the management and exploitation of library and analogous resources in research, is no exception. During the seventies there had been a regular series of such gatherings known as the Cranfield Conference. These having now fallen into desuetude, when Brian Perry, head of the British Library's Research and Development Department, welcomed us to this reborn version he naturally proposed that it should be called ‘Not the Cranfield Conference’. The four day event, jointly sponsored by the British Library, the University of Pittsburgh's Department of library Science, and the UK Computer Board, attracted a small but agreeably heterogenous audience. Attendance at sessions averaged 60 from a total registration of just under a hundred, largely composed of information science professionals, computerate librarians, human- factors computing theoreticians, a sprinkling of civil servants and various other varieties of professional research support people, drawn fairly even handedly from universities and polytechnics, with even a few token representatives of industrial concerns such as Shell. Although the British formed the majority, followed by the Americans and the French, several other countries were represented including Sweden, Eire, Canada, Netherlands, Turkey and Bophutatswana. The conference bore every sign of having been carefully arranged to maximise opportunities for informal contact and discussion: there were no parallel sessions, and the timetable was not a tight one, with five keynote speakers, one panel session and a paltry 20 presentations spread over four and a half days. The venue, Cranfield Institute of Technology, notorious for its sybaritic charm as a conference centre, contributed something to this end. As befits experts in the research process, the organisers had gone out of their way to create a stimulating, agreeable, thought-provoking envirorunent in which creativity and information flow would flourish. But what were we supposed to talk *about*?
In the initial session, Jack Meadows (Loughborough) surveyed several recurrent themes of the conference: the types of application found by researchers for IT, which he viewed historically as shifting from storage and retrieval, to communication and in the future to creativity itself. He asserted that take-up of any new technology lasted for about a decade between the first 10% of potential users and the last 10%, and pointed out that because acceptance of IT must be a communal decision it would necessarily be a slow one. He said that good human interfaces implied a loss of computational efficiency;- that researchers required different levels of information; that IT facilitates informal communication better than formal and various other varyingly "untenable generalisations" (his phrase), presumably in order to provoke discussion.
The panel discussants were Richard Rowe (FAXON) who had brought a four pound $4000 NEC portable with built in modem to show us (this was instantly eclipsed by Brian Shackel who had brought his two pound f:200 Cambridge Z88); he also remarked on the importance of serendipitous browsing. John Clements (National Academy of Sciences, Washington) drew our attention to the importance of information processing expertise in the synthesis which characterised such major new endeavours as the human genome project, and also said we were within five years of making a completely detailed computer simulation of a living organism of the complexity of E Coli. Ted Herbert (Computer Board) saw JANET and its future development as crucial to scientific communication. He identified a trend towards simulation over actual experimentation in science; and a window of opportunity where unit cost of computer power was dropping faster than the demand was rising; he also summarised rather amusingly some of the difficulties inherent in negotiating with Whitehall. Gérard Losfeld (Université de Lille) had had to leave shortly before the discussion but in an impressive demonstration of IT in action, had FAXd his comments on Meadows' paper from the airport. These cast doubt on the likelihood that better software necessarily meant better research and made a good case for the fear that bibliographic databases encourage productivity at the expense of creativity. Finally, Nicholas Ostler (Department of Trade and Industry) drew a rather curious parallel between IT and money as a store of value, medium of exchange and a unit of account before making some rather Vague generalisations about computing in linguistics and drawing an even more curious parallel between librarianship and espionage.
The discussion following this panel was largely focused on browsing methods. Creativity and innovation is based on hitting on the unexpected, but the narrowing focus research means that less and less is unexpected. IT should open up possibilities ways of reducing the information overload, perhaps by automating the filtering process needed for intelligent browsing.
The first full day of the conference was concemed with that elusive feature of the best research: creativity. Proceedings began a remarkable keynote speech by Yves Le Coadic of the Conservatoire Nationale des Arts et Métiers, Paris, substituting for Jean Claude Gardin. He began by dismissing the common notion that, viewed diachronically, there is a close connexion between scientific development and prevailing political ideologies. It turns out science has always had a central theoretical core, few arts people ever penetrate to the inner workings of a scientific community and return to explain them. The manufacture of scientific ideas depends on networks of communication and Baconian deduction, in contrast with the notion of scientific inspiration popularised by Koestler. Turning to the Humanities, Le Coadic identified a shift from the use of IT for information storage to its use as a means classifying information, and finally as an integrative force, a means of extracting the rules implicit in a given universe of discourse. Ideas, he concluded, are created or manufactured by precisely the kinds of social and technical networks which IT facilitates.
After this heady Gallic stuff, the remaining presentations seemed somewhat tame. Michael Brittain, of the NHS Training Authority, listed six influences of IT on research, ranging from time-saving to re-ordering the canonical pattern of the research process. He had noticed that most researchers were unaware of the process of research as such and closed with some rhetorical questions about whether there were areas that couldn't be done without benefit of IT (several members of the audience quickly identified some for him) and whether not its application was always cost effective. No one had any ideas on that, but it sparked off a very interesting discussion concluded that social processes (such as co-operation) were often more important technological aids.
Goodier (Department of the environment) provided a management perspective on the research process, based on his own experience at the Agricultural and Food Research Council. Much of his paper was a plea for proper documentation of research activities in progress, which, he thought, would benefit from the sort of controlled keywording that typifies bibliographic databases, and for some sort of quality control mechanism more effective than simple peer review — or at least the appearance of one.
Chris Turner (Brighton Poly) had an uncomplicated answer to the problem of increasing creativity in his own IT faculty: the creation of a uniform IT environment, based on Macintosh hardware, with Sun workstations for intensive numerical work and a VAX which was increasingly regarded as a giant disk used for archival backup purposes. He restated two crucial elements in a creative IT strategy: high connectivity, and an awareness of HCI factors.
John Weiner (University of Southern California, Department of Medicine) gave a very impressive paper which argued an unfashionable view: that information processing can be formalised and that the creative process is definable. The methods he outlined — for 'ideas-analysis' in clinical trials concerned with paediatric oncology were based on a knowledge representation intended to capture 'ideas' from the literature, which could then be manipulated by a rulebased deductive system to simulate creativity without (in his phrase) ‘any need for the wine or the hot tub’. Some of the success he reported may have derived from the 'ideas' involved between a given given outcome) but the paper was very well presented and provocatively argued.
The next speaker (Johan van Halm, an independent consultant) thus had a somewhat punch drunk audience with which to engage. His paper on IT in Library Science, effectively a summary of a report prepared for the Dutch Library Council last year and recently published in English, was uncontentious and its conclusions (that widespread acceptance of IT depends on such factors as public acceptability and a satisfactory communications infrastructure) unsurprising.
The next session was on 'Collection and analysis of information'. The keynote speaker, Harold Borko (Graduate School of Library and Information Science, University of California at Los Angeles ) gave a rather 'gee-whizz' style overview of the history of IT developments in the library community up to circa 1970, which so dispirited me that I played truant from the next two papers, given by Chris Batt (a real librarian, from Croydon) and Lisbeth Bjoerklund (University of Linkoping, Sweden) to check my e-mail and inspect the Poster session in the adjoining room. Batt's paper, as given in the PrePrints, seems to consists of speculations about matters of library management, while BiOerklund has hit on the notion of hyper-textualising an OPAC, but not apparently done a great deal about it. I returned in time to hear Andrew Dillon, the first Of three speakers from the HUSAT Research Centre at Loughborough, present some results of an analysis of reading behaviour undertaken as part of project Quartet. It demonstrated that researchers placed different kinds of texts at different points along three axes (how it is to be read, why it is worth reading, and what it is likely to contain), with clear design implications f6r optimal reading and retrieval software.
After tea, Mike Lynch described some of the basic 'information studies' research carried out at Sheffield's prestigious Department of same, ranging from automatic indexing algorithms to heuristics for analyzing threedimensional chemical structures, and the increased . complexity made possible by ,advances in computer hardware and software. Though' interesting, this seemed only marginally relevant to the rest of the Conference: it was followed by a paper by Patricia Wright (Medical Research Council) which almost caricatured the poor image that 'information studies' has in some quarters. Dr Wright had asked about 200 research workers (nearly all psycholinguists) whether they used Computers to work on at home and if so what for. The mind-boggling results of her survey were that most people did, and mostly for word processing. Another revelation provided by this fundamental research was that far fewer people in the UK used email from home computers than did North Americans. Dr Wright suggested this might be because telephone charges were higher on this side of the pond, or then again, it might not. The questionnaire design was good; the paper was well presented and well argued, but almost entirely pointless. Most delegates promptly adjourned to Cranfield's exceptionally well-appointed bar (over 120 different single malts) for lengthy if inconclusive discussions about creativity and the research process.
The next day began with a good session on Information Exchange within the research community. The keynote speaker, Prof Brian Shackel, head of the HUSAT research team at Loughborough, resisted the temptation to speculate about the future, availability o ia' or Zlirec cantered at a various modes o (mail, conferencin comparing tiona and versionsfunctional and Pragmatic terms. There was good possibility that email would supplant the conventional variety entirely, but he was less sanguine about electronic conferencing or journal publishing.
The acceptability of new technology, as much else, hinged on human factors problems, for which he recommended some specific solutions: at least A4 and preferably A3 sized screens, hypertextual structures; ways of filtering junk mail; standardisation of formats and protocols; integrated international networks... Many of these had been the subject of the basic research carried out within Project Quartet, but there was no reason to assume that all its results could be transferred from research into reality. Maintaining the ‘invisible college’, for example, implied a need for local IT expertise; a novel way of funding this might be by a small with-holdable surplus on all research grants. Lapsing eventually into futurology, Shackel advised us to watch out for high-definition TV and ISDN, and keep an eye on the ‘electronic campus’ project at Aston University. 93% of Loughborough academics already have a PC on their desk, so the future may be nearer than we think.
Elaine Davis-Smith, IR specialist for a scrupulously un-named chemical company, then gave what was regrettably an almost inaudible paper about Potential applications of IT within large (un-named) Chemical companies concerned with hazardous chemicals, which provided an object lesson in unsuccessful communication.
Constance Gould (Research Libraries Group) then described how a survey of American scholarly users' needs had - indicated two major areas where bibliographic information was conspicuously lacking: data about research progress and data about machine in readable data files. In both cases, the need was crucial in all disciplines, and particularly acute in inter-disciplinary fields. There was a widening professional gap between the unpublished research in progress was 'haves' and the 'have-nots' as far as access to concerned. The difficulties of getting reliable access to and bibliographic control of machine-readable datafiles were even worse: files are not catalogued or encoded in any consistent way, so researchers often don't even know they exist. The paper gave a clear presentation of the problem area, and it is od to equally ready to clean these, particular gean stables is less certain. Lindsay (Kingston Poly) then provided intriguing, and in some ways view of the problems of n-nation management from a third world based on a project undertaken for the Development Planning Unit of College London. He described the political and organisational fficulties involved in bringing together access to the scattered 'gray literature' in this His conclusions were cautious: the new Is made available by IT in some ways existing social, financial and iti( problems more than the ysolved hn Richardson, the second HUSAT speaker, a well argued and very detailed survey the available wisdom on the efficacy of ectronically mediated conferencing, highlighting some problems with which readers (among others) are famliar. Although a high degree of mmunication was clearly a necessary ition for a productive research irom-nent, the low bandwidth of most tror,ic communications oft@,n introduced, as complications as their greater speed distribution removed. Electronic mail to be less affected by these problems, successful electronic discussion groups, he ncluded, need a skilled moderator, strong motivation and opportunities for face face contact. s Reynolds (CODIL Language Systems) scussed some of the more lunatic lications of the Data Protection Act with to electronic communications. He stulated various unlikely scenarios, in hich the usual business of acaden-dc mmunication n-dght well fall foul of the much to the bemusement of all present.
The remainder of this third day of the conference was given over to relaxation, in form of visits to various IT based mpames in the Milton Keynes area, of which British Telecom was reportedly most popular, because it was air conditioned, and of course to the conference banquet, for which Cranfield's cooks excelled themselves. It was enlivened by an occasionally coherent dinner speech from Murray Laver, who said (as a good after dinner speaker should) several things we would all like to say but t, notably that IT was silting up the search process by making it more difficult
I had the dubious privilege of giving the keynote address to the final session of the conference, the morning after, which concerned publishing, presenting and archiving the results of research. My paper began by casting fashionable doubts on the notion of research as a process, and stressed the importance of decoupling data both from its containing media and the processes applied to it, before expiring in a flurry of humanistic verbiage about multiple interpretations, hypertext etc. I also questioned whether the library community was in fact capable of responding to the challenges offered by the new technologies, an issue directly addressed by Michael Buckland (School of Library and Information Studies, University of California at Berkeley)'s paper, which outlined the radically different constraints and possibilities inherent in the application of IT to library services. He argued persuasively that the dispensing role of collections, and the relationships between catalogues, bibliographies and library materials allke needed rethinking.
David Ellis (Department of Information Studies, University of Sheffield) presented the results of a very detailed analysis of the information seeking habits of a sample of academic researchers in four social science research groups at Sheffield, with a view to deriving a general behavioural model which could be used to optimise the design of retrieval systems. Key features of the model included the ability to ‘chain’, that is, to follow links from one citation to another and to differentiate sources in terms of their relevance. No existing software seemed to offer the full range of desired features, but hypertext systems seemed to offer most promise.
Marcia Taylor (University of Essex) traced the development of the Essex Data Archive, from its origins 25 years ago as a survey databank to its current pre-eminence as an archive for, and source of, social science data, both deposited by individual researchers and provided by central government. She summarised some of the services it offers, and gave a brief overview of the research it undertakes, notably its initiative in formulating guidelines for the standardisation of descriptions of machine readable datasets.
The most unusual paper of the conference was given by Micheline Hancock-Beaulieu (City University) and concerned the creation of a database of information about The Athenaeum, a leading 19th century review and the richness of its contents as a source for 19th century literary history, but in the existence of a 'marked file' in which each of the approximately 140,000 anonymous reviews it contains is tagged with a short form of the author's name. An interdisciplinary team of librarians, computer scientists and historians at City is now constructing a relational database to hold this invaluable source material in an integrated way, using a package called TINman.
Cliff McKnight (the last of the HUSAT speakers) gave the last formal presentation, which returned to the major concern of the conference: the reading of scholarly journals. As part of Project Quartet, they had converted an eight year run of a journal called Behaviour and Information Technology into a hypertext usmg GUIDE to provide good quality searching and browsing capabilities as well as the usual ability to @ parts of texts, pop up figures and references etc. Although the formal structure of acaden-dc discourse typically mimics a linear process (introduction, methods, results, discussion), there was abundant evidence that skilled readers use this framework only as a point of departure, hopping from point to point in a way easily supported by hypertext, nrovided that the @ underlying ; @ctural metaphors (cross reference, citation, etc.) are clearly marked.
As a coda to the event, Brian Perry chaired a discussion at which participants were invited to comment on the success of the conference as a whole. Most of us however were feeling too limp from the heat to do more than agree that parallel sessions were not a good idea and that the @g and content had indeed encouraged a satisfactory exchange of views. The idea was floated that pre-prints should have been made available early enough for participants to have read them before the event, so that formal presentations might be replaced by informal seminars, but did not gain much support, though several felt that there had been insufficient time for discussion in the sessions. As a case study in how to organise acaden-dc conferences, I felt that "Not the Cranfield Conference' could not easily be faulted. The progranune, which at first glance looked rather dull, was unusually varied containing many unexpectedly stimulating papers and only a few dodos. My only quahn is that too great a success may lead to yet more research into research, a depressingly incestuous and unproductive activity.
Lou Burnard Oxford Univenity C@ng Service
Location and attendance
The meeting was hosted at Alphen an der Rijn (Netherlands) by Samson-Sitjthoff BV, part of a major Dutch printing and publishing conglomerate, whose Information Services division is directed by Jan Maasdam, chairman of the Dutch SGML users' group. The SIG itself was set up very recently by its current chairman, Martin Bryan (author of the only readable book on SGML) who works for a division of the SEMA group called Yard Software Systems but is also closely associated with SOBEMAP and MARKIT, their SGML parser. The SIG has about half a dozen active members drawn largely from major European software houses with an interest in the field. Its chief remit seems to be discussion of the interface between SGML and database design, but this was only the third meeting and the group has not yet felt the need to create a formal constitution or agenda.
Agenda for meeting
The agenda was as follows:
Presentatations
Van der Steen's presentation was overlong and rather rambling for the occasion, but raised some interesting points about the benign influence of computational linguistics in the development of SGML (a DTD -he said confidently- is a formal grammar) and the appropriateness of hierarchic database systems to it. His company is developing an "Integrated Publishing Management System" entirely dependent on SGML as transfer mechanism, which had necessitated a detailed specification for an ideal text retrieval system. He also described his own PARSPAT system which uses recognition of syntactic patterns as a database search mechanism (he has recently published a book on the unification of pattern-matching, recognition, parsing and transduction) and gave examples of its use for analysing the Brown Corpus and a database of 18th century Delft Estate inventories
My presentation simply outlined the structure of the TEI, gave some examples of the horrors of unchecked scholarly markup and discussed the relationship between text and databases.
Chahuneau 's company has the task of constructing a document database to support EEC legislative and other documents in nine languages in parallel. Its scale (15,000 pp in the printed annual form) and complexity set it apart from any other SGML applications I have yet come across. Because the database is constantly changing, sophisticated version control and integrity checks are essential to maintain all nine views of it in parallel. This ruled out any of the traditional text retrieval database systems; hence the case study of available UNIX RDBMS. Decisive factors in narrowing these down proved to be their degree of support for the 8-bit characters of ISO8859 (essential for the 9 languages); and the way in which the software implemented crucial database operations. Only three of the nine systems investigated allowed the manipulation of 8-bit characters as well as their storage. As to software performance, it seems that the ideal system would combine INGRES' query optimiser (which made an order of magnitude difference to the speed of join operations) with the SYBASE file-access engine (which had a similar effect on most other operations). An investigation of the various hardware platforms available showed perhaps unsurprisingly that although a RISC-architecture machine such as the new DEC Station gave enormous performance improvements, the low-end 386-based machine was a better price/performance option for development than any of the other available workstations.
Discussion of Schouten's paper was less focussed than it merited, largely due to the lateness of the hour. Schouten had advocated using a conceptual modelling technique known as NIAM, rather than the more usual Entity-Relationship model (which would have pleased Chahuneau and me) or a straight hierarchic datamodel derived from the DTD (which would have pleased van der Steen). He had not paid much attention to such specific problems as version control, which seemed to imply the need for a formalism superior to the DTD, the semantic adequacy of which was already in question.
Contacts made
Both Chahuneau and Bryan expressed interest in the work of the TEI, and a willingness to participate if invited. I took the opportunity of rehearsing some of the current Committee 4 arguments with them (both agreed that attributes were not formally necessary, but still extremely usefull). If we do decide to involve either of them, Bryan might be a better SOBEMAP representative than Gaspart, while Chahuneau or his nominee would be a good substitute for Dendien on committee 4. Either of these would be self-financing. It is important to stimulate SOBEMAP interest since they have EEC funding for their MARKIT product which is the only structured editor I have so far come across that runs in the MS-DOS environment we all know and loathe.
I met Jens Erlandsen, whose company TEXTware A/S based in Copenhagen is developing Gestorlex, which seems to be yet another SGML-based structured editor for dictionaries and other reference books. They also market a small free text browser of the Gofer-type, and are involved in an ESPRIT project to develop a multi-media publisher's workbench. The novelty in the latter is that they plan to implement Salton's "space-vector model" for the full text indexing requirement.
Kluwers, publishers of CHum, turn out to be a part of the same publishing empire as our hosts. They were represented by Drs van Wijnen, who seemed quite taken with the notion that CHum contributors might be amenable to supplying their material complete with SGML tags. She agreed that this was worth suggesting to the CHum editorial board at any rate.
This was the second conference on the theme of Computers and Teaching in the Humanities to be organised by the Office for Humanities Communication and the University of Southampton, with the support of the ALLC (Association for Literary and Linguistic Computing), CTISS (Computers in Teaching Initiative Support Service) and the AHC (Association for History and Computing), and some sponsorship from IBM. It attracted a similar mixture of attendees to that of the preceding conference, more or less evenly divided between academic staff from both universities and polytechnics with a third estate drawn from the growing body of arts computing support staff. It was unlike the previous conference however (also held at Southampton a year ago) in two respects. The first conference had resembled a bazaar, with numerous parallel sessions organised as workshops introducing specific applications areas with the aid of a volume of essays (since published as IT in the Humanities, ed Rahtz, Ellis Horwood, 1988). This conference built on the evident interest generated by the first, laying a greater stress on the practical problems of introducing computing tools to the undergraduate curriculum. It also had a more unified programme, exemplified by the conference subtitle 'Redefining the humanities'. To many delegates, it seemed, the chief effect of the introduction of computing had been to provoke a re-evaluation of the methods and priorities of teaching methods in the humanities, quite independent of any technological considerations.
The conference was opened by Peter Denley (Westfield College) who, as secretary general of the Association for History and Computing, is well placed to deliver a 'sociology of computing in the humanities'. He began by describing the rise of humanistic scholarship itself in renaissance Italy, stressing its emphasis on rhetoric and purity of language, together with its importance as a way of both defining and sustaining the growth of a secular educated elite. Movements as successful as renaissance scholarship inevitably distort their successors' perceptions of them; nevertheless, Denley argued, some of the problems currently faced by the Humanities can be related to the change of direction implicit in the reasoning of the first Humanists. It could be argued that emphasizing the purity of classical Latin above the Latin vernaculars of the middle ages had replaced practical linguistics by arid philology, while the emphasis on classical education as vocational - on rhetoric as a necessary political accomplishment - was clearly a two-edged sword. If the humanities were purely vocational, what was the function of humanistic research? A new agenda was needed, Denley argued, which recognised and reaffirmed the fundamental importance of the humanities, rather than regarding them as a useful collection of skills. As to computing in the humanities itself, he suggested, it was time to take stock: the role of IT in the arts course should be more than just to impart necessary skills, like word-processing. Discipline-specific training was important, whether or not it used the computer. Yet the structure of arts computing as it currently existed did not always encourage new ideas: there was no career structure for the strange hybrid characters who currently become arts computing advisors. For Denley, history and computing go hand in hand: history, as a way of handling information, demonstrates the complexity of knowledge, while a database system enforces rigour of analysis, by requiring that events fit into a structure. The Humanities, he concluded, needed urgently to [Page] reassert their importance and their relevance in the evolution of information processing.
A rather different perspective was offered by the next speaker, Tom Stonier, Professor of Science and Society at Bradford. His message was an evangelical one of unbridled prosperity for all, just around the corner. In the nineties, he said, education will absorb more of the GNP than anything else. Pursuing this rather odd economic metaphor, he pointed out that human resources were the only sort of capital which could be made to appreciate, by means of education. Like Denley, he felt that training for skills alone was short sighted; unlike him, he was confident that education and material progress would go hand in hand. Today's pupils would have life expectancies of a century and enjoy enormous material wealth, apparently caused by extensive use of robots and improved factory farming techniques: there would be guaranteed income for all. Doubters in his audience were exhorted to learn from the past: the purpose of history was (of course) to enable us to forecast the future. We should abandon the protestant work ethic, stop making a living and learn how to live. Computers are the greatest pedagogic devices since grandmothers. When the information operatives take over, their level of education will necessarily preclude totalitarianism. A materials based society evolves through competition, but an information based one develops only by means of cooperation. And so forth. Such millenarianism seemed to a number of the audience not only foolish, but also dangerously foolish, given the increasing marginalisation of the humanities to which Denley had already drawn our attention; however, as a morale booster, this was a most amusing and effective speech.
David Bantz (Dartmouth College) was considerably less charismatic but perhaps more reasonable. His presentation promised to address the extent to which educational problems were solved by computing methods and whether computing methods might not undermine traditional humanistic values, by being inimical to the 'great conversation of ideas' which Wayne Booth sees as characterising the humanities, by over-valuing reductionism and calculation at the expense of reasoning. He made several sound criticisms of the current state of the art in computer aided criticism and computer aided learning systems (for example, the way that most historical simulation systems permit of only one right answer, the 'drill and kill' style of CALL etc.), but had little positive or practical to propose other than to point out that computing hardware should be regarded as an expense item rather than a capital investment, since a four year old machine is useless. We learned that at Dartmouth, as at Bradford, all students are required to buy their own machines (Macintosh) and that 85% do, at a special price of $1500-2400, not so expensive when set against tuition fees of $20,000 p.a.
David Miall (College of St Paul & St Mary, Cheltenham), in one of the more thought provoking papers of the conference, talked of a crisis of belief in the humanities. Like Marlow's Faustus, the modern day humanist feels that all the learning at his disposal has failed to give him power. The Humanities are not about the acquisition of knowledge, but of transferable skills, the purpose of which is to change people's feelings and raise their consciousness. Miall then considered a variety of ways in which the introduction of the computer in the classroom helps this by defamiliarising a text, by interacting with and challenging affective models, and by changing the nature of student/teacher relationships, of which he gave a detailed and impressive discussion. The teaching of literature in particular, he concluded, is concerned in the nineties with re-reading, rather than [Page]reading, for which tools such as hypertext are eminently well suited.
This was followed by a rather weak paper on the use of a standard text retrieval package called Personal Librarian, used by every student at the Stevens Institute of Technology to access the 7 Mb of set texts used in a course on the History of Science (Ed Friedman), and a rather stronger one on the implications of hypertext for poetry teaching by John Slatin, from the University of Texas at Austin. This was of interest more as a demonstration of what tools such as HyperCard look like from the perspective of an English literature specialist than for any concrete results presented. Alan Dyer (Lanchester Poly) picked up the same theme in his presentation, which concerned the way in which computing skills necessarily spanned the division between the traditionally linear 'readerly' skills and visual or spatial skills. He described, and later presented, an interactive hyperfiction produced by one of his students as an instance of what could be achieved when creative people were offered suitably powerful and easy to use tools.
Sebastian Rahtz (Southampton) gave a rapid but detailed description of the Southampton/York Archaeological Simulation System, in which a database of archaeological information is front-ended by something that looks suspiciously like a computer game, but which reportedly enables students to leam resource management. He also described the 'arch_help* system developed at York, in which a tailored form of the DEC mainframe Help system is used to provide students with organised information about courses, lectures, booklists and even accomodation details. Both systems represent a shift in stress away from 'teaching about the computer' to 'teaching with the computer'. Charles Henry (Columbia) initially treated us to a brief survey of the pedagogic importance of visualisation in cognition and memorising, from Pestalozzi to neural nets. His subsequent attempt to use the insights gained in analysing the structure of the Old English epic Beowulf was fascinating but too short to be convincing.
Arthur Stutt (OU) began by quoting Umberto Eco's definition of a novel as a machine for generating interpretations. The artist, he argued, has always been ready to apply technology: the special contribution of the computer should be to facilitate processes otherwise impossible. Pointing to the importance of argument in humanities, he made a good case for extending the traditional single explanation school of expert systems to cope with the traditional formal stages of argument. He did not draw a parallel between the renaissance view of rhetoric as an essential component of the humanities on the one hand and, on the other, the need to teach techniques of argument as, in Gardin's phrase, 'propositions which mutually support each'.
R.A. Young (Dundee) also dealt with ways of formalising knowledge, but from the point of view of the professional philosopher. He identified a tension between the different attitudes to conceptual processes implicit in the construction of formal logics by philosophers in the Russell tradition, on the one hand, and the need to make expert systems that behave 'as if intelligent which characterise knowledge engineering on the other. There was a need for synthesis, not least because of traditional philosophy's abilities to deal with inconsistencies and ethical issues generally mishandled or ignored by the knowledge engineering paradigm.
Paul Davis (Ealing College) described a hybrid music system, and indeed performed on one, after dinner. He gave a brief survey of various [Page]approaches to the synthesis of music, stressing the importance of the performer in designing appropriate interfaces for digital music systems and asserted that music science was an area rather than a discipline. Coming at the end of a long and intellectually demanding day, his presentation seemed a little under-powered.
Lynette Hunter (Leeds) began by attempting a structural analysis of contemporary computing mythology, in terms of the dominant myths of the Western post renaissance man-made world. The machine offered an illusory promise of freedom from drudgery by its power over semiotics, mediated by the magical powers of the shaman (or computing advisor) and vicarious participation in the club culture of the technocracy. But (as David Miall had already remarked) it conferred only the appearance of power. She then described recent changes of emphasis in the computing component of the Leeds arts courses. Reductionism and the myth of exactitude were inimical to humanistic skills of analagy and metaphor. The place of the computer was to help in marshalling facts and memory, and so it fitted better into courses dealing with textual editing or bibliography, where classification skills and principles of selection needed to be taught.
Alison Black (Reading) gave an interesting and well-presented paper on the differing reactions and achievements of students introduced to designing documents on paper and on screen. Her talk was effectively illustrated with examples of projects undertaken by the students and by statistics drawn from questionnaires aimed at assessing student reactions at different stages of their exposure to the different methods of document design. Her analysis of the way new technology affects working practices was clear and convincing as was her warning that whilst WYSIWYG desktop publishing has a lot to offer the design student, we should not be so dazzled by its superficial merits as to forget its limitations and to abandon more traditional methods of document design.
Cell biology was the somewhat suprising subject of the demonstration provided by Wendy Hall, the object of which was to present a hypertext system developed at Southampton with HyperCard. This linked images held on videodisc (some 54,000 images per side) with extracts from relevant textbooks, adding sound and animation where appropriate. Although this particular project was not humanities based, Wendy Hall was quick to point out the general applicability of the technology and the pedagogical methods behind it. In his paper Videodiscs and the Politics of Knowledge', Roger Martlew (Southampton) returned to a key theme of this conference: the relative roles in the classroom of the teacher, student and computer. Like David Miall and the DISH duo, he argued that traditional styles of Humanities teaching impose specific roles on both lecturer and student which computers had the potential either to fossilise or to radicalize. He clearly felt that recent pedagogical developments in secondary education were equally applicable at the tertiary level, and that the lecturer 'must cease to be a controller of knowledge, and must become a manager of learning'. The link between Martlew's archaeological videodisc and the politics of knowledge became slightly blurred in the talk, but recourse to the abstract of his paper sets us back on the right track with the reminder that the control of access to visual information in archaeology confirms the lecturer's power over the educational process'. The videodisc controlled by a lecturer could be used in the same didactic way as the traditional 'chalk and talk' methods of teaching; the videodisk controlled by the student was equally possible, if the lecturer has the courage to renounce power for the sake of pedagogy.
The last full session of the conference was concerned with three major teaching packages. In the first, Frank Colson described and demonstrated the HIDES CAL software package, used at Southampton as an Important part of the special subject component of the history degree. The software runs on a network of PS/2s located in the University library, and presents students with a structured walk through documentary sources, supported by impressive graphics. It was claimed that students enjoyed using the system, and that it also lead to their making greater use of original (noncomputerised) sources. In the second Susan Hockey and John Cooper described, and Jo Freedman demonstrated, the 'Oxford Text Searching System', developed at Oxford with CTISS funding to encourage arts undergraduates to use concordance and free text searching software in their study of set texts. Finally, Nicholas Morgan and Richard Trainor described (but did not demonstrate) some of the principles underlying the development of the highly successful DISH project for teaching history at Glasgow. They reiterated the changes in the teacher-student relationship made possible by the use of computers: the transformation of the instructor from teacher into guide and the resulting emphasis on exploration, and on the diversity of insights resulting from a variety of routes through the material offered.
The conference was closed by Nigel Gardner (now with ESRC) whose valedictory address as head of CTISS indicated that the Initiative had been less succesful as an exercise in institutional change than in causing reassessment of the requirements of specific disciplines. The next round of CTISS funding (announced at this conference by Gardner's successor, Jonathan Darby) was thus aimed specifically at setting up topic-oriented 'centres of excellence', which would need to address more precisely such matters as project management, staff training, resource control and evaluative procedures. Gardner also suggested that there was a shift in the role of computing centres which, if they were to survive at all in the world of the individual work station, needed to re-emphasize their role in providing administrative computing facilities, and support for telecommunications and 'learning resource centres' (what we used to call libraries).
A somewhat sporadic general discussion followed this closing address, but did not really bring together the two major themes that had run through an unusually well-balanced and unified programme. The first is that with or without the presence of a computer terminal in a classroom, teaching methods in tertiary education must move away from the traditional master/disciple roles which the availability of hypertext systems and videodiscs are beginning to expose and challenge. Secondly that whilst new technology has a lot to offer the humanities, particularly teaching in the humanities, a great deal of caution and selection should be exercised in the manner and degree to which it is applied. In retrospect, though little was actually said about redefining the humanities as such (except by those who wanted to annex computer science), quite a lot of thought had evidently gone into redefining the teaching of the humanities. Several speakers referred in passing to the copyright problems implicit In using electronic materials for teaching purposes: this has been a recurrent area of concern, and it is to be hoped that a special session at some future conference will address it explicitly.
Lou Burnard and Judith Proud (Oxford Text Archive)
The SGML Users Group is an informal pressure group with (as yet) a fairly small membership drawn largely from software vendors, universities and the publishing industry. It distributes a useful Newsletter containing SGML-related news, product annoucements etc., organises regular meetings, of which this was presumably typical and is entitled to sell standards documents at a knock- down price to members only.
Nigel Bray, md of the British end of a Dutch software house consultancy called MID Information Logistics Group Ltd, spoke first. MID distribute and support Datalogics products, which are targetted on large scale publishing requirements, typically involving lots of pictures, publications often derived from an online database and much revision. He described in some detail products called WriterStation and Pager and gave an overview of the production process using such systems. WriterStation is a conventional word processor running under MS/DOS except that it knows about document type definitions (DTDs) against which it validates input, inserting SGML tags as appropriate, and which are also used by its context-sensitive editor. The DTD for a given application is held in a sort of runtime module called the type definition file; a similar runtime file is used to provide a viewing format for the text. The latter uses only standard PC facilities and is thus some way short of WYSWYG, but still provides helpful (and tag-free) visual feedback to the user. It does not support any character set other than the standard IBM set. Pager is a batch pagination system, which can integrate text produced by WriterStation with graphic images of all kinds via a GREP look-alike rather optimistically described as an "omniscient conversion system". This lead to some discussion of the problems of converting from other typesetting markups to SGML (a process the chairman characterised as "fundamentally akin to alchemy") and much bickering from the floor about how (for example) such systems could possibly handle tables and diagrams. In conclusion, Bray remarked that a typical system should support input and maintenance of text via SGML workstations into a database management system (he mentioned particularly DM and Oracle), which could provide job tracking and page control facilities as well as the ability to restructure text into different presentations. Publication on paper or even on CD-ROM was no longer seen as the primary purpose of such systems.
This was followed by an intriguing (if rather pointless) discussion about the feasibility of using SGML to mark up mathematical text in a meaningful way given by Paul Ellison from Exeter University's Computing Centre. He is active on the BSI Technical Committee dealing with text and office systems, which is currently reviewing among other things Clause (i.e. chapter) 8 of the current ISO Technical Reference paper 9573 Techniques for using SGML. This aims to produce a DTD capable of dealing with mathematics. He began (as SGML presentations tend to) by attempting to answer the question "Why not use TeX?", immediately conceding that if the object of the exercise was a document in the house-style preferred by the American Math Society, there was really nothing to be gained by not using TeX. If however continuity of markup was desired, and suitability for a syntax-directed editor, and especially if the markup was to reflect something of the computability of the mathematics, then an accurate DTD would surely be preferable, despite its verbosity, of which he gave ample illustration. As a mathematical ignoramus I was reassured to find that written mathematics contains almost as many ambiguities as written English: for example x with a little n to the right and slightly above it could mean x <power> n or x <superscript> n; and dx above dy with a line between does not represent a fraction of some sort. It was suggested that coping with this was a particular instance of something more general, dubbed by Ellison "the secretary problem" - TeX had (after all) been designed for use by mathematicians. Mike Clarke (Imperial) remarked that Mathematica provided means of solving all of these problems and could be made to generate SGML, but did not elaborate.
Over lunch, I re-met a Dutch acquaintance called Gerd Van Der Steen, formerly attached to the University of Amsterdam where he had been developing parsing systems for historical (and other) documents, who now works for the Dutch end of MID. After lunch, Martin Bryan (SOBEMAP) stepped into the breach left by the defection of one of the advertised speakers (Neil Morley) who was to have spoken on the subject "What the Publishers Association is doing about SGML". Since this appears to be "not a lot", this was not probably not too tasking; a booklet introducing the concepts of SGML had just been distributed to all PA members; unlike the American Association of Publishers no DTD specification was envisaged, but publishers were recommended to use a recently published book (by Bryan as it happens) SGML: an authors guide as a source for models. Something called a "management awareness campaign" was also underway.
The last speaker of the day, a M. Moricon from a French software house called Advanced Information Systems, had expected to be invited to speak on hypertext and SGML but (due to another last minute defection) had to prefix this with a brief rundown of events in France. The French Publishers Association had set up a working group chaired by Dominique Vignot, which had taken the AAP's DTD and translated and adapted it for the French market. Some major publishers (Lefevre, Hachette) and printers (Maury, Jouve) were known to be using SGML. The French Electricity Board was reportedly considering its use as a way of coping with the tons of documentation relating to the Nuclear Energy programme; the scientific community was also interested but not at a sufficiently high level to enforce standards. Of more interest than this third hand gossip was the remainder of the presentation in which Moricon talked about the conversion of the CSTB's Building regulations into an SGML-based hypertext. These regulations, which have statutory force in France, are being converted from printed form to an electronic database, from which they can be extracted for printing or excerpting as well as electronic browsing in a hypertext published on CD. There is about 15 Mb of running text, into which tags will be automatically introduced by a combination of YACC and LEX (rapidly) and then corrected manually (slowly). A prototype hypertext had been developed from some sample entries using Hypercard, modified in some unexplained way to support buttons in scrolling fields. The experience had shown that it was quite difficult to get trained drafters of document to understand the difference between implicit and explicit `anchorage' of sections; that SGML tagsets making extensive use of attributes required a lot of complex programming; that a DTD created with hypertext in mind was a good way of identifying all potential linkages; that SGML was useful as a way of defining data entry conventions; that because a DTD identified semantic components it was analogous to an information model.
In a final roundup of European SGML events, Martin Bryan mentioned a recent EEC-funded training session held at Ghent. SGML training should aim to satisfy three distinct groups: management, end-users and document designers. The latter group required similar skills to those needed in database design or systems analysis. A special interest group of the working group was being set up concerned with the interface between SGML and databases proper, in which I expressed an interest.
The purpose of this meeting was to present the first published Review Draft of a Guide to the cataloguing of computer held files, which has been under discussion and development for the last two years by a working party organised jointly by the ESRC Data Archive at Essex and the Edinburgh University Data Library. Several other organisations had been active at various times in this or related working parties and most of them were also represented at this meeting. These included the British Library, the Institute of Information Scientists, the Archives Department of the University of Liverpool, the CCTA, the IUSC, the PRO, DTI, the Central Statistical Office, NISS and even the Oxford Text Archive.
Peter Burnhill (Edinburgh Data Library) began his brief overview of the Guide by acknowledging the encouragement and support of the ESRC specifically in its recognition of the importance of bibliographic control of machine readable data as a necessary precursor of secondary analyses. He stressed that the Guide was intended to be a guide, rather than a standard. Its purpose was to provide information for those wishing to apply existing standards (AACR2 for the logical structure, MARC for the implementation) to the cataloguing of machine readable resources of all sorts. It was therefore primarily - but not exclusively - addressed to the librarian community. Amongst specific problem areas identified, Burnhill mentioned that of the "general material description" - computer software typically came in many physical forms which might have little relevance from a bibliographic control standpoint; as an example he cited the "Aladdin" emulation package for the Atari, which was a box containing a chip set, a registration card, a manual, and two (incompatible) diskettes. To describe such heterogenous objects adequately the Report proposed four major subdivisions: bibliographic details, sufficient to identify the item described and its availability; description of the item's subject and intellectual content; technical information describing the item's physical characteristics; management and access information relative to local copies of the items. Most earlier proposals about how machine readable resources should be catalogued had tended to blur these ditinctions. Burnhill also gave a brief history of how the Guide had come into being, inspired initially by a published discussion (by Sue Dodd) of AACR2 Chapter 9 and its interpretation. It was clear that new versions of the AACR2 standard would have to deal more thoroughly with the 'intellectual content' of electronic works.
Bridget Winstanley (ESRC Data Archive) circulated a short discussion paper on the 'next steps' for the working party. She described possible test vehicles for the proposals of the Guide, ranging from simple bibliographic lists to fully fledged union catalogues and (a new buzzword for me though not for the librarians) something called "Linked" Catalogues i.e. where software is sufficiently intelligent to search many different catalogues at different sites. Conversion of the ESRC Archive's current catalogue (held in the Standard Study Description format) to MARC could be done automatically. The Bibliographic Control project at Essex, having looked at a variety of other communication formats, had reluctantly decided that MARC remained the only horse in the race, despite its somewhat jaded appearance. A simple PC-based interactive cataloguing system was under development, the output from which was MARC records conforming to the guidelines.
Sue Dodd, as inspiration of the working party's effort, began by remarking that both the books she had written in the last five years were already outdated by technological change, but gave no specific examples. Traditionally, publishers had been the source of bibliographic information: this did not seem to be the case at present for the 'publishers' of machine readable data, whoever they might be, largely for social and organisational reasons. The advent of the Guide might make it easier for electronic publishers to provide the required information in a uniform manner. She spoke of a 'natural tension' between book and machine readable file standards. For a book, all the required information was present on the titlepage. Intellectual content, if required, was as easily accessible as a physical description (provided you knew how to read). This did not apply to a machine readable file. She distinguished a useful distinction between the logical (humanly determinable) characteristics of a file and the physical characteristics of its carrier and (in my view) a less useful one between the latter and any system dependency. She concluded by mentioning three problem areas for bibliographic control among the facilities already provided by many libraries online services (how do you catalogue DIALOG?), works printed on demand in varying formats, and online works that are constantly updated.
Marcia Taylor (ESRC Data Archive) concluded the meeting by pointing to the existence of a Dissemination working party, of which all present were ipso facto made members, and there was some general discussion about the best ways of circulating the proposals in the Guide.
In my view, the Guide itself needs substantial rewriting before it can be let out of the Librarian community. However its structure and approach are eminently sound, and it is heartening to see professional cataloguers entering an arena too long left to the amateur.
This conference brought nearly 500 delegates, chiefly European, to the beautiful city of Koln in Western Germany, currently celebrating its 600th anniversary. Three international associations combined forces for the occasion: the International Conference on databases in the Humanities and Social Sciences (ICDBHSS); the Association for History and Computing (AHC); and the International Federation of Data Organisations for the Social Sciences (IFDO). The preoccupations of these three organisations clearly having considerable overlap, a joint conference should not have been an altogether bad idea: there are many important respects in which the interests and skills of the social science data archivist or analyst and those of the historian are complementary. The organisation of the conference did not, however, encourage inter-disciplinary discussion, let alone cross-fertilisation. Indeed, the complexity and rigidity of the timetable gave very little scope for discussion of any sort - though of course, there were ample opportunities for private argument over large quantities of echte koelsch.
Somewhere between 150 and 200 papers were timetabled, with on occasions as many as six parallel sessions spread across the three days. It would be nice to report that this density reflected the richness and variety of the scholarship on display, but honesty does not permit me such politeness. The fact of the matter is that (judging only by the sessions I attended) a good third of the papers were either almost entirely innocent of intellectual content, or had nothing to say that had not been said a thousand times before, usually more concisely. There was also an unusually high number of scheduled papers which simply did not appear - perhaps mercifully. All of this had a dispiriting effect, which no amount of software demonstration, nor even the excellent buffet dinner provided by IBM, could dispell. The following biassed and idiosyncratic account should of course be read only as an expression of my personal reactions, and makes no claim to impartiality or omniscience, or even accuracy.
Proceedings began with a plenary panel session, in which six speakers were due to expatiate on the subject of "databases and the future of electronic scholarship"; in the event there were only three. First off was Joe Raben, who, as originator of the journal Computers & the Humanities, and of the ICDBHSS conference series as well as much else, has every right to rest on his laurels and refrain from stating more than the obvious: this he did, and at some length. He was followed by Nicoletta Calzolari, deputing for Antonio Zampolli, from the Istituto Linguistica Computazionale in Pisa, whose brisk precis of trends in computational linguistics (the shift from grammar, to lexical studies, to analysis of corpora) and the technological and social changes heralding the emergence of the polytheoretic linguists' workbench deserved better attention than it received from an audience already half asleep. As the third speaker, I tried (unsuccessfully) to provoke disagreement about the different paradigms within which databases are used, and to mediate an opposition between the hermeneutics of scholarship and the certainties of information technology by saying "look you, there is models in both".
After lunch, given a choice of sessions on Content Analysis, Computer Aided Instruction, Regional history and Data protection, (two other sessions were also timetabled, but did not apparently happen), I opted for the first, where I was first stupefied by an authorship study which had not progressed much beyond the smart idea of typing the text into a computer, and then amazed by a stylish stylistic analysis of crime fiction. Volker Neuhaus (Germisches Inst.,Koln) readily agreed that a highly formalised narrative such as the classic detective story "of the golden age" was that much easier to analyse using an small number of exhaustive taxonomies than other perhaps less ritualised material, but this by no means invalidated the methodological interest in his paper. Later in the same session, Peter Mohler gave an interesting presentation about the venerable General Inquirer program, now available from ZUMA at Mannheim in a PC version, and its use for classifying or codifying narratives for statistical thematic analysis. This session also included an impressive paper from Robert Oakman (USC) which demonstrated how frequency counts could be manipulated to cluster sections of Jane Austen's prose meaningfully, using an algorithm originally developed for clustering geological specimens according to the proportions of their component minerals. It is hard to see what this was doing in the same place as the other paper in this session, which supposedly concerned whether or not computing had anything of relevance to modern literary critical concerns, and proved to be of quite extraordinary crassness.
I began the second day of the conference by chairing a tiny session on expert systems, (someone had kindly volunteered me for this honour without first ascertaining whether I actually knew anything about the topic), made tinier by the fact that only two of my four speakers materialised, but larger by the fact that one of them had brought most of his research team with him. The team came from the University of Grenoble, and had developed an expert system for use by urban planners, to assist in decision making. Their paper had a strong theoretical content, but remained impressive. The other paper was more superficial, and consisted of some meditations about the applicability of expert systems to legal databases.
My obligatory presence in this session meant that I was unable to listen to the opening papers in the session devoted to Manfred Thaller's Historical Workstation project, notably his own presentation of the workshop concept, Gerhart Jaritz (Krems) on an iconographic project using Kleio and Peter Becker (Gottingen) on family reconstitution. I did however arrive in time to hear Kropac (Graz) describe his Kleio-based prosopographical database, Muller (Salzburg) on 15th-16th century patterns of migration as deduced from the Salzburg "Burgerbuch" with Kleio's aid, and Bozzi (Pisa) on the Latin lemmatisation routines which are now incorporated into Kleio. I was impressed by all the databases presented in this section; what I felt lacking was any sense of quite where the research based on their use was heading. However, the collaborative and non-commercial ethos of the Historical Workstation project has much to recommend it, as does Kleio itself.
After lunch, I first listened to someone expounding how Pascal programs could be used to list all the names of people who might perhaps have been around on a given day in the Middle Ages at a particular court, but, finding it hard to understand why this was either useful or methodologically valuable, subsequently decamped to a session on data archives. This proved to be unexpectedly interesting. I arrived too late to hear Marcia Taylor and Bridget Winstanley present the work of the Essex Data Archive's Seminar Series on the cataloguing of computer files, but in time to hear an excellent summary of the 'Trinity' proposals concerning the standardisation of historical datasets from Hans- Joergen Marker (Danish Data Archive). This was followed by two further papers concerning current standardisation efforts, one national, describing the framework being set up in the Netherlands for a Historical Data Archive (Van Hall and Doorn); the other international, on the work of the ALLC/ACH/ACL Text Encoding Initiative (Sperberg-McQueen). I found all three papers interesting and important; the session was also exceptional in that it provoked (and permitted) much useful discussion. For those who find the subject of data standards marginally more exciting than watching paint dry let me add that the discussion centred on such matters as the social history of research (what datasets exist? what were they created for?) and consequently was far more concerned with interpretative issues (what does this codebook actually mean?) which are at the heart of the quantititative/qualititative divide in much current debate, rather than on whether to use square brackets or curly ones, ASCII or EBCDIC etc.
The last full day of the conference offered sessions on a variety of topics: those I missed included art historical and archaeological applications, legal sources, incomplete data, and time series analyses. Instead I stayed with a session on more or less straight historical database applications: this included a French genealogical system using dBase, tweaked sufficiently to cope with the intricacies of the Bourbon dynasty (Selz-Laurier, LISH, Paris); a fascinating analysis of networks of influence in German state-sponsored research institutions using Oracle and multi-dimensional scaling (Krempel, Max Planck Inst. fur Gesellschaftsforschung, Cologne); and a rather less fascinating discussion of the difficulties of handling orthographical and semantic variance in a standardised historical dataset (Haertel, Graz). Dan Greenstein (Oxford) gave one of the more thoughtful and thought-provoking papers in this session, pointing out the conventional historian's uneasy relationship with "the bitch goddess quantification" and attempting to assess the extent to which (for example) the multiple encodings possible with true relational database management systems succeed in restoring the historian's intimate dialogue with his sources with reference to his own work with the History of the University's enormous INGRES database. This was followed by an interesting re-telling of a paper originally written by Frances Candlin (Glasgow) as a programmers' eyeview of the historian's activities, but presented -with much embedded commentary- by Nicholas Morgan, of the DISH project at Glasgow. These two papers alone offered ample opportunity for serious methodological debate, which was not, however, taken.
A large international conference of this kind is of course much more than a collection of research papers. This one also provided a shop window for an impressive panoply of software systems and books as well as the obligatory gossip and politicking. The latter being inappropriate material for a report of this nature, I shall conclude with the former. Systems demonstrated included a full colour art historical image retrieval from laser disk developed for the Marburg Institute, the BBC Domesday Project and IBM's famous Winchester Cathedral graphics model as well as a host of pc-based software nearly all of academic origins (TuSTEP, Kleio, TextPack, ProGamma, HIDES, DISH, CODIL and a Hungarian concordance package called MicroDIC stay in my mind).
I concluded my stay in Cologne by attending the annual general meeting of the Association for History and Computing. The Association now has around 500 members from 23 predominantly European countries, only 60% of these being in the UK. Three new "branches" (one of the hottest political issues of the conference concerned what exactly a "branch" was) had been set up, in Italy, Portugal and France. A fat volume, based on the first Westfield conference, had appeared and, despite a devastating review in Histoire et Mesure (a journal edited by the President of the Association, it should be noted), would be followed by a second volume later this year. The British, Portuguese, French, Austrian and (newly created) Nordic Branches reported on their activities. Manfred Thaller reported that his standardisation work was progressing, and that another workshop might be organised next spring in Goettingen. The Archive group was sending out a questionnaire in collaboration with the Essex Data Archive. Ambitious plans to expand the Association's journal were announced, as were plans for a series of other publications.
CATH 87 (as it will no doubt come to be known) was an unusual event in several respects. For one thing, as Nigel Gardner (CTISS) pointed out in his introductory remarks, it approximated to that perfection proposed by David Lodge, a conference with no formal papers. For another, instead of published proceedings at some vague time in the future, all delegates were presented at registration time with a collection of essays by various hands covering most of the topics addressed by the conference, now published by Ellis Horwood as "Information Technology in the Humanities", edited by S. Rahtz.
Another unusual aspect of the proceedings, at least from my cloistered viewpoint, was that just as many of the 100+ delegates came from Polytechnics and other institutions in the 'public sector' of higher ed, as came from Universities and similar bastions of privilege. This burgeoning of interest may have something to do with the coming into existence of a working party on IT in the Humanities (public sector only) sponsored by the CNAA. This working party is chaired by David Miall and is currently conducting a survey, planning a workshop on the theme this autumn and aims to set up a clearing house of some description.
There were in fact two formal papers: one at the start, from the charismatic Richard Ennals, and one at the end, from the even more charismatic (because French) Jean-Claude Gardin. Ennals, who is now at Kingston CFE, was inspirational (at least in intent) on the importance of the humanities and their under-rated powers which, he insisted, could be made more effective still by the appropriate use of computers. AI, the 'technology of thought', might provide a way of bridging the gap between the "two cultures" (Ennals is clearly a child of the sixties); the absence of Theory from the humanities might be a strength; Piaget's beneficial influence on primary school teaching needed to be carried through into the secondary system; logical positivists were a lot more 'dehumanized' than computers; rules (as in expert systems) could be descriptive rather than delimiting; input from the Humanities was needed because of the complexity of the problems to be tackled. These and similar ideas served to illuminate, fitfully, Ennals' final proposition of "computational politics" - that software engineers could profitably learn from social engineers. This highly seductive notion relied on what (I suspect) is a purely metaphorical similarity between the transition from single CPU to parallel architectures on the ne hand, and the transcending of solipsism in the modern democratic state on the other. It was a bravura performance.
In between the two formal papers, there were six parallel workshop sessions, each on specific topics, and also three introductory tutorial sessions. The organisers of the workshops had been briefed to stimulate discussion and argument rather than simply read out papers, which for the most part they did. The range of topics covered was impressive, as was the concentration of expertise. I attended workshops on Concordances (P. King from Birmingham), Programming (Wendy Hall from Southampton), Art, History (Dave Guppy and Will Vaughan from UCL), Classics (Kevin O'Connell from Exeter), Linguistics (L. Davidson from Leeds) and Literature (Tom Corns from Bangor), thus missing inter alia S. Rahtz on Archaelogy, R. Trainor on History, G. Davies on CALL, J. MacGregor on Theology, A. Pearce on Music and P. Salotti on Databases.
I found the Concordances Workshop rather disappointing, though it did[Page] stimulate much discussion. King was anxious to demonstrate his own concordance generator which runs on an Amstrad word-processor, though he did bring out several useful applications for its output (fairly rudimentary KWIC lists) in teaching non-native speakers of English to identify patterns in contemporary usage. There was much argument about the normative effect of such exercises. Several people enquired about micro-OCP.
The Programming Workshop was equally ready to tackle fundamental issues. Wendy Hall quoted Dijkstra on the malignant effect of BASIC to great effect and also (clearly) took a quiet pleasure in the total absence of any evidence that teaching programming was a good way of training people to reason logically. Dave de Roure advocated LISP; Sebastian Rahtz Icon. Several people pointed out that the programming environment probably mattered more in determing the ease with which a language was acquired than the language itself; there was some agreement that the difficulty of structured languages might in fact be no bad thing. A gentleman from IBM endeared himself greatly to me by asserting that (a) any progamming skills acquired at universities were totally useless in a commercial context and (b) it would be a lot more use to teach people how to organise and structure their data properly.
After dinner (bearable) we were rewarded for our persistence in trekking half a mile through pouring rain by a postprandial entertainment from Jon Nicholl of Exeter's Education Department. This consisted of demonstrations of three applications ('authorizations'?) of the LINKS program, a simple expert systems shell for use in comprehensive schools. One recreated a detective story apparently familiar to every former B.Ed student; one (written by a ten year old) impersonated a mediaeval physician; one had to do with Devonian placenames. The second was the most fun; the subtext of the presentation was that teaching project work in this way could actually be a lot more fun as well as getting across some interesting principles of abstraction. Though I wondered whether hierarchic tree structures might not turn out to be just as mentally crippling as BASIC.
Dave Guppy opened the Art History Workshop with a sceptical survey of possible computer applications, including image processing, storage problems, indexing problems etc etc. For him Art History was about fundamentally difficult and affective aspects of experience. Will Vaughan tried to redress the balance by pointing to the possibilities of new storage media as means of manipulating images, but had to agree that there still very few real applications outside museums. As case study Guppy provided us with two very nice pictures of naked ladies, one by Giotto and the other by Titian, together with commentary by a distinguished art historian called Freedberg and the workshop eventually cohered in a long discussion about how a computer could possibly have assisted in his analysis. (Not a lot it transpires)
The Classics Workshop was somewhat of a misnomer and also nearly floored completely by an uncooperative AT. Fortunately Kevin O'Connell was too much of a professional to let this seriously impair his presentation of how Nichol's LINKS could also be used to represent the plot of Antigone, though it did slow down somewhat his description of an expert system (programmed in micro Prolog) based on the "Roman World" flash cards which are apparently now widely used to teach classics (if 'widely' is the right word). The claim was that a model of the inter-relationships recorded on Latin inscriptings from Lugdunum could be adequately represented and easily manipulated using micro Prolog; I remain unconvinced.
Of those I attended, the Linguistics Workshop probably adhered closest to the organisers' brief, perhaps because Leeds is one of the few places where computing is taught as an essential component of the Linguistics[Page]course. Davidson described in some detail the various parts of this teaching, plotted against two axes which he saw as mutually exclusive, viz the type of amount of purely computational skill needed and direct relevance of the skill acquired to the academic subject. He raised a number of pedagogically important issues, notably that current research in linguistics seems to be depending more and more on computational models which owe little or nothing to formal linguistics (which did not use to be the case). One prime case is the 'simulated annealing' parsing project at Leeds which uses a purely stochastic model; another is the need for socio-linguists to employ purely sociological data, such as census returns. Most of the discussion centred on what actually gets taught. Leeds' BA students apparently thrive on a three day intensive course covering the rudiments of CMS and OCP together; there was little support (presumably as a result of bitter experience) for my view that general courses on operating systems were better left to computing centre staff.
Tom Corns began the Literature workshop by asserting simply that literature was very difficult for humans, let alone computers, because of the complexity and subtlety of readers' responses to it (which was one of the strengths of the case according to Ennals). Perhaps more significantly, (and certainly more memorably), he remarked that literary criticism remained "totally innocent of computer-aided achievements", despite the fact that the subject itself was still alive and well. Stylistics, which had once seemed to offer the computer an entree, had been effectively killed off by the likes of Fish on theoretical grounds, while the McCabe/Eagleton radico-deconstructionist-feminist axis had no time for the "toys for the boys" technological ethos. But as all good critics (and readers of Kipling) know, ignoring the technology of your day simply marginalises your discipline. The bulk of his presentation therefore concentrated on immediate strategies to raise the level of awareness of computational possibilities amongst the current crop of students. The discipline had always required high standards of presentation and well organised bodies of data; the word processor, the database, and even the concordance were all highly effective means to those ends, if they had no more theoretically seductive claims on students' time. In the future of course, there would be other possibilities; amongst these he adumbrated the possibilities of an Old English CALL system, and something called "advanced study aids", by which (I think) he (or rather Margarette Smith who shared the honours of this presentation) meant hypertext systems, incorporating a user-modelling component.
The proceedings were wound up by Prof Jean-Claude Gardin's formal paper which (I regret to say) I did not fully understand, largely because of its use of mathematical formulae to express types of inferential methods and other habits of mind alien to my anglo-saxon soul, but which I suspect would have been congenial to Fish. Gardin has been eminent in the sphere of interpreting archaelogical finds and other cultural manifestations for the last thirty years but (he said comfortingly) the only progress he could detect had been the recognition that there could be no independent standards to describe such objects: there are as many descriptions as there are research goals. Like Ennals, he saw interesting opportunities in AI systems not just because they are well funded, though that should not be ignored, but because they paralleled his current research strategy. A given set of semiological components (representation systems) can be acted on by different processing components to reach, different conclusions, according to different goals; in the same way, a given set of facts and rules may be acted on by an inference engine to construct a knowledge based system. The recursiveness of deconstructive criticism was exemplified at some length: Jakobson & Levi Strauss' study supposedly saying all there was to be said of Baudelaire's "Les Chats" had stimulated 28 critical responses, which they had dutifully included in a revised edition, und so weiter. He also felt the need to preserve[Page]'bilinguism', that is to present their results in ways appropriate to (their expectations of) their readers' likely expectations.
If Ennals began this conference by assuring us that the humanities had something to offer the world, then Gardin closed it by reminding us that whatever that might be it was not certainty, and that scientistic rigour was as out of place in the humanities as we had always hoped. In between, we had had ample opportunity to see what the technology could do and how it could be shaped to our ends, provided of course we could determine what those might be. I have already remarked on various unusual aspects of this conference; perhaps not the least significant of these was a general willingness to confront and discuss quite fundamental issues at a non-trivial level.
WITH G.P. Neate (Bodleian Library))
The purpose of this expedition was to tie up loose ends left over from Bodley's long standing investigation of the suitability of the Memex text-searching engine as a means of providing online access to the pre-1920 and similar catalogues. This investigation began with a British Library R&D Grant in 1983, at which time Memex was hosted by the Library's own PDP-11; for a variety of technical reasons this proved inadequate to the task, and the project was temporarily dropped in 1985. In 1986 Memex set up a marketing agreement with Gould which proved to be a distinct improvement both commercially, in that there are now several installed systems running on Gould minis, and technically in that they now have a demonstrable version of the Bodleian's catalogue. Hearing this, Geoff Neate arranged a two day trip to Memex's East Kilbride offices, and kindly asked me to accompany him. In the event, although Memex were still unable to demonstrate a true working version of the catalogue, the visit proved well worthwhile.
We were first given a detailed account of the company's current state and market prospects, which look much healthier as a result of the one year agreement with Gould. The company now employs 18 staff at East Kilbride, and seven at the Edinburgh research lab. There are nearly twenty pilot systems now installed, and some of these were described in some detail. They included the usual unspecifiable Defence and Police applications, and some fairly boring ones like a database of all the telexes received at Peat Marwick Mitchell's New York office, but also some rather more imaginative systems such as 20 Mb of script summaries maintained by TV-am, which could be searched for visually interesting snippets such as President Reagan picking his nose on camera etc. In the commercial world MEMEX's speed both in search and in set-up time makes it a natural for companies wishing to scan the 'Commerce Business Daily' - an electronically published list of all US Government jobs currently up for tender, - or even (I suppose) the body of case law maintained by Context Legal Systems. There are no other library applications however, which is largely attributed to librarians' lack of desire to step outside the approach favoured by the British Library.
Development of the product continues; the exclusivity of the Gould arrangement has now lapsed, which means that development is now concentrating on the DEC and DEC OEM marketplace. One (very interesting) current version of Memex is a single board that plugs into the Q BUS on a microVAX2 running VMS and costs about 5000 pounds; similar boards are available for bigger machines with prices up to 20,000. Because the device uses a standard VME BUS, it can be configured into wide range of hardware; one other possibility clearly under consideration was the SUN workstation.
The current system operates in a way quite similar to conventional indexing systems. The text is regarded as a flat file of hierarchically organised structural units (document, chapter, paragraph, sentence for example) which are composed of tokens all of a single type. Conversion of text to "infobase" (sic), involves the creation of an index of non-numeric tokens (the "vocabulary") which maps the external form of each [Page] such token to a unique symbol. The text is stored in a compressed form by replacing each token by this symbol, which may be up to 3 bytes long. Capitalisation, whether or not the token is a word-separator and whether or not it is a number are all indicated by flag bits. Tokens recognised as numbers are converted to fixed- or floating- point form and excluded from the vocabulary.
No occurrence index is maintained. Searches are carried out by first scanning the vocabulary for the required terms (so zero-hit searches are very rapid indeed!) to extract the corresponding codes; then delegating the search for these codes to the Memex board (this has a throughput of around 0.4 Mb/sec, or - since it is operating on compressed text -effectively about 200,000 words/second). Hit records (i.e. addresses within the file) must then be decoded for display, or may be retained for further (set) operations. In the version of Memex available on Gould (though not that now implemented on VAX) inspection for proximity matching also has to be put off to this post-processing stage, as it does with CAFS.
Unlike CAFS, the MEMEX hardware does not support any sort of fuzzy matching: all search terms must be stated explicitly. The availability of the vocabulary file goes a long way to counteracting this inconvenience and it is possible to add a "reversed” vocabulary file so that searches for words ending with particular strings can easily be identified; obviously the full generality of the facilities available with CAFS fuzzy matching is still not catered for however. If the number of search terms exceeds the number of search channels available (8, cp. CAFS 16), the query optimiser will initiate more than one scan through the file transparently to the user, rather than rejecting the search as CAFS currently does.
For very large files, a signature file can also be maintained to optimise performance by allowing for focussed searching, in much the same way as the Advanced CAFS Option. With all these options in place, the amount of filestore space saved by the compression becomes rather less significant; detailed figures calculated for one of the Bodleian files only (DAFZ) show that although the original raw data file (16.6 Mb) was reduced to 12.6, the amount of space needed for ancillary indexes etc brought the total filestore requirement for this file up to 21.9 Mb; the CAFS searchable form of the same file was 23 Mb. Compression is still a very effective way of speeding up the search process, simply by reducing the amount of data to be scanned, of course.
The other possible drawback of storing text in compressed form - updating problems - is obviated to a large extreme by the provision of an online screen editor which operates on the "infobase" directly. We were not able to see this in action, but from its description in the documentation it seems more than adequate for most uses.
As currently packaged the system does not support multiple indexes nor any other way of categorising tokens within an index, except insofaras numbers are specially treated. The sort of precision made possible by CAFS SIF features is thus entirely lacking. To search for "London" in a title rather than "London" in an imprint, we had to resort to the rather counter-intuitive process of specifying that "London" must precede the word "Imprint" in the record; to search for books printed in Tunbridge Wells, one would similarly have to search for "Imprint" and "Tunbridge Wells" in that order and within 3 words. Aside from their reliance on the [Page] existence of the tokens "Imprint" (etc) within the record, neither procedure worked entirely satisfactorily in the Bodleian data, which contains multiple bibliographical entities within one record.
Post-processing facilities in the software demonstrated were quite impressive: the user can combine results of searches, mark particular hits as significant, narrow or broaden the search focus, re-run previous searches, interrogate a history file etc etc. The query language used is also reasonably comprehensive, though its syntax would present some problems to users not previously exposed to such notions as "exclusive or" or "proximity match" or "regular expression"; it would be quite easy to hide all of this as a CALL-level interface to the search engine is also provided, which is directly accessible from C programs.
Documentation provided consists of a programming manual and a descriptive user guide, which is reasonably accessible. (Though it does include the following benumbing sentence: "The NOT operator is existential and cannot be interpreted as an 'outwith' operator in the case of proximity".)
The staff at MEMEX were very helpful, not just in their willingness to explicate sentences of this type, but also in the readiness to let us take over one of their Gould machines for a day's tinkering. Unfortunately, the transfer of the pre-1920 catralogue had not been done properly, several of the records being incomplete and the numeric fields being incorrectly translated, so it remained difficult to make an accurate assessment of the system's performance. However, so far as we could tell, one complete scan through all 12 'infobases' into which the pre-1920 catalogue is currently divided, assuming that the tokens to be searched for exist in every file, would take around 5 minutes. This compares favourably with the current CAFS guesstimate for the same operation, which is around ten minutes. We carried out rough timings for a range of searches against one of the files; these are detailed in Geoff Neate's report.
Testimony to the ease with which text can be converted to a Memex "infobase" was provided by the Cart Papers, a collection of 17th century documents which we brought with us on a magnetic tape, and were able to search (on the micro-VAX) within a few hours.
We also learned something of the company's future plans. Of most interest here was something called the "Vorlich machine" currently being designed at their Edinburgh research laboratory. This device will use the kind of pattern recognition algorithms built into the current generation of image and voice recognition systems to tokenise free text by hardware, thus doing away with the need for the current encode/decode software.
As yet, Memex do not have a system which we could consider as an off the shelf user text searching product. Neither have they actually demonstrated to us all of the claimed potential of their current product as a library searching system-builder. Nevertheless, the company now has a secure financial basis from which to engage in the sort of primary research needed to make one, together with a great deal of expertise. Their switching to DEC hardware with or without the UNIX environment to host the system also makes them very attractive in the academic context. If hardware assisted text searching engines do become commonplace in the next few years, as they show every sign of doing, Memex must have a bright future.
Westfield College campus begins increasingly to resemble the set for some grimy Channel 4 documentary on the state of British Education. The exteriors of its gracious 19th century buildings are suffering a rash of desperate fly-posting while their bare interiors remain un-redecorated and unloved. For this conference, the ruins have put on a pretence of being inhabited still, which somehow makes them all the more depressing. In an ambitious moment twenty years ago, Westfield erected a functionalist science block, derelict for the last few years since it lost its science department; for this occasion it has been unlocked and its 'ground floor heating switched on. Ghosts lurk in the corridors, however. Elsewhere, in what was once a library, there are still a few comfortable chairs and a non-stop coffee machine, but all the bookshelves are bare.
Maybe the atmosphere affected my judgement, or maybe it's just that it had a hard act to follow, but I found this second conference less exciting than the first one. There was the same extraordinarily broad-based constituency of delegates, from secondary school teachers to academic researchers, as well as a significant European presence (except for the French who were conspicuously absent): the attendance list includes nearly 500 people. There was also the same abundance of material: around 250 papers crammed into two days of parallel sessions. Considerable effort had been made to group papers on a common theme into the same session, which encouraged more detailed and informed discussion but discouraged the serendipity I had enjoyed at the previous year's event. The distributed nature of Westfield's surviving lecture rooms also made it very difficult for butterflies like myself, once stuck in a group of rather limp papers on the applications of Knowledge Based Systems in secondary education, to escape to the parellel session on "Recent advances in historical demography" which was clearly where the real action was going on.
There were two plenary sessions, of which I attended only the first, which was a "keynote address" style lecture by Roderick Floud. Prof Floud has been somewhat of a pioneer amongst computing historians, having published an article advocating the use of electronic card readers in 1973. His lecture was enthusiastic but decently sober about the micro revolution, stressing that new tools did not mean new methods. In the future, he was confident that data input methods would remain a central problem, however advanced the technology. He described what he called a "prompting data input program" that had been developed for use in capturing US Army pension records and demonstrated the [Page] ease with which data could be manipulated by a typical cheap micro dbms/spreadsheet package (REFLEX, no less) and concluded with a plea for historians to fights against the "mythology of computing".
As aforesaid, I made the mistake of choosing the wrong session from the four parallel workshops offered next, from which I gained nothing but a nice new acronym (MITSI - the Man In The Street Interface). The third paper in this group was the best: it was from a Portuguese scholar who had developed an expert system for handling about 2000 depositions of "sinners" as recorded in 17th - 18th century ecclesiastical court records. Unfortunately Carvalho's English was not up to the task of explaining a great deal of its inner workings, though the principals seemed clear enough.
I had no choice in the next set of four: whatever the rival attractions of "Urban and regional studies" (quite a bit), "Higher Education Seminar" (rather less) or "Prosopographical studies" (rather more), I had to attend the workshop on "Relational database method", if only because I was giving the first paper in it. This (a rapid introduction to SQL and the relational model using D. Greenstein's Philadelphia data as example) had to be boiled down from about 2 hours worth of overheads to a very fast 30 minutes, but it seemed to go down reasonably well. Phil Hartland (RHBNC) then gave an unusually clear and jargon-free exposition of the virtues of SSADM in managing large projects: two intriguing examples he mentioned were a projected history of the music hall and also a database about music in the 18th century. Michael Gervers from Toronto ( one of the few non-Europeans present) reported on his Pauline conversion to ORACLE in much the same terms as last year: he has now produced some quite interesting results about changes in the landholding status of Mediaeval textile workers.
Next day, I arrived in time for the last part of an informal workshop on data standardisation chaired by Manfred Thaller, which appeared to be making very little progress: someone was pleading for a set of 'ethical guidelines'. After coffee, I plumped for the session on "Problems of multiple record linkage", thus missing the intellectual ("Recent advances in historical psephology"), the exotic ("Schools Education Seminar") and the ineluctable ("Academic word processing" - a dizzying combination of Tex, Latex and Tustep). My chosen session began with Arno Kitts' (Southampton) solid exposition of the historical and methodological problems involved in accurately linking together Portuguese names as they appear in 19th and 20th century passport lists, electoral rolls, cemetery lists etc. The object of the exercise is to determine patterns of emigration: calculating for example the rates of return migration. The linkage procedure should be completely automatic (he asserted) to avoid subjectivity, but necessarily involved dictionary lookup for some more widely varying name forms. None of these problems seemed to worry the next speaker, our very own A. Rodriguez, whose record[Page]linkage problems were virtually non existent: her data consisting of some 8000 records of birth, marriages and deaths in all of which surname, forenames, and father's names are all obligingly present. Even SIR could cope with data as simple as this: all that was necessary was a massive sort on the names, followed by a forty line piece of procedural gibberish to insert links between records with the same namestring present, written for her by the obliging D. Doulton of Southampton, centre of SIR fanaticism in the known universe. The last speaker, Ariane Mirabdobaghli (LSE) was using Ingres to link 18th centurty parish and tax records: it was not at all clear how, which is a pity.
The remainder of the conference consisted of five parallel sessions of five "research reports" each, spaced out so as to permit session hopping. I managed to catch Dunk (sic) and Rahtz (sic) on strategies for gravestone recordings (a flatteringly familiar exercise in conceptual modelling); Dolezalek (Frankfurt) on ways of reconstructing manuscript stemma (an intriguing, if apparently hopeless text); Nault (Montreal) on an enormous historical demography project at Quebec (births and marriages of every individual betwen 1608 and 1729) - being stuck with a Cyber 70 they had to write their own dbms, but seem to be doing quite well with it; and finally, Lamm (Birkbeck) who has been let into the MOD's secret archive of first world war soldiers' records with an Epson portable. He is using this to extract a minute random sample of about 8000 records, about thirty variables (height, age, length of service etc) from the attestation papers, personal correspondence, war records, pension and medical books etc etc here stored away on some 64,000 feet of shelfspace. I found it rather depressing that this numerically recoded set of SPSS data would probably be all that remained of this archive by the time it was made public in 1995, the rest - already damaged by fire- having long since crumbled to dust. But my friend from the Public Record Office seemed quite relieved at the prospect.
The AHC (as I suppose we shall have to call it) now has a formal constitution and its own magazine. The enthusiasm generated at last year's conference continues to thrive. But I hope that next year, when it is planned to organise a smaller national conference on more focussed topics, I shall be able to report more substantial fruits from it.
Membership of this new working party is a little more broadly based than that of the previous IUSC gesture in this direction; i.e. in addition to J. Duke (Leeds), D. Lindsey (Aberdeen) and me, it has Peter Willett (Sheffield), who is a real information scientist, Colin Edwards (Ag & Food Research Council) and Anne Ford (Leeds Poly). It is chaired by Richard Bland (Stirling), who is ESRC representative at IUCC and a recent convert from sociology to computer science.
The first meeting necessarily accomplished little besides delineating an area of interest and possible activity. The fact that at least one major directory of all software in this area already existed (unbeknownst to the previous working party) meant that most of the usual spadework has already been done; it was agreed that some background reading was needed by all members before manufacturers should be approached. There was some agreement for the proposal that a clearer idea of possible application areas for the software were needed; little for the suggestion that we would do better to wait on/ contribute to the emergence of text facilities in SQL-2. It was also agreed that something like the forthcoming INGRES deal would be highly desirable, if feasible.
Other tidbits I picked up:- Leeds have been very happy with their version of Status, but would still rather have bought Basis. Sheffield obtained ASSASSIN (which has now been completely re-written and is supposed to be very good) virtually free. F.N.Tesky, now at Brighton Poly, has developed a low level software interface to Status. Strathclyde has an Alvey project looking into text searching hardware.
It was agreed that the next meeting would be in London, thank God.
Along with several thousand other punters I received an invitation to the first series of "Sequent Lectures" among the usual spring tide of junk mail; as did several hundred others, I decided it might be worth a visit, since it was free and featured most of the dbms I know and love. The event was held in the unspeakably awful Novotel, which is largely composed of multi storey car park, and had three parts. In part one, an earnest American salesman explained very slowly why Sequent machines are so fast (and reliable and wonderful and cheap). This I found mildly informative, never having paid the company much attention before; the parallel architecture (lots of stripped down VAXes hanging off a superfast bus) sounds remarkably sensible, providing that you can take advantage of it. To do this properly, however, you clearly need so pretty smart programmers. My jargon detector popped its dial on the phrase "We have architected that [i.e. Oracle's use of background processes] on the Sequent" and failed to function for the rest of the morning.
Part two came after coffee and comprised three two-stage sessions held in parallel (another case of Sequent architecting). These were supposed to be for Sequent's favourite software vendors to endorse the message by explaining how they'd taken advantage of the wonderful box in their implementations: predictably they turned out to be fairly low key sales pitches with only token gestures in this direction. Products featured were Ingres, Oracle, Informix, Unify and The Office Smith. I missed all of Unify (it's a supposedly high performance TP-type system hosted under UNIX), Oracle (heard it all before) and also all of the technical seminar on parallel programming (from which however I did steal a programming guide). The Ingres speaker seemed proudest about Ingres' Visual Programming (Trademark) and its "state of the art" query optimiser. It exploits the Sequent architecture by running front and back end processes on different processors, as does Oracle, I assume. The Informix speaker was proudest about their 4GL; he did however announce the new Informix-TURBO which can be used to beef up multi-user large scale UNIX implementations (not however noticeably using parallel programming techniques) amd also DATASHEET ADD-OM with which you can make your Informix-SQL database look just like Lotus 1-2-3. There's progress. Office Smith turns out to be fairly drab hierarchic text indexing system for UNIX boxes only. The speaker clearly felt rather defensive about this ("relational is just a small subset of database technology") and rightly so. It uses B-trees and compresses index terms rather like those speed-writing adverts (KN Y RD THS?); one thing in its favour is that it was designed to be bilingual, emanating as it does from the Canadian Government.
The main event of the day was Dr Rob Wilmott's Vision of the Future, an inspirational little number, featuring lots of graphs showing sharply divergent lines with labels such as "Shipped MIPs", "Price/Performance[Page] Learning Curve" (and only 3 spelling mistakes) etc etc. Fortunately for the innumerate, the lessons Dr Wilmott wished us to take home were not too difficult: (1) lots of small machines is better value for money than one biggie (2) progress is being impeded by fuddy-duddy conservatism and the deadweight of current software investment (3) OSI standards are a Good Thing, and are Happening. Likewise, UNIX, C etc. These messages were all dressed up rather fetchingly with the usual sort of stuff about the imminent collapse of the British non-manufacturing industry and the appalling levels of ignorance in British management. To fill the latter gap, our Rob has -surprise- started a new management consultancy called OASIS which will help you "go multi-vendor" and transform your software productivity before the astonished eyes of the competition breathing down your neck. Question time provoked an unexpected smear on government collaborative ventures, and (with reference to whether IBM would ever get involved in parallel architectures), quite the best mixed metaphor of 1987, so far, viz "Once the benchmarks are on the table, you will see all Hell break loose".
A nominal lunch was provided, after which I trekked across London to visit the British Library's Research & Development Division, deep in the heart of Soho. It is possible that they would be willing to fund a one year research post here to assess the actual and potential uses of machine readable texts, which would also help keep the Text Archive on its feet. I spoke to Terry Cannon, who was encouraging.
Batelle Software (aka Information Dimensions Ltd) is a recently created offshoot of the Batelle Memorial Institute; the latter was set up in 1929 by one Gordon Batelle, a wealthy ex-miner, in whose will it was charged to undertake research "for the good of mankind". It now describes itself as the "world's largest private research company" and can lay claim to having benefited mankind by the invention of (inter alia) the Xerox machine, Snopake and bar-coding. BASIS, a market leader in text retrieval software, was originally (early 60s) an in-house product used to keep track of the Institute's voluminous research project reports; DM, modestly described as an improvement on all existing relational database management systems, is a more recent product, developed over the last six years from BASIS.
The seminar was in two parts, each consisting of a fairly technical run through followed by a demonstration. It was one of a series of small and (as it transpired) rather well run seminars which the company has been running this month to raise its UK visibility somewhat; there were about six attendees apart from ourselves.
DM runs on CDC and DEC machines and is due to be ported to IBM next year. At first glance it looks like a fairly traditional ANSI/SPARC three level architecture DBMS, with a structural data model, an "actual" (i.e. logical) data model and one or more user data models. It supports concurrent access - up to 511 per "kernel" (i.e. database virtual machine) against 250 databases, with up to eight kernels per mainframe; it comes with COBOL or Fortran pre-processors, has good screen handling facilities, a reasonable-looking report writer, and extensive security, recovery and journalising features etc. It has a built-in data dictionary which is used to enforce referential integrity. It has its own system language (FQM) which is similar to but not the same as SQL; in particular it does not contain facilities to create new views or relations. When pressed, Batelle insist that when the product is enhanced to support SQL (supposedly real soon now) it will support a full SQL including such facilities. On the other hand FQM supports text items far more effectively and usefully than SQL.
BASIS is a well established text retrieval system (major users include BT, BP, the Smithsonian Institute, Pergamon Infoline and Reuters, not to mention the Houses of Parliament or President Reagan's personal VAX). It is also used as a component of Wang's Office Systems and of a turnkey library system; in the shape of Micro-BASIS it is also being licensed to various CD-ROM publishers. Its indexing strategies (there are twenty different algorithms) appear to be sufficiently flexible to cater for almost any sort of text or user requirement thereof, including support for funny alphabets, embedded ignore characters etc.
Stop/Go words can be specified. Phrases can be indexed. Subfields within text can be indexed (e.g. if one field of a report contains a table, each row of the table can be indexed distinctly). More than one search unit can be specified for the same file, though not of course dynamically. The package has a reasonable screen interface, with a tolerable procedural language used to hook sets of commands together. It has a built-in thesaurus capability which can be user-modified, an online editor and a reasonably flexible and relaxed batch input system. Like most such systems, it uses an unindexed holding-file for recent data; access to the main text base is not possible while it is being updated.
On the whole I was favourably impressed by both systems. In particular, DM seems the only serious contender for the role of DBMS on CDC hardware, while BASIS is certainly worth consideration for the role of text searching software on DEC Amdahl or CDC hardware. Both products are far from cheap: DM with all its bells and whistles for a Cyber 855 would cost over £40k; while the basic module of BASIS would cost over £6k even on a micro VAX and might be twice as much on a mainframe.
CHArt - Computers in the History of Art - is a special interest group organised by Prof. Will Vaughan at UCL and Dr Antony Hamber at Birckbeck, with a burgeoning membership (about 150 attended this conference) drawn rather more from the major national museums than from academic departments. I attended its inaugural meeting nearly two years ago mostly out of idle curiosity; I was invited to this, its second annual conference, I suspect largely on the strength of my performance at Westfield (historians of art seeming to overlap a little with historians in general) on condition that I explain what databases were in words of one syllable, preferably employing lots of pictures.
The conference was a two day event, with mornings given over to formal presentations and afternoons to a number of parallel demonstration sessions. In between was a very pleasant reception featuring memorable dim sum. All around was the wealth of the National Gallery; definitely among my favourite conference venues to date. I opened the first day's formal sessions (which all concerned cataloguing/database applications), using as my main example a page from the Gallery's Catalogue written (I later learned) by the distinguished old buffer who had formally welcomed us into said gallery's hallowed portals not five minutes earlier. Fortunately he'd left by the time I started to get personal. Colum Hourihane from the Courtauld, where the only computer-assisted art historical cataloguing of any importance is actually going on, then gave a very impressive resume of every known method of iconographical classification. He'd found eight different methods used to categorise the subjects of images, of which the best appeared to be ICONCLASS, as used by, yes, the Witt Library at the Courtauld. His paper, when written up, should become a standard reference on the subject.
After coffee in an adjoining room of old masters, Jeanette Towey (described as 'a researcher' and evidently not a sales person) gave a work-person-like introduction to what word-processors are, how they differ from typewriters etc. etc. She advocated Nota Bene, having used that and Word Star, but had never tried Word Perfect nor heard of SGML, page description languages or -mirabile dictu- TeX. Gertrude Prescott from the Wellcome Institute and her 'data processing consultant' (whose name I forgot to write down) then described their current prototype cataloguing system for the Wellcome's immense collection of prints, using dBase III+. It was rather depressing to see that although they were starting from scratch - much of the collection never having been catalogued in any way - their data analysis was very rudimentary. It seemed to me to be over-reliant on dBase III's tendency to sweep anything difficult under the carpet into a "MEMO" field, of which they had about eight in one record. No doubt they will learn better from the example of their neighbours at the Witt Library.
After lunch, there were various demonstrations, of Nota Bene (which I avoided) and of STIPPLE, our old friend from the pigsty, which does not appear to have changed much and which I am now close to thinking I understand. ERROS Computing is still in business, but does not appear to have gained any new customers since the last report, some 18 months ago, nor indeed to have expanded its standard demo at all. Another demonstration, of somewhat dubious relevance to Art History, was being given by a Dr Alick Elitthorn from a private charity called PRIME (no relation to any manufacturer) which has something to do with the analysis of dreams. Its chief point of interest was that it used STATUS on a PC AT, of which I have long heard but never actually seen. The software costs £2000; by dint of sitting on my hands I prevented myself from taking a security copy of it immediately. Day Two, which was supposed to be on visual rather than historical aspects of the subject, was opened by a Mr Duncan Davies (formerly with ICI, now retired) who gave what was reported to have been a magnificent overview of the rise of western civilisation. Owing to the caprices of British Rail, I missed much of this, arriving only in time for the Reformation, from which time, according to Dr Davies, may be dated the end of the period during which written communication had constituted the intellectual power base. With the rise of universal education came the stress on words and numbers as the only fit means of communication, the discouragement of the most able from visual forms of expression and our consequent inability to say anything intelligent about visual images. The second great invention of humanity, will be the pictorial equivalent of the phonetic alphabet and if anyone had any ideas on how it could be done, would they please telephone Dr Davies on 01-341-2421. The visual content of his talk, which my summary does not attempt to include, was, of course, the better part. Terry Genin had the difficult task of following this, but persevered, remarking that he would normally be on playground duty rather than addressing a gathering of this sort. He has developed some fairly straightforward courseware involving image and colour manipulation on RM380Zs as a means of teaching art history in a secondary schools but the bulk of his talk was a plea for the possibilities of interactive video to be more widely recognised in that context, (which seems to me to be a political rather than an art historical question), rather than just as a means of selling Domesday Book, of which he had several (unspecified) criticisms.
After coffee, Andrew Walter (IBM Research) gave a rapid canter through the York Minster Computer Graphics project. This is somewhat of a tour de force in CAD; it consists of a model of the York Minster, sufficiently detailed for views to be plotted from every angle both inside and outside. A video of the resulting tour was on display throughout the conference; each frame took about three hours CPU time on an IBM 3430, so interaction was impossible. The presentation included samples of the high level graphics language in which the Minster views were specified (primitives such as cylinder, sphere, cube etc. are combined in a procedural way) which was interesting though how much sense it made to the majority of the audience I can only guess. Wire frame drawing with dotted in-fill was presented as a more promising way of getting interactive processor
speeds; the problems of including perspective in the views were also touched on.
David Saunders (National Gallery) described an ingenious application of image processing techniques. The problems of colour changes in 16th century paint are fairly well known (Ucello didn't actually paint blue grass, it's just that the yellow wash he put over it has gone transparent); more modern pigments also change over time. Usually the only way of telling what has happened is when a part of the painting has been protected from light, e.g. behind the frame. By storing carefully controlled digitised images of the painting and the comparing them after a five year gap, the NG hopes to identify more precisely what types of material are most at risk and what factors cause most damage. The equipment (which was also demonstrated in the afternoon) includes an HP 9836 frame store and a special digitising camera. Several art historical applications of image processing techniques were also given in what was, rather unexpectedly, the most stimulating paper of the conference.
Finally, two ladies from the Weeg Computing Centre at the University of Iowa described their videodisc retrieval project. A library of about 18,000 colour slides had been stored (in analogue form) on video disc, and a simple text DBMS (called Info-Text) used to index them. The system was designed for use by faculty members wishing to collect together appropriate illustrative material. In the classroom, images can be projected in the same way as conventional slides; the quality of the images (we were assured) was "better than might be expected"; it looked reasonable on the standard video monitors available at the National Gallery. Images are catalogued according to nineteen different categories (date, provenance, size etc.); no formal iconographic indexing was used. Apart from the obvious advantages of being tougher and cheaper to maintain, one great attraction of the system was seen to be its integration of indexing and displaying comparable and contrasting treatments of equivalent subjects.
The conference closed with a plenary discussion centre. This focussed at first on the difference between the words "analogue" and "digital", rambled off into ill-informed speculation about the possibility of automatic subject-recognition and was brought to heel by a plea for more information about what sort of database system was worth buying, and whether or not art historians should be expected to get their brains dirty trying to design them. My views on all these topics being fairly predictable, I shall not summarise them here.
Hearing that I was hoping to attend the second University of Waterloo conference on the new OED, Ian Lancashire, driving force behind Toronto University's thriving Centre for Computing in the Humanitives (CCH), kindly invited me to give a seminar there. This being too good an opportunity to miss, as there are several projects at Toronto of considerable interest, I arrived in Toronto (on a special -bilingual- cheap flight via Montreal) a few days before the OED Conference proper and visited...
The Dictionary of Old English which flourishes in several rooms on the 14th floor of the magnificent Robarts Library, where I saw some very flashy Xerox workstations given to the project together with a small VAX configuration. The project has a single programmer who has to develop all the software for editing and page make-up of the dictionary entries, which have now reached the letter D. They were astonished to hear that we could offer online searching of substantial portions of the corpus, even if we could not display Old English letters. Their interface is pleasantly similar to the desk-tops on which the terminals sit, (i.e. cluttered) and just to be on the safe side the project also maintains (in several hundred drinking chocolate cartons) numerous bits of paper derived from the corpus we know and love. Ashley Amos, sole surviving editor of the dictionary, managed to track down some obscure charters which a user here had been unable to find in our copy and was generally encouraging.
At University of Toronto Computing Services (UTCS), I inspected the bookshop (which is splendid) and the computer shop (likewise; hottest new property reportedly the Amiga 1040 which is selling like hot cakes, or muffins as the Canadians unaccountably call them). I was not shown their Cray, nor indeed their IBM 4361, but instead was introduced by John Bradley, apparently the resident humanities computing boffin, to TACT - a new interactive text-searching system he is developing to run on Toronto's ubiquitous (see below) IBM-XTs - and by Lidio Presutti (his sidekick) to MTAS, an OCP look-alike of which I was given a copy, also for use on IBM-XTs. Time did not permit me to discover much about the way the centre is organised, other than the fact that they have recently started charging their users real money (well, dollars and cents anyway) for computing resources consumed, with predictably dire consequences for anyone not funded by the Defence Dept or similar.
Nevertheless, Humanities Computing is set to thrive at Toronto, largely as a result of Ian Lancashire's "partnership" with IBM-Canada. This involves the setting up of four rooms full of XTs and staff to support them over three years, all paid for by Big Blue, which gets no more in return than lots of academic credibility and three years worth of humanities graduates convinced that all computers should run PC-DOS. Any software etc. developed will be placed in the public domain. One of the four centres was on the verge of opening its doors: it had 24 XTs on a token ring with an AT as file server and three printers. The XTs were set up in such a way that they could only be booted from a supplied disk, which could not be removed from drive A. They were also bolted to the floor, despite Canadians' proverbial honesty. Students will use these to prepare machine-readable texts, using EDLIN or NotaBene (WordPerfect is not regarded as highly as it is here), to be processed using MTAS and maybe TACT. Other software to be made available includes the Duke Toolkit and the usual clutch of concordance packages, Kermit, network mail etc. as well as some public domain text-jiggling utilities designed to whet if not satisfy the literary appetite. Students will apparently be expected to become reasonably proficient in not just PC-DOS but also VM-CMS and UNIX as well, which seems a bit steep. Conspicuously absent was any whisper of what databases are for. There is rumoured to be a Masscomp somewhere in the English Dept but I never saw it.
I gave my seminar in the Centre for Mediaeval Studies (where the second of the four IBM rooms was still under construction); I had been billed to talk about the KDEM but instead waxed lyrical on the necessity for the Text Archive, the problems of representing and processing text properly and the wonders of CAFS to a gratifyingly large (c. 36, including the Director of UTCS, I later learned) audience, most of which survived till the end.
The next day, being Saturday, I spent at Niagara Falls, of which the Canadian end is unquestionably the better, and truly spectacular. I was startled by a bright red London bus (used for touristic purposes) and resisted the temptation to have my photo taken going over in a barrel, though I did go into the tunnels behind the Falls which command a magnificent view of their derriere.
Back in Toronto, I lunched with Michael Gervers, who runs the Documents of Essex England Data Set (DEEDS) project, more or less on his own with some Government assistance in the form of temporary (YOP-type) staff. The project involves the indexing of a massive collection of mediaeval records from Essex (England) and is the only real database project I came across at the University. It started off using an awful DBMS package which sounds like a Canadian version of IMS, but is now going through the traumas of conversion to Oracle, at present on a huge AT (with a 40 Mb disc AND a Bernoulli box), though it will be moving to the UTCS IBM system shortly. The cost of Oracle for this system appears to have been met from the IBM 'partnership', although what other users it will have in the absence of any local knowledge of how to exploit or support it is less clear.
I travelled to Kitchener, the nearest large town to the University of Waterloo, by train in the company of Willard McCarty who works with Ian Lancashire in running the CCH, and Abigail Young, who works on the Records of Early English Drama (REED) project also at Toronto. She had been largely instrumental in depositing in the Text Archive that proportion of the published corpus of REED texts which was still available on floppy disk, so I was very pleased to meet her.
And so to Advances in Lexicology (not a word to be found in OED -yet) which was the second annual conference held at Waterloo's Centre for the New Oxford English Dictionary and was generally felt to be a distinct improvement on its predecessor. Twelve papers were given over three days to about 150 delegates, roughly equally divided in their alleigances between lexicography, computer science and artificial intelligence. One reception, many coffee breaks and two fairly spartan lunches were laid on, during all of which there was much animated discussion. The best joke of the conference was probably Dana Scott's collection of howlers, of which I recall only "AI is when the farmer does it to the cow instead of the bull" which manages to combine innuendo with syntactic ambiguity.
Howard Webber (Houghton Mifflin) 's keynote address was the only one of the papers not (yet) available in printed form; like many keynote addresses it sounded rather as if he had made it up on the plane from several old after dinner speeches. However, it got out of the way all that necessary stuff about the role of dictionaries as a sort of "Language Command Central" (his phrase), the unease with which lexicographers had regarded the machine, the difference between machine- readable dictionaries and lexical databases and the transition from the former to the latter, while also dropping a few hints about where the 'American Heritage' dictionary project was now going in its co-operation with Brown University (nowhere in particular, as far as I could tell, other than the preparation of a new 50 million word corpus).
Manfred Gehrke (Siemens AG) tackled head-on the computational difficulties of providing rapid access to a realistically large lexicon. The method described, using morphemes rather than 'words' as primary keys has several attractive features (like the comparatively smaller number - and size - of such keys), though is perhaps more appropriate to highly agglutinative languages such as German. The fact that morphemes have meanings which the compounds derived from them usually employ is also of particular importance in German. Even so segmentation can cause problems: "Madchen handelsschule" is a girls business college, but "Madchenhandels schule" is a white slavery school.
Mark Aronoff (SUNY) and Roy Byrd (IBM) gave a rather dubious account of the role of etymology and word length in English word formation. A dictionary of high frequency affix lists was extracted from the top end of the Kucera-Francis word list, and another unspecified 3/4 million word list. This was then enhanced with some fairly simple etymological information from Webster's 7th (i.e. did the affix enter the language from a Germanic language or a Romance one). Any complications (such as words which were imported into English from French, but came into French from Germanic) were rigorously disregarded, as was the distinction between words which were formed within the English language and those which were borrowed -as it were- fully formed. Much statistical jiggery- pokery was then employed to determine how syllable-length and etymology accounted for the productivity of various affixes, and much wonder expressed at the apparent ease with which native speakers keep their neologisms racially pure. But the results, as Mike Lesk pointed out, would have been equally consistent with a simple phonemic explanation: (predominantly longer) Latinate suffixes naturally sound better on (generally Latinate) polysyllabic verbalisations, while (often short) German endings go best with (mostly Saxon) little words.
Walter and Sally Sedelow (Univ of Arkansas) have been in the field of computational linguistics almost since it began; their paper, which was mostly given by Walter, thus had a tendency to historical reminscence not quite germane to the issue, while employing terminology and a style, the clauses of which were embedded well beyond the capacity of most intelligences not endowed with a 640 Mb hardware stack, not unlike some really nasty exercises in automatic parsing, and consequently seemed to go on for a great deal of time without ever getting very far. This was a pity, because its subject (the adequacy and usability of Roget's Thesaurus as a knowledge representation language) is important and interesting. A mathematical model of the thesaurus (derived from work by Robert Bryan at San Francisco State) was presented and various other research reports adduced in support of the claim that the associations represented in Roget are actually significant. The skeptical might say that this was rather unsurprising; however anyone who can (apparently in all seriousness) express the concept in the following way (quoted from the abstract) clearly cannot be ignored, if only on stylistic grounds:
"The paper ends...with a statement to the effect that any assertions that the Thesaurus is a poor representation of Emnglish semantic organization would be ill founded and, given depth of analysis, would have to be regarded as counterfactual"
Judy Kegl (Princeton), Beth Levin (MIT) and Sue Atkins (Collins) gave probably the meatiest of the papers at the conference, - and coincidentally no doubt the only one co-written by a real lexicographer (Atkins). It includes much analysis of the different ways in which two language learners dictionaries (LDOCE and OALDCE) attempt to convey the intimate relationship between the various senses of English verbs and their complement-taking properties, (or case structure). Even such apparently obvious characteristics of the verb as transitivity are not always predictable by simple transformational rules such a "If the verb takes an object then it can be used passively" (e.g. "The shoes fit", "The shoes fit me" but not "I am fitted by the shoes"), but there is no self-evident place to put such facts about the verb "fit" in a dictionary. Consequently dictionaries differ: "cost" for example is intransitive according to OALDCE, and transitive according to LDOCE. The paper also contains much that is refreshingly sensible about the nature of specialised dictionaries (such as learners' dictionaries) and the distinction between them and the sort of immensely complex linguistic snapshot to which some computational linguists expect all lexicons to aspire. The sort of knowledge needed for the latter, though indispensable to the former, must be processed and combined a way appropriate to particular users. Detailed assessment of the way certain notoriously difficult verbs are presented in OALDCE and LDOCE is used to present inconsistencies inherent in the absence of any generally agreed meta- language for verbal descriptions, a point which recurred in other papers. The strength of this paper is the combination of the structuring capabilities offered by theoretical linguistics with the reductive classificatory skills of lexicography, which it both demonstrates and advocates.
Thomas Ahlswede (Illinois Inst Tech) reported on the continuing saga of the parsing of Websters 7th, first initiated by John Olney et al in 1968. 'Parsing' here means the recognition and extraction of semantic information from the text of a dictionary definition which can then be stored and manipulated within a lexical database. It is analagous to (but even more ambitious than) attempts to extract similar semantic structures from free text. Dictionary definitions provide implicit information about relationships between words, not just taxonomic (an x is a sort of y) but also part/whole relationships. But a simple syntactic analysis of the text of a definition is rarely adequate to the task of understanding it; a detailed lexicon containing information about each word likely to be encountered by the parser is evidently necessary. For Websters 7th, some of this information (but not all) can be extracted from the entries themselves, while some of it is already available in the existing parser's lexicon of about 10,000 entries. This process was later dubbed "dictionary hermeneutics" by Graeme Hirst. How much easier it might have been if the dictionary structure had been initially captured in a meaningful way (as was the OED) is an embarassing question which no-one had the poor taste to ask.
Nicoletta Calzolari (Pisa) described an equally ambitious but rather more practical project now under way under Zampolli's charismatic aegis: the construction of a bilingual lexical database (LDB) system by linking together existing monolingual LDBs, the linkage being provided by machine readable bilingual dictionaries. Combining monolingual and bilingual dictionaries, which typically differ in the degree of discrimination felt necessary for individual word senses, should lead to a much richer integrated system. The dictionaries to be used include Zingarelli, Garzanti, Collins Italian/English, LDOCE and OALDCE. No complex supra- linguistic model is envisaged, simply the ability to discriminate senses when going in either direction between two languages. Such old chestnuts as the fact that Eskimos have 99 different words for 'snow' and Arabs none at all were not on the agenda: the project is not really concerned with semantics, but aims rather to provide a useful tool for translators and others using existing dictionaries.
The final session of the second day comprised summaries of the current state of play of the NOED Project itself, as viewed by firstly Tim Benbow and John Simpson (OUP) and secondly Frank Tompa and Gaston Gonnet (Waterloo), all of whom were sporting Oxford Dictionary Ties to mark the occasion. Benbow reported that the dictionary's 21,000 pages had now been rendered machine readable, using an estimated 500 million keystrokes, with an error rate of around 4 in 10,000; this was being proof read and corrected by ICC to bring the residual error rate down to less than 1 in 250,000 characters. The data is held at Oxford and manipulated in a SQL database under VM/CMS. Rick Kazman's parser would be used to convert the ICC mark-up to SGML, and other software developed in house mostly by IBM secondees (e.g. a text editor called LEXX) will be used by the lexicographers to complete the integration of the dictionary and the supplements. Some wholesale changes will be made (notably Murray's method of representing pronunciation will be replaced by IPA) at the same time as automatic integration is carried out; some (about 4000) new words/senses will also be added to bring the early parts of the supplement up to date (This is the responsibility of John Simpson's NEWS project). Publication of the new integrated dictionary (the Book) is planned for spring 1989. It will have 22 volumes and cost about £1500. Publication of a CD-ROM version of the OED alone (without supplements) is also planned, probably for late 1987, mainly as a means of testing the market for electronic forms of the dictionary, and providing information for the database design work going on at Waterloo. It is hoped to set up a unit of lexicographers in Washington which, together with the NEWS team, will ensure that the dictionary, or rather its eventual electronic form, will be kept up to date on both sides of the Atlantic.
At Waterloo several very interesting pieces of software have been developed, which were described by Gaston Gonnet and Frank Tompa. While waiting for the ICC data to reach them, they had been experimenting with a smaller dictionary of Trinidadian English which had successfully demonstrated the generality of their approach. The software used comprises (1) INR/lsim - a parser-generator and parser for context free grammars (2) PAT - a fast string searcher and (3) GOEDEL the "Glamorous OED Enquiry Language". INR/lsim (no-one seems to know what this is short for) resembles in philosophy the parser-generator developed for Project TOSCA at Nijmegen, though I never got the opportunity to ask Tompa whether he'd heard of this. Maybe it's just the only way of solving the problem properly. It has been used by Kazman among others to convert the ICC mark-up to SGML, and to convert the OALDCE tape as first supplied to the Text Archive into a similar SGML markup. PAT (written by Gonnet who has made quite a study of text searching algorithms, I discovered) stores indexes to a text in a Patricia tree, a form of condensed binary tree new to me, though apparently to be found in Knuth if you know where to look. PAT is very fast but, at present, very simple minded. GOEDEL is a more sophisticated system, still under development, the most crucial element of which is not so much its current Algol-like syntax as the fact that its basic datatype is a parse tree (again like the Dutch system). This solves all manner of nasty data management problems and bridges the gap between DBMS and Text Processing systems in a way at least as natty as CAFS and probably more so. The user specifies a parse tree for the text to be returned and can impose selectional restraints using SQL like conditions.
Peter Davies (described as an 'Independent Lexicographer') read out his paper from the conference proceedings in a dry monotone well suited to its contents, which contained rather few conclusions derived from some fairly specious data. He had tagged the most frequent word forms in the American Heritage Corpus with the century of their introduction to English and the language from which they derived. (Like Aronoff and Byrd he was uninterested in the fact that this corpus neither distinguishes homographs nor associate inflections of the same lemma.) The results presented were raw percentages ("In the top 500 types, 75% are native") with no attempt to adjust for the known skewness of vocabulary distribution irrespective of origin.
Alexander Nakhimovsky (Colgate) is much obsessed with time, more specifically with how language reflects "common-sense reasoning" about the nature of time. He is one of the "Meaning-Text" school of Soviet theoretical linguists. To understand why to answer "I have a 12.30 class" constitutes refusal of a lunch invitation requires not just knowledge of social convention, but also of the probable durations of lunches and classes. English verbs are not marked for perfect as opposed to imperfect so that "Alice read her book quickly" could have two quite different meanings (either as a process or as an event). Knowledge of the duration of events is a linguistic phenomenon because many words cannot be understood without knowing the duration typically associated with them - not just obvious words like 'lunch' or 'nap', but also 'holiday' (which cannot be in minutes but is usually not in years) or 'insomnia'(which cannot be measured in minutes nor go on in the afternoon). It is apparent that the units of time appropriate to common sense reasoning vary with the duration of the event, as does their precision. (Thus '5 minutes' means somewhere between 1 and 10, but '3 minutes' usually means just that). To make up for the absence of a perfective/imperfective aspectual difference, English relies on an opposition Nakhimovsky calls telic/atelic, which has to do with the narrative within which the verb appears, so far as I understand (or have understood) it.
Fabrizio Sebastiani (Pisa) presented a more conventional view of the role of the lexicon in describing QWERTY, a knowledge-based system designed to 'understand' technical writing. It operates by converting parse trees representing syntactic structures onto semantic structures represented in the knowledge representation language KL-MAGMA. The fact that technical writing is mostly composed of 'paradigmatic' sentences, from which such nasty problems as mood, aspect and temporal position are conspicuously absent was touched but not insisted upon: unfortunately Sebastiani did not allow himself enough time to make clear exactly what was original in the system nor how effective the design was in practise.
Graeme Hirst (Toronto) closed the conference on a controversial note which also managed to bring together the two sides of lexicology, if only in debate. His paper stated with agreeably frank partisanship why dictionaries should list case structure, that is, specifications of the valid roles associated with any particular verb, together with any semantic restrictions on the values that may fill those roles and any preposition or other syntactic marker specific to them. Case structures may be found in one guise or another in almost every theory of language or cognition, from Chomsky to Charniak, and in a weak form are almost present already in the 'verb-patterns' specified in some language learning dictionaries. Hirst's most telling argument in favour of his proposal was that if lexicographers did not supply such information then computational linguists would, and would certainly do a worse job of it. The most telling counter argument was that, at present, no-one has a clear idea of exactly what cases (roles) exist nor is there any agreement on how to define them. A less telling counter-argument, which to my mind unnecessarily dominated the subsequent heated discussion, was the economics of adding this information to the already overcrowded pages of a dictionary; when pressed, Hirst said that he thought it would be more use to more dictionary users than etymology if something had to be sacrificed.
After the conference proper, I visited the NOED Centre itself, where I met Donna Lee Berg, the librarian on the project, and acquired offprints of some technical reports on the software being developed there by Gonnet. I also watched some confused lexicographers struggling with GOEDEL and, while being very impressed by the power of this system, was glad to notice that there are 18 months labelled "development of user interface" set aside in the planning of the project which have not yet begun.
Back in Toronto, I found Ian Lancashire very busy preparing a long term plan for funding humanities computing beyond the end of his IBM partnership. This entails the preparation of a detailed report of all the activity currently going on at the six or seven universities in Ontario which is to be presented to the Ontario government with a request for funding very shortly. I managed to distract him sufficiently to discuss briefly his slightly different views of the functions of a text archive. He wishes to see texts distributed as freely as public domain software, the role of the Archive being simply one from which an original non- corrupted version can always be retrieved, and the only restriction on the user being simply not to redistribute materials for profit. To that end, all texts encoded at Toronto (and there will be many, since preparation of a specified text forms a part of students course work) will be prepared to a common standard from non-copyright texts, such as facsimiles of early printed books. Whether this is practical in our case, where many of our texts are prepared from modern editions or texts otherwise still in copyright, is unclear. It is certainly something we should consider when capturing texts in the future however. I would also like to give some thought to the possibility of making some of our other texts more freely available (i.e. copyable).
No particular theme had been specified for year's ALLC conference (one had last year, in Nice, but no-one took any notice of it). Vague attempts had been made to clump together related papers, the chief effect of which was that anyone interested in OCP-style software couldn't find out anything about database style software, and anyone not interested in literary statistics had absolutely nothing to do for most of one day. There were three invited speakers, as well as three days of parallel sessions, and two major social events clearly calculated to impress foreign delegates. Much of what transpired was well up to expectation; in the 200+ delegates there were only a few new faces amongst the ALLC die-hards, and most of the issues discussed had a more than familiar ring to them. The accommodation at UEA was also no worse than usual, though the food was remarkably nasty.
Leaving mercifully aside the more tedious papers, I noted with interest the following:-
Tony Kenny (Balliol) summarised his work in statistical stylistics and was also chief lion at the subsequent round table discussion on "whither [Page]computation stylistics?". The discussion turned out to be unusually interesting, if inconclusive, while his paper was exhaustive, if exhausting. It made eminently reasonable distinctions between what made sense in the field (distinguishing texts in terms of parameters that could be shown to be internally consistent - cf Delcourt) and what did not (postulations about undefinable entities such as 'all the works Aristotle might have written'). He compared statistical techniques to aerial photogrpahy, showing the wood rather than the trees and concluded with a summary of his next book, which uses clustering techniques (Pearson correlation coefficients in particular) to discriminate the Pauline and non-Pauline bits of the Greek New Testament on the basis of their usage of different parts of speech.
As usual at these gatherings there was a certain amount of political manoeuvering in evidence. It transpired that Nancy Ide (Chairman of the Association for Computing in the Humanities) is planning an international workshop on standardisation of machine readable texts. I put forward the proposal that the Text Archive deserved more funds to whatever sympathetic ear came within reach, and was told on several occasions to think BIG.
This conference aimed to bring together as many as possible of those currently using computers as primary research tools or as teaching aids in the historical disciplines, with a view to establishing a new learned society, the AHC, with a wide ranging membership including universities, research bodies, polytechnics, and local government bodies responsible for secondary education. This Catholicism may be one reason for the evident success of the event: there were nearly 300 delegates, including several from Germany France and Italy, and ten sessions each of four or five short papers, of extraordinary richness and variety, with only a few duds. There was also a more or less permanent but changing exhibition room featuring assorted micro manufacturers and software publishers as well as lots of online demos, mostly on the ubiquitous BBC micro or via JANET at the exhibitor's home base. I was alas unable to show off the wonders of CAFS due to the continued absence of weekend working on the 2988.
There were too many papers to summarise individually (it is probable that some sort of publication will emerge eventually) so this report simply describes over all trends and memorabilia.
Predictably, there were large numbers of home grown quasi-DBMSs on display, ranging in sophistication all the way from data entry systems in BBC BASIC tailored to a particular type of historical document up to the all-singing all-dancing current market-leader, the German CLIO package. I had previously met this in Gottingen; it is still being re-written in C but now has satisfied customers in Austria and France as well, and is stimulating interest here. Others mentioned included -yes- Famulus?? which got blush-making praise from a Nottingham group and Oracle which was mentioned by several speakers as the obvious choice for dbms, despite the presence of a strong SIR-fancying contingent. In fact the conference revealed a very healthy eclecticism as far as software is concerned, both Prolog supporters and one renegade APL fanatic being given if not equal time at least equal respect.
Aside from methodological manifesto, some real historical research was reported, largely by groups of economic historians. In this category, I found Turner (using multi-dimensional scaling to analyse 19th c House of Commons voting practice - which rings a bell) and Riehl (using predictive mathematical models to analyse the emergence of the Greek city state) particularly interesting. At quite the other end of the educational spectrum, there were sessions devoted to methods of introducing computational methods into the undergraduate syllabus, and to novel applications of computers in the secondary and below classroom. These were very interesting, and oddly complementary. One speaker compared the university teacher addressing the post-micro generation to a hunter-gatherer trying to teach neolithic man how to increase agricultural productivity - a simile whch seemed to strike several chords. One novel CALL application is [Page] about to be marketed by Longmans: it is a role-playing game in which children are introduced to decision making procedures and the role of chance in historical events, by simulating the Palestine agreement of 1947. It seemed a good way of teaching people to "think historically" - an activity which the charismatic Richard Ennals (who jetted in to chair one session and then jetted out again) assured us was worth big bucks in pushing back the frontiers of AI.
I noted two major trends:
... began, like quite a few Saturdays lately, with my escorting Sarah Daisy, violin perilously perched at the back of her bike, northwards through suburban back streets to Miss Chandler's expansive residence in Belbroughton Road, arriving around 1040 for a 1030 lesson, as usual. Then I bicycled furiously up Banbury Road to do some frugal Saturday morning shopping: muesli, brown rice, brown bread, brown beans, caffeine-free tea, fresh fruit and a Sunday indulgence, fillet of lamb, extracted from the bone before me by a prodigiously fat butcher with a suspicious expression, as if he could smell a closet vegetarian polluting his premises.
... continued, with my re-escorting Sarah homewards (and the shopping too) by 1120. And then grabbing my gown, for the second time this year and my script, and heading off frantically for the Computing Service. And then bicycling furiously back again having forgotten my bike lock (a heavy duty number weighing rather more than the bike itself). And so to the Lecture Room, where assorted luminaries of the Computing Service, Libraries Board and the Bodleian itself are already fretting about where to put the wineglasses, turning on the equipment, testing the view from the back etc.
For today is the Day of the Visitation, when the Curators of Bodley's library exercise their annual right of inspection, ostensibly to count the books, in practise to listen to a Latin Oration and then Lunch well. Except that this year, as a consequence of, and in spite of, and leaving to one side, all manner of complicated university politics and campaigns and proposals and working parties and parties prises, they are also going to Visit us, in order to see my natty program search the pre-1920 catalogue (S to SHERIDAN) at one million characters a second.
At noon, the few real users of the computer still around are all told to leave, and the machine turned over to us. By now, we have been waiting for Them to arrive for so long, that we've relaxed to the point of seeing whether or not the wine really is cold enough. Consequently it is not until the first begowned dignatory enters that we think of checking that what was working perfectly well at 1145 is still working at 1215. Needless to say, it isn't. Ten minutes of undiluted hell then ensue, during which (a) approximately thirty elderly academics rubber neck their way into the room (b) the monitor on which I am to do the demonstration continues to say COMMENT- A SHORT DELAY WILL OCCUR (c) my Director makes desperate jokes about ICL scoring own goals (d) me and the solitary operator on duty, who's getting paid overtime for the privilege, try to decide what's needed to get the brute moving again. Eight minutes into this, with the notional egg starting to congeal on my face, and even as Bodley's Acting Librarian starts to make his Introductory Remarks, I notice the unnatural quiet in the machine room, and suggest that there really ought to be a printer going, and how about an output scheduler as well? Operator grunts, types some special operators gobbledegook, terminals spring to life, I sprint back to lecture room, elbow my way through assembled subfusc, phew, not a nanosecond too soon, we are go for demo.
As planned, I demonstrate that there are 4 titles in S-SHERIDAN which contain the word "Abysinnia", 5 containing words starting with "Abysinn..", 6 containing words starting "Abxsynn" (where x is a y or an i or even a q), and ever so many more when you search for variations such as Ethiopia, Aethiopica etc. all of which the wonderful CAFS machine duly does. This works a treat. The next one I get wrong, [q3 and q2 or q1 not being at all the same as q3 and (q2 or q1) which is what it says in the script]; fortunately, of the curators present, those who understand this error outnumber those who don't, so that the former derive great pleasure from explaining it to the latter, and a general feeling of bonhomie (spurred on by white wine) takes over from technophobia. The master of Balliol, a force to be reckoned with in this context as in many others, desires hands on access to the system, and starts hunting for entries about John Wycliffe (one was duly found, fortunately). I find the one entry in the catalogue for a book printed in the 17th, 18th, 19th and 20th centuries, as well as a breakdown by century. The Curators, not a body of men to stand and gawp, are by now in a highly satisfactory state of hubbub. The charabanc arrives to waft them away to Lunch at Wolfson, to which, somewhat to our surprise, both myself and my Bodleian conspirators are invited.
The world looks better through the inverted end of a sherry glass with the knowledge that there's a lot more of that sort of stuff to come, even better when you haven't got to do more than sound like a human being (as opposed to one of those computer people) while you get through said stuff. I am seated in a place reserved for the Senior Proctor (he couldn't make it, I assume) next to the Public Orator himself, who hasn't been to my Oration, just as I haven't been to his, which makes conversation a lot easier. Chablis succeeds sherry, claret succeeds chablis, to the accompaniment of an unpretentious lunch (one cannot trust the Wolfson chef not to experiment, I am told) of avocado salad, roast chicken, fresh potatoes and beans, orange syllabub. The head Curator makes a short speech; the Acting Librarian makes a longer one; I admire the view over the river.
The charabanc deposits me back at the computing service, somewhat dazed, where I regain my trusty bicycle, and speed home through the sunshine, realising as I approach that it is already close to teatime in the real Saturday world which hasof course been ticking on all this time as a background activity. For example, Mr and Mrs Triggs are here to take away the spare bed (as arranged yesterday): Belinda fortunately is at hand to tell them to wait till Thursday (as re-arranged today). As consolation, I ply them (and myself) with duty-free Calvados and enjoy the sensation of being reunited with the rest of the massed Burnards, knowing that all things which should have been done, have been, and not by me. And that it would not be entirely inappropriate for me to go and lie down for a few hours. Which I duly do.
The dark falls on Southmoor Road. The smell of grilling sausages wafts slowly up the stairs, indicative of imminent supper for lodgers and left-behinds. Did you, dear reader, think I had forgotten the date? Of course not! Tomorrow is the tenth of November, as surely as today is the ninth, which can only mean that Belinda has to be taken out tonight and treated like the Real Woman she increasingly resembles. This year, it has been decided, she and four of her peers are to be treated to an Italian dinner out, rather than a disco. As far as I can tell, the trouble with discos is that Boys don't always come, even when you invite them. More fools them, I can only think, as hair is brushed and re-brushed and the genuine Fiorucci hairgrip fixed and refixed in place. The four chosen peers (whose names are Gilly, Izzy, Maggie and Claire) arrive, promptly at 6.15 for pre-dinner sparklers, which occur in the garden with much whooping, after, as is only proper, the giving of gifts. Downstairs, Tata Pam, Sarah Daisy, Elizabeth Rose and the only lodger in residence stoically munch their way through sausage risotto, while we wait for the first of two taxis to waft us all away to the far North, at least as far as the Pizzeria in South Parade.
We have been placed at a large round and a small square table, labelled RESERVED, near the door, so that, although we can't easily see the man in the kitchen bouncing the dough off the ceiling, we can see passers-by studying the menu and pull faces at them. Only Gilly and Belinda dare tackle a whole pizza each; Claire and Lilette have different flavours of spaghetti; Maggie and Izzy share what turns out to be a somewhat miserly portion of lasagne; I content myself with a plate of penne. Everyone drinks coke, naturally, except for the iredeemably adult who stick to the house rosso with acqua minerale. This Pizzeria serves decent and unobjectionable primi; where it excells is in the dolci department. Its banana splits have to be seen to be believed, while la coppa mamma mia (anglice, knickerbocker glory) is altogether out of the run of the mill. It is therefore not surprising that a paltry five girls account between them for five coppas, two cassatas and a banana split. Another good thing about this Pizzeria is that (being an Italian establishment, albeit in alien territory) it is noisy and cheerful and girls who stick paper umbrellas in their hair and go to the lavatory en masse do not have to suffer the icy stares they might encounter at other establishments. That's all there is to tell about this somewhat unusual Saturday really. All five girls, somewhat to my surprise, were delighted to be offered the chance of walking home (we can sing drunk songs, said Gilly, and later managed quite a few lines of "Tipperary", from who knows what atavistic sense of the appropriate) rather than waiting for a taxi again. The walk home was rowdy and rapid, if cold. Then a few card games until the appointed hour for Other Daddies to turn up and retrieve their offspring. (Izzie's was disgracefully late,. and got ticked off by his daughter). Then Bed for some, a nightcap for others, and (unless memory deceives me) a lengthy dissertation on the clinical manipulation of people's emotions from Tata Pam, before I gave up trying to maintain the vertical, and went to bed myself.
An unusual sort of a Saturday; and therefore one I decided, as you see, to preserve on floppy disk. Doing so has occupied me for about an hour and a half this Sunday: I haven't written anything but Visit Reports and Manuals and User Guides for so long I've had a lot of trouble with the style. Can you tell, dear reader/user?
(a) CINECA is an inter-university computer centre owned jointly by five major universities in North East Italy; it is situated in the industrial wasteland surrounding Bologna at a place called Casalecchio. It provides computing facilities for reasearch at all of its parent universities on an IBM mainframe running CMS front-ending a Cyber 170 and a CRAY, with a solitary VAX running VMS. Nearly all its users are scientific, but (largely at the instance of the PIXI research group, which had invited me) CINECA has recently purchased OCP and assigned a member of its small consultancy staff the job of overseeing all arts users.
PIXI, which I was assured stands for "Pragmatics of Italian-English Cross Cultural Interaction", is a small research group funded by the NPI (a government agency responsible for inter-university co-operative research projects); its members are linguists teaching English at the Universities of Rome, Parma, Pisa, Bologna and Napoli. I gave them a short introductory talk about the problems of text preparation, outlined the main relevant features of OCP and then assisted their leader (a Balliol man, need I say) to demonstrate how OCP could be used to operate on a little bit of their corpus. This currently represents about seven hours of surreptitious tape-recordings of people asking for help in bookshops. Linguistic features such as turn-taking and stress are easily encoded for OCP, but 'overlap' (where one speaker interrupts another) may lead to some problems. However, the group, which is virtually non-computerate, seemed enthusiastic, and the CINECA consultant was impressed by the easiness of installing OCP.
(b) The Instituto della Linguistica Computazionale is a specialist Institute directly funded by CNR (the Italian National Research Council) and headed by the charismatic Antonio Zampolli, President of the ALLC, Consultant to the Council of Europe, etc etc, who had invited me to Italy on his last visit to Oxford. More or less on arrival in Pisa, I gave a lecture to about a dozen members of the Institute, describing what OUCS is and does, particularly as regards Computing in the Arts, more particularly databases; of particular interest to the audience were the Greek Lexicon and the Shakespeare Corpus, and there was also sufficient technical awareness to appreciate the importance of CAFS.
Work done in the five sections of the Institute covers the range of linguistic computer applications, from AI to concordance generation. Zampolli stresses that all five sections are integrated; the groups working on automatic lemmatisation, thesaurus construction and on-line dictionary applications are all obviously inter-dependent, and have an important dependence on the group resposible for the large text archive at Pisa, but it is harder to see how the first group I met (headed by Capelli and Moretti) fitted in. Their work seemed to me pure AI in the Knowledge Representation paradigm, using a version of Brachman's KL-ONE language, extended to include both general conceptual structures and instances of objects related to them in the same structure. They were however insistent that their work was intimately related to the work of the remaining parts of the Institute.
At the other extreme I spent most of the afternoon talking to Rita Morelli, who is responsible for organising the Institute's two rooms of magnetic tapes into a coherent Archive. I described TOMES in some detail and was rather taken aback to learn that all their programming was done in IBM Assembler. Of the tapes which Zampolli had brought with him when he brought his Institute out of CNUCE (the national university computing centre) into the promised land at via Faggiola, over two thirds have now been checked for usability and their contents catalogued. The tapes contain 2-3 thousand texts, mostly but not exclusively in Italian, varying in size much as ours do. The Italian texts were mostly prepared for the Accademia della Crusca for use in the Italian dictionary project; the other texts for many different scholars. One unusual feature is that all the texts were produced to a common standard format, including some quite recondite encoding features. I offered to include their catalogue in the TOMES database, which seemed to be an acceptable idea, and would be a major step towards implementing the Historic Gottingen Agreement. We also briefly discussed the notion of a "Text Description Language", that is, a high level descriptive language to which varying encoding formats can be mapped. I mentioned SGML, which is sort of but not quite what is needed. Zampolli had independently proposed researh into a TDL, so there might still be some mileage in it, even, who knows, some Euro-funding for the Text Archive.
Unfortunately, several of the people I would have liked to meet (notably Picchi, who is responsible for their dictionary database software and Bozzi who works on the thesaurus project) were unavailable through illness or other committments. This visit was therefore less immediately useful than it might have been; nevertheless it was very interesting to see at first hand how a specialised centre of this type functions. My expenses for the whole trip were paid by Zampolli's Institute. Railway enthusiasts will be pleased to learn that, although all but one of the numerous Italian trains I took during the trip was late (by anything between five minutes and three hours), not one of them ever broke down.
INDEPOL (Intelligence Defence and Police) is the first ICL packaqe developed exclusively as a front end for CAFS; its name reflects the marketplace at which it is targetted, and also its origins. It was developed by ICL Defence Systems with funding from MoD, originally as a way of making CAFS-800 (the hardware precursor of CAFS-lSP) usable outside Stevenage. It has had a lenqthy gestation, reflected notably in its complete independence of ICL's current strategic product line. Thus, it accesses CAFS facilities via the RSI to RECMAN directly; it has its own internal Data Dictionary; it enforces its own privacy mechanisms independently of VME. The only point of contact in fact is TPMS, which it relies on to handle concurrency, (see further below).
INDEPOL differs in several ways from the conventional static DP package, quite apart from the wonders of CAFS. Its data model is completely flexible; restructuring being unnecessary because of its reliance on SIF. The model contains only things called domains and relationships, the latter not explained in any detail; domains are more-or-less SIF data items, in the physical data model, as well as files and records (the latter containing mostly SIF items only though fixed length fields are supported) it has things confusingly called record sets which are actually dynamically created collections of record keys held in virtual store. They are also working on something like value-based sets between files. It has extensive access control mechanisms, implemented by additional predicates added invisibly to every query.
What sets it furthest apart from the query products we know and love however is its update facilities. Records are updated in situ, (hence the reliance on ISAM), any contention being handled by VME. A modified record is immediately available for searching. (I don't think this is is true even of DCI based systems, which cannot open files in read/write mode).
Its user interface is startlingly simple: you can use either (dynamically created; forms or a very simple command language which has minimal clutter. Fuzzy matching and quorum conditions are described by a syntax arguably closer to natural language than DCI's The same forms can be used for input and output. Macros can be defined both to abbreviate commands and conditions and also for record sets.
The launch was a very professional affair. We were treated to a dramatic slide show loosely derived from Raymond Chandler, in which various fragmentary clues were presented to a private eye. The online demonstration then used INDEPOL to solve the case by searching various files for leads indicated by said fragmentary clues, This was both entertaining and impressive. After lunch (passable pie, decent wine, unambitious salads, disgusting coffee), there were more demonstrations. The whole affair was very carefully stage managed; each group of punters (entirely from the military and the police, so far as I could tell) having its appointed ICL minder to fetch its drinks, introduce it to the right boffins and keep it apart from the other punters. P. Leggate from the English faculty library and I had our new salesman all to ourself, who revealed that he was hopeful of getting funds from DTI to get the package released to one University site, presumably either us or QMC. I may have blown it, by getting too embroiled in technical discussion with Jim May (main guru behind the software) to pay much attention to this heavy hint. I hope not: it looks like a very interesting opportunity. And, to round off a perfect day, neither of the trains I took broke down.
This disastrous day began with the realisation (around midnight) that the batteries in my alarm clock needed replacement. Foolishly disregarding this evil omen, I prevailed on British Telecom to wake me in time for the 0715 train to Manchester, which they duly did. British Rail however was less satisfactory. Said train, having staggered through the Midlands in a desultory manner, finally collapsed inside the tunnel leading into Birmingham New Street, where it was officially declared in a memorable piece of BR-ese "a failure awaiting replacement engine", just as the connecting service to Manchester departed. The next connecting service to Manchester was also twenty minutes late, for good measure, so that I missed another connection at Crewe. So it was that my Shakespeare demonstration was presented to royalty while I wandered disconsolately around Crewe station in search of something remotely interesting to read. I finally arrived at W.Gorton at 1245, just in time to hear D. Daice's last sentence on receiving the coveted gong (in fact a large slab of perspex) from the hands of a well-tailored George V look-alike.
There were about fifty ICL CAFS luminaries present, and about half a dozen users all from the CAFS CUA WP, while HRH Price Michael of Kent (the aforesaid lookalike) fielded a modest team of four minders in very natty suits (though not as natty as his own). Over lunch (a visually dramatic if bland cold collation of salmon disguised as submarine, turkey disguised as frog, veal disguised as chicken and roast beef in a party hat) I learned that the Shakespeare demo, handled by Les Teare, had gone down well with HRH, who had also been presented with his very own OPD. This was incidentally the first time anyone had told me that my humble PQR program was to be given the regal once-over (and on an Estriel too). I had also missed one of H. Carmichael's rapid summaries of the wonders of CAFS which was variously described to me as too technical (by A. Hutt who also confirmed that PDS was now a dead duck and asked why we didnt write our own PDS to QM bridgeware in Prolog) and really comprehensive (by the appalling D. Lovegrove who wanted a certain compromising passage in the CAFS WP Report altered). I met various dignatories from the history of CAFS, only previously known to me as names in bibliographies, such as G. Scarrott, J. M. Evans, L. Crockford and also V. Maller. From T. Wansborough I learned that the Fortran Interface was also a dead duck; in fact probably the only good news for OUCS to come out of the whole ridiculous affair was the promise of a pre-release of DCI-200 this year, endorsed by both K.Watts' boss and G.Haworth. After lunch and the royal departure, we were all presented with a nice commemorative medal now on display in my office (please form an orderly queue) and the option of a guided tour round the premises. I foolishly took this up, and was consequently rather bored by vast acreage of unnaturally tidy machine room, stainless steel boxes, 600 megabit cables, girls rewiring boards etc. Just before rushing to catch a train which didn't actually exist, I met L. Teare who had a few interesting things to say about running my software on series 30: guess what, it sometimes gets the counts wrong, just like our very own CAFS engine.
Back in the hands of BR, I was only mildly surprised when the train I did finally catch (the 1725 from Manchester) suffered an engine failure somewhere near Dudley, fairly resigned to the hour's delay at New St while a new engine was found, and almost amused when, some ten minutes later, on the way into Birmingham International, the wheels of the coach I was travelling in began to give off smoke due to faulty braking systems. An unfortunate BR executive on his hols was also on the train and did much energetic rushing up and down and apologising to passengers. Mirabile dictu, the train to which we were all transferred then ran without incident, arriving in Oxford a mere 15 minutes after its scheduled time, i.e. 2245. Cycling home I calculated that of seventeen hours awake so far, eleven had been spent on a train in various stages of collapse. There must be easier ways of getting a free lunch.
This one and a half day conferencette had been ingeniously timed to squeeze into a weekend separating the annual gatherings of two ot the three major international bodies to do with theatre research, namely SIBMAS (Societe Internationale des bibliotheques et Musees des arts du Spectacle); which was last week, and FIRI (Federation Internationale de recherches Theatrals) , which is next week. Consequently it attracted a rather more distinguished and cosmopolitan patronage than its subject matter miqht have lead one to expect. There were about a hundred delegates, mostly British or American, though with a sizeable number of German, French, and Scandinavians as well. More exotic presences included a lady from the Australian Arts Council, a Pole and a genuine Russian from Moscow. Most of those present appeared to be fairly senior staff from theatre museums, libraries or archives, having only a smattering of knowledge about computing, with a few notable exceptions.
Proceedings were held at Gresham College, which seems to be the nom de guerre of City University's Dept. of Arts Policy, this institution (almost literally) hangs out on the 12th floor of Frobisher Crescent, in the Barbican. 1 have never seen this monument to the power of money before, and was therefore quite impressed (if perplexed) by it.
The first day began after lunch with three demonstrations, the first, live from Munich, was of TANDEM, the current market leader in theatre databases. This is an EEC funded project, currently indexing all dramatic works performed in Germany since 1960, some 13,000 documents, as well as bibliographic data, this holds information about productions (the roles, performers, costume designs, instruments used, stage sets etc.). It uses STAIRS, and runs on an IBM 370 beionginq to the German Ministry of Agriculture. The demonstration was moderately impressive, if you've never seen a demonstration of STAIRS before. The second demonstration was live from City University New York, where a very similar U.S. project is going on, using this time SPIRES. The demonstration was dauntingly tutorial in nature, and again, impressive only if you've never seen an information retrieval system, circa 1965, in action. The database was about 6 Mb, of which rather more than half is taken up by index file. The third demonstration was live from the Barbican Booking Office and consisted of (a) a nice straightforward ticket-manaqement package for computerised box offices (it's called BOCS and runs on a PDP-11) and (b) a rather natty booking system for use by both theatrical agents in search of gigs and theatre managers in search of acts, which runs on Prestel.
Suitably refreshed by this brush with reality, we then went back up in the lift to Gresham College for a swift Reception (wine, olives, peanuts). This proved to be remarkably convivial considering I hadn't met any of the people present before (except for Joe Donohue who claimed to remember me from Dartmouth nine years ago and stole one of my jokes for his speech the [Page]next day); perhaps theatre historians are friendlier than database specialists.
The second day consisted of three sessions, one on "International" (i.e. non-British) theatrical databases, one on the home-grown variety, and the third on whether the twain could ever meet.
Session one was inaugurated by a panel of distinquished Presidents and such like, making some fairly anodyne remarks about the virtues of international co-operation. Ian Herbert, orqaniser of the conference, also spoke, bringing to our attention the production of Cyrano de Berqerac to be screened by C4TV tomorrow, (though quite why I know not) and explaining that the purpose of theatrical databases was to pass on to future generations the theatrical experience.
Speakers were limited to ten minutes each: a good idea . Joe Donohue gave yet another update on the state of the London Stage Databank at the University of Massachaussets. This, currently using INFOL2 on the CDC Cyber, is now being moved to a Corona (IBM PC clone), attached to something called an Omega Bernouli Box -a sort of 20 Mb jukebox of floppies- and running Revelation. Donohue confidently averred that this would be compatible with everything else because it used ASCII. Susan Madrell from the Sydney based Arts Documentation Service of the Australia Council described a large collection of Australian press cuttings (about 57 k clippings) covering anything vaguely artistic and Australian since 1966. This is indexed by proper name, and a subject index is to be added from next year. They too have a PC, and use Wordstar and dBase II. It is planned to "download" the data to something called Ozzynet (spelling?). Didier Augustin from Metz described through an interpreter the Mirabel System, an online database of available play scripts, accessible via Transpac, the French answer to Prestel. Any French language play script (almost) can be archived at Metz, and its details (cast, plot, resources, category...) entered in the database. Of about 2500 plays currently held, 84% are unpublished. Augustin was very alarmist about becoming over-reliant on i.t.; perhaps the recent temporary collapse of Transpac due to overloading was at the back of his mind.
After coffee, the two major existing databases which had been demonstrated on the previous day got further exposure. Irving Brown (City University of New York) provided some background to the Theatre Research Data Center (TRDC) and its newly-published International Bibliography of the Theatre (IBT). TRDC had painstakingly established a taxonomy of theatrical terms and designed a Data Entry Sheet; data capture was by a team of about sixty voluntary unpaid field bibliographers or by co-operation with "other international projects" (i.e. TANDEM). IBT is computer typeset directly from the database, (but not as nicely as the Greek Lexicon), will appear annually and contains what its promoters clearly regard as a mind-boggling 1300 entries this year. Heinrich Huesmann (Munich) gave some more information about the political structure of TANDEM: it is EEC funded and is looking for national bodies with which to co-operate, co-operation consisting of the national body depositing its records with TANDEM and in return getting the right to use the database, this idea is clearly far too Teutonic and sensible over to work.
Over lunch I was button holed by an egg-shaped bookseller who had been sold a Gould running Oracle and was wondering what to do with it; this went on so lonq that I barely had time to qet my overheads in order before the afternoon session began. A succession of lady librarians described their varyingly interestinq cataloguing systems. Sarah Woodcock (Theatre Museum, London) was using, or planning to use, TANDEM for their very miscellaneous collection; she also used the Museum Documentation Association's forms, the Betjemanesque Jane Hatford Dunn (British Theatre Association) was also in transition to TANDEM. They have vast numbers of sets of plays for loan indexed to varyi ngly successful degree. I gave my standard soft sell for the Shakespeare database at top speed, which appeared to provoke stunned amazement. Judith Chapman (Resource Dance Centre at Surrey) use INFO on the Surrey Prime to index bibliographies of dance material. Gillian Hartnoll (British Film Institute) outlined plans to integrate several existing databases (the British National Film and Video Catalogue, the index to Sight and Sound, the Monthly Film Bulletin etc.) using Oracle on a cluster of four micro-VAXes. This sounded v. impressive.
The final session on integration, standards, how and why was a general free for all which had the occasional high spot, notably a statement from the editor of a Russian theatre magazine requesting articles on computerisation and a few terse comments from Professor Carlos Tindemans, vice president of the International Association of Theatre Critics who, introduced as a leading theatre semiologist, stated that he was in fact a semiotician and then sat down. Ian Herbert proposed a form of words giving himself a mandate to investigate the feasibility ot setting up a British Theatre Database which was passed on the nod and the proceedings broke up in good humour.
The Report looms ever larger: initial complete draft is due -for next meeting (one month), typesetting should take up most of September and the publication date is October 4th. Quite a lot of it is already on 8801 floppies so we might even make the deadline.
Actions from the previous meeting of interest: QM250 wont be able to write to ISAM files. The response to my other paper (on IDMS/CAFS integration.) was slightly more encouraging: it was agreed that IDMS could be enhanced to use CAFS far physical scans of a database, and revealed that the fact that QM didn't want to see them was the historical reason why the CAFS view of IDMS records excludes pointers, now described by an ICL spokesperson as "daft".
VLSI developments: amongst the (internal ICL) register of about 20 candidates for inclusion in the next generation of CAFS engines are : key channel enhancements for context sensitive searching and variable length embedded dont cares, more key channels, improved performance from the retrieval unit, more flexible record handling by the format unit and integration with the file correlation unit, (tho the latter is subject to a marketing decision). Which of these get in before the design is fixed in silicon remains to be seen. Target speed of the new device will be at least 10 Mb/sec, just as a result of vlsi. At least two years before the prototype appears, of course.
The new CAFS microcode, with bugs cleared and a 2OX speed up, is still only in use on series 39 machines. It ought to work on a 2966 too, but no-one seems to have tried yet.
Hugh Devonald floated some ideas about the future structure of the working party, its affiliation etc. There was a general feeling that its work would continue under the umbrella of LSUG, perhaps as one of the new technical committees to be set up. He also incautiousaly let slip the fact that the CAFS SIG has £1700 in the bank, even allowing for the estimated costs of the new publication, which led to much speculation about what wine would best accompany the launch of same.
The December CAFS lunch at Cafe Royal will feature one C. Cheetham as ICL speaker and D. Fairbairn, (currently head of NCC, about to join James Martin Associates) as guru. Also (of course) OUCS, BCC, CEGB.
Before lunch we had a quick demo of the BCC library enquiry system, via a conveniently placed OPD. This was moderately impressive: the idea is nice but the use of Bulletin is fairly unimaginative, and the absence of any keyword searching makes the system less than usefulYou can only retrieve books by specifying one or two words in the title; no other fields are CAFS searchableIt also takes a noticeable minute or two to get any response while it searches its paltry half million titlesPoor response time was attributed to the fact that it was school hols as well as lunch time.
Most of the day was spent going through more draft sections (about three).
This four day workshop (the title of which was even more impressive in German) had of necessity a floating population, but over the whole period there were some notable absences (e.g the French) as well as some unexpected presences (i.e the Italians en masse). Attendance averaged thirty each day, predominantly German and Austrian, with a sprinkling of Swiss, four Italians (two each from Pisa and Rome), one American (namely Jarausch of North Carolina, president of the International Something for Quantitative Historical Research), one Belgian (Paul Tombeur from Cetedoc, Louvain), one Dane (Marker, from the Danish Data Archives) and three Britons (May Katzen from Leicester, Kevin Schurer from Cambridge and myself).
The stated purpose of the Workshop (or, as Zampolli persisted in calling it, Washup) was to investigate the feasibility of standardising the machine-readable sources increasingly used by historians and to promote their free exchange on a European basis. Its unstated purpose (according to Tombeur who, as sole representative of the Francophone world, was in a somewhat machiavellian mood) was to consolidate its organiser's position as a newly-appointed mere Austrian in the pecking order of Germanic scholarship. Certainly there was much wheeling and dealing going on, mostly in German, and it was evident that quite significant gestures in the direction of European co-operation were being made. Either way, I found it an unexpectedly worthwhile and unusual gathering: worthwhile in that a formal agreement between the three Text Archives represented was actually committed to paper, and that I found out about several database projects not at all unlike our own previously unknown to me; unusual in that the informal structure and small scale of the occasion permitted quite detailed discussion.
The main achievement of the Workshop was probably the agreement between Zampolli, Tombeur and myself reached over lunch on the first day. This had four heads: to combine our Archive catalogues, to continue to control access to their contents in the same way (effectively) as we currently do at Oxford, to try to get legal guidance on the copyright problems involved and to investigate ways of standardising descriptions of text formats. A proposal will be put to a subcommittee of the Council of Europe chaired by Zampolli for MONEY to work in this area. This agreement was achieved largely in reaction to a proposal made by Manfred Thaller (the workshop's organiser) which we all agreed was unworkable for text, with all its attendant copyright problems, however desirable it might be for unpublished historical sources. Regrettably, the only person who might have been able to introduce some intellectual stiffening to the discussion of text formats (van der Steen, whose paper on text grammars presented at ICCH this year was also to be presented here) was unable to attend, as were representatives of the major French and German text archive.
During two days of rather circular argument, it beame apparent that the nation of conceptual analysis as a necessary precursor of database design is still widely regarded with suspicion by historians, being seen as the preserve of informatics. For many of those present the phrase data description was assumed to be something more like the proposed 'Study Descriptions' which the Social Science Survey Archives (coincidentally meeting at the same time in Esssex) have been trying for some time to standardise. In my innocence, I made the point (several times) that computer held versions of original source materials need some rather more abstract description than are needed for derived sets of numbers and standardised encodings. The OUCS database design course, part 1, might have been of some assistance here, but I was not called upon to give it; instead 1 gave a condensed version of my Nice paper, stressing how the TOMES database resembled its abstract model. I also found myself chairing a most unsatisfactory discussion on how texts should be described, for my sins.
Most of the presentations were given in German, with rather haphazard summary translation. What follows should therefore be regarded not as an exhaustive account, but just as a crystallisation of the bits I could (a) understand (b) remember.
Much of one day was given over to a presentation by teams from Freiburg and Munster of a massive database of Mediaeval German names extracted from necrologies, abbey roll calls etc This proved to be the Greek Lexicon writ large (they have about 400,000 name forms and the database occupies 230 Mb.); one could unkindly say that its software is also pretty Mediaeval: they use Sperry's DMB-1100 which is a Codasyl system, but access to the database is provided only by a query language which looks very much like Data Display, circa 1975. The part of it of which they were proudest was the vastly complicated lemmatisation code which determines what the probable root form of a name is if the particular variant of it required is not yet in the database. They promised to send me a copy of their schema design in which they have somehow managed to find a need for about 40 different record types to support a subset of the facilities the Greek Lexicon supports with less than a dozen.
A team from Zurich described an interesting, if methodologically suspect, project in which vast amounts of data about the weather in Switzerland between 1525 and i860 had been extracted from all sorts of written sources and then combined to produce all sorts of time series analyses about changes in agriculture, social structures etc. A gentleman from Salzburg described his attempts to analyse patterns of Mediaeval migration using Thaller's own CLIO system, unfortunately entirely in German. CLIO was also the subject of a presentation, though not a very clear one. It is a PL/1 package, currently being rewritten in C, a novel feature of which is its string pre-processor, which converts from more or less any input format likely to be encountered in "free text" versions of parish records, chronicles etc down to its own internal structures, access to which is then provided by an interactive concordance generator. Various other software tools (e.g. to do nominal record linkage and lemmatisation) are also provided, but Thaller did not have time to do more than sketch the architecture of the system.
I was more impressed by a man called Merqenthaler from Ulm, who has wrapped up SIR/DBMS, TEXTPACK, COCOA, a word processor, the standard SIEMENS archival system and possibly some other bits and pieces into one consistent screen driven package, -for use by psychiatrists doing content analysis on transcripts of patient interviews. The raw text is put through a spelling checker as well as a dictionary which identifies key (psychological) concepts within it. His system is also being rewritten in C, which appears to be where it's at in Germany these days.
I was also impressed by two of the art historians present who did not actually give presentations but with whom I had some interesting discussions about the Ashmolean project; one was from the Marburg institute, whose work with Iconclass I already knew of; the other was Dr Albert Schug from Cologne, who appeared to be the Grand Old Man of museum applications in Germany.
Finally, I met two archaelogists with something intelligent to say (not a very common occurence). One was selling a natty little micro-based system called ARCOS which records sherd images on videotape and then automatically analyses them to provided cataloguing data; the other was selling a detailed descriptive taxonomy for archaelogical specimens. The former costs 60,000 DM and I have an English language glossy about it; the latter is free, but all in German.
This was the first joint meeting of two of the larger ICL user groups (Data Management and Large Systems) which are to amalgamate in line with reorganisation within ICL later this year. It began with reports from three working parties of DMUG (LSUG doesn't seem to have any) which are to be published shortlyThe DOS Subgroup s report sounded interesting: it makes recommendations about good practise and proposes some major enhancements, notably interfaces between DOS and graphics packages and text retrieval systemsThe IDM3X subgroup's report, being largely about what to do when your corporate database is so huge and complex your subschemas are too big for the Cobol compiler, sounded less interesting: apparently 29 enhancements to IDHSX are detailed in it. The Applications Generation Subgroup is responsible for something called "4GL" which turns out to mean Application Master and Report Master (but not RCI); its spokesman didnt get much time to do more than say that although their report was being printed and did describe some of the shortcomings of AM, they did not feel their work was half begun yet.
These reports were followed by three presentations, two short and commercial, one long and less so. A man from Cincom, the company which gave you TOTAL (a dbms so antiquated it has 16 years experience in the field), gave quite an impressive condensed sales pitch for MANTIS. This is a fast selling 4th Generation system now available for VME machines as well as Honeywell, IBM, VAX..The next release (but not this one) will use ICL s IDMS and DOS underneath; it has its own dbms which supports a methodology not a million miles from SSADM; also its own procedural programming language described endearingly if obscurely as a "mix of Basic, PL/1, Pascal and APL"Equally ingenuously, the man said that it was a high performance system because it was "compile-free" i.e interpretive : a clanger which all subsequent speakers picked up.
Second sales pitch of the day was for MINER, oxymoronically described by someone called Mel from Savant Enterprises as "the software machine". Mimer hails from Uppsala University (one of the world's leading software research centres according to Mel, who hails from Croydon) and is not just a 4th generation package; on the contrary, "Mimer is ready to move to 5th and 6th generation systems". Such technical details as did appear amongst the hype were that the system is based on B-trees and uses lots of virtual memory (this tactic was grounds for the assertion that MIMER is "usually faster than IDMS with CAFS"); also that there was no need to reorganise the B-tree indexes because they were "automatically self-balancing". It probably is quite a good package, but putting a nice screen front end on it doesnt make it a 4GL in my book.
Highlight of the day was undoubtely a presentation (complete with live demo) by Chris Worsley from the Electricity Council of their very own home-grown ECLIPSE ("Electricity Council s Integrated Programming Systems Environment" - of course). The EC is responsible for collating and reporting to Government on the vast amounts of information produced by the different Electricity Boards, all of which (at least until recently) had completely independent data management policies, never mind the mixture of pcs, mainframes, IBM, ICL, etc. A common equipment policy has now been decided upon (which incidentally features CAFS heavily), but the evolution of a corporate database is still going on. The problem with ICL's DDS (which Worsley otherwise considered "superb") was that there was no automatic connexion betwen the top and bottom halves of the quadrant. ECLIPSE uses a code generator called DELTA (not sure where it comes from) to transform [Page] essentially documentary specifications into code (COBOL, SCL etc) appropriate to all sorts of machines. The novel thing in all this is that the ECLIPSE user is specifically disallowed access to any interface beyond documentation! Code is described in terms of LUPOs and LAMs; the system generates Delta code from this, which is in turn transformed into Cobol , Fortran, PL/1 etc. Local coding standards are rigidly enforced: attempted violations of them (such as the use of GOTOs or introduction of DML) will not be compiled. The demonstration showed how simple it was to set up a typical application, using a menu-driven screen painter like ISDA but rather smarter. A nice touch was that the Report Generator looked very much like the Screen Painter. The package is to be marketed by CAP.
After a reasonable lunch (waiter service, hot chicken, cold profiteroles, but not enough wine), assorted ICL hacks took the stand. John Sale presented a paper (attached) about release dates and product versions: some kerfuffle was provoked by the withdrawal of support for VHE 8.01 at the end of June, of which many present claimed to be unaware, likewise by the continued non-appearance of QM250, Peter Kirby gave a bland but unexceptionable chat about future trends within DOS, using the word "motherhood" (in the sense "platitude") several times, which I have not come across in an English speaker before. A recent NCC survey had shown that the proportion of ICL users using DDS was far higher than the proportion of non ICL users using any comparable product: this shows either that DDS is a market leader, or that IBM users are a bunch of wallies, or (most likely) that it is quite difficult to avoid using DDS if you are an ICL shop. Amongst the DDS enhancements specifically mentioned as in the pipeline were support for multiple projects, improved performance, text retrieval facilities, national language capabilities and (most interesting) a tailorable set of screen interfaces to allow for distributed processing at graphics workstations, document preparation systems etc.
Guy Haworth here took to the podium and gave a bright and breezy slide show (featuring I regret to say an aerial view of our very own dreaming spires cheek by jowl with a picture of QMC s IRAS satellite) about how absolutely wonderful CAFS was. He described the chief products (remembering to mention that CSO is only COBOL compatible) and showed pictures of the various industry gongs (now numbering three) which CAFS has collected so far.
The day closed with a presentation, again including a live dem, by Andrew Walker of the long awaited QUICKBUILD, now mysteriously called Quickbuild Pathway, which is not (except perhaps metaphorically) a new sort of concrete, but the latest all-singing all-dancing ICL user-interface to AM, QM, DDS, 1DMS etc etc. All the tedium of entering data into the dictionary, generating guery views, applications, schemas etc is lumped together into this product, together with a nice friendly screen driven interface with built-in help. It also, if I heard aright, generates default database definitions directly from the business model definitions, thus at last crossing from top to bottom half of the dictionary. I think we should order a copy, so long as it's free.
This workshop, organised by Will Vaughan of the UCL Art History Dept naturally consisted almost entirely of art historians; notable exceptions being Dave Guppy (UCL Computer centre), Kevin Flude (from the V & A), the ubiquitous May Katzen and myself. Most of the art historians were from institutions in London, but a token Pole, a token Norwegian, two Americans and an Italian were also in evidence, presumably left over from the international conference of art historians which finished the week before.
The day began with Tim Benton (OU) describing how he had gone about indexing the collection of Le Corbusier's drawings in Paris. If he'd confined himself to the problems of data analysis in this, (he wanted to do things like indexing individual bits of drawings, distinguishing components of identifiable buildings, drafts, doodles etc) this would have been more interesting. Unfortunately he had been primed to present the case for using a mainframe, in his case at Cambridge, and therefore tried to persuade us that everything could be achieved using an editor and a sort package. He was followed by Mike Greenhalgh (Leicester) who, primed to say how wonderful micros were (in his case running dbase2), completely flummoxed most of the audience with talk of bit mapped screens and unix interfaces. He also seemed to judge all mainframes by UMRCC, which seemed a bit unfair. Amongst products he plugged were a text editor called VEDIT, the Leicester VT100 simulation chip for the BBC micro, a mark up language called ASPIC and -yes- the OUCS Lasercomp service. I tried to redress the balance by giving a much abbreviated version of the presentation I used for the Ashmolean last year on data analysis, which was quite well received. Will Vaughan (UCL) argued the case for doing your own programming by showing how he had managed to re-invent a reasonable data entry package and (rather more entertainingly) demonstrated a program running on a BBC micro which would draw Mondriaans for you.
Over lunch (which we had to buy for ourselves, shame), I did my best to persuade K.Flude that the V&A should consider CAFS a bit more carefully: he turns out to have been badly bitten by IDMS when an impressionable student at the Museum of London. After lunch, Cathy Gordon (Witt Library, Courtauld) described the impressive database system they are now using to catalogue their collection of reproductions. This had several interesting features I intend to pirate for the Ashmolean, notably full support for divergent interpretations of the various pictures and a very natty hierarchic system of encoding iconography. The software is a customised unix-based package running on something called a Plexis P25. It was particularly reassuring to see entities we have included in the Ashmolean model (such as object-events) reappearing in their design. Lindy Grant from the Conway library of the Courtauld gave a presentation which, although rich in illustration, had very little to say about computers, other than to manifest complete distrust in their capability to cope with the complexity of architectural description. I have never seen quite so many photos of Chartres Cathedral in one day though.
Kevin Flude (V&A) summarised the current state of play in the computation of national museums in an appropriately gloomy monotone: nearly everyone uses what the MDA offers, which is basically batch runs of GOS to produce your catalogue: no museum has the money or the expertise to develop new interactive systems and no package currently available seems suitable. Finally, Rob Dixon from Erros computing gave a sales pitch for (and demo of) a package called STIPPLE (System for Tabulating and Indexing People, Posessions Limnings and Ephemera). This package runs on an IBM System 38, and thus can afford to be lavish in its use of B-tree indexing, which appears to be its main novelty. Its interfaces, despite Mr Dixon's assurances to the contrary, seemed rather obscure to me, and at present you can only use it by leasing a line to Erros Computing's machine in Abingdon. The Tate Gallery has apparently fallen for it, but no one else, as yet.
The Workshop continued for a second day, devoted to small group discussions on various topics; although I was unable to participate in these, the feedback during the papers on this day was quite encouraging: the art historical world is ripe for computerisation.
This was the second meeting of a small (about 10) working party set up to produce a second report on the exploitation of CAFS-ISP. The first one was circulated in April this year, since when some new bits of software have been "productized" and several more eager guinea pigs appeared. Membership of the group is predominantly from the public sector (N Thames Gas, N.WGas, Southern Water, OEGB, Northampton Borough Council) except for a pair of odd balls (Burton's Management Information Systems and OUCS)
In time honoured manner, membership gives access to early information about ICL thinking and an ICL elbow to twist (Guy Howarth) about our requirements, in return for a reasonable amount of flag waving on behalf of the company
CEGB reported on their recently completed trial of the GM/CAFS
combination using huge ISAM files and a test ICM5 database The former application concerns both routine and unanticipated maintenance of each of half a dozen power stations, each of which involves upwards of 100 MB of detailed information Ad hoc and rapid access to the information by (non-computing) engineers is clearly essential; it is also used by the Radiological Protection Board The IEM5 database was a mere 40 Mb development system Full details of the trial are
available in a report; broadly speaking, it entirely endorses ICL's confidence in the QM/CAFS product as an end-user system, while repeating several of our own minor usability niggles. They do not use output record types, and they define macros for their end users; CAFS
searchable text is not used because the file containing it is too large and volatile (about 40 Mb/site updated daily); but they planned to use it on a different system, which they were confident would fit onto only one FD6640 when completed They did use CM within a TP service, where it slowed things down, but not much (Southern Water here reported that use of TPMS and CAFS was causing them great mental anguish largely due to the infamous software notice QM43 which should not be applied)
After lunch, the Working Party's questionaire and the format of its
report were discussed in more detail About 30 replies had so far been received, mostly indicating that actual usage of CAFS is still very low, although once converted users never look back Several responses indicated dislike of CM and a need for better software
interfaces before CAFS would even be considered; one (the Inland
Revenue) opined that CAFS was "too slow". It was agreed that details
extracted from respondants who had expressed a willingness to co-operate would form a significant part of the new report, together with detailed accounts of the WP members' applications I therefore agreed to write up our own eccentric usage of the beast by the next
meeting, which will be held in Northampton at the end of this month.
I obtained (but subsequently lost) a copy of the draft specification
of QM 250 and a useful tip about getting diagnostics out of QM. I also travelled back with the lady from Northampton BC, who confirmed all our trepidation about PDS110 and added more. They have vast numbers of huge boring old EtE type files from old systems which they just put through ADRAM and into CM; while development of new systems seems to go on within user departments using PDS!
This three day meeting was one of the more productive DBAWG events; there has been somewhat of a lull in activity this year and a feeling that the group was not heading anywhere particularly interesting or productive. No doubt due to the fresh Highland air, these tendencies were reversed at this meeting. Attendees were as usual, with a few faces new to me, notably a vociferous gentleman from GCTA and a dour lady from Aberdeen.
Day 1 was taken up with admin, liaison reports etc. The ANSI draft proposed NDL (due for publication next month) and RDL standards had been exhaustively reviewed by WG15 of ISO TC97Of the comments on the RDL passed back to ANSI for action, the vast majority had originated with DBAWG; commentary implying further work on NDL was clearly now a little late in the day. DBAWG"s major concern here was to ensure that some standardisation body carried on its own work on DSDL; this had been accepted as a work item by WG15. During the discussion, several voiced the opinion that in this area, as in so many others, American public bodies have a tendency to ignore European comment, however authoritative or reasonable.
The group then turned to a lengthy and remarkably vapid paper from a different ISO group (WG5) on the database environment, on which detailed comment had been requested. The paper was the only evidence so far available of WG5's (perhaps over-ambitious) goal of defining a reference model for any dbms; demolishing it took up most of the rest of the first evening.
Day 2 (apart from a report on BNCOD3 which more or less tallied with the opinions in my own visit report on same) was given over to three working papers on "New Datatypes"The latter had been proposed at the January agm as an important work item, but this was the first occasion that the group as a whole had discussed exactly what it might mean. Two of the papers, by representatives of Philips and DCE, were very general, while the third (mine) was very specific. During the discussion, which was wide ranging, not to say heated, it became apparent that none of them was really concerned with what the group as a whole thought "New Datatypes" meant. By midnight it was apparent that any future work in this area would concentrate on the use of abstract datatypes (things like Algol68 structures, Pascal sets etc), despite valiant efforts on my part to tie the matter down to specifics such as text or graphics.
For all of Day 3, we divided into two subgroups (apart from an informal resumption of the main protagonists of the "new datatypes' controversy).
Following the acceptance of the current D6DL as an ISO work item, DBAWG had agreed to extend its current specification to support the RDL as well as the NDL draft proposed standards One of the subgroups therefore concerned itself with this new relational D6DL, and reportedly made considerable progress.
The other subgroup, in which I participated, turned again to the vexed question of Access Control. This topic is now enjoying a new lease of life, breathed into it by two recent working papers in which some real syntax has begun to emerge. Last year's work in defining a reference model of access control, from the top down as it were, is now static; this years will be on the definition of the functions necessary to support the reference model, from the bottom up Several hours of reasonably productive discussion ensued and promises of some six working papers.
Day 3 closed with a plenary discussion of DBAWG's aims and objectives over the next year; these were much more clearly apparent by this stage than at the start of the meeting. On the journey back I had the opportunity of comparing entity models with the CACI representative, an interesting experience.
At this, the third annual British conference on databases, it appeared that theoretical issues were begining to predominate, perhaps not surprisingly in that Computer Science Departments are so dominant in its organisation This may be no bad thing; there is, so far as I know, no comparable forum in the UK for informal discussion of database research
As is customary, the Proceedings were handed out on the first day, this time in the form of a real book typeset by CUP from authors' machine-readable mss. Lasercomp enthusiasts will be pleased to know that Britain's oldest university press had made a right dog's breakfast of the job; the production schedule did not apparently allow for authorial proof reading.
There were two invited papers, two panel sessions and five sessions of three papers each. The conference began ominously with the first of the invited papers, from the appalling Stephen Todd (IBM) who showed some very nice colour slides of molecules whilst burbling about the database system he had "lashed-up" some years back with PRTV He remarked with an air of surprise that database was "not central to the computational chemist's perceptions" and completely forgot (till prompted by one P Salotti) that database search and pattern matching techniques might be of some use in the business of residue identification The other invited speaker was the legendary R Kowalski (Imperial) whose account of Prolog as a database language was thorough and well-presented if not unfamiliar Todd remarking from the back row that it could all be done in relational algebra, Kowalski riposted that there was more expressive power in Prolog's left earlobe than the whole of relational algebra.
This exchange was alas, about the only sign of intellectual debate during the conference; despite much talk of Controversial Issues, there was little controversy, particularly during the truly awful Panel Sessions These (one on user interfaces, one on teaching methods) could have been an opportunity for stimulating and worthwhile argument if they had been chaired with more enthusiasm or featured more aggressive or opinionated panellists; instead they were uniformly bland, self-satisfied and trite.
The computer scientists did not have it all to themselves: a few mere users did speak up, amongst which Tony Wakefield (Bristol Poly)'s account of the complexities of his IMS training course was the best; this together with G.Loizou (Birkbeck)'s account of the extra software tools needed to support Cullinet's IDMS were potent advertising for ICL's IDMS (which has had all the goodies Loizou's lot have had painstakingly to write for themselves for the last two years)
Most attractive piece of software glimpsed over the horizon was Paul Feldman (Thames Poly)'s automatic diagrammer for conceptual models. Most
currently fashionable theoretical issue was the usefulness (or not) of the triple and the binary model it supported as a universal formalism (Roger Johnson (Birkbeck) and G.Martin (ditto)) Most boring and futile piece of academia was not quite as hotly contended for as usual at these gatherings: the choice lay between Flynn (EAnglia) comparing various unmemorable design methodologies, Thanisch (Greenwich Observatory) on automatic clustering of attributes by means of embedded dependencies and Laender (EAnglia again) on a design tool for the definition of user views of rare inpenetrability.
Best Dressed Presentation was probably Ron Cuff (IBM) on a rather eccentric natural language query processor which allowed the casual user to fill in a menu with bits of vague waffle and then translated the resulting mess into QBE queries Worst Dressed Presentation was undoubtedly Anne Zorner (Sheffield Poly) on the DBAWG's DSDL, which did not get round to its subject matter (the ingenuity with which the proposed DSDL supports dynamic reorganisation of a live database) until the last five minutes.
In the same session, Jon Kerridge (Sheffield Poly again) presented the DBAWG's architecture for a distributed database system and Elizabeth Oxborrow (Kent) presented an outline of progress on the Proteus project, which is a real live (sort of) functional prototype ddbms Again, sparks should have flown, but didn't Proteus certainly deserved more time, but uncharacteristic modesty on the part of the organisers (most of whom participate in it) appears to have inhibited this
Work we have done for P.Gray (Aberdeen) on getting Astrid to work on VME may however give Oxford the entree to this project later this year Informally interest was expressed in our experience with CAFS in various quarters; QMC reported that they had one but hadn't worked out how to plug it in yet and Lancaster that they were thinking of getting one Two ICL people (no other manufacturers were present) apologised for the rotten support we had been getting and blamed RWilmot's axe for it.
The conference was housed in what looked like a converted barracks, but the wine at the conference dinner last until after midnight.
This was the fifth in a series of annual conferences on Computers in English Language Research run by a small group of like-minded linguists from the universities of Oslo and Bergen in Norway, Goteborg and Lund in Sweden, Nijmegen and Amsterdam in Holland and Lancaster and Birmingham in England. The like-mindedness consists chiefly in a concentration on the methodology of corpus linguistics, in particular the use of computers to analyse large (more than a million words) chunks of running text. The British Council, ICL, IBM and even a Royal were also sponsoring the event. This being the first time that Oxford had been represented at the conference in any form (while we do have lots of corpora we don't have any linguists), I spent quite a lot of time explaining that the Computing Service and the University Press were not the same thing at all, mainly to delegates excited by the news of NOED, the computerised OED announced the week before.
The conference was held in a hotel on the banks of Lake Windermere, because Lancaster had been unable to find room for it during term time. It was in fact quite a small conference (about 50) covering a broad range of varieties of delegate, as well as of English. (See attendance list).
Proceedings began on Monday with a talk by Henry Kucera (a Grand old man of Corpus Linguistics) from Brown University. He reported some fairly unsurprising discoveries in the now lemmatised and tagged Brown corpus (e.g. that nouns and verbs have more inflected forms than do other part of speech). More interesting was his brief account of work done using a computerised form of Roget's Thesaurus as the basis of a synonym generator. One clever part of the design is that it operates on inflected rather than base forms: you say 'running* and it obligingly gives you 'rushing', 'flowing', 'zooming' etc rather than 'rush1, 'flow' etc. The other clever part is that definitions are also included so that when offered 'fountain' as a synonym of 'spring', the user can specify that he actually meant the sort found in mattresses and hence get a different set of synonyms. The obvious application is in really classy word processing systems, which is why this particular Roget database is not currently available.
Jan Svartvik, who heads the research team at Lund University which has been working for some time on that portion of the venerable 'Survey of Spoken English' which has been made machine-readable, reported on the pauses between speech chunks, which together account for nearly half of all utterance time, speculating as to their semantic function and the mapping (if any) of these chunks to tone units on the one hand, and syntactic units on the other. His colleagues Anna-Britta Stenstrbm developed further the importance of pauses as markers of semantic units in speech, a feature of language the importance of which some of the [Page] non-native speakers here did not seem to have assimilated, while Mats Eeg-Olofsson reported on the dp problems involved in the analysis of a tagged corpus of 5000 tone units (or whatever they are) using a 16 bit micro. He presented an intriguingly vague sort of conceptual analysis based on a methodology attributed to Winograd which looked like a confused version of entity analysis as practised in the real world. They are lumbered with dBASE2 to implement the design on, so I don't think they will get very far for a while.
Afternoons and evenings were given over to parallel sessions and I was therefore unable to hear Peter Littlechild (Camerino) explain how to squeeze one Mb of text onto a 5 inch floppy, nor yet Mahavir Jain (Delhi) on the construction of a corpus of Indian English (though I did interest the latter in the Text Archive and in OCP over dinner next day), nor even Mette-Cathrine Jahr (Oslo) on relative preferences for the 's and 'of genitive formations in everyone's favourite corpora. Instead I plumped for the session supposedly on computer aided lexicology where both Jacques Noel (Liege) and Willem Meijs (Amsterdam) discussed the difficulty of automatically parsing dictionary text. The former are now using STAIRS to generate KWIC indexes to LDOCE and to CED, primarily (I later learned from prof. Engels) to generate ELT exercises. Meijs'; project was only announced in March; it will use the software developed as part of the TOSCA project to transform the OALDCE text (supplied by the Text Archive) into a database and I have already asked for a copy.
The rest of this day was somewhat of a miscellany: Dirk Geens (Brussels) had nothing much new to say about his corpus of modern drama; Yang Hui-zhong (Shanghai) on a 1 million word corpus of scientific/technical English in the making was another candidate for OTA/OCP; Rodolpho Delmonte (Venice) on complex noun phrases had no discernible connexion with computing. While attending to these I missed Matti Rissanen and Ossi Ihalainen's account of work begining at the University of Helsinki where a large corpus of English is to be assembled on faintly dubious historical principles (also with the assistance of the Oxford Text Archive), which I had already heard all about on the journey up.
At dinner we were addressed by the Chancellor of Lancaster University who read a pithy greeting from HRH Prince Philip and by Nelson Francis (the other Grand old Man of corpus linguistics) who told one very good joke, alas unrepeatable. I was sitting next to Henry Kucera and probably annoyed him intensely by suggesting it was time they started to use predicate logic to express their grammars.
Next day's plenary session included Stig Johansson (Oslo) on types of ambiguity detected during automatic parsing and Willem Meijs (again) with some rather unsurprising evidence that people recognise units larger than individual words in speech, but was mainly given over to reports on the progress of the Lancaster tagging system.
Roger Garside (a computer scientist) typified their grammar as one of 'constituent likelihood'; it works by generating every possible parse of a group of tagged words as a number of trees. Probabilistic weighting, [Page]derived from other trees already present in the corpus, is then assigned each tree using some fairly fancy sums and the winner added to the corpus once it has been verified by a linguist. This process now has a claimed accuracy greater than 95% Geoffrey Sampson (a linguist) gave more details of the tagging system used and likened their approach to that of case law. Finally, Eric Atwell described the use of their tagging programs as a spelling checker superior to (e.g.) the UNIX Writers Workshop in that it would fault such sentences as "Fill in the pink from" on syntactic grounds. To test this he needs a corpus of one million words guaranteed to contain typing errors which he is attempting to get out of ICL (Portman Rd); I offered to see whether we could contribute any as well.
During the afternoon's parallel sessions, I managed by adroit switching of location to avoid two Belgian papers on CALL, a lady from Jerusalem on pronominal uses of the word "one" and a loony who claimed to be able to index abstracts automatically by paying attention solely to the punctuation. Instead I was able to contrast British (Fanny Leech) and Dutch (Gert van der Steen) approaches to writing automatic tagging programs; I think the Dutch win and hope that I have persuaded van der Steen to let us try out his software here. Remaining speakers that evening were real linguists: Kjellmer (Goteborg) on collocations was sound but dull; Tottie (Uppsala) on types of negation both sounder and duller; John Kirk (a fairly happy KDEM user from Belfast) rounded off the evening with a scattering of sample scurrilous Scotticisms, much to the delight of the native speakers present.
The conference ended with two plenary sessions, one covering research at Birmingham and the other at Nijmegen. John Sinclair (whom I have been trying to meet for approx 5 years) made a forthright statement of the pragmaticist's position on lexicographic evidence: dictionaries should reflect usage as it is found to exist in a representative corpus (they have 7 million words at Birmingham), in terms of both vocabulary and syntax, rather than historical evidence or linguistic introspection. If this position had been given more than lip service by the compilers of CED and LDOCE (both of whom were present looking gloomy and minder-like) then those dictionaries would look very different. Antoinette Renouf described work she had been doing on attempting to elicit usage patterns by tape-recording unscripted encounters between students : this sounded fun, but methodologically highly suspect. From Nijmegen, Jan Aarts and Theo van den Heuvel summarised, with justifiable pride, six years of work on developing high quality software tools for the use of corpus linguists (somebody had to) and the conference ended with an impressive demonstration of their culmination: a video (complete with glamorous dutch lady demonstrator) of the LDP analysis program which allows happy linguists to page up and down and roundabout sentences in tree form on a classy graphics terminal.
In summary this was a most enjoyable and unusual gathering: Oxford is not renowned for its linguistics department, nevertheless the effort that we do put into facilities like the Text Archive and the OCP are much appreciated by scholars outside the University.
Because of other commitments (Advisory) I was able to attend only the first day of this two day meeting in Sperry's palatial skyscraper just off the North Circular road The following were the chief points of interest to emerge.
Over 132 people attended this workshop on knowledge-based or expert systems, although no-one I spoke to appeared to have any very clear idea of quite what such systems were or might be used for Their chief components appear to be firstly a knowledge-base, which is not quite the same as a database because the entities represented in it include both vague or imprecise quantities and equally vague rules about their significance, secondly an inference engine, which is a processor capable of both using and adding to the knowledge base, and finally a fairly sophisticated man-machine interface, usually in natural language albeit within a fairly restricted semantic domain Such systems are VERY EXPENSIVE INDEED to construct, and thus are to be found only amongst the VERY RICH, e.g the oil industry, the US Dept of Defense and any UK Computer Science Dept which has got its hands on some Alvey funding.
The first speaker, from RACAL, appeared to be rather unnerved by the size of his audience and was also clearly under instructions not to reveal too much about his subject, a system for making sense of the vast amounts of data obtained during exploratory drillings in the North SeaSince exploitation of any resulting oil wells is usually a co-operative venture in which the cut each member of the consortium receives depends crucially on the size of reserves found at a particular place, it is rather crucial to get unbiassed and accurate information A 5% error could mean a difference of £350 millions The system was computationally unfashionable, being model based rather than rule based; that is, it works by constantly revising and improving on rules of thumb derived from observations rather than a priori knowledge There is no attempt to model causality in its knowledge-base which means among other things that the system has no way of reconciling contradictory conclusions reached from different premises It runs on a special purpose AI machine (the Symbolics LISP machine) which also supported numerical analysis in Fortran and a conventional database management system (unspecified).
This presentation, (which should have been more impressive given that it proved to be the only one describing a real fully functional system) was followed by two sales pitches, one for SAGE, an expert systems shell marketed by SPL, and
the other for REVEAL, a decision support system marketed by TYMSHARE Both are also available from ICL and cost an arm and a leg, except that there is a version of SAGE for VAX VMS currently on offer with a massive educational discount.
The SAGE knowledge base is created by the user as a set of rules and objects which the inference engine then uses to establish a goal (e.g "This user is liable to have a heart attack") by means of a dialogue and backward-chaining reasoning (i.e People under stress are more likely to have heart attacks, so I must establish how likely this user is to be under stress; to establish which I need to establish how many Advisory Sessions he has done in the last month; to establish which.etc It's called recursion!)Because objects can have a truth value (probably false, nearly true, etc.) rules involving combinations of such objects are said to exhibit fuzzy logic; thus, for SAGE, if p is 0.3 true, and q is 0.8 true, then p AND q is 0.3 true, and p OR q is 0.8 true Various smart alecks in the audience pointed out that this was a barely adequate fuzzy logic, with which the speaker had the good sense to agreeNevertheless, I think SAGE would be a good way to learn about expert systems and might even be useful for somethingREVEAL by contrast proved to be a souped-up financial modelling system with little to recommend it apart from the use of fuzzy logic in both its database system and its English language interface, so that the modeller can say things like "List all tall rich blondes with large bosoms' without having to specify what 'tall', 'rich' and 'large' mean exactly.
The conference dinner was satisfactorily sybaritic and was followed by the traditional drunken gossip in the bar, during which several people lurched up to me and said "Famulus?" in a menacing sort of way.
The following morning began with a presentation on Salford's new Prolog system (which was also available for hands-on use during the morning) This is (like Poplog) a hybrid in which all those bits that are difficult or impossible to do in pure Prolog (like reading from files or doing assignments) are hived off to another language, in this case LISP It also supports a better syntax for grammar rules, a Fortran interface and -yes- floating point arithmetic! But it is only available on Prime and is still under development.
The trouble with Prolog of course, the next speaker pointed out, is that it is really only practicable on machines that don't exist and are unlikely to for the next ten yearsSomething called "a given sector of my client base" was however attuned to it and so his company (Cambridge Consultants Ltd) were investigating its usefulness in
real time (but not as yet real) applications Their investigations had however thrown up nothing that could be communicated to us other than a list of the available versions of the language and some fairly superficial remarks about it.
John Baldwin from Bristol and Ron Knott from Surrey re-established the intellectual credibility of this workshop by the next two papers which described programming languages capable of building knowledge based systems Baldwin described his Fuzzy Relational Inference Language, which is a logic programming language incorporating uncertainty in a far more thorough and mathematically respectable way than SAGE Other buzzwords included the blackboard model, parallel architecture, a self organising knowledge base and a dataflow machine His paper was the only one that could properly be said to manifest state of the art knowledge, although Knott gave an interesting survey of the available functional programming languages, typified by LISP, variations of which are very much still alive and kicking.
After lunch, delegates were restored to good humour by Tim O'Shea (OU) who gave a good survey of Al-supported computer aided learning systems Apparently the crucial question to ask someone trying to sell you a computer tutor is "What sort of task difficulty model do you have?” and, if this fails to floor him (or it), 'Does it support dynamic student modelling?' The speaker was good enough to kindle enthusiasm for his subject which is saying a good deal in this case Apparently the process of producing CAL systems is generally known as "authoring" The people (or things) that do it are presumably known as "authorers".
Finally we were given an interesting account of the current structure of the Alvey directorate, and even some figures about how its huge funds were being split up Whether or not these large sums will succeed in reversing the process of rot supposedly created by that convenient scapegoat the Lighthill Report, or whether they will simply prove a useful way of cutting down funding for other academic research remains to be seen, though the fact that ALVEY expands to "All Large Ventures Except Yours” may be taken as some indication.
This was that rare occasion, a CUA -sponsored meeting at which no-one was after ICL's blood With characteristic low cunning the company first put up an odious marketing person to draw such offensive remarks as the audience could muster (chiefly "What about TP?" and "When?") and then lob them over to Hamish Carmichael to be dealt with after lunch Mr Gershon (for it was he) is much addicted to the use of ACL (Audible Capital Letters) and so we were repeatedly told that CAFS-ISP was now offering a real End User Capability at a Highly Competitive Price; he also kept referring to it as the Product of the Decayed" which can t be right, even if Vic Maller has left the companyAs regards price however, it is depressing to learn that we could (now) buy a whole DSC complete with CAFS for the cost of the add-on DFC CAFS module i.e £35k.
The next speaker was Roger Hare, an equally paradigmatic Man from West Gorton, who acquitted himself well in the extremely boring task of detailing the history of the CAFS-ISP field trials, including such minutiae as the number of bugs and the dates each was cleared Instructively, we were told that there had been over two hundred (sic) bugs in Querymaster (variously abbreviated throughout the day as QM, QRYM and QYRM) and about half a dozen each for VME 8.10, IDMS350, DDS650; one (one) hardware failure had required a change in the CAFS microcode Of the nine field trial sites, one (one) had gone live - the day before the meeting by an odd coincidenceAll had had ample evidence to justify the performance claims previously made for the systemSeveral technical queries were also handled by this speaker, mostly to do with the impact of CAFS searches on concurrent TP systems, which is predictably pretty dire.
Lastly before lunch, Peter Drake of Southern Water spoke briefly about the working party on high level interfaces to CAFS, in which we are also involvedHe did little more than outline the history of the group, express gratitude for ICL's helpfulness in setting it up and promise that the Report would be finished by January.
The lunch was high on protein, but low on flavour and I spent half of it in the pub with some policemen I learned from S.George (QMC) that their CAFS would be used if at all as a means of supercharging existing IDMS databases, that they were very worried about concurrency problems and that installing VME release 8 was a pig I also ingratiated myself with one Ian Turner, whose title is Marketing Innovation Manager within UKD Marketing and who is, significantly, a Kidsgrove Man This means that he is willing not just to countenance but even actively to support the "Death to the Dictionary" heresy This maintains that CAFS should be accessed via the low level RSI rather than via the baroque orthodoxies of Bracknell Man Consequently we are now on the list for field trials of PSAM, RCI, RSI, Textmaster and ANONP (another non-nameable product).
After lunch we were given a tantalisingly brief account of progress on the Inland Revenue's COP (Computerised PAYE) system The purpose of this is not just to catch up with Paul Griffiths, but also to provide a national facility for each tax office to be able to translate names and addresses into National Insurance numbers Eventually it will use 47 2966s with 18,000 DRS terminals accessing about 250 FDS 640s (Laughter) The first of its 47 regional systems went live in October The interesting aspect for us is that they use a purpose-written interface to the ISAM RAM, using the RSI, thus giving the lie to Bracknell Man's assertion that everything must be done via the Dictionary.
The next speaker, from Nottingham Borough Council, had enjoyed his lunch rather too much to make any very intelligible criticisms of Querymaster, particularly as nearly everything he complained of was a direct consequence of trying to use a 1960-style database of flat files as if it was an IDMS system and also of behaving as if CME was a real operating system instead of a transition aidAnother speaker from Southern Water (which is the one site to have gone live with CAFS) gave an impressive account of their field trial experiencesThey had used QM (QRYM or QYRM) on a 105 Mb database and even in this cumbrous software environment could report vast performance improvements.
Finally, two ICL speakers Andrew Hutt (on behalf of Bracknell Man) and Hamish Carmichael (on behalf of the human race) summarised the state of play. Hutt asserted that QM without CAFS could search 250 records a second, and 4000 with it Carmichael, who is much better at PR, proved that a 2966 with CAFS could read at one mile an hour as follows. The upper limit for one 2966 is six CAFS units, giving you a total maximum database size of 121 gigabytes, i.e roughly the equivalent of 5 miles of library shelves. This would, he claimed, take about half an hour cpu to scan the lot. Hutt said how happy he was that at last ICL had made CAFS an integral part of the company's development plans; Carmichael said how happy he was that further developments were already underway at Kidsgrove (i.e not Bracknell) and even non ICL software houses. This oblique reference to ANONP clearly augurs well for our early involvement with the RSI, an impression confirmed by Turner's interest in our proposed packaging of an SPSS interface using PSAM.
DBAWG's annual outing to the Highlands proved somewhat more intellectually strenuous than usual, partly because poor weather kept us within doors for most of the time, but mainly because this was the last opportunity to get the group's proposals for a Data Storage Description Language (DSDL) into a presentable form before they are submitted for consideration by the ANSI "Experts' Meeting" in Paris next month. Consequently I spent most of Friday and Saturday in the DSDL subgroup going through the syntax paper (78 pages of BNF) correcting errors, inconsistencies and spelling mistakes and rewording it here and there to keep it in line with the vocabulary used by the latest ANSI specification for network database systems, which our document is intended to complement. The last part of Saturday's session (finishing around 11 pm) was spent revising and partly rewriting an eight page document intended to make the case for a DSDL as an introduction to the syntax paper.
Other entertainments included a plenary session in which the standing paper on Access Control was gently but firmly shot down in flames. Most of my bits of it remained intact, but it was agreed that substantial restructuring of the paper was necessary. This will be attempted (at the Open University) before the next meeting.
There was a proposal to publish the group's other major effort this year (a syntax for distributed database systems) possibly together with other papers in book form, for which I (naturally) proposed the use of the Lasercomp.
There was much informal discussion about so-called fourth generation packages and some useful (if impressionistic) comparisons made between various options we might consider for the VAX.
The eight talks at this workshop covered a range of topics in non-numeric computing. Of particular interest to me was Rod Johnson's (UMIST) lecture which consisted of a history and overview of machine translation from which I picked up a lot of useful material for my lectures. Talks by S. G. Pulman (East Anglia) on Programming Requirements for Computational Linguistics and C. S. Mellish (Sussex) on PROLOG also showed what would happen to database query systems in the future.
The more immediate future was covered by Lou's entertaining exposition on the new FAMULUS and to a lesser extent by Dave Lindsey's (Aberdeen) survey of text retrieval systems. It looks as though we should wait for SIFT from Norway, which, it is claimed, is still going to be free.
The talk by B. Lowndes (Liverpool) on RAPPORT told me enough of what I wanted to know about the package and rather slowly. Two of her examples were computer centre applications - it would have been nice to hear more about some real applications including the Egyptian archaeology one which she obviously regarded as a very unusual use of computers.
The remaining session, which I chaired, was on word processing/typesetting. Adrian Hock (Leeds) described his mainframe WP program called LEWPS which was geared very much to Leeds' documentation format. Its only virtue was that the output from one run can be used as the input to the next, rather than using separate input and output files as in ROFF etc.
Heather Brown's long-awaited talk on 'Typesetting - the systems and the output devices' turned out to be more of a sales pitch for laser printers of which she showed many pictures. The resume of her paper which had been set by TROFF contained some very peculiar hyphenation.
She has TEX running on a PERQ at KENT and the PERQ and their laser printer a Canon LBP-10 are both on the Cambridge ring. There was very little mention of the Lasercomp, which I tried to rectify by handing out our brochures.
At this second meeting of the working party, I presented a paper describing the current state of Famulus and putting forward the various proposals for improving it that I have accumulated over the years. My paper which summarises the features of 6 different implementations of the package in 4 pages was tolerably well received and it was agreed that it should form the basis of a circular to all Universities and Polytechnics' Computing Services to elicit information about other versions of Famulus, other proposals, and to estimate the requirement for packages of this type. This circulation will be done before October (not by me); the Working Party hopes to report back, probably recommending that someone should be funded to tidy-up ythe source and add the most popular extensions to it, (not me) by the end of the year. It was generally felt that there was about 6 man-months of effort needed to bring the package up to scratch and to add the required enhancements. It was tentatively proposed that the work might be carried out at Manchester or York, and that Edinburgh PLU might become a central distribution site for the package, but since neither York nor PLU were present at the meeting, these proposals were deferred till after the feed back – if any – from the circulation.
This meeting, convened by Richard Field under the aegis and instructions of the IUSC , set out to review the status and future prospects of Famulus as a package widely used for bibliographical and quasi-bibliographical research. It was quite well-attended, sites represented being ULCC , Aberdeen, York, Warwick, Atlas, Edinburgh (PLU), UMRCC Southampton, UCL and Trent Poly. Versions of Famulus were now available on CDC 6600/7600, Dec 1O and 20, IBM361/370, Honeywell, GEC 4020, Burroughs B6700, ICL 2900 & 1900 at least, each version having slightly different parentage and features. The Ur-Famulus, which was developed for the US Department of Forestry in Los Angeles some years ago, had been updated somewhat since its invasion of Europe (and New Zealand, which is where the Burroughs version came from); Aberdeen had been in touch with its originators but the changes appeared to be purely cosmetic and they did not seem willing to make them generally available.
Discussion centred initially on what role Famulus played in the spectrum of text processing packages. It was necessary to distinguish it clearly from full text searching packages like Status, Quill, Sift, 3ip etc. Such packages were currently under consideration by SPC anyway. Famulus had an unusually clear user interface which was widely admired and liked; moreover the Ossify program meant that an equally clear 'system interface' format was available for transferring between versions. Nearly all sites reported that the package was used chiefly for maintaining bibliographies or similar catalogues; most sites reported about a dozen to twenty users. The largest applications mentioned were the Manchester Museum's catalogue and the Oxford Onomasticon project. It was suggested that the function of KWIC/INDEX was better performed by a full text searching system or by OCP .
Amongst enhancements proposed were:- full character-handling facilities analagous to those of OCP; ability to define some fields as being numeric and ability to do some simple arithmetic on them (sums, averages, distributions, comparison etc); ability to define date fields; data validation, perhaps by a stand-alone input utility; extending the formatting options of Galley, particularly to include hierarchically sorted data and tabular output; There was general agreement that the present internal format was wasteful of space and could profitably be reorganised e.g. to use variable length records, or even ISAM files where these were available.
The meeting discussed briefly the problems of getting funding for a complete rewrite of the system, or the possibility of evaluating all the currently available software against Famulus-requirements. It was concluded that a working party should be set up to draw up a detailed report of what facilities currently existed in all the known versions of the package, and what enhancements were generally needed. An interface to support these requirements would be defined, in terms of the existing Famulus command language, and those sites which wanted to implement any of the proposed goodies would do so. R.Field would report back to IUSC which would then set up a working party and co-opt members onto it.
I am sorry to have to report that this expensive and over-subscribed event turned out to be little more than an extended commercial for BP's super new all integrated management information system and for the (horrible IBM) systems analysis package they used to create it. Even the three non-BP speakers allowed onto the platform eventually provided little more than an object lesson on how to sell ancient data management concepts to the managerial class, though they did at least address the ostensible subject of the seminar. Other than that, it was all about identifying BUSINESS OBJECTIVES AND STRATEGY and drawing up matrixes showing which Business Procedure corresponded with which Business Objective, enlivened only by the odd bloomer ("Is the company properly organised to meet its informational requirements?"), three times over. The other three speakers were marginally more interesting. Roger Tagg (a fellow DBAWG person) discussed how disparate the elements of an integrated system would have to be, to include elements such as 'confravision', 'colour graphics' and what the French call 'bureautique' (i.e. office Automation). His type of systems analysis, he clearly opined, was more than equal to the task, if the software wasn't. The second interloper was Ed Tozer (a fellow DMUG person) who did a reasonable job of plotting the available software along such unfamiliar axes as 'type of user' (management/specialist/clerical) and 'type of query' (high/low volume, un/structured, ad hoc...). He also asserted that within a few years we would all have a mip on our desks, if not in our pockets. The third interloper, a fellow academic from Newcastle, was similarly apocalyptic about various new types of chip, expert systems, join machines etc. The day was not however a complete waste of time: over the (also awful) lunch I met the newly appointed head of ICL's IDMS design team and impressed on her our distaste for being compelled to use DDS, requirements for Fortran etc. etc.
This was a very useful meeting at which I learned a good deal about ICL's current planning, relating to both the role of CAFS and its software interface. I also obtained some (probably company confidential) dates. To summarise, CAFS is seen as an integral part of the hardware, not an optional bolt-on; there was also talk of the 'File Correlation Unit' or 'Join Machine'. All data management software (which is, after all, what sells ICL mainframes) will interface to CAFS by 1984, either directly (PDS, QRYM, IDMS), via the new 'Relational Programming Package' (or RPP, on which see below) or via 'Predefined Set Access Method' (PSAM, on which see below). RPP is used by Cobol programs (through a call interface, and a preprocessor similar to that currently used by IDMS) while PSAM sits between an existing object program (e.g. a package) and a conventional file on CAFS. In all three cases definitions of some sort need to be present in the Data Dictionary, (even for PDS!) mapping the 'relational view' in terms of which queries are always expressed (RPP/PDS/QRYM are 'single sourced') on to the data item/s in the file/s being accessed. This mapping may be quite simple (e.g. a few fields of a record in a RECMAN file) or very complex (e.g. a specified navigation through an IDMS database); some automatic path choosing is present in RPP but should probably be treated with great caution. This integration of query languages has one great benefit (quite apart from its use of CAFS) in that it provides a simple (and relational) interface to IDMS; its major defect is that separate mappings must be generated for each combination of fields constituting a 'relational view' and stored in the Dictionary. This is then used to generate an OMF module (functionally similar to an IDMS object subschema) used at runtime together with the RPP inline code and the user program. Each query must be precompiled in this way, (unless QRYM is in use) but may be parameterised to some extent. In those parts of the query where a literal would normally appear, a parameter identifier may be supplied, and then instantiated at runtime. The mechanism for passing parameters through the PSAM interface is not yet defined, but is thought to be via SCL: the PSAM interfaces is clearly of great interest to us, but equally clearly still the subject of internal discussion within ICL.
This meeting was similar in style and content, if not in surroundings, to the one I attended in the Highlands two months ago, except that there were three or four faces new to me present. On day 1 I applied for and was granted official membership of the Group as Oxford's representative. Reports of liason with other groups of similar interests followed, in particular a Pascal working party which has expressed interest in the continuing work of standardisation of what ANSI now wish us to call dbcs. Another interesting indicator of future trends was the DBAWG's agreement that its distributed DB architecture was not predicated on an exclusively Codasyl (or indeed ANSI) architecture. As before the real work of the group continued in its two subgroups and I stayed with the 'Access Control' subgroup. After having at our last meeting failed to agree on a definition of 'action' within an information system, we spent most of this one failing to a.gree on what a 'rule1 was anyway. On day 1, it had a rather hierarchic element (of, Asimov's laws of robotics); but on day 2 this was discredited in favour of the view that it was just a boolean expression 'action X is permitted to role Y if (arbitrarily_complex_expression) ' A discussion about 'rights' and 'revoking' of rights led to the assertion that rules could also be the subjects of (meta) rules and indeed attempts to define meta-meta-rules (which define what happens when the meta-rules defining which rules currently apply are changed). It was also agreed that the subgroup needed a dictatorial chair person, which role I was reluctantly coerced into.
DBAWG (Database Administrator Working Group) is a long-established working group of the BCS which is also affiliated to CODASYL (Conference on Data Systems Language - that influential body to which we owe such delights as Cobol and IDMS). Its membership is eclectic and disputacious. Many innovations in the database world have originated in or been subtly moulded by its argumentative members (the IDMS Fortran Interface and the independent storage schema being two notable recent examples).
I was invited to attend this meeting as an observer. The meeting was hosted by Aberdeen University at the Burn, a charming country house on the banks of the North Esk (but unfortunately out of the Salmon-fishing season). Other academic bodies represented included the Open University, Edinburgh and Sheffield Poly; manufacturers represented included Philips and Univac; software house CACI and SCICON were also represented as were users such as John Brown Engineering and MOD.
It soon became evident, however, that the members of the group (25 in all) were present in purely personal capacities. DBAWG's current projects are the definition of an architecture to support distributed databases and the extension of CODASYL systems to include access control; its two current subgroups reflect this division.
I sat in on discussions of both subgroups and also on the major topic of the whole meeting, which was the proposed ANSI DBCS specification derived from Codasyl. Of the two subgroups, work on the distributed architecture is well advanced and most of the time was spend clarifying and optimising the expression of the group's proposals in a working paper. Work on access control is less well advanced however: an initial model, based on ISO 'information systems' concepts, has been floundering for the last two meetings on matters of definition. Quite what an 'action' (to or by which access is to be controlled) is and at what level of primitiveness it exists has been the subject of two papers, both now rejected.
In discussing the ANSI paper, several worthwhile criticisms emerged, notably the absence of compound data structures and of language-dependent facilities at the subschema level.
The International Word Processing Exhibition (sponsored by IT - i.e. you and me and Mrs Thatcher) suggested several excellent ways of spending £40,000 and quite a few fairly rotten ones. I have acquired a reasonable quantity of glossy literature about the products which took my eye and only given my name and address to a small number of salesmen. Advent Data Products (who provided my free ticket) were busy demonstrating their amazing printer to someone else but I watched anyway: it certainly will print just about anything, but only very slowly. It is still possible to buy acceptable matrix printers for £600 or less. The most visually attractive standalone system is undoubtedly Data Recall's Diamond, which has a screen about the size of an LP record in tasteful (EEC-recommended) amber-on-brown. Undoubtedly the best from the communications point of view is ICL's DRS-based Wordskil 8800; inevitably this suffers from the usual ICL disadvantage of not quite existing yet. Wang systems also look very good, but only support IBM protocols. My feet were starting to hurt by the time I found Wordplex and Philips which may be why they seemed comparatively lacklustre. An aggressive Burroughs person demonstrated a large box they call a 'content-addressable file store' and got quite cross when I quibbled that this should imply rather more than a large ISAM file. The best gadget on display (not available yet though) was undoubtedly Sony's typecorder. This fits into the average hard-pressed executive's briefcase and enables him to dictate messages and wordprocess documents on the same minute cassette while flying by Concorde, or presumably waiting for the bus.
After lunch I arrived at the 1851 Census workshop in time to hear Prof. Michael Anderson explain how he had encoded his sample of the 1851 returns for analysis by SPSS. This was one of a series of workshops sponsored by the SSRC Survey Archive in which historians (economic and otherwise) are exposed to the possibilities of using the computer for secondary analysis of large existing data sets. I thought I should attend not only in order to preach the gospel of proper data analysis and recoding by Spitbol, but also to keep in touch with the SSRC Survey Archive, which performs similar functions for social science data as the Oxford Archive does for textual data, only rather better.
With this in mind, I came away with a number of documents which will be reworded to act as publicity material for our own Archive.
More than 50 happy BASIS users attended this first official meeting of the newly constituted UK User Group. After constitutional discussion, elections of officers etc., there was an hour and a half presentation concerning Release L of BASIS, due for beta test in January of 1989. Somewhat to my surprise, this new release promises to rectify lots of the more irritating fundamental shortcomings of the package. For a start, somewhere in Dublin (Ohio) a nettle had been grasped, and the internal format of BASIS files will never be the same again: this means that true proximity matching will be possible in release L (i.e. the index postings will give word position as well as context position within document); it also means that minor irritations such as the horrendous updating process are completely removed, and that the system changes necessary to facilitate real-time update in a future release are already in place.
Other encouraging trends are:- a committment to support more "document architecture" type facilities (sections within full text documents; extensions to the current 'hidden string' features to support non textual objects; support for SGML markup on input; increases on all current limits ) and a variety of improvements to the user interfaces to the package. Three levels of interface are identified: at the corporate level, specialist programming will continue to use high-level languages, for which COBOL and FORTRAN preprocessors will be provided in addition to the current "CALL" facilities; at the departmental level, the current facilities will be maintained, expanded somewhat so that for example the VT100 based screen handling available with Long Text is available with all data types; at the individual end-user level a new module called EXPRESS will offer a PC-style interface with somewhat reduced capabilities, using ring menus, full screen working (for VT100s) and integrated query and update of documents. And the really good news is that all of this comes free for existing customers as an incentive to migrate databases from K to L.
After an excellent lunch, (during which I met fellow academic users from Imperial and Aston) we were entertained by a man from Wapping describing how News International had eventually decided to build up their own in house story database, rather than relying on bureau services (too expensive and too slow) or old fashioned cuttings libraries (too heavily picketted). Two interesting things emerged from this: firstly that it had at first been thought cheaper to have the copy re-keyed in the West Indies than to use the stuff typed in by the journos themselves; secondly that even on The Times it was necessary to add keywords to identify story contents, since headlines were increasingly too jokey and elliptical for the purpose. Other fields added to each story during the hectic 12 hours between its appearance in print and its inclusion in the database include any corrections advised by the legal department; these will both presumably be of greater importance when the Sun is included in the database later next year. I also learned that, new technology or not, post-printout editing (a practice still carried out at Wapping) is known as "cutting on the stone". News International is also keenly aware of the commercial potential of their database and is already considering licensing such spin-offs as an online sports service, law reports on CD-ROM etc.
The remainder of the day was given over to very dull self- congratulatory salesmanspeak, listing all the new customers gained over the year (of which News International was probably the most significant, though Oxford University also featured, I regret to say). When John Townsend began droning on about the importance of addressing the marketplace in terms of the way he drove product, I slunk away. But the news about release L was worth hearing: if only half of it is true (and most of it is "committed") BASIS does have a real competitive edge.
CAFS-ISP represents ICL's first step towards machine architecture capable of supporting "fifth generation" processors and so its launch was associated with a certain amount of quiet ballyhoo. Speakers included a person from the DOl, the charismatic d'Agapayeff and even R. Wilmott himself. d'Agapayeff held up a shiny metal box the size of a postcard and said that the Japanese were about to flood the market with similar articles which would have the processing capability of a PDP-11 or two and that the computer industry had better watch out. The home computer of the future, he also opined, would be programmed in something like Prolog and you could forget your BBC Basic. Wilmott produced a number of curious statistics about such things as the productivity of workers in the information processing industry and the numbers of terminals per square foot of office space, and drew some rather forced analogies with the Agricultural Revolution before assuring us that ICL were indeed investigating Prolog for internal use. The great thing about CAFS (of course) is that it makes prolog-type programming feasible on a more than experimental scale, and thus opens the way to all sorts of startling innovations in natural-language processing, text searching and database query languages. The chief drawback of CAFS is that (as a result of Wilmott's otherwise admirably ruthless rationalisation of ICL's product range) it will only ever run on S-series processors. Despite much talk of co-operation with Universities (all of which so far as I know have P-series processors) therefore, our only hopes of getting a toe into this particular hot bath at present seem to be to wait for the X25 interface to CAFS devices promised for the future, unless of course we can get a new processor …
After an excellent lunch, two current CAFS-800 users were wheeled on to report their experiences: one being W.H. Smith who use it for their 70 Mb mail order file, and the other being Thorn EMI TV Rentals whose representative said "the CAFS concept is a basic means of getting at the grass roots information", thus testifying at least to the excellence of the lunch. Hamish Carmichael (ICL) then gave more precise details. CAFS-ISP would be a standard product, supplied as a. hardware module additional to the standard DCM and connected via standard DCU2s. A reading rate in excess of 1Mb/sec was claimed, while a dual 2988 with its full complement of CAFS drives could access 80+ gigabytes, be they standard IDMS databases recman files, relational tables, text files or (a CAFS special) "self-identifying format files". This capability arose from the integration of CAFS into the standard IPA jigsaw puzzle, of which only one piece now remains to be revealed.
This was the first national conference on databases ever held. It was jointly organised by the BCS, Aberdeen University, Middlesex Polytechnic and Cambridge University, with money from IBM, and drew an unexpectedly large attendance of nearly 150, with equal numbers of delegates from both industry and academia. The standard of presentations,(with one or two exceptions) and of discussion was unusually high
for such encounters and I was able to continue discussions with some
delegates I had previously met in Cardiff at the rather less high-powered gathering I attended there last week. Another unusual feature of the conference was that all the papers given had been submitted in full in advance, and copies of the proceedings were thus made available before they occurred. (My copy is available for consultation).
Of the 11 papers presented, Deen gives a very useful summary of the state of the art in database research, at least as viewed from Aberdeen - which is probably a pretty good place to view it from; Zahran (LSE) proposes some less than revolutionary methods of extending the capability of Data Dictionary systems; Brown et al (id) give another glimmer of the long-awaited pot of gold which is ICL's RADS - an automatic program generator driven by the Data Dictionary. (I rather enjoyed this because I've never seen Jackson-structures actually being used for anything, but the computer scientists in the audience seemed rather restless. ) Gray (Aberdeen) continued his promotion of ASTRID, concentrating here on the GROUP-BY operator with which he claims to have completed the relational algebra. This was quite impressive, and it is probable that he will be coming to Oxford to test it on some of our databases in the near future. Tagg (independent consultant) gave a rather dull precis of the BCS's even duller report on query languages - dull because it leaves out all recent developments. Longstaff et al (Open U) described
their teaching system which translates queries from relational calculus back
into a restricted set of English or other natural language so that the user may check to see whether he has asked what he intended to ask, and also to teach by example the more obscure niceties of the calculus. Crowe et, al (Thames Poly) describe a similarly motivated system of infinitely less sophistication. Gray (Cambridge) outlines some of the problems of dealing with imprecise or unknown values in a database, using three valued logic, and lattice theory ; Clauert et al (Cambridge) describe their implementation of a CODD database distributed across several mini computers (This was amazingly complicated like most Cambridge procedures but apparently quite effective). Ho (Hong Kong)'s account of optimal search sequence, i.e. an algorithm for determining how different file organisations should be ranked when evaluating a query which will use several files was both inscrutable and, when understood, redundant. Finally Martin (Liverpool) presented the results of comparing one dead DBMS (Robot), one still born (RDBMS) , one adolescent (RAPPORT) and one real one (IDMS). Chief result was that Rapport queries wenre easier to write than Cobol DML programs …. varied immensely, with IDMS usually coming out on top.
This SSRC-sponsored workshop brought together a number of experts from the fields of computer science and social science who (not surprisingly perhaps) were mutually rather baffling. The general tone of the encounter was set by Dr Peter Stocker (East Anglia) for the computer scientists and Stephen Clark (SSRC Survey Archive) for the social scientists respectively. Clark evaluated software which existed (the S.I.R. front-end to SPSS) and did nearly everything anyone might want nearly all of the time, while Stocker described software which had existed but failed to please and software that didn't exist yet but would satisfy just about everyone. Stocker also explained how good computers were not only as a source of entertainment but also as a source of income . The reaction to this topical jest (much of the conversation in the bar hinged on comparisons of letters from the UGC) having subsided, he proceeded to dole out some rather antique platitudes on the subject of database systems, relational models, size and scope thereof etc before finally touching on his latest fund-raising exercise, a distributed database system. This currently takes the form of lengthy arguments about some way of interfacing various relational query process rs to a common target and bears a strong resemblance to ICL's Data Dictionary philosophy. John Welford (ERCC) was supposed to be speaking about an application of IDMS to a complex record-linkage (i.e. parish record) problem, but was rapidly side-tracked into a rather empty discussion of the usability of otherwise of Cobol, which managed to raise the hackles of both camps in his audience, the computer scientists by its matter and the social ones by its manner. The undisputed star of the workshop was chubby Dr Tim King from Cambridge whose infamously large and complex database of parish registers we hope to acquire for the Archive before the whole project gets the chop this August. His database is accessed by means of his own home-made relational system CODD and its query language CHIPS (hem hem); this software is written entirely in BCPL and King and Jardine were evidently a touch miffed that the SSRC had refused ipso facto to support the package; several social scientists were also rather disappointed at this, since his was the only system described at the workshop which could be said to generate much enthusiasm. (This enthusiasm was however later dispelled by a remarkably inept presentation by another user of CODD, Ryan Kemp from UCL, whose talk was barely visible, scarcely literate and utterly devoid of intellectual content). A more competent speaker, Ron Cuff from Essex, now c/o IBM, speculated about future directions in query languages, spouting out acronyms like a burst water main, but also several useful references Expert systems, natural language query processors, all are very nearly with us (the best of the latter was called ROBOT in the states but is now known as OLE (On Line English- presumably to avoid confusion with Liverpool's antiquated DBMS rather than to appeal to the Spanish market). The narrowing of the gap between A.I, systems and query processors continues to lead to flourishing hybrids (and mutual mystifications). What the social scientists made of all this I cannot imagine; probably rather more however than they did of the next speaker, Dr Peter Gray from Aberdeen, whose ASTRID system is an automatic program generator for accessing a Codasyl database by means of the relational algebra. The same system was being developed to interface Zloof's Query-by-example, which Gray felt typified the query processor of the future. The social .scientists finally managed to get in a word when Dr N Gilbert (Surrey) described in heart-rending detail the extremes to which they had to go in order to translate the inescapably hierarchically structured datasets derived from the General HouseholdSurvey into flat SPSS-processable files. This (to my mind utterly futile) task was being carried out in collaboration with the Survey Archive at Essex; inevitably some linkages are lost and others 'as many as possible' are then painfully reinstated by hand. In later conversation with Dr Gilbert and the Archive's representatives I argued the case for a database solution to this problem, which may have some effect.
The Welsh team, in the person of one Fiddian, gave a paper on their current fund-raising exercise which has the ambitious name of SENSE; quite what else it has remains [Page]to be seen. It is apparently something like CONSISTENT on Multics, or Software Tools on UNIX, or -dare I say it- VME/B. Quite why the government should be funding the development of a machine independent operating system is rather beyond me, but then I may have missed something crucial in Mr Fiddian's extraordinarily soporific talk. The third day of the conference began with Mr Kemp on whom I have already trampled, and continued with Dr Ron Stamper (LSE) who seemed by comparison to come from a different planet. He described the LEGOL project, an ambitious scheme somewhere between A.I. and data modelling. Its aim is to provide a knowledge-base representing (currently) the DHSS bureaucracy, in terms of the law the Department is supposed to embody rather than the clerical procedures (which would form the basis of a conventional computerisation). This was fascinating, and LEGOL does actually work – there is a POP2 interpreter for it at Edinburgh. However I cannot imagine that the social scientists received much from this talk apart from the pleasing sensation of having their brains gently squeezed through a fine sieve. Finally however, Stephen Tagg (Strathclyde) came trailing clouds of glory to assert the claims of SPSSS as a data management system (sic). This he actually managed to do remarkably well, with no dodging of unpleasant truths, no vague platitudes and a considerable display of knowing How To Get Things Done in the constraints of the real world.
Finally despite the flippant tone of this report, I should say how very useful it was to be able to meet the people behind the projects of which vague rumour has been reaching me for some time (e.g. King & Jardine, Stamper, Tannenbaum) and actually discuss matters of mutual interest and concern.
British Telecom gave an account of the query processing system they have set up to interrogate IDMS databases. It is an interpretive processor using nonstandard terminology and obtaining its data through the back door (i.e. a copy of the schema source) rather than from the IDMS directory; functionally it has little if anything superior to Data Display 250 and in some respects (output formatting etc) is markedly inferior. On the other hand it cost BT 1 man/year as opposed to whatever outlandish sum DD costs.
Tom Wansborough (iCL) gave a good straightforward account of the IDMS patch-up utilities Restructure and Reorganise, both of which we have used here, the former as less successfully than the latter. He also gave some useful tips on how to simplify restructure when it was needed and explained patiently several times why PRIOR pointers should always be used in volatile sets. Some figures on the performance of Restructure were given - for "small" databases (a mere 200 Meg or so), a restructure time of 20-30 hours seemed to be acceptable. I was too dumbfounded to hear of installations where an uninterrupted run of 20-30 hours was feasible, let alone acceptable, to comment.
After lunch, Richard Barker (iCL) gave a presentation on the facilities which might be expected in IDMS 300. Consdierable improvements in performance had been implemented for the commonest of DML verbs and also great improvements in security. It would be possible to run IDMS and its associated components at ACR 10 and access to a database except via IDMS would be impossible. Various minor user-requested enhancements (one of them mine!) would be included. The most significant change however is the introduction of separate storage schemas; the ability to default everything but the logical description of data items in the schema in particular will reduce the design-to-use time by an order of magnitude in our environment.
I raised 3 major problems we are currently having or anticipate: one proves insoluble, one unpopular and one will go in as an enhancement request.
ICL had arranged a hands-on demonstration of the new Personal Data System (a relational system developed at Sputhampton) largely for our benefit; attendees were: R.Stratton, L.Burnard, S.Hockey (OUCS), S.George (QMC) C.Massey (SWURCC) and D.Doulton (Southampton).
There was a preliminary briefing session on PDS concepts - tables, fields, etc. The system manages to give a full relational capability without the jargon. It is used entirely interactively. The user defines one or more tables, which have names and fixed attributes. Each attribute is defined by name and value (possible values are integer, decimal, character string, date and century date) . Checking rules may be specified to validate data further on input. At present all input is interactive, but plans for bulk loading of data were murmured about.
The command syntax is quite simple to use and, after some initial problems with the computer (a 2903 pretending to be a 1902) three of us were able to get a reasonable impression of the system in operation. It seemed flexible and intelligble to the naive user, in most important respects. We all were able to create and update tables without difficulty, and to make some simple enquiries. There was of course no way to investigate the performance of the machine when more heavily loaded.
There was some discussion of facilities we felt to be lacking: chielfy more sophistication in its handling of character sets, more flexible sorting options, more sophisticated conditionals and a macro facility closer to that of Data Display. Compatibility with IDMS was also mentioned as an important requirement. ICL are not of course in a position to give any firm dates, or even weak ones, but around the end of this year it might be worth complaining about the product's non-availability on VME/B again. Meanwhile, there is an excellent manual, and the product is running on 2903 and ME29.
This two day meeting began with reports on recent activities in the field of literary and linguistic computing from the various ALLC Representatives around the world: about ten different countries (all European except for the ubiquitous J.Joyce (USA-East)) were represented in person, while nearly 40 written reports had been submitted. The report of the ALLC Working Party on networking and databases was also tabled; it includes a recommendation that a database of information about machine-readable texts and software for processing them should be established along the lines discussed at an earlier meeting of the working party in London. Both at this meeting and at the ALLC's subsequent AGM I briefly described the proposed format of this database and requested information for it. Initial validation of the designed system will be carried out in early 1981, by using it to keep track of issues of OCP. It will then be expanded to include issues of texts from the Oxford Archive. Data collection from other sources will continue in parallel, in collaboration with University College Swansea, using the UMRCC filestore as a staging post.
There were two formal sessions of invited papers, one on machine translation and the other on computer-aided lexicography. The MT session was opened by from the EEC with the disarming statement that SYSTRAN (the product in which 5 million units of account -about $20 m- will be invested over the next 5 years on your behalf and mine) is obsolete from both the linguistic and the computational standpoints. Nevertheless (as is often said) it WORKS for translation of scientific and technical literature only for certain host/target pairs of languages only and with no post-editing.An experimental batch system for translation of scientific abstracts will be made available via EURONET within the coming year. Meanwhile research towards a European replacement for Systran (EUROTRA) would continue, though funding was not yet committed. Subsequent speakers summarised the main features of the Eurotra project. Keill (UMIST) demonstrated that Systran (which is entirely written in uncommented IBM Assembler) could not easily be enhanced except by adding greatly to the complexity of the existing dictionaries. Eurotra by contrast would be a modular system of great flexibility, deriving from 4 dynamic 'strategic' component which interfaces its -fairly traditional- parsing algorithms with its equally traditional static dictionaries. Eurotra however has no 'real world knowledge' built into it and is light years away from an 'Understanding' system. Maegaard & (Copenhagen) described Eurotra's interface structure which is essentially a simple dependency tree the nodes of which are labelled for four levels of analysis (morpho-syntactic, syntactic function, logico-semantic and semantic-relational) Ambiguity is more easily resolved when these four levels of linguistic description are available simultaneously, while the loss of word-order inherent in a tree representation actually (it is claimed) aids translation. There is of course nothing particularly novel in this formalism and its limitations (chiefly the absence of a knowledge component) have been known for many years. M. King (Cambridge) 's description of the so-called Semantic Component in Eurotra was clearly aimed at a non-specialist audience and did little more than exemplify some of these limitations. Nevertheless, for a system designed to deal only with technical writing, Eurotra seems a great advance on Systran if only because it has a clear underlying linguistic model, the inadequacies of which are clearly defined and understood. It is in no sense an experimental design, and appears to have learned very little from even comparatively recent advances in AI.
For no very good reason, delegates were then treated to a short presentation of ADA by one C.Gram (Copenhagen). The main features of this latest attempt to emulate the perfection of Algol68 were however clearly of little interest to the majority of the somewhat bemused audience and of no interest at all to anyone who has read an article in Computing.
The next day's session was opened by W.Kartin (Liege) at his most magisterial with a panoramic survey of the various activities that might be described as computational lexicography. As is often the case with such surveys I gleaned little from it save the classic assertion that "AI people are concerned with how to understand language, i.e. how not to misunderstand language". The level of computational expertise involved is best illustrated by a lengthy discussion of how the text of dictionary entries could be sorted on secondary fields in order to produce lists of synonyms or 'pseudo-synonyms'. Zettersten (Copenhagen) gave an informative account of the newly-revived Dictionary of Early Modern English Pronuciation Project (DEMEP). Dictionary slips indicating pronunciation are being gathered in a fairly conventional form by teams of scholars based at Stockholm, Bergen, Berlin and Aachen from hundreds of early printed sources to produce what will (if it comes to fruition) be an invaluable reference work on the development of English pronunciation over the period 1500-1800. The project is linguistically unfashionable and computationally unexciting; nevertheless it has the far from insignificant merits of a well-defined and worthwhile goal.
Winfried Lenders (Bonn) gave a workmanlike description of the 6 or 7 computer held lexicographic databases available in Germany (where such activites are now co-ordinated by a government agency, of course). His account was informative rather than analytic. As might have been expected there was little common ground amongst the materials described since they had all been tailor-made for different purposes. The problems of integrating such disparate sources (which seemed tthe obvious next step) were only touched upon. Finally Marie Bonner (Saarbrucken) described the six year old Old Icelandic Dictionary project in which half a million words of legal texts are to be the basis of the first ever lemmatised word index of Old Icelandic. The project had evidently benefited from a clear linguistic analysis of the process of lemmatisation (particularly difficult when dealing with old manuscript material); but no evidence was given of any comparably sophisticated computational analysis during the development of the project.
The keynote of these proceedings was ICL Hospitality, a curious commodity which was supplied in various forms and large quantities at regular intervals. Forms in which ICL Hospitality is supplied include psychedelic slide shows, online demonstrations and outbursts of fervent patriotism, but it usually comes in a bottle. The patriotism was largely concentrated in my first encounter with Dr Peter Aylett who, perhaps feeling rather tired and emotional after paying for so much ICL Hospitality, embraced me warmly before the proceedings began and then tore my ear off for having mentioned some disquiet with the current serviceability of our 2980. This to the point of alarming an innocent bystander from RACAL.
The next day's presentations might be divided into three categories: polemic (Aylett, Ellis), apologetic (the three user presentations) and summary (the rest). Of the user presentations (as one delegate rather sharply pointed out), one only used data management techniques because of deficiencies in VME/B (Murphy), one had a user population most of whom had no data to manage (Burnard) and the other had written their own DBMS and was now regretting it (Philbrick) . Murphy's account of the problems of managing the literally thousands of megabytes of data pouring from the monitoring devices used in fusion experiments at Culham, had a familiar ring to it. IDMSX is used to keep track of their file usage and archiving facilities, while they plan to use STATUS to analyse the free format documentation about the various experiments which accumulates at roughly the same rate as the data. Apart from Liverpool, Oxford was the only university site represented; my own presentation therefore concentrated on the peculiarities of the University Computing Environment, concerning which I waxed so eloquent that I had just about reached the topic of data management when the chairman started waving blue cards at me. Sun Life Assurance's experience having written their own DBMS confirmed the popular wisdom - it had taken them 5 man years or so to replace an existing conventional system, and the replacement was already inadequate.
ICL speakers on this first day were perhaps forgiveably rather desultory in their presenta tions of DDS, TPMS, IDMS, PDS and the wholesome nature of data management. Harry Ellis explained how proper use of the products would keep our respective companies' names out of the papers; Richard Barker rushed through the recovery and security aspects of IDMS; Yvette Ascher managed to summarise the facilities of DDS, TPMS and IDMS in a record-breaking 45 minutes and Jackie Sansom did her best to make sense of PDS without once mentioning the word 'relational'. Apart from PDS itself (concerning which I made a lot of interested noises in what I hope were the right places), the chief new departure I noted was a greater stress on the integration of the products, typified by the merging of TP with IDMS now known as TPMS, and in the new acrobuzz IPA. This has nothing to do with pale ale but is short for INFORMATION PROCESSING ARCHITECTURE. Unlike most machine architectures, this is really a question of how you feel about what you're doing. If you see a computer system dynamically as information flowing between nodes (e,g. terminals, databases, processors) of comparatively less intrinsic interest, then you are into IPA; if however you see it just as a machine you are trying to kick into life to do something for you (and never mind the rest of the world), then you are probably not. The definition is however mine and may well be quite wrong.
In the evening ICL Hospitality took the shape of an excellent dinner (starring beef Wellington) followed by speeches of increasing incoherence and a general rush for the bar. I apologised to various ICL speakers for having run out of time before getting round to saying just how marvellous the 2% of our users who use IDMS think it is. Large amounts of malt whisky fuelled further discussions with assorted ICL directors and Jim Alty long into the night, the exact nature of which eludes me; data management was not excluded from them, but the younger generation, Liverpool football club and L. Van Beethoven also made guest appearances.
Next morning, at 9 sharp, in body if not spirit, we reconvened for the unacceptable face of ICL Hospitality in the form of a half hour long slideshow, complete with music, special effects and a commentator who would have been more persuasive selling soap. It resembled nothing so much as an animated glossy brochure from which, had I been more awake, a fine bag of trophies for pseuds corner might have been gathered. Its gist was that ICL Computers (information Processors, sorry) will save you money, increase productivity and make you irresistible to the opposite sex. The company's latest offerings (2956 and 2966) slotted sharply into some unexplained scale of competitiveness next to IBM's most recent offering and on them (after DME and VME) you could now have CME which could be both, or either. A delegate later proposed the Friendly User Machine Environment (FUME) as the obvious next development.
Fortunately the remaining ICL presentations were of quite a different nature. Hamish Carmichael gave an impressively thorough and technically detailed account of CAFS which, by moving intelligence from the main frame to a file access component (a 'back end processor'), allows quite staggering improvements in throughput when large volumes of loosely organised data are to be searched. ICL's own personnel records are now held on a CAFS-based system as a slightly modified IMS database; the heretical implication that IMS storage structures could actually be improved upon was neither stressed nor shirked. Bob Gifford's presentation of ICL's experimentation with Viewdata was also quite impressive, if only because it explained just what is going on during those odd pauses when one phones SMC about a bug. At present Viewdata access to the Known Errors Database is only available "in house" or to some selected foreign sites; several delegates were sufficiently impressed by its speed to request that it be provided on a Public viewdata service. Finally, Jim Alty gave a stirring summary of why data management was a Good Thing and laid into various popular fictions about it with enviable assurance. He also asserted that Liverpool currently supported 26 database applications, but later reduced the figure somewhat.
Apart from confirmation of every suspicion that OUCS is a very untypical VME/B user and from an intriguing insight into how the rest of the world does things, it cannot honestly be said that my attendance at this gathering was of much direct benefit to the Service. It was however almost entirely funded by ICL (gratitude for which is hereby expressed) and something might just come of my noises about PDS. And maybe universities have a duty to remind the rest of the world about what they are up to while it is busy making money, saluting the flag and making programmers redundant.
I visited the RNIB Computing centre in London at the invitation of their chief systems analyst to discuss the possibility of co-operating in exchanges of machine-readable text. The RNIB now produce all new Braille titles on their GEC-based computer system. About half a dozen people are employed typing into terminals in the normal way, but they have also been using a KDEM (indistinguishable from ours) for the last six months. One copy of the text is pre-edited (i.e. marked up for homonyms, paragraph markers etc.), while another is scanned by the KDEM. Quite a lot of pre-editing is done directly on the KDEM as the text is scanned. The resulting text is then processed by a program developed at Warwick called DOTSYS. This outputs the text in a format which can be displayed as braille on a specially built terminal for further editing and is ultimately used to produce the zinc plates from which braille books are still embossed. The KDEM, I was told, worked far too fast for the DOTSYS processor to keep up, though it greeted my arrival by going into 'Panic' mode amidst familiar cries of "Where's Paul?”; judging by the amount of error correction going on on the braille terminal DOTSYS has shortcomings additional to its slowness. Of course the KDEM is only used on modern printed English texts (mostly pulp romances etc.).
We discussed copyright problems and agreed to exchange lists and other information; particular interest was expressed in database systems - they were considering Rapport - and in the possibility of using publisher's tapes as the basis of a new Braille dictionary.
Klett is one of Germany's leading publishers, particularily of school text books and dictionaries. I was invited to visit them to discuss methods of analysing machine-readable dictionaries with their computing consultant, G.Lieser. Klett had previously obtained a copy of the Hornby OALDCE transcript from the Archive, and much of my time in Stuttgart was spent discussing the problems of 'parsing' entries in this dictionary by computer, in order to extract all the subfields while still retaining the ability to reconstitute the original entry. I was able to offer Klett a copy of the new Collins dictionary(which poses similar but different problems) in return for their hospitality and also for a copy of the so-called WEBMARC format transcript of Webster's 7th - a dictionary we have been trying with little success to obtain from the publishers for two or three years. Having defined a suitable data structure for dictionary entries, Klett1s objective is to merge entries from different dictionaries in order to build up an authoritative lexicographical database against which new bilingual dictionaries might be validated.
Klett is also active in the field of computer typesetting; their equipment (a Digiset) is driven by the German-written DOSY-system, which appears to offer a (fairly) user-friendly interface similar to that proposed for the Lasercomp. One difference is that the Digiset allows for point-size to be selected independently (within certain ranges) of font, so that the number of fonts stored is smaller. Using this equipment and special software written by Lieser, Klett have produced very high quality Arabic output. Their mainframe is an IBM 370, running ynder DOS (?), which has very little flexibility (l was told that only VSAM files were supported), but they also have a DELTA system supporting about 50 interactive terminals , and an antiquated Philips machine doing the payroll. They also have an under-utilised OCR which reads only OCR typefaces.
I also spent a day at nearby Tubingen where I was able to visit Dr W.Ott who provides at the Computing Service of the University of Tubingen a service very similar to that provided by Susan here at Oxford. Two things particularly impressed me at Tubingen: the first was the recently published concordance to the Vulgate. This was produced entirely at the Computing Centre using Ott's programs to generate the concordance and format it for direct input to Klett's digiset. The resulting plates were sold at commercial rates to the publishers by a holding company called Pagina (which also publishes the new journal Data Verarbeitung). The concordance is being sold at an astounding DM700 per each of' five vols; Ott told me estimated the production cost of each volume at 30 marks. The second impressive thing I saw at Tubingen was a 17 year old CDC machine (due for retirement next year) which takes up about as much space as the average German tram and is full of spaghetti, but still supports the computing requirement of one of Germany's larger universities, with over 95% serviceability.
This working party was set up by the ALLC Committee about a year ago to investigate the feasibility of distributing text processing applications amongst a number of sites internationally, evaluating the effectiveness of existing comms networks, and to prepare an international directory of resources such as machine readable texts and programs. Representatives from the U.S.A.,Italy, Germany and tho U.K form its membership. I was co-opted to the Working Party at the Cambridge ALLC Symposium, and, together with a programmer from Swansea, presented a draft report on the structure of a database to hold information about machine readable texts at this meeting. The proposed directory would include information about characteristics, availability and applications of existing machine readable texts everywhere; as a first step it will enable the contents of the Oxford Archive to be catalogued more effectively than at present. The structure proposed, five interlinked Famulus files, has the advantage (that it can be easily implemented at Swansea, and subsequently interfaced to an IDMS database at Oxford for general access. The directory would hopefully be updated on-line, and might be published using the Lasercomp, or on fiche. Swansea have agreed to provide labour, and are investigating the possibility of getting further funding for the project. More details in the minutes, when produced.
[hand annotation: NB Only Joan showed up!! (Tonino was delayed – Jim couldn't make it – John mysteriously absent – Winfried too busy – Hey ho.]
ICL gave a presentation of the next release (DD250) of Data Display which is due for release this summer. This product now presents a very user-friendly interface to an IDMS database. There are a few new features (in particular the ability to constrain selection by boolean expressions of arbitrary complexity using dummy characters) this they somewhat grandiosley call "fuzzy-matching"} and an extensive 'HELP' command) but most of the changes from DD2DO are essentially mopping up and extending what was half done already. The full facilities now available for user-defined macros in particular give DD considerable power and it should prove very useful in simplifying use of IDMS databases by 'non-programmers'. ICL were also due to give a presentation on the probable contents of IDMS 300 (the next
major release of the base product) but were unable to do do because of
internal bickerings. I discovered that there might be problems making any
new facilities in the base product automatically available from Fortran, and complained accordingly, hopefully to some effect. Over a lengthy liquid lunch I discussed my ideas for an algol68 interface with delegates from ERCC and QMC , among other things. After lunch the group discussed ICL's response to the shortfalls paper, which was bye and large satisfactory, if non-committal.
The general quality of work reported on at this symposium was perhaps slightly more technical and strictly 'computational' than at some previous ones in the series. There were noteworthy papers in the usual areas of lexicography, prosody, bibliography, concordancing, attribution studies, literary statistics and so forth, but also some sessions on database techniques, natural language processing and -whisper who dares- machine translation. The latter was the subject of Professor Liu Yong-quan from the Peking Chinese Academy of Social Sciences, where machine translation has been somewhat of a necessity and appears to work (into if not out of Chinese) despite the limitations of available hardware. Jitze Couperus (CDC & Codasyl) gave what was intended to be a keynote address on the concurrence of linguistics and computing in current database trends. Semiotics, he averred, was more fashionable than cybernetics. No-one disagreed but (as a subsequent two hour discussion session demonstrated) this may have been because few understood. This 'Open Forum' was probably the most disappointing event of the Conferences - the understanding gap between database specialists and computational humanists shows no signs of shrinking.
As an indication of the range of subjects and representatives at the conference, I shall just mention a few papers which interested me. Among impressive new textual projects reported on were the collation of the six editions of Burton's Anatomy of Melancholy (Faulkner, Washington State); the indexing of two early 17th century German newspapers by topic (Ries, Cambridge); the production of a lemmatised concordance to Ibsen (Hofland, Bergen) and the problems of concording the textually complex 1606 folio of Ben Jonson (Howard-Hill, South Carolina). There was of course a paper on OCP (Hockey, Oxford) and another on a very quick and very dirty indexing program called CODOC (Niblett, Swansea) . There was little else on software of note though there was much informal praise for SPIRES, UNIX and other unattainable goodies. The most impressive hardware on view was that attached to the Chinese Languages Transposition Project (Nancarrow, Cambridge) which hooks a tektronix up to a rotating cylinder for transput of any of several thousand characters in Chinese, Tibetan etc. Among more technical papers, Cercone (British Columbia) surveyed current methods of storing lexicons for natural language applications and Skolnik (Amsterdam) gave a good account of storage mechanisms well suited to them. Another new statistical measure of lexical diversity was proposed by Delcourt Mathonet & Mersh (Liege) and the poetic style of W. B.Yeats resisted all attempts to analyse its variations with EYEBALL (Jaynes, Minnesota). Not so Dostoevski, who has now come under the searching eyes of Geir Kjetsaa (Oslo) and his attribution algorithms.
Abstracts of all the papers are available from me. Apart from the papers I can report a great deal of interest in OCP and a great deal of information about machine readable texts scattered around the globe. I was co-opted to an international working party which aims to catalogue all such materials. Finally there was an Elizabethan Feast, complete with madrigals, which almost (but not quite) made up for the appalling standards of the Newnham cuisine.
Security at Bureau West was 'Condition Black' for this meeting, so I am allowed to divulge the following: there were several new attendees
from organisations as disparate as Western Gas and RACAL Management Services; ICL gave a long and minutely detailed presentation on techniques of IDMS sizing, including details of useful publications available only from project teams; no response was available from ICL on the subgroup's report (presented at last meeting) but was promised for the next. In a more informal context, I received advance documentation of a new 'user friendly' relational system from Edinburgh (called ENUF), good advice on making one of the current IDMS applications here more secure, and interesting gossip on what Oxfordshire County Council are up to. (implementing an IDMS-based Local Authority Financial Information System jointly with Doncaster and ICL)
L. Burnard
This meeting was essentially a mopping-up operation. All twelve papers (bar one or two) are in their last or last but one draft. The chief area as yet unresolved is that of statistics: some discussion on how population stats and performance-monitoring stats could be collected and used to improve performance and diagnose needs for restructuring lead to some firm suggestions.
An interesting document was circulated on the subject of security of IDMS databases: it proposed that (after 5X32) IDMS inline code should be made to run at ACR9, thus preventing unauthorised access completely.
I have two sections of the report to revise; at the next meeting (December) all the sections will be collected and collated. ICL are to give a presentation on the next generation of IDMS (200); it should be the same as the one they recently tried to charge £250 for but is free
This group continues the task of reviewing the entire range of IDMS-centred software on 2900 systems, aiming to prepare a report by the end of the year. Each product is briefly described, any problems in its operability/ performance/usability outlined and a list of required enhancements drawn up, usually after much discussion within the group. The group's report will be ready in time to affect ICL's plans for IDMS 300. The main bodies represented at the group are: BBC, BNOC, BOG, CAA, MOD, Post Office, WMG, RACAL. Universities are unusually heavily represented on this subgroup: in addition to myself, the secretary is from QMC and there is usually a representative from ERCC.
At this meeting, reports were presented on the IDMS base product, Pfix, Database Load, Data Display, Recovery, IDMSK facilities and on the Fortran Interface. I also got the opportunity to report a new bug in the Fortran interface and to pick up news of a new facility in Data Display.
About 300 people from over 70 different British Universities and Polytechnics attended this year's IUCC Symposium at the University of Warwick. OUCS was represented by FAS (this years chairman), LH, SL and LB (who contributed a paper on IDMS). As at most such gatherings, there was ample opportunity to exchange important gossip, horror stories and absurd boasts with colleagues from other computing centres, quite apart from the scheduled sessions. There were five of these, on micro-computers, distributed and large scale computing, databases and graphics. Each session consisted of a keynote address by an invited speaker, and four or five shorter presentations of varying quality. Of the invited speakers, Dr Jim Alty (Liverpool) was probably the most impressive. He described in some detail the range and quality of support offered to micro users at Liverpool. Liverpool run quick practical courses for Intel 8080, 8085, Motorola, TI 9900 and the ubiquitous PET machines, programming in BASIC and PASCAL. Applications include data capture (often in inhospitable regions such as the Sahara), control systems, communications and data processing. Dr Alty mentioned in passing that only one tenth of the £2 m allocated to the DES for micro-supported research was used last year. Other papers in this session included one by D.Holdsworth (Leeds) describing a new Pascal interpreter, which occupies 1K on the 6800, and one by S.Jameson (Aston) on the programming of an.intelligent terminal to interface with a standard commercial (i.e. unintelligent) information retrieval system. The session on distributed computing was opened by C. Whitby-Strevens (INMOS) who, quoting extensively from Jensen, urged us to forget everything we already knew about computing, stressed that conventional algorithmic languages (except maybe ADA) and souped-up sequential algorithms were more hindrance than help, and generally annoyed everyone. Of the other speakers in this session only J.C.Boarder (Oxford Poly) appeared to agree; he described a language, LZ, with some novel parallel features based on work by Dijkstra. I The session on large scale computing was opened by Dr G.R.Field (UMRCC) who, after a brief political history of university computing, asked several questions (are 60 micros in a ring worth 2/3 of a CRAY-1? how many Atlases do you need to lay end to end in order to make a Cyber 203? is it a meaningful exercise anyway? and above all Is it worth the money?) but provided no answers. In the same session, P.Gray (European Centre for Medium Range Weather Forecasts) gave an interesting account of experience with the CRAY-1. Mean time between faults is it was revealed over 60 hours, which happens to be crucial for the Weathermen, one of whose applications runs for over 6 hours. To some extent confirming Whitby-Strevens, he stressed that to get the best out of the machine, some re-design of the algorithms employed was necessary - doing more arithmetic could actually reduce processor time. The database session was opened by Dr M.Atkinson (Aberdeen) substituting at very short notice for a speaker from Grenoble, with the heretical propositions that 'databases are for programmers' and that a fully generalised Database Management System was impossible. His main interest was in Computer Aided Design systems, for which existing dbms are undeniably [Page]cumbersome. He described a system currently being designed which included a component called the Data Curator to control data security and integrity and handle transmission of it to and from various CAD micros on a ring. Other speakers at this session included H.Robinson (Hatfield Poly) on data modelling techniques; M.Newton and A.Gawronski (Open) on a new relational system to be used in the new (& highly recommended) OU database course; and L.Burnard (Oxford) who, asserting that IDMS is after all only a package, attempted to describe how to set up a database in 10 minutes and was suitably penalised by being cut off in mid-example. The session on graphics was opened by K.Brodlie (Leicester) with a comprehensive survey of the various systems available. He pointed out that standard general purpose software tended inevitably to lag behind the capabilities of new hardware and also gave a flavour of the new NAG graphics chapter promised for Mark 8. This final session also included another screening for a rather fine piece of computer animation made using GHOST in 1965, and accounts of two;interesting applications. One (Owen & Earnshaw, Leeds) uses an interactive graphic display as a means of rapidly inputting musical scores in any of the various internal foirmats commonly used. These can then be displayed or printed on conventional staves. The other (Best, Aston) uses a composite eliptical graph to display the interaction of eight pathological parameters as an aid in diagnosis. The system is used experimentally at the patients bedside. Full details and abstracts of the papers given at the event are available on request ,
This subgroup was reformed in May to take over outstanding work from earlier groups of the old IDMS User Group, and to investigate the facilities currently available with 2900 IDMS/X, to report on shortcomings and press ICL for enhancements. Oraganisations represented include BBC, BR, BNOC, MOD, Racal, W.M.Gas, Sainsburys, Post Office, QMC and various local authorities. It is also a good place to trade experience, horror stories &c, and a pressure group to which ICL is unusually responsive. The only IDMS facilitiy not included in the group's terms of reference is the Fortran Interface, largely because the FISS subgroup got going before this group was formed. I proposed to disband the FISS group (current membership 3½) not only for the obvious reasons of convenience, but also to ensure that Fortran & Cobol DHL facilities would not diverge any further than they already show signs of doing. Subject to agreement of the other 2½ members of the FISS subgroup, the IDMS group agreed to include the Fortran Interface in its list of concerns. About ten reports on various IDMS facilities & products are to be tabled, about half of which were discussed at this meeting. I have copies of papers presented, which include: descriptions of the facilities of IDMSX, Restructure, Reorganise, TP, Recovery, Statistics; possible areas for improvement, etc. Some of these are due to be re-drafted; papers yet to be presented include accounts of FAME, Data Display, Database Load & Integrity Checker. BR and Post Office also reported on some performance problems in IDMSX. The next meeting will be at Oxford, and further reports will be presented.
Some definite progress towards agreement on what sort of conceptual view the projected end user facility should support was made at this meeting, and not before time. There was general agreement that entity modelling (as per ICL's Data Dictionary System) was the most effective technique currently available for mapping any end-user view onto whatever storage structures a particular DBMS employed (i.e. IDMS structures in practise, though theoretically any other DBMS could be used). Such techniques would result however in a global data model inappropriate to all but a few end-users. The group proposed that 'local' data models appropriate to various categories of end user should be supported, the mapping from each of these, via the global model, to the storage model to be the responsibility of the EUF proper. Such local models would not necessarily be simple subsets of the global model; in some cases relationships present in the global model might be suppressed, in others relationships not actually implemented in the storage model (because of e.g. volumetric considerations) might be present. I stressed the importance to University users of providing a relational local model, with some grudging support. It seemed probable that the DBS would/could be enhanced to provide support for local data models in this way, especially since construction of a global model for the DDS inevitably involved as a first step the construction of local models. The DDS subgroup would investigate.
Presentations given included: a man from MOD explaining a little of how they currently support their requirement for a query language suitable for use by Admirals, Flight Commanders and such like when playing toy soldiers.
The current system apparently uses lots of discrete files, some of which (we were told with a straight face) were designed 'as long ago as 1812') It seemed that the similarity of their query language to English was possible largely because of the limited type of queries the system had to support. They were thinking of expanding from queries about the positions of friendly/unfriendly vessels to queries about something called 'signals' but did not seem to anticipate any problems. As far as their account of it went, there seemed to be very little syntactic component in their processor. M R Jackson (ICL) spoke on FOCUS a highly impressive American software-house package which provides an interface between just about everything and everything else. Since it interfaces to IMS and to conventional files as well as IDMS, the data model used is limited to hierarchies, which makes for some contortions when handling IDMS databases. Its range of facilities (graphics, stats etc) was staggering, as was its price. Richard Barker (ICL) gave a short outline of the new facilities in ICL's own Data Display at the next release (DD200) (some of which look very nice) to see whether the group considered DD a suitable base for a query language. No firm conclusion was reached: despite its nice new facilities, DD is probably best used as a DML programmers aids.
After initial false starts, discussion centred on the type of conceptual view the EUF should support. It was not clear whether a separate view was required for the interface between the conceptual view/s and the IDMS structures: whether we were discussing an hourglass (pace Tim Bourne) or a pyramid (pace everyone else). Favourite candidate for the pinnacle (or the waist) is currently the Entity Model already used by DDS, to no-one's particular surprise; I shall certainly do my best to push the case of the relational view, preferably binary, at the next meeting when current work will be summarised/ discussed. Meanwhile ICL is still committed to marketing RDBMS 'at some stage' but not (it now appears) as a product integrated with other Data Management products.
Brief presentations on the conceptual views supported by some existing front ends to IDMS systems. R.Haydock (Plessey) described the structure supported by their homegrown interface to 1900 IDMS; Ed Dee (ERCC) summarised work done by M.Kay on a relational subschema and two speakers from ICL gave short accounts of the FAME ('hierarchic') interface and of the FORAL ('binary') interface respectively. The MOD (OPCON) spec for a query language mentioned at the last meeting was distributed and I have a copy if anyone is interested. C.Chang of ICL also spoke on GPLAN, a simple interface to IDMS which however makes some unwarranted assumptions about the structures represented. Of these presentations, only that on the relational view seemed to me to be of much use, the others being somewhat perfunctory and ad hoc. The Plessey product was also of interest in that it had evolved q
uite naturally from software originally written to support their conventional flatfiles during their changeover to IDMS. Dataskil were ^elieved to be marketing a similar product.
There was some (fairly murderous) criticism of the draft section-headings for the final report proposed by J.West (Oxfordshire CC); much of this was re-worded and re-ordered. Various people were nominated to write sections describing the conceptual views discussed at the meeting: I got stuck with the binary view, as a prize for having read the article in question more recently than the ICL speaker.
This Sub Group has recently reconstituted itself to investigate primarily the sort of conceptual view end users should/would like to have of databases supported by ICL software. The meeting was well attended (about 25 people from organisations such as BBC, Post Office, CAA, Plesseys, Sainsburys, West Midland Gas, an Insurance Company, Oxfordshire County Council and the Ministry of Defence in various forms.
The terms of reference proposed at the last meeting were discussed and a rewording which removed some wooliness from the original draft agreed. Some discussion of Data Display (the query facility currently supplied for IDMS) indicated that it would not fall within the province of this group but rather within that of the IDMS group.
Harry Ellis responded on behalf of ICL to the terms of reference, thus giving some weight to the proceedings.
After lunch it was agreed to begin by considering existing types of conceptual view (relational, binary, hierarchic &c), in particular the interfaces proposed by recent articles in the literature fo between a relational conceptual view and an IDMS structured internal view. Links between IDMS and other pieces of software (STATS, FAME &c) were also suggested as possible areas for exploration. This seemed a more profitable starting point that the description of yet another all purpose query language, parameters for which had already been defined by the last EUFSG. A part of MOD called OECON (theyfight phantasy wars for NATO) had a document on the setting up of such a query language which might be made available to the group; so had CAA produced a report.
The most interesting revelation of the day was that ICL are 'productizing' (Mr Ellis's word) the relational database system RDBMS designed by Andrew Hutt at Southampton and implemented experimentally on their 1900 some time ago. Whether this would be a stand-alone product (linked into the DDS of necessity but discrete from IDMS) or whether it could be made to 'front-end' onto (existing?) IDMS databases was not yet clear. ICL were aware of the requirements Universities and others had for simplicity of access to databases, and appeared to think that the relational view was the answer. If this is the line taken, most if not all, of the problems currently being experienced with IDMS usability (4 month learning process etc.) should disappear. The relational view is also (unlike the Codasyl structures) academically respectable. The work of this subgroup as a pressure group on ICL seems to me to be very important, and its effects may be far-reaching.
2900 Software
About six weeks were spent producing a new 10 system for Kent's 2900 Snobol interpreter, which involved some tinkering with the goodies SIMPLEIO module and much pestering of those (not a few) who know more S3 than I do. The new 10 system can halve the execution time of IO-bound programs, largely because the old version used FORTRAN for all IO. Thorough testing of the interpreter revealed two or three bugs which were reported to Kent. New documentation for 2900 Snobol was written and then ROFFed.
An Algol 68/GHOST interface was written and appeared to work.
Work on OXMTREAD continued. Facilities for reading 7-track tapes and tapes written with Fortran formatted writes were added.
IDMS
The Wytham database went operational in March, after some extensive tuning and recoding, following consultations with ICL. Fifteen years worth of private life of Parus Major was loaded into a 10 megabyte database over one week in February and I wrote a short report on the experience.
I discussed a new project, to do with Roman burial sites, with two people from the External Studies Dept., and designed a schema for them.
I chaired two meetings of the Fortran Users DMUG sub-group here, at which ERCC unveiled plans for the new improved FISS product, and various proposals for enhancements (chiefly in the area of usability) were discussed. FISS is apparently still only being used by ourselves, ERCC and British Aerospace, (Liverpool will have the 1900 version shortly), though copies have been sold to various other ICL customers. A valuable side effect of all this was that we now have a pre-pre-release of FISS FI120, which allows Fortran access to IDMSX, the extended version of IDMS with many more bells, whistles and coloured streamers. This is the version which the Assize Court database will use. I wrote and tested an IDMSX schema for the latter at the end of March.
Archive
Plans to obtain Websters Dictionary were knocked on the head by a point blank refusal from the publishers; prospects of obtaining the new A.L.D. however look much more promising. The last of the [Page]Howard-Hill Shakespeare tapes arrived at the beginning of March, closely followed by the man himself. OUP are interested in using the texts as the basis of a new edition, and will be correspondingly helpful when it comes to giving us copies of other texts, we hope. I did some preliminary work on reformatting the Shakespeare texts using 2900 Snobol, and discovered that the folio text of Julius Caesar appears to have been mislaid. Richard Sabido (OUP) handed over three tapes, one of which was unreadable (they couldn't read it either), the other two being the OUP General Catalogue and the African Encyclopaedia. These are readable on the 1906A but not yet on the 2980. Susan's indefatigable girls continued to plod through some minor English classics for me, completing Wordworth's Lyrical Ballads and Akenside's Pleasures of the Imagination in a record two weeks. We received a new copy of the Gill Corpus from the University of Warwick, which has aroused the interest of a lexicographer working for Heinemanns. Enquiries about texts were received from the USA, Holland, Germany and Newcastle; a sampler tape was issued to the Tyneside Linguistic Survey.
1900 Software
Birmingham uncovered one bug and suggested one enhancement in Famulus, both of which were acted upon. New versions of the package and documentation were released to various places, including the Royal Greenwich Observatory and the D.E.S. Aston reported a gross error in Oxeye, which was corrected; new version released to Aston and to Nottingham.
Any Other Business
I re-joined the DMUG End User Facility Sub-Group: a dedicated band which plans to define a really usable interface to IDMS for ICL to implement, by the end of 1979, (see visit reports). I submitted a paper on our experience using IDMS to ICCH/4, which was accepted, and wrote a second article on Snobol for publication in the ALLC Bulletin.
The theme of this year's International Meeting was 'Computational Stylistics' and four related papers were delivered. The first, by Prof R Bailey (Ann Arbor, Michigan) outlined three different types of application for computational methoddi in literary studies. The first (data retrieval) was flourishing to the point (Prof Bailey confidently asserted) when all the concordances man could ever need would soon have been produced. The second was in the construction of models, where he felt the computer tended to be used solely in support of the "limited conceptual framework of the New Criticism" without exploring alterative representations of the creative process such as those of the Moscow school of Soviet semioticians. Prof Bailey also made the (to my mind somewhat tendentious) assertion that the study of the Humanities, like that of the Sciences, is cumulative and hierarchic. His third type, the use of statistical and other methods for testing widely-held if unproven hypotheses was exemplified by the third speaker, Dr Kenny. The second speaker was Dr P Boyde (St Johns College,Cambridge) who spoke in rather more general terms about the Computer's Temptations. He made the intresting suggestion that a poetic sensibility, such as his own, was almost of necessity innumerate and also propounded Boyle's Law, viz that three examples of anything were necessary to convince a Humanist - one or two being too few, and four far too many. Unfortunately, since he then proceeded to give something like 5 examples of ways in which computers confused him, I fear I lost interest. Dr A Kenny (Balliol) provided a sharp contrast by distributing three pages of statistical tables derived from his current research into the authorship of the 14 books of Aristotle's Metaphysics, using similar techniques to those described in his recent book on the authorship of the Ethics. After lunch, during which I tried but failed to persuade a representative of the Kraus-Thomson Organisation that the future lay in fiche, Prof James Joyce (no relation) of the University of California described his application of a computational network model to the Mediaeval poem Pearl*s rhyming scheme. Prof Joyce considered that the most frequently traversed nodes in his network corresponded with words of semantic significance in the poem. The argument was somewhat vitiated by Prof Joyce's inability to remember what some of the words in the poem meant, and also by his refusal to recognise the formulaic nature of much mediaeval verse, but the applicability of such a model seemed good.
The AGM of the ALLC which followed consisted of the usual tabling of reports from representatives around the globe, the usual fulsome praise for the officers of the society, and the usual promises that the Bulletin of the Society would appear more regularly in the future. Since Susan was appointed editor at the last committee meeting, it seems probable that this may actually occur.
ASSOCIATION FOR LITERARY AND LINGUISTIC COMPUTING
INTERNATIONAL MEETING : ANNUAL GENERAL MEETING
FRIDAY 15 DECEMBER 1978 KING'S COLLEGE, UNIVERSITY OF LONDON, ENGLAND
The Sixth International Meeting/Annual General Meeting will be held in Room 3B20 (3rd basement, new building), King's College, Strand, London WC2R 2LS, England. (The nearest tube station is Aldwych - open only at peak periods - or Temple, Charing Cross and Holborn underground stations are also within easy walking distance). The host will be Professor R. Wisbey.
INTERNATIONAL MEETING Provisional Programme
10.15 - 10.30 Professor R. Wisbey (ALLC Chairman): Introduction 10.30 - 15.00 Computational Stylistics
Chairman: Dr D.S.Brewer (Master of Emmanuel College,Cambridge) 10.35 - 11.05 Professor R.W. Bailey (Ann Arbor, Michigan)
A Survey of Computational Stylistics 11.35 - 12.05 Dr P. Boyde (St. John's College, Cambridge):
The Computer and Stylistics, or Lead us not into Temptation 12.05 - 12.35 Dr. A. Kenny (Master of Balliol College, Oxford):
A Stylometric Study of Aristotle's Metaphysics
- Lunch -
14.30 - 15.00 Professor J. Joyce (San Francisco)
Rhymes and Individual Style in Some Middle English Poetry
15.15 ANNUAL GENERAL MEETING
The Address will be given by Professor R.A. Wisbey (King's College, London), who will also chair the business meeting, at which reports will be presented by ALLC representatives in various countries and by chairmen of specialist groups.
CHEESE AND WINE PARTY
The party will commence at 17.30 after the close of the AGM. It will be held in the (upper) Senior Common Room, to which all members and guests will be welcome. There will be a charge of approximately £2 per head.
TRAVEL ARRANGEMENTS
For special travel arrangements, valid from any country of departure, contact Mr M.J. Whiteley or Mrs L. Hill, M.D. Travel Limited, Special Events Division, 19 Old Hall Street, Liverpool L3 9JQ, England (telephone 051-227 3565). Hotel accommodation may also be booked through M.D. Travel, or arranged privately. The Imperial Hotel Group has several hotels at various prices. Reservations may be made for all hotels by contacting The Reservations Managers, Imperial Hotels, Russell Square, London WC1B 5BB, England (telephone 01-278 7871). There are many cheaper hotels in the immediate proximity of Russell Square and no prior reservation would be necessary.
This was the first meeting of the main group since I joined its executive. About 50 to 60 people were present, representing some forty different organisations, the large majority being commercial or government installations. The minutes, including a list of attendees, will be available when I get my copy.
The main business of the meeting was to discuss the proposed changes in the Group's structure and the change in its name. It was agreed that the group should now be known as the Data Management User Group or DMUG. I gave a (necessarily) brief outline of the activities of the Fortran sub group, and was subsequently approached by two possible future users of the Fortran inter face (bnoc and WHSmith).
In the afternoon, Geoffrey Baker, a consultant for a firm called CACI, gave a brief outline of a system he had developed for the NERC's Marine Agency, for collecting and storing scientific data about the sea. The novel feature of the data scheme was that descriptive elements were included in the database, giving unusual flexibility. The system was, effectively, a combined data dictionary and network database. Such a system might well be of use to some of our users.
The purpose of the meeting, organised by ICL, was to give a fairly comprehensive and technical presentation of the new range of data-management products now on release, in particular of IDMSX (the souped-up version of IDMS with extra bells and whistles, at no decrease in performance) and DDS. In order to add extra facilities without degrading the performance, a substantial re-write of the base code has been done, which resulted in a performance improvement of 10%. Adding the extra facilities reduces this improvement to 7%, it was claimed.
A nifty plastic binder with lots of technical details, and a summary of the presentations was handed out, and is available for inspection in my office.
I found the meeting useful, because many of the extra facilities in IDMSX solve usability problems IDMS users are experiencing now, e.g. the provision of record type indexes and of set pointer arrays. It was also a very valuable opportunity to meet the people who are developing the product and to discuss their future plans on an informal basis.
Much time has been spent in connexion with OUP's new edition of Shakespeare. All the Howard-Hill folio texts have been processed into usable form; the task of proof reading and adding COCOA references continued throughout this period, and has now accelerated with the appearance of an editorial assistant from OUP. Programs to produce the compositorial index were written ages ago and still seem to be working.
During August I got myself into a terrible state preparing a paper for presentation at a Conference on Database Applications in the Humanities & Social Sciences to be held after ICCH4 at Dartmouth College. Efforts to get IDMSX running under 5X32 in order to substantiate the claims in the paper (which concerned our only IDMSX application) were unsuccessful, so they (the claims) had to be moderated, but they (the efforts) continue. ICCH4 was an impressive experience; I heard news of all sorts of projects likely and unlikely, acquired 3 magtapes of text and several inches of printed blurb, and met too many people to list here.
September began with my getting into a terrible state tidying up the previous paper for publication (camera ready copy produced by the Qume was sent off two days before the deadline), and preparing another, entirely different one, for presentation at the IUCC Symposium. A report on the symposium will appear in the next Newsletter. The last two weeks of September I spent on holiday in Italy, which was ok till the gelati ran out.
Somehow during this period, at least two IDMS projects made considerable progress: the clinical trials database for :RPM is well into testing, and initial loading of the dictionary database for :BROADBENT seems to be complete. October saw the start of a new project to hold information about vascular surgical operations (:KIDSON). I discussed another possible archaelogical project, but decided it was not worth using IDMS for. D.Doulton came from Southampton to discuss the ways and means of supporting IDMS (they plan to do so).
OXMTREAD has been rewritten and now has a user hook. Documentation of this and of OXMTWRITE is still incomplete.
Texts acquired for the Archive include complete works of Donne, Marlowe and the Gawayne Poet, major novels of Woolf, Joyce, Dostoevski, Lessing & Plath, and also the Lancaster Corpus of British English. We might have Chaucer too, if I could get round the parity error on the tape. Keats, re-edited to modern standard and nearly complete, came from Newcastle. About 5 texts of various sorts were issued.
A new version of 2900 Snobol, with bugs I reported in February reputedly fixed came but has not yet been tested. Preliminary talks with QMC suggest that they will be able to implement 2900 SPITBOL later this year.
I wrote papers for the IDMS User Group on Restructure, Statistics and Fortran Interface. Meetings were attended in London and Birmingham. Nothing much happened at any of them.
I attended a British Academy lecture by C.S.Butler who has been analysing Sylvia Plath with a computer; he gave short plugs for Cocoa, Oxeye &c to a largely incredulous audience.
I completed a series of four articles on Snobol for publication in the ALLC Bulletin.
This was the first meeting of the new executive committee of the ICL IDMS User Group. I have represented Oxford at this group sporadically since it was set up in 1977, and was recently elected Chairman of the Fortran User's Subgroup, which qualifies me to sit on the Executive,ta ran ta ra.
The main business of the meeting was to re-define the structure of the group and aportion responsibilities for that new structure. It was decided that the group's name should be changed to 'Data Management' group, reflecting the wider interests of the majority of the group and also providing a better interface to ICL: the group could function better as a pressure group if it covered a range of ICL products.
At the end of the meeting, four sub-groups were defined: TP (chairman John Ladley, M.O.D.) to deal with all aspects of data management in a TP environment, particularily with respect to recovery; DDS (chairman P.M.Shearer, West Midlands Gas) to deal with all aspects of Data Dictionary System and allied products; IDMS DDL (chairman Geoff Baker, Consultant CACI) to deal with DDL aspects of IDMS, i.e. the in-line software; IDMS DML (chairman Me) to deal with DML aspects of IDMS, i.e. the programmer interface, from all Host languages. The four chairman, together with a chairman (John Dicker,BBC), treasurer and 1900 Representative constitute the Executive.
Terms of reference for all 4 subgroups were discussed. They are to be submitted to the main user group for ratification, as are the proposed changes in the group's structure.
Application for affiliation with the group from the South African IDMS User's Group was believed to be imminent.- The executive voted to reject the application 3 to 1; fortunately no application had in fact been received.
A discussion of the group's relationship with the 2900 User Group established that there was little commonality, and hence, little conflict, of interest between the two bodies: ICL however required that members of working parties set up by the Data Management group (to which confidential information might be made available) should also be members of the 2900 User Group.
IDMS
The quarter began promisingly with me commuting to Bracknell in a suit to get some hands-on experience of using FISS (the Fortran interface to IDMS) and indeed of 5x27 on a 2980 which worked more often than not as well. The swan schema was successfully implemented there and one test program run before pressure of other work recalled me to Oxford. At the end of the quarter I gave a talk to the Zoology Dept. on database techniques and the use of IDMS which seemed to arouse considerable enthusiasm.
2980 Initial Service
I did my best to make a lot of noise about the Initial User Service, drafting a User Notice, rewriting much of the Initial Use documentation and delivering a well-received if inaccurate presentation to the User Group Meeting. I wrote various macros to make life a little easier for package-users (and others), viz OXCAF, OX-CARD-INPUT, SNOBOL and FAMULUS. Documentation on Snobol and on Famulus was written and delivered to User Liaison last week. Initial testing of SNOBOL indicates considerable scope for improvement particularly as regards the I/O. A disturbing number of bugs in the first release of Famulus also surfaced and were reported to Edinburgh: a new release is due shortly.
Archive
This quarter the number of words in the Archive passed the 3 million mark, largely as a result of our acquiring a large corpus from Warwick, containing a most extraordinary mischmasch of newspaper articles, committee reports, recipes, instruction leaflets etc. I also received from the University of Manchester a partial transcript of the A.L.D. and another of written speech. Negotiations with the firm L. Urdang Inc. are progressing; if successful they should lead to our acquiring not only a complete transcript of a new English dictionary but also of various other reference works. Texts of Keats and Hopkins are expected to arrive by September, when the Coleridge corpus should also be completed.
1900 Software
Much time was spent on VERA, a new program for analysing variant states in texts which interfaces to the cluster analysis routines of GENSTAT. This was finished and tested in time for Susan to advertise it at an International Meeting on Textual Criticism in Paris, just. Much time was also spent on programs for analysing the movements of giant tortoises on the Isle of Aldabra. Apart from this, and routine software maintenance, the 1906 did not see much of me this quarter. I tested the new version of CLOC, and found a few restrictions and one bug, which were duly reported to Birmingham. I have also just obtained a copy of the Bradford 1900 LISP system.
Any Other Business
I continued to write superb minutes for the Programmers Meeting, and also designed a somewhat less superb form for logging the activities of Advisory Service. I was involved (together with Susan and the Professor of General Linguistics) in the organisation of a series of seminars on Computational Linguistics to take place next Michaelmas term. My main extramural activity was attending the ALLC International Symposium at Aston at the beginning of April, which (despite the highly uncongenial surroundings) proved to be one of the more successful of the series to date. I was invited to write a series of articles on Spitbol programming, recommended to produce a machine-independent Oxeye, engaged in a full and frank exchange of views on the organisation of machine-readable archives with both American and Norwegian experts in the field, and I also picked up some interesting new techniques in collocation analysis.
The International Meeting of the ALLC for 1977 had a unifying theme - the role and organisation of a corpus in literary and linguistic study, hence my presence there. About 70 people, mostly European, attended. In the morning there were brief presentations from (l) Theo Bungarten which was all in German and incomprehensible (2) J. Neuhaus, who is associated with Spevack at Munster, where a lot of interesting work is being done on the Shakespeare corpus, with the aid of the chronological english dictionary, (3) Sture Allen of the Swedish Logotek, an ongoing archive of all Swedish current newspapers, at Goteborg. These were followed by a general panel discussion, in which Charles Muller (Strasbourg) gave a few useful cautions on the indiscriminate use if statistics which I had never heard said in French before, and Prof. Al (Amsterdam) stressed the relevance of a corpus of linguistic practice in view of current thinking about socio-linguistics. There was much discussion on the well-worn theory of corporal homogeneity, but no firm conclusions were reached.
There was a break for lunch during which the serious button-holing of the day began. I met Stan Bootle (who used to work for Univac and now sells a sort of primitive artificial intelligence package to business men) and A. Van Wijngaarden. I tried my best to get Spevack and Neuhaus interested in visiting Oxford, since their research seemed the most interesting currently being undertaken with the use of computers. I exchanged names and addresses with various others.
After lunch we were reconvened as the ALLC AGM, an altogether more parochial affair. There was a certain amount of mutual self-congratulation in evidence, reflecting the various political machinations that have been proceeding in the executive of the Association, Susan has now been elected to the Committee with a large majority: several people also expressed admiration for the work being done at Oxford under her mantle. I spoke briefly on the Archive, and distributed copies of.the Catalogue. I also learned the whereabouts of some other texts previously unknown to me (in Denmark and in Belgium) and made efforts to obtain copies of them.
In the evening the buttonholing resumed, over large quantities of Belgian beer, fruit and nts, and broke up for dinner around 2100h. I was then taken on a tour of the nightspots by John Allen (Manitoba) and his Belgian chums which, however, did not constitute official business.
Up to 200 people attended the Summer School, mostly from Italy, France, USA, Canada and W.Germany; smaller national groupings included the English, Belgian, Dmtch, Czech, Yugoslavian and E.German (all about half a dozen each). There were also at least one Pole, one Vietnamese and an Indian from Moscow. Over two-thirds of those present were linguists and probably half of these had no computational experience. Almost all were engaged in academic research, mostly of post-doctoral or doctoral level. The summer school consisted of a dozen forma1 courses of lectures of varying lengths, workshops and informal discussions.
The courses reflected the interests of the majority in that the strictly computational element was slim. The star in this respect was M.K.Halliday (long regarded as the Grand Old Man of modern linguistics) whose course demonstrated the enormous complexity of understanding and producing natural language, and the complete lack of homogeneity in the typos of knowledge to do so. By contrast, Yorick Wilks (equally undoubtedly the Machine Intelligence Superstar of the Seventies) gave an optimistic and remarkably thorough overview of the various systems for understanding natural language developed over the last ten years. Wilks (a philosopher by training) has a witty lecturing style and maintains a lordly indifference to the petty squabbles of nomenclature which bedevil the field. He brought out the intrinsic similarities of much recent work, and stressed that the relation between these and recent developments in theoretical linguistics was closer than suspected. Wilks' course was complemented by a two-week course taught (in a ruthlessly pedagogic manner) by Luc Steels (U of Antwerp, now at MIT).Though called 'an introduction to computational linguistics', -this course in fact simply presented in some detail the formalisms used to express meaning in the field and gave more technical details of many of the systems described by Wilks. Typical material covered included LISP type lists, phrase-structure grammars, case-structure grammars, semantic structures (eg. Schank's "conceptual dependency" graphs, Winograd's PLANNER) and, most important, transition networks. Steels gave sufficient detail of how different parsing systems operated to suggest that implementation was comparatively trivial, and also discussed in more general terms other cognitive formalisms such as frames, scripts and the Knowledge Representation Language of Bobrow & Winograd. A specific example of a transition network was discussed in another course, given by Ron Kaplan (Xerox, Palo Alto) who, with R. Woods, designed the first successful ATN-based parser and has now expanded it considerably. (An ATN is a method of representing a given syntax or grammar in a Transition Network; the A stands for Augmented, and indicates that some or all of the transitions may include predicate tests, case functions, structure building acts etc., thus including a semantic component in the system)
In addition to these full length courses, I attended parts of a course given by George Lakoff (UCLA,Berkeley) on linguistic "gestalts" -a new concept in theoretical linguistics suspiciously similar to the AI "frame" concept , an informal discussion at which I spoke on the Oxford Archive, and an introductory LISP course. I also found a couple of potential Oxeye users and a lot of texts for the Archive. There was an attempt to organise a football match "Linguists v. Computers", but it failed through lack of computational support.
Approx 5 kg. of printed paper came put of the Summer School, and is on its way hither if anyone wishes to know more about it. I also have course notes for Steels' course and a useful bibliography.
English Texts
Work on the Archive continued through the summer. Texts were received from Cambridge, Edinburgh and Minnesota, and sent to Warwick, Edinburgh, Leuven and Swansea. From Minnesota came the nucleus of the Coleridge corpus, currently being completed. Lovely Mrs. J. Healey began typing the rest of this in July and had got to page 400 by September, when the money ran out. Proof reading and correction of the bulk of this corpus is now well advanced. Other texts received included three from UMRCC which came in the unlikely form of 7-track Atlas coded paper tape. I wrote a program to convert these to EBCDIC, which appears to be working. Another conversion program was written to translate 1900 code with shift characters (which are ignored by MT-READ) into EBCDIC. A machine-readable catalogue of the Archive was begun in December. Considerable interest in the Archive was expressed at the various international meetings Susan and I attended last year, and large amounts of texts were, as ever, promised. Work on a program for collating textual variants and grouping them for cluster analysis began in October.
SPITBOL
Most of September was spent nervously preparing myself to give a course encouragingly titled 'Advanced use of Spitbol'. In the event, this proved quite successful, in that attendance actually increased over the four weeks, and there has been continued demand for the course notes. Some of those who came were not arts-users. A new version of the SPITBOL compiler is expected daily. A tape bearing 2900 Snobol arrived this month from Kent, just too late to be of much use -for the present.
FAMULUS
Three or four minor bugs were fixed and a major enhancement carried out to the sorting routines last year. The enhancement allows for non-sorting characters (e.g. accents) to be specified, and necessitating the recompilation and re-organization of the subroutine library. Two programs were written to assist particular users get their heart's desire (as opposed to more or less what they wanted) out of the package. 2900 Famulus did not materialise.
IDMS
Progress has not been rapid, nor indeed clearly visable. Some real swan data has been punched and re-formatted a few times. My knowledge of COBOL has increased from the pathetically inadequate to the merely risible. I attended the AGM of the User Group at which, inter alia, it was agreed to set up a FORTRAN user's sub-group. I will be going to Bracknell to get some experience of the FORTRAN interface (FISS) this month by helping in something known as 'beta-testing1.
Any Other Business
OXEYE was issued to Aston and the installation test run successfully last month. I became a budget advisor at the end of September. I wrote impeccably biassed minutes for the Programmers monthly meetings. I attended the 4th International Summer School in Computational and Mathematical Linguistics in August in Pisa, a meeting of the BCS where I hoped (but failed) to see blood in a confrontation between implementors of Spitbol and Snobol, and the International Meeting/AGM of the ALLC in Liege in December. Visit reports were written.
April - June 1977
21 weeks this quarter were taken up by holiday and official jubilation.
Copies of OXEYE were sent to Minnesota and to Birmingham. The SPITBOL system has remained unchanged and bug-free so far this quarter. I gave a seminar on its use and on good SNOBOL practise at University College Swansea, I also wrote a review of a book on SNOBOL, which is to be published next term.
The Oxford Archive of English Literature, a data bank of literary texts in machine readable form, was conceived and reached parturition this quarter, despite endless problems with gummy pooltapes, defective tape transports and inadequate software. Several letters have been written, a meeting was held, tapes have been transferred. All texts known in the 1906A filestore were collected together, dumped to tape, and lost in a general restore.. Currently about 50 texts are being maintained and a catalogue is in production. Hardware problems.form the chief obstacles to progress.
The IDMS installation test was successfully run at Bath on May 6th. The swan database was moved there successfully the same month, and a program to monitor its performance ran once or twice before the machine went off the air. I am now re-designing the entire system. I gave a Tuesday morning chat on IDMS and also wrote a paper for the 2900 Technical Committee on the same topic.
Notable visits this quarter include two from ICL, Mr. Dick Jukes of Management Training who gave valuable advice on training in the use of IDMS and Mr. Tim Bourne who came to cast a fatherly eye over the progress of the RSL Serials Project (currently HALTED - WAITING FOR A STUDENT). I attended a course on literary statistics run by Susan and the next Master of Balliol, and a fascinating seminar on computational linguistics given by Professor Yorick Wilks at St. Cross.
Alan Purvis (CGF, Bracknell) gave a brief presentation of some of the features of DMUS and drew attention to some of its deficiencies. He had used all of the utilities, but discussion centred on the use of DATAVALIDATE, EXTRACT, REPORT and SORT. I have a copy of a summary report he made, Deficiencies noted include: ambiguous syntax in FML (eg. use of continuation symbol obligatory before THEN and forbidden after it; unsatisfactory IF...THEN..ELSE..nesting), absence of job type (character/numeric) conversion on output, and more seriously, absence of any facilities for group items (COBOL 'arrays'). The latter deficiency will however be removed in the next release, B64. A serious inefficiency is that each line of a BASIC PROCESSING module has to be interpreted afresh for each record processed. BASIC PROCESSING modules are not however necessary for most DMUS Applications. Mr Purvis confirmed, unofficially, a rumour that ICL had no plans to provide the 'translated ML' facilities advertised in the manuals within the foreseeable future. Amidst universal lamentation the meeting broke for lunch. After lunch, possible applications of DMUS in the University environment were discussed. Both Southampton and SWURCC were thinking of using such facilities as SORT and REPORT to analyse user-metering records. Edinburgh pointed out that SORT, RECORDLIST etc. were of obvious utility to the generality. Kent were concerned that many DMUS facilities were not available with VME/K. It was also pointed out that SORT in particular should be more efficient than any user-written code as it calls MAMPHY etc. directly. There are no SCL equivalents for such commonly used facilities as COPY, BLOCKCOPY, RECORDLIST or BLOCKLIST. It is probably too early to say exactly which applications will use DMUS most, but the meeting clearly felt theat it would have to be used in the University environment, despite its deficiencies. Another meeting would be held in 6 months to exchange experience gained using the package.
Minutes of the meeting are not yet available but will be produced on request when they are. As normal, brief reports were made from those currently implementing IDMS projects - which now includes BBC, Plessey, HSA, Water Data Unit, ERCC and the MOD in various guises. No one had anything but praise for the IDMS code, nor much to grumble about except the eccentricities of System B. Other interested parties present included representatives of the GPO, UKAEA, Mardon Son & Hall, Anglian Water Authority, Short Bros and. the Hong Kong Government. The main business of the day was (a) a presentation, by one D.A. "Nobby" Clarke of ICL on the newly released Data Dictionary System and (b) proposals to re-organise the group - now an unwieldy 30+ in size. a) . The Data Dictionary System is a newly released, large (120+ Kb mainstore) expensive (no figures available yet) largely undocumented (but TP 6426 is due in June) piece of software that no system offering database facilities can afford to do without. Such was my impression. It is essentially a database in containing information about all aspects (data structure, physical placement, processes used etc etd) of another database or databases, which can therefore be continuously updated, and in which modifications to database design can be exhaustively tested without corrupting the status quo. As a design tool (and especially when starting a databse from scratch) there is little doubt that the DOS methodology cannot as yet be improved upon. One can of course use that methodology without using the associated software! "Nobby" 's presentation demonstrated this quite effectively by showing the simplicity of translation from an 'entity model' to DOS input, though he was less forthcoming about output from the package. He particularly stressed how the use of an 'entity .model1 would draw attention to otherwise obfuscated contradictions inherent in the structure of data. The DOS was machine and application independent , and grand things were promised of it in the future. In particular, the next release (DDS200 - in 3V 23J will be able to generate IDMS Schema DDL or FML file DDL directly from the 'entity model' . CRISP (the Navy Stores & Transport Inventory Project) have been evaluating the DDS by carrying out field trials for ICL, and they have also produced a comprehensive report on the package. To quote their conclusions “The application of the DDS is relevant to all ADP projects and the authors recommend that all 2900 series users adopt it at the earliest opportunity" and "The DDS is still in its infancy with regard to its potential usefulness to the user" . I have a copy of this report. b) The IDMS user group formally agreed to split into several smaller working parties dealing with distinct areas where it was felt enhancements were necessary, to discuss and make recommendations to ICL on those enhancements, and to report back to the main group, which would meet at less frequent intervals to discuss the subgroups' recommendations. First meetings for the subgroups would be held- in the next month, and will report to the next general group meeting in two months time. I propose to attend a subgroup dealing with query language and report generator.
This quarter I have kept a diary.
1 1900 Activities
The Oxeye Users Guide was published on 24th January and I gave a presentation of the package at one of Susan's Computing in the Arts seminars.
A new SPITBOL compiler arrived on 17th February. This version operates by default in a fundamentally different way from previous ones. The choice was between leaving existing programs un-modified but increasingly inefficient and carrying out a substantial amount of re-writing.
As far as OXEYE was concerned, various modifications had to be made to improve the packages portability in any case, so I chose the latter alternative. At the request of the original Eyeball's designer, I wrote an Oxeye/Eyeball interface and incorporated a Fortran program into the system. The new version (1.1) of Oxeye was put up on 30th March. Copies are being distributed to sites as distant as Swansea and Minnessotta.
With the new SPITBOL compiler there came a large normal mode file, which proved to be very comprehensive documentation. I added a chapter on the macro facilities etc. local to Oxford, and (after some problems with exclamation marks) a clean copy of this page document was produced on the PDP8 printer, from which manuals are currently being made by the Print Unit.
2 2900 Activities
I have been using the Putney machine extensively this quarter.
IDMS is not yet available at Bath, though it should be after Easter.
I ran the IDMS installation test successfully on 18th January, and used the SORT and DATAVALIDATE modulfiS of the DMUS successfully a few days later.
There then followed a week of lost blocks during which the Project Team showed itself remarkably well-ter.ipered. By 9th February however I had successfully compiled my own schema and DMCL for the pilot IDMS project. This is to be a study of the last 15 years of observations of swans on the Thames however, since most of the data is still stored in several shoeboxes of grubby notebooks, I wrote a program to generate some test data. The first attempt to store any of it in the IDMS database failed because I had made the page size too small (128 bytes as opposed to a suggested l-l\ thousand), but at the second attempt 25 of my synthetic swans fluttered home to roost before a logical error in my schema design caused run-unit abortion..Using IDMS is a complicated business so I spent the next week or two on a MAC terminal writing an SCL macro to make it less of'a feat of memory. This was successful up to a point, though much remains to be understood. The whole application program ran successfully on MAC on April 4th.
0ther Activities
I presented a less than revolutionary paper on the allocation of transient budgets at a User group meeting on 12th January.
I attended an IDMS User Group meeting at the Ministry of Agriculture, Fish and Food on February 2nd, where there was a presentation on the Data Dictionary System. Mr. T. Bourne (of GPCD S and TS) subsequently came to Oxford to discuss the DDS launch and agreed to consider any application of the package we might find as a candidate for his assistance.
Possible future applications of IDMS were discussed with Dr. Shaw (Radcliffe Science Library), Mr. Wright (Forestry) and Miss Pickbourne (Genetics).
I attended a somewhat dispiriting seminar on the DMUS run by the 2900 Applications Committee and a presentation by Dr. Coles (Northwick Park) on an automated SNOP system for coding histopathological data at the Regional Health Authority,