Collected Reports

2. 1995-1999

2.1. XML Europe 99
Granada

At the end of April 99, I spent a week in Granada courtesy of the GCA, where I attended the XML Europe 99 conference. The price of this privilege was that I had to pay attention during all the sessions, and then give a closing address at the end of the conference, after a very sybaritic lunch (see below).

I duly attended, and took assiduous notes, (and then lost them all which is why I don't have a visit report to post). I gave my closing plenary from a series of bullet points on my laptop, which lead several people to assume that I had actually prepared a serious talk, and, more seriously, to ask me for copies of the slides afterwards.

I have therefore attempted to confect a more or less plausible version of what I actually said, written out in my best joined-up XML, using the newly minted XML version of the TEI Lite DTD, so you should be able to read it with almost anything....

Figure 1. The view from the podium

2.2. The 1999 EBTI, ECAI, SEER and PNC Joint Meeting
Academia Sinica, Taipei
18-21 January
an unofficial report

Maybe I should start with the acronyms...

EBTI The Electronic Buddhist Text Initiative is a loose international confederation dedicated to converting into electronic form the immense canon of Buddhist teachings and literature, currently surviving in all manner of exotic languages and formats.
ECAI The Electronic Cultural Atlas Initiative is a similarly unstructured group of scholars aiming to set up a distributed database of information about ‘cultural heritage’ sites relating chiefly but not exclusively to Asian cultures and accessible via an integrated GIS interface;
SEER apparently stands for Scholars Engaged in Electronic Resources , and is another loose federation of scholars interested in Asian literatures and languages, particularly ancient ones, with a broader and more library-oriented focus than EBTI;
PNC The Pacific Neighborhood Consortium is a rather more technologically-focussed collaboration amongst a number of academic computing centres located in the "Asian Tiger" countries which has already established an effective regional wide area network and is now looking for content.

Presiding genius and visionary of both ECAI and EBTI is Lewis Lancaster from the Department of East Asian Studies at UC Berkeley, but both groups have an extraordinary roster of distinguished and energetic scholars and institutions scattered around the Pacific rim, co-operating in a number of equally impressive digitization projects. This was the third of their international conferences at which I have had the honour of being invited to present a TEI Workshop; previous events were held at Haiensa Monastery in Korea in 1994, and Otani University in Kyoto in 1996, and the next will be at Berkeley in California next year. The sheer scale of EBTI's interest, embracing canons of ancient texts in Pali, Sanskrit, Tibetan, Korean, Chinese, Japanese, and other less well-known languages, makes for a pretty rigorous test of the TEI's claimed ability to cope with all texts in all languages of all times. Put this together with a characteristically Buddhist atmosphere of mutual tolerance and altruistic scholarship (to say nothing of the lure of their exotic locations) and it is understandable why such invitations are hard to resist.

The 1999 event was hosted by Academia Sinica, a Taiwanese government funded research institution and the Taiwanese Ministry of Education. Academia Sinica has a long history of interest in SGML and XML; one of its chief luminaries being Professor Ching-chun Hsieh, architect of (amongst other things) the modifications to Unicode needed for it to handle the full range of ancient Chinese characters. Indeed, should the TEI decide that it needs a host organization in this part of the world, I think that Academia Sinica would be a natural first choice for a site to approach. Also in Taipei, at the National Taiwan University but in close contact with Academia Sinica, the CBETA project, is creating a TEI-conformant corpus of the Chinese Tripitaka. involving Christian Wittern, inventor of the Kanjibase is a recent recruit to Academia Sinica, and is converting several large structured-text databases to TEI. Both Christian and Rick are enthusiastic promoters of the TEI.

The conference proper was preceded by four tutorial workshops on Metadata and the Dublin Core, by Helen Jarvis (University of New South Wales); TEI (me); GIS by Lawrence Crissman (Griffith University); and Image Data by Jan Glowski (Ohio State). I missed all of these except my own, owing to a pressing need to revise my overheads and drink as much of the Academia Sinica's excellent coffee as I could afford, but the handouts indicate that they all provided a useful technical overview of these four key topics. After the workshops, Rick Jelliffe kindly shepherded a group of us long-noses around the famous night-markets of downtown Taipei, where I bargained furiously for brass ornaments, and eat lots of tasty sausages, and freshly cooked chestnuts. My appetite for exploration thus whetted, I must confess to taking a whole day off to wander around Taipei on my own which was an exhausting but extraordinary experience.

The conference itself spread over four days, with up to four parallel sessions running throughout, each devoted to progress reports and position papers about text creation projects, networking and digital library projects, and GIS applications in Asian studies. Most of the presenters came from Taiwan, Japan, USA, Australia or Korea, but India, Nepal, Thailand, Russia, Germany, France, Mexico, and the UK were also represented. We were fed at regular intervals with excellent Chinese buns and other goodies, quite apart from plentiful Chinese lunches, and there was much extra-session activity and discussion, involving the exchange of business cards. There was also a splendid Chinese banquet, held in the Taipei Business Club (a building that has to be experienced to be believed), complete with witty meditations on the approaching millenium from a very distinguished Taiwanese professor and a welcoming speech from the newly elected mayor of Taipei, whose election pledges apparently include Internet access for one in three inhabitants of the city, quite apart from finishing the rapid transit system that Taipei desperately needs. I'm not sure what we eat (apart from the Sharks Fin Soup) but it was all very tasty, and plentiful.

An opening plenary by Roy Weber from AT&T gave a taste of some of AT&T's more futuristic products, based on close integration of internet and conventional telephony systems: these included WISL, allowing for management of distributed telephone sales staff (did you know that nearly half of all calls to AT&T are to 800 numbers?); a deluxe form of video-conferencing called "virtual presence"; and the endearingly-titled "cyber fridge", (named after the fridge on which all good US families post family-oriented information) which we will all use to keep in touch when our homes have permanently active internet connexions, and flat display panels are built into refrigerator doors.

One of the two other Brits at the conference, Susan Whitfield, reported progress on the International Dunhuang Project at the British Library, home of the "sponsor a sutra" digitization scheme. Not content with digitizing this extraordinary collection of 20,000 manuscripts and manuscript fragments, all now catalogued, the project is now developing an integrated catalogue of the Stein photo archive. This links extracts from Stein's diary of his turn of the century travels along the Silk Road with the photographs he took; its use of a geographic or spatial metaphor as a way of accessing disparate collections of digital resources seemed genuinely innovative to me and was a recurrent theme of the conference.

Tom Duncan, from the Museum Informatics Project at Berkeley, described some of the technology underlying their project, based on the use of Sybase as a data manager for 50,000 images, delivered via JTIP compression, and also as a means of providing full text retrieval from the Korean Buddhist text canon. They deploy a traditional thesaurus style access to these resources via a Sybase implementation of a range of standard thesauri. Howie Lan, also from Berkeley, gave an overview of other Digital Library research activities at Berkeley, but ran out of time before getting down to much real detail of his advertised topic of "multivalent documents" (I think this is mostly about combining digital images, transcriptions, and other views of a document)

I attended a business meeting of ECAI, my curiosity about which had been sparked by an introduction to the power of GIS systems which Ian Johnson kindly gave me during a long conversation about the relative merits of XML and RDBMS for organizing data. ECAI has an active technical group apparently largely driven by said Johnson and other Australians, and an interesting programme of work. The plan appears to be to collect metadata describing any and all datasets relating to Asian cultural materials using a modified (surprise!) set of Dublin Core descriptors, and to provide access to the corresponding distributed datasets via a new desktop mapping system called Timemap. I muttered about OAIS and the Warwick framework, and perhaps more significantly infiltrated a late night drinking session hosted by the imperturable Helen Jarvis (who is, incidentally, co-author of the definitive study of Cambodian atrocities ) and a frock-coated Matthew Ciolek (of WWW Virtual library fame). I hope to persuade some of those involved in ECAI's technical group to present a session at this year's DRH conference, since I think we have much to learn from the scale and scope of ECAI, quite apart from its intrinsic interest.

A recurrent theme of the EBTI sessions on ancient Chinese and Japanese texts concerned the difficulty of encoding such texts in a standard way resulting from the Unicode Consortium's decision to "unify" CJK scripts. Shigeki Moro, for example, reported that in encoding the Taisho Tripitaka (which includes both ancient Japanese and Chinese marterial) they had so far needed 5840 distinct charavcters, of which 1264 were unavailable in JIS and 338 were missing from Unicode. These so-called Gaiji have to be represented by SGML-like entity references, using numbers assigned by inventories such as Wittern's Kanjibase; when their textbase is converted to XML, they propose to use empty elements for the purpose. In the same session I learned of the existence of the Mojikyo Font Center , an admirable organization which provides free Truetype fonts for over 80,000 Japanese and Chinese characters.

Technical solutions to linguistic diversity was a recurrent theme in the EBTI sessions I attended. Amongst many others, I noted the following: Jost Gippert presented the TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) project at the University of Frankfurt, demonstrating alignment of several texts from this immense collection using WordCruncher; the TITUS site also includes an ingenious set of pages for testing the ability of your system to deal with Unicode. Dhanajay Chavan from the Vipassana Research Institute in India demonstrated the new version of the Chattha Sangayana CD which allows the entire Pali canon to be viewed in Devanagari, Roman, Myanmar, Thai, Sinhalese, Cambodian or Mongolian scripts. Marvin Moser from Lucent Technologies in Chicago presented an input system for Tibetan script which (I learned) has similar "stacking" problems to classical Arabic. The Asian Classics Input Project , with which the OTA already has links, continues to create and distribute a vast library of Indo-Tibetan literature, recently expanded to include digital images of collections of such material in St Petersburg. Meanwhile, a team from Dongguk University in Korea has been quietly developing their own XML-aware Unicode editor for inputting the Korean Tipitaka, which may well turn out to have wider application, if it ever gets out of Korea; unfortunately I couldn't find out as much about this as I wanted to: this paper describing it is in Korean.

In another tribute to the lure of XML, Charles Muller (Toyo Gakuen University) recounted his experience in converting his Dictionary of East Asian Buddist terms into XML, using XSL to render it via Internet Explorer 5. The possibilities thus opened up of linking this dictionary with other resources such as the Rangjung Yeshe Tibetan English dictionary of Buddhist culture attracted much interest. Christian Wittern gave an excellent account of how encoding projects like CBETA could make use of the TEI which last he described in a memorable phrase as being "a travel guide not a catechism"; he uses UltraEdit, a programmable two-byte aware editor, to input and tag the Chinese corpus according to a modified TEI Lite, and delivers the result for browsing via Microsoft's free HTML Help wizard (an ingenious idea I propose to steal).

My account necessarily omits whole strands ofthe conference devoted to broader non-textual collections of cultural artefacts, and much discussion of the opportunities afforded by the fusion of networking and digital library, as well as some of the more specialist topics discussed, simply because one cannot be in two places at once. Nevertheless, I hope I have given some sense of the diversity and richness of this exceptional conference.

2.3. ICAME 19
Slieve Donard Hotel, Newcastle, Co. Down.
>20-24 May

ICAME, the annual get together for corpus linguists, was held this year in a luxurious Victorian hotel in Newcastle, Co. Down, where the mountains of Mourne come down to the sea (and they do), organized with great panache by John Kirk from Queens University Belfast. The food was outstanding in both quantity and quality, there were magnificent sea views, the weather was splendid, and the conference programme was full of substance and variety. With about a hundred participants, the conference was not too big to allow for plenty of interaction and socializing, even with a fairly crowded programme of sixty formal papers and a dozen posters spread across three days. Also, I counted nearly twenty presentations of one kind or another explicitly derived from work using the BNC, which was (as they say) nice. In what follows, I summarize briefly only those sessions I attended, passing over in silence a few I could not attend either through an inability to be in two places at once, or through conference-fatigue.

For the third time, it had been decided to hold a separate one-day Historical corpora Workshop as a curtain raiser to the main event. This was opened by Matti Rissanen's annual survey of projects and resources for diachronic corpus studies, now available or underway, in which he also said some nice things about the Oxford Text Archive. Irma Taavitsainen from Helsinki reported on a study of what she termed "metatextual" comments in a corpus of Early English medical texts dated 1375 to 1550: the claim was that both form and incidence of such comments as It is to be seen that, I will now demonstrate etc change over this period to reflect other changes from personal to impersonal modes of narrative, or from an oral to a literary style. Terttu Nevalainen, also from Helsinki, presented CEEC, an interesting attempt to apply sociological criteria in the design of a Collection of Early English (1417-1681) Correspondence. Of necessity, the 6000 letters making it up had all been taken from published editions, (thus introducing a rats-nest of copyright problems), and also thus to some extent reflecting modern editorial practices in such matters as spelling or selection procedure (less than a sixth of the letters are from women, for example). Nevertheless, the corpus is clearly of immense interest as a way of quantifying and detailing hypotheses about such phenomena as the social distribution of you/ye forms, or the changes from -s to -th verbal forms, or the use of multiple negation.

In the first of two papers about the Lampeter corpus, I rattled through an account of its design and encoding, focussing mostly on the problems of getting from a semi-SGML form of markup to one which actually validated against a TEI dtd, before Claudia Claridge (Chemnitz) reported on some real work done using it as a source of information to substantiate theories about the development of scientific thought; for example, the lexical patterns typifying the empirical approach, and the gradual replacement of active voice reporting forms by passive constructions over the century that this corpus samples.

After our first experience of a Slieve Donard lunch, Douglas Biber (Arizona) author of a recently-published book on corpus linguistics, described an application of factor analysis techniques to 18th century English. Unlike previous work in which the parameters studied had been derived from 20th century material, this work began by constructing a number of specifically 18th century "dimensions" along which the rates of occurrences for a wide range of linguistic features were shown to cluster in different ways in texts taken from ARCHER corpus. Different genres are then identifiable by their location along these dimensions.

Ann Curzan (Michigan) reported on a study of the shift from grammatical to natural gender agreement for anaphoric pronouns across the period 1150-1215 in the Helsinki corpus: her study shows that this was by no means a simple transition, involving both factors relating to both lexis (some nouns remaining stubbornly gendered) and syntax (eg distance between anaphor and antecedent). Christer Geisler (Uppsala) presented a mass of data relating to postmodified clauses in the tree-banked version of the Helsinki corpus, the purpose of which went straight over my head. Gerry Knowles (Lancaster) was also fairly recondite as to methodology, but the purpose was plain enough: to identify the origins of northern varieties of English by analysis of evidence from dialect maps rather than on the assumption that they derived from some homogenous Middle English dialect.

The Historical Workshop closed with a discussion as to whether or not it was A Good Thing, or whether it would be better to roll its contributions in with the rest of ICAME. Since ICAME's expansion is now officially International Computer Archive of Mediaeval and Modern English, I felt that it probably should; others, perhaps more territorially minded, disagreed, and we all adjourned to the bar.

Next day ICAME proper began with John Kirk explaining the structure of the event: there would be a a series of themed sessions, focussing on major corpus development initiatives: today ICE, tomorrow the BNC, then Birmingham, and so on. He'd also planned a session on dialectology, but the dialectologists had not co-operated. John reminded us that the result of the Northern Ireland Referendum would be due around tea time on Saturday, and that we all had a copy of the consultation document in our packs, so that we could consider whether or not the verbs exercise and discharge were in fact used synonymously. A special excursion in the evening would take us to a secret location near Belfast where the joint Anglo-Irish secretariat would ply us with drink before the Referendum result hopefully removed it (the secretariat, not the drink) from existence.

The ICE session began with a presentation by Baz Arts and Sean Wallis from UCL of the ICECUP annotation and search software. This is a classy piece of Windows software which allows you to search the completely parsed ICE-GB corpus of one million words in terms of its linguistic annotation. The interface looked a bit like the SARA query builder screen, or the Linguistic Database (LDB) searching software developed by Hans van Halteren for the TOSCA project many years ago -- but on steroids. The idea is that the user defines fuzzy tree fragments -- templates for nodes which are to be matched in the complete syntactic tree. Each node has three parts (function, POS category, feature) and the arcs can be directional, or non-directional. The system performed well and looked good, but Sean spent rather a lot of his time explaining how model based systems query front ends like this were an improvement on those based on logical expressions, which whether true or not was fairly uninteresting to non computer-scientists. There were promises of an enhanced and extended ICE-GB corpus to be developed later, with the software bundled, presumably contingent on grants coming through. I asked how the system handled contextual queries (since metadata wasn't included within the nodes) but didn't understand the answer. Chuck Meyer, a real linguist, then reported his experience of using the new system, and in particular of comparing its usability with his analyses of the same corpus published in 1996: he focussed however on minutiae of the results rather than usability issues which meant that I rapidly fell asleep.

I awoke (briefly) to hear Atro Voutilainen and Pasi Tapainenen from Finnish language engineering company Conexor, newly formed to exploit the run away success of their English Constraints Grammar Parser, probably the most widely used and amongst the most successful of current automatic tagging systems for English. A new version called Englite is now available on the web. Most of the technical detail of the Finnish team's impressive work is available in publications elsewhere, to which he gave several useful pointers He also demonstrated some Unix tools for processing the parser output, for example to produce new groupings of idiomatic phrases. The tools look good, but you have to be a true believer in dependency grammar to get the best out of them.

Jim Cowie from New Mexico State repeated some fairly well-worn observations about the role of corpora in improving automatic translation, citing some nice examples (how to distinguish storms of ice cream from snowstorms in a Spanish text) and also showing some nice software. He said there was a need to enhance the lexica used by your average MT system with frequency and contextual information, which is as true now as it was when Mike Lesk said it in 1986. Ylva Berglund presented World Wide Web Access to Corpora from Essex, a JTAP project which aims to demonstrate how corpora can be used in language teaching. Her presentation was meticulous but the project remains seriously underwhelming from my point of view. Its future remains unclear and there were some politely critical remarks from the floor about the need for such a project to be a little less self-promoting, and maybe more extensive in its coverage. The afternoon was given over to software demos, which went without a hitch: I demonstrated some of the spiffier bits of Sara, in particular how to use it to examine differing usage patterns for the word pretty as an adverb by men and women; the Zurich team demonstrated their impressive web front end to SARA, which they promised we could distribute for them when it was ready; and I went for a walk along the beach with Tony Mcenery. As to the secret reception, we went by bus for miles along twisty Northern Irish roads to a place surrounded with barbed wire which looked rather like a converted school hall, where everyone had a lot to drink, and I explained at least seven times to different people what the state of the BNC currently was.

Next morning began with an hour of deeply statistical discussion by Professor Nakumura and colleagues from Tokushima University, concerned with various methods of automatically identifying text-type within the LOB corpus. The rest of the morning was largely devoted to papers reporting work done with the BNC. It began however with two serious papers from Douglas Biber and Geoffrey Leech both derived from their Corpus Grammar work. Biber's was mostly on the ways lexical patterns explained explain different usage rates in different registers for apparently interchangeable constructions, in this case complement classes (verb+that vs verb+to). Leech's focused on conversation, and proposed some interpretations for the characteristic patterns of difference found amongst the four basic genres analysed in their grammar. Speech is characterized by shared knowledge, an avoidance of elaboration, a plethora of interactions, a need to express emotion and stance, freqent repetition of set phrases and (because of its time-based nature) frequent front or back loading of syntactic structures. All of these can be shown to underly the significantly different syntactic patterns found in speech.

After coffee, Tony McEnery lowered the tone of the proceedings by reporting with relish his investigations into the naughty words used throughout the BNC. He produced a number of examples to demolish various intuitive claims about who swears about what to whom published by one G. Hughes, and also remarked in passing how odd it was that the Norwegians came in for so much invective in the spoken part of the BNC.

Hans Martin Lehmann and Gunnel Tottie from Zurich reported a technique and some results for the automatic retrieval of adverbial relatives (e.g. this is the place where/at which/that/0 he found it) and for investigating their different usage patterns. Automatically retrieving zero-marked relatives is particularly tricky, even for the ingenious Lehmann, involving running the untagged text through the Helsinki parser, to identify potentially appropriate patterns, which are then manually checked. Apparently, the manual search missed 20% of the cases found by the automated process -- but the program was entirely at the mercy of tagging errors in the parser.

Sebastian Hoffman, also from Zurich gave a thoughtful presentation about the collocational evidence available from the BNC. Most people believe that native speakers know many complex lexical items: but empirical evidence showed that for relatively infrequent words there were disparities between predicted collocates and those actually attested in the corpus (using log-likelihood measurement of the collocational strength). The question of how speakers recognize such combinations in rare words remains open.

At this point, and in a major departure from ICAME norms, the conference split into two parallel strands. I sat tight for two more presentations on the BNC. Roberta Facchinetti (Verona) presented a study of preferences for will vs going to in speech and writing, somewhat marred by the observation that the written part of the BNC did not reliably distinguish reported speech. Jurgen Gerner (Berlin) discussed the increasing use of they as what should grammatically be a singular pronoun, used anaphorically to refer to everyone or somebody (as in everybody should do what he/they can): there seemed to be a preference for the singular only with the some- form.

After the usual extravagant and irresistible lunch, we settled down for an afternoon of presentations from Birmingham. Ramesh Krishnamurthy briefly described TRACTOR, a research archive for the various corpora and resources being created by and for the TELRI project, which has recently obtained a further three years funding for its pioneering work in corpus-ifying the languages of Eastern and Central Europe. The archive will include a number of interesting tools, integrated into a single framework, as well as the various corpora already produced and in production by TELRI member sites. Ramesh was at pains to distinguish the project from Multext East (less specific) and Parole (more specific), but TELRI still sounds more of a club than a project to me. Not that there's anything wrong with clubs (especially if they will admit me as a member, which I did my best to persuade them they should).

Geoff Barnbrook made deprecating remarks about the Bank of English and COBUILD, briefly touching on the political fallout from the recent massive "downsizing" of the latter, before giving a fairly anodyne description of a parsing system under development for the definition texts of the COBUILD dictionaries. He was followed by Oliver Mason, who again discussed the notion of lexical gravity in collocation together with some useful techniques for its automatic calculation: this remains no less impressive than when I saw it presented at PALC last year, but does not seem to have advanced much since then. Sue Blackwell discussed how the words look and well are used as markers of discourse function in a range of examples from the Bank of English; Willem Meijs gave a fairly thorough overview of national stereotypes as revealed by mutual information scores, but came to no firm conclusions that I noticed, perhaps because my attention was beginning to flag by this stage of the day.

Revived by tea, I plumped for the parallel session for annotation-dweebs (thus regrettably missing three papers on the slightly unlikely topic of dialect studies in the BNC) and managed to follow quite a lot of Eric Atwell and Clive Souter's discussion of the problems of mapping between the outputs of different parsing systems. As part of the Amalgam project, they had attempted mappings between nine different parsers and (despite the best efforts of Expert Advisory Groups for Language Engineering Standards) concluded that for syntactic analyses at least there simply is no obvious or even non-obvious interlingua. Even something as simple as labelled bracketting is controversial if you happen to be mapping an analysis based on a dependency rather than a phrase structure grammar. This dispiriting news was followed by an interesting paper from Yibin Ni (Singapore) who had been trying to make explicit by tagging some fairly recondite co-referential relations in discourse, but who did not seem to have hit on any notational scheme adequate to the purpose. Final paper in the session was from Geoff Sampson, presenting with characteristic clarity some of the problems in trying to define an annotation scheme that can guarantee consistency if application across corpora of transcribed speech: such common features of speech as repair and truncation wreak havoc with the best designed syntactic tagging schemes, to say nothing of the gaps in an analysis caused by <unclear> elements. The day concluded with a mammoth session in which each of the poster presenters got five minutes to announce themselves, and a reception at which John Kirk tried to explain some salient features of Northern Irish linguistic history. I think.

My notes on the next day begin with Vincent Ooi (Singapore) who promised to explore the different "reality" evidenced by collocational data from Singaporean and Malaysian English, but instead gave only what I found a rather impressionistic account of some multi-word phrases in English as she is spoke in the Straits. I did however learn that Singaporean lifts are equipped with devices which sound an alarm should anyone be taken short whilst inside: these are rather unimaginatively known as urine detectors.

Martin Wynne (Lancaster) presented the results of an interesting comparison between two part-of-speech taggers: CLAWS, from Lancaster, and QTAG, from Birmingham. The comparison was effected by running both parsers on the same corpus (the written half of the BNC sampler), mapping the results into the EAGLES recommended annotation scheme, and comparing the results. Martin conceded that this was grossly unfair on several counts: the EAGLES scheme is much closer to CLAWS in the first place; CLAWS was trained on the BNC; in cases of ambiguity, CLAWS uses portmanteau tagging, whereas QTAG gives a prioritized list from which they always took the first. He also spent a lot of time saying that he didn't regard the results (in a corpus of over a million words, CLAWS disagreed with the reference scheme 2% of the time, and QTAG about 15%) as proving anything, in which case one couldnt help wondering why they were being presented.

Antoinette Renouf (Liverpool) reported on the next phase of her unit's continuing and (I think) unique work on neologisms. The AVIATOR project, which monitored the appearance of new words in newspaper texts over a four year period, demonstrated that such words tended to have very low frequencies, thus requiring the development of rather rarefied statistical techniques for their detection and classification. Such techniques having been developed in collaboration with Harald Baayen from Nijmegen, Antoinette announced the arrival of a new project called Analysis and Prediction of Innovation in the Lexicon (APRIL), the aim of which is to develop a system of automated classification, accounting qualitatively and quantitatively for the features found in huge amounts of text, and then extrapolate from this to predict the structure of the future lexicon. Allegedly.

Graeme Kennedy (Victoria University of Wellington), author of the other newly published book on corpus linguistics, presented an intriguing paper on Maori borrowings into contemporary New Zealand English, couched largely as a comparison between those reported in the new dictionary of NZ english and those attested in a recently compiled corpus of spoken New Zealand English. Apparently, 77% of the words in the dictionary are not attested in the corpus, while 86% of forms found in the corpus don't appear in the dictionary.

Under the title It's enough funny, man, Anna-Brita Stenström reported on some features of teenage talk familiar to those with teenage daughters (e.g. use of enough and well as adjective pre-modifiers) but maybe not to others. In traditional ICAME fashion, she presented a mass of useful and interesting evidence for these usages, and their typical contexts based on searches of the COLT and the BNC; she had also used the online OED as a source of comparative diachronic information, enabling her to reveal that enough as a premodifier appears about 800 times in OED citations, while well is well frequent in Old English, thus suggesting that teenagers have merely rediscovered an enough established feature of Early Modern English.

Due to other committments, I had to make my excuses and leave at this point, thus missing amongst other things a report on the Lancaster multimedia corpus of children's writing as well as the closing celebrations. Despite this sacrifice, I still somehow managed to missed my flight home and had to spend an extra night at the Belfast airport hotel. Even this did not dampen my enthusiasm for the event: one of the best of a long series, and a hard act to follow.

2.4. Language Learning and Computers
Chemnitz University of Technology, 20-23 Feb

Chemnitz is a large town in Saxony (you may have heard of it under the name of Karl-Marx-Stadt, when it was a large town in the DDR) notable for a well-established University, which has just had the good sense to elect as its pro-vice-rector (sounds less implausible in German)Prof. Dr. Josef Schmied, director of the REAL English language centre. Josef is also a corpus nut, and the originator of the Lampeter corpus project, which is why I was pleased to receive an invitation to visit his centre for a two day workshop on language teaching with computers. funded by the British Council in Germany.

I was one of four invited Brits, each of whom was allocated 90 minutes talk on the first day; the audience consisted of about 40 assorted teachers from other German Universities and Language Centres some of whom were allowed to speak (but only for 20 minutes each) on the second day, before a round up discussion session.

Before all this began, however, there was a little ceremony in which a Very Important Person from the British Council in Berlin and the Vice Rector himself signed a renewed cultural exchange agreement which would ensure that Chemnitz students might continue to enjoy partnership arrangements with a variety of UK universities in the interests of greater mutual understanding, the spread of the true British language, and so on. We also had a little talk from Frank Frankel, now retired from the Council but retained as a consultant in co-ordination of language centres throughout Germany, and a long-time enthusiast for CALL, as to how its wider use might change the roles of teacher and learner.

Josef Schmied gave a brief overview of curent projects at TUC involving language-learning with computers (LLC), of which probably the most interesting is the Internet Grammar project: a web-based system for language learners, teachers, and linguists to collaborate in developing a kind of translator tool for the teaching of English grammar. Unlike the UCL project of the same name, which is monolingual, the Chemnitz project is aimed very specifically at translators' syntactic needs; it also aims to synthesize inductive and deductive procedures to establish rules.

Hilary Nesi, from Warwick, gave a well focussed and informative review of the pedagogic usefulness (or otherwise) of a range of English language dictionaries available on CD-ROM; she also waxed lyrical on the potential classroom usefulness of the humble hand-held dictionary-type device, if only their manufacturers could be induced to give them more linguistically-oriented features instead of overloading them with electronic organizer type facilities.

Geoffrey Leech, from Lancaster, gave a very interesting talk about the grammar of spoken English, derived from a major new work on which he is collaborating with Stig Johansson, Doug Biber, Ed Finegan and others. The corpus underlying this is a 20 million word corpus comprising the spoken part of the BNC and a parallel corpus of American spoken material, collected according to the same principles by Longman, but not (so far as I know) available to anyone outside the project. Beside presenting and analysing an impressive mass of comparative data, with far more detailed examples than I can present here, he made some very cogent suggestions about their potential relevance to language teaching. The availability of well based frequency information for various syntactic constructions could be used not just to determine actual usage, and thus to provide authentic examples of usage, but also perhaps to help determine which grammatical structures should be taught, and which were most appropriate for different varieties of English, or types of discourse -- both areas in which language learners have most difficulty -- thus (to use Geoff's phrase) "maximizing communicative payoff" both in production and reception.

I gave the standard talk about the BNC again, including a live demonstration of the BNC Sampler, and a little coda stolen from Guy Aston about how the BNC can be used by learners to challenge their teachers, which seemed to go down well.

Gary Motteram from Manchester's Centre for English Language Studies in Education discussed some of the pedagogic issues in using IT for language teaching, based on his extensive experience in running a Masters degree in TESOL, which is now done entirely online, and sounded to my relatively untutored ear like a model case study in IT-based distance learning. He demonstrated, and advocated, the use of Toolbook as a low price, easy access, tool kit for the construction of powerful courseware.

At the end of this long day, the British Council took us all to an a rather unusual performance at the Schauspielhaus. Das Ballhaus is a kind of musical-cum-ballet, presenting fifty years of German history through the device of a dance hall, in which a massive and highly skilled cast waltzed, tangoed, drank, and generally overacted through the terrible twenties, the even worse thirties, the unspeakable forties, and the unbelievably depressing fifties, before succumbing to the degenerate sixties, the cacophonous seventies, and the incomprehensible eighties. The piece lasted three and a half hours, with an interval, and the theatre was unbearably hot, but it was oddly compelling, being full of humour and incident, and quite the most extraordinary display of really bad tailoring I have ever seen. Although entirely non-verbal, and although almost all of the music was entirely familiar and accessible, it also demonstrated how far shared knowledge and experience is essential to understanding a culture -- the second half in particular being full of incidents which the East German contingent found riotously funny or touching but which left the rest of us entirely bewildered.

The next day was given over to a mixed bag of presentations from other happy recipients of the British Council's bounty (in the shape of exhange agreements with UK universities). Regrettably I missed the first of these -- a group from Rostock describing a proposed diachronic newspaper corpus comsisting of samples of high, middle, and low brow British newspapers from three centuries. Speakers from Magdeburg's Otto von Guericke University and from various units of Potsdam University described their experiences in setting up IT-based learning resource centres, aka Self Access Centres or SACs, and in using the internet as a source for such things. There were also brief consumer reports on a various workshops attended (thanks to British Council funding) at a number of UK institutions. Sound principles (teachers should act as guides not experts; teach how rather than what etc.) were articulated and many favourite web sites cited.

Dr Thomas Bellman, from Leipzig's Hochschule fur Technik Wirtschaft un Kultur, demonstrated some nice software for cataloguing ELT resources: it looked like your standard bibliographic database, but included hot links which would start up bits of courseware directly from the record, screens which enabled you to pre-define subsets of records according to ELT-relevant categories, and buttons for access to a web browser, a wordprocessor, ond even a virus checker, all embedded within the same interface. I did not have the heart to murmur "OLE?"

Dr Jurgen Martini (also from the Otto von Guericke University in Magdeburg, but a different part of it) gave a defiantly non-IT-based presentation about the difficultiues of teaching cultural studies which kept promising to open up a new area of discussion, but didn't quite do so. The courses he teaches sound very interesting though.

The Workshop's final speaker, Bernd Rueschoff (from Karlsruhe), has the unusual ability to make CALL-theory sound interesting, as well as knowing how to make a very impressive presentation. He covered a very wide range of pedagogic issues, ranging from the need for authenticity in language learning and cultural studies, the use of technology as a means of enriching rather than replacing the traditional learning environment, and the drive towards active participation in learning rather than passive acquisition of information. He also had some practical advice to offer on how exactly to achieve these motherhoods, based on substantial experience.

I concluded from all this that, like the rest of us, German institutions have both Language Centres and Linguistics Departments, and are not quite sure whether IT has a role to play in either, neither, or both. They are also rather vague about whether corpus linguistics belongs in linguistics or cultural studies, or both, or somewhere else. All of these constituencies were represented, eying each other somewhat nervously, at the workshop; and if there wasn't a lot of rapprochement, there was a respectable amount of talk. Recurrent topics in the discussion included the impact of "self-access" as a way of learning rather than simply as a means of doing pre-set assignments, and doubts as to its general applicability for all students; anciety was also expressed about the possible dangers of allowing students access to politically incorrect notions over the internet, and the need for reliable assessment of networked resources, at which I felt a plug for Humbul would not be out of place.

I stayed on for an extra day after the workshop ended in order to show the Chemnitz team what I had been doing with their Lampeter corpus, and to discuss plans for future collaboration. This took so long that I never got round to doing any respectable amount of sight seeing, which is why this report is regrettably deficient in my reactions to the new Germany, the architectural oddities of Chemnitz (yes, Karl Marx's head is still there, just across the road from McDonalds) and Dresden (wonderful palace -- shame about the traffic system), or the delights of Saxon cookery. But it's probably long enough as it is.

2.5. TEI Workshop
SSLMIT Forlì
16-22 April 1997

The Scuola Superiore di Lingue Moderne per Interpretore e Tradutori is one of the very small number of University-level institutions in Italy dedicated to the training of interpretors and translators. It has a high national reputation, only partly inherited from its illustrious parent, the University of Bologna, since it is in fact located at Forlì, a pleasant Emilian town some 30 kilometres from Bologna,on the edge of one of Italy's major wine growing areas, and close to Predappio, birthplace of Il Duce.

I was invited to teach a full TEI workshop to a mixed group of about fifteen 3rd and 4th year students, all of whom were fluent in English, and had already had some exposure to computing methods and results by virtue of using the BNC and SARA. The main objectives were to explain some basic markup principles, to give some hands-on experience of other SGML software, to demonstrate the extent to which the usability of a computer corpus is determined by its markup; and get the students thinking about how they might prepare their own corpora. The workshop consisted of eight 90 minute lectures, three two hour practical sessions, and two discussion sessions, somehow squeezed into six days of fairly concentrated effort.

Before the workshop proper I gave as curtain raiser an open lecture on the British National Corpus, remarkably similar in content to the one I had given the week before in Lòdz, though couched in somewhat different terms. The emphasis was, naturally enough, on how the BNC actually used TEI-like markup. It was followed by the following sessions:

Theory of Text Encoding: a lecture on the motivation for encoding and the varieties, benefits, and dangers of markup.
Document Analysis: a brief presentation on what document analysis is, how to do it, and why you should bother.
Tagging Workshop: an exercise in document analysis, based on a document one of the students had worked on previously. The discussion was a little inhibited at this stage, as many of the students were still reeling from the shock of being asked to consider using a computer for something other than word processing.
Introduction to SGML and DTDs: this was a whistlestop tour through the syntax of SGML, requiring considerable amounts of stamina; it was probably the least successful of the straight lectures.
TEI Lite overview: straightforward introduction to the most generally useful core element tags in TEI Lite, basic notions of the TEI scheme etc.
Basic TEI Encoding Practical: the first of three 2 hour practical sessions. Importing an "Ascii-only" text into Author/Editor and adding some basic tagging, for titles, divisions, paragraphs, phrases, etc. Everyone worked on the same text (a very short story by Kate Chopin downloaded from the internet) to the same script, which was exhausting, but apparently enjoyable.
TEI Architecture: a lecture on the organization of the TEI scheme, how to mix and match tagsets etc. (For an Italian audience, the pizza model is a particular treat)
Using special tagsets: a second practical session, marking up a short passage of transcribed speech with a view of the TEI specialized for spoken texts, using Author/Editor. In this case, the text being imported alreadty contained markup for turns, overlap, etc. which had to be converted either manually or using macros (the brighter students found out how to do this for themselves).
The TEI Header: lecture on its motivation and contents, necessarily focussing on bibliographic matters of somewhat marginal interest to corpus builders, though I did try to draw some parallels with the needs of corpus builders for documentation.
Building a TEI Header: by the time of this third practical session, most of the students were getting quite confident in their use of A/E; they were now challenged to create a TEI Header for either of the two texts created previously, starting from a blank screen.
Overview of TEI tools: this lecture tried to explain the varieties of SGML and TEI software available: but focussed particularly on using nsgmls, various convertors, and Panorama. I wrote a Perl program to count sentence-initial patterns, and used Jade to turn the Chopin text into RTF.
TEI Tools Practical: this was largely an exercise in learning and using Panorama to display and search the marked-up texts prepared earlier.
My final lecture touched on some of the more recondite aspects of the TEI likely to be of interest to linguists (segmentation, feature structures, alignment etc.), in the overall context of what the TEI had to offer corpus builders. It also summarized very briefly what the course as a whole had been trying to demonstrate: that you get out of your corpus only what you put into its markup, and that it was up to the corpus builder to decide just what that should be.

In a final round up session, over coffee and cakes, the group voiced some concerns about the relevance of all this to the practical problems they will face as translators: some of them wondered if a TEI corpus would help them preserve private corpora of translated works; others were more interested in the availability of large public corpora like the BNC.

I must add that these students were a real pleasure to teach. They worked very hard to grapple ideas and methods initially quite unfamiliar to them, and (particularly in the practical sessions) worked with great enthusiasm. And they put up very politely with my hectoring style of teaching too. By the end of the week, they had definitely earned the certificates we handed out.

2.6. PALC97: Practical Applications of Language Corpora
University of Lódz
April 12-14 1997

It's somewhat of a cliché to describe Lódz (pronounced, approximately, Wootch) as the Manchester of Poland: this doesn't mean so much that it is blessed by an excellent football team, as that it is cursed with a major industrial past. At the turn of the century, it was a rich city, built on cloth, and with one of the most prosperous middle class communities east of Berlin. The centre of the town, and the University quarter, still boasts a number of the fine houses they built, some of them now being carefully restored; others still hidden beneath the grey dust and neglect that seems to live over the whole of this region of Europe. But then, of course, came the thirties, and the appropriation of Poland by invaders, first from the West, and then from the East. Lódz, I learned from my Guidebook, has the dubious honour of having been the first city in modern times to establish a ghetto and the ghost of that absence still haunts the place. Curiously, for there are now no Jews to be seen here, it's the star of David which is now daubed on walls, in contexts where elsewhere one might find the swastika daubed, in association with the swirling iconography of urban rage imitated from the inner cities of further west.

The teaching of English is a major growth area in Polish universities; for every student learning Russian or German, there are ten wishing to learn the language of McDonalds and Marks and Spencer, whose emblems now dominate the centre of Warsaw as well as that of High Wycombe. The British Council appears to be playing a major role in satisfying this demand, as witness its funding of this conference, and a number of other educational programmes aimed at secondary and tertiary English language teaching. An unusual and interesting aspect of these programmes is the recognition that access to (and study of) language corpora are of particular importance if there is to be a systematic improvement in the quality of English spoken (and taught) by Poles. Hence the organization of this well-attended international conference.

The four day event was hosted jointly by the British Council and the University of Lódz Institute of English Studies, represented respectively by Susan Maingay and Barbara Lewandowska, with assistance from James Melia, and took place in the University's recently constructed and comfortable conference centre. Each day began with a brisk walk to the University's council chamber for a plenary session, held beneath the stern depicted gaze of assorted dignatories in funny hats, followed by parallel sessions at the conference centre, combining project reports and research papers from a wide spread of corpus applications and interests. Evenings were given over to relaxation, discussion, and the opportunity to sample some excellent Polish hospitality.

Plenary sessions

The conference began on a high note, with an excellent lecture by Professor Michael Hoey (Liverpool) on the nature of the corpus linguistics enterprise, the questions it asked, and the answers it might provide. Asserting that corpora could be used to provide evidence of lexical patterns, of semantic prosodies, of syntactic patterns, of colligation, and even of text grammars, Hoey proceeded to discuss striking examples from each of these categories, of which I will summmarize here only what he called "the drinking problem principle". If someone has difficulty in drinking, it will generally not be referred to as a drinking problem because the more common collocational sense is inappropriate. In the same way, corpus evidence, rather surprisingly, demonstrates that following a posessive adjective (my, our, his etc) the plural form reasons is always preferred to the singular reason when the intended sense is "cause" (our reasons for doing this...) rather than "rationality" (to lose one's reason).

The second plenary speaker was Tony McEnery (Lancaster Univ) introducing the notion of what he called multimedia corpora. He stressed the need to introduce visual information as a context for understanding verbal material, and showed us a few pages from the corpus of children's writing and drawings currently being created at Lancaster, which will be distributed freely over the web.

The final plenary speaker was Patrick Hanks (OUP) who gave a bravura demonstration of the problems that corpora give lexicographers. What exactly is it that lexicographers do when they go through the lines of a concordance assigning each one to some sense or another of a word? and how on earth do they do it? Hanks has published several papers on this, and worked with some of the best computational names in the business (Atkins, Fillmore..) but he's honest enough to say he still has no definitive answers. His presentation focussed on a few interesting examples: the word "baked" for example, which seems to require certain lexical classes (not just edible food stuff, but specific categories of them) and the word .

Parallel sessions

In the nature of things, I couldn't attend all of these. The exigencies of time and space meant that I had to choose to miss presentations from inter alia: Bengt Altenberg (Lund), Wieslaw Babik (Krakow), Michael Barlow (Rice), Simon Botley (Lancaster), Igor Burkhanov (Rzeszów), Doug Coleman (Toledo), Martha Jones (Nottingham), Dorothy Kenny (Dublin), Bernhard Kettermann (Graz), Przemyslaw Kaszubski (Poznan), Anne Lawson (Birmingham), Barbara Lewandowska-Tomaszczyk (Lódz) , Belinda Maia (Oporto), Michal Pawica (Kraków), Margherita Ulrych (Trieste), and Maciej Widawski (Gdansk), whose names I list to give at least some indication of the geographical spread of participants. What follows by way of review should therefore be regarded only as a sample of the concerns raised and materials discussed -- though not, I hope, too unrepresentative a one.

Stig Johansson (Oslo) reported on the progress of the English-Norwegian parallel corpus project, now expanding to include up to a hundred texts in other European languages, notably German, Dutch, and Portuguese. The well thought out design of this corpus allows comparison both between texts translated from language A into language B, and the reverse, although it is not always easy to find sufficient texts to do this (there are far more English-Norwegian translations, for example, than the reverse, simply because it is hard to find comparable texts) In a separate evening session, he described some of the software developed for the project, in particular the automatic alignment procedure developed at Bergen by Knut Hofland and a Windows retrieval package developed by Jarle Eberling at Oslo.

Michael Rundell (Longman) gave a pleasant presentation about corpus evidence for the British fondness of understatement, in particular the phrase not exactly and similar ironic uses. It's probably not too controversial to say that this was not exactly unfamiliar to those who had heard him speak at TALC last year, but none the worse for that.

Guy Aston (Forlì) contrasted the pedagogic usefulness of large corpora such as Cobuild or the BNC with that of small specialized corpora drawn from, specific text types, such as a 14,000 word "hepatitis corpus" in use at Forlì. Small corpora are more easily managed by the language learner, and their lexis can be studied in extenso; learners can use them to practice their inductive powers, hypothesizing lexical, colligational or collocational patterns, which may or may not be confirmed by examination of large reference corpora. In this respect, it is possible to get the best of both worlds.

Akua Anokye (Toledo) described some interpretative problems in analyzing the transcriptions of Afro-American folk narratives recorded on aluminum disk by Hurston, Lomax et al in the late twenties and now stored in the Library of Congress. She had transcribed some of these recordings, using her own scheme, and presented a largely impressionistic account of the interplay between their phonological and contextual features.

Sylvia Shaw (Middlesex) described how corpora had influenced the production and format of the third edition of Longman's Dictionary of Contemporary English. This had included both the use of frequency information, derived from large corpora, and particular attention to typical language learner errors, derived from corpora of language learners' production. Thus the student can be advised, for example, of the range of things to which words such as beautiful are typically applied by native speakers, which is much smaller than that used by language learners, and given advice on how to choose between near synonyms such as error and mistake.

Raphael Salkie (Brighton) quoted a number of French writers' opinions about the differences between French and English, intended to help translators as rules of thumb (e.g. ). His paper reported some interesting work on the extent to which these perceptions were born out by corpus evidence, and gave a brief overview of the Intersect project.

Chris Tribble, (Lancaster, Reading, Warsaw) picking up Guy Aston's paper on the benefits of small corpora, suggested that for class room use, small corpora were of more use than large as well as being more accessible. His paper reported on some experiments using Microsoft's Encarta as a language resource (as well as a source of factual information) noting that the type of language it contains is very similar to that which language learners are typically required to produce: brief factual articles.

Oliver Mason (né Jakobs) from Birmingham presented what was in many ways an exemplary research report about some very interesting work he has been doing on identifying statistically the size of collocation spans, by calculating the type-token ratio of the words appearing in each position to the left and to the right of the keyword. The results are striking: different node words exhibit strikingly different patterns of influence on their neighbours, giving a visual hint of the extent to which they construct fixed phrases, for which he proposed the term lexical gravity.

Sylvia Scheur (Poznan) discussed several aspects of her research into the pronunciation difficulties faced by Polish-speaking learners of English. She had recorded 17 Polish language-learners reading the same English texts at the start of their course and a year later, and was transcribing these phonetically (using the SAMPA writing system for the International Phonetic Alphabet, developed by John Wells). Students were also asked to assess their own performance, which produced some interesting comments about their perceptions of English prosody.

I gavethat paper about the BNC and SARA again, spiced up somewhat for this audience by the addition of hints about forthcoming availability of the corpus outside the EU, and also with the first ever live demonstration of the sampler corpus.

Philip King (Birmingham) gave an overview of the Lingua multilingual parallel concordancing project, now in its second phase of existence, with a particular focus on some of the pedagogic software being developed at Birmingham for its exploitation, its use in generating course material for student use, and the ability to browse and search parallel corpora. It would be interesting to compare the methods and results of this project with those of the ENPC described by Stig Johansson, but no-one had the temerity to do so, in public at least.

Social events

Like other academic conferences this one was oiled by a couple of very pleasant evenings, drinking, dining, and discussing. Particularly memorable was an evening concert of baroque music by Teleman and Handel played on original instruments, followed by a splendid buffet dinner. This was held in one of the afore mentioned bourgeois palaces and much appreciated by all. After a couple of glasses of very drinkable Hungarian wine, McEnery, Ketteman and I were able successfully to empress all the help we needed to make TALC 98 (hopefully, to be held in Oxford next July) a reality.

I also took the opportunity of a free Sunday at the end of the conference to visit Arkadia: this is an ornamental garden full of picturesque gothic ruins, classical statues, and the like, originally laid out by the local aristocracy in the 18th century, and now a pleasant enough place for a Sunday afternoon stroll. Getting there involved a detailed and educational study of Polish regional railways and bus services, to say nothing of the refreshment room at Lowicz railway station (which I won't).

2.7. Trip Reports from various places in the USA (12-20 Jan 1997)
[Eastern Michigan Uni; UM Ann Arbor; Indiana U; Berkeley]
A visit to the USA (12-20 Jan 1997)

I spent ten days on the other side of the Atlantic last week, much of it in the frozen midwest, quite a lot of it hanging around Detroit Metro airport, waiting for planes which were either not due for hours, or which were, but had been delayed for similar periods of time. The only place I found there capable of serving a decent cup of coffee had no spot to plug in my laptop, and the food doesn't bear thinking about. Fortunately, when not in airports I was well looked after, and well fed, by a number of friendly natives. I was rained on in snowy Indiana, and saw some nice tourist sights in foggy San Francisco. For more details, follow the links below...

The Linguist List, Eastern Michigan University
Humanities Text Initiative, University of Michigan
Variations Project, Indiana University
University of California at Berkeley

2.7.1. Consulting with the Linguist List
12-14 January 1997

For those who don't know it, the Linguist list is a well-established ListServ list, read by tens of thousands of linguists of all varieties world wide. The list is hosted by Eastern Michigan University, a small (by American standards) campus just up the road from the University of Michigan. Ever anxious to improve and extend it, Linguist's presiding deities Helen Dry and Antony Aristar last year applied for and received a grant from the American National Science Foundation to move the list to a new plateau of serviceability, including bibliographies and source texts, abstracts, and all manner of wonderful things. To their credit, they decided that SGML was just the ticket for this purpose, and solicited consultancy and advice from their readership. Somewhat to my surprise, my response to this request lead to my being invited last autumn to come and discuss the project with them and their colleagues, which I duly did last week.

The consultation took place in the Aristar-Dry's rather splendid architect-designed home in snow-bound Ypsilanti in the shape of a two day informal workshop. The three moderators of Linguist (Aristar, Dry, and Dan Seely) were present, along with John Remmers, their technical editor,and several of the graduate `associate editors' who currently have the thankless task of editing and controlling the hundreds of messages which arrive for the list every day. Also participating were another SGML consultant, in the shape of Gary Simons of the Summer Institute of Linguistics, with whom I have had the honour of working previously on TEI matters, and Ron Reck, billed as Linguist technical editor, a former EMU linguistics graduate, now moving to a better paid job in Washington.

We began with an overview of the software they currently use. The whole operation is hosted on the University's unix system, but is about to move to its own Digital Unix box. The software used, developed by Remmers, is a home-grown but functional mixture of shell scripts and C programs, embedded within the Unix mailer. Considerable care is needed to ensure that files are correctly processed, keyworded, and proof read, to say nothing of vetting for obnoxious content, but the procedures for doing all this are well understood and documented. Remmers then presented a good overview of where they would like to go in developing a new system, and some suggestions about how they might get there. The chief problem areas highlighted were in character handling, the need for various format conversions (the list is currently distributed both as email and in HTML from its web site), and in making searching more sensitive, i.e. context aware. I gave, without benefit of overheads, a brief sermon on the advantages of SGML, focussing in particular on these issues, and on the desirability of document analysis before proceeding much further, and we then all adjourned for a rather tasty Japanese dinner in Ann Arbor.

The following morning, Remmers came to take Simons and me out to a decent breakfast, over which we were able to get a little more specific about his currently proposed SGML dtd, the chief drawback of which is its lack of any structure within the body. Simons then gave an illuminating overview of the key issues in making the world's character sets uniformly accessible by today's computing systems. He concluded, unsurprisingly, that the only workable long term answer is to use Unicode/ISO 10646, but not without having given at least as much time as they deserved to eight other possible approaches, and giving as much technical detail as anyone might reasonably require about the current state of such mysteries as the Basic Multilingual Plane, how to shoehorn Unicode characters into Windows code points, whether and where to use Unicode entity references, and how to ship UCS-2 documents as Mime attachments.

In a final discussion session, the group reviewed the state of affairs, and started trying to identify what kinds of document their DTD should handle, which services their new system would be expected to provide, and what software would be appropriate. I tried, not very successfully, to get the concepts of document analysis across, and rather more successfully to persuade them that individual messages should be their primary data objects. This would enable them to produce personalized issues and automatically retrieved summaries in addition to email and HTML versions of the current "Linguist Issue", particularly given the availability of more detailed and accurate indexing of the message contents than exists at present. There was some discussion of what should go into these indexes, and how it should be controlled. A consensus was established in favour of Open Text 5 (which they have been promised for a paltry $1500) as a suitable indexing tool.

The project has only just begun, and its grant is small. However, this is an exciting time to be beginning such a project, with the announcement of XML and other relevant developments. It has a daunting task ahead of it, given the need to maintain the current level of service, but the team seems both highly motivated and technically competent. I will be watching it develop with interest.

2.7.2. University of Michigan, Ann Arbor
14-16 January

The Humanities Text Initiative at Michigan University is an organization not entirely unlike our very own Humanities Computing Unit, but based within the University library and rather more solidly funded. It is run by John Price-Wilkin, with whom the Oxford Text Archive has long co-operated. I arrived unannounced in the evening and was pleasantly surprised to find a room in which a pair of industrious graduate students were tending the electronic production line that is the HTI American Verse project. At one end were real books, each with a little card marking its current status; at the other, proof-read and consistently TEI encoded versions of them: on the table was a well thumbed copy of TEI P3. The OCR software, called Typrereader, looked rather sophisticated to me: it is used to produce a markup-free text, proof read carefully against the original. Markup is introduced at the next stage, using Author/Editor. Texts are loaded into Panorama, and tagged printout is proof read again. I did not get any figures about the throughput of this part of the highly professional Digital Library Production Service, but it is clearly doing well enough to keep the HTI's web site busy, as well as servicing a number of other products. It is interesting to note the shift in emphasis away from collection and research support towards content creation and service provision.

Price-Wilkin had a busy schedule the next day, during which he somehow made time to give me a brief overview of the organization, and describe in full detail some work he has been doing with Dr Johnson's dictionary. He also allowed me all too brief access to the riches of their collection, and the use of his office for an unrelated TEI telephone conversation before taking me to lunch, where I was introduced to the dubious delights of the chilli dog.

A number of different approaches have been taken at Michigan to the perennial problem of providing good quality humanities computing support. Initially, they had set up a `collaboratory' -- a personalized computing facility which particular named scholars could apply to use for suitable projects over a fixed period --- but this had not been altogether successful. Take-up and productivity of the HTI, which included an open access facility-rich room, staffed by experts in a number of different fields, and engaged in resource creation for specific projects, were much greater. As well as superior OCR and tagging services, they offer a Kontron camera (a Progress 3012) for high quality image capture, free for internal use.

Other facilities available to library users include a `Knowledge Navigation Center' specifically developed to provide assistance on IT to humanities students, faculty and staff: this has a number of machines and support staff. The Office of Instructional Technology also provides a service aimed at developing and adapting teaching and learning software to faculty's needs: I formed the impression that this unit had a poor record of co-operation with other more resource-focussed centres.

The HTI forms part of the University's Digital Library Production Services, along with a number of complementary and well-funded projects: MESL (the Museum Educational Site Licencing Project), the venerable TULIP electronic journals project (currently co-operating with JSTOR) and a new service known as the SGML Server Program, the object of which is to sell SGML consultancy and expertise in setting up electronic text centres to other universities at a knock down price. Income thus generated is intended to go into the content creation activities. The DLPS reports to a board on which the Library, the Information Technology Division (a large Unix based central computing service), and CAEN (another large Unix-based central computing service) are all represented. (I suspect that there is an interesting history to be written about how Michigan came to have two competing computing services). The HTI gets funding from the University's Press, the Library, the Office of the President (of the University) as well as from grant giving bodies like NEH and NSF, which it uses to carry out prestigious content creation projects like the Making of America (which will produce 1.5 million page images, combined with OCR'd text in TEI format). It currently has six full time staff and its activities are expanding to include not only images (unlike the Library of Congress, which delivers TIFF format page images, theirs are dynamically converted to GIF for web delivery) but also music and film (as a natural outgrowth of the MESL project); they are even contemplating numeric data (for the ICPSR, no less).

Michigan has an HFS system like ours, run by CAEN, which the DLPS is planning to use for large datasets, such as GIS data. Otherwise they rely on their own largescale RAID system, which gives them six 72 gigabyte disks, mounted on a Sun server. They use a product called DLT for backup. They use their own software to interface web users with the underlying text search engine, which is Open Text release 5; this software is also supported by the SGML Server Program mentioned above.

I could have spent a lot longer in Ann Arbor, had my itinerary permitted, since I think we have a lot to learn from their successes. I also discussed with Price-Wilkin the idea of organizing a TEI-header users-only workshop, which he seemed to think a good idea; he also suggested that the TEI really should get into the business of selling consultancy services, which I did not quarrel with since he was buying the dinner (and it was a very good one). Afterwards, I was re-united with an old acquaintance, Professor Richard W. Bailey, whom I last saw in the late seventies and with whom I spent a very entertaining evening reminding each other of past follies and embarassments, cut short only by my need to get up early the following day for a flight to Indianopolis.

2.7.3. Indiana University, Bloomington
17-18 January

I took a side trip to Indiana University's new Music Library, on 17-18 January. This is home to the Variations project, a state-of-the-art real-time music delivery system, which reportedly makes use of IBM's Digital Library products.

The project director, David Fenske, introduced me to the systems support person, Jon Dunn, and the librarian responsible for the digitization process, Constance Mayer, all of whom kindly gave up a lot of time to making sure I saw as much as possible during my brief visit. The goal of the project is to digitize substantial quantities of the library's holdings of recorded music, held on CD, LP, and cassette. Real-time delivery of digital sound currently requires non standard disk access and storage methods, and the system at Bloomington relies on an IBM proprietary method known as tiger shark (apparently because data is `striped' across the media, rather than being stored in discreet blocks) for storage, and on an internal ATM network for delivery.

Although the project is not currently using the IBM Digital Library (henceforth, DL) software, Fenske assured me that several components of the software were already in use, while others would soon be upgraded to a state where they would deliver what was needed. The principal gap was in support for the afore-mentioned TigerShark file system, which could not be accessed via the current ADSM software. Consequently, at present, Indiana are using ADSM for backup, and as a repository only. Audio files are manually moved between the ADSM and a separate `playback' server. This is a conventional `pull' type unix server, running under AIX, with its own filestore which is accessed via a product called Multimedia Server. It is planned to replace this with a new IBM product called Video Charger, due for release in September, which will interface directly with ADSM. Mention was also made of a similar product called Media Streamer, designed to handle real time audio broadcasting. For our purposes, the most relevant forthcoming component of the Digital Library system will be Visual Info. This is a database product designed for storage and retrieval of images and text which sits on top of the well established (not to say venerable) db2 engine. It will also have some extras called db2 `extenders' because they extend the searching capabilities of db2, apparently using black magic to do things like searching images by colour, shape etc.

For metadata and cataloguing purposes, Indiana plans to replace its current OPAC system (NOTIS) with a new one, currently being developed by a company called Ameritech Library Services. This OPAC is shared by nine distinct campuses, so the upgrade will be slow. The new product, called Horizon, will interface directly with Digital Library (according to an agreement between IBM and ALS). It is designed for inter-operability, and has Z39.50 support. Fenske pointed out that Z39.50 does not address the realtime networking issues critical to their needs, which made integration with the Digital Library product correspondingly more important. We discussed the relative wisdom of rolling your own solution versus waiting while manufacturers roll one for you: Fenske said that his concern was always to make sure his needs were addressed by vendors' announced and supported product plans. He had found IBM very responsive, and was confident in their ability to deliver the required functionality in the long run. He is currently working very closely with the company, and will be working part-time as a consultant at the Santa Teresa laboratory where DL development is based.

Bloomington is nationally and internationally famous for the quality of its music teaching: out of approximately 7000 applications, they admit about 350 a year; their current enrollment is about 1500 music ``majors'' and a hundred part-time students. The music library occupies four floors of a new purpose-built (and privately funded) Performing Arts Center, with several large reading rooms, and purpose-built stacks. Round the edges of the reading rooms I saw rows of carrels, some with outmoded (and under-used) analog playback systems, others with newer equipment (typically a MAC or PC workstation, with a Kurzweil keyboard, MIDI, and playback systems). In total there are 70 such workstations, of which 30 can currently access the Variations system. (This is partly because the current Variations software runs only on PCs: it's planned to switch over to NT4 as soon as ATM drivers for NT are available). The front end software is Netscape: a page specifies the lists of musical works allocated to particular courses, with direct links to the digitized audio itself, where this is available. Clicking on one of these activates the Variations player, which is configured as a Netscape helper application. The player allows you to select particular tracks from the work, randomly and with impressively smooth and rapid access. The sound quality is comparable with what you would expect from a good domestic CD-player over headphones. Fenske told me that their server logged about 17,000 audio file deliveries per month.

In addition to these carrels, the library has three seminar rooms, and a LARC-like room, equipped with about 50 macs and PCs, which is run by the central University Computing Service. These facilities are all linked to the ATM network, and so can all access the Variations system provided that they are able to run its software. There are fifteen full time library staff and two full time technicians.

I then visited the office where the digitization and cataloguing is actually carried out (this also doubles as the control room for a small recording studio). Digitization is done largely by part time student labour, under Mayer's direction. The procedure is only partially automated, needing a certain amount of manual intervention. Up to twelve hours of music get processed each day: limiting factors are the time taken to compress the WAV files to JPEG (this is done in batch overnight) and the amount of disk space available. Operators have to check that space is available to hold the material they are creating, and also to create manually a ``tracks file'' which records title and composer information for each track digitized. This is taken directly from the CD or LP, rather than from the existing catalogue records, for a number of reasons, ranging from variability in the level of cataloguing details actually available (MARC cataloguing practice for published music varies greatly in what gets included, and where), to political and programming difficulties in getting direct access to the centrally-maintained catalogue records. Consequently, the operators' keyboarding instructions have to specify exactly how proper names of composers should be entered --- there is no other authority control --- and include the depressing note that all foreign accented characters should be ignored.

Five cataloguers are employed to enter the data into a simple line-mode shell script, taking up to 30 minutes per CD. The only automated part of the process appears to be the reading and detection of track duration times directly from the CD: there is no workflow program to check, for example, that the catalogue records are correctly updated. A filename is allocated to each piece of music, derived from its identifier in the NOTIS system. When the whole piece has been digitized and is ready for compression, it is backed up to the AHDS and a catalogue record update is requested, apparently by hand. This will (eventually) insert an entry in the MARC 856 field, containing the URL at which the digitized track will be accessible, assuming that it is available from the server. For example, the piece of music with NOTIS identifier ABE7278 will gain an 856 field containing something like the following:

$1 /.../cgi-bin/var/access?ABE7278 $2 http

I had some private conversation with Jon Dunn, who provided some more technical details of the present and future system. In future he expected that the OPAC would link to detailed metadata held on a library server, which would in turn point to digital objects held on an object server. As noted, their present system relied on AIX and the Multimedia server, using the ADSM only for backup and as an archival store (for both WAV and MPEG versions of the files). The glue holding this together was all developed in house: in particular, the Variations player was written in Visual C++ and the web interface material written in Perl. A half FTE post had just been established to port the Player to a 32 bit environment. There had been no particular planning exercise or formal acceptance procedure.

At present, Jon said, the system only has to handle 30 concurrent accesses over the ATM network, but it should be able to handle up to 100 such, if the number of workstations expands. The playback server is an IBM RS6000 series, specifically a 59H uniprocessor (apparently, Multimedia Server does not work on multiprocessor environments) with 512 Mb of main memory, running AIX 4.1.1 and ADSM 2.1. The server addresses a total of 120 GB of SCSI disk storage, and manages the ``striping'' (this precludes using RAID to increase the amount of available disk space). The disks are accessible by NFS and the in house ATM network used for delivery is connected to the campus wide area network, so in principal the system could be accessed from anywhere on campus. However, since the campus network is a conventional FDDI ring running at 1200 Mbits/second, and most buildings have an ethernet running at only 10 Mb/sec this is not regarded as a practical possibility. The campus network is run by the University Computing Service which is reluctant to risk degenerating performance in this way.

As noted above, after about ten hours of music has been digitized, the holding area is full and all data has to be compressed. Up to 120 Gb of data can be held in the play back area, but they have digitized ``much more'' than that. With the new video charger software, transfer between ADSM and the playback area can be automated, but at present it has to be done manually, on the basis of course requirements. The new DL will also include a defined API for applications such as the Variations player and its associated web pages, to which Dun expects to be writing code. He mentioned the existence of a detailed IBM technical paper describing the internals of the Video Charger software.

In a final discussion, Fenske gave me some other US contacts which might be worth following up. These are all members of something called the ``Renaissance Consortium'' --- a club of early DL users run by IBM, loosely under the aegis of Michel Bizy. (spelling?)

Library of Congress (Herb Becker). They are currently using Visual Info.
Marist College (John Mc Ginty). Amongst other things, they are managing the Roosevelt Archives and providing web access to them at www.newdeal.marist.edu.
Virginia Commonwealth U (Jim, Mulvaney, Phyllis Self). Their application reportedly involves delivery of medical x-ray images to remote sites, compounding privacy issues with image delivery problems, but I could not find it on the web.
Case Western (Jim Baker). They are researching rights management issues, and have developed prototype software for object-level access-control. Fenske was of the opinion that this was too fine grained for most purposes: he mentioned the idea of a cryptolope, in which only the container of a packet of digital data would be encrypted as a more promising approach.

Unfortunately, bad weather meant that I had to leave Bloomington earlier than planned, but I don't think I would have learned a great deal more by staying. It seems clear to me that we should continue to be deeply skeptical about the claimed abilities of the IBM digital library software.

I did manage to visit the LETRS electronic library project while I was there: this is another TEI based project, firmly located in the University main library, run by Perry Willet. LETRS has adopted similar solutions to the provision of digital texts online as HTI, though on a smaller scale. It is a joint venture of the library and computing services, with five parttime graduate consultants and one full time technical consultant. LETTRS provides access to a number of networked CDs, and the OT5 software used at Michigan. They have also created a sizeable amount of TEI-conformant text as part of an ongoing Victorian Women Writers Project.

2.7.4. University of California at Berkeley
19-20 January

I cut short my visit to Indiana (which was now getting seriously cold) in order to get back to Detroit before too many of the planes out of it had been cancelled by bad weather. This turned out to be a Good Idea, since the one I eventually caught to San Francisco left two hours late, and managed to lose my luggage in the process. But California is still a warmer place to be, even when you have no clean shirts and only a substitute toothbrush.

I had been invited to the Library at Berkeley by a campus-wide working group, jointly sponsored by the Townsend Center for the Humanities and the UCB Library, which has a campus-wide remit to promote interest and information on computer usage across the Humanities. Amongst other interested (and interesting) parties, this brings together the Library's own Electronic Text Unit, the Bancroft Library's Technical Services Division, the Library's Photographic Service, and several academic departments, notably those of English and Linguistics. Berkeley Library is, of course, the home of EAD: the Encoded Archival Description, now being developed at the Library of Congress, as well as many other good things. My talk, on the relationship between scholarship and the digital library, attracted a gratifyingly large audience, many of whom did not appear to know each other, and nearly all of whom managed to survive until the coffee break. I had been asked to stimulate enthusiasm for the TEI, and did my best: for at least an hour.

After a pleasant lunch with my hosts (Daniel Pitti, Tim Hoyer, and others), and a little light tourism (I have now been up to the top of Berkeley's Campanile and admired as much of the view as the fog would permit) we reconvened for a discussion session, which like many such ranged comfortably over a number of issues without settling any of them. I did learn however that the Electronic Text Unit had developed a dtd for managing digital images along with their transcriptions, calledthe EBind dtd, and that according to Prof Lewis Lancaster, double keying is a far more cost effective method of data capture than OCR. (Lancaster also brought me up to date on the activities of the Electronic Buddhist Text Inititiative, which is still going strong.) In the evening, I was then taken out for an excellent Japanese dinner by the Linguistics Department, in the shape of Prof C. Fillmore, J. B. Lowe, Jane Edwards, and two graduate students, and we all got to discuss corpus linguistics late into the night. To round off a perfect day, on returning to my hotel, I found that my luggage had finally caught up with me.

Next day, rejoicing in a clean shirt, I set off to visit Uli Heid, currently a visiting fellow at the International Computer Science Institute in Berkeley. We spent an hour or two discussing corpus retrieval software and I at last saw the Corpus tools developed in Stuttgart by Oliver Christ. Alas, since this prestigious international institution did not have a single PC running Windows in it, I was unable to respond by demonstrating SARA, other than over the web (which worked).

Fillmore is about to start on a new project which involves annotation of a corpus with detailed lexical information: we talked a little about how that might be supported in the TEI scheme. After lunch, I visited the Berkeley Linguistics Department proper, where I saw some of the impressive work Lowe is doing in bringing together and (eventually) marking up components of dozens of African language dictionaries. This project, known as CBOLD (Comparative Bantu Online Dictionary) looked like an excellent TEI prospect.

I spent the rest of the day engaged in tourism in San Francisco, at last. I can now report that I have crossed the Golden Gate Bridge, visited the City Lights bookshop, eaten at Max's Diner, and seen the cars wiggling their way through Lombard Street (the wiggly block of 7th street). Oh, and also seen the remains of the 1910 Worlds Fair -- some utterly implausible red sandstone ruins put up by William Randolph Hearst. I should express my thanks to Jane Edwards for introducing me to those delights, to my Berkeley hosts for allowing this tour to end on such a high note, and indeed to all the people I visited for allowing me to disrupt their routine with so many impertinent questions.

2.8. Conference on MS encoding
Studley Priory
2-3 Nov 1996

This was a hard working weekend conference, rather on the TEI model, organized by Peter Robinson of de Montfort University and the Canterbury Tales Project, in collaboration with Dr Hope Mayo from the newly Mellon-funded project on Electronic Access to Manuscripts. It assembled representatives from several major European manuscript collections, specifically the British Library; the Arnamagnaean Institute at the University of Copenhagen; the Institut de Recherche et Histoire des Textes in Paris; the Vatican Library; the Czech national library in Prague; the Welcome Institute; the National Library of the Netherlands at the Hague; the Marburg Institute and the Bodleian, together with experts on MARC, on the Berkeley Finding Aids project, the Text Encoding Initiative and the Dublin Core. The charmingly-named Merilee Proffitt was also present as a representative of the joint Berkeley-Columbia Digital Scriptorium project. The object of the two days was to formulate a basis for minimum level descriptive cataloguing, specirfically of western medieval manuscripts, and agree a context in which this could be developed, with an eye to future European funding.

This was an unusually well-prepared and documented meeting: brief bibliographic and biographical details of all the participants were circulated before hand, and I came away laden with paper and replete with technical detail. A full report was promised, so this should be taken only as a summary stop gap, composed from my personal and probably ursine standpoint.

The first day was taken up by presentations from each of the cataloguing experts present, recounting their own institutional policies and current practices, which proved a most instructive mixture. At the BL, for example, Michelle Brown spends one day a week "up a ladder" looking for illuminated mss, which when found are given minimal level descriptive cataloguing "at shelf" using a dtd developed by Richard Masters, while Rachel Stockdale and her colleagues are contemplating the horrors of retrospectively converting an estimated million printed textual descriptions to a structured format. Mathew Driscoll and Dominik Wujastyk gave good introductions to what was special about their own kinds of mss (Icelandic and Sanskrit respectively); Elizabeth Lalou described the latest incarnation of Medium, the medieval ms database developed at the IRHT in the 70s; we learned a little about the Vatican's 150,000 estimated mss, of which only 15% are catalogued, from Piazzoni, and also about the thinking behind the Czech National Libary's Unesco-funded digitization project from Adolf Kroll. Thomas Brandt described the Marburg Institute's MIDAS system, which underlies a very impressive publishing programme, and also a very interesting collaborative cataloguing venture called Diskus, involving several major German museums and art historical institutions. Anne Korteweg described the cataloguing scheme used for manuscripts at theDutch Royal Library, mentioning in particular its use of PICA and of Iconclass Finally, Richard Gartner outlined theBodleian's mss cataloguing projects, using EAD and a set of TEI extensions.

Day two began with some real computer demonstrations from David Cooper, showing the Bodleian's Celtic manuscript project, and from IBM's digital library projects, in the shape of Peter Elliott and Uschi Reber. None of this had a lot to do with the stated business of the meeting, but it was nice to see some really good digital images (or would have been had I not been busily getting my own presentation ready). Real business resumed with three brief technical presentations about MARC (from Larry Creider), EAD (from Daniel Pitti), the TEI Header (me), and the Dublin Core as a discovery vehicle (from Jennifer Trant). The rest of the day was devoted to discussion of what exactly should constitute a Dublin-style core set of descriptors for manuscript materials. This began with a rather unsuccessful attempt to get everyone to agree on a top ten list of fields that must be present, got bogged down somewhat, but eventually cohered around the notion of identifying a key set of descriptive categories, within which more specific fields might be mandated. The discussion was finally given some coherence by a rather neat taxonomy proposed by Jennifer Trant, which I reproduce below. Starred items indicate those which were agreed on as essential, according to my notes.

Ownership (where the object is now and has been)
- *** repository name
- *** shelf mark
- previous owner/repository
- previous owners mark
Creation (how the object came into being)
- * date of production
- * place of production
- scribe/hand
- colophon
Physical Description (observable characteristics)
- ** dimensions
- format
- ** extent
- ** materials
- binding
- illuminations
- collation
Contents (i.e. interpretative features: the meaning conveyed by the object)
- * author, and role e.g. commentator, author, translator
- * title, subclassed as transcribed, supplied, unknown (or incipit)
- incipit
- explicit
- * language and writing system
- subject + scheme
- form or genre
- iconographic subjects
- complete?
Relationship of this object to others
- references
Reproductions (mandatory, where these formed the source of cataloguing)
- format
- institution and identifier
- date
- rights
Record History (cf TEI revision History)
- cataloguer
- sources consulted
- * original vs reproduction

A fuller and clearer version of this list is to be produced in the official report of the meeting. With this consensus established, the real work of defining the nitty gritty of how (for example) this consensus might be mapped on to a MARC record structure or a TEI header was left to another day. The plan is to prepare a bid for a two year development project, funded by the EU libraries programme and co-ordinated from de Montfort, which will define such implementations, systems for producing such records in technologically-challenged libraries, and a central repository for them, as well as a detailed set of Guidelines. Watch this space.

In all honesty, I must add that some credit for the hard thinking and energetic discussion which characterized this meeting may be due also to the sybaritic nature of its surroundings. Studley Priory is a beautiful building, with an excellent chef, a well-stocked bar, and some rather fine gardens for stomping around in, even on a windy November weekend.

2.9. Belux96
Business Faculty, Brussells
31 Oct 1996

I was invited to deliver the opening address at A decade of power, the third annual conference of the Belgium-Luxembourgian SGML Users' Group, held at the Business Faculty of Brussells University, just down the road from the airport (rather inconveniently for us Eurostar-fanatics). The day long event was well attended, with about fifty delegates, mostly from the industrial/publishing community in Belgium and the Netherlands, though there were also some academics present. There was a small exhibition, notably featuring Omnimark, Balise, Adobe, and Fotek, and a very friendly atmosphere. The day was given over to presentations, which I summarise below.

Lou Burnard (Oxford University) SGML on the Web

You can downloadthe latest version of my presentation from my web site, including the overheads. As far as I remember, I explained at some length why HTML was a Bad Thing for electronic publishers (this is what the Americans call ‘preaching to the choir’), and rather more briefly why SGML was a Bad Thing for the Websurfer in the Street (which is probably what the Americans call ‘making waves’). I also made a few incautious remarks about month, which provoked some interest.

Elizabeth Gower (Adobe) Practical issues in SGML Publishing

The practical issues concerned all related to conversion of legacy data: how to do it, how to make sure you don't get ripped off by the company you sub-contract to do it, and why it's going to cost you more than you think. The presentation was slick, professional, thorough, and aimed fair and square at senior management: an invaluable checklist of practical things to do (make sure you've got enough MIPS and enough disk space! make sure your network can stand the load!) and how to benchmark the process, but not much on different software strategies you might adopt to do the conversion, or manage the results of doing it.

Norbert Mikula (Philips Semiconductors) Electronic databooks: proof of concept

Not content with writing an SGML parser (of sorts) in JAVA (imaginatively called Cappuccino), Mikula has also turned his hand to the production of Yet Another DSSSL Engine (or YADE), which uses Milowski's Kawa scheme interpreter, also written in Java. The context for these tools is the Philips Semiconductors Electronic Databook, an application of PCIS, the dtd Philips have developed within the Pinnacles framework, and forms the basis of Mikula's research at the University of Klagenfurt in Austria. His presentation was impressive, and although only in prototype form, the work he outlined shows great potential.

Milena Dobreva (Inst of Maths and CompSci, Sofia, Bulgaria) Use of SGML by philologists: experiences gained during the Medieval Slavic Manuscripts Encoding Project

This paper gave an overview of the trials and tribulations experienced by a group of Bulgarian philogists in applying the TEI Guidelines to the problems of describing and encoding medieval Slavic manuscripts. Dobreva's presentation focussed more on the organizational and training aspects of the project than its technical content, though she did briefly present the set of extensions the group had found it necessary to make to the TEI dtds, which were of some interest. The facility with which previously SGML-naive users were able to make useful progress, even in a technologically challenged environment, was very heartening, while the survey of common problems encountered was reassuringly familiar.

Benoit LaSalle (Omnimark Technologies) Using microdocuments and hybrid distributed databases for building up hypertext rich content online services

Exoterica Corporation has changed its name to Omnimark Technologies, and this paper was presented by its European sales head, rather than by Eric Skinner or John McFadden. Otherwise, this was much the same story as presented at Munich earlier this year: the future lies in microdocuments (crystals, or document fragments) which can be embedded within conventional table-like record structure. This is undoubtedly true, but would be more persuasive if there were evidence of implementations not simply hacked together from a mass of Omnimark scripts doing clever stuff behind the scenes.

Paul Hermans (ProText) Questor: publishing social law to different media

This was also a user-report, this time on an electronic publishing project concerned with course materials for social law. The material needed different organization for electronic access: for the former, a browser based on Synex viewport (the engine behind Panorama) was used, for the latter, Framemaker. The necessary data massage was carried out by Omnimark. They had begun with three separate dtds, for capture, hardcopy, and online, one of which also sported ICADD attributes subsequently abandoned for performance reasons; these had then been ingeniously combined into a single dtd. Hermans spent some time outlining the ways in which Hytime linking strategies had enabled them to organize the material for online searching in an effective way, before commenting that TEI extended pointers had proved a much easier option when it came to actually implementing the system with current SGML tools, which was also heartening.

Jacques Desseyne (Sema Group Belgium) The SGML Tree Transformation Process (STTP): processing SGML documents in an absolutely standardized way

This was a good introduction to the STTP side of the DSSSL specification, giving a clear overview of how it works, and why it is a distinct component of the DSSSL architecture. Deseyne rather fudged the issue of implementation, and did not make the expected announcement of a new Sema product, or not in public at least, but rumours were abundant during the coffee break.

Raf Schietekat (Fotek NV) DSSSL: the promise FOSI did not fulfill

This was the last paper of a long day, and I found it rather disappointing. It covered some basic principles of DSSSL in a rather mechanical way, and did not add much to my understanding of the overall scope of its formatting language. In particular, I had hoped for some explanation of what FOSIs are, before they become extinct, but did not receive one.

2.10. Colloqium on Hypertexts and Electronic Editing
École Normale Superieure, Paris
12-14 September 1996

This three day colloqium was hosted in Paris by Jean-Louis Lebrave of the Institut des textes et manuscrits modernes (ITEM), with funding from the CNRS under a programme called Le Patrimoine Ecrit or Written Heritage, managed by Almuth Grésillon. It brought together an unusually wide-ranging group of European researchers and implementors, many of whom were previously unknown to me, and all of whom had something interesting to demonstrate or to say about using hypertextual methods in the process of critical editing or electronic publication. Most of the presentations involved web-hosted demonstrations, displayed on a huge screen in a darkened room, fitted with exceedingly comfortable cinema-style armchairs. I learned that the French for "web" is toile, which also means canvas, as well as being a slang term for the cinema screen. After a rather good lunch, it is testimony to the rhetorical skills of the presenters that no-one visibly fell asleep in such surroundings.

Unfortunately, local committments made it impossible for me to attend the first day's sessions. According to David Robey (who turned out to be the only other brit present), they were rather theoretical and lacking in focus, but I was sorry to have missed meeting Bernard Stiegler. The second day began with Etienne Brunet, (UPRESA, University of Nice) a founder member of the ALLC and author of the Hyperbase concordance system, describing his Balzac cd-rom project. This will make available 90% of La Comédie Humaine (Balzac's monster sequence of 19th c novels) --- once the copyright problems have been ironed out. More interestingly, the project is a collaborative one, in which 33 international Balzac scholars are scheduled to produce notes, commentary, and assorted links. Brunet suggested that the online text was really there as a way of indexing the collection of associated Balzacian commentary. The web site has some interesting Balzaciana, including photos of the great man's walking sticks as well as his manuscripts. It also has a KWIC concordance, carved up into HTML, with each line of context linked automatically to the passage in the complete text from which it comes, and some rather impressive statistical displays.

Luca Toschi from the CRAIAT research group at the University of Florence prefixed his demonstrations with some suggestive thoughts on the ways in which digitizations facilitated many views of a text, based on his experience with the publication of an electronic Goldoni variorum. I found the software, when we finally saw it, rather impressive: it displays each state of the text separately, but with variant passages colour coded. Clicking on one of these then brings up the state of the text in which that variant appears, as a new window. They use an IBM product called Linkway Live (?).

Paolo d'Iorio (Paris) described how the web had effectively created an enormous, multinational, multilingual, Nietzche database. His chief focus was on the way the web made co-operative authoring ventures possible, but he also gave some nice examples of ms images linked to transcriptions. Asked whysomeone didn't just publish everything on CD, he stressed the open, dynamic, nature of the enterprise; in similar vein, when asked whether he didn't fear that this kind of community would create some new kind of orthodoxy, he pointed out that the dissidents were perfectly free to set up webs linked to the same material, displaying it in their prefered way.

Daniel Ferrer (also from ITEM) gave a considered presentation of the advantages of Storyspace in presenting the evolution of the text of Joyce's Ulysses through the various states of manuscript, typescript, correction, page proofs, etc. The database includes Gabler's transcription as well as a sets of page images; much play was made of the difficulty of linking non-rectangular zones of Joyce's scrawl in the latter with parts of the former. The Joycean theme was picked up by Marlena Corcoran, who demonstrated the importance of good design and a simple interface in a system (developed with mediatool) which allowed one to see the corrected original page proofs (placards) ofUlysses , as they appeared before they were cut up by some over-zealous conservator, and thus to infer something about the correction process itself.

Nicole Moulinoux from the University of Rennes described another scholarly collaboration centred on the works of William Faulkner, in particular a web based (but password-protected for copyright reasons) electronic variorum of the Sanctuary . Another member of the team showed a comprehensive set of retrieval and statistical tools aimed (I think) at content analysis, which had used the text: the unusual feature of this was the provision of dynamically generated tables of hits retrieved according to an unexplained factor analysis. This seemed to offer a lot of possibilities for stylistic analysis.

Alain Giffart (IMEC) discussed a new project at the Collège de France to create a hypertextual edition of Roland Barthes' Comment vivre ensemble . His discussion was wide ranging and highly theoretical. If hypertext has a theory (and it must have, for the French to take it seriously) then Barthes anticipated most of it. It will be interesting to see what the project produces: the Barthes archive includes video and audio of the lectures from which his published works derive, made at different times. Extending the geneticist school of textual criticism to cope with such materials poses interesting challenges.

Peter Szendi (IRCAM) presented what he called a maquette diaboliciel (or diabolical pilot program) -- a prototype for an exploration of the variations within and between versions of Beethoven's Diabelli Variations , in which the scores were linked to an audio track. For some reason his comment that it should be possible to find some auditory equivalent of the clickable button (to indicate an anchor within an audio track) provoked heated dissent from the audience.

I opened the final day of the Colloqium by presenting the usual overview of the TEI architecture in my best A-level French, somewhat enhanced by illustrations of how to do hypertext-y things in TEI, including the Comenius example, and also for good measure two different ways of encoding the start of the Beowulf ms. I was also able to wave a copy of the newly published edition of Cahiers Gutenberg containing Francois Role's French translation of the TEI Lite tutorial. Which was, as they say, nice.

David Robey (Manchester) followed this up by describing his experience as a member of a TEI work group, and presenting his views on how to tag the rhyme and alliteration patterns in Dante's Inferno . This provoked an interesting argument about the extent to which algorithmically or procedurally determined patterns needed to be made explicit in an encoding.

David Pietrowski and Georges Vignaux (INALF/LIMSI) described a prototype system for use in tracking reading of a dictionary, specifically the Grand Robert .This was presented as a way of creating a new perspective on language rather than a neat way of capturing the reading process itself, which might have been more persuasive.

A team from Saarbrucken, led by one Wender, gave a brief overview of a rather fine Goethe project. This web site generates HTML on the fly from their internal markup, combining text, manuscripts, and critical apparatus in an impressive way. They are also working on Musil and Buechner. The methodological problem they foregrounded was that of making explicit the temporal stages of a series of variants.

Alessandro Pamini (Istituto Metacultura) tried (twice) to present a new hypertext system being developed in collaboration with the Cultural Anthopology department of the University of Rome. His talk was given in Italian, and all I gleaned from it was that he felt everyone else had misunderstood the fundamental issues. (But he did later hand out a French translation, which looks a lot more interesting)

The colloquium concluded with a long and wide ranging round table discussion in which several distinguished luminaries picked up some themes of the colloqium, and introduced some new ones. Hans-Walter Gabler (Munich), whose computer-assisted re-editing of Joyce provoked some controversy a few years back, was quietly optimistic about the possibilities offered by a true computer-based editing: I learned later that he is currently planning a new graduate seminar on textual criticism with a strong computational component. Yannick Maignien of the Bibliotheque Nationale and Robert Martin of the Institut Nationale de la Langue Française raised several key issues about the opportunities and difficulties of electrronic text provision (role of libraries, need for metadata, publishing, copyright, etc). Christine Coutoure, documentalist at the École Normale, spoke feelingly of the librarian's perspective on the untamed wilderness of the web. Eric Lochard from Montpellier, Alain Giffard, Jean-Louis Lebrave, and Daniel Ferrer all in their separate ways questioned some of the methodological implications of the hypertext method, and the seductive availability of apparently neutral digital resources. There was much debate, but I found most persuasive Toschi's eloquent reminder that whatever else scholars do, they should not abnegate their responsibility to spin a plausible tale.

2.11. Algebraic Document Processing: Project DAVID workshop
University of Minho, Braga
2-4 September 1996

DAVID is a newly funded project in structured programming and document manipulation, based at the University of Minho's computer science department and funded by JNICT (Junta Nacional de Investigação Científica e Tecnológica: i.e. the Portuguese national research council). Its goals are to build on the department's expertise in formal methods and grammars, exploring new ways of applying them to text processing and information handling. To start the project off, they organized a three day workshop on "Algebraic document processing" in Braga last week, inviting a small number of guest speakers (myself, James Clark, and Sebastian Rahtz) to share their ideas in a pleasantly relaxed environment. There were about a dozen participants, mostly from the department itself, but also from other Portuguese universities.

I was asked to give an overview of SGML and then to describe the TEI architecture and ways of using it. I took the opportunity of reorganizing the standard TEI workshop slightly, also introducing a little more technical content than usual. Following the now-traditional account of the problems of realizing the full potential of electronic texts, I presented a version of the Gentle Introduction to SGML expanded to include all (but only) the SGML features actually used in the TEI scheme. Sebastian Rahtz followed this with a nicely contrasting account of the practical difficulties of applying SGML in the production-line world of Elsevier's journals division. Elsevier has committed itself to using SGML for the archival storage of their several hundred scientific and technical journals, using a version of the ISO 12083 dtd with various varyingly satisfactory accomodations to cope with the enormous amounts of maths, graphics etc. needed. This policy is expected to pay off in terms of re-usability, with such ventures as the Science Online database due to start next year, which will deliver full text of 300 scientific journals on the web, complete with links to relevant bibliographic databases and abstracting services, for those well-enough funded to access them, at least.

After lunch (pizza of course), I gave the traditional TEI architecture presentation, focussing rather more explicitly on the nitty gritty of its class system and how the modularity of the dtd is implemented. I then gave the document analysis presentation, followed by a group tagging exercise. This last was quite successful, unexpectedly, since the text chosen as vehicle was an obscure 17th century English political pamphlet, which several participants found linguistically rather challenging. The first day concluded with a brief account of the motivation for, and contents of, the TEI header, after which I was more than ready for a cool beer and the obligatory bacalhao.

Day two began with the audience's choice of TEI tagsets: offered a choice of TEI-ana, TEI-spoken, and TEI-lite, they plumped for the first two, which is what I duly gave them. This was a pleasant change, for me at least; in retrospect it's a great shame that I had not prepared more on the feature structure tagset, since such mention as I was able to make of it was clearly of considerable interest to this audience. I concluded my presentations with a brisk canter through available SGML tools and strategies for handling TEI texts.

James Clark then improved considerably on my superficial account of what a parser does by giving a detailed presentation of his SP parser, complete with some glimpses of its 60,000 lines of C++ code. The new version of SP implements everything defined in ISO 8879, even the silly bits. Unlike his earlier parser, sgmls, it supports multi-byte character sets such as Unicode, maintains non-ESIS information such as comments and use of white space, and also allows for modification of the concrete syntax. It is also reportedly at least twice as fast. SP is actually a general purpose C++ class library, with a well defined entity manager, and comes with a number of useful implementations; it is completely free of charge, and has already been incorporated into a number of leading SGML products (notably Technoteacher, Balise, and Microstar's SGML Author for Word).

After lunch, Sebastian Rahtz' second presentation described an ingenious approach to the rather unusual problem created by Elsevier's document management policies: the need to convert from LaTeX into SGML (rather than the more usual reverse). Authors' use of LaTeX varied so much, and LaTeX itself was so flexible, that in some cases it would be cheaper to throw away the author's source and retype it from scratch, or to throw away the TeX markup and retag it. Automatic conversions based on parser technology inevitably fell foul of the macro-processing nature of the TeX system sooner or later, and so the optimum solution seemed to be to use TeX itself to process the input, but tweak its macros so as to emit appropriate SGML tagging along with the formatted text. Probably you have to be a TeX hacker of Sebastian's expertise to even think of this solution, let alone to implement it, but it does apparently work more reliably than any of the alternatives. Sebastian also outlined various strategies for presenting maths on the web and tried very hard to persuade us that using PDF was a good idea.

Day three was mostly devoted to DSSSL, the new Document Style Specification and Semantics Language, ISO 10179. José Ramalho from Braga began with a brief overview of the contents of this important new standard, which complements the standard syntax offered by SGML with a standard way of defining its semantics. DSSSL has four components: a language and a processing model for transforming one SGML document into another; a query language called SDQL for identifying portions of an SGML document; a style language for applying formatting characteristics to an SGML document, expressed in terms of things called "flow objects"; and a powerful expression language derived from SCHEME which ties the whole thing together.

James Clark then presented his new piece of software: JADE (believed to be short for James's Awesome DSSL Engine). He began with an impressive proof of concept which involved converting the 300 pages of the DSSSL standard itself directly into RTF before our very eyes, and showing the results, but then descended (or ascended) to showing us large amounts of the code in Jade. I regret to report that my comprehension of object-oriented techniques, already somewhat overstretched, did not allow me to follow as much of this as it should: but the availability of a DSSSL style sheet for TEI texts (on which work is currently proceeding) will no doubt change this.

After lunch, Jocelyn Payne described a system he has developed called Web-o-matic, which appears to be a way of doing what most people do with CGI scripts but using object-oriented Rexx. The system is used by a do-it-yourself economic modelling web site at the Institute for Fiscal Studies. It wasn't clear to me what the point of this approach might be, but it's always nice to hear what Jocelyn is actually up to. I wonder whether he knows that at least three internet service providers use the term web-o-matic to describe their tacky home-page creating scripts?

Finally, Pedro Henriquez and José João Almeida gave an overview of their plans forthe DAVID project. They have three years funding from JNICT, the Portuguese government research funding agency, to explore new techniques and ideas in textual information and processing. Their strong background in language processing, with a corresponding emphasis on representation of meaning as abstract syntax trees, together with their large resources of programming skills (they have already developed a specification language and prototyping environment called Camila, and applied it to the task of defining a `literate programming' application) clearly give them a head start (over me at least) in understanding DSSSL and applying it to new and interesting areas.

This was an unusual workshop, with a high degree of technical content in a pleasantly relaxed atmosphere. For me, it was also a comparatively painless way of finding out what the excitement about DSSSL is all about; my thanks are due to the organizers and to my co-speakers for making this an occasion to remember.

Lou Burnard

2.12. Munich
SGML Europe 1996 Conference

SGML Europe 96 was held this year in a very large and expensive hotel in Munich, adjacent to what would have been very pleasant park if the weather had been less damp. I travelled there from the Tyrol by train, thus missing the opening ceremonies (by all accounts a rather lachrymose affair, this being the first major conference since the death of Yuri Rubinsky to whose memory the event was dedicated), but saving the TEI the cost of an extra night's stay in said hotel. I did arrive in time to hear Charles Goldfarb give his "Inventor's Keynote", from which I learned that Charles has been thinking about the World Wide Web, and perhaps regretting that he didn't take the chance to set Tim Berners-Lee right about SGML when he had it. When titans meet, Dr Goldfarb opined, one should find another field -- sound advice, with reference to web browser wars, but rather defeatist for those of us who think that the SGML community might have something to learn from the runaway success of the web. Key events of the year were the publication of DSSSL; and some reorganization and realignment of various competing areas of HyTime and DSSSL activities, notably the definition of the Standard Document Query Language and of the HyTime "general facilities" (aka the useful bits -- architectural forms, property sets, groves, formal system identifiers etc). Charles also proudly announced his "Purity Test" for so-called SGML-conformant applications, on which see http://www.SGMLsource.com/Goldfarb/purity.html, if you care about such things.

The real function of conferences like this one is not however to listen to presentations, however inspirational, but to hobnob with the vendors, who were there in force. There were three exhibition halls full of variously sized booths to do this in, with almost all major players represented (conspicuously absent were Microsoft and Novell) and many minor ones, throughout the proceedings, pausing only for the evening reception (on the 24th floor, commanding magnificent views of some very damp tree tops). I duly hobnobbed, to the point of exhaustion, as did most of the other 300 or so delegates, when not drifting in and out of sessions.

Here are a few of the software products that made some impression on me: some newish SGML authoring tools, notably InContext (the new version is now reduced in price to the point where we could actually afford it), and Stilo, which is nearly available -- and both of which successfully processed the TEI dtds under my suspicious gaze. Folio Views had put a lot of money into free mugs and pads proclaiming "Folio does SGML" (I am told that this is somewhat economical of the truth). Astoria, the new SGML object database from Xerox, had an immense and very busy stand. At the STEP booth, they were busily producing a daily news bulletin directly from an online news feed, converting it automagically to SGML and formatting and printing it in real time. Synex, makers of Panorama, are now actively marketing their Viewport engine at a price we certainly cannot afford. Jouve have a (comparatively) cheap and cheerful CD-ROM production system called GTI publisher which could give DynaText a run for its money. AIS have a new version of Balise (with a new logo) and a new English manual on the stocks. At the cheaper end of the market, two new application development toolkits were in evidence, one called SGML-C, from Bruce Hunter, and the other, called NSL originally developed for the Multext project by Henry Thompson. Both worth hunting down on the net.

Even had I not been somewhat pre-occupied with getting my part of the closing plenary session ready, it would have been impossible to take in all this as well as the three parallel tracks, so my report is necessarily somewhat fragmentary. I listened to a session on document management systems, which included a good overview of issues in document database design from John Chelsom, and a characteristically pragmatic discussion of ways of building hybrid distributed document databases from John McFadden (see http://www.exoterica.com/pres/hddb). I dipped in and out of a major overview of SGML software tools organized by Steve Pepper (Falch) and Robin Tomlin (SGML Open), in the hope that this would stop defining criteria and start evaluating products (it didn't -- but the overview was very thorough). I attended a session about the role of SGML on the internet, which featured Eric van Herwijnen and Martin Bryan, neither of whom had anything new to say on the subject (though Eric had some nice pictures); Jon Bosak, in a different session, also addressed the web and had some rather sharper comments to make: (see http://www.sgmlopen.org/sgml/docs/library/dssslo.htm). I was rather disappointed to find that no-one apparently had anything to say about the importance of SGML as an archival format, or a metadata format, and I don't think I heard the phrase "digital library" mentioned once in four days, which seems odd. The conversations I heard in the halls were all about document production and document management, the intranet and improving your organization's information flow. I heard nothing about ICADD or the TEI, or how SGML might be of use to the rest of us, which was all rather depressing.

The final plenary began with Dale Waldt explaining at some length just why commercial publishers love (or should love) SGML, which somewhat reinforced my prejudices against them. I found myself speaking up for academia and reminding the conference that maybe SGML had a more significant role to play than just helping publishers and consultants get rich, and that its true Destiny was to preserve our cultural heritage. I also suggested that it really shouldn't be so difficult to get started with SGML, the basic notions of which can be jotted down on the back of an envelope, a theme which I was glad to see Tim Bray pick up in his masterly wrap up of the whole proceedings (see http://andor.opentext.com/~tbray/SGML-Europe/index.html). Tim stopped in the middle of his presentation (done like almost everyone else's with Powerpoint), exported it as an HTML file with one click of the mouse, and asked the question the SGML Industry really needs to answer: why is it so difficult for vendors to build true SGML support in with comparable user friendliness and simplicity?

Outside the conference, for the first time since I arrived in Germany, the sun came out. It was a public holiday, so the park was full of cheerful Bavarians drinking beer and eating sausages to the accompaniment of an oom-pa band. Time to go home...

2.13. Goteborg, Bergen, and Innsbruck
May 1996

At the invitation of Daniel Ridings, I visited the Sprakdata Unit at Goteborgs University at the beginning of May. This unit, which is distinct from the Linguistics Dept, was set up originally by Swedish Acdemician Sture Allen, and is the major lexicographical and NLP research centre in Sweden. I gave a two hour lecture on the subject of the British National Corpus, followed by an extended demonstration of the capabalities of SARA to an audience of about 15 lexicographers and others, which seemed to be much appreciated. On the second day of my visit, I gave a boiled-down version of the standard TEI workshop to students on the second year of the University's Computational Linguistics course, with an emphasis on the implications of TEI markup for lexical data, at the invitation of Riding's colleague Pernilla Danielsson. Danielsson and Ridings are collaborating on a project called PEDANT, which is an aligned multilingual corpus, using TEI markup and the toolkit developed by Henry Thomson as part of the Multext project, which we discussed at some length. I also met Steffan Larsson and colleagues from the Linguistics Dept at Goteborg, who have developed their own complex annotation scheme for the markup of a spoken Swedish corpus, and with whom I discussed the possibility of converting their markup scheme to TEI conformance. My BNC lecture was also attended by Gary Ward, a systems programmer from the University of Vaxjo, which is also attempting to set up a SARA server. With assistance from technicians at Goteborg, we were able to make considerable progress in getting the current version working, to the point where we were able to determine that there really was a bug in the server (since fixed by Scotty, I am glad to report).

The third day of my visit to Sweden, being Sunday, I spent visiting the Goteborg museum (splendid collection of late 19th century soft porn) and getting rained on in various picturesque locations, before heading to Bergen to participate in an encoding seminar being organized by the NCCH's Wittgenstein Archive. Part of the reason for this workshop, it was explained to me, was to prepare local staff and staff from other Norwegian centres for the forthcoming ALLC conference; I suspect it was also to field-trial the splendid new facilities being made available for said conference in Bergen's new Law Faculty building. Whatever the reason, I very much enjoyed the opportunity of re-running the boiled-down TEI workshop for the benefit of an intelligent and alert audience. In the time available, it was not possible to do a full hands-on session, but I was able to run through the exercises I had prepared (loading an untagged file into A/E, tagging it, building a header, exporting it, viewing it with Panorama, etc.) which seemed to be much appreciated. After the seminar, Claus Huitfeldt, Peter Cripps and I discussed somewhat inconclusively how the TEI/MECS Workshop planned for the start of the ALLC conference should be organized. I think we decided that we'd do them in parallel, sort of. I also met Lars Johnsen, the new head of NCCH, a former corpus linguist turned administrator, as well as old friends from Bergen. The second day of the workshop was given over to Allen Renear, from Brown, who gave an intriguing historical overview of the development of what he called the ontology of texts, from platonism (as characterized by the Brownian OHCO model) through liberalism (the view that there are many possible hierarchies) to anti-realism (the view that text has no reality aside from that projected onto them). Renear characterized the last as being "fashionable", which I take to mean "wrong".

Committments in Oxford meant that I was regrettably unable to stay for the remainder of this Workshop, which was to include a discussion of MECS and how it differed from TEI, so I took the customary large lunch at Hotel Norge before returning home on Tuesday evening.

After two days' frenetic but parochial activity in Oxford, I set off for Innsbruck, via Munich, in the company of Mike Popham. Our objective was partly to find out what exactly the NOLA project might be or might become, and partly to get a little time quietly to discuss the two other meetings to which I had committed Mike's time this month. Once arrived in Innsbruck, I developed a streaming cold, not helped by the rain which had followed me from Goteborg, which may have somewhat coloured my view of the ensuing meeting.

The meeting was organized by Allan Janik, of the Brenner Archiv, and attended by Heinz Hauffe of the Innsbruck University Library, Donald Broady of the Royal Institute for Technoloy at Stockholm, Lars Johnsen from Bergen in addition to Mike and me. Its purpose was to review the the NOLA proposal for funding under the EU's Libraries Programme, recasting it in the light of criticisms received so far. There was a general feeling that much work needed to be done, particularly in recasting the relationship of the project to the TEI, and in involving more prominent library activities. I had hoped that the work being done at the Bodleian in extending the TEI Header to handle manuscript materials might be of some relevance here, but this remains to be seen. Mike Popham's excellent minutes (soon to be available from the NOLA web page at http://???) summarize the chief areas discussed and the future plans of the group.

Mike and I then spent a very useful day preparing an initial position paper for the Oxford Text Archive to present at the first Service Providers Forum meeting of the AHDS, and in discussing Mike's presentation at the Princeton workshop on Software for the Humanities, before going our separate ways -- he back to Munich and thence to Princeton, I into the Tyrol and thence to Munich for the SGML96 conference. It continued to rain.

2.14. Surfnet TEI Workshop
University of Groningen

A TEI Workshop was organized by Harry Gaylord at the University of Groningen in the Netherlands 21-22 November 1995. The workshop was hosted by Harry's department at the University, with additional funding from SURFNET (the Dutch equivalent of JANET) and attracted an impressively large number of delegates, not all of them Dutch (there was a small group of Francophones, Swiss, Belgian and French; also some Spanish and Portuguese; but no visible Germans, no doubt owing to the similar event in Tuebingen the week before). I counted about 120 during the opening plenary, and even by the final session there were well over fifty. Although dogged by administrative and technical mishaps, ranging from the collapse of one of the presenters with influenza 24 hours beforehand to the dispatch of the printed course materials to some other conference in an adjacent province, I'm glad to report that the event appeared to have been very well received by all who attended, with much discussion and questioning, both enthusiastic and intelligent, during the sessions and afterwards.

Day 1 began with a series of plenary lectures: I gave the celebrated "Gentle introduction to SGML", followed by two other invited speakers: Bert Bos from the W3 Consortium spoke illuminatingly about the relationship between SGML and HTML, and the great hope afforded by the proposed stylesheet mechanism in HTML3; Gert van der Steen gave an exhaustive overview of SGML software, characterized by its functions. After a break we were joined (virtually) by Rich Giordano, who gave a very impressive lecture on the TEI Header and his thoughts about its relationship to projects in resource discovery, by video tape recorded in his office in Berkeley. We knew it was live because at one point the phone rang (it was Harry reminding him to make the video) and at another someone tried to walk into the office.

The TEI Workshop proper began with a presentation of the TEI architecture (regrettably using my fallback set of overheads, rather than the nice new version I'd prepared for use with Panorama), followed by a group document analysis, using as vehicle some pages of the Sherlock Holmes story "The Red Headed League". In the absence of printed handouts, at that time finding their way back to us from the aforesaid adjacent province, Alan Morrison and I had to perform this text as well as elicit discussion of its chief features -- which turned out to be quite an amusing experience for all concerned. Whether for this reason or not, this session was very lively raising a number of fundamental encoding issues in a satisfactory manner. The day concluded with a presentation of the standard TEI document analysis slides, which I gave from paper notes in the absence of anything flashier.

Day 2 was to have begun with the first of three graded exercises on using Author/Editor and Panorama, but it became apparent fairly rapidly that the teaching room allocated for this purpose was simply not set up correctly. Delegates seemed philosophical about this, occupying the time in such characteristically Dutch habits as checking email and strolling down the corridor for a smoke, while we re-arranged the timetable to allow a further hour's installation time. I gave (again from backup slides) the overview of TEI Lite and discussed briefly ways of using it to implement the features we'd identified the previous day. After lunch we divided the group up, with those uninterested in Author Editor being free to go and inspect the various demonstrations of student projects and other software (notably Near and Far and Explorer).

The rest repaired to the PC room and drove out a few hopeful students who thought it was theirs. Alan and I then walked through our Author Editor exercises successfully. In the first, we imported a "plain ASCII" version of the Holmes text, tagging it as a single div, then as a paragraph, then split the paragraphs. This taught use of the surround, remove, split, and joint options. Then we exported the text, looked at it in Write, and reimported it. In the second exercise, we started from a blank new document, and built the skeleton of a full TEI conformant text, using the "Insert markup" option to teach something of the TEI header at the same time. In the third exercise, we imported a version of the tale which Alan and I had tagged in some detail earlier, with all direct speech identified and allocated to speakers. We then showed how this markup could be displayed in Panorama, using previously prepared stylesheets and navigators. This programme of work was fairly exhausting for the 30 or so participants who stayed the course, and entirely so for the two of us trying to give it, but it seemed to go down well. I did finally manage to see some of the work which students at Groningen have been doing -- an illustrated version of the Christmas Carol; a morphologically analysed version of Hiawatha -- but not as much as I'd have liked.

I was greatly impressed by the energy and enthusiasm of the Workshop participants. They were very tolerant of the technical mishaps which prevented the event from being as great a success as it deserved to be. The local team, particularly the students, also deserve commendation for their great resourcefulness in the face of adversity. I hope that Surfnet will be willing to sponsor another such event in the near future.

2.15. Oh no, it's another dining report
[Memoria project meeting etc.]
Institute for Computational Linguistics, Pisa
5-7 March 1995

The Memoria Project's latest meeting was in Pisa on Monday and Tuesday this week, but I decided to go early, partly to save the project 250 pounds in airfare, and partly in order to discuss with the indispensable A. Zampolli the chance of getting more funds for the TEI from Brussells under a call for proposals due to expire on 15 March (gak).

I took a fly (as Alitalia likes to call them) from Gatwick at 0945 (which meant getting the 0558 train from Oxford) on Saturday, found the leaning tower still unfallen, and basking in spring sunshine, and Italy just as it should be. In the evening, after some prevarication, Zampolli and I managed to dine and dine well at the Trattoria opposite my hotel, but I was concentrating so much on the need to sort out the politics of the TEI proposal that I don't remember any thing except some truly unforgettable gnocchettini with cheese and basil sauce, and a piece of fish with rather more olives attached than I had bargained for.

Sunday I went to Florence, partly for touristic reasons, but mainly to persuade my old friend Dr A to write the BNC Handbook for me, which I duly did over an excellent lunch in an obscure, but very crowded fiaschetteria. This was the sort of Italian eating establishment I like -- hams hanging from the ceiling, elbow-to-elbow diners all shouting at the top of their voices, small children misbehaving with indulgent grandparents, plates of food and bottles of chianti flying everywhere. You eat what's on offer, and don't regret it. I had a thick vegetable soup, a plate of prosciutto, and what would have been a monstrously thick pork chop had it not been beef of a most succulent and delicious nature. To walk it off, Dr A and I staggered through the Boboli Gardens up to the Belvedere commanding magnificent views over the town, before calling at someone's house for tea and cake as the sun set, and we went our separate ways.

On Monday, the Memoria meeting proper took place: Catherine Toussaint from CAP who has taken charge of the project summarised what had happened at the Project Review meeting in Luxembourg last week (not a lot) and we discussed what exactly should go into Work package 4 (quite a bit). Lunch was unremarkable, but for dinner we went to my other favourite sort of Italian restaurant -- the kind where the proprietor welcomes you like a long lost relative and proceeds to foist exotic and unlikely dishes on you, saying "but you must taste this" and "what do you think of this?". Such establishments rarely have more than four customers at a time, and are buried in obscure side streets, but well worth the extra walk. I was seduced into trying the home-smoked tunnyfish, followed by just a little oyster-mushroom soup with crostini, followed by a main course composed mostly of small pieces of roast marinated goose, rounded off with a pudding of baked ricotta in lemon sauce, and an aromatic grappa. Exquisite is the word that springs to mind. Also, filling.

Day two of the meeting took place at the ILC in Pisa, and consisted largely of a demo of image processing software by Andrea Bozzi and friends. With the aid of a Quadra 7/50 and a box of transputers, they can figure out where the words in a line of digitized medieval manuscript begin and end and tie them up with a transcription of it. There was also some discussion of what exactly should be presented to the first meeting of the Special Interest group (to be constituted next week), and a very instructive discussion of what exactly Yannick Maignien meant in his description of the Memoria "maquette". Also some bleating from me about how exactly OUCS was supposed to carry out Work Package 3 in the absence of any input from the other partners. Lunch was a modest affair, (pasta in cream sauce, followed by salad and chips) and very late but none the worse for that. The rest of the afternoon I spent trying to connect to the BNC server, with modest sucess, eventually; also trying to extract a copy of the TEI conformant 30 million word corpus of modern Italian -- I nearly got it, but not quite, before returning this evening.

2.16. Portable Documents: visit report
UK TeX User Group: "Portable documents: Acrobat, SGML and TeX"
Bridewell Theatre, London
19 Jan 95

This joint meeting of the UK TeX Users Group and the BCS Electronic Publishing Specialist Group attracted a large and mixed audience of academics, TeX hackers, publishers, and software developers, with representatives from most UK organizations active in the field of electronic publishing and document management. I was expecting rather more disagreement about the relative merits of the various approaches now available for the creation of portable documents; in the event, the path of SGML-based righteousness, with appropriate concessions to the practical merits of PostScript-based systems, was apparently endorsed by the consensus.

First of the seven speakers was David Brailsford from Nottingham University, who described Adobe's Acrobat as "a de facto industry standard". His presentation of exactly how the various components of this product worked together, and could be made to interact with both LaTeX and SGML, was very clear and refreshingly free of hype. The choice of PDF (which is effectively a searchable and structured form of Postscript, in which logical structure and hypertextual links are preserved along with the imaging information) as an archival format was a pragmatic one for journals such as EPodd where fidelity to every detail of presentation was crucial. The availability of a free Acrobat reader was also a plus point. He characterized the difficulties of mapping the logical links of a LaTeX or SGML document on to the physical links instantiated in a PDF document as a classic case of the importance of "late binding", and revealed the open secret that Adobe's free PDF reader would soon be upgraded to recognise and act on HTML-style anchors. A demonstration of the Acrobat-based electronic journal project CAJUN is already available online at http://quill.cs.nott.ac.uk.

David Barron, from Southampton, gave an excellent overview of what exactly is implied by the phrase "portable document". Documents are not files, but compound objects, combining text, images, time-based media. There is a growing awareness that electronic resources should be regarded as virtual documents, repositories of information from which many different actual documents may be generated. These developments all make "portability" (defined as the ability to render documents -- with varying degrees of visual fidelity -- in different hardware or software environments) very difficult. Portability was of crucial importance, not only for publishers wishing to distribute in the electronic medium, and not only for specific user communities wishing to pool information, but also for all of us. Information available only in a non-portable electronic form was information at the mercy of technological change. He cited as portability success stories the widespread use of PostScript and LaTeX as a distribution medium by the research community, referring to the Physics preprint library at Los Alamos as a case where this had now become the normal method of publication. By contrast, the success of the World Wide Web seemed to be partly due to its use of a single markup language (HTML) which effectively takes rendering concerns entirely out of the hands of authors. From the archival point of view, however, none of the available standards seemed a natural winner: hypertext was still too immature a technology, and there were still many intractable problems in handling multiple fonts and character sets. Professor Barron concluded with a brief summary of the merits of SGML as providing a formal, verifiable and portable definition for a document's structure, mentioning in passing that Southampton are developing a TEI-based document archive with conversion tools going in both directions betweeen SGML and RTF, and SGML and LaTex. Looking to the future, he saw the IBM/Apple Opendoc architecture as offering the promise of genuinely portable dynamic documents, which could be archived in an SGML form once static.

The third speaker of the morning, Jonathan Fine, began by insisting that the spaces between words were almost as important as the words themselves. I felt that he wasted rather a lot of his time on this point, as he did later on explaining how to pronounce "TeX" (surely unnecessary for this audience) before finally describing a product he is developing called "Simsim" (Arabic for sesame, which is a trademark of British Petroleum we learned). This appears to be a set of TeX macros for formatting SGML documents directly, using components of the ESIS to drive the formatter, but I did not come away with any clear sense of how his approach differed from that already fairly widely used elsewhere.

Peter Flynn, from University College Cork, did his usual excellent job of introducing the Wondrous Web World, focussing inevitably on some of its shortcomings from the wider SGML perspective, while holding out the promise that there is a real awareness of the need to address them. What the Web does best, in addition to storage and display of portable documents, is to provide ways of hypertextually linking them. Its success raises important and difficult issues about the nature of publishing in the electronic age: who should control the content and appearance of documents -- the user, the browser vendor, or the originator? Publishing on the Web also raises a whole range of fundamental and so far unresolved problems in the area of intellectual property rights, despite the availability of effective authentication and charging mechanisms. He highlighted some well-known "attitude" problems -- not only are most existing HTML documents invalid, but no-one really cares -- and concluded that the availability of better browsers, capable of handling more sophisticated DTDs, needed to be combined with better training of the Web community for these to be resolved.

The three remaining presentations, we were told after a somewhat spartan lunch, would focus on the real world, which seemed a little harsh on the previous speakers. Geeti Granger from John Wiley described the effect on a hard-pressed production department of going over to the use of SGML in the creation of an eight volume Chemical Encyclopaedia. Her main conclusions appeared to be that it had necessitated more managerial involvement than anticipated, largely because of the increased complexity of the production process. She attributed this partly to the need for document analysis, proper data flow procedures, progress reports etc., though why these should be a consequence of using SGML I did not fully understand. More persuasively, she reported the difficulty the project had had in finding SGML-aware suppliers, in designing a DTD in advance of the material it described, in agreeing on an appropriate level of encoding and in getting good quality typeset output.

Martin Kay, from Elsevier, described in some detail the rationale and operation of the Computer Aided Production system used for Elsevier's extensive stable of academic journals. Authors are encouraged to submit material in a variety of electronic forms, including LaTeX, for which Elsevier provide a generic style sheet. Other formats are converted and edited using an inhouse SGML-aware system (apparently implemented in WordPerfect 5, though I may have misheard this). This uses their own dtd, based on Majour, with extensions for maths, which seemed to be a major source of difficulty. Documents will be archived in SGML or PDF in something called an electronic warehouse, of which no details were vouchsafed. Both PDF and SGML were seen as entirely appropriate formats for online journals, CD-ROM and other forms of electronic delivery. The advantages of SGML lay in its independence of the vagaries of technological development, and its greater potential. However, potential benefits always had to be weighed against current costs; like any other business, Elsevier was not interested in experimentation for its own sake.

The last speaker was Michael Popham, formerly of the SGML Project at Exeter, and now of the CTI Centre for Textual Studies at Oxford. His presentation did a fairly thorough demolition job on the popular notion that there is still not much SGML-aware software in the world, starting with a useful overview of the SGML context -- the ways in which SGML tools might fit into particular parts of an enterprise -- and then listing a number of key products organized by category. It was nice to hear the names of so many real SGML products (auto-taggers, authoring aids, page layout systems, transformation tools, document management systems, browsers and parsers) being aired, after a long day obsessed by Acrobat and LaTex. He concluded with a useful list of places where up-to-date product information can be found, and a reminder that the field is rapidly expanding, with new tools appearing all the time.

The day concluded with an informal panel session, onto which I was press ganged, which effectively prevented me from taking notes, but also gave me the chance to promote the recently-published DynaText version of the TEI Guidelines, which I did shamelessly. I also remember Malcolm Clark asking, tongue firmly in cheek, why everyone couldn't just use Word, and being somewhat agreeably surprised by the number of people in the audience who were able to tell him the answer, and in no uncertain terms. Other topics addressed included auto-tagging, whether maths and formulae should be encoded descriptively or presentationally, whether Microsoft will still be around in the next century, and whether we would ever learn how to format documents for electronic presentation as well as we could on paper.

3. 1991-1994

3.1. Conference on Encoding and Corpora
A visit to Oslo
University of Oslo
14-16 Nov 1994

I was invited to give a number of talks at the University of Oslo as part of a small conference organized by the new inter-faculty Text Laboratory set up there, in collaboration with the Department of English and American studies, but with visiting guests from other linguistics departments at Oslo, and from the Universities of Jykslava (Finland) and Lund (Sweden). The emphasis was on corpus linguistics and encoding; between twenty and thirty staff and research students attended over the three days of the conference.

Willard McCarty from the University of Toronto's Centre for Computing in the Humanities began the first day with a detailed presentation of his forthcoming electronic edition of Ovid's Metamorphoses, which continues to be a fascinating example of just how far the humanities scholar can go with an ad hoc encoding scheme. I then gave the usual rapid canter through the TEI Guidelines, their milieu and architecture, which gave rise to some quite useful discussion before we broke for a substantial lunch. In the afternoon, Willard and I spent some time in the Text Laboratory, trying to install the very first BNC starter set (in my case) and checking email (in his). The Lab has a large Unix fileserver (some kind of DEC machine, since it runs Ultrix), and a room full of Windows and MACs connected to it via ethernet. We saw no-one else trying to use the equipment while we were there, but the Lab has only just begun operations.

On day two of the conference, Willard gave a talk which began promisingly by outlining the history of concordancing and concordances, from the middle ages onwards, but then became an overview of the features of TACT, which did little to improve my opinion of the design of that loose baggy monster of a concordance program. I then gave the usual rapid canter through the BNC, which aroused considerable interest. There were several intelligent questions about the design and construction of the corpus, and the accuracy of its linguistic tagging. I was also able to do my bit for the European Union by pointing out that a "no" vote in Norway might make it more difficult for us to distribute copies of the BNC there (the day before I arrived the Swedish referendum had confirmed Swedish membership; while we were there, rival campaigns on either side of the Norwegian referendum were in full swing).

During the afternoon, Willard and I were (independently) esconced in offices to act as consultants for a couple of hours: I spent most of my time re-assuring a lady from the German department that the TEI really could handle very simple encodings as well as complex ones, and rehearsing with her the TEI solutions to the usual corpus-encoding problems. Oslo is collaborating with Finnish and Swedish linguistic researchers in the development of a set of bilingual corpora (English-Finnish, English-Norwegian, and English-Swedish), so I also spent some time discussing and reviewing the project's proposed usage of the TEI Guidelines. Bergen and Oslo have developed a procedure for automatically aligning parallel texts in English and Norwegian, which appears to work reasonably well, perhaps because the languages are not so dissimilar. I rather doubt whether automatic alignment of English and Finnish will be as easy, but the Finns seemed quite cheerful about the prospect. In the evening we were taken out for a traditional Norwegian Christmas dinner, comprising rotten fish, old potatoes, and boiled smoked sheep's head, washed down with lots of akvavit: not as nasty as it sounds, but twice as filling.

The final day of the conference began with an excellent talk by Doug Biber, from Northern Arizona University, describing the use of factor analysis in the identification of register within a large corpus of materials in three languages (English, Korean and Somali). Biber's use of statistics is persuasive and undogmatic; the basic method was outlined in his book on speech and writing (1988) but its application to cross-linguistic (or diachronic) corpora is new and provoked considerable discussion.

This was followed by my swan song at the conference, a real seat-of-the-pants nail-biting event, being my first ever attempt to describe and then demonstrate the BNC retrieval software running (on Willard's laptop) live and in real time. As a result of careful pre-selection and late night rehearsal, I'm relieved to say that the software did not crash once, though my ability to control Willard's laptop's track-ball in public was frankly pitiful. SARA herself attracted favourable reaction, in particular because of the system design. Interest was expressed in the idea of extending her functionality to cope with the display and searching of parallel TEI-encoded corpora: not a task I think we will be undertaking ourselves in the near future.

This was a relaxed but far from vacuous three days, with ample opportunity for discussion and debate in pleasant surroundings. Sincere thanks are due to my host, Stig Johansson, and his department for arranging it and funding my participation.

3.2. University of Cork
TEI-WWW workshop

As originally proposed at ACH-ALLC in Washington earlier this year, Peter Flynn of the Curia Project at the University of Cork organized a two day meeting with a general view of creating dialogue between the TEI and the developers of World Wide Web, one of the most rapidly growing computer systems since the Internet itself. WWW is a distributed hypertext system running at some improbably large number of sites worldwide, which uses a very simple SGML tagset called HTML (it has been rather unkindly characterized as "Pidgin-SGML"). WWW itself consists of a markup language (HTML), a set of Internet protocols (FTP, HTTP etc) and a naming scheme for objects or resources (the "Universal Resource Locator" or URL). A number of browsers are now available which use these components. Mosaic, developed at NCSA, is probably the most impressive: running on Mac, X and Windows it offers a fully graphical interface with just about everything current technology can support. Lynx, developed at the Computer Science dept at U of Kansas, is at the opposite extreme, assuming only a VT100 (there is also a WWW-mode for EMACS!). I will not attempt here to describe WWW in operation. Web browsers are freely available by anonymous FTP all over the place: if you haven't tried it out already, and can't see what all the fuss is about, then you should stop reading now, get yourself a browser and do so forthwith.

The two day meeting was attended by Chris Wilson (NCSA); Lou Montulli (Lynx, U Kansas); Bill Perry (EMACS, Indiana University); Dave Ragget (Hewlett-Packard; HTML+) and myself for TEI. Various representatives of the Curia project, notably Patricia Kelly from the Royal Irish Academy, were also present. I gave a short presentation about the TEI, focussing mostly on contextual issues but also including some detailed technical stuff about bases and toppings and X-pointer syntax, which seemed to be well received. Dave Ragget then talked us through the current HTML+ draft which started off a very wide ranging discussion. This continued during the second day of the meeting, but was at least partially nailed down in the shape of a brief report (see below) which should be somewhere in the Web by the time you read this one.

To their credit, most WWW people seem painfully aware of the limitations of the current HTML specification, which was very much an experimental dtd hacked together in haste and ignorance of the finer points of SGML. (or indeed the blunter ones). HTML+, which Dave Raggett has been working on for the last year or so, attempts to improve on it without sacrificing too much of its flexibility. This draft will eventually progress to Internet RFC status; there is also talk of an IETF working group co-chaired by Ragget and Tim Berners-Lee (of CERN; onlie begettor of the Web) to steer this process through.

The Cork meeting was an interesting opportunity for the developers of three of the major Web browsers to meet face to face and argue over some of the design decisions implicit in the HTML+ spec. To some extent this did happen, though the discussion was rather anarchic and unstructured. It was also a good opportunity for the TEI to encourage development of HTML+ in a TEI convergent manner, and this I think was achieved. Several of the changes accepted, at least in principle, will make it much easier to transform TEI documents into HTML, if not vice versa. Some practical issues about how WWW should handle TEI conformant documents were also resolved.

Outside the meeting, this was also a good opportunity to find out more about the Curia project itself. My hasty assessment is that this project has still some way to go. There is a clear awareness of the many different ways in which it could develop, and a tremendous enthusiasm. I think the project would benefit from some detailed TEI consultancy before too much more P1-conformant material is created. It also offers interesting contrastive opportunities with other corpus-building activities, chiefly because of its enormous diachronic spread, and its polyglot nature.

Lou Burnard, Cork, 21 Nov 93

========= Concluding statement of the WWW/TEI Meeting follows ==========

<!-- This uses the HTML dtd --> <title>WWW/TEI Meeting</title> <h1>Notes from WWW/TEI Meeting</h1> <h3>Action Items/Recommendations</h3> <list> <li>HTML 1.0 should be documented to define the behavior of existing browsers, and should be frozen as agreed upon at the WWW Developers' Conference. <list> <li>Features to be documented, implemented and specified include collapsing spaces, underline, alt attribute, BR, HR, ISMAP... <li>HTML IETF spec needs to be updated by CERN, as well as existing documentation </list> <li>HTML+ future browsers need not support HTML 1.0 features after a reasonable amount of time. As an aid in transition, the HTML+ spec/DTD will not include any deprecated features of HTML 1.0. <list> <li>HTML 1.0 deprecated features <list> <li>nextid <li>method, rel, rev, effect from &lt;A&gt; tag (but not from the &lt;LINK&gt; tag) <li>blockquote --&gt; quote <li>There was a feeling that the &lt;img&gt; tag will be superceded by the &lt;fig&gt; tag, although its deprecation was not agreed upon. <li>menu list --&gt; ul <li>dir list--&gt; ul </list> </list> <li>The intention of HTML+ is to support generic SGML-compliant authoring tools, and authors are recommended to use this software with the HTML+ DTD for the creation/maintenance of documents. <li>Browsers may implement different levels of HTML+ conformance. <list> <li>Level 0 implementation <list> <li>HTML 1.0 spec referenced above </list> <li>Level 1 implementation <list> <li>Partial fill-out forms <li>New entity definitions (in section 5.1 of HTML+ draft) </list> <li>Level 2 implementation <list> <li>Additional presentation tags (sub, sup, strike) &amp; logical emphasis <li>Full forms support (incl. type checking) <li>Generic emphasis tag </list> <li>Level 3 implementation <list> <li>Figures <li>NOTEs and admonishments </list> <li>Further levels to be specified </list> <li>Authoring tools are expected to conform to the HTML+ DTD and are <b>NOT</b> to support deprecated features. <li>We expect the HTML+ DTD to be developed incrementally. The HTML+ internet draft will make clear which features are now stable and which are still subject to change. The DTD will be structured to reflect this. <ol> <li>HTML+ will work with the SGML reference concrete syntax. <li>The entity sets will be user-specifiable (in the long run). <li>HTML+ will support nested divisions or containers. <li>There will be a number of new features <dl> <dt><b>Figures & Images</b> <dd>&lt;fig&gt; may be able to subsume the role of &lt;img&gt;. <dt><b>Generic highlighting tag</b> <dd>The &lt;em&gt; tag will be used with a set of three or four defined attributes to present a guaranteed-distinct presentation of these attributes. <dt><b>Generic roles</b> <dt><b>Support for undefined elements</b> (user extensions) (render) <dd> <dt><b>Tables</b> <dd>This is now stable. <dt>Math</b> <dd>for research </dl> </ol> <li>HTML/TEI <list> <li>It was felt the correct way to convert between TEI and HTML was to do it on the server side using a conversion filter. <li>This server will also provide a hypertext link to download the raw TEI text. <li>We (WWW developers and TEI people) will strive together to converge functionality between HTML* and TEI, as well as to produce this server/filter system. </list> <li>Links to: <list> <li>HTML spec <li>HTML+ spec <li><ref target="http://curia.ucc.ie/curia/doc/tei.html">TEI overview</a> </list> </list>

3.3. EAGLES meeting
Paris
Oct 1993

One benefit accruing from my attendance at the Eagles meeting in Paris is that I got some more information about what exactly is supposed to be going on in Eagles. I also doubled the size of my collection of NERC reports by acquiring a copy of a very sensible TEI review produced by Dominique Vignaud and Pierre Lafon.

There are five EAGLES groups: on Resources, Language Corpora, Speech, the Lexicon, and Evaluation & Assessment (with which Henry Thompson is involved). The group on language corpora has five subgroups. One is chaired by John Sinclair on text typology, and I have no information on its progress other than verbal rumblings from Birmingham. One is on linguistic annotation and is chaired by Geoffrey Leech who has circulated a detailed work paper which rather overlaps the work presented in AI2 W2, but has some interesting differences. A third is supposed to address something called "text representation issues" and is chaired by Gregoire Clemencin of GSI/ERLI; the Paris meeting was called to discuss a work paper submitted to this group by Nancy Ide and Jean Veronis. There seemed to be some doubt about the responsibilities of the two remaining subgroups: one concerns documentation and distribution, for which Pisa is responsible; the other concerns "tools" for which Wolf Paprotte at Muenster is fingered.

Members of the TR subgroup are Veronis, Clemencin, Ramesh Krishnamurthy from Birmingham and Henry Thompson from Edinburgh. Nancy Ide was also present at the meeting as a TEI spokesperson, officially; Ramesh was unable to attend; Antonio was also unable to attend, through illness (He did however speak to us from his hospital bed by telephone during the afternoon; the word is that he should be back at work next week.)

EAGLES groups have two years in which to produce their recommendations; a mid term report (to be jointly edited by Nicoletta Calzolari and Jock McNaught) is due in April 1994, with final reports one year later. I'm not sure whether this report is supposed to cover all Eagles activities, or just those of the Corpus group, but believe the latter. The TR subgroup plans to complete its input to this report by February, and the document previously circulated by NI/JV was a first stab at producing draft for it.

The task of this workgroup was to provide a set of recommendations for the encoding of corpora, based on a review of the TEI. Most of the meeting was spent reviewing some of the formal decisions already taken by the TEI with a view to generality of application; I will not repeat them here since I am sure they will also appear in the minutes of the meeting. The chief point made and made frequently was that it was up to this workgroup to pin down those generalities and recommend specific solutions. There was a clear understanding of the extension and modification mechanisms provided within the TEI framework, and an evident willingness to apply them to the task in hand. It was less clear whether EAGLES would decide to propose a single 'boiled-down' TEI dtd, or instead a package of customizations of the full TEI.2 dtd.

It was at least tentatively agreed that Eagles-conformance implied two things: adherence to a specific set of editorial principles (HT likes to call these "invariants") governing such things as punctuation, normalization etc., as documented in the TEI <editDecls>, or a list of such specified sorts. And secondly use of an SGML dtd which could be derived in a TEI conformant way from TEI.3. The perceived value of sgml was in the validation it offered, which goes some way to explain the concern expressed in the NI/JV document about "polysematic combinatorics" etc.

Unfortunately I had to leave the meeting before it finished (someone was on strike at Roissy) so I do not yet know whether (or how) the work group plans to descend from these general preliminaries to the tough task of actually deciding whether to recommend <div> or <div1> or either, what sort of <w> tag to use and so on. With my BNC hat on, it seems to me that we could possibly offer some help in this respect; with my TEI hat on, it seems to me essential to monitor closely the work of this group, as it tests the viability of the TEI scheme in a variety of new application areas.

3.4. NSC93: Second Network Services Conference
Warsaw
October 1993

The Network Services Conference is an annual get together organized by the European Association of Research Networks (EARN, aka the European end of Bitnet). I was invited to the first NSC, held last year in Pisa, but was unable to attend; I therefore felt morally bound to accept the invitation to this year's conference, held last week in Warsaw. Poland has recently (in the last six months) become a full member of EARN and has good connexions with the outside world, though not, as yet, much of an internal communications infrastructure. For the duration of this conference however, happy networking nerds from the whole of Europe, including many from countries of the former Soviet Union, were to be found pumping email messages from the room full of terminals thoughtfully provided by a number of local computer firms, through Polands new's 60 megabit gateway and thence to homes in Georgia, Russia, Belarus, the Ukraine, Germany, Italy, France, the Netherlands, the UK and even the US.

The need for "quality" networked information was one recurrent motif of the conference; another (intimately linked with it by some) was the need for commercialization of network facilities. To quote the first speaker I heard: "we need to find a way for people to be paid for putting information on the internet". By contrast, there was mercifully little gung-ho technobabble, though the closing speaker did assure us that "every human being on the planet" would have access to the network by the end of this century, and that future conferences would require the use of virtual reality helmets. In its place, there was a surprisingly stubborn anglo-saxon stress on "bringing services to end-users".

The session on training began with a presentation on the UK's mailbase service, funded by JISC and hosted at Newcastle: its un-surprising conclusion was that electronic communities need a lot of hand-nurturing and training, a theme picked up by another speaker from Newcastle, who described an ITTI project which is producing generic training materials for network awareness. How not to do it was well exemplified by the third paper in this session - which described a protoype international EARN helpdesk: having relied on member states to provide publicity for its services, this was singularly underused.

Favourite software of the conference was undoubtedly Mosaic, which was demonstrated live in a bravura display of networking nerve by Robert Cailliau from CERN: pointing and clicking from Honolulu to Oslo to Israel before our very eyes. Mosaic is also probably the nearest thing we currently have to the "User network interface to everything (UNITE)" which was proposed by an equally charismatic speaker from Northern Ireland, while mention of it also elicited my favourite mixed metaphor of the conference from the speaker (George Brett) who said that although it was essential to keep pushing the envelope and going in at the deepend, it was necessary at the same time to keep the VT100 user on board (who is this user? I think we should be told).

Yet another Newcastle speaker (Jill Foster) drew a rather neat analogy between the current state of network publishing and the Sorcerer's Apprentice episode in Walt Disney's Fantasia. In this analogy, the techies currently trying to cope with the unmanageable explosion of networked information are Mickey Mouse contending with all those mops and buckets of water, while the librarians are the sorcerer deus ex machina who will finally restore order, by applying their classificatory, cataloguing, filtering, and archiving skills.

Other sessions I attended included a German speaker on something called UDINE which sounded like another universal interface, but with the added attraction that it would take advantage of ATM (nothing to do with holes in the wall, this stands for 'asynchronous transfer mode' and is apparently the very next thing in broadband ISDN). A Norwegian speaker described an interesting experiment in creating a distributed archive of photographs. An American speaker droned on about the problems of running a Listserv discussion group. In the same session, I missed (unfortunately) hearing the legendary Eric Thomas who was promoting a new user guide explaining why Listserv is the true and only path to righteous network communication. Rich Giordano flew in from Manchester just in time to give a thoughtful paper about the need for interactivity in information generating communities, and how poorly current models of electronic libraries supported it, which had the unusual distinction of citing some real evidence.

I spoke third in a session on "publishing on the network", shared with Terry Morrow (who runs the BIDS service at Bath) and Ann Mumford (who chairs the UK SGML Project and is a CGM whizz). The SGML banner, in particular the TEI flavour, seemed to go down reasonably well -- everyone agreed that the best way of improving the "quality" of networked information was to improve the richness of its markup.

On the whole, however, I did not find the papers presented here particularly impressive. The real business of the conference was evidently being conducted in a number of official and not so official networking special interest groups and workgroups meeting there. Not being a member of RARE or IETF or ISUS or others, there wasn't much opportunity for me to participate in these politickings; I was also disadvantaged by the fact that I had been placed in a monstrous hotel some 20 minutes walk from the conference venue. I did however spend some time impressing on George Brett (from CNIDR) the importance of the TEI Header as a starting point for one Internet Task Force which is supposedly coming up with proposals to improve Gopher's usability by providing associated documentatioon files. If the TEI Header can't be used for that purpose, there's something seriously wrong with it -- and likewise, if the IETF Task Force so charged completely ignores the TEI proposals. This is something which perhaps others in the Steering Committee with the ear of CNIDR might like to pursue (this means you Susan H!).

Secondly I spent some time bending the ear of Robert Cailliau (apparently a largeish cheese --or spider?-- in the World Wide Web) on the subject of what's wrong with HTML. Unlike most HTMLers I've met, this one at least was ready to listen, and something may just come of it.

Lou

3.5. SALT Club meeting
[London]
January 1993

The SALT Club is an informal group of researchers, from both academic and commercial sectors, with a common interest in Speech and Language Technology (nothing to do with Strategic Arms Limitation). It is organised and minimally funded by the UK Government's Department of Trade and Industry, as a side effect of the funding it provides for research work in this area together with the Science and Engineering Research Council (SERC), under something called the Joint Framework for Information technology (JFIT). Aside from a useful bulletin distributed to JFIT-funded projects, SALT organizes an annual workshop and provides a nexus of useful contacts. It also constitutes an expert group with some influence on government priorities for research funding in this area.

This year's SALT Club Workshop was concerned with Multimedia. Under the rather grand title "Paradigm Shift in Speech and Language Technology: Integrating with other Media", this two-day event combined a series of carefully-chosen plenary presentations and a number of small working group sessions, collectively addressing the issue of what, if anything, there might be of interest to SALT hackers in multimedia and vice versa.

The workshop began with a presentation from Graham Howard about the US Art of Memory project, which (due to the hostility of London Transport) I missed. I did however arrive in time to enjoy Barry Arons' (MIT Media Lab) round-up of work concerned with hypertext-like ways of interacting with recorded speech, ranging from intelligent telephone answering machines to 'virtual conversations'. The technically interesting part of this concerned the design of an appropriate user interface for a database of sound bites accessible by voice only. The database consisted of 13 minutes of monologue from five speakers, gathered over the telephone, which Arons had analysed into a network of some 80 discrete nodes, linked by 750 links of various types (e.g. 'summary', 'detail', 'supporting argument' etc.). It could be queried using a simple 17 word command language. The technology was impressive and sophisticated but mechanically organized snippets of spoken language cannot really be said to be conversations, although Arons did make passing reference to Grice's work in discourse analysis as providing useful models for how 'reports' from such a database might be organized.

This was followed by a more low-key presentation from Paola Fabrizi of the RNIB about the Electronic Daily Newspaper. I found this easily the most impressive project described in terms of real world applications of multimedia. Every night while the compositors are busy setting up the printed text of the Guardian, an electronic version of the same text in a specially encrypted form is broadcast via the unusued lines in teletext transmissions. Subscribers download the text into their pcs overnight, and can then `read' the paper at the same time as their sighted nbeighbours, using a voice synthesizer, large character display or transitory braille display. Paola demonstrated the system with a bottom end of the market speech synthesiser which came as somewhat of a shock after the realism of MIT's synthetic voice, but the simplicity of the interface had much to commend it. Users could scroll, and search, switch betweeen paragraph, line, word and letter mode, the latter being useful to spell out unfamiliar words. Selected stories or prtions of them can be pasted to a scrapbook etc.

Janet Cauldwell (OUP) presented the electronic OED of which any further praise would be otiose. I shall therefore remark only that this product is so wonderful its presentation and marketing could be safely left to idiots.

After lunch, we were organized into a five work groups, on a variety of occasionally overlapping topics (Organizational structures within language; database interfaces; use of nonlinguistic information in linguistics; educational applications; communications facilities). Each group had an assigned leader, a small number of thought-provoking presentations and a rapporteur charged with making proposals for ways in which new SALT-related activities could be realigned within a multimedia paradign. I joined the first group, and enjoyed brief presentations from Adam Kilgariff and Lynne Pemberton, both, in different ways, concerned with ways of structuring complex texts. The group invented an interesting application area: an intelligent agent capable of summarising or expanding the information content of individual components of a multimedia system as well as identifying and categorising the links between them. I waxed lyrical on the wonders of HyTime, TEI etc, and found myself appointed rapporteur for my pains.

Day two began with a brief hectoring from Gerry Gavigan of the DTI, the gist of which was that the SALT progrmme was likely to be axed if there was not a more enthusiastic response to its next call for proposals, provoking a certain amount of muttering about the difficulty of persuading hard-pressed commercial partners to join and the complexities of the application procedure. This was followed by a video presentation displaying results from an ESPRIT project called MMI-squared (for MultiModal Interfaces for Man-Machine Interaction) which I found slick but curiously uninventive. The previous day's workgroups then resumed.

Two further plenary presentations addressed topics of some relevance to TEI interests: one, from Arnold Smith (SRI) discussed the ubiquity of modelling techniques, (in dbms, CAD, spreadsheets, robotics etc) and the need to achieve mapping between them, which, he opined, could be done by something called "abductive equivalential translation", of which I understood just about enough to see similarities with the use of SGML: translating models clearly requires the addressing of semantic issues rather than simple data format comnversion. It also facilitates re-use of encoded knowledge: as an unexpected application area he mentioned integrated manufacturing systems for which he claimed that language translation methiods were directly applicable. The second, from John Lee, (Edinburgh) talked at some length on the different communicative aspects of different modalities and appropriate bases for choosing amongst them, for which he proposed 'specivity' as an interesting metric. Like Smith (and others) he saw the significant contribution that SALT could makein terms of the application of a discourse model.

After lunch the rapporteurs were invited to present the findings of their groups. These are to be edited up into a report for the DTI, which I will circulate to anyone interested when it is completed. The information handling group had seemingly decided that linguistic description was a better way of querying large databases (specific instances included images and distributed dbs) than graphical methods. Re-expression of queries in different modes was a good way of checking that they were being correctly interpreted. Their new project was a voice input/recognition system for attaching annotation to large quantities of images or paperwork. The Education workgroup, on which Nick Ostler reported, had surveyed a number of interesting applications, but had chiefly highlighted a checklist of the problems of multimedia developmt (high cost, dangers of rigidity, difficulty in carryover, copyright problems etc). The workgroup on 'nonlinguistic issues' (Chris Mellish reported) had also identified a large number of problem areas and research opportunities, from which I recall as particularly interesting the following: combining automatic lip-reading with speech recognition (useful in noisy discos); research on the semantics of body language and gesture; application of linguistic anmalysis to existing graphical interfces etc. The telecoms group, reported on by Martin Crossley (BT), began by positing the widespread availability of a number of emergent technologies (wideband networking into the home, videophones, teleconferencing, teleworking, teleshopping etc). In the short term they predicted speech controlled telephones with more intelligent conversational interfaces; in the medium term, bettwer interactive educational tools (necessary for any expansion of distance learning). In the longer term they foresaw proper multimodal access to distributed computer systems, linguistic interaction with intelligent search agents, ways of intelligent topical indexing of video data.

The workshop wound up, like many others, with discussion of ways in which funding might be obtained for further SALT/Multimedia research, with five possible being proposed. The DARPA model, in which there is a predefined goal for which consortia are invited to compete, with payment contingent on their achieving specified targets, was liked by some as much as it was disliked by others. Alternatives included the notion of a centrally funded institute and more precisely defined contractual arrangements (as in LRE). There was a familiar call for outreach to other related communities. Ostler closed with a review of the current status of the SALT club itself, now no longer bankrolled by DTI, it transpired, but by his own company.

3.6. AIS/Berger-Levrault, Paris
On 4 Sept 92

I travelled to Paris at the invitation of Francois Chahuneau, md of AIS, to see the prototype of the system his company has developed on behalf of the Bibliotheque de France, in connexion with their ambitious PLAO (Poste de Lecteur Assiste par Ordinateur) scheme, on which I reported last year. The prototype is for a scholarly workstation, which brings together a lot of interesting ideas about how scholars interact with electronic text, both in transcribed form and as digital images. Essentially it provides an integrated environment for the management of texts, including their annotation, closely modelled on traditional scholarly practice. The software runs on SPARC stations under X-windows and uses PAT as its main retrieval engine, which means that its performance is very impressive. Texts, both transcribed and in image form, can be organised into (possibly overlapping) typed logical 'zones', annotated and given hypertextual links. Text and text image can be synchronised, though only at a relatively coarse level. The texts themselves are read-only, while annotations and structuring information are dynamic, as is rendering. It uses SGML (of course) though with a very simple dtd based essentially on the use of typed milestones to mark zone boundaries, but can take advantage of whatever markup is present in a text. I had supplied Chahuneau with a TEI-style marked up text which he was able to import directly into the system, with impressive results. The query language used is particularly powerful, and takes full advantage of the structuring capabilities of SGML. The prototype will also be demonstrated at the Waterloo conference next month, where I expect it to arouse considerable interest: it combines much of the functionality of Dynatext with the power of PAT and the user-friendliness of Lector. Licensing and distribution arrangements are not yet clear, but it looks as if it will be considerably cheaper and more 'open' than any of those products. AIS are also planning to release a general-purpose tool kit for converting SGML texts into other formats, known as Balise 2, which will sit on top of the public domain SGML parser sgmls: this looks particularly interesting.

While in Paris I spent an afternoon with Dominique Vignaud who has been commissioned by Quemada and Tournier to assess TEI proposals for the encoding of corpora as the main French contribution to the NERC project. Vignaud is one of France's leading SGML experts (she was responsible for an attempt to create a French version of the AAP standard and also for my favourite expansion of 'SGML' -- Surement Genial Mais Laborieux) and so I was much heartened by her enthusiastic praise for the general design principles of the TEI. We discussed in some detail the difficulty of reconciling the incompatible goals of different research projects within an effective interchange framework. The 'base plus topping' method advocated by the TEI seemed the best theoretical solution, although for practical reasons it seemed likely that a 'lowest common denominator' approach would be followed, with Vignaud proposing as candidate for standardisation a simple subset of TEI recommendations, not dissimilar to that used by the BNC, on which I was also able to bring her up to date.

3.7. Social History: the challenge of Technology
University of Essex

The University of Essex hosted an interesting one day workshop with the title `Social History: the challenge of Technology' on June 1st. It adopted a format new to me, in which ten invited speakers were allowed a maximum of five minutes to highlight key issues in their previously-circulated papers, followed by a a 15 minute commentary on each pair of papers by an invited `discussant' and a general discussion. The programme was carefully arranged to include five complementary pairs of papers, and the whole affair proved remarkably successful in generating a fruitful and stimulating exchange of views amongst the hand- picked participants, most of whom were computing historians, data archivists or similar.

The first session dealt with the creation of `public use data sets' with two speakers (Steven Ruggles from Minnesota and Liam Kennedy from Queens, Belfast) presenting basically similar projects in the integration and harmonisation of large scale existing data sets for re-use, in Ruggles' case derived from US census data over a large number of years; in Kennedy's a whole spectrum of 19th century Irish statistical data. Both called attention to the very practical difficulties of harmonising the differing analytic preconceptions of the original data collectors, while stressing the need to make the datasets (which Ruggles described as a `national treasure') more accessible and user friendly. Prof Michael Anderson (Edinburgh) in commenting, made several practical points from the ESRC viewpoint, which he characterised as sceptical about the usefulness of secondary analyses. He stressed the need to set realistic and achievable targets as well as the importance of making explicit the theoretical basis for combining datasets, the need for long term institutional support and the difficulties of making such data sets user-accessible for casual enquiry. The discussion indicated wide support for these concerns, though Ruggles stoutly defended the general usefulness of public use datasets in social science, claiming that in the few cases where available they were the most widely used resources. Several of the historians present expressed anxiety about the difficulty of linking datasets using different `codebooks'. Kennedy noted the `spurious consistency' of terminology such as `general labour' in occupational classifications; Bob Morris (Edinburgh) pointed out that this variability reflected important historical differences; Dan Greenstein (Glasgow) noting that historical sources provided an `opaque window onto the past' as well as being objects of interest in their own right, stressed the need for explicit data modelling of the researcher's interpretations of them. It occurred to me that many of these problems were strikingly similar to those faced by linguists trying to unify different linguistic annotation schemes and might therefore benefit from the kind of approaches currently being discussed within TEI AI1.

The second session dealt with the need for standardisation in data collection and interchange. My paper gave a brief summary of the TEI, stressing its attempts to avoid prescriptivism, and promising great things for SGML as a powerful notation system. Manfred Thaller (Gottingen) distinguished four levels of description appropriate to historical material: numerical data, factual data, running text and bit-mapped images. Interpretation differed at each level but all had a common core of problems. He shrewdly observed that standardisation was becoming more difficult as researchers tended to define themselves by a particular technology. A successful standard must be descriptive, based on a conceptual analysis rather than any technology, but could only succeed if it was backed up by well-designed and acceptable technology. Responding, Peter Denley (Westfield) stressed that standardisation was not a mere abstraction, and deplored the lack of recognition given in research culture to collaborative effort. While the TEI's proposals needed to be made more accessible to the non-professional, there was a danger that computing historians were about to propose a `third pope' if they continued to ignore the very real and highly relevant progress made by other research communities, while their datasets would be increasingly marginalised. This view seemed to have general acceptance, though some, notably Anderson, insisted that quick and dirty methods would always prevail in the long run.

The third session dealt specifically with the role of data archives in historical research, with papers from Paul de Guchteneire (Steinmetz Archive, Amsterdam) and Hans-Jorgen Marker (Danish Data Archive, Odense) and a particularly useful synthesis of them from Bridget Winstanley (ESRC Archive, Essex). De Guchteneire remarked on the highly skewed usage distribution of archived material (very small numbers of items being used very frequently), the archives' reluctance, through lack of resources, to provide ancillary support facilities, the need to preserve data sets currently being produced by government and other agencies, and on the need to formalise citation of datasets. Marker, picking up Thaller's four-fold characterisation of historical datasets, agreed that most archives were really only capable of dealing with survey type material, and that their methods might not be generalisable. Winstanley reiterated the need for standardisation in citing and cataloguing datasets; addressing the palpable discomfort of historians for the social science model of archiving, she stressed the need for proper bibliographic description and control. In the ensuing discussion, I drew attention to the TEI's recommendations for bibliographic description, and noted the interesting contradiction between an archive's dual role as repository (foregrounding idiosyncrasy) and as source of reusable resources (foregrounding integration). Anderson made the good point that traditional social science archives were about to be engulfed with the results of `qualitative' surveys which could mean a narrowing of the distinction between textual and numeric data.

In the fourth session, dealing with the use and potential of online historical databases, Humphrey Southall (QMW) described the trials and tribulations of providing JANET-wide access to a large Ingres database of information about local labour markets via a Hypercard-based front-end and Heiko Tjalsma (Leiden) a project called Chronos which provides integrated access to a variety of SAS datasets and their documentation. Both talks tended to concentrate overmuch on the technical details of networking, which also dominated the discussion. The rapporteur, Don Spaeth (Glasgow) contrasted traditional mainframe-based online services with the workstation model, the provision of networked access to CD-ROMs etc. and the discussion tended to drift into rather ill-informed speculation about likely technological changes, though Eric Tannenbaum (Essex) (who was unfortunately present only for this session) did make the interesting observation that it would be the need of environmental and other researchers for terabytes of data within seconds which would determine the likely development of new networks.

The last session of the day dealt with the topic of IT in the teaching of history. Frank Colson (Southampton) gave an optimistic acount, based on the impressive success record of his HIDES system, which is designed to complement rather than replace traditional library resources. Deian Hopkin (Aberystwyth) was less sanguine: his analysis concentrated on the impossibility of funding new IT-based teaching methods without proper institutional and financial support. As Rick Trainor (Glasgow) pointed out, the papers were nicely complementary, in that the teaching methods described by one were precisely those which present funding arrangements made it difficult or impossible to provide. Hides had been justly praised for the way it enhanced the traditional teaching role by providing students with the opportunity to carry out systematic analysis of major problems, on their own terms, while the National Curriculum appeared to wish to undermine or undervalue that very potential by using IT as a low-cost way of providing transferable skills. Computer- based teaching was an effective way of bringing a closer rapprochement between teaching and research as well as in breaking artificial discipline-based distinctions. He felt that it was better to focus on the institutional problems preventing its wider acceptance than on the purely technological ones, and that the methodological differences between social science and history had been overstated.

In the following discussion, Morris probably spoke for several when expressing disquiet at the way the morning's discussions of the technical possibilities and the intellectual challenge that they posed seemed to have gone adrift. He feared that technophoria would distract from the fact that only some models of enquiry were well served by IT. For Hopkin, IT was of importance because it enforced a `confrontation with the data' and reminded historians of the need for a methodology to handle that confrontation. Greenstein agreed on the crucial importance of a formalised model of enquiry (for which he gracefully gave credit to the TEI), and noted that without one teaching history (as opposed to transferable skills which could be got anywhere) was very difficult indeed. Spaeth remarked that revolutionary fervour had been the downfall of the quantifying school of historians and agreed that the misuse of IT for its own sake lead to poor teaching practice.

I came away very favourably impressed by the format of the day: tightly-focussed small group discussions can sometimes be a little incestuous, but the programme had been carefully arranged to provoke ample scope for controversy and some fruitful argument had ensued. From the TEI perspective, I was particularly encouraged by the evident willingness of the computing historians to confront standardisation problems at the heart of the TEI agenda, and to restate them in their own terms.

LB June 6th 91

3.8. SGML Update: consultancy, tools, courses'
Amsterdam
16-17 May 1991

The Dutch SGML Users Group hosted a two day international conference in Amsterdam 16-17 May under the general title `SGML Update: consultancy, tools, courses'. This attracted over a hundred delegates, by no means all from the Benelux area, though mostly from European publishing and software houses. There were two keynote speakers (Sperling Martin for the AAP, and myself for the TEI), about a dozen presentations from manufacturers or consultants and a well-arranged software exhibit in which all the major SGML software vendors were represented, with the conspicuous exception of Software Exoterica who had apparently had to withdraw at the last minute. There was ample opportunity for discussion and argument between presentations, over an excellent buffet lunch and in the evenings.

Sperling Martin as one of the chief progenitors of the AAP standard was happy to report that it was now in use by more than 25 major publishers, with a further forty planning to adopt it over the next twelve months. He gave brief overviews of three particularly successful applications on the fringes of conventional publishing. Firstly, the Association for Computing Machinery, which has just developed a five year strategic plan with the AAP standard at the centre of several dozen new print products, on demand reprint facilities, optically stored databases, hypertext products etc. Perhaps more interestingly, the ACM plans to mandate the AAP standard as the interchange format of preference for its army of unpaid professional contributors, reviewers and referees in the future. Secondly, the Society of Automative Engineers, which is adapting the AAP standard for use in something called a `Global Mobility Technology Information Center' or in plainer English, a database of information about all sorts of transport systems. The interesting thing here was the convergence between SGML and object-oriented databases -- as well as manuals of technical information, SGML was being used as the vehicle for data to be transferred directly into CAD/CAM systems. Sperling's third AAP success story was a similarly hybrid development: a new legal database system developed for the Clark Boardman Company, providing integrated information services derived from legal journals, statutes and regulations, a body of case law together with interpretation and annotation, usable by traditional print journals or electronic hypertexts. Of course, the AAP project had not been an unmitigated success: it had begun at a time when SGML was barely established, and some aspects, notably those concerned with maths, formulae and tables have never been finished properly. Moreover, there are a few deliberate errors in the standard, introduced (said Sperling ingenuously) as `reader tests'. He also called attention to some image problems -- all too familiar to TEI ears -- such as the perceived conflict between TeX and SGML, or ODA and SGML, and the intimidating nature of SGML so long as its cause is left to the purists and the evangelists. Looking to the future, Martin predicted an increased awareness of SGML within the library community as a practical means of coping with the explosive growth of published materials, particularly in Science and Medicine. The AAP standard was to be assessed for suitability as a `non-proprietary information exchange vehicle' for electronically networked journals, by the 110-member Association of Research Libraries, under a scheme for which the National Science Foundation had recently provided $0.75m seed funding. His presentation concluded with some sound advice for those developing a strategic business plan in which SGML featured (concentrate on the business asset, don't expect technology to do everything, expect to spend at least $5 a page to get electronically tractable text...) and some predictions for future AAP work. A corrected version of the AAP standard would be re-submitted to ANSI and a summary of needed corrections to the published dtds would appear in EPSIG news at the end of this year.

Seamus McCague gave an impressively detailed description of two practical applications of SGML in work undertaken by his company, ICPC, a fifteen year old Dublin-based specialist typesetting company. One, for Elsevier, involved the production of about 100,000 pages of high quality camera-ready copy from SGML encoded text annually; the other, for Delmar, the conversion of an existing reference book into an electronic resource. Details of the two projects provided interesting contrasts in production methods; they also showed how the SGML solution was equally applicable to two very different scale operations. For Elsevier, the use of SGML greatly simplified both process and quality control, by facilitating the automatic extraction of data for the publisher's control database; for Delmar, it had made possible significant improvements to the product (a drug handbook) by automating the production of a variety of indexes.

Francois Chahuneau of AIS, the thinking man's Antoine de Caunes, gave a characteristically ebullient presentation about the relationship between SGML documents and database systems. He distinguished four characteristic modes of action: simple storage of documents in a database, where typically only a limited amount of header type information is visible to the database; database- driven document extraction, where documents are synthesized from information held in a database as a specialised form of report; tightly coupled systems in which highly volatile document and database systems share information; and the true document database in which all the information and structure of a document are represented by isomorphic database constructs, thus combining the well-understood strengths of database systems in such matters as concurrency control, security and resilience with the flexibility and multiple-indexing capabilities of document processing systems. As examples of this last mode, he then described in some detail two products: his own company's SGML- Search, which is based on PAT, and Electronic Book Technologies' Dynatext, and also demonstrated a beta-test version of the MS- Windows version of the latter. It uses an interesting scripting language based in part on DSSSL, which enables it to be configured to look more or less like anything, whereas SGML Search is command-line driven, using a fairly rebarbative syntax.

The interface between SGML and database systems was also touched on by Jan Grootenhuis of CIRCE, the doyen of Dutch SGML consultancies. Speaking of his experience in teaching SGML, he remarked that people with a typographic background found SGML almost as difficult to understand as people with a computer science background found the requirements of typography, which struck a familiar chord. He then briefly described a recent project in which documents had been converted automatically into an Oracle database, using a database model defined by Han Schouten. The project had shown that database definitions could be automatically generated from a DTD; the complete suite of Oracle manuals, created as Ventura or WordPerfect documents, had been loaded into an Oracle-Freetext database, using SGML as an intermediary. He noted that the tendency of technical writers to use descriptive tagging to bring about formatting effects had made this task unnecessarily difficult, and argued for better enforcement of descriptive standards. He also outlined some experiences in using SGML for CD-ROM publication of journals at Samson, and of legal and other regulations published by the Dutch government, and the updating problems involved. His conclusion was that SGML was now past the point of no return. It was no longer being used in pilot projects only, but as an integral part of real work. Its use was no longer regarded as worthy of comment; moreover, because its evangelists were too busy doing real work to try to publicise it, the task was being taken on by professional teachers and educators.

The first day of the conference concluded with manufacturers' presentations. Tim Toussaint(MID) and Paul Grosso (Arbortext) gave a joint presentation. Toussaint revealed that MID, formerly Dutch and now German, is now 26% French. They used Arbortext as an SGML editor, and Exoterica's XTRAN to convert it for loading into an unspecified relational database. Applications included standard reference works such as the Brockhaus Duden and a database of standards documentation. Grosso gave a good sales pitch for Arbortext, which is a luxuriously appointed SGML editor intended for use primarily in an electronic publishing environment and described as non-intimidating and user-congenial. It includes a specialised WYSYWG editor for tables and formulae from which AAP-conformant marked up text is generated, has good browsing and outlining facilities and its own script language.

Hugo Sleimer, European Sales Director for Verity (a spinoff from Advanced Decision Systems) gave a classy presentation of a product called TOPIC, the only relevance of which seemed to be that it supported a wide variety of document formats, including SGML. Much of his presentation dealt exhaustively with the problems of text retrieval by boolean logic, at a level which did not show much respect for his audience's intelligence. Tibor Tscheke, from Sturtz Electronic Publishing, was due to talk about his company's work in creating an electronic version of the Brockhaus Encyclopedia, but had unfortunately been forbidden to do so by Brockhaus. He was therefore reduced to some generalities about the role of information within an enterprise, the integration of SGML systems into mainstream information processing and so forth, which was a pity.

I opened the second day of the conference by summarising the current status of the TEI and discussing some of the technical problem areas we had so far identified, in particular those raised by historians and linguists for whom any tagging is an interpretation which must be defensible. This being the second time I had done it in two weeks, I managed to get through most of my material within a reasonable approximation to the time allocated me.

Yuri Rubinsky (SoftQuad Inc) gave an entertaining and wide- ranging talk, picking up in passing some of the technical issues I had raised rather than simply presenting a product review, though he did mention in passing (and also demonstrated) that Author/Editor was now available under Windows and Motif as well as for the MAC. The theme of his talk was that SGML could be used to describe more than just documents, and that several of its capabilities were under-used. There was more to an SGML document than its element structure. Among specific examples he mentioned were customised publication, for example by extracting `technical data packages' geared to a specific maintenance task from CALS- compliant documentation in the Navair database; using attribute values to generate documentation at different user levels from a common source; an ingenious use of entity references within `boiler plate text fragments' in General Motors manuals; and the assembly of customised DTDs from sets of DTD fragments by a use of parameter entities strikingly similar to that proposed by the TEI, or by use of marked sections. For the GM application, this approach had reportedly saved the cost of its implementation within six months.

Pamela Gennusa (Database Publishing Systems) also picked up the recurrent theme of this conference: that SGML was uniquely appropriate to database publishing. She gave a good description of the major issues in preparing text for publication in database format and the strengths of SGML as a means of making explicit the information content of texts in a neutral way, which was essential given that authors and consumers had different requirements of it and touching on the problems of security, high volume and time sensitivity which characterise database publishing as an industry. She also gave a good overview of the capabilities of the new version of Datalogics' set of SGML products, notably WriterStation, an impressive authoring tool with several new facilities and DMA (Document Management Architecture) a complex set of object-oriented tools providing database management facilities for SGML material which also includes full text searching facilities like those described earlier by Chahuneau.

Ruud Loth (IBM Netherlands) gave a workmanlike presentation of IBM's SGML product range, which now includes an context sensitive editor for OS/2 called TextWrite, a formatter for VM or MVS called BookMaster and a new range of products called Book Manager to deal with `softcopy books' (IBMese for `electronic texts'). Book manager Build runs under VM and MVS and generates `softcopy' from GML or SGML documents; BookManager Read runs additionally under DOS or OS/2 and has impressive facilities for hypertext- style browsing, intelligent text retrieval, indexing and annotation. IBM documentation (47,000 titles, 9 milliard pages) would soon be available in this new form.

Bruce Wolman of Texcel AS then gave a detailed product description of the Avalanche `FastTag' automatic tagging system which, it is claimed, can handle almost any kind of text and automatically insert usable markup into it. The product has two components, a `visual recognition engine' which searches for visually distinct entities in a document, as defined by a set of rules encoded in a language confusingly called Inspec, and another language, called Louise, which defines the form in which these objects should be encoded. Things like tables, footnotes, horizontal lines, running headers or footers or special control sequences could all be automatically tagged as well as objects defined by regular expressions or specific keywords in the text. The product had just been launched in Europe and was available for MSDOS, VMS, Ultrix and Macintosh.

John Mackenzie Owen of the Dutch consultancy Pandata gave a brief description of the SGML handling capabilities of BasisPlus, stressing however its strengths as a document management system rather than its admittedly limited SGML features. Bev Nichols of Shafstall described the Shafstall-6000, an all-singing all- dancing document conversion system based on a package called CopyMaster which included SGML among its 800,000 claimed `document-to-document' pairings but which (I had the impression) would really rather be operating on a proprietary format called the Shaffstall Document Standard. The last presentation of the day was from Ian Pirie of Yard Software Systems who described the successful Protos project carried out by Sema Group and Pandata for the CEC. The project handled proposals for funding from DG 13 which had to be distributed to member states for comment and the ensuing comments. MarkIt had been used to validate the format of the messages passed in either direction, its regular expression facilities being particularly useful in automatically encoding the content of telex messages, and its application language to encode the messages for storage in a Basis database. The whole operation had been carried out with minimal disruption of the message system.

Aside from the presentations, the conference provided an excellent opportunity to catch up on the expanding world of SGML- aware software. Among products demonstrated were new versions of MarkIt and WriteIt from Sema Group, of Author/Editor from Softquad, Arbortext, Writerstation from Datalogics and an interesting new product, an SGML editor called EASE from a Dutch company called E2S. Delegates were also given a copy of the first fruits from the European Work group on SGML, a consortium of European publishers which has been working on a set of AAP- inspired dtds for scientific journals which took the form of a very well designed and produced booklet documenting a DTD for scientific article headers. I came away from the conference reassured that SGML was alive and well and living somewhere in Europe.

Lou Burnard

Text Encoding Initiative

A postscript to the above

My report on last month's Amsterdam SGML UserGroup conference, recently posted on TEI-L and on comp.text.sgml was, like most such reports, written with timeliness and liveliness as primary objectives, rather than considered sober opinion. Consequently, it contains some phrases which I would certainly not wish to stand as matters of official published record, and also a few inaccuracies that I'd like to correct. I've recently received a letter from Sperling Martin drawing attention to some of these, most of which is quoted below. This is partly a way of expressing gratitude to Sperling for having taken the time to correct my misrepresentations so thoroughly and with such good humour. His reply also provides some fascinating background detail about those rugged pioneering days of the SGML revolution - — I for one would like to know what became of the Atari SGML parser!

Lou Burnard

There are three points about which I want to provide further explanation. The first concerns the "planted" errors in the early AAP DTDs; the second relates to ACM; and, the third pertains to the Association of Research Libraries' activities.

As to the condition of the AAP DTDs, I hope you can recall that I said that the development of what ultimately became the AAP Standard was begun before SGML had even achieved formal ISO draft status. This was done, with obvious risk - - what if ISO had not approved SGML?? Just think were we would all be now!

SGML, of course, was in some form of ANSI evolution from the late 1970's. Fortunately, by the early 80's the core of SGML had reached a fairly solid condition. Much work remained in refining and enhancing that core. And that was the focus of the ANSI/ISO committee efforts during the period 1982-1986. In addition, once the standard had reached a nearly complete form, the ANSI/ISO community moved rather quickly to get SGML through the draft and final approval cycles, saving us on the AAP Project significant embarrassment.

To give you a calendar metric, the AAP project was launched in late 1983. It produced its final report and initial set of DTDs in February 1986 -- about eight months before the officially approved ISO version of SGML. The earliest attempts to use the primitive SGML tools to describe the AAP document structures were useful in getting us headed in the right direction. It was a bit of a juggling act, however, to keep the AAP technical efforts completely synchronous with the evolving SGML. (You folks on the TEI project have it so much easier -- he says truly ingenuously!)

The point I was trying to make was that we on the AAP Project were working with a bit of a moving target. And, in the later AAP project phases, as SGML began to solidify and become more widely circulated as a draft ISO document, its complexity was a bit of an impediment to understanding its richness and utility. Still a problem today.

As we were obligated to share the draft AAP DTDs with a panel of publishing technology and "SGML" experts, we wanted to be certain that the material we were presenting for review was being thoroughly read. Our simple "test" was to plant a few obvious errors in the DTDs to see if our reviewers were paying attention. There were no SGML parsers in 1985 save a very limited "toy" that Charles Goldfarb had built to operate on his home ATARI! That meant that the only way to catch an error was to read the whole DTD character by character.

The result of the review drill was that most errors were caught. In fact, most of the errors were ones that we made by being unfamiliar with SGML applications development and not correctly interpreting SGML's metalinguistic rules. There were even a few instances where we discovered syntactic conflicts in the draft versions of SGML that were subsequently rectified -- contributing to the refinement of SGML. (In its final year, the AAP Project did serve a valuable role in "testing" some aspects of SGML as it too was taking final shape.)

Anyhow, whatever the "planted" errors, and I recall only three egregious instances, they were removed long before the AAP DTDs received any form of wide circulation. The more important issue today, that I apparently did not make clear, was the revision to the AAP DTDs that is now underway to correct errors and ambiguities that were unintentional. And, as you reported, revised, corrected versions of the DTDs are likely to be available later this year.

While on the AAP theme, let me add that the DTDs for math and tabular material are not quite as rife with problems as you may have thought I suggested. The AAP math material has been very useful in many commercial publishing applications that are alive and well. The TeX vs SGML debate continues apace independently of the AAP Standard. The tabular component of the AAP Standard has seen even wider use.

The SAE project that I described, uses the AAP tabular approach as the basis of the engineering data tables that are part of their aerospace and ground vehicle standards publications. Many CALS compliant applications have used the AAP tabular material approach. What I was trying to emphasize was that work remains in improving the AAP math and tabular components. EPSIG is now soliciting comments and suggestions about those components. I hope that those who were at the Amsterdam meeting who are interested in contributing ideas, will respond to the solicitation.

The one point about my description of the ACM project that needs clarification concerns the use of the AAP standard for manuscript submissiomn. I am certain that I didn't say that for electronic submissions, ACM will mandate the use of an SGML application. The overhead that I used and was part of my handout, clearly shows that full SGML application tagging will occur at the receiving end -- that is at the ACM headquarters. SGML application tagging can certainly be done by the authors and editors, but it will not be mandated. There are basic guidelines that are suggested for electronic submissions, but they cannot be followed without requiring authors to do comprehensive source document tagging. As the ACM project evolves and suitable tools become more prevalent, the groundrules for electronic submissions can be expected to change.

Finally, about the Association of Research Libraries activitis, the electronic journal effort that I described is a very recent initiative. Any "seed" funding from the National Science Foundation or others has not yet been established. The only thing that should be reported is that project funding will be addressed upon completion of the formal project plan -- and that is still to be completed. At this stage it is assumed that the technical basis for the collaborative information interchange will likely build on widely accepted standards, including the AAP Standard.

3.9. ICAME Conference Report
Craiglands Hotel, Ilkley, Yorks
6-12 May 1991

ICAME is the annual get together of corpus linguists. This year's, (the twelfth) was hosted by Leeds University at a rather nice decayed Victorian hotel on the edge of Ilkley Moor and enjoyed excellent weather, the usual relaxed atmosphere and the usual extraordinary array of research reports, which can only be very briefly noticed in this report. As usual, there were about 50 invited delegates, most of whom knew each other well, and a few rather bemused looking non-Europeans, notably Mitch Marcus (Penn State) and Louise Guthrie (New Mexico SU). The social programme included an outing to historic Haworth by steam train which, alas, your correspondent had to forgo in order to attend to other TEI business, and large amounts of good Yorkshire cooking, which he did not.

For the first time, the organising committee had included a so- called open day, to which a number of interested parties, supposedly keen to find out what this corpus-linguistics racket was all about, had been invited. As curtain raiser to this event, I was invited to present a TEI status report, which I did at break neck speed, and Jeremy Clear (OUP) to describe the British National Corpus project, which he did at a more relaxed pace. The open day itself included brief presentations from Stig Johansson (Oslo), on the history of ICAME since its foundation in 1973, from Antoinette Renouf (Birmingham) on the basic design problems of corpus building, from Sid Greenbaum (London) on the design and implementation of the new co-operative International Corpus of English project, from Eric Atwell (Leeds) on the kinds of parsing systems which corpus linguistics made possible, from Jan van Aarts (Nijmegen) on the Nijmegen approach to computational linguistics, from John Sinclair (Birmingham) on the revolutionary effect of corpus linguistics on lexicography and on language teaching, from Gerry Knowles (Lancaster) on the particular problems of representing spoken language in a corpus and from Knut Hofland (Bergen) on the technical services provided for ICAME at Bergen. While none of these speakers said anything particularly new, several of them (notably van Aarts, Renouf and Sinclair) managed to convey very well what is distinctive and important about the field. As far as I could tell, most of the ICAME community was a bit dubious about the usefulness of the Open Day. For outsiders wishing to get up to speed on why corpus linguistics is interesting and why it matters however, I would judge it a notable success.

Corpus linguistics is, of course, all about analysing large corpora of real world texts. To do this properly, you probably need a good lexicon, and you will certainly finish up with one, if you do the job properly. Not surprisingly therefore, the conference proper began with a series of papers about electronic lexica of various flavours, ranging from the CELEX database (Richard Piepenbrock, Nijmegen) in which a vast array of information about three languages (Dutch, English and German) is stored in a relational database, to the experimental word-sense lattices traced by Willem Meijs' Amsterdam research teams from the LDOCE definitions. Work based on this, surely by now the most analysed of all mrds, was also described by Jacques Noel (Liege) and by Louise Guthrie (NMSU). The former had been comparing word-senses in Cobuild and LDOCE, while the latter had been trying to distinguish word senses by collocative evidence from the LDOCE definition texts: although well presented and argued, her conclusions were rather unsurprising (highly domain specific texts are easier to disambiguate than the other sort), and to base any conclusions about language in general on the very artificial language of the LDOCE definition texts seems rather dubious.

The traditional ICAME researcher first quantifies some unsuspected pattern of variation in linguistic usage and then speculates as to its causes. Karin Aijmer (Lund), for example, reported on various kinds of `opener's in the 100 or so telephone conversations in the London-Lund Corpus, in an attempt to identify what she called routinisation patterns. In a rather more sophisticated analysis, Bent Altenberg (Lund) reported on a frequency analysis of recurrent word class combinations in the same corpus, and Pieter de Haan (Nijmegen) on patterns of sentence length occurrences within various kinds of written texts.

Although attendance at ICAME is by invitation only, an honourable tradition is to extend that invitation to anyone who is doing something at all related to corpus work, even a mere computer scientist like Jim Cowie (Stirling) who began his very interesting paper on automatic indexing with the heretical assertion that restructing the type of text analysed was essential if you wanted to do anything at all in NLP. The object of his research was to identify birds, plants etc. by means of descriptive fragments of text and his method, which relied on identifying roles for parts of the text as objects, parts, properties and values, both highly suggestive for other lines of research and eminently pragmatic. A similarly esoteric, but only potentially fruitful, line of enquiry was suggested by Eric Atwell's report on some attempts to apply neural networks to the task of linguistic parsing.

Another nice ICAME tradition is the encouragement of young turks and research assistants, who, when not acutely terrified, are often very good at presenting new approaches and techniques. This year's initiates included Simon Botley (Lancaster), who presented a rather dodgy formalism for the representation of anaphoric chains, Paul Gorman (Aberystwyth) who had translated CLAWS2 into ADA and almost persuaded me that this was a good idea, Christine Johansson (Uppsala) who had been comparing `of which' with `whose' - almost certainly not a good idea and Paul Rayson & Andrew Wilson (also Lancaster) who had souped up General Enquirer to do some rather more sophisticated content analysis of market research survey results by using Claws2 to parse it.

Two immaculately designed and presented papers concerned work at the boundary between spoken speech as recorded by an acoustic trace and by transcription: Anne Wichmann (IBM) presented an analysis of `falls' in the London-Lund corpus, a notorious area of disagreement between transcribers. Her elicitation experiment tended to show that there was a perceived continuity between high and low falls which transcribers could not therefore categorise. Gerry Knowles (Lancaster) proposed a model for speech transcription, in which perceived phonemic categories formed an intermediate mapping between text and acoustic data. Speech transcriptions require a compromise between patterns that can be computed from text and interpretations derived from acoustic data.

High spots of the conference for me were the presentations from O'Donoghue (Leeds) and Marcus. If there is anyone around who still doesn't believe in systemic functional grammar, Tim O'Donoghue's presentation should have converted him or her. He reported the results of comparing statistical properties of a set of parse- trees randomly generated from the systemic grammar developed by Fawcett and Tucker for the Polytechnic of Wales Corpus with the parse trees found in the same (hand-)parsed corpus itself. The high degree of semantic knowledge in the grammar was cited to explain some very close correlations while some equally large disparities were attributed to the specialised nature of the texts in the corpus.

Mitch Marcus (Penn State) gave a whirlwind tour of the new burgeoning of corpus linguistics (they call it `stochastic methods') in the US, and made no bones about its opportunistic nature or or funding priorities. Incidentally providing the conference with one of its best jokes, when remarking of the ACL/DCI, the Linguistics Data Consortium etc. "People want to do this work extremely badly, and they need syntactic corpora to do it", he described the methods and design goals of the Penn Treebank project, stressing its engineering aspects and providing some very impressive statistics about its performance.

Several presentations and one evening discussion session concerned the new `International Corpus of English' or ICE project. Laurie Bauer (Victoria University) described its New Zealand component in one presentation, while Chuck Meyer (UMass) described some software developed to tag it (using Interleaf) in another. The most interesting of these however was from And Rosta (London) who is largely responsible for ICE's original and, for my taste, rather baroque encoding scheme: itvtook the form of a detailed point by point comparison between this and the TEI scheme with a view to assessing the possibility of converting between them. The verdict was largely positive, though he identified several points where TEI was lacking, some of which (notably the inability to tag uncertainty of tag assignment and a whole raft of problems in tagging spoken material) should certainly be addressed and all of which provided very useful and constructive criticisms.

There was a general feeling that standardisation of linguistic annotation (which corpus linguists confusingly insist on calling `tagging') was long overdue. Marcus pointed out that the LOB corpus had used 87 different tags for part of speech, LOB had upped this to 135, the new UCREL set had 166 and the London Lund Corpus 197. In Nijmegen, the TOSCA group has an entirely different tagset of around 200 items which has been adopted and, inevitably, increased by the ICE project. It seems to me that someone should at least try to see whether these various tagsets can in fact be harmonised using the TEI recommendations, or at least compared with the draft TEI starter set described in TEI AI1 W2. I also think that someone should at least try to see how successful the feature-structure mechanisms are at dealing with systemic networks of the POW kind.

LB, 14 May 90

3.10. Eurotra Workshop
Luxembourg
18 April 1991

Between 80 and 100 people, from a large variety of European software houses and research centres were invited to this "Information Day", at the European Commission in Luxembourg, the goal of which was to stimulate interest in the CEC's Call for Tenders for the next phase (1992-4) of the Eurotra II Project. I attended in my role as consultant to Sema Group (Brussels), whose presentation was partly based on a report I wrote for them some time ago. Several background papers were provided and are listed at the end of this report; copies are available from me on request.

Background Presentations

After brief initial background presentations from F. De Bruine and Sergei Perschke, placing the Eurotra project in its political and historical contexts respectively, Nino Varile (CEC) gave a brief overview of the last three decades of what is now known as NLP. He stressed chiefly the way in which idiosyncratic procedural systems based on Transformational Grammar had given way to declarative systems based on lexical unification theories. He argued that such systems, being inherently more robust, would speed up progress in MT systems during the 90s, as would the notion of reusable modular lexical and processing resources. The object should be systems of high quality; the days of what Nino disparagingly referred to as ‘do-it-yourself’ NLP were over. Multi-functional resources, shareable between projects, would be the norm.

Roberto Cencione (CEC) then introduced the main business of the day: initial reports from four feasibility studies commissioned by the CEC, each of which had been briefed to investigate a distinct aspect of Eurotra II. Unlike Eurotra I, this would be a kind of ‘NLP workbench’, modular, formalism- independent and capable of evolving to meet new requirements, but currently solidly based in industry standards (POSIX, X/OPEN, WINDOWS, NFS, SQL, SGML...). Each of the four studies had been asked to assess the current prototype, consider existing relevant formalisms, and specify new tools, formalisms or strategies as appropriate. Each had involved collaboration between academic and industrial partners: final reports are not due until July, but initial versions of all were presented at the meeting. Cencione also highlighted Eurotra's drift into professionalism (my phrase): until 1987 all R&D had been in the hands of academics; during the last two years a central software team had taken this over. The next phase, until 1994, would be characterised by turnkey projects carried out under contract. After 1994, cost- sharing projects would become the norm.

Feasibility Studies

The System Architecture

Jörg Schütz (Inst for Applied Information Science, Saarbrücken) picked up the theme of open modular architecture. By contrast with Eurotra I, the new system takes an object-oriented approach. He presented the various layers of the architecture, from man-machine interface down to database storage, by way of user agent and object manager. The latter handles interactions with lexical resources and rules; the former oversees a number of co-operative software agents or ‘toolboxes’, for example for text-handling. A need was identified for a formalism-independent ‘lexical interface representation language’. A speaker from CAP-Gemini gave some further detail of the MMI agent: it would have a distinctive Eurotra ‘look and feel’, but would be built on X/Windows. As for system control -- if all else failed, there would always be Unix.

ET6-1 The Linguistic Formalism

Steve Pullman of SRI International began by noting that the linguistic formalism must be usable for general NLP, in a multilingual as well as a monolingual context, as well as for MT. Other desiderata were that it should be declarative, reversible (i.e. usable for generation as well as analysis) and theory independent. It should have an easily implemented core, to which equally monotonic and declarative extensions could be hooked, and should use ‘typed feature logic’. There was some discussion of the interfaces between the Language Analysis agent (LA) and the user agent's virtual machine (VM) on the one hand and with the Text Handling agent (TH) on the other: LA/VM was well defined: primitive functions included Parse, Generate, Refine (i.e. further transform the output from Generate) and Transfer (i.e. translate), with in each case appropriate parameters such as language or grammar. LA/TH was rather more fuzzy, with some linguistic functions such as morphological analysis being done by TH rather than LA.

ET6-3: The Text Handling Component

Christian Devillers of SEMA reviewed the text- handling design study. This component interfaces the Linguistic Analysis system with real texts, both during input and generation. The study had involved a brief survey of existing office document handling systems, SGML systems, and systems used within the ‘literary and linguistic computing’ paradigm. A simple SGML dtd for texts passing across the TH/LA interface had been defined (EDIF - Eurotra Document Interchange Format). Recognising that TH tools would probably be of wide interest outside Eurotra, EDIF has been designed with an eye on TEI conformance.

The TH component's main function is to translate between a formatted document and whatever linguistic structures are used for input to (or output from) the LA component. At present, LA requires input of single sentences, with no nested quotations etc., as well as some quite detailed morphological analysis, and that is therefore what TH must produce. Devillers stressed that the segmentation performed by TH was determined entirely by the LA: if this were enhanced for example to deal with paragraphs, then that would be passed across the interface. Some rendition features of the input text are passed through to LA; the majority however are filtered out and stored somewhere unspecified, so that they can be re-associated with the output text.

This presentation was followed by a few desultory questions about other related CEC-funded projects and about the range of material anticipated for translation. I spoke very briefly about the TEI in response to a prod from Cencione. It felt like a very long morning (no coffee break) by the time we all went, thankfully, to lunch in what is unquestionably one of the biggest and best office canteens in the world.

ET7: Reusability of lexical resources

Ulrich Heid (University of Stuttgart) presented some initial results from this rather different feasibility study. Like Varile, he stressed the economic argument in favour of re- using resources, which follow from the imperative need to ‘size up’ NLP projects. A toy system can demonstrate anything: you only demonstrate what is really feasible with a realistically sized system. Reusability might mean simply re-cycling of a resource prepared for some other purpose, or it might mean designing resources with multiple applications in mind from the start.

Most of the presentation dealt with questions specific to re- usable machine-readable dictionaries (MRDs). Heid touched briefly on the existence of a number of related projects and initiatives (e.g. Acquilex, Genelex, Multilex, which are concerned with acquisition, formal description and integration of MRDs respectively). MRDs contain vastly larger amounts of information than electronic lexica, but they are not available for many languages and their underlying structures are not explicit. The study group's approach to unification of lexicographic information across different MRDs was to try to define some primitive level of description, expressed in a type feature logic, corresponding with the linguistic phenomena which the dictionaries purported to describe. Note: This sounds, on reflection, very like the approach proposed within TEI Working Group AI1, so it must be right. Among the 20 or 30 assorted research groups, publishers and software vendors involved in the study, there had been no dispute about the use of an attribute-value representation scheme, though its practical viability had yet to be demonstrated. The use of SGML as an exchange mechanism and the need for interaction with TH and TEI had been equally noncontroversial, though some unspecified concerns had been expressed about character set problems. At the end of the year, another feasibility study would be reporting on some pilot projects demonstrating the benefits and methods of standardisation.

This talk provoked some disagreement from the floor, in the shape of Wolf Paperote (Munster), who asserted on behalf of corpus linguists everywhere (there were none present) that MRDs were a lot less useful than corpora as a source of linguistic information, and that since parsing corpora was marginally easier than parsing MRDs, and much cheaper, wouldn't the money be better spent on the former? To judge from the icy silence that greeted this remark, his was a minority view.

Contractual matters

Cencione rounded off the day by setting out the procedure to be adopted for the call for tenders, set out in ET9. Tenders were invited for two distinct projects: first the implementation of a Eurotra II development environment, as described by the four study papers; second the provision of maintenance and software support facilities for all Eurotra Project researchers (currently 17 sites located in 12 countries). The contracts would run to 1993, with the possibility of extensions under the new Language Research Engineering (LRE) programme for a further two years from 1994.

Nearly a hundred different companies and institutions had expressed interest in tendering. The formal invitation to tender would be published towards the end of May; the deadline for bids would be the end of July; contractor/s would be chosen by late September with a view to concluding contractual negotiations by the end of the year, and starting work early in Jan 1992. Contractors could organise the work as they deemed fit, bid for one or both projects, subcontract work etc, but the CEC would contract with only one member of a consortium, who should moreover be responsible for at least half of the work on the project. Software developed under ET9-1 would remain the property of the CEC, and must be shareable with any future research projects funded by the Commission. Their estimate of the costs for both projects over two years was around 30 man/years, with two thirds of the approximately 15 man/years for the first year being allocated to ET9-1.

TEI Relevance

At the risk of stating the obvious, I would like to stress the importance of the CEC's Linguistic Research Engineering project to the future of the TEI, and not just because of the amounts of money involved (several millions of ECU over the next few years). Over the next few weeks I will be writing up an assessment of the TH study, as part of the OUCS/SEMA consultancy project. Any input or comments from the TEI perspective would be most useful. I see three chief areas of overlapping interest, briefly summarised below.

Reuse of lexical resources

There is a lot of effort and money going into initiatives to standardise lexical resources such as MRDs, with which TEI is already involved by virtue of overlapping membership (Nicoletta Calzolari is, of course, a significant contributor to ET-7); however, it was clear from talking to Heid that closer collaboration would be both possible and welcome. As a first step I have requested copies of ET-7's detailed working papers, several of which include surveys of existing encoding schemes and recommendations for standardisation which should be brought to the attention of the relevant WGs. I think the new WGs on lexical resources and on terminology in particular should be encouraged to build on this European work rather than go their own way.

Language formalisms

Someone competent to judge the issues within the AI1 should be asked to assess the linguistic formalism of ET6-1 and consider ways of representing it using TEI style feature sets. In my report on ET6 I would like to suggest that LA should be able to output the results of its analysis in a TEI conformant way: it would be nice to have some specific arguments and examples to support this, but I am not confident of my competence to produce them.

Text Handling tools

Despite the general lack of enthusiasm within the Eurotra project for corpus linguistics, it seems to me that some of the tools developed as part of TH may prove to be of particular interest to several TEI projects. TH will (for example) have to develop ways of automatically detecting and tagging sentences and morphological structure in the full range of European languages in SGML. If properly designed and implemented, such tools would be of great general applicability. CEC's policy as regards making such tools availabile freely to the research community, at least within Europe, sounds distinctly encouraging, as does their declared intention of working within an open Unix environment.

Document List

The following documents were made available at the workshop.

An introduction to the Eurotra Machine Translation System. 38 pp.
Eurotra-I Development Environment: a broad overview. March 1991. 26 pp.
The Eurotra-II Development Environment: a broad overview. March 1991. 21 pp.
Eurotra 6-1: Rule formalism and virtual machine design study. Summary and main findings to date. April 1991. 81 pp.
A feasibility and design study on the software environment within the framework of the Eurotra programme. Document 3: interim report. Co-ordination: Jörg Schütz. March 1991. 126 pp.
The Eurotra text handling subsystem: a broad overview. March 1991. 24 pp.
A short report on the Eurotra 7 Study. Ulrich Heid. April 1991. 47 pp.
Call for tenders ET-9. Description of the work packages. April 1991.
Copies of overheads for each of the five main talks summarized above.
Address lists for attendants at workshop and for the 12 Eurotra centres.

4. 1980-1990

4.1. Information Technology and the Research Process.
Cranfield Institute of Technology
18-21 July 1989

All academic communities define themselves partly by regular gatherings dedicated to self examination; the community of ‘information scientists’, i.e. those skilled in the management and exploitation of library and analogous resources in research, is no exception. During the seventies there had been a regular series of such gatherings known as the Cranfield Conference. These having now fallen into desuetude, when Brian Perry, head of the British Library's Research and Development Department, welcomed us to this reborn version he naturally proposed that it should be called ‘Not the Cranfield Conference’. The four day event, jointly sponsored by the British Library, the University of Pittsburgh's Department of library Science, and the UK Computer Board, attracted a small but agreeably heterogenous audience. Attendance at sessions averaged 60 from a total registration of just under a hundred, largely composed of information science professionals, computerate librarians, human- factors computing theoreticians, a sprinkling of civil servants and various other varieties of professional research support people, drawn fairly even handedly from universities and polytechnics, with even a few token representatives of industrial concerns such as Shell. Although the British formed the majority, followed by the Americans and the French, several other countries were represented including Sweden, Eire, Canada, Netherlands, Turkey and Bophutatswana. The conference bore every sign of having been carefully arranged to maximise opportunities for informal contact and discussion: there were no parallel sessions, and the timetable was not a tight one, with five keynote speakers, one panel session and a paltry 20 presentations spread over four and a half days. The venue, Cranfield Institute of Technology, notorious for its sybaritic charm as a conference centre, contributed something to this end. As befits experts in the research process, the organisers had gone out of their way to create a stimulating, agreeable, thought-provoking envirorunent in which creativity and information flow would flourish. But what were we supposed to talk *about*?

In the initial session, Jack Meadows (Loughborough) surveyed several recurrent themes of the conference: the types of application found by researchers for IT, which he viewed historically as shifting from storage and retrieval, to communication and in the future to creativity itself. He asserted that take-up of any new technology lasted for about a decade between the first 10% of potential users and the last 10%, and pointed out that because acceptance of IT must be a communal decision it would necessarily be a slow one. He said that good human interfaces implied a loss of computational efficiency;- that researchers required different levels of information; that IT facilitates informal communication better than formal and various other varyingly "untenable generalisations" (his phrase), presumably in order to provoke discussion.

The panel discussants were Richard Rowe (FAXON) who had brought a four pound $4000 NEC portable with built in modem to show us (this was instantly eclipsed by Brian Shackel who had brought his two pound f:200 Cambridge Z88); he also remarked on the importance of serendipitous browsing. John Clements (National Academy of Sciences, Washington) drew our attention to the importance of information processing expertise in the synthesis which characterised such major new endeavours as the human genome project, and also said we were within five years of making a completely detailed computer simulation of a living organism of the complexity of E Coli. Ted Herbert (Computer Board) saw JANET and its future development as crucial to scientific communication. He identified a trend towards simulation over actual experimentation in science; and a window of opportunity where unit cost of computer power was dropping faster than the demand was rising; he also summarised rather amusingly some of the difficulties inherent in negotiating with Whitehall. Gérard Losfeld (Université de Lille) had had to leave shortly before the discussion but in an impressive demonstration of IT in action, had FAXd his comments on Meadows' paper from the airport. These cast doubt on the likelihood that better software necessarily meant better research and made a good case for the fear that bibliographic databases encourage productivity at the expense of creativity. Finally, Nicholas Ostler (Department of Trade and Industry) drew a rather curious parallel between IT and money as a store of value, medium of exchange and a unit of account before making some rather Vague generalisations about computing in linguistics and drawing an even more curious parallel between librarianship and espionage.

The discussion following this panel was largely focused on browsing methods. Creativity and innovation is based on hitting on the unexpected, but the narrowing focus research means that less and less is unexpected. IT should open up possibilities ways of reducing the information overload, perhaps by automating the filtering process needed for intelligent browsing.

The first full day of the conference was concemed with that elusive feature of the best research: creativity. Proceedings began a remarkable keynote speech by Yves Le Coadic of the Conservatoire Nationale des Arts et Métiers, Paris, substituting for Jean Claude Gardin. He began by dismissing the common notion that, viewed diachronically, there is a close connexion between scientific development and prevailing political ideologies. It turns out science has always had a central theoretical core, few arts people ever penetrate to the inner workings of a scientific community and return to explain them. The manufacture of scientific ideas depends on networks of communication and Baconian deduction, in contrast with the notion of scientific inspiration popularised by Koestler. Turning to the Humanities, Le Coadic identified a shift from the use of IT for information storage to its use as a means classifying information, and finally as an integrative force, a means of extracting the rules implicit in a given universe of discourse. Ideas, he concluded, are created or manufactured by precisely the kinds of social and technical networks which IT facilitates.

After this heady Gallic stuff, the remaining presentations seemed somewhat tame. Michael Brittain, of the NHS Training Authority, listed six influences of IT on research, ranging from time-saving to re-ordering the canonical pattern of the research process. He had noticed that most researchers were unaware of the process of research as such and closed with some rhetorical questions about whether there were areas that couldn't be done without benefit of IT (several members of the audience quickly identified some for him) and whether not its application was always cost effective. No one had any ideas on that, but it sparked off a very interesting discussion concluded that social processes (such as co-operation) were often more important technological aids.

Goodier (Department of the environment) provided a management perspective on the research process, based on his own experience at the Agricultural and Food Research Council. Much of his paper was a plea for proper documentation of research activities in progress, which, he thought, would benefit from the sort of controlled keywording that typifies bibliographic databases, and for some sort of quality control mechanism more effective than simple peer review — or at least the appearance of one.

Chris Turner (Brighton Poly) had an uncomplicated answer to the problem of increasing creativity in his own IT faculty: the creation of a uniform IT environment, based on Macintosh hardware, with Sun workstations for intensive numerical work and a VAX which was increasingly regarded as a giant disk used for archival backup purposes. He restated two crucial elements in a creative IT strategy: high connectivity, and an awareness of HCI factors.

John Weiner (University of Southern California, Department of Medicine) gave a very impressive paper which argued an unfashionable view: that information processing can be formalised and that the creative process is definable. The methods he outlined — for 'ideas-analysis' in clinical trials concerned with paediatric oncology were based on a knowledge representation intended to capture 'ideas' from the literature, which could then be manipulated by a rulebased deductive system to simulate creativity without (in his phrase) ‘any need for the wine or the hot tub’. Some of the success he reported may have derived from the 'ideas' involved between a given given outcome) but the paper was very well presented and provocatively argued.

The next speaker (Johan van Halm, an independent consultant) thus had a somewhat punch drunk audience with which to engage. His paper on IT in Library Science, effectively a summary of a report prepared for the Dutch Library Council last year and recently published in English, was uncontentious and its conclusions (that widespread acceptance of IT depends on such factors as public acceptability and a satisfactory communications infrastructure) unsurprising.

The next session was on 'Collection and analysis of information'. The keynote speaker, Harold Borko (Graduate School of Library and Information Science, University of California at Los Angeles ) gave a rather 'gee-whizz' style overview of the history of IT developments in the library community up to circa 1970, which so dispirited me that I played truant from the next two papers, given by Chris Batt (a real librarian, from Croydon) and Lisbeth Bjoerklund (University of Linkoping, Sweden) to check my e-mail and inspect the Poster session in the adjoining room. Batt's paper, as given in the PrePrints, seems to consists of speculations about matters of library management, while BiOerklund has hit on the notion of hyper-textualising an OPAC, but not apparently done a great deal about it. I returned in time to hear Andrew Dillon, the first Of three speakers from the HUSAT Research Centre at Loughborough, present some results of an analysis of reading behaviour undertaken as part of project Quartet. It demonstrated that researchers placed different kinds of texts at different points along three axes (how it is to be read, why it is worth reading, and what it is likely to contain), with clear design implications f6r optimal reading and retrieval software.

After tea, Mike Lynch described some of the basic 'information studies' research carried out at Sheffield's prestigious Department of same, ranging from automatic indexing algorithms to heuristics for analyzing threedimensional chemical structures, and the increased . complexity made possible by ,advances in computer hardware and software. Though' interesting, this seemed only marginally relevant to the rest of the Conference: it was followed by a paper by Patricia Wright (Medical Research Council) which almost caricatured the poor image that 'information studies' has in some quarters. Dr Wright had asked about 200 research workers (nearly all psycholinguists) whether they used Computers to work on at home and if so what for. The mind-boggling results of her survey were that most people did, and mostly for word processing. Another revelation provided by this fundamental research was that far fewer people in the UK used email from home computers than did North Americans. Dr Wright suggested this might be because telephone charges were higher on this side of the pond, or then again, it might not. The questionnaire design was good; the paper was well presented and well argued, but almost entirely pointless. Most delegates promptly adjourned to Cranfield's exceptionally well-appointed bar (over 120 different single malts) for lengthy if inconclusive discussions about creativity and the research process.

The next day began with a good session on Information Exchange within the research community. The keynote speaker, Prof Brian Shackel, head of the HUSAT research team at Loughborough, resisted the temptation to speculate about the future, availability o ia' or Zlirec cantered at a various modes o (mail, conferencin comparing tiona and versionsfunctional and Pragmatic terms. There was good possibility that email would supplant the conventional variety entirely, but he was less sanguine about electronic conferencing or journal publishing.

The acceptability of new technology, as much else, hinged on human factors problems, for which he recommended some specific solutions: at least A4 and preferably A3 sized screens, hypertextual structures; ways of filtering junk mail; standardisation of formats and protocols; integrated international networks... Many of these had been the subject of the basic research carried out within Project Quartet, but there was no reason to assume that all its results could be transferred from research into reality. Maintaining the ‘invisible college’, for example, implied a need for local IT expertise; a novel way of funding this might be by a small with-holdable surplus on all research grants. Lapsing eventually into futurology, Shackel advised us to watch out for high-definition TV and ISDN, and keep an eye on the ‘electronic campus’ project at Aston University. 93% of Loughborough academics already have a PC on their desk, so the future may be nearer than we think.

Elaine Davis-Smith, IR specialist for a scrupulously un-named chemical company, then gave what was regrettably an almost inaudible paper about Potential applications of IT within large (un-named) Chemical companies concerned with hazardous chemicals, which provided an object lesson in unsuccessful communication.

Constance Gould (Research Libraries Group) then described how a survey of American scholarly users' needs had - indicated two major areas where bibliographic information was conspicuously lacking: data about research progress and data about machine in readable data files. In both cases, the need was crucial in all disciplines, and particularly acute in inter-disciplinary fields. There was a widening professional gap between the unpublished research in progress was 'haves' and the 'have-nots' as far as access to concerned. The difficulties of getting reliable access to and bibliographic control of machine-readable datafiles were even worse: files are not catalogued or encoded in any consistent way, so researchers often don't even know they exist. The paper gave a clear presentation of the problem area, and it is od to equally ready to clean these, particular gean stables is less certain. Lindsay (Kingston Poly) then provided intriguing, and in some ways view of the problems of n-nation management from a third world based on a project undertaken for the Development Planning Unit of College London. He described the political and organisational fficulties involved in bringing together access to the scattered 'gray literature' in this His conclusions were cautious: the new Is made available by IT in some ways existing social, financial and iti( problems more than the ysolved hn Richardson, the second HUSAT speaker, a well argued and very detailed survey the available wisdom on the efficacy of ectronically mediated conferencing, highlighting some problems with which readers (among others) are famliar. Although a high degree of mmunication was clearly a necessary ition for a productive research irom-nent, the low bandwidth of most tror,ic communications oft@,n introduced, as complications as their greater speed distribution removed. Electronic mail to be less affected by these problems, successful electronic discussion groups, he ncluded, need a skilled moderator, strong motivation and opportunities for face face contact. s Reynolds (CODIL Language Systems) scussed some of the more lunatic lications of the Data Protection Act with to electronic communications. He stulated various unlikely scenarios, in hich the usual business of acaden-dc mmunication n-dght well fall foul of the much to the bemusement of all present.

The remainder of this third day of the conference was given over to relaxation, in form of visits to various IT based mpames in the Milton Keynes area, of which British Telecom was reportedly most popular, because it was air conditioned, and of course to the conference banquet, for which Cranfield's cooks excelled themselves. It was enlivened by an occasionally coherent dinner speech from Murray Laver, who said (as a good after dinner speaker should) several things we would all like to say but t, notably that IT was silting up the search process by making it more difficult

I had the dubious privilege of giving the keynote address to the final session of the conference, the morning after, which concerned publishing, presenting and archiving the results of research. My paper began by casting fashionable doubts on the notion of research as a process, and stressed the importance of decoupling data both from its containing media and the processes applied to it, before expiring in a flurry of humanistic verbiage about multiple interpretations, hypertext etc. I also questioned whether the library community was in fact capable of responding to the challenges offered by the new technologies, an issue directly addressed by Michael Buckland (School of Library and Information Studies, University of California at Berkeley)'s paper, which outlined the radically different constraints and possibilities inherent in the application of IT to library services. He argued persuasively that the dispensing role of collections, and the relationships between catalogues, bibliographies and library materials allke needed rethinking.

David Ellis (Department of Information Studies, University of Sheffield) presented the results of a very detailed analysis of the information seeking habits of a sample of academic researchers in four social science research groups at Sheffield, with a view to deriving a general behavioural model which could be used to optimise the design of retrieval systems. Key features of the model included the ability to ‘chain’, that is, to follow links from one citation to another and to differentiate sources in terms of their relevance. No existing software seemed to offer the full range of desired features, but hypertext systems seemed to offer most promise.

Marcia Taylor (University of Essex) traced the development of the Essex Data Archive, from its origins 25 years ago as a survey databank to its current pre-eminence as an archive for, and source of, social science data, both deposited by individual researchers and provided by central government. She summarised some of the services it offers, and gave a brief overview of the research it undertakes, notably its initiative in formulating guidelines for the standardisation of descriptions of machine readable datasets.

The most unusual paper of the conference was given by Micheline Hancock-Beaulieu (City University) and concerned the creation of a database of information about The Athenaeum, a leading 19th century review and the richness of its contents as a source for 19th century literary history, but in the existence of a 'marked file' in which each of the approximately 140,000 anonymous reviews it contains is tagged with a short form of the author's name. An interdisciplinary team of librarians, computer scientists and historians at City is now constructing a relational database to hold this invaluable source material in an integrated way, using a package called TINman.

Cliff McKnight (the last of the HUSAT speakers) gave the last formal presentation, which returned to the major concern of the conference: the reading of scholarly journals. As part of Project Quartet, they had converted an eight year run of a journal called Behaviour and Information Technology into a hypertext usmg GUIDE to provide good quality searching and browsing capabilities as well as the usual ability to @ parts of texts, pop up figures and references etc. Although the formal structure of acaden-dc discourse typically mimics a linear process (introduction, methods, results, discussion), there was abundant evidence that skilled readers use this framework only as a point of departure, hopping from point to point in a way easily supported by hypertext, nrovided that the @ underlying ; @ctural metaphors (cross reference, citation, etc.) are clearly marked.

As a coda to the event, Brian Perry chaired a discussion at which participants were invited to comment on the success of the conference as a whole. Most of us however were feeling too limp from the heat to do more than agree that parallel sessions were not a good idea and that the @g and content had indeed encouraged a satisfactory exchange of views. The idea was floated that pre-prints should have been made available early enough for participants to have read them before the event, so that formal presentations might be replaced by informal seminars, but did not gain much support, though several felt that there had been insufficient time for discussion in the sessions. As a case study in how to organise acaden-dc conferences, I felt that "Not the Cranfield Conference' could not easily be faulted. The progranune, which at first glance looked rather dull, was unusually varied containing many unexpectedly stimulating papers and only a few dodos. My only quahn is that too great a success may lead to yet more research into research, a depressingly incestuous and unproductive activity.

Lou Burnard Oxford Univenity C@ng Service

4.2. SGML Database SIG Meeting
Samson-Sitjthoff BV, Alphen an der Rijn
18 May 1989

Location and attendance

The meeting was hosted at Alphen an der Rijn (Netherlands) by Samson-Sitjthoff BV, part of a major Dutch printing and publishing conglomerate, whose Information Services division is directed by Jan Maasdam, chairman of the Dutch SGML users' group. The SIG itself was set up very recently by its current chairman, Martin Bryan (author of the only readable book on SGML) who works for a division of the SEMA group called Yard Software Systems but is also closely associated with SOBEMAP and MARKIT, their SGML parser. The SIG has about half a dozen active members drawn largely from major European software houses with an interest in the field. Its chief remit seems to be discussion of the interface between SGML and database design, but this was only the third meeting and the group has not yet felt the need to create a formal constitution or agenda.

Agenda for meeting

The agenda was as follows:

Application of linguistic methods and tools to database management of SGML coded texts. Gert van der Steen from MID, a Dutch software house
The TEI: an application of SGML in scholarly research.Me
Performance comparisons of some UNIX-based RDBMS François Chahuneau of AIS, a French software house
General discussion of a paper on conceptual modelling for a document database tabled at the previous meeting by Han Schouten of the TFDL (the Dutch Agricultural Ministry)

Presentatations

Van der Steen's presentation was overlong and rather rambling for the occasion, but raised some interesting points about the benign influence of computational linguistics in the development of SGML (a DTD -he said confidently- is a formal grammar) and the appropriateness of hierarchic database systems to it. His company is developing an "Integrated Publishing Management System" entirely dependent on SGML as transfer mechanism, which had necessitated a detailed specification for an ideal text retrieval system. He also described his own PARSPAT system which uses recognition of syntactic patterns as a database search mechanism (he has recently published a book on the unification of pattern-matching, recognition, parsing and transduction) and gave examples of its use for analysing the Brown Corpus and a database of 18th century Delft Estate inventories

My presentation simply outlined the structure of the TEI, gave some examples of the horrors of unchecked scholarly markup and discussed the relationship between text and databases.

Chahuneau 's company has the task of constructing a document database to support EEC legislative and other documents in nine languages in parallel. Its scale (15,000 pp in the printed annual form) and complexity set it apart from any other SGML applications I have yet come across. Because the database is constantly changing, sophisticated version control and integrity checks are essential to maintain all nine views of it in parallel. This ruled out any of the traditional text retrieval database systems; hence the case study of available UNIX RDBMS. Decisive factors in narrowing these down proved to be their degree of support for the 8-bit characters of ISO8859 (essential for the 9 languages); and the way in which the software implemented crucial database operations. Only three of the nine systems investigated allowed the manipulation of 8-bit characters as well as their storage. As to software performance, it seems that the ideal system would combine INGRES' query optimiser (which made an order of magnitude difference to the speed of join operations) with the SYBASE file-access engine (which had a similar effect on most other operations). An investigation of the various hardware platforms available showed perhaps unsurprisingly that although a RISC-architecture machine such as the new DEC Station gave enormous performance improvements, the low-end 386-based machine was a better price/performance option for development than any of the other available workstations.

Discussion of Schouten's paper was less focussed than it merited, largely due to the lateness of the hour. Schouten had advocated using a conceptual modelling technique known as NIAM, rather than the more usual Entity-Relationship model (which would have pleased Chahuneau and me) or a straight hierarchic datamodel derived from the DTD (which would have pleased van der Steen). He had not paid much attention to such specific problems as version control, which seemed to imply the need for a formalism superior to the DTD, the semantic adequacy of which was already in question.

Contacts made

Both Chahuneau and Bryan expressed interest in the work of the TEI, and a willingness to participate if invited. I took the opportunity of rehearsing some of the current Committee 4 arguments with them (both agreed that attributes were not formally necessary, but still extremely usefull). If we do decide to involve either of them, Bryan might be a better SOBEMAP representative than Gaspart, while Chahuneau or his nominee would be a good substitute for Dendien on committee 4. Either of these would be self-financing. It is important to stimulate SOBEMAP interest since they have EEC funding for their MARKIT product which is the only structured editor I have so far come across that runs in the MS-DOS environment we all know and loathe.

I met Jens Erlandsen, whose company TEXTware A/S based in Copenhagen is developing Gestorlex, which seems to be yet another SGML-based structured editor for dictionaries and other reference books. They also market a small free text browser of the Gofer-type, and are involved in an ESPRIT project to develop a multi-media publisher's workbench. The novelty in the latter is that they plan to implement Salton's "space-vector model" for the full text indexing requirement.

Kluwers, publishers of CHum, turn out to be a part of the same publishing empire as our hosts. They were represented by Drs van Wijnen, who seemed quite taken with the notion that CHum contributors might be amenable to supplying their material complete with SGML tags. She agreed that this was worth suggesting to the CHum editorial board at any rate.

Collected Reports

Table of contents

1. 2000-2005

1.1. Rencontres TEI francophonesATILF, Nancy20-21 octobre 2005

1.2. Extreme Markup Languages 2004Montreal8 August 2004

1.3. ALLC Roadmap MeetingPisaApril 2002

2. 1995-1999

2.1. XML Europe 99Granada

2.2. The 1999 EBTI, ECAI, SEER and PNC Joint MeetingAcademia Sinica, Taipei18-21 Januaryan unofficial report

2.3. ICAME 19Slieve Donard Hotel, Newcastle, Co. Down.>20-24 May

2.4. Language Learning and ComputersChemnitz University of Technology, 20-23 Feb

2.5. TEI WorkshopSSLMIT Forlì16-22 April 1997

2.6. PALC97: Practical Applications of Language CorporaUniversity of LódzApril 12-14 1997

2.7. Trip Reports from various places in the USA (12-20 Jan 1997)[Eastern Michigan Uni; UM Ann Arbor; Indiana U; Berkeley]A visit to the USA (12-20 Jan 1997)

2.7.1. Consulting with the Linguist List12-14 January 1997

2.7.2. University of Michigan, Ann Arbor14-16 January

2.7.3. Indiana University, Bloomington17-18 January

2.7.4. University of California at Berkeley19-20 January

2.8. Conference on MS encodingStudley Priory2-3 Nov 1996

2.9. Belux96Business Faculty, Brussells31 Oct 1996

2.10. Colloqium on Hypertexts and Electronic EditingÉcole Normale Superieure, Paris12-14 September 1996

2.11. Algebraic Document Processing: Project DAVID workshopUniversity of Minho, Braga2-4 September 1996

2.12. MunichSGML Europe 1996 Conference

2.13. Goteborg, Bergen, and InnsbruckMay 1996

2.14. Surfnet TEI WorkshopUniversity of Groningen

2.15. Oh no, it's another dining report[Memoria project meeting etc.]Institute for Computational Linguistics, Pisa5-7 March 1995

2.16. Portable Documents: visit reportUK TeX User Group: "Portable documents: Acrobat, SGML and TeX"Bridewell Theatre, London19 Jan 95

3. 1991-1994

3.1. Conference on Encoding and CorporaA visit to OsloUniversity of Oslo14-16 Nov 1994

3.2. University of CorkTEI-WWW workshop

3.3. EAGLES meetingParisOct 1993

3.4. NSC93: Second Network Services ConferenceWarsawOctober 1993

3.5. SALT Club meeting[London]January 1993

3.6. AIS/Berger-Levrault, ParisOn 4 Sept 92

3.7. Social History: the challenge of Technology University of Essex

3.8. SGML Update: consultancy, tools, courses'Amsterdam16-17 May 1991

3.9. ICAME Conference ReportCraiglands Hotel, Ilkley, Yorks6-12 May 1991

3.10. Eurotra WorkshopLuxembourg18 April 1991

4. 1980-1990

4.1. Information Technology and the Research Process. Cranfield Institute of Technology18-21 July 1989

4.2. SGML Database SIG MeetingSamson-Sitjthoff BV, Alphen an der Rijn18 May 1989

4.3. Computers and Teaching in the Humanities, 1988 University of Southampton[Dec 13 to 15, 1988]

4.4. SGML Users GroupBSI London2 December 1988

4.5. International Student House (London)25 October 1988ESRC Seminar Series on Cataloguing Computer Files

4.6. Report on the Cologne Computer ConferenceKoln7-10 Sept 1988

4.7. VISIT REPORT Lou BurnardSouthampton University 10-11 April 1987Computers and Teaching in the Humanities

4.8. VISIT REPORTLou BurnardMemex Information Engines LtdEast Kilbride26-27 March 1987

4.9. Westfield College 20-21 March 1987History and Computing - 2nd Annual conference

4.10. Leeds 19 March 1987IUSC Working Party on Free Text Retrieval

4.11. VISIT REPORTNovotel, Hammersmith 17 February 1987The Sequent Lectures

4.12. Lou BurnardBatelle SoftwareSeminar on DM/BASIS5 December 1986WITH P.Salotti

4.13. National Gallery, LondonSecond annual CHArt Conference27-28 November 1986

4.14. Universities of Toronto and WaterlooConference on New Oxford English DictionaryNovember 7-13 1986

4.15. UEA, NorwichAssociation for Lit & Ling Computing xiii internationalsymposium2-5 April 1986

4.16. VISIT REPORT TO Westfield College March 21-23 March 1986 History and Computing: The inaugural conference of the Association for History and Computing.

4.17. Oxford[Bodleian Curators Meeting; Belinda's Birthday]A NOT VERY USUAL SATURDAY9 November 1985

4.18. (a) CINECA, Bologna and (b) Inst. Linguistica Computazionale, PisaOctober 24-31 1985

4.19. Visit ReportHotel Metropole, LondonFor INDEPOL Launch15 Oct 85

4.20. ICL (West Gorton)Presentation of the Queens Award for Industry to ICL10 Oct 1985

4.21. City University 13-14 Sept. 1985Databases for the Theatre

4.22. ICL Slough1 Aug 1985CAFS SIG WP LBurnard

4.23. VISIT REPORT L.BurnardMax-Planck-Institut fur Geschichte, Gottingen 14-19 July 1985International Workshop on the Creation Linkage and Usage of large-scale interdisciplinary sourcebanks in the historical disciplines

4.24. VISIT REPORT 10 Royai National Hotel 13 June 1985ICL DMUG/LSUG

4.25. VISIT REPORTUniversity College London2 April 85Workshop on Computers and Art History L. Burnard

4.26. VISIT REPORTCEGB StockportCUA CAFS SIG WP9 Nov 84

4.27. VISIT REPORT LBurnardThe Burn, GleneskBCS/Codasyl DBAWG6-9 September

4.28. VISIT REPORT LBurnardLeeds Polytechnic July 11-13 1984BNCOD 3 also present: P Salotti

4.29. Low Wood Hotel (Windermere)19-23 May 1984Lancaster University Conference (ICAME)

4.30. VISIT REPORT LBurnardSperry House, Stonebridge Park, London BCS Codasyl DBAWG 19 January 1984

4.31. VISIT REPORT LBurnardUniversity of Surrey IUSC Workshop ON 3-4 Jan 1984

4.32. Visit ReportStrand Palace Hotel, LondonICLCUA CAFS SIG Open Meeting30 November 1983ALSO PRESENT P Salotti

4.33. VISIT REPORT L. BurnardThe Burn, Glenesk BCS/Codasyl DBAWG Sept 1-4, 1983

4.34. Report by Mrs. S. M. Hockey26/7/83IUSC Workshop on Non-Numeric ComputingOther Staff Present: Mrs. L. Hayes, L. D. Burnard

4.35. VISIT REPORTUMRCCIUSC Famulus Working Party22 June 1983 L.Burnard[UMRCC]

4.36. VISIT REPORT L. BurnardU.M.R.C.C.IUSC Working Party on Development of Famulus

4.37. VISIT REPORT L.BurnardUMRCCJanuary 22Famulus meeting

4.38. BP Britannia House. 25.11.82BCS information Retrieval Specialist Group Seminar on Integrated Text and Data Management Systems.

4.39. Date of Report: 29.11.82ICL Euston

4.40. VISIT REPORT L BurnardICL BracknellBCS/CODASYL DBAWGOn 18-19 November 82

4.41. 20.9.82Codasyl/BCS DBAWGThe Burn

1.1. Rencontres TEI francophones
ATILF, Nancy
20-21 octobre 2005

1.2. Extreme Markup Languages 2004
Montreal
8 August 2004

1.3. ALLC Roadmap Meeting
Pisa
April 2002

2.1. XML Europe 99
Granada

2.2. The 1999 EBTI, ECAI, SEER and PNC Joint Meeting
Academia Sinica, Taipei
18-21 January
an unofficial report

2.3. ICAME 19
Slieve Donard Hotel, Newcastle, Co. Down.
>20-24 May

2.4. Language Learning and Computers
Chemnitz University of Technology, 20-23 Feb

2.5. TEI Workshop
SSLMIT Forlì
16-22 April 1997

2.6. PALC97: Practical Applications of Language Corpora
University of Lódz
April 12-14 1997

2.7. Trip Reports from various places in the USA (12-20 Jan 1997)
[Eastern Michigan Uni; UM Ann Arbor; Indiana U; Berkeley]
A visit to the USA (12-20 Jan 1997)

2.7.1. Consulting with the Linguist List
12-14 January 1997

2.7.2. University of Michigan, Ann Arbor
14-16 January

2.7.3. Indiana University, Bloomington
17-18 January

2.7.4. University of California at Berkeley
19-20 January

2.8. Conference on MS encoding
Studley Priory
2-3 Nov 1996

2.9. Belux96
Business Faculty, Brussells
31 Oct 1996

2.10. Colloqium on Hypertexts and Electronic Editing
École Normale Superieure, Paris
12-14 September 1996

2.11. Algebraic Document Processing: Project DAVID workshop
University of Minho, Braga
2-4 September 1996

2.12. Munich
SGML Europe 1996 Conference

2.13. Goteborg, Bergen, and Innsbruck
May 1996

2.14. Surfnet TEI Workshop
University of Groningen

2.15. Oh no, it's another dining report
[Memoria project meeting etc.]
Institute for Computational Linguistics, Pisa
5-7 March 1995

2.16. Portable Documents: visit report
UK TeX User Group: "Portable documents: Acrobat, SGML and TeX"
Bridewell Theatre, London
19 Jan 95

3.1. Conference on Encoding and Corpora
A visit to Oslo
University of Oslo
14-16 Nov 1994

3.2. University of Cork
TEI-WWW workshop

3.3. EAGLES meeting
Paris
Oct 1993

3.4. NSC93: Second Network Services Conference
Warsaw
October 1993

3.5. SALT Club meeting
[London]
January 1993

3.6. AIS/Berger-Levrault, Paris
On 4 Sept 92

3.7. Social History: the challenge of Technology
University of Essex

3.8. SGML Update: consultancy, tools, courses'
Amsterdam
16-17 May 1991

3.9. ICAME Conference Report
Craiglands Hotel, Ilkley, Yorks
6-12 May 1991

3.10. Eurotra Workshop
Luxembourg
18 April 1991

4.1. Information Technology and the Research Process.
Cranfield Institute of Technology
18-21 July 1989

4.2. SGML Database SIG Meeting
Samson-Sitjthoff BV, Alphen an der Rijn
18 May 1989

4.3. Computers and Teaching in the Humanities, 1988
University of Southampton
[Dec 13 to 15, 1988]

4.4. SGML Users Group
BSI London
2 December 1988

4.5. International Student House (London)
25 October 1988
ESRC Seminar Series on Cataloguing Computer Files

4.6. Report on the Cologne Computer Conference
Koln
7-10 Sept 1988

4.7. VISIT REPORT Lou Burnard
Southampton University
10-11 April 1987
Computers and Teaching in the Humanities

4.8. VISIT REPORTLou Burnard
Memex Information Engines Ltd
East Kilbride
26-27 March 1987

4.9.
Westfield College
20-21 March 1987
History and Computing - 2nd Annual conference

4.10.
Leeds
19 March 1987
IUSC Working Party on Free Text Retrieval

4.11. VISIT REPORT
Novotel, Hammersmith
17 February 1987
The Sequent Lectures

4.12. Lou Burnard
Batelle Software
Seminar on DM/BASIS
5 December 1986
WITH P.Salotti

4.13.
National Gallery, London
Second annual CHArt Conference
27-28 November 1986

4.14. Universities of Toronto and Waterloo
Conference on New Oxford English Dictionary
November 7-13 1986

4.15.
UEA, Norwich
Association for Lit & Ling Computing xiii internationalsymposium
2-5 April 1986

4.16. VISIT REPORT
TO Westfield College March
21-23 March 1986
History and Computing: The inaugural conference of the Association for History and Computing.

4.17. Oxford
[Bodleian Curators Meeting; Belinda's Birthday]
A NOT VERY USUAL SATURDAY
9 November 1985

4.18. (a) CINECA, Bologna and (b) Inst. Linguistica Computazionale, Pisa
October 24-31 1985

4.19. Visit Report
Hotel Metropole, London
For INDEPOL Launch
15 Oct 85

4.20. ICL (West Gorton)
Presentation of the Queens Award for Industry to ICL
10 Oct 1985

4.21.
City University
13-14 Sept. 1985
Databases for the Theatre

4.22.

ICL Slough
1 Aug 1985
CAFS SIG WP LBurnard

4.23. VISIT REPORT L.Burnard
Max-Planck-Institut fur Geschichte, Gottingen
14-19 July 1985
International Workshop on the Creation Linkage and Usage of large-scale interdisciplinary sourcebanks in the historical disciplines

4.24. VISIT REPORT 10 Royai National Hotel 13 June 1985
ICL DMUG/LSUG

4.25. VISIT REPORT
University College London
2 April 85
Workshop on Computers and Art History L. Burnard

4.26. VISIT REPORT
CEGB Stockport
CUA CAFS SIG WP
9 Nov 84

4.27. VISIT REPORT LBurnard
The Burn, Glenesk
BCS/Codasyl DBAWG
6-9 September

4.28. VISIT REPORT LBurnard
Leeds Polytechnic July 11-13 1984
BNCOD 3 also present: P Salotti

4.29. Low Wood Hotel (Windermere)
19-23 May 1984
Lancaster University Conference (ICAME)

4.30. VISIT REPORT LBurnard
Sperry House, Stonebridge Park, London
BCS Codasyl DBAWG
19 January 1984

4.31. VISIT REPORT LBurnard
University of Surrey
IUSC Workshop ON 3-4 Jan 1984

4.32. Visit Report
Strand Palace Hotel, London
ICLCUA CAFS SIG Open Meeting
30 November 1983
ALSO PRESENT P Salotti

4.33. VISIT REPORT L. Burnard
The Burn, Glenesk
BCS/Codasyl DBAWG
Sept 1-4, 1983

4.34. Report by Mrs. S. M. Hockey
26/7/83
IUSC Workshop on Non-Numeric Computing
Other Staff Present: Mrs. L. Hayes, L. D. Burnard

4.35. VISIT REPORT
UMRCC
IUSC Famulus Working Party
22 June 1983 L.Burnard
[UMRCC]

4.36. VISIT REPORT L. Burnard
U.M.R.C.C.
IUSC Working Party on Development of Famulus

4.37. VISIT REPORT L.Burnard
UMRCC
January 22
Famulus meeting

4.38.
BP Britannia House.
25.11.82
BCS information Retrieval Specialist Group Seminar on Integrated Text and Data Management Systems.

4.39. Date of Report: 29.11.82
ICL Euston

4.40. VISIT REPORT L Burnard
ICL Bracknell
BCS/CODASYL DBAWG
On 18-19 November 82

4.41. 20.9.82
Codasyl/BCS DBAWG
The Burn

4.42.
(a) International Word Processing Exhibition, Wembley (b) SSRC Workshop on 1851 Census
London
28 May 1982

4.43. Information Dimensions User Group Meeting
Hotel Intercontinental (London)
18 May 1988

4.44. VISIT REPORT L. BURNARD
Strand Palace Hotel
ICL CUA Presentation on CAFS-ISP
6 April 1982

4.45. VISIT REPORT
Jesus College, Cambridge
13-14 July 81
British National Conference on Databases

4.46.
University College Cardiff
Workshop on Databases in the Social Sciences

4.47. VISIT REPORT L.Burnard
West Midlands Gas (Solihull)
ICL.DMUG.2900 IDMS User Group Meeting 10Ipril81