[Put logo here]

Modelling meaning: a short history of text markup

Lou Burnard

A naive realist's manifesto

Modelling matters

A very long time ago

Let's start in the unfamiliar world of the mid-1980s...

...but also a familiar one

Modelling the data vs modelling the text

By the end of the 1970s, methods variously called ‘data modelling’, ‘conceptual analysis’, ‘database design’ vel sim. had become common practice.

An inherently reductive process

how applicable are such methods to the complexity of historical data sources?

The 1980s were a period of technological enthusiasm

Re-invention of quellenkritik

‘History that is not quantifiable cannot claim to be scientific’ (Le Roy Ladurie 1972)

Theorizing Humanities Computing

(Hughes 2012)

(Hughes 2012)

Isn't the modelling of textual data at the heart of all these?

Serious computing meets text

The Textual Trinity (Burnard 1987)

The Textual Trinity (Burnard 1987)

In interpreting text, the trained human brain operates quite successfully on three distinct levels; not surprisingly, three distinct types of computer software have evolved to mimic these capabilities.

Text is little boxes

(Preliminary description of TEX: D Knuth, May 13, 1977)

(Preliminary description of TEX: D Knuth, May 13, 1977)

No, text is data

Database orthodoxy

But what are the "important entities" we might wish to identify in a textual resource?

Assize court records, for example

(, 1980)

(An application of CODASYL techniques to research in the humanities, 1980)

Scribe

Scribe developed by Brian K Reid at Carnegie Mellon in the 1980s, was one of the earliest successful document production systems to separate content and format, and to use a formal document specification language. Its commercial exploitation was short-lived, but its ideas were very influential.

(S)GML

(A Brief History of the Development of SGML (C)1990 SGML Users' Group )

(A Brief History of the Development of SGML (C)1990 SGML Users' Group )

Charles Goldfarb and others developed a "Generalized Markup Language" for IBM, which subsequently became an ISO standard (ISO 8879: 1986)

SGML was designed to enable the sharing and long term preservation of machine-readable documents for use in large scale projects in government, the law, and industry. the military, and other industrial-scale publishing industries.

SGML is the ancestor of HTML and of XML ... it defined for a whole generation a new way of thinking about what text really is

Motivations

What is a text?

Separating content, structure, and presentation means :

This is not a new idea! But it's a good one...

Some ambitious claims ensued

(Presentation for Oxford IT Support Staff Conference, 1994)

(Presentation for Oxford IT Support Staff Conference, 1994)

A fuller example...

<carte n="0010">  <recto url="19800726_001r.jpg">   <desc>Vue d'un cours d'eau avec un pont en pierre et des      petites maisons de style mexicain. Un homme et une femme      navigue un pédalo en premier plan a gauche.</desc>   <head>San Antonio River</head>  </recto>  <verso url="19800726_001v.jpg">   <obliteration>    <lieu>El Paso TX 799</lieu>    <date>18-08-1980</date>   </obliteration>   <message>    <p>26 juill 80</p>    <p>Chère Madame , après New-York et Washington dont le        gigantisme m'a beaucoup séduite, nous avons commencé        notre conquête de l'Ouest par New Orleans, ville folle        en fête perpétuelle. Il fait une chaleur torride au        Texas mais le coca-cola permet de résister –        l'Amérique m'enchante ! Bientôt, le grand Canyon, le        Colorado et San Francisco... </p>    <p> En espérant que vous passez de bonnes vacances,        affectueusement. </p>    <signature> Sylvie </signature>    <signature>François </signature>   </message>   <destinataire>Madame Lefrère      4, allée George Rouault      75020 Paris      France   </destinataire>  </verso> </carte>

A digital text may be ...

a ‘substitute’ (surrogate) simply representing the appearance of an existing document

... or it may be

a representation of its linguistic content and structure, with additional annotations about its meaning and context.

What does the markup do?

... and it also expresses the encoder's view of what matters in this document, thus determining how it can subsequently be analysed.

Where is the textual data and where is the markup?

BL Ms Cotton Vitelius A xv, fol. 129r

BL Ms Cotton Vitelius A xv, fol. 129r

Where is the textual data and where is the markup?

Beowulf, ed. C L Wrenn (with student annotations)

Beowulf, ed. C L Wrenn (with student annotations)

Which textual data matters ?

A ‘document’ is something that exists in the world, which we can digitize.

A ‘text’ is an abstraction, created by or for a community of readers, which we can encode.

The document as ‘Text-Bearing Object’(TBO)

Materia appetit formam ut virum foemina

But don't forget... digital texts are also TBOs!

Markup is a scholarly activity

Compare the markup

<hi rend="dropcap">H</hi> <g ref="#wynn">W</g>ÆT WE GARDE <lb/>na in gear-dagum þeod-cyninga <lb/>þrym gefrunon, hu ða æþelingas <lb/>ellen fremedon. oft scyld scefing sceaþe <add>na</add> <lb/>þreatum, moneg<expan>um</expan> mægþum meodo-setl <add>a</add> <lb/>of<damage>  <desc>blot</desc> </damage>teah ...
<lg>  <l>Hwæt! we Gar-dena in gear-dagum</l>  <l>þeod-cyninga þrym gefrunon,</l>  <l>hu ða æþelingas ellen fremedon,</l> </lg> <lg>  <l>Oft <persName>Scyld Scefing</persName>    sceaþena þreatum,</l>  <l>monegum mægþum meodo-setla ofteah;</l>  <l>egsode <orgName>Eorle</orgName>, syððan ærest wearþ</l>  <l>feasceaft funden...</l> </lg>

... and

<s>  <w pos="interjlemma="hwaet">Hwæt</w>  <w pos="pronlemma="we">we</w>  <w pos="npllemma="gar-denum">Gar-dena</w>  <w pos="preplemma="in">in</w>  <w pos="npllemma="gear-dagum">gear-dagum</w> ... </s>

or even

<w pos="nplcorresp="#w2">Gar-dena</w> <w pos="prepcorresp="#w3">in</w> <w pos="nplcorresp="#w4">gear-dagum</w> <!-- ... --> <w xml:id="w2">armed danes</w> <w xml:id="w3">in</w> <w xml:id="w4">days of yore</w>

.. not to mention ...

<!-- ... --><l>Oft <persName ref="https://en.wikipedia.org/wiki/Skj%C3%B6ldr">Scyld Scefing</persName> sceaþena þreatum,</l>

or even

<l>Oft <persName ref="#skioldus">Scyld Scefing</persName> sceaþena þreatum,</l> <!-- ... --> <person xml:id="skioldus">  <persName source="#beowulf">Scyld Scefing</persName>  <persName xml:lang="lat">Skioldus</persName>  <persName xml:lang="non">Skjöld</persName>  <occupation>Legendary Norse King</occupation>  <ref target="https://en.wikipedia.org/wiki/Skj%C3%B6ldr">Wikipedia entry</ref> <!-- ... --> </person>

Wait ...

... did someone mention the TEI ?

The Text Encoding Initiative

Vassar College, Poughkeepsie

Vassar College, Poughkeepsie

The obvious question

Well .... maybe ....

Why the TEI?

The TEI provides

Relevance

Why would you want those things?

The virtuous circle of encoding

The scope of intelligent markup

The TEI provides -- amongst others -- recommended markup for

Is it possible to delimit encyclopaedically all possible kinds of markup?

Why use a common framework ?

The TEI was designed to support multiple views of the same resource

Conformance issues

A document is TEI Conformant if and only if it:

TEI conformance does not mean ‘Do what I do’, but rather ‘Explain what you do in terms I can understand’

Why is the TEI still here?

Because it is a model of textual data which is ...