Modelling meaning: a short history of text markup

A very long time ago

Let's start in the unfamiliar world of the mid-1980s...

the world wide web did not exist
the tunnel beneath the English Channel was still being built
a state called the Soviet Union had just launched a space station called Mir
serious computing was done on mainframes
the world was managing nicely without the DVD, the mobile phone, cable tv, or Microsoft Word

...but also a familiar one

corpus linguistics and ‘artificial intelligence’ had created a demand for large scale textual resources in academia and beyond
advances in text processing were beginning to affect lexicography and document management systems (e.g. TeX, Scribe, (S)GML ...)
the Internet existed for academics and for the military; theories about how to use it ‘hypertextually’ abounded
books, articles, and even courses in something called "Computing in the Humanities" were beginning to appear

Modelling the data vs modelling the text

By the end of the 1970s, methods variously called ‘data modelling’, ‘conceptual analysis’, ‘database design’ vel sim. had become common practice.

remember: a centralised mainframe world dominated by IBM
spread of office automation and consequent data integration
ANSI SPARC three level model

An inherently reductive process

how applicable are such methods to the complexity of historical data sources?

The 1980s were a period of technological enthusiasm

Digital methods and digital resources, despite their perceived strangeness were increasingly evident in the Humanities
There was some public funding of infrastructural activities, both at national and European levels: in the UK, for example, the Computers in Teaching Initiative and the Arts and Humanities Data Service
Something radically new, or just an update ?
Humanities Computing (aka Digital Humanities) gets a foothold, by establishing courses

Re-invention of quellenkritik

‘History that is not quantifiable cannot claim to be scientific’ (Le Roy Ladurie 1972)

In the UK, a series of History and Computing (1986-1990) conferences showed historians already using commercial DBMS, data analysis tools developed for survey analysis, "personal database systems" ...
In France, J-P Genet and others influenced by the Annales school proposed a programme of digitization of historical sources records
Further pursued by Manfred Thaller with the program kleio (1982) -- a tool for transcribing and analysing (extracts from) historical sources, which included annotation of their content/significance
Thaller also (in 1989) challenged advocates of Humanities Computing to define its underlying theory

Theorizing Humanities Computing

What are the underlying principles of the tools used in Humanities Computing (then) or Digital Humanities (now)?
Unsworth and others eventually (by 2002) start using the phrase ”scholarly primitives” to characterise a core set of procedures e.g.
- searching on the basis of externally-defined features
- analysis in terms of internally-defined features
- association according to shared readings

(Hughes 2012)

Isn't the modelling of textual data at the heart of all these?

Serious computing meets text

The Textual Trinity (Burnard 1987)

In interpreting text, the trained human brain operates quite successfully on three distinct levels; not surprisingly, three distinct types of computer software have evolved to mimic these capabilities.

Text is little boxes

(Preliminary description of TEX: D Knuth, May 13, 1977)

TeX was developed by Donald Knuth, a Stanford mathematician, to produce high quality typeset output from annotated text
Knuth also developed the associated idea of literate programming: that software and its documentation should be written and maintained as an integrated whole
TeX is still widely used, particularly in the academic community: it is open source and there are several implementations

Database orthodoxy

identify important entities which exist in the real world and the relationships amongst them
formally define a conceptual model of that universe of discourse
map the conceptual model to a storage model (network, relational, whatever...)

But what are the "important entities" we might wish to identify in a textual resource?

Assize court records, for example

(An application of CODASYL techniques to research in the humanities, 1980)

Scribe

Scribe developed by Brian K Reid at Carnegie Mellon in the 1980s, was one of the earliest successful document production systems to separate content and format, and to use a formal document specification language. Its commercial exploitation was short-lived, but its ideas were very influential.

(S)GML

(A Brief History of the Development of SGML (C)1990 SGML Users' Group )

Charles Goldfarb and others developed a "Generalized Markup Language" for IBM, which subsequently became an ISO standard (ISO 8879: 1986)

SGML was designed to enable the sharing and long term preservation of machine-readable documents for use in large scale projects in government, the law, and industry. the military, and other industrial-scale publishing industries.

SGML is the ancestor of HTML and of XML ... it defined for a whole generation a new way of thinking about what text really is

Motivations

an enormous increase in the quantity of technical documentation : the aircraft carrier story
an enormous increase in its complexity : the Gare de Lyon story
a proliferation of mutually incompatible document formats
an almost evangelical desire for centrally-defined standards
a mainframe-based, not yet distributed, world in transition

What is a text?

content: the components (words, images etc). which make up a document
structure: the organization and inter-relationship of those components
presentation: how a document looks and what processes are applied to it
.. and possibly many readings

Separating content, structure, and presentation means :

the content can be re-used
the structure can be formally validated
the presentation can be customized for
- different media
- different audiences
in short, the information can be uncoupled from its processing

This is not a new idea! But it's a good one...

Some ambitious claims ensued

(Presentation for Oxford IT Support Staff Conference, 1994)

A fuller example...

<carte n="0010"> <recto url="19800726_001r.jpg"> <desc>Vue d'un cours d'eau avec un pont en pierre et des petites maisons de style mexicain. Un homme et une femme navigue un pédalo en premier plan a gauche.</desc> <head>San Antonio River</head> </recto> <verso url="19800726_001v.jpg"> <obliteration> <lieu>El Paso TX 799</lieu> <date>18-08-1980</date> </obliteration> <message> 26 juill 80 Chère Madame , après New-York et Washington dont le gigantisme m'a beaucoup séduite, nous avons commencé notre conquête de l'Ouest par New Orleans, ville folle en fête perpétuelle. Il fait une chaleur torride au Texas mais le coca-cola permet de résister – l'Amérique m'enchante ! Bientôt, le grand Canyon, le Colorado et San Francisco... En espérant que vous passez de bonnes vacances, affectueusement. <signature> Sylvie </signature> <signature>François </signature> </message> <destinataire>Madame Lefrère 4, allée George Rouault 75020 Paris France </destinataire> </verso> </carte>

A digital text may be ...

a ‘substitute’ (surrogate) simply representing the appearance of an existing document

... or it may be

a representation of its linguistic content and structure, with additional annotations about its meaning and context.

What does the markup do?

It makes explicit to a processor how something should be processed.
In the past, ‘markup’ was what told a typesetter how to deal with a manuscript
Nowadays, it is what tells a computer program how to deal with a stream of textual data.

... and it also expresses the encoder's view of what matters in this document, thus determining how it can subsequently be analysed.

Where is the textual data and where is the markup?

BL Ms Cotton Vitelius A xv, fol. 129r

Where is the textual data and where is the markup?

Beowulf, ed. C L Wrenn (with student annotations)

Which textual data matters ?

the shape of the letters and their layout?
the presumed creator of the writing?
the (presumed) intentions of the creator?
the stories we read into the writing?

A ‘document’ is something that exists in the world, which we can digitize.

A ‘text’ is an abstraction, created by or for a community of readers, which we can encode.

The document as ‘Text-Bearing Object’(TBO)

Materia appetit formam ut virum foemina

Traditionally, we distinguish form and content
In the same way, we might think of an inscription or a manuscript as the bearer or container or form instantiating an abstract notion -- a text

But don't forget... digital texts are also TBOs!

Markup is a scholarly activity

The application of markup to a document is an intellectual activity
Deciding exactly what markup to apply and why is much the same as editing a text
Markup is rarely neutral, objective, or deterministic : interpretation is needed
Because it obliges us to confront difficult ontological questions, markup can be considered a research activity in itself
Good textual encoding is never as easy or quick as people would believe -- do things better, not necessarily quicker
The markup scheme used for a project should result from a detailed analysis of the properties of the objects the project aims to use or create

Compare the markup

<hi rend="dropcap">H</hi> <g ref="#wynn">W</g>ÆT WE GARDE <lb/>na in gear-dagum þeod-cyninga <lb/>þrym gefrunon, hu ða æþelingas <lb/>ellen fremedon. oft scyld scefing sceaþe <add>na</add> <lb/>þreatum, moneg<expan>um</expan> mægþum meodo-setl <add>a</add> <lb/>of<damage> <desc>blot</desc> </damage>teah ...

<lg> <l>Hwæt! we Gar-dena in gear-dagum</l> <l>þeod-cyninga þrym gefrunon,</l> <l>hu ða æþelingas ellen fremedon,</l> </lg> <lg> <l>Oft <persName>Scyld Scefing</persName> sceaþena þreatum,</l> <l>monegum mægþum meodo-setla ofteah;</l> <l>egsode <orgName>Eorle</orgName>, syððan ærest wearþ</l> <l>feasceaft funden...</l> </lg>

... and

<s> <w pos="interj" lemma="hwaet">Hwæt</w> <w pos="pron" lemma="we">we</w> <w pos="npl" lemma="gar-denum">Gar-dena</w> <w pos="prep" lemma="in">in</w> <w pos="npl" lemma="gear-dagum">gear-dagum</w> ... </s>

or even

<w pos="npl" corresp="#w2">Gar-dena</w> <w pos="prep" corresp="#w3">in</w> <w pos="npl" corresp="#w4">gear-dagum</w>  <w xml:id="w2">armed danes</w> <w xml:id="w3">in</w> <w xml:id="w4">days of yore</w>

.. not to mention ...

<l>Oft <persName ref="https://en.wikipedia.org/wiki/Skj%C3%B6ldr">Scyld Scefing</persName> sceaþena þreatum,</l>

or even

<l>Oft <persName ref="#skioldus">Scyld Scefing</persName> sceaþena þreatum,</l>  <person xml:id="skioldus"> <persName source="#beowulf">Scyld Scefing</persName> <persName xml:lang="lat">Skioldus</persName> <persName xml:lang="non">Skjöld</persName> <occupation>Legendary Norse King</occupation> <ref target="https://en.wikipedia.org/wiki/Skj%C3%B6ldr">Wikipedia entry</ref>  </person>

Wait ...

How many markup systems does the world need?
- One size fits all?
- Let a thousand flowers bloom?
- Roll your own!
We've been here before...
- One construct and many views
- modularity and extensibility

... did someone mention the TEI ?

The Text Encoding Initiative

Spring 1987: European workshops on standardisation of historical data (J.P. Genet, M. Thaller )
Autumn 1987: In the US, the NEH funds an exploratory international workshop on the feasibility of defining "text encoding guidelines"

Vassar College, Poughkeepsie

The obvious question

So the TEI is very old!
- Not much in computing survives 5 years, never mind 20
- Why is it still here, and how has it survived?
- What relevance can it possibly have today?
And with XML everyone can create their own markup system and still share data!
And in the Semantic Web, XML systems will all understand each other's data!
RDF can describe every kind of markup; SPARQL can search it!

Well .... maybe ....

Why the TEI?

The TEI provides

a language-independent framework for defining markup languages
a very simple consensus-based way of organizing and structuring textual (and other) resources...
... which can be enriched and personalized in highly idiosyncratic or specialised ways
a very rich library of existing specialised components
an integrated suite of standard stylesheets for delivering schemas and documentation in various languages and formats
a large and active open source style user community

Relevance

Why would you want those things?

because we need to interchange resources
- between people
- (increasingly) between machines
because we need to integrate resources
- of different media types
- from different technical contexts
because we need to preserve resources
- cryogenics is not the answer!
- we need to preserve metadata as well as data

The virtuous circle of encoding

The scope of intelligent markup

The TEI provides -- amongst others -- recommended markup for

basic structural and functional components of text
diplomatic transcription, images, annotation
links, correspondence, alignment
data-like objects such as dates, times, places, persons, events (named entity recognition)
meta-textual annotations (correction, deletion, etc)
linguistic analysis at all levels
contextual metadata of all kinds
... and so on and so on and so forth

Is it possible to delimit encyclopaedically all possible kinds of markup?

Why use a common framework ?

re-usability and repurposing of resources
modular software development
lower training costs
‘frequently answered questions’ — common technical solutions for different application areas

The TEI was designed to support multiple views of the same resource

Conformance issues

A document is TEI Conformant if and only if it:

is a well-formed XML document
can be validated against a TEI Schema, that is, a schema derived from the TEI Guidelines
conforms to the TEI Abstract Model
uses the TEI Namespace (and other namespaces where relevant) correctly
is documented by means of a TEI Conformant ODD file which refers to the TEI Guidelines

TEI conformance does not mean ‘Do what I do’, but rather ‘Explain what you do in terms I can understand’

Modelling meaning: a short history of text markup

A naive realist's manifesto

A very long time ago

...but also a familiar one

Modelling the data vs modelling the text

An inherently reductive process

The 1980s were a period of technological enthusiasm

Re-invention of quellenkritik

Theorizing Humanities Computing

(Hughes 2012)

Serious computing meets text

The Textual Trinity (Burnard 1987)

Text is little boxes

(Preliminary description of TEX: D Knuth, May 13, 1977)

No, text is data

Database orthodoxy

Assize court records, for example

(An application of CODASYL techniques to research in the humanities, 1980)

Scribe

(S)GML

(A Brief History of the Development of SGML (C)1990 SGML Users' Group )

Motivations

What is a text?

Separating content, structure, and presentation means :

Some ambitious claims ensued

(Presentation for Oxford IT Support Staff Conference, 1994)

A fuller example...

A digital text may be ...

... or it may be

What does the markup do?

Where is the textual data and where is the markup?

BL Ms Cotton Vitelius A xv, fol. 129r

Where is the textual data and where is the markup?

Beowulf, ed. C L Wrenn (with student annotations)

Which textual data matters ?

The document as ‘Text-Bearing Object’(TBO)

Markup is a scholarly activity

Compare the markup

... and

Wait ...

The Text Encoding Initiative

Vassar College, Poughkeepsie

The obvious question

Why the TEI?

Relevance

The virtuous circle of encoding

The scope of intelligent markup

Why use a common framework ?

Conformance issues

Why is the TEI still here?