[Put logo here]

Modelling meaning: the role of data in the humanities

He who sees the Infinite in all things sees God. He who sees the Ratio only, sees himself only

Lou Burnard

The brief

(Sorry, I won't answer all these questions)

The status of data in current SHS research

(see e.g. Flanders and Jannidis 2019)

Encoding data for discourse studies

Which level/s of description do we favour?

Multiple levels can coexist in XML

But in practice, linguists seem to prefer fairly simple -- reductive -- data categorisations

A choice must be made...

Is there any such thing as a "pure" transcription?

A language corpus consists of samples of authentic language productions ...

Can there be any re-presentation without interpretation?

Annotation: necessary evil or fundamental ?

‘Annotation ... is anathema to corpus-driven linguists.’ (Aarts, 2002)

‘The interspersing of tags in a language text is a perilous activity, because the text thereby loses integrity.’ (Sinclair, 2004)

‘… the categories used to annotate a corpus are typically determined before any corpus analysis is carried out, which in turn tends to limit ... the kind of question that usually is asked.’ (Hunston, 2002)

‘... all encoding interprets, all encoding mediates. There is no 'pure' reading experience to sully. We don't carry messages, we reproduce them –– a very different kind of involvement. We are not neutral; by encoding a written text we become part of the communicative act it represents. ’ (Caton 2000)

A naive realist's manifesto

How do we keep the virtuous hermeneutic circle turning?

Modelling matters

How did we get here from there?

Let's (briefly) go back to the unfamiliar world of the mid-1980s...

...but also a familiar one

Modelling the data vs modelling the text

By the end of the 1970s, methods variously called ‘data modelling’, ‘conceptual analysis’, ‘database design’ vel sim. had become common practice.

An inherently reductive process

how applicable are such methods to the complexity of humanities data sources?

The 1980s were a period of technological enthusiasm

Where did these digital methods originally thrive?

Corpus Linguistics : searching for meaning

How do we identify the components of a discourse which give it meaning ?

the text is the data

Stylometrickery

the text is the data

The re-invention of quellenkritik

‘History that is not quantifiable cannot claim to be scientific’ (Le Roy Ladurie 1972)

the data is extracted from the text

How should we model textual data?

The Textual Trinity (Burnard 1987)

The Textual Trinity (Burnard 1987)

In interpreting text, the trained human brain operates quite successfully on three distinct levels; not surprisingly, three distinct types of computer software have evolved to mimic these capabilities.

Text is little boxes

(Preliminary description of TEX: D Knuth, May 13, 1977)

(Preliminary description of TEX: D Knuth, May 13, 1977)

No, text is data

Database orthodoxy

But what are the "important entities" we might wish to identify in a textual resource?

Assize court records, for example

(,
            1980)

(An application of CODASYL techniques to research in the humanities, 1980)

What is a text (really)?

  • content: the components (words, images etc). which make up a document
  • structure: the organization and inter-relationship of those components
  • presentation: how a document looks and what processes are applied to it
  • context: how the document was produced, circulated, processed, and understood
  • .. and possibly many other readings

For example:

Separating content, structure, presentation, and context means :

This is not a new idea! But is it a good one?

Some ambitious claims ensued

(Presentation for Oxford IT Support Staff Conference, 1994)

(Presentation for Oxford IT Support Staff Conference, 1994)

A digital text may be ...

a ‘substitute’ (surrogate) simply representing the appearance of an existing document

... or it may be

a representation of its linguistic content and structure, with additional annotations about its meaning and context.

Functions of encoding

... and thus expresses the encoder's view of what matters in this document, determining how it can subsequently be analysed.

Which textual data matters ?

A ‘document’ is something that exists in the world, which we can digitize.

A ‘text’ is an abstraction, created by or for a community of readers, which we can encode.

The document as ‘Text-Bearing Object’(TBO)

Materia appetit formam ut virum foemina

And don't forget ... digital texts are also TBOs!

Markup is a scholarly activity

... though considerations of scale may have an effect ...

Because ...

Good markup (like good scholarship) is expensive

Big data vs. curated data

Big data vs. curated data

Choices (1)

Consider this kind of object:

BL Ms Cotton Vitelius A xv, fol. 129r

BL Ms Cotton Vitelius A xv, fol. 129r

Some typical varieties of curated markup

<hi rend="dropcap">H</hi> <g ref="#wynn">W</g>ÆT WE GARDE <lb/>na in gear-dagum þeod-cyninga <lb/>þrym gefrunon, hu ða æþelingas <lb/>ellen fremedon. oft scyld scefing sceaþe <add>na</add> <lb/>þreatum, moneg<expan>um</expan> mægþum meodo-setl <add>a</add> <lb/>of<damage>  <desc>blot</desc> </damage>teah ...
<lg>  <l>Hwæt! we Gar-dena in gear-dagum</l>  <l>þeod-cyninga þrym gefrunon,</l>  <l>hu ða æþelingas ellen fremedon,</l> </lg> <lg>  <l>Oft <persName>Scyld Scefing</persName> sceaþena þreatum,</l>  <l>monegum mægþum meodo-setla ofteah;</l>  <l>egsode <orgName>Eorle</orgName>, syððan ærest wearþ</l>  <l>feasceaft funden...</l> </lg>

... and

<s>  <w pos="interjlemma="hwaet">Hwæt</w>  <w pos="pronlemma="we">we</w>  <w pos="npllemma="gar-denum">Gar-dena</w>  <w pos="preplemma="in">in</w>  <w pos="npllemma="gear-dagum">gear-dagum</w> ... </s>

or even

<w pos="nplcorresp="#w2">Gar-dena</w> <w pos="prepcorresp="#w3">in</w> <w pos="nplcorresp="#w4">gear-dagum</w> <!-- ... --> <w xml:id="w2">armed danes</w> <w xml:id="w3">in</w> <w xml:id="w4">days of yore</w>

.. not to mention ...

<!-- ... --><l>Oft <persName ref="https://en.wikipedia.org/wiki/Skj%C3%B6ldr">Scyld Scefing</persName> sceaþena þreatum,</l>

or even

<l>Oft <persName ref="#skioldus">Scyld Scefing</persName> sceaþena þreatum,</l> <!-- ... --> <person xml:id="skioldus">  <persName source="#beowulf">Scyld Scefing</persName>  <persName xml:lang="lat">Skioldus</persName>  <persName xml:lang="non">Skjöld</persName>  <occupation>Legendary Norse King</occupation>  <ref target="https://en.wikipedia.org/wiki/Skj%C3%B6ldr">Wikipedia entry</ref> <!-- ... --> </person>

Choices (2)

How about this kind of object ...

A random shelf from the London Library

A random shelf from the London Library

The digital library model

What can you can do with a million books?

Distant Reading

(From Ryan Heuser on twitter recently)

(From Ryan Heuser on twitter recently)

"Designing a text-analysis program is necessarily an interpretative act, not a mechanical one, even if running the program becomes mechanistic." (Joanna Drucker, Why Distant Reading Isn’t(2017)

Choices (3)

... or this kind of object

The linked data model

Hype?

Hype?

‘LOD creates a store of machine-actionable data on which improved services can be built... facilitate the breakdown of the tyranny of domain silos ... provide direct access to data in ways that are not currently possible ... provide unanticipated benefits that will emerge later ’ (Anon, passim)

LOD is about linking web pages together...

Wait ...

... did someone mention the TEI ?

Impact and effects of data-driven research

Representativeness ... of what?

Are there more novels published by men than by women in the 19th century?

Data from http://www.victorianresearch.org/atcl/

Data from http://www.victorianresearch.org/atcl/

How should we create a representative sample of this population?

But what if we want to consider multiple categories?

Aiming for a balanced corpus

cultural difference or sampling error?

Hungarian

Hungarian

Is the TEI really an ontology?

But before we can extract or model their content, documents must be interpreted...

Data science and textual data

.... and (maybe) the same applies to data

"All data is historical data: the product of a time, place, political, economic, technical, & social climate. If you are not considering why your data exists, and other data sets don’t, you are doing data science wrong”

[Melissa Terras, Opportunities, barriers, and rewards in digitally-led analysis of history, culture and society. Turing Lecture 2019-03-03, https://youtu.be/4yYytLUViI4]

A recent example: global reach versus situated context

"attempts to create a single map of all possible elements and attributes, and to provide provenance of internal structures while grouping object by type and subtype, raised significant ontological issues" (Beals, M. H. et al The Atlas of Digitised Newspapers and Metadata, 2020)

Different institutional catalogues

All institutional collections

Conclusion

Still from  (Wehn-Damisch,2012)

Still from Behind the Doors(Wehn-Damisch,2012)

Umberto says:

We conclude...

‘Text is not a special kind of data: data is a special kind of text’