3 Text as data

Books, paintings, drama plays performances, novels, poetry and films are the object of study in the humanities. Although we don’t usually refer to them as data, we could say that the text of a poem, the dialogues of a play or the visual elements of a Dalí painting are data. This meaning is related to a first definition of data: information that allows us to know something. In this way, however, we are referring to analogical data, data that is not explicitly accounting as separate and distinct values, that is, it cannot be analyzed computationally (Schöch, 2013). For this reason, a second meaning is added to this first definition of data as information arranged in a way that is suitable for processing by a computer.

There are different types of data, according to their formal structure: structured data, unstructured data and semi-structured data.

Types Definition Formats
Unstructured no formal structure TXT
Semi-structured parcial formal structure XML, HTML
Structured formal structure CSV, EXCEL, RIS, BiBTeX

3.1 Unstructured Data

In general, textual data (any kind of written expression represented as text: a poem, a novel, a paper, a bibliography) is considered unstructured data because if is not organized into separate and distinct semantic units:

1605 Miguel de Cervantes published the first part of Don Quixote.

A simple reading allows us to understand this text, to delimit its parts, to make explicit what is implicit, for example, which author is being discussed, what is the work, in what year it was published. For a computer, however, it is difficult (not impossible) to analyse this information because it is not structured, because the limits of each element, its relations and meaning are not explicit. It is just a sequence of characters. For the storage and exchange of unstructured data between programs, plain text files are often used, recognizable by the .TXT extension. The text contained in such files is represented as a sequence of characters without any additional structure or formatting. Plain text files are compatible with most software for text processing.

3.2 Structured Data

Structured data usually follow a specific data model, which explicitly defines the data. It is typically grouped in pairs containing variables and observations. Think of a table with data divided into columns and rows:

Variable Variable Variable Year Place
observation observation observation 2007 Madrid
observation observation observation 1605 Paris
observation observation observation 2009 London

One variable in this table is Place with the observations Madrid, Paris and London. Another variable is Year, which is assigned the observations: 2009, 2007, 1605. The data is therefore structured according to a model that defines each line of the table as a bibliographic entry. The file format with the extension .CSV is used to structure data in the form of a table divided into columns and rows. The text is also in plain text format, but technically it uses certain conventions (commas, quotation marks, line breaks) to structure the data: commas separate cells (CSV = Comma Separated Values); line breaks separate rows; the first row can represent the variables (headers in a table); quotation marks surround an observation, if there ir a comma in it.

Miguel de Cervantes,Quijote,Madrid,1605
Adam Mickiewicz,Pan Tadeusz,Paris,1835

CSV files are compatible with most software programs for text processing, in the same way as files with the TXT extension. Files generated by spreadsheets, e.g. Excel, are also structured data, but they use their own conventions; most of them can, however, be exported in CSV format. Sometimes other more specific conventions can be used for particular data. For example, the .bib or .bibtex file format, also based on plain text, uses other conventions for defining lists of bibliographic items.

@book{burnard_what_2014, 
location={Marseille}, 
title={What is the Text Encoding Initiative?}, 
author={Burnard, Lou}, 
year={2014},
doi = {10.4000/books.oep.426}
}

3.3 Semi-structured data

Semi-structured data is often expressed with mark-up languages that separately and distinctly annotate elements within, normally, unstructured data.

In <date>1605</date> <persName>Miguel de Cervantes</persName> 
published the first part of <title>Don Quixote</title> 

Semantic information is added to unstructured text by annotating some words with double tags surrounding the elements of interest. The proper name, Miguel de Cervantes, is surrounded by an opening tag <persName> and a closing tag </persName>. In this way, parts of the unstructured text can be processed by a computer to extract, for example, a list of all the proper names for creating an index. This type of language is usually stored in files with the extension .XML, which stands for eXtensible Markup Language. It is an extensible markup language because it can use already available elements (tags) or generate its own ones. TEI (Text Encoding Initiative) is a specific vocabulary of XML elements, as we are going to see in the next chapter.

HTML (Hypertext Markup Language), used widely for the representation of web pages, also stores semi-structured data with tags, although it is more focused on the visual representation of the data than on the semantic structure, which characterizes TEI.

In 1605 <b>Miguel de Cervantes</b> 
published the first part of <i>Don Quixote</i> 

3.4 Exercise

In order to manipulate data you are using a text editor. The most widespread operating systems have them installed by default (Textedit, Notepad, Gedit). Although any of those would do the job, please install Visual Studio Code. It is a cross-platform text and code editor, that we are going to use for editing also XML-TEI throughout this course. See more on the editor in the appendix A.

3.4.1 Create a table in CSV

You can use as a guide the animated picture below:

3.4.2 Create a html document

<html>
    <head>
      <title>Simple HTML Document</title>
      </head>
    <body>
        <h1>About Cervantes</h1>
        <p>In 1605 <b>Miguel de Cervantes</b> 
        published the first part of <i>Don Quixote</i>
        </p>
    </body>
</html>

References

Schöch, Christof (2013): “Big? Smart? Clean? Messy? Data in the Humanities”, Journal of Digital Humanities, 2, 3, <http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/>.