Using digitised library collections to help locate literary sources

Over the past decade, a number of organisations have invested heavily in the large-scale digitisation of library collections. Perhaps the most prominent of these initiatives is Google Books (previously known as Google Book Search), which since 2004 has been digitising large numbers of books and other content supplied by a large number of library partners, but a wide range of other text digitisation initiatives exist.  While many earlier text digitisation efforts were  based on the creation of high-quality texts using structured document formats like the TEI (Text Encoding Initiative) guidelines, large-scale digitisation programmes have generally taken the approach of linking page images directly to the text output generated by automated OCR (optical character recognition) tools. This approach, perhaps first utilised on a large-scale for the content of scholarly journals by JSTOR, enables the full-text of digitised content to be searched as well as supporting the generation of derivative versions in formats like PDF or EPUB (or print).

While for legal reasons, the full-text content currently made available by these initiatives tends to be that clearly out-of-copyright, the resulting collections collectively do constitute a resource that is of enormous potential value for those scholars studying the past [1]. A recent paper has concluded that the “full view” holdings in Google Books compare reasonably well with the nineteenth-century holdings of some major research libraries [2]. While there remain some serious reservations about the dominance of one particular commercial player in this space as well as on the quality of the scans, metadata and OCR output produced by these initiatives [3], there is a general acceptance that they have the potential to be extremely useful. For example, Dan Cohen of George Mason University has commented on how Google Books amplifies the collections made routinely available to scholars at smaller institutions:

Although I was trained at places with large research libraries of Google Books scale, I’m now at an institution that is far more typical of higher ed, with a mere million volumes and few rare works. At places like Mason, Google Books is a savior, enabling research that could once only be done if you got into the right places. I regularly have students discover new topics to study and write about through searches on Google Books. You can only imagine how historical researchers and all students and scholars feel in even less privileged places. Despite its flaws, it will be the the source of much historical scholarship, from around the globe, over the coming decades. It is a tremendous leveler of access to historical resources. [4]

And this is not just true of history. I was reminded of this in a recent discussion of Thomas Hardy’s literary sources. The starting point for this was a query on the Thomas Hardy Association’s hardy-l discussion forum asking subscribers if they knew whether there was any mention of the geologist Charles Lyell in two key sources: the Collected Letters edited by R. L. Purdy and M. Millgate (7 vols, Oxford: Clarendon Press, 1978-1988) and the Literary Notebooks edited by L. A. Björk (2 vols, London: Macmillan, 1985). The question of where Hardy gained his knowledge of Lyell is one which has been explored before by Hardy scholars, not least by Patricia Ingham in her paper “Hardy and the Wonders of Geology,” where she has argued that Hardy’s debt to Lyell was largely mediated through Gideon Mantell’s popular account of The Wonders of Geology (1838) [5]. The Bohn edition of this (1848) duly appears in Millgate’s attempted reconstruction of Hardy’s Max Gate library [6].

Ingham’s paper (p. 60) also mentions the one entry in the Literary Notebooks that refers to Lyell. Surrounded by entries taken from the Cornhill Magazine, the Saturday Review,  and the Dorset County Chronicle, this consists of a short anecdote about Lyell illustrating his enthusiasm for geology, followed by a quotation from Goethe which is stated to be from the same source, a speech by A. P. Stanley, the well-known Dean of Westminster [7] :

1078 “The enthusiasm of Sir Charles Lyell, who when travelling along a cutting gazed out of the railway carriage as if the sides were hung with beautiful pictures.” Dean Stanley (speech)”

1079 “‘There are many echoes in the world,’ said a great German poet ‘but few voices’.” id. (p. 119)

I happen to have a copy of the original (1974) edition of the first volume of the Literary Notebooks (Gothenburg Studies in English, 29), but Björk’s only comment there (p. 343) was that the source of entry 1078 was unidentified. Similarly, it described the immediate source for the Goethe quotation as unknown, but adding that on a loose notebook sheet, now in the Dorset County Museum, Hardy had “jotted down the following (in pencil): ‘In this world there are few voices & many echoes. G. H. Lewes,'” adding that the “wording of entry 1079 shows, however, that Lewes’s biography of Goethe … is not the direct source.” When Björk was editing the Literary Notebooks, it would have been extremely time-consuming to explore the voluminous writings of Arthur Penrhyn Stanley for these two entries, with the added complication that they could have been potentially based on a recollection of a speech that had never been published.

However, the existence of Google Books and the excellent Internet Archive collection of texts meant that, within a relatively short amount of time, I was able to track down a plausible source for both the Lyell anecdote and the Goethe quotation . Focusing on A. P. Stanley (1815-81), who as  Dean of Westminster was the preacher of Lyell’s funeral sermon, an initial search of Google Books directed me to an undated collection of Stanley’s writings entitled Thoughts that Breathe, that had been posthumously published in Boston [8]. While the full-text of this work was not available to me through Google Books – and the appropriate metadata record on the Internet Archive’s directed me to a unconnected text (thus proving that it is not just Google Books that has quality control problems with its metadata) – I was luckily able to track down a copy directly from the University of Michigan. This definitely included both the anecdote about Lyell as well as a version of the notebook entry on Goethe. The problem remained, however, that this work – published in the United States and (at least according to COPAC) not extant in any major UK library collection – was extremely unlikely to have been Hardy’s own source. However, a quick look at the volume itself demonstrated that Thoughts that Breathe was merely a collection of unattributed extracts (for copyright reasons?) from Stanley’s previous writings.

The Goethe reference gave me my next clue. Searching Google Books for the fuller version of the quote published in Thoughts that Breathe showed that it had also been used in Stanley’s inaugural rectorial address delivered before the University of St Andrews on the 31 March 1875, a speech entitled “The Study of Greatness” [9]. This time, only part of the text was available from Google Books, but the complete work was available from the Internet Archive. The wording of the Goethe reference was slightly different to that in Thoughts that Breathe, but one could imagine that the St Andrews address was a far more credible source than the Boston compilation.

Remember the pregnant saying of Goethe – ‘There are many ‘echoes in the world but few voices’ – and let it be your constant effort to distinguish the voices from the echoes, and to respond accordingly …  (p. 23)

However, there was absolutely no sign of the Lyell anecdote in this or any other of the St Andrews addresses.

So returning once more to the Lyell story, and searching Google Books using the exact wording of the anecdote given in Thoughts that Breathe,  I finally found the Lyell quotation in two periodicals: Popular Science Monthly and (much more promisingly) in Macmillan’s Magazine. Given Hardy’s strong links with Macmillan, this was a much more plausible source for Hardy. Yet again, however, Google Books would not provide me with access to the full-text of the periodical, so I once again had to browse the content on the Internet Archive for the relevant volume of Macmillan’s Magazine. The quality of the full-text was a little variable, but I was finally able to find the Lyell anecdote in: Vol. 37, no. 218, published in December 1877 [10].

I have heard people say who had travelled with Sir Charles Lyell, that to see him hanging out of the window of a railway carriage, to watch the geological formations as he passed through a railway-cutting was as if he saw the sides hung with beautiful pictures. (p. 100)

Stanley’s article was entitled “The Education of After Life” and was an address delivered on the occasion of the new session of (the recently formed) University College, Bristol on 27 October 1877. The Goethe quotation also featured (pp. 102-103), proving to my satisfaction (at least) that this was the source of both notes 1078 and 1079 in Björk’s edition of the literary notes. The Literary Notebook elsewhere demonstrates that Hardy did regularly use Macmillan’s Magazine as a source of information. The dates also fit with other entries in the notebooks.

This example is relatively trivial and, naturally, I cannot rule out that others might have worked out these links already. However, I feel that it does show how using a combination of Google Books (for matching sentences) and the Internet Archive’s text collections (for retrieval) might enable some simple literary puzzles to be solved fairly easily.

More puzzling was the relationship between Google’s digitisation programme and the Internet Archive’s text collections. Metadata for some Google Books content was included in the Internet Archive’s catalogue. Some content that was not available (at least to me) through Google Books, was available through the Internet Archive, but not in every single advertised format. Typically, for example, one could NOT download texts in formats like PDF, but you could retrieve plain text (i.e. that generated by OCR) or view page images through the Internet Archive’s BookReader. This is why Google Books turned out to be a more powerful tool for searching, but actual retrieval depended on the use of other services.


