1 Digital Philology meets premodern Chinese texts

Electronic texts for researchers from researchers.


One of the challenges posed to us as scholars and editors by our times is to preserve the cultural heritage that has come down to us through the ages on its transition to digital formats. These new formats do away with many of the limitations of the print medium and open up exciting new possibilities. These can be realized in many different ways, however, based on our assumptions about what is the best way to achieve the desired result, what aspects of a work should be preserved and what aspects can be changed or discarded, and how access to the digital products should be organized.

Some of these assumptions may be due to a lack of awareness of the new freedoms allowed by today’s technology. However, while most of these assumptions are not technical in themselves, they nonetheless involve decisions about technical implementation that at some point will influence the design of the interfaces and tools used to access the cultural heritage.

Here I will discuss the basic assumptions concerning the Kanseki Repository 漢籍リポジトリ (Kanripo). Firstly, this preservation should not be achieved by fixing the cultural artifact in display cases, like those of a museum, such that modern users can not readily interact with it, but rather by making it available for exploration, research and all kinds of interactions, imaginable and as yet unimaginable. Another assumption is that the process of preservation should be as open and transparent as possible, based on the ideal of a free, non-authoritarian discourse that embraces

These assumptions concern the design of the technical side of the system, as well as how decisions about this process are reached, what goals should be envisioned, and so forth. All users should be involved and have a say, including future generations.

This is obviously not practicable, however, and since access to and use of the cultural heritage is fundamental, a key goal is to design a system that enables as many and as diverse usages as possible. This implies that there will be no central authority over the digital work, so the implementation will need to work in a way that does not require one. On the other hand, trust and reputation are important in allowing users to assess authoritativeness for themselves. Thus we will strive to develop a system that develops a strong and supportive community.

Philology and digital text

The Kanseki Repository serves as a repository for premodern Chinese texts. These texts are provided in a way that facilitates their use as a reliable source for any kind of research for which they are needed. While the system will support a wide variety of different research approaches and methodologies, an awareness of the assumptions the repository is based on will help researchers use it more efficiently. These methodological assumptions are outlined below.

Text and document

Scholars studying almost any aspect of premodern Chinese culture, depend on textual “witnesses” of things past. These witnesses are transmitted to us in the form of material objects, but in most cases we are interested not in the objects themselves, but rather the content they transmit. It is useful here to distinguish between the text as an abstract entity and the medium through which the text is transmitted. A text can be transmitted in the form of many different material objects, each with differences that might also extend to the content of the text. We thus need a model that can take account of these different aspects of an artifact.

One useful model to consider is that developed by the International Federation of Library Associations and Institutions (IFLA, see frbr-report-1997). It describes the "Functional Requirements for Bibliographic Records" (abbreviated as FRBR) and is quite complex, describing all aspects of bibliographic data. Of relevance here is only the first set of terms that denote the "products of intellectual or artistic endeavor" as four related entities: work, expression, manifestation, and item (See Figure 1). These four terms are elaborated below.

  • Work is a "distinct intellectual or artistic creation." (frbr-report-1997, p.17) For example, the Analects of Confucius or the Dao De jing are works. Works are independent of the rendering in a specific edition or language. Thus, an English translation is considered a representation of the work in the same way as the Chinese original. A work according to this model is an abstract entity that does not rely on a specific expression in words.
  • Expression is "the specific rendering the work takes each time it is 'realized'" (frbr-report-1997, p. 19). A specific formulation of the text of the Analects or a specific translation thereof, or a version of the text of the Dao De jing with the commentary by 王弼 Wang Bi are all examples of expressions. An expression is still independent of a specific edition, however.
  • Manifestation is "the physical embodiment of an expression of a work." (frbr-report-1997, p. 21). For textual works, this is usually a specific edition published by a specific publisher, for example D.C. Lau's translation of the Analects published by Simon & Brown in 2012 or the edition of Wang Bi's commentary included in the 四部叢刊 Sibu congkan of 1929.
  • Item is "a single exemplar of a manifestation. The entity defined as item is a concrete entity." (frbr-report-1997, p. 21). My own exemplar of D.C. Lau's translation or my photocopy of Wang Bi's Sibu congkan edition are examples of items.


Figure 1: Entities of the first group and their relations 図# 「グループ1」における実体とそれらの関係

As scholarly editors we work with items that carry evidence of the work through the expression that the item records. If there are multiple textual witnesses of the work, we usually deal with multiple manifestations of this expression. On the other hand, in bibliographic discussions concerning specific texts and the way they have been transmitted historically, the focus is usually on expressions and manifestations, i.e., levels of abstraction above the level of the individual item but below the level of work.

In the Kanseki Repository, this model has been used to determine whether two "digital items" should be related to separate expressions, and thus have separate entries in the catalog, or merely to separate manifestations of the same expression, thereby representing two witnesses (editions) under the same entry in the catalog.

A useful simplification of this terminology might be to employ the more familiar, though slightly ambiguous term text when we speak of the "work", and document for "item". Multiple witnesses of a work would then be multiple "items", which might or might not represent the same "expression/manifestation". Since this second set of terms, while not as formal and precise as the FRBR set, is sufficiently expressive and also more familiar to philologists, I will use the terms text and document where possible.

So how should we understand the relationship between text and document? And what does this understanding mean for our task of producing scholarly reliable, philologically grounded digital editions of a text? Before exploring this further, yet another important clarification needs to be made.

Record and interpretation

In a seminal article, the Swiss scholar Hans Zeller zeller_record-and-interpretation_1995 emphasized the fact that all scholarly editing should make a clear distinction between the record of what has been transmitted in the "documents", and the scholarly interpretation thereof. In this terminology, the record is the essentially objective evidence or record of a text in the form of a material document, while the interpretation is the way in which a specific reader of this document, in his function as editor of an edition, subjectively reads and understands the text, based on knowledge of the writing style and any other supporting evidence that is derived from knowledge about the text, but not necessarily part of this specific document itself. Equipped with such knowledge, an editor might be able to point out errors in the characters that make up the text, missing or misplaced pages, and additions or omissions in the textual content, and even attempt to produce a new edition (manifestation) of the text that improves on perceived shortcomings. Zeller stresses that we should not only be aware of this subjectivity and objectivity in editing, but also communicate it to the users of the editions we produce, to prevent users from having unreasonable expectations and trying avenues of analysis that are simply not supported by the edition. It also goes without saying that this is not a new distinction. As Zeller himself points out, in the European tradition it goes back to at least Aristarchus of Samothrace (\textgreek{Ἀρίσταρχος}; c. 220 – c. 143 BC), an influential scholar of Homerian poetry, who established the first critical edition of Homer's works, rejecting and marking lines he considered doubtful, but always making it clear what was based on his own judgment and what was evidence of the documents available to him.

While this distinction can be blurry at times, and some question the very notion of objectivity in the record, it still seems useful and it has informed the design of the Kanseki Repository, which arranges the editions of a text it represents into those that strive to faithfully reproduce a text according to some textual witness (record) and those that make alterations or additions to the text, for example by adding punctuation, normalizing characters, collating from other evidence, etc. (interpretation).

Additional requirements

Studying textual content (i.e., just text) in isolation severely limits understanding. In many cases it is desirable to have more context or supporting external evidence. Peter Shillingsburg shillingsburg-view_2015 outlines the following requirements of a digital edition (slightly edited for clarity):

a. Digitized images of all the documents. This makes it possible to view any document side-by-side with any other document, from anywhere in the world, without traveling from Tokyo to Marburg and New York.

b. A table of variants to show how all the documentary texts differ from one another.

c. A textual history that explains the relationships between the variant documents and why we should care about them.

d. A transcription of at least one of the documents, so that the variants list can be used more easily. Or else transcriptions of all the documents so that readers can select and read any of them. This would also enable machine collation.

e. Edited copies of the transcriptions to correct errors, to create new editions based on individual interpretation. While the other editions of the record will preserve the text as a historical documentary text, such revisions will help readers avoid the distractions caused by scribal or compositorial errors.

As outlined above, a strict division is maintained between the "record", i.e., a descriptive and neutral representation of a text, and the "interpretation" of that record, i.e., any edited version of, commentary to, or other enhancement of the text. For this purpose, anything that has been done before the activity of the modern editors, including previous modern editions of a text, is considered part of the record.

This should allow users to clearly distinguish at any moment between views or changes introduced within the tradition and those introduced later.

The record: Two faces of digital text

What has been said so far could be applied in a similar way to printed texts (Shillingsburg) and in some cases it has been written about exclusively with printed editions in mind (Zeller). But there are also issues that are specific to digital editions.

Every edition that enters the record can be represented in two distinct, but related digital forms: as digital facsimile1 and as digital transcription. Both introduce by necessity alterations due to the medium used, but they should still be kept as close to the original as possible. Thus, a digital facsimile might be reduced to monochrome (black and white) images, with the white margins cut away and the contrast enhanced, but still considered a faithful representation of the original texts.2 In the same way, the transcription of a text might strive to reproduce the text as closely as possible to the characters as seen on the facsimile (or more correctly, as read and interpreted by the typist or input program), without ever achieving a complete identity. (This means that if several variants of a character are available in the coded character set, the one closest to the character on the page is selected).


Figure 2: A digital facsimile of a page from the 太平經抄 (right) with the transcribed text (left) 図# 太平經抄の中の一ページ分のデジタルファクシミリ(右)とその翻刻テキスト(左)

To be complete, the record also comprises descriptions of the available editions that are not yet part of the repository, ideally including all known editions.

Some conventions to easily distinguish these two types of texts are outlined below.

The Interpretation

While the record tries to present the evidence in a form that is as neutral and unbiased as possible, it is in the interpretation that real scholarly editing takes place. The record is scrutinized and amended, commented on, etc. Such modifications and interpretation ensure the living evolution of the text that has continued ever since it was first fixed in writing.

One important use of this concept in the context of the Kanseki Repository is to provide an edition that employs modern character usage, rather than the characters used in the editions documented in the record. This allows us to cater to the needs of modern users but still provides a record of the existing textual tradition through reference to the editions in the "record".

Premodern Chinese texts (古籍 guji in Chinese, 漢籍 kanseki in Japanese – hereinafter I will use the Japanese designation) that were transmitted in the form of beautiful woodblock prints for many centuries reveal numerous traces of this kind of evolution, as their owners took possession of the texts as objects. As well as affixing collector's stamps, they also freely added marginal notes, reading marks, punctuation and more, thus making almost every copy of a work a unique item reflecting a specific trajectory of readership over the generations. Similarly, all users of the repository can create their own copies, marked with their individual readings and comments. As yet, we do not have the digital equivalent of a collector's stamp, however.

In addition to this ad hoc commenting on individual items, there is also the traditional exegesis, in which commentary, explication and sound-glosses were inserted interlinearly into the root text, albeit usually in smaller characters to set them off. The result was then published as a new edition. In other texts, the root text and the commentary might also be set apart by other typographical means, by indenting the first line of the commentary for example. It is thus a fact that outside of the Buddhist scriptures, almost all texts of any standing as a classic have been transmitted not as raw, unadulterated texts, but rather wrapped in layers of commentary and annotation. Within the Kanseki Repository, such 'texts with commentary' are usually treated as separate texts with their own entries in the catalog. Rather than an interpretation of the root text, in the philological sense used here, they become a record in their own right.

Details of the technical implementation

The texts contained in the repository are presented with only the markup that is absolutely necessary for them to function within the supporting system. Every edition of the text is established separately and curated alongside other editions. The format for the texts is an extension to the Emacs Org-mode text format3 and the technical means to handle the alongside is 'version control software' as outlined in the following section.

This approach is quite contrary to the current practice in Scholarly Editing and Digital Philology, which uses the XML based Text Encoding Initiative's (TEI, see tei) textual model as the standard, and therefore requires an explanation.

First of all, it is important to note that there is not just one true rendition of an existing text in digital form. Every rendition will have to develop its own methodology, make decisions about how to represent certain distinct features of a source text and what to consider as ancillary baggage to be left behind during the transition to the digital medium. No one digital edition will therefore be the same as another, except where one is a direct digital copy of the other.

One important feature distinguishing such editions is of course the technology used. In Digital Philology, most of the work done since the turn of the century, is based on the text model developed by the Text Encoding Initiative (TEI) and syntactically based on XML, a technology that also drives most of the World Wide Web, among other things. With its sophisticated model of texts and many ways to record textual features of interest to the encoder, it is an important technology and its expressivity goes lightyears beyond the modest means of the syntax employed here.

The fact that TEI and XML are not used for the repository here should not be seen as a statement of their unsuitability for the task at hand. They are very suitable in fact, and can be used in conjunction with what is presented here to great advantage. Alternatively, they can be used on their own in the usual way. In more than 20 years of working with these methodologies, however, I found that they are not always an ideal fit for every station on the path of a text from the reader (as editor) to the producer and back to the reader. There is also room for a simpler, but yet well defined textual model, which can be algorithmically "up-converted" to a more expressive syntax, if desired, to fit into a production pipeline. Additionally, it is useful to also provide a text in plain format as the final stage of the publication process.

While the simple format used here works well for establishing the text, and for reading, translating and annotating, it might prove inadequate for more sophisticated use cases, like linguistic annotations, or the markup of names, persons and locations.4. In such cases, a conversion to XML representation of the texts is feasible.

The long and short of all this is that there is no single representation that is technically superior in all cases. Pragmatically, the one that best serves the needs of the text and the reader should be chosen at any given time.

This leads quite naturally to another observation: A digital text cannot be sufficiently appreciated by simply looking at the sequence of bits or bytes that represent the text; it is necessary to take into account the whole system that supports the presentation of the text to the reader—the application software for rendering the texts, the underlying operating system and its properties, the fonts used for rendering text, and the size, pixel density and luminosity of the screen (or paper!) on which the text is rendered all contribute to the system that presents the text to the user. With all of these items, there is a multitude of variation, depending on the purpose of viewing the text. Users of the Kanseki Repository are thus encouraged to experiment with this liberty and use what best fits their needs. Chapter 3 will provide further details of this.

How texts and editions appear in the Kanseki Repository

As mentioned, a minor extension of the Emacs Org-mode format, which is almost plain text, serves both as the first stage of preparing a text in the digital format, for example by typing it out, and as the final stage where the text is published and distributed. This is a format that makes the least assumptions about the receiving system by restraining it technically to a lowest common denominator, i.e., "plain text"5.

Kanripo implements the principles outlined above in the following way:

  • A text "repository" is created for every text for holding all information related to the text, including links to facsimile renderings of pages and the content of the text itself6.
  • The repository is named according to a convention that gives the text an appropriate place in the textual and conceptual universe of premodern Chinese texts. In principe, the conventions used here follow those for classification of traditional Chinese cataloging, as exemplified by such great works as the 四庫全書総目提要 Siku quanshu congmu tiyao or the bibliographical treatises of the dynastic histories, with a few alterations as outlined in the next chapter. The name of the repository serves as an identifier for the text and this identifier is used to uniquely and unambiguously refer to the text.
  • The repository is created using a software called "git"7. This software was designed to help software developers collaborate across different time zones and locations. The details are unimportant, but the essential point is the concept of "branches", which enables work on different versions of a text to be done simultaneously and allows the switching back and forth between multiple versions. The software is capable of dealing with as many different versions of a text as needed and offers many functions, such as efficient storage and retrieval, display of differences, etc.
  • Every one of the editions in the "record" that has a digital representation in Kanripo will have a branch that represents this edition. Such a branch is given a name to identify it. We will use names that include only uppercase letters (and a hyphen "-") to point to branches that represent editions in the record, such as WYG (for the reprint of the 文淵閣 Wenyuan ge edition of the 四庫全書 Siku quanshu) or CK-KZ (for the 考正 Kaozheng reprint of the 重刊道藏輯要 Chongkan Daozang jiyao). These branch names can be arbitrarily chosen, but they should be documented somewhere in the record.
  • Other branches represent editions or versions of the text that do not form part of the record, but are added by editors or researchers for specific purposes, which might be the creation of a new critical edition, reading notes, translation, commentary, etc. If such an edition is based on an edition in the record, the referencing system of the base edition is carried over to allow better integration into Kanripo. These branches are usually named with a combination of uppercase and lowercase letters or lowercase letters only (such as WYG-chris or trans-en) to clearly distinguish them from the non-interpretative branches8.
  • One of the editions in the record will be selected as a "base edition". The page numbers and line markers of this base edition will be used as a reference framework in all editions of this text. This is not necessarily the philologically most valuable edition, but rather a pragmatic choice for the purpose of supplying a reference framework for the other editions in the "record". This does allow the cross-referencing of locations within different editions across editions. All editions have a reference to the base edition as part of the information at the beginning of each file.
  • Among the other branches, there are two which have a special meaning within the Kanseki Repository: the master and _data branches.
    • master is the branch which has the most up-to-date text as chosen by the editors of the Kanseki Repository. This might be the one in the base edition, but might also be a different one, if that seems to be a better starting point. It will usually also introduce (different) punctuation, character normalization, structural markup, and other characteristics. As this shows, the master branch represents an interpretation of the text, not a record, hence the lowercase letters used for its name.
    • _data is a branch that holds data used by the system. These data might pertain to one or more of the other branches. For example, information about which facsimile page belongs to which text location is maintained here.
  • Within every branch of the repository, there will be a file called "Readme.org". This file is not part of the textual content itself, but gives the minimum necessary information about the record (metadata), most notably the specific edition that is described in the branch in which the file Readme.org is located. For this reason, the content of this file will vary across different editions. By convention, the Readme.org file of the master branch, at least, will contain information about all the other branches as well to serve as a quick reference. More detailed information about the text will also be found in the Catalog.

Texts in the Kanseki Repository

One of the distinguishing features of the Kanseki Repository compared to many other projects is that it primarily attempts to document as many editions of a text as feasible. It is therefore not based on one specific edition by default. For every edition, a transcribed version of the edition and a digital facsimile can be kept and these can be read side by side.

It should also be mentioned here that sometimes texts with the same title and author/editor can still be quite different. An example of this can be seen in the 雲笈七籤 Yunji qiqian (KR5d0055), which has 120 巻 juan in the Daozang, but only 40 juan in the Daozang jiyao. Should these be considered to be the same text, or rather different ones? Good arguments can be put for both points of view. Currently, the Kanseki Repository considers them as versions of the same text and will put them side by side as branches in the same repository, since this arrangement facilitates research on the differences of these texts.

In addition to documenting historical editions, many additional editions of a text can be created in the Kanseki Repository, for example by collating existing editions, and adding punctuation, further markup, annotations, translations, etc.


The catalog is used to maintain information about the texts. This is held in a format similar to the texts and distributed in the same way. Users of the repository can maintain their own versions of the catalog to accommodate specific analytic needs.

At the time of this writing, the Catalog can be accessed at kr-catalog but is not yet integrated into the system.

The use of the term "digital facsimile" might need some explication here. Due to the malleability of digital entities, a digital image is in many senses not "made similar" to the text-bearing object that served as its source. While there are some ways to make up for this, for example including dimensions and a color bar for calibration in photographs of documents, wherever software is involved there is always the possibility of unaccounted change. Some curators of digital text images therefore prefer the term "digital avatar", to emphasize the fact that they represent re-created resemblances in the digital realm rather than simple reproductions. In the current context, however, I have chosen the term digital facsimile because I think it is less confusing and still sufficiently precise.
It goes without saying that for archival storage, the facsimiles should be kept in as high a resolution as possible and retain as much ancillary information as possible. A raw version as captured by the input device should also be included, as well as a record of how the image was reproduced. (A detailed description of how to digitize a text is beyond the scope of this introduction).
A detailed description of this format is available in the appendix. There is also a description online, which will be updated as necessary at mandoku-format-en
Where exactly this line should be drawn is also up to the user, of course, and the text format used here does provide for simple annotations.
When I first arrived at the CBETA (Chinese Buddhist Electronic Text Association) project, everything was done using a plain text format. Over time, I injected a certain amount of XML into the workflow. However, there are still widely used plain text renditions in circulation, published from the XML source of the texts by automated scripts.
"repository" is used here in the specific technical sense of a folder on a computer, the content of which is subject to a version control system.
git belongs to the category of version control software used widely by software developers. There are many introductions to git, but they are mostly geared towards programmers. A useful starting point for more information about how to use git is git-docu .
There is yet another form of branch names, which is specific to the Buddhist texts derived from the Taishō edition. Since this edition reports other textual witnesses in text-critical footnotes, these have been used to reconstitute the texts in separate branches. It has not yet been possible to actually tie them to the original text witnesses here, however, so the naming of the branches reflects this indirect way of constituting the text: T@SONG or 大→宋 are both ways to indicate a so-called Song edition (there are in fact several, but the Taishō editors grouped them together into one), without actually having a reference to the source available. It is of course hoped that some day these stop-gap measures can be replaced with references to the actual witnesses of these editions.