Friday, 28 August 2009

How Complex is Tangut ?

Last year my friend Nathan Hill kindly invited me to give a talk on Tangut at my Alma Mater. I accepted with some trepidation because I am still very much at the start of a long and steep learning curve with regards to Tangut, but I hoped that by the time the talk was due to be given in May this year I would have something interesting and exciting to talk about. Unfortunately I got tied up with other stuff (Tangut, ironically), so in the end my talk turned out to be more of a general introduction to the structure of the Tangut script and some of the issues that I have faced over the last year or so in preparing an encoding proposal for Tangut. But anyway, the talk didn't go too badly, and so I thought that I would convert my PowerPoint slides into a four-part series of blog posts.



Notes for an introductory talk on the Tangut script given at SOAS on 21st May 2009



1.1 The Age of New Scripts

During the 10th to 13th centuries a number of new scripts were devised by peoples who had come into contact with (and conflict with) China, and who wanted to assert their national identity and cultural superiority by means of their own, unique and distinct writing systems (colour-coded to show their current Unicode status):


[See Documents relating to the encoding of the Tangut, Jurchen and Khitan scripts for Unicode encoding proposals]


Three of these scripts, Large Khitan, Jurchen and Tangut, are structurally similar to Chinese, and I will look at their similarities and differences, both amongst themselves and in relation to Chinese, below.



1.2 Khitan Large Script

  • Closely modelled on Chinese
  • Many characters borrowed directly from Chinese
  • Some with the same meaning (e.g. 皇帝 in the text below)
  • Some as phonetic borrowings
  • Many other characters derived from Chinese characters by adding or removing strokes (e.g. 東 with two extra strokes on the 6th line from the right in the text below)
  • Few or no characters composed of multiple elements with large numbers of strokes (i.e. no characters like Chinese 雙)
  • Uses exactly the same stroke types as Chinese
  • Largely undeciphered

Transcription of a Khitan Memorial Stone

Source: Miínzú Yǔwén 民族语文 2005 no.4 page 54

Click here to highlight Khitan characters that are the same as Chinese characters



1.3 Jurchen

  • Very similar to Khitan Large Script
  • Many characters derived from Khitan and/or Chinese
  • Relatively few direct borrowings from Chinese compared with Khitan
  • No characters with large numbers of strokes or composed from multiple complex elements
  • Uses exactly the same stroke types as Chinese
  • Largely deciphered

Drawing of a "Medallion" with a Jurchen inscription

Source: S. W. Bushell, "Inscriptions in the Juchen and Allied Scripts" in Actes du Onzième Congrès International des Orientalistes (1897) 2nd section page 21
(originally from Fāngshì Mòpǔ 方氏墨譜 [Mr. Fang's Catalogue of Inkstones] (1588) vol. 1 folio 33)


Table of Chinese, Khitan and Jurchen Numerals

Source: Daniel Kane, The Sino-Jurchen Vocabulary of the Bureau of Interpreters (1989) page 21



1.4 Tangut

  • Only superficially similar to Chinese
  • Characters are not obviously derived directly from Chinese or Khitan characters, although they are clearly influenced by Chinese
  • Discrete elements arranged into a square character
  • Appears crowded compared with Chinese, with few non-complex characters
  • Most characters composed of two or three distinct components, and only a few characters are themselves elemental components
  • Mostly written using the same stroke types as used for writing Chinese, but some stroke types and stroke constructions are unique to Tangut
  • Higher proportion of diagonal and oblique strokes than in Chinese
  • No closed elements (i.e. no box elements like Chinese 口 and 囗)

Chrysographic Edition of the Lotus Sutra

Source 中国少数民族文字字符总集


Fragment of a Memorial Stone from the Western Xia Royal Tombs

Source: 大夏寻踪——西夏文物特展 (Vanished Exhibition on Western Xia artefacts at the National Museum of China)

[Can you spot the characters meaning "one" and "three" ?]



1.5 Stroke Complexity

Tangut is renowned as being very complex in terms of the structure of its individual characters, but I wanted to try to determine exactly how complex Tangut is, and how it compares with Chinese, Khitan and Jurchen, so I produced the following graphs to show the distribution of characters by stroke count in these various scripts.


Distribution of Tangut Characters by Stroke Count

Data derived from Proposal for a revised Tangut character set for encoding in the SMP of the UCS (SC2/WG2/N3577) Appendix A.


Distribution of Traditional CJK Characters by Stroke Count

Data derived from the kTotalStrokes field of the Unihan Database for those characters defined in Unicode 1.0 (i.e. U+4E00 through U+9FA5), excluding simplified characters (mostly those characters with a kTraditionalVariant field).


Distribution of Simplified CJK Characters by Stroke Count

Data derived from the kTotalStrokes field of the Unihan Database for those characters defined in Unicode 1.0 (i.e. U+4E00 through U+9FA5) that have the kXHC1983 field but do not have the kSimplifiedVariant field (i.e. most simplified characters in the 1983 edition of Xiàndài Hànyǔ Cídiǎn 现代汉语词典).


Distribution of Large Khitan Characters by Stroke Count

Data derived from the transcription of a Khitan memorial stone given in Miínzú Yǔwén 民族语文 2005 no.4 page 54 and page 55.


Distribution of Jurchen Characters by Stroke Count

Data derived from Jin Qizong 金啓孮, Nüzhenwen Cidian 女真文辞典 [Dictionary of Jurchen Characters] (Beijing: Wenwu Chubanshe, 1984).


Stroke Count Data for Traditional CJK, Simplified CJK, Tangut, Jurchen and Khitan

StrokesCJK TraditionalCJK SimplifiedTangutJurchenKhitan
1102030
23722066
3806002528
4157143316552
52402153228760
63863516540141
766456816029334
895775931014718
91,1258515243710
101,369923773134
111,55590184702
121,63687088500
131,54676178200
141,44659464000
151,50253447300
161,25140933600
171,02031117300
1879317510600
197161686000
205191052900
21394791500
2230447600
2324040100
2414921100
2510722000
26546000
27521000
28261000
29131000
3080000
3150000
3231000
3341000
3400000
3510000
3611000
3700000
3800000
3910000
4000000
4100000
4200000
4300000
4400000
4500000
4600000
4700000
4810000
Total18,3738,9436,2211,377255
Mean13.4611.4912.096.015.43
Mode12101265

Comparison of CJK, Tangut, Jurchen and Khitan Stroke Counts


Jurchen and Large Khitan are the two scripts that appear to be most similar to Chinese, yet actually they are the most different when it comes to stroke count, both having only half the number of strokes as traditional CJK characters on average. This difference is probably due to the fact that Large Khitan and Jurchen characters do not have any high stroke count radicals such as 言 "speech" (7 strokes), 金 "gold" (8 strokes), 馬 "horse" (9 strokes) and 鳥 "bird" (9 strokes) that are very common in Chinese characters.

On the other hand, it was a surprise (to me at least) to see how closely the contour of Tangut matches that of traditional Chinese, as I had always assumed that Tangut characters must, on average, be much more complex than Chinese characters. But although Tangut does not have any characters with very few strokes (less than 4 strokes) or very many strokes (more than 24 strokes), which distinguishes it from Chinese, if you ignore the lower and upper ends of the graph the distribution of stroke counts for Tangut is very close to that of traditional Chinese. Why then does Tangut text look so much more complex and more crowded than Chinese? That could be answered with another graph which took into account each character's frequency of occurence. A large proportion of high frequency Chinese characters have very few strokes (e.g. 一二三人女山火水大小中), and conversely Chinese characters with very many strokes tend to occur less frequently, with the result that normal Chinese text always has a large proportion of characters with few strokes. In contrast to the situation with Chinese, there does not appear to be any relationship between frequency and stroke count for Tangut characters, so that normal Tangut text is uniformly composed of characters with 12±6 strokes, with the result that it appears denser and more crowded than Chinese.



1.6 Structure of Tangut Characters

  • Individual Tangut characters not obviously derived directly from Chinese or Khitan characters
  • Limited set of component elements
  • Elements are themselves built from simpler elements by the addition of 1 or 2 strokes
  • Most characters constructed from 2 or 3 component elements
  • Very few basic elements are also characters in their own right

Series of components are constructed from a basic element, on the one hand by the addition of strokes to the basic element to make other simple components (vertical progression in the diagrams below), and on the other hand by combining these simple components with other components to make complex components (horizontal progression in the diagrams below).


Series of Tangut Components (Example A)


Series of Tangut Components (Example B)


Due to this incremental process many character components are very similar to each other, and when two or three such similar components (coloured red in the diagram below) are combined together in different combinations to make different characters (coloured blue in the diagram below), the results are confusingly confusable.


Eleven Characters composed from different combinations of Five Components



1.7 Tangut Radicals

  • Not true radicals (determinatives)
  • But simply aids to character lookup
  • Chinese dictionaries select leftmost or topmost character element as the radical
  • Most Russian dictionaries base the radical on the character element at the bottom right corner of the character

In the example below, the same radical is used in both Li Fanwen's dictionary and Kychanov's dictionary, but in the former it is a lefthand radical, and in the latter it is a bottom right radical. This shows how most horizontally aligned components can occur equally on the left side or on the right side of a character, and it is largely an arbitrary decision of dictionary compilers as to whether it is treated as a lefthand side radical or a righthand side radical.

Li Fanwen 2008 Kychanov 2006

The proposed Unicode character ordering is based on 527 left-based radicals (including some top, bottom and enclosing radicals where there is no lefthand component). The advantage of this system of ordering is that it is consistent and allows for deterministic lookup of characters, but the disadvantage is that there are some high stroke-count radicals with very few members.

N3577 Appendix A



1.8 Structural Analysis

  • Because Tangut characters are composed of a limited set of component elements arranged in different configurations they are very amenable to structural analysis
  • Nishida’s 1966 dictionary gives structural analysis of each character

Table of Tangut Component Configurations identified by Nishida

Source: Nishida Tatsuo 西田龍雄, Seikago no kenkyū 西夏語の研究 (1964) page 246


Entry in Nishida's 1966 Tangut Dictionary

Source: Nishida Tatsuo 西田龍雄, Seikabun Shōjiten 西夏文小字典 (1966) no. 10-103


The Unicode proposal gives an Ideographic Description Sequence (IDS) for each proposed character. This borrows a character description syntax designed for CJK characters (but which will no longer be restricted to CJK characters from Unicode 6.0).


6 comments:

David Boxenhorn said...

I love Ideographic Description Sequence! I want to order by that, and use regular expressions to search it. Can we do it already in your database?

David Boxenhorn said...

"Jurchen and Large Khitan are the two scripts that appear to be most similar to Chinese, yet actually they are the most different when it comes to stroke count"

That's not how I see the graphs. What is most different between the scripts is the number of characters (the area under the graph). If you look at the left side, they ascend in a very similar manner, and then taper off as the need for characters is exhausted. Looked at this way, Tangut is clearly the exceptional script - it is right-shifted compared to the others.

You can look at this in terms of information density. The total number of possible characters for each stroke number is an exponential curve. For low values, the non-Tangut scripts are very information dense - they closely approximate the total number of possible characters. The reason that the right side of the curve drops off slowly (rather than using up the total possible number of characters and then dropping off suddenly) is internal morphology. What the right-shifted Tangut curve illustrates is the high degree of internal morphology of Tangut characters.

David Boxenhorn said...

Another way to look at it: I think IDS would work very well as an input method for Tangut. It wouldn't work as well for Chinese because there are too many basic elements.

Andrew West said...

David,

You're quite right that the problem with an IDS-based input method for Chinese is that there are too many basic elements, making it somewhat impractical to display all the possible elements. In addition there are quite a few elements that are not encoded as characters (for example, the lefthand side of 师, the righthand side of 铅, the righthand side of 拣, the top of 览). On the other hand, Tangut has a relatively few number of basic elements. all of which will be encoded, so an IDS-based input method would be more practical, and is something I want to work on (next year maybe).

YH said...
This comment has been removed by the author.
Andrew West said...

But the Ideographic Description Characters makes me afraid: will people code Tangut script in a dynamic-composition way? That sounds terrible! I hope they use precomposed Tangut like Chinese: ... And now someone want to code Tangut in that way ...

You misunderstand. I use Ideographic Description Characters to help analyse the structure of Tangut characters, and in the future to create an input method for Tangut (user can enter the component elements of a Tangut character, but the output is a single Tangut character) But no-one is proposing to encode Tangut as decomposed character components that need to be dynamically composed by the user. We are proposing to:

A) Encode 6,221 individual Tangut characters. This includes all characters found in modern dictionaries, including variant characters.
B) Encode a set of Tangut radicals and components for use in dictionary indexing and discussion of Tangut character structure by scholars.

I hope this clarifies the situation.