Select Page

Zhi felt greatly encouraged. His solitary work was running parallel to these larger efforts. Most of them, though, still had not been able to free themselves from clunky keyboards. While breaking down characters into components had worked well enough for specific character retrieval indexes and typewriter keyboard designs, it did not translate directly into programming such a process for a computing machine.

Zhi remembered the advantage of the shape-based approach, where character parts helped to identify the whole character directly. To integrate that useful principle into his encoding scheme, Zhi decided to index characters by their components—the simpler characters within each ideograph—using the first letter of each component’s pinyin spelling.

The idea took another two years to flesh out. On average, characters can be broken into two to four components, and there are 300 to 400 components in total. The majority of characters can be divided into two halves—vertical or horizontal—along with other possible geometries. This yielded a two-to-four-letter alphabetic code for each character, which meant each character required at most four keystrokes on a conventional English keyboard. The average English word length, by comparison, is close to 4.8 letters. Zhi thus made the alphabet work more efficiently for individual ideographs than it did for English. The system also cleverly worked around the problem of dialect difference and homophones. Because the code took only the first letter, rather than the complete sound of the character, most regional speech variations did not matter. The four-letter code worked like an acronym of the different parts of the character. Zhi essentially used the alphabet as a proxy to spell by components rather than words.

He sequenced each character’s components in the order they would have been written by hand. Coding by components gave context and important cues that reduced ambiguity and the risk of duplicated codes. The chances of having the same components—or even components starting with the same letter—occur in the exact same order in two different characters are low.

Zhi’s way of indexing the Chinese character by its alphabetized components made it easier for humans to input Chinese—as long as you knew how to write the language—and created a more systematic human-machine interface. For instance, in his system, the character for “road,” 路 (lu), which has 13 strokes by hand, can be broken up into a mere four components: 口 (kou) , 止 (zhi), 攵 (pu), and 口 (kou). Isolating the first letter of each component gives the character code of KZPK. Or take the character 吴 (wu), a common last name, which can be quickly decomposed into two parts, 口 (kou) and 天 (tian), yielding a character code of KT.

Alphabetic spelling, once mediated by Chinese in this way, is no longer a phonetic but a semantic spelling system, where each letter actually stands for a character rather than a sound. This method of indexing can also be extended to represent groups of characters. Take, for instance, “socialism,” or shehui zhuyi: 社会主义. By tagging the first letter of each of the four characters in the phrase, the phrase can be coded in a four-letter sequence, SHZY. Or consider another frequently invoked phrase, the seven characters that make up “People’s Republic of China”—Zhonghua renmin gongheguo: 中华人民共和国. It can simply be typed in as ZHRMGHG.