Corpus Hanku

The Hanku is a monolingual, synchronous Chinese corpus (in simplified Chinese characters) available via web interface. It is available via the website of the Confucius Institute at Comenius University in Bratislava at: <konfuciovinstitut.sk>. The building process has begun in spring 2016 and it was supported by the Chinese National Office for Teaching Chinese as a Foreign Language in 2016. Here we would like to thank Mr. Vladimír Benko (Slovak Academy of Sciences) for providing us with Chinese language data.

The corpus is available here. The corpus can be used free of charge, but registration is required. Please send us the following information (via email to lubos.gajdos(at)uniba.sk):

  • Your Name
  • Your Academic Affiliation
  • Your E-mail
By registering you agree to use the Hanku corpus and its resources solely for study, research, teaching and other non-commercial purposes.

Learn how to use the corpus and more.

The Hanku uses an open-source version of the Sketch Engine corpus manager (NoSketch Enigine) as well as open-source tools for tokenization (ZPar) and POS tagging (the Penn Chinese Treebank). The corpus has reached the size of 800 million tokens (June 2016), is equipped with bibliographic, POS, style and genre, phonetic annotation. Syntactic annotation is prepared (autumn 2016). So far, the Hanku corpus is equipped with the following style and genre annotation: (1) baokan (journalistic texts from the PRC), (2) falv (legal texts from the PRC; texts of laws and regulations), (3) none (texts from the Internet). Texts from different registers will follow (e.g. professional texts, texts of Modern Chinese literature etc.).

Structure of the Corpus

The logical (as presented to the end user) structure of the corpus is based on documents. Typically, one document correspondto one webpage (“s” referenced by a URL), or a newspaper article, a book etc. The set of documents form the corpus directly, there is no higher hierarchy level included. Lower hierarchy levels include text structures (paragraphs, sentences) and tokens with their positional attributes.

The basic block of the corpus is a token – one single position in the text. Traditionally in corpus linguistics, one token represents one word in the source text, with additional information, such as lemma, part of speech or syntactical function. For Chinese, there are two possibilities—to tokenize a text into characters (Hanzi) or words. Even if the division of text into words is often fuzzy and subject to individual interpretation, it was decided to tokenize a text into words. In the Hanku, each token is annotated for part of speech (POS), its composition into characters and the Hanyu pinyin transcription. The POS annotation and tokenization are results of automatic processing.

Usage

The corpus may be used in linguistics research and language teaching. The Hanku common usage scenarios in language teaching are as follows:

  • basic word usage – KWIC
  • collocation preferences of a word
  • sentence pattern search
  • register’s specific usage of a word
  • register’s preference of synonyms etc.

The system of the corpus (under the query type “lema”) allows a user to search for Chinese words or characters by writing them in Hanyu pinyin with or without the tones.

Performing the KWIC and collocation’s search are basic tasks which an ordinary user of corpora is familiar with. Using a regular expression, e.g. the sentence pattern, might be regarded as an advanced level. Results may be saved directly from the interface as txt or XML files.

If you use this corpus in your research, please refer to:

Gajdoš, Ľ., Garabík, R., Benická, J. The New Chinese Webcorpus Hanku Origin, Parameters, Usage. In Studia Orientalia Slovaca, Vol. 15, No. 1 (2016), pp. 21—33.

In case of questions, please feel free to contact lubos.gajdos(at)uniba.sk