You are here

Corpora

Download a copy of CQPweb tutorial (in Chinese) here.

The corpora developed by BFSU CRG members can be accessed at BFSU CQPweb corpus portal with user ID 'test' and password 'test'.

  1. China English Corpus: 40 million words, developed under the leadership of Wenzhong Li.
  2. China Pear Narrative Interlanguage Corpus: a corpus of spoken and written English/Chinese narrative discourse produced by EFL college students based on the video prompt 'The Pear Stories' film. The corpus was designed and developed by Jiajin Xu, and will be made publicly available soon.
  3. CLOB corpus: A Brown family British English corpus of one million words published largely in 2009) developed under the leadership of Jiajin Xu and Maocheng Liang. An article describing the corpus was published in the 2013 issue of ICAME Journal. Download CLOB (18.2MB). Crown and CLOB corpora based publications can be found here. Please find a detailed description of CLOB corpus at CoRD corpus resource database of Helsinki University.
  4. Crown corpus: A Brown family American English corpus of one million words published largely in 2009, developed under the leadership of Jiajin Xu and Maocheng Liang. An article describing the corpus was published in the 2013 issue of ICAME Journal. Download Crown (18.2MB). Crown and CLOB corpora based publications can be found here. Please find a detailed description of Crown corpus at CoRD corpus resource database of Helsinki University.
  5. GPEC: English and Chinese bidirectional parallel corpus, developed under the leadership of Kefei Wang.
  6. PACCEL (Parallel Corpus of Chinese EFL Learners), created by Qiufang Wen and Jinquan Wang, and published on CDs by Foreign Language Teaching and Research Press.
  7. PATTIE (Preschoolers- and Teenagers-oriented Texts in English) corpus compiled by Dr. Jie Ji. The constuction of the corpus was completed in late 2014. PATTIE will be available soon via BFSU CQPweb. Download PATTIE wordlist created by PowerConc, and PATTIE wordlist created by AntConc. The corpus can be downloaded here for personal research only.
  8. TED English Chinese parallel corpus of speeches: 6,187,849 English words and Chinese characters, collated by Jiajin Xu based on Web Inventory of Transcribed and Translated Talks. The corpus can be downloaded here.
  9. ToRCH2009 Corpus (ToRCH2009现代汉语平衡语料库): Texts of Recent CHinese corpus 2009 (A Brown family Chinese corpus of one million words) developed under the leadership of Jiajin Xu. Download ToRCH2009 here.
  10. ToRCH2014 Corpus (ToRCH2014现代汉语平衡语料库): Texts of Recent CHinese corpus 2014 (A Brown family Chinese corpus of one million words) developed under the leadership of Jiajin Xu. Download ToRCH2014 here.
  11. SCOUT: Spoken Chinese Of Urban Teenagers (created by Jiajin Xu around 2004)
  12. SWECCL (Spoken and Written English Corpus of College Learners): developed under the leadership of Qiufang Wen, Lifei Wang, Maocheng Liang and Xiaoqin Yan, and published on CDs by Foreign Language Teaching and Research Press.
  13. MedAca (Medical English discourse of Academia) corpus -- Clinical medicine component (MedAca医学学术英语语料库-临床医学子库) contains medical English research article texts (of 18 different subject areas of clinical medicine) of five million words. The building of the corpus was proposed by Jiajin Xu and the text gathering was undertaken by a group of English teachers (namely, Feng Xin, Qi Hui, Wu Jingjing, Ye He, Wan Ling, and You Sheng) at the School of Foreign Languages, Fujian Medical University. The first release, i.e. Version 1 of the MedAca corpus was compiled in 2015. You can download MedAca 1.0 word list here. The Version 2 of the corpus was finalised on 8 August, 2017. Version 1 data was incorporated as part of the new Version 2 MedAca corpus (which consists of 5,041,631 tokens and 99,765 types in 1,186 files). The MedAca V2.0 word list can be downloaded at the link. The one-million-word version MedAca corpus V1.0 is now searchable online at http://111.200.194.212/cqp/.
  14. The TECCL corpus V1.1 (Ten-thousand English Compositions of Chinese Learners, Version 1.1, 中国学生万篇英语作文语料库) is a corpus of 1,817,335 words of Chinese EFL learners at different levels of schooling and from almost all over China, covering a great variety of writing prompts. The essays were produced during 2010 and 2015, some done in class and other at home. The texts of the TECCL corpus can be downloaded from here, and concordanced online from http://111.200.194.212/cqp/. A Stanford Parser version of the TECCL treebank is made available for download here (The accuracy of the parsed version has not been checked. Some people warn of the use of parsers to analyse interlanguage, esp. underchievers', English texts; others have found that novice writers tend to use simple syntax, and therefore parsers work well with learner English texts).
  15. TIME Magazine Corpus (1923-2008) of about 196 million words, which was twice the size of the BYU Time magazine corpus (by Mark Davies). The text collection was obtained a few years ago and mounted to BFSU CQPweb in late 2015. The corpus size is about 196 million words.
  16. The Independent Corpus gathered texts from The Independent--a British national morning newspaper--between 2009 and 2015. The corpus size is about 231 million words.
  17. ToRCH2014, an update of ToRCH2009, is under construction.
  18. NESSIE Corpus 1st release (NESSIEv1, Native English Speakers Similarly or Identically-prompted Essays). Download the corpus here.
  19. The DEAP (Database of English for Academic Purposes) Corpus (under construction) aims to collect texts of 100 million words covering 20 or more disciplines.
  20. Conference Interpreting Corpus under construction (as of 29 March 2017).
  21. The ECCE Corpus 1.0: ECCE is pronounced as ['eki]. The ECCE (English Chinese Corpus of Editorials) corpus 1.0 was created by Linwei Yang and his MA students at Yantai University before Linwei joined the PhD progromme at the National Research Centre for Foreign Language Education of Beijing Foreign Studies University. The bilingual texts of ECCE were originally extracted from The Financial Times website, and sentence-aligned by Linwei's team. The corpus size of ECCE 1.0 is 238,363 English words and 424,921 Chinese characters.

 

Corpora or text collections prepared by people beyond the BFSU CRG.

  1. 85 translations of "Tao Te Ching", "Laozi", "Dao De Jing".
  2. Download BNC XML edition from here.