Download a copy of CQPweb tutorial (in Chinese) here.
The corpora developed by BFSU CRG members can be accessed at BFSU CQPweb corpus portal with user ID 'test' and password 'test'.
- China English Corpus: 40 million words, developed under the leadership of Wenzhong Li.
- China Pear Narrative Interlanguage Corpus: a corpus of spoken and written English/Chinese narrative discourse produced by EFL college students based on the video prompt 'The Pear Stories' film. The corpus was designed and developed by Jiajin Xu, and will be made publicly available soon.
- CLOB corpus: A Brown family British English corpus of one million words published largely in 2009) developed under the leadership of Jiajin Xu and Maocheng Liang. An article describing the corpus was published in the 2013 issue of ICAME Journal. Download CLOB (18.2MB). Crown and CLOB corpora based publications can be found here. Please find a detailed description of CLOB corpus at CoRD corpus resource database of Helsinki University.
- Crown corpus: A Brown family American English corpus of one million words published largely in 2009, developed under the leadership of Jiajin Xu and Maocheng Liang. An article describing the corpus was published in the 2013 issue of ICAME Journal. Download Crown (18.2MB). Crown and CLOB corpora based publications can be found here. Please find a detailed description of Crown corpus at CoRD corpus resource database of Helsinki University.
- GPEC: English and Chinese bidirectional parallel corpus, developed under the leadership of Kefei Wang.
- PACCEL (Parallel Corpus of Chinese EFL Learners), created by Qiufang Wen and Jinquan Wang, and published on CDs by Foreign Language Teaching and Research Press.
- PATTIE (Preschoolers- and Teenagers-oriented Texts in English) corpus compiled by Dr. Jie Ji. The constuction of the corpus was completed in late 2014. PATTIE will be available soon via BFSU CQPweb. Download PATTIE wordlist created by PowerConc, and PATTIE wordlist created by AntConc. The corpus can be downloaded here for personal research only.
- TED English Chinese parallel corpus of speeches: 6,187,849 English words and Chinese characters, collated by Jiajin Xu based on Web Inventory of Transcribed and Translated Talks.
- ToRCH2009 corpus: Texts of Recent CHinese corpus 2009 (A Brown family Chinese corpus of one million words) developed under the leadership of Jiajin Xu. Download ToRCH2009 here (4.6MB).
- SCOUT: Spoken Chinese Of Urban Teenagers (created by Jiajin Xu around 2004)
- SWECCL (Spoken and Written English Corpus of College Learners): developed under the leadership of Qiufang Wen, Lifei Wang, Maocheng Liang and Xiaoqin Yan, and published on CDs by Foreign Language Teaching and Research Press.
- MedAca (Medical English discourse of Academia) corpus -- Clinical medicine component contains medical English texts (of 18 different subject areas of clinical medicine) of one million words compiled in 2015. The building of the corpus was proposed by Jiajin Xu and the text gathering was undertaken by a group of English teachers (namely, Feng Xin, Qi Hui, Wu Jingjing, Ye He, Wan Ling, and You Sheng) at the School of Foreign Languages, Fujian Medical University. Download MedAca word list here.
- The TECCL corpus V1.1 (Ten-thousand English Compositions of Chinese Learners, Version 1.1, 中国学生万篇英语作文语料库) is a corpus of 1,817,335 words of Chinese EFL learners at different levels of schooling and from almost all over China, covering a great variety of writing prompts. The essays were produced during 2010 and 2015, some done in class and other at home. The texts of the TECCL corpus can be downloaded from here, and concordanced online from http://188.8.131.52/cqp/. A Stanford Parser version of the TECCL treebank is made available for download here (The accuracy of the parsed version has not been checked. Some people warn of the use of parsers to analyse interlanguage, esp. underchievers', English texts; others have found that novice writers tend to use simple syntax, and therefore parsers work well with learner English texts).
- TIME Magazine Corpus (1923-2008) of about 196 million words, which was twice the size of the BYU Time magazine corpus (by Mark Davies). The text collection was obtained a few years ago and mounted to BFSU CQPweb in late 2015. The corpus size is about 196 million words.
- The Independent Corpus gathered texts from The Independent--a British national morning newspaper--between 2009 and 2015. The corpus size is about 231 million words.
- ToRCH2014, an update of ToRCH2009, is under construction.
- NESSIE Corpus 1st release (NESSIEv1, Native English Speakers Similarly or Identically-prompted Essays). Download the corpus here.
- The DEAP (Database of English for Academic Purposes) Corpus (under construction) aims to collect texts of 100 million words covering 20 or more disciplines.
Corpora or text collections prepared by people beyond the BFSU CRG.
- 85 translations of "Tao Te Ching", "Laozi", "Dao De Jing".
- Download BNC XML edition from here.