You are here

The TECCL Corpus

Ten-thousand English Compositions of Chinese Learners (the TECCL Corpus)

Version 1.1

2015-12-28

 

Download the TECCL corpus here.

A Stanford Parser version of the TECCL treebank is made available for download here (The accuracy of the parsed version has not been checked. Some people warn of the use of parsers to analyse interlanguage, esp. underchievers', English texts; others have found that novice writers tend to use simple syntax, and therefore parsers work well with learner English texts).

 

Key information of the TECCL Corpus

Corpus name: Ten-thousand English Compositions of Chinese Learners (the TECCL corpus) (Version 1.1)

Text contributors: Xue, Xizhe (Romanised pinyin notation of the Chinese word "learner")

Project initiator: Jiajin Xu (the National Research Centre for Foreign Language Education, Beijing Foreign Studies University)

Year of corpus creation: 2015

Formats of the corpus: Two forms of the TECCL corpus, i.e. raw texts and part-of-speech tagged texts, are available. They are stored in two folders, i.e. 01TECCL_RAW and 02TECCL_POS. The POS texts were annotated with the tag set version 7 (C7). (cf: http://ucrel.lancs.ac.uk/claws7tags.html) of the CLAWS POS tagger developed at UCREL, Lancaster University, UK.

Citation: Xue, Xizhe. 2015. Ten-thousand English Compositions of Chinese Learners (The TECCL corpus), Version 1.1. The National Research Centre for Foreign Language Education, Beijing Foreign Studies University.

 

The TECCL corpus: Its background and highlights

The TECCL corpus contains approximately 10,000 writing samples of Chinese EFL learners, totalling 1,817,472 words (Note: We consider as words all alphanumeric strings, including hyphenated strings, represented by the regular expression [a-zA-Z0-9-]+.). Initially, 10,127 texts were sampled from an online writing and scoring system. 262 blank texts, texts written in Chinese, translated English texts, and duplicated and/or plagiarised texts were removed by hand. As a result, the finalised version of the TECCL corpus consists of 9,865 texts. All the text contributors have agreed to share their texts for future use of academic purposes while they were submitting the texts to the online system. Further anonymisation was committed to keep the possibility of writers' identity disclosure to a minimum. The sampling frame of the corpus was drawn up by Jiajin Xu, and he too undertook all the text cleaning and POS tagging. Liangping Wu, at the early stage of the project, assisted with the text cleaning.

The TECCL corpus ‘figures prominently’ not for its size but its representativeness in the following five aspects.

1)      Unlike other Chinese learner corpora available, the TECCL corpus is more up-to-date as of 2015. The material included was produced between 2011 and 2015. The corpus was compiled to mirror the Chinese EFL learners' English of the time.

2)      The corpus features a wide range of topics or prompts. The rough estimation goes over 1,000 different essay topics.

3)      The writers in the corpus run the gamut from elementary school to postgraduate students, undergraduates being the overwhelming majority. The number of so-called 985/211 and non-985/211 universities to a large extent corresponds to the actual proportion of Chinese universities.

4)      The geographical spread of the writers in the TECCL corpus is by far the widest of all Chinese EFL learners' English corpora. The corpus encompasses text material from 32 provinces, and (autonomous) regions, including Hong Kong and Taiwan.

5)      In stark contrast to other Chinese EFL learners' English corpora, the TECCL corpus comprises both texts written in class and in testing context under (time) pressure and texts written after class. The corpus even takes in some collaborative writing samples. Most previous Chinese EFL learners' English corpora are compositions produced in high-stakes standardised English tests, such as CET-4/6, TEM4/8 and PETS.

 

A known problem with text typography

Chinese learners have a notorious habit of typing words immediately after the commas and full stops without a space. This problem of spacing is not corrected in the final version of the corpus. Fortunately, this does not affect the computation of word tokens or the tagging of parts of speech. Users of the corpus can add a white space after the punctuations, if necessary.

 

Disclaimer

The TECCL corpus can be downloaded for personal research, but not be used for any form of commercial purposes.

 

Contact

Please feel free to report any problems with the texts to bfsucrg@sina.com.

More information about the corpus is available at the official site of the Corpus Research Group, National Research Centre for Foreign Language Education, Beijing Foreign Studies University, http://www.bfsu-corpus.org.

Web-based concordancing of the TECCL corpus is enabled at BFSU CQPweb, http://111.200.194.212/cqp/.

 

“中国学生万篇英语作文语料库(V1.1)”说明文档

(2015-12-28)

 

TECCL语料库基本信息

  中文名称:中国学生万篇英语作文语料库(V1.1)

  语料提供:薛熙哲

  策划整理:许家金

  创建年份:2015

  语料版本:TECCL语料库以“生语料”和“词性赋码语料”2种格式发布,分别对应01TECCL_RAW、02TECCL_POS 2个文件夹。词性赋码采用CLAWS赋码器,所用码集为C7(详见http://ucrel.lancs.ac.uk/claws7tags.html)。

  引文格式:薛熙哲,2015,中国学生万篇英语作文语料库(V1.1)(Ten-thousand English Compositions of Chinese Learners, Version 1.1,简称The TECCL corpus)。

 

TECCL语料库创建的背景及特色

  TECCL语料库规模约为1万篇作文,1,817,335词(按:单词定义为:[a-zA-Z0-9-]+)。语料收集之初,共计10,127篇作文,经删除空文档,中文文档,翻译作业,雷同作文,以及明显超出学习者水平的文本后,余下9,864篇。所有语料来源于某在线作文评改系统。TECCL所收作文均已获原作者授权。TECCL建库时作了进一步匿名处理。该语料库的文本采集方案由许家金拟定,后期语料清理加工、标注由许家金完成。其间得到吴良平老师的协助。

 

  TECCL语料库规模不大,但取样分布代表性较好。TECCL语料库的特色可概括如下:

  1)语料新。所有语料产生于2011-2015年。

  2)题目多。粗略统计,TECCL语料库中涉及的不同作文题逾千个。

  3)学段宽。所收作文涵盖大学、中学、小学三个学段,其中以大学为最多。985、211和非985、211高校的收录比例,与我国高校的实际构成接近。

  4)地域广。语料来源于包括香港、台湾在内的32个省市自治区和特别行政区。

  5)任务活。写作任务类型包括课堂限时作文、课后家庭作业、期中期末考试作文,为课堂演讲而准备的讲稿,以及小组协作作文等。属于英语课程体系内的学业任务,而不是高风险的标准化考试作文。在这一点上,TECCL语料库明显不同于以往国内建成的英语学习者语料库(如,公共英语四六级CET考试作文及口试语料库,英语专业四八级TEM作文及口试语料库,以及公共英语等级考试PETS语料库)。

 

说明

  语料文本中,标点后无空格现象突出。这一点语料库发布时,未作修正。这反映了我国英语使用者对词间空格不敏感。标点后无空格,并不影响词数计算,也不会干扰词性赋码。如有必要,语料库使用者可通过查找替换,自行添加空格。

  

声明

  该语料库只可作学术研究之用,不得用于任何形式的商业活动。

 

联系

  语料中不合用之处,已尽力清理。若发现其他问题,请联系:bfsucrg@sina.com

  TECCL语料库另部署于BFSU CQPweb,诸位可访问http://111.200.194.212/cqp/,在线检索TECCL语料库。

  更多语料库相关信息,可访问北京外国语大学语料库语言学团队网站:http://www.bfsu-corpus.org

Belongs to: 
Corpora