Annotated List of Useful Corpora, Tools, and Resources

语料库工具资源及介绍大放送

2020-04-23 12:29 tesol

本文共2105个字,阅读需22分钟

阅读模式 切换至中文

All free corpora and corpus tools are each marked with an *. Large General Corpora British National Corpus* (BNC) was developed jointly by three publishers (Oxford University Press, Longman, and W & R Chambers), two universities (Oxford University and Lancaster University), and the British Library.It contains a little over 100 million running words of language data produced between 1980 and 1993 and is divided into seven registers or subcorpora: spoken, fiction, magazine, newspapers, academic writing, nonacademic writing, and miscellaneous. As such, BNC is an excellent source for studying contemporary British English. The corpus can be downloaded at http://ota.ox.ac.uk/desc/2554. It can also be accessed and searched online at several portals, including Mark Davies’s BYU portal, which is equipped with a powerful and user-friendly search engine that can perform many query functions, such as finding and generating collocations, keywords in context, and frequency lists. Corpus of Contemporary American English* (COCA) is a continuously expanding corpus developed by Mark Davies. Currently, it contains 520 million words, with approximately 20 million words included for each year from 1990 and up to 2015 (more data of the same proportion are expected to be added in future years). Like BNC, COCA contains several subcorpora, including spoken, fiction, magazine, newspaper, and academic writing. Its search engine is the same as the one used on Mark Davies’s BNC portal. Corpus of Global Web-based English* (GloWbE) is another corpus provided by Mark Davies. It consists of 1.9 billion words gathered from 1.8 million web pages from 20 English-speaking countries, including both inner-circle English-speaking countries (e.g., Great Britain, United States) and outer-circle English speaking countries (e.g., India, Singapore). Small General Corpora Brown Corpus of Standard American English*, developed in the 1960s, is the first of the modern, computer-readable, general corpora. It contains one million words from 500 texts of 2,000 words each covering 15 categories of texts, including news reports and fiction. Lancaster-Oslo/Bergen (LOB) Corpus of British English  is a corpus developed as a British English corpus to match the Brown Corpus, using exactly the same number of texts with the same length and number of categories. Crownand CLOB corpora* were developed at Beijing Foreign Studies University. Mirroring the structure and size of the Brown and LOB corpora, with approximately one million running words each, the Crown and CLOB corpora are intended to provide sets of representative samples of contemporary American and British English, respectively. Spoken Corpora Corpus of Spoken Professional American English is a two-million-word corpus, with one million words being presentations and discussions at professional conferences and one million coming from question-and-answer sessions at the White House press briefings and conferences. Michigan Corpus of Academic Spoken English* (MICASE) is a two-million-word corpus developed by the English Language Institute at the University of Michigan. The corpus is composed of 152 transcripts that are well balanced in terms of academic disciplines and functions. It provides academic spoken English in various functions, including classroom instruction, group discussion, presentation, and advising sessions. Therefore, the corpus is of particular use for studying academic spoken English. Santa Barbara Corpus of Spoken American English* is a 249,000-word corpus composed of transcribed naturally occurring spoken interactions from all over the United States. With conversational English as its data, this corpus is particularly useful for learners interested in conversational American English. Spoken subcorpora of BNC and COCA* (freely available as noted above) are each a large spoken English corpus. It is very important to note, however, that while the Spoken subcorpus of BNC involves different forms of spoken English, including both formal speeches and informal private conversations, the spoken subcorpus of COCA contains exclusively broadcast English. Written English Corpora British Academic Written English Corpus* (BAWE) is a corpus of academic writing by native British English speaker students at both the undergraduate and master’s levels. It consists of 2,761 texts from arts and humanities, social sciences, life sciences, and physical sciences. BAWE is a valuable resource for the study of tertiary-level academic English writing. Michigan Corpus of Upper-Level Student Papers* (MICUSP) is a written English corpus developed by the English Language Institute at the University of Michigan. The corpus has 2.6 million words made up of 829 A-graded papers written by fourth-year undergraduate students and graduate students across 16 disciplines. It is searchable by a variety of categories such as topic, genre, and discipline. As such, the corpus should be very useful for ESL/EFL writers learning to write academic papers. Academic English subcorpora of BNC and COCA* (freely available as noted above) are each a written English corpus. Similarly, the fiction, magazine, and newspaper subcorpora in both BNC and COCA are also each a written English corpus, but they represent different writing genres and registers. Business Letter Corpus* (BLC) is a corpus of business letters of both U.S. and U.K. samples developed by Yasumasa Someya in 2000 for his master’s project. The corpus consists of roughly one million words. It is searchable and useful for instructors and students of business English. Learner English Corpora International Corpus of Learner English (version 2, ICLEv2, l) is arguably the most well-known learner corpus of EFL/ESL students’ writing. It was compiled by Granger and colleagues at the UniversitéCatholique de Louvain. The ICLEv2 consists of argumentative and descriptive essays by higher intermediate and advanced EFL learners from 16 first language backgrounds, such as Chinese, Czech, Dutch, French, German, Japanese, and Russian. Thus, it is particularly valuable for comparing the language of EFL learners from different language backgrounds. International Corpus Network of Asian Learners of English* (ICNALE) is a 1.2-million-word written corpus developed by Shin'ichiro Ishikawa. It contains essays written by EFL learners from more than 10 Asian countries. Louvain International Database of Spoken English Interlanguage(LINDSEI,) is a corpus of spoken English by EFL speakers of 11 languages, including Bulgarian, Chinese, French, German, Japanese, and Spanish. Developed by Gaëtanalle Gilquin, Sylvie De Cock, and Sylviane Granger at the Université Catholique de Louvain, the corpus, available for purchase on CD, contains systematically collected representative ESL spoken English and comes with a search program and a handbook. Ten-thousand English Compositions of Chinese Learners* (TECCL Corpus) is a learner corpus of written English by Chinese EFL students developed at Beijing Foreign Studies University. It contains approximately 10,000 essays written by Chinese secondary school and college EFL students. Corpus Search, Analysis, and Tagging Tools AntConc* is a corpus search and analysis software program developed by Lawrence Anthony at Waseda University. AntConc provides modules for concordancing corpus queries, developing word lists and keyword lists, compiling lists of clusters/N-grams (formulaic expressions or clusters), and retrieving collocations of target words. Because of its various functions, AntConc is a valuable tool for analyzing corpus data for language teaching and learning purposes. AntWordProfiler* is another program designed by Lawrence Anthony. AntWordProfiler calculates the number and coverage/percentage of the General Service List’s (West 1953) first and second most high-frequency words and the Academic Word List’s (Coxhead, 2000) words in a corpus. AntWordProfiler may be considered an alternative, with a modern interface, to the Range software developed by Professor Paul Nation. CLAWS WWW Tagger* is a parts-of-speech tagger. You can enter a corpus up to 100,000 running words into its designated space and it will tag your corpus for parts of speech for free. For a corpus larger than 100,000, you can purchase a full version of the tagger. MonoconcEsy, MonoConc Pro, Colloate, and ParaConc(http://athel.com/index.php) are AntConc-like corpus analysis tools developed by Michael Barlow. MonoConc Pro 2.2 (the most current version of MonoConc Pro) has functions comparable to those of AntConc, such as identification of word collocations and wordlist keyword comparisons. MonoconcEsy is a simplified version (free for individual use) of MonoConc Pro; it does not have the advanced query and analysis functions found in MonoConc Pro 2.2. ParaConc is a tool for comparing data in two corpora of different languages. Range* is a program by Professor Paul Nation for measuring the distribution of a word across different texts in a corpus. The program measures the number and coverage/percentage of words in a corpus. It is especially useful for determining to which frequency group of words a word belongs based on the various existing word lists available, for example, whether a word falls into the first 1,000, second 1,000, or fifth 1,000 most frequent words list (up to the 25th 1,000 words list) based on the 25 sublists of BNC/COCA words. WordSmith Tools is an integrated set of corpus analysis programs developed by Mike Scott. Its most current version is Version 7. WordSmith provides similar search and analysis functions as those offered by AntConc and MonoConc Pro, such as data concordancing as well as developing word, key word, and cluster/N-gram lists. WordSmith also offers special functions not found in AntConc, one of which is ConcGram, a tool for retrieving concgrams (i.e., nonconsecutive clusters or N-grams; Cheng, Greaves, & Warren 2006). Corpus-Based Language Learning Resources and Tools Compleat Lexical Tutor* is a website developed by Tom Cobb for language learning and teaching. It provides a variety of resources (e.g., corpora such as Brown and BNC), tools (e.g., Range, VocabProfile), and corpus-based learning functions or activities (e.g., corpus-driven error correction and Concord Writer, which allows learners to write while assisted by lexico-grammatical information accessible online). Word and Phrase* is a multi-function tool for vocabulary learning. It allows you to check the frequency information of any word in the entire COCA and across its registers and its various academic disciplines if the Academic function is selected. You can also use it to check the vocabulary profile of any text being entered. By selecting the information desired, you can find out how many of the words in the text are in the first 500 words in the Academic Vocabulary List and how many are in the 500–3,000 range of the list. Also, by clicking on any word in the text, you can obtain its frequency information and concordance examples of its use in COCA. Moreover, you can check whether any string of words in the text is an established phrase or multi-word unit; if it is a phrase, then concordance examples of its use in COCA will be displayed. WordNet* is an English vocabulary database provided by Princeton University. It can function as a dictionary with much more detailed useful information, including corpus examples when you select the right display option. It is especially useful for understanding synonyms, including synonymous word phrases such as phrasal verbs, because it provides very detailed information (with examples) about the relationships among the synonyms or hypernyms/hyponyms in a set. Generally, for language learning and teaching purposes, the “Show key senses” display option should be selected. Key Word Lists Academic Word List* (AWL) is a list of 570 highly frequent word families in academic English developed by Averil Coxhead in 2000. The words in the word families are those that occur in a wide range of academic texts. The AWL may be the most influential list of academic words and is a very useful word list for instructors and students of English for academic purposes. Academic Vocabulary List* (AVL) is a list developed by Dee Gardner and Mark Davies in 2014. It contains over 3,000 word lemmas (not word families) that occur frequently across all the academic disciplines in the academic subcorpus of COCA. The AVL differs from the AWL in several ways. For information about their differences, visit the AVL website listed above. General Service List (GSL), developed by Michael West in 1953, may be the most influential list of high-frequency words of general English. It includes the first and second 1,000 most frequent words (or word families) in English, and it is of great use for English teaching and material development for beginner learners. However, it has been challenged for its age and subjective selection of words. Thus, a New General Service List (see below) has been compiled recently by Brezina and Gablasova (2015). New General Service List (New-GSL), developed by Vaclav Brezina and Dana Gablasova in 2015, is composed of more than 2,000 high-frequency words in contemporary English. In contrast to its predecessor (GSL), the New-GSL was developed based on large sets of corpus data with rigid criteria for word extraction. Mike Nelson's Business English Lexis Site* provides a series of business English word lists developed based on a business English corpus that Mike Nelson built for his doctoral dissertation research, such as “100 Most ‘Key’ Words in the Business English Corpus,” “Positive Business English Key Words,” and “Negative Business English Key Words.” The site also provides free downloadable teaching materials.
所有免费语料库和语料库工具都用*标记。 大型通用语料库 英国国家语料库(BNC,网址:http://corpus.byu.edu/bnc/)是由三家出版商(牛津大学出版社、朗文出版社和 W & R Chambers),两所大学(牛津大学和兰卡斯特大学)和大英图书馆联合开发的。 它包含了1980年至1993年间产生的1亿多个连续单词的语言数据,分为七个语域或子语料库:口语、小说、杂志、报纸、学术性文章,非学术性文章以及其他文本。 因此,BNC是研究当代英国英语的一个极好的来源。 语料库可以在网址“http://ota.ox.ac.uk/desc/2554”下载。 它还可以在几个门户网站在线访问和搜索,其中包括马克•戴维斯(Mark Davies)的BYU门户网站,该门户网站配备了功能强大、用户友好的搜索引擎,可以执行多项查询功能,如查找和生成搭配、语境中的关键字、以及词频。 当代美国英语语料库(COCA,网址:http://corpus2.byu.edu/coca)是由马克•戴维斯开发的语料库,目前仍在不断更新扩充语料。 目前,它包含5.2亿个单词,从1990年到2015年,每年增加大约2000万个单词(预计今后几年还会以同等增量继续扩充)。 与BNC一样,COCA包含几个子语料库,包括口语、小说、杂志、报纸和学术写作。 它的搜索引擎和马克•戴维斯的BNC门户上使用方法一致。 世界网络英语语料库(GloWbE,网址:http://corpus2.byu.edu/glowbe)是Mark Davies提供的另一个语料库。 它的语料来自20个英语国家的180万个网页,共有19亿个单词组成,其中既包括英语为母语的核心国家(如英国、美国),也包括英语为母语的其他国家(如印度、新加坡)。 小型通用语料库 布朗标准美式英语语料库*(Brown Corpus of Standard American English,http://www.lextutor.ca/conc/eng)是第一个现代的计算机可读的通用语料库。 它的语料源自新闻报道、小说等15大类500篇文本,每篇2000个单词,共计100万个单词。 Lancaster-Oslo/Bergen(LOB)英式英语语料库(http://clu.uni.no/icame/manuals/LOB/INDEX.HTM)是为匹配布朗语料库而开发的一个作为英式英语语料库的语料库,使用完全相同数量、长度和类别数量相同的文本。 Crown and CLOB语料库*是在北京外国语大学开发的。 与Brown和LOB语料库的结构和规模相类似,各有约一百万个运行词,Crown and CLOB语料库旨在分别提供具有代表性的当代美国英语和英国英语样本集。 口语语料库 美国专业英语口语语料库(http://www.athel.com/cpsa.html)是一个两百万个单词的语料库,其中一百万单词来自专业会议上的演讲和讨论,另一百万单词来自白宫新闻发布会和会议上的问答。 密歇根学术口语语料库(MICASE,网址:http://quod.lib.umich.edu/cgi/c/corpus/corpus?c=micase;page=simple)是由密歇根大学英语语言研究所开发的一个两百万单词的语料库。 语料库由152份文字记录组成,这些文字记录来自各个学科学术领域。 它提供各种学术场合的英语口语,包括课堂教学、小组讨论、报告和咨询会议。 因此,该语料库对于研究学术英语口语尤其有用。 圣巴巴拉美式英语口语语料库*(Santa Barbara Corpus of Spoken American English,网址:http://www.linguistics.ucsb.edu/research/santa-barbara-corpus)是一个包含249,000个单词的语料库,由美国各地自然发生的口语互动转录而成。 该语料库以会话英语为语料,对学习美国会话英语的学习者特别有用。 BNC和COCA的口语子语料库*(如上文所述免费提供)都是一个大型的英语口语语料库。 然而,值得注意的是,BNC的口语子语料库涉及不同形式的英语口语,包括正式演讲和非正式的私人对话,而COCA的口语子语料库只包含广播英语。 书面英语语料库 英国学术写作语料库(BAWE,网址:http://ota.ahds.ac.uk/headers/2539.xml)是一个以英式英语为母语的本科生和硕士生的学术写作语料库。 它由艺术和人文科学、社会科学、生命科学和物理科学的2,761篇文本组成。 BAWE是研究大学英语写作的宝贵资源。 密歇根高级学生论文语料库(MICUSP,网址:http://micusp.elicorpora.info/)是由密歇根大学英语语言研究所开发的一个书面英语语料库。 语料库共有260万个单词,由本科四年级学生和研究生跨16个学科的829篇A级论文组成。 它可以通过各种类别进行搜索,如主题、类型和学科。 因此,该语料库对ESL/EFL作者学习撰写学术论文是非常有用的。 BNC和COCA的学术英语子语料库*(Academic English subcorpora,如上文所述免费提供)都是书面英语语料库。 同样,BNC和COCA中的小说、杂志和报纸子语料库也都是一个书面英语语料库,但它们代表着不同的写作体裁和语域。 商务信函语料库*(Business Letter Corpus,网址:http://www.someya-net.com/concordancer)是一个英美商务信函语料库。 是2000年Yasumasa Someya 为其硕士项目开发的样品。 语料库大约有一百万个单词,对商务英语的老师和学生十分适用。 英语学习者语料库 国际学习者英语语料库第二版(International Corpus of Learner English,ICLEv2,l)可以说是最著名的EFL/ESL学生写作语料库。 这个语料库是由法语区鲁汶大学的Granger和他的同事创建的。 ICLEv2由来自16个不同母语国家的高、中、高级英语学习者的议论文和描述性短文组成,其中包括汉语、捷克语、荷兰语、法语、德语、日语和俄语。 因此,对不同语言背景的英语学习者的语言进行比较就显得尤为有价值。 亚洲英语学习者国际语料库网络*(International Corpus Network of Asian Learners of English,ICNALE,http://language.sakura.ne.jp/icnale/download.html)是由Shin'ichiro Ishikawa开发的包含120万单词的书面语料库。 其语料包括来自10多个亚洲国家的英语学习者的论文。 Louvain国际英语口语中介语数据库(Louvain International Database of Spoken English Interlanguage,LINDSEI, http://www.uclouvain.be/en-cecl-lindsei.html)是一个英语口语语料库,包括保加利亚语、汉语、法语、德语、日语和西班牙语等11种语言。 该语料库由法语区鲁汶大学的Gaëtanalle Gilquin教授、Sylvie De Cock教授和Sylviane Granger教授开发,可购买,包含系统收集的具有代表性的ESL口语,并附带搜索程序和手册。 TECCL语料库(http://www.bfsu-corpus.org/content/teccl-corpus&)是北京外国语大学开发的一个中国英语学习者书面英语语料库,该语料库包含了大约10,000篇中国中学和大学英语学生的作文。 语料库检索、分析和标注工具 AntConc*(http://www.laurenceanthony.net/software.html)是早稻田大学劳伦斯•安东尼开发的语料库检索和分析软件程序。 AntConc提供了用于语料库检索、生成词表和关键字表、汇编簇/N-gram(公式化表达或簇)列表以及检索目标词搭配的模块。 由于AntConc的各种功能,它在分析语料库数据以达到语言教学和学习目的方面十分有用。 AntWordProfiler*(http://www.laurenceanthony.net/software.html )是劳伦斯•安东尼设计的另一个程序。 AntWordProfiler计算一个语料库中General Service List(West 1953)第一和第二高频词以及Academic Word List(Coxhead,2000)词汇的数量和覆盖率/百分比。 AntWordProfiler可以被认为是Paul Nation教授开发的Range软件的另一种选择,具有现代化的界面。 CLAWS WWW Tagger*(http://ucrel.lancs.ac.uk/claws/trial.html)是一个词性标注器。 你可以输入一个语料库,多达100,000个运行单词到它的指定空间,它将免费为你的语料库添加词类标签。 对于大于100,000的语料库,您可以购买完整版本的标记器。 MonoconcEsy(http://athel.com/index.php),MonoConc Pro(http://athel.com/index.php),Colloate(http://athel.com/index.php)和ParaConc(http://athel.com/index.php)是Michael Barlow开发的类似于AntConc的语料库分析工具。 MonoConc Pro 2.2(MonoConc Pro的最新版本)具有与AntConc相当的功能,如单词搭配识别和单词表关键字比较。 MonoconcEsy是MonoConc Pro的简化版本(个人免费使用); 它不具备MonoConc Pro 2.2中的高级查询和分析功能。 ParaConc是一个比较不同语言的两个语料库中数据的工具。 Range*(http://www.victoria.ac.nz/lals/about/staff/paul-nations)是Paul Nation教授开发的程序,用于测量一个单词在语料库中不同文本中的分布。 该程序测量语料库中单词的数量和覆盖率/百分比。 该工具在基于各种现有词汇表来确定单词属于单词的哪个频率组方面尤为有用,例如,基于BNC/COCA单词的25个子列表来确定单词是否落入词频表中第1-1000,第1001-2000个或第4001-5000(直到第25001-26000)的位置中。 WordSmith Tools(http://lexically.net/wordsmith/index.html)是Mike Scott开发的一套完整的语料库分析程序。 目前最新版本是第7版。 WordSmith提供了与AntConc和MonoConc Pro类似的搜索和分析功能,例如数据协调以及生成单词、关键字和词簇/N元序列表。 WordSmith还提供了AntConc中没有的特殊功能,其中之一是同现词列(ConcGram),一种用于检索同现词列(即非连续词簇或N元序列;Cheng,Greaves,& Warren 2006)的工具。 基于语料库的语言学习资源与工具 Completat Lexical Tutor*(http://www.lextutor.ca/)是一个由Tom Cobb开发的用于语言学习和教学的网站。 它提供了多种资源(如Brown和BNC等语料库)、工具(如Range、VocabProfile)以及基于语料库的学习功能或活动(如语料库驱动的纠错功能和Concord Writer,让学习者能在写作时在线访问获取词汇语法信息)。 Word and Phrase*(http://www.wordandphrase.info/)是一个多功能的词汇学习工具。 使用该工具您能检索整个COCA语料库中任何单词的频率信息,如果勾选了“学术”复选框,还能获得检索单词在跨语域在各学科学术中的词频。 您还可以使用它来检索正在输入的任何文本的词汇资源库。 通过选择所需的信息,你可以发现课文中的单词有多少在AVL的词频前500的单词中,有多少在词频表的500-3000范围内。 此外,通过点击文本中的任何单词,您可以获得它的频率信息和该词在COCA中的索引。 而且,你可以检查文本中的任何一串单词是否是一个既定的短语或多词单位; 如果它是一个短语,那么将显示其在COCA中使用的示例。 WordNet*(http://wordnetweb.princeton.edu/perl/webwn)是普林斯顿大学提供的英语词汇数据库。 当您选择正确的显示选项时,它可以起到词典的作用,提供更详细的有用信息,包括语料库示例。 它对于理解同义词特别有用,包括同义词短语,如短语动词,因为它提供了关于一个集合中的同义词或上义词/下义词之间关系的非常详细的信息(带有示例)。 通常,出于语言学习和教学目的,应选择“显示关键意义”显示选项。 关键词列表 Academic Word List*(AWL,http://www.victoria.ac.nz/lals/resources/academicwordlist)是由Averil Coxhead于2000年编制的一份包含学术英语中570个高频词族的词汇表。 词族中的单词是那些出现在广泛的学术文本中的单词。 AWL可能是最有影响力的学术词汇表,对于学术英语的教师和学生来说是一个非常有用的词汇表。 Academic Vocabulary List*(AVL,http://www.academicvocabulary.info/)是由迪•加德纳和马克•戴维斯在2014年开发的一个列表。 它包含超过3,000个词目(不是词族),这些词目经常出现在COCA学术子语料库中的所有学术学科中。 AVL在几个方面与AWL不同。 有关它们之间差异的信息,请访问上面列出的AVL网站。 通用服务词汇表(General Service List, GSL),由Michael West于1953年制定,该词汇表应该可以说是通用英语中最具影响力的高频词汇表。 它包括英语中前2000个词频最高的词(或词族),对初学者的英语学习和材料搜集都十分有用。 然而,该词汇表其时代局限性和选词的主观性而渐渐没落。 因此,Brezina和Gablasova(2015年)最近编制了一份新通用服务词汇表(见下文)。 新通用服务列表(New General Service List ,New-GSL,http://www.newgeneralservicelist.org/),由瓦茨拉夫•布雷吉纳和达娜•加布拉索娃于2015年开发,由2000多个当代英语高频词组成。 与之前的GSL相比,NGSL是在大量语料库数据的基础上发展起来的,具有严格的词汇抽取标准。 Mike Nelson商务英语词汇网站*(Mike Nelson's Business English Lexis Site,http://users.utu.fi/micnel/business_english_lexis_site.htm)提供了一系列商务英语词汇表,这些词汇表基于Mike Nelson为其博士论文研究建立的商务英语语料库而开发,如“商务英语语料库中100个最‘关键’的词汇”,“商务英语正面词汇关键词”和“商务英语负面词汇关键词”,网站还提供免费下载的教材。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文