Basics of Using Corpora

语料库知多少?

2020-04-30 22:14 tesol

本文共812个字,阅读需9分钟

阅读模式 切换至中文

A corpus is a collection, or body, of language. Though usually text-based, corpora (the plural of corpus) can include collections of spoken language as well. In fact, some of the most popular examples of corpora include TV news and U.S. Supreme Court transcripts. Other collections include religious texts, academic papers, Wikipedia, and, definitely the largest of all corpora, the Internet. Using a corpus to learn vocabulary can be a much more active experience than traditional, passive, approaches to learning vocabulary. The Advantages of Corpora One of the great advantages of a corpus is that it presents language in context. This is known as a concordance and allows learners to recognize relationships among words, phrases, sentences, and paragraphs. In particular, this extended context allows us to see collocations, or the connections, between words in the various ways they may be used. For example, we can get a better idea which adjectives are commonly associated with a particular noun and what prepositions are associated with a particular verb.If you think of how you may use a dictionary to learn new words, you realize that there is typically a single sentence that serves as the example for any particular word. With a corpus, you may have dozens or even hundreds of examples. Further, these are likely to be authentic language rather than the one contrived sentence that is likely to be included in a dictionary. Having access to multiple authentic examples provides learners with lexical as well as grammatical models. Corpora may be most useful in order to encourage learners to experiment with different sentence constructions. Traditionally, corpora have been very expensive and time consuming to construct and this has limited the accessibility for learning purposes. That has resulted in corpora being primarily used by researchers rather than language teachers, but technology has made it easier to gather, code, and archive large bodies of text and institutions, and instructors have created numerous new collections of corpora, including collections of their own students’ work. The Corpus of Contemporary American English The Corpus of Contemporary American English (COCA) is easy to use through a freely available website. COCA is a good example of conventional corpus-driven concordance tools. With a larger corpus like COCA and iWeb, users can find more examples of any given word, including numerous examples of context, collocations, and phraseology. This allows learners to observe various authentic examples of a given word in order to develop a more diverse and sophisticated understanding of the diverse use of a word or word root (using conventional corpora, users can search with an asterisk for various morphological forms of a word root). Users can search for a word or word root by using an asterisk at the end of the root. For example, you can see the results when I search for reach*: And the results allow me to select the form of the word I would like to explore further, so I select reached: I can see 55,905 examples of this word in context: Recently, the COCA has launched a new English Corpora website that combines COCA with a number of other corpora, including a corpus of TV and movies, and the new Intelligent Web corpus (iWeb), which allows you to create a “virtual corpus” that is customized and still retains these powerful functions. Teachers and learners can gather a variety of texts into customized collections based on their own interests or around a particular academic topic. This can be useful if a class is organized around thematic topics or if students are preparing for a particular academic discipline. This can particularly useful for disciplines that have unique writing conventions or incorporate a lot of technical jargon. These virtual corpora can be saved for continued use and users can also save a history of their previous activity for future reference. The iWeb corpus includes 14 billion words that were systematically selected from across the Internet. This site offers users a lot of functionality for free as long as you make fewer than 250 queries per day. Additional searching and features are available as part of a paid individual or institutional site license. Google can also be used as a basic concordance tool with the entire internet as a corpus. However, such use does not include the robust and sophisticated nature of a tagged corpus. In a future entry, I will share some practical suggestions for such use. Additional Resources Here are some additional resources: Corpora as an Authentic Resource of Language and Beyond CorpusEye is another large collection of different corpora TESOL Press Resource: Using Corpora for Language Learning and Teaching The Use of Corpora in the Vocabulary Classroom Corpora in English Language Teaching Giampieri, P. (2019). The web as corpus in ESL classes: A case study. International Journal of Language Studies, 13(2), 91–108. How do you use corpora in your language classroom? Please share in the comments, below.
语料库是语言的集合。 语料库虽然通常是基于文本的,但也可以包括口语的集合。 事实上,最常见的语料库包括电视新闻和美国最高法院的笔录。 其他的收藏包括宗教文献、学术论文、维基百科,以及互联网——绝对是最大的语料库。 使用语料库学习词汇比传统的、被动的词汇学习方法要主动得多。 语料库的优势 语料库的一个很大的优点是在一定的语境中呈现语言。 这被称为一种一致性,它允许学习者识别单词、短语、句子和段落之间的关系。 特别是,这种扩展的语境允许我们以不同的方式看到词语之间的搭配或连接。 例如,我们可以更好地了解哪些形容词通常与某个特定的名词相关联,哪些介词与某个特定的动词相关联。想想当你使用字典来学习新单词,某个特定单词通常都会有一个例句。 有了语料库,你就会获得几十个甚至上百个例句。 此外,这些可能是真实的语言,而非被收录在字典中的人造句子。真实的例子为学习者提供了词汇和语法模型。 鼓励学习者运用不同的句子结构,语料库可能是最有效的。 传统意义上,语料库的构建非常昂贵且耗时很长,这限制了学习目的的可及性。 这就导致语料库主要被研究者而不是语言教师使用,但随着技术的发展,使大量文本和机构的收集、编码和归档变得更加容易,教师也创建了大量新的语料库,包括他们自己学生的作品的语料库。 当代美国英语语料库 当代美国英语语料库(COCA)是一个免费网站,方便使用。 COCA是传统语料库驱动索引工具的典例。 有了像COCA和iWeb这样更大型的语料库,用户可以找到任何给定单词的更多示例,包括其语境、搭配和短语的大量示例。 这允许学习者观察给定单词的各种真实例子,以便深化对单词或词根的各种用法的更多样化和更复杂的理解(使用常规语料库,用户可以用星号*搜索单词词根的各种形态形式)。 用户可以通过在词根末尾使用星号来搜索单词或词根。 例如,当我搜索“reach*”时: 我可以从搜索结果中想进一步探索的单词的形式,我选择了“reached”: 我可以看到这个词的55,905个例子及其语境: 最近,COCA推出了一个新的英语语料库网站,该网站将COCA与多个其他语料库相结合,包括电视和电影的语料库,以及新的智能Web语料库(iWeb),它可以让你创建一个“虚拟语料库”,该语料库是定制的,并且仍然保留这些强大的功能。 教师和学习者可以根据自己的兴趣或围绕某一特定的学术主题,将多种文本汇集成定制的语料。 如果某一门课是围绕主题来组织的,或者如果学生正在为一个特定的学术学科做准备,这会很有用。 这对于那些具有独特写作规范或包含大量技术术语的学科尤其有用。 这些虚拟语料库可以保存以供后续使用,用户也可以保存他们以前活动的历史记录以供将来参考。 iWeb语料库包括140亿个词汇,这些词汇是从互联网上系统地挑选出来的。 这个网站为用户提供了很多免费的功能,每天免费查询的次数有250次。个人付费或者获得网站授权的可使用额外的搜索和功能。 谷歌也可以作为一个基本的索引工具,以整个互联网为语料库。 然而,这种使用并不包括标记语料库的稳健和复杂的性质。 在下一篇文章中,我将分享一些关于这种使用的实用建议。 附加资源 下面是一些附加资源: 语料库是语言的真实资源(http://www.myenglishonline.ca/wp-content/uploads/2014/11/Corpora-as-an-Authentic-Resource-of-the-Language1.pdf) CorpusEye是另一个不同语料库的集合 TESOL出版社资源:语料库在语言教学中的应用 语料库在词汇课堂中的应用 英语教学中的语料库 Giampieri, P. (2019). The web as corpus in ESL classes: A case study. International Journal of Language Studies, 13(2), 91–108. (译:ESL课堂中的网络语料库:案例研究) 在语言课堂上你是如何使用语料库的? 请在下面评论区分享。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文