Build a corpus from the web

如何构建基于网页的语料库

2020-04-23 12:25 sketchengine

本文共1539个字,阅读需16分钟

阅读模式 切换至中文

The web is a great source of readily available textual data but also a bottomless warehouse of spam, machine-generated content and duplicated content unsuitable for linguistic analysis. This may generate some uncertainty about the quality of the language included in the corpora from the web. At Sketch Engine, we are very well aware of the problems associated with building web corpora. This is why we never include blindly just anything that the web offers. Typically, we will discard between 40 % and 60 % of the textual content we download. The data which are unsuitable for linguistic analysis are identified using a sophisticated procedure with a special focus on the following issues. Duplicated content It is not uncommon that identical or nearly identical content is found on several websites or even on different pages of the same website. For example, media conglomerates often post the same news article, sometimes with minor changes or in a shortened or extended version, on all the sites they own. Similarly, travel agencies often include tourist resort descriptions on their websites. However, these descriptions are typically not written from scratch but copied, and sometimes slightly adapted, from other tourism websites. The fact that the same text appears several times on the web does not mean that it was written several times. Including each instance of the article in a general language corpus would distort the information derived from the corpus about how language is used. To put it simply, the corpus would show that the language in the duplicated content is used more frequently than it really is. It is extremely easy for a corpus from the web to suffer from this issue which makes it useless. How is duplicated content avoided Sketch Engine uses a deduplication procedure which is able to detect perfect duplicates as well as texts which were slightly adapted, shortened or extended. This means that if a text is shared with only minor changes, only one instance is kept in the corpus. The deduplication is carried out at paragraph level. This means that whole paragraphs are compared for similarity. If two paragraphs anywhere in the corpus are identified as identical, one of the will be removed. Therefore, it is possible that the web corpus may contain documents where one or more paragraphs may be missing. Deduplication with user corpora When users build their own corpora in Sketch Engine using the built-in WebBootCaT tool, deduplication is deactivated by default because there are many scenarios when the user may need to include repetitive content. For example, the user might want to analyse how many times the same piece of news appeared in the corpus. The deduplication can be activated during corpus compilation and it is also possible to select the level at which it will be carried out. This can be the sentence level, paragraph level, document level or any other structure can be used. For example, with the sentence level selected, individual sentences will be compared and if identical sentences are found, all but one will be removed from the corpus. Unwanted content The internet is full of textual content which has hardly any linguistic value (unless one wants to study this type of text specifically). This may include: • texts made up of incomplete sentences (post comments, reviews and discussions) • on-page advertisements • repetitive content found on each webpage of a site (navigation menus, top menus, end matter text, legal clauses) • text snippets (beginnings of posts or pages inviting the user to read the complete article or page) The above-mentioned types of spam are eliminated with JusText, a specialized tool able to identify and remove this content from a downloaded web page. This tool is also applied to user corpora when using the integrated corpus building tool with the web option. Applying the JusText tool onto the following pages will remove everything but the content marked in green. Nothing will make it into the corpus from the first web page because no piece of text is long enough to provide the necessary context for linguistic analysis. A home page of a news site with only news headlines or news snippets without sufficient context. Nothing will be included in the corpus. The main body of a blog post or news article will be included in the corpus, the remaining content will be ignored. Spam By spam we refer to text found on the internet which was not produced by a human or which was only produced once but was automatically replicated on many other places on the web. Spam may include: texts whose abundance on the internet is highly disproportionate to their frequency outside the internet (pornographic and adult sites, sites selling products for slimming, muscle gain, hair growth and other health products). These sites often get duplicated automatically on various URLs which increases their presence even further. machine-generated text, often not intended to communicate any meaningful information machine-translated texts The presence of spam in our web corpora is partly eliminated by deduplication. In the worst case scenario, a maximum of 1 copy of each page will be present in the corpora. However, the main method of avoiding spam in our corpora is the use of seed URLs. Seed URLs The process of web crawling is not completely random. Before the crawling starts, a list of respectable, high-quality websites is compiled and the web crawler starts by downloading the content of these seed URLs. They can be media sites, blogs, professional sites and also other sites from which we downloaded good content in the past. If a link leading to another website is found, the web crawler will follow the link but it will only continue doing so up to a previously defined level. Since most of the unwanted web content is in English, the level has to be set low when building an English web corpus. It can be higher for other major languages and it can be even higher for less major languages where the danger of reaching spam is much lower. Seed URLs cannot be used when building user corpora with the built-in WebBootCaT tool. They are only used in web corpus building carried out by the Sketch Engine team. Users can, however, build a corpus by downloading websites one at a time by using WebBootCaT with the website option. Additional criteria The process of web crawling to obtain a general language web corpus also include various additional criteria. Length A document is only kept in the corpus if the downloaded web page contains enough data after applying the above cleaning tools. If the document is too short, for example, one sentence only, the document will not be included because a lone sentence out of context is rarely linguistically valuable. On the other hand, if the document is too long, for example, many thousands of words, it might be an indicator that the content is not a standard webpage or that the content may not be of linguistic nature at all. Such documents are also not included. When building user corpora with the built-in WebBootCaT tool, these parameters can be set to different values or even disabled to include absolutely all text in the corpus. Language detection During web crawling, the language of the downloaded text is detected and only texts in the desired language are included. This means that an English corpus can contain pages published on German, Spanish, French, Japanese and other websites as long as they are in English. Where are the texts from? Despite the use of seed URLs as the starting points for the web crawling, it is not easy to generalise and give a simple answer to this question. However, each document (a downloaded web page) in the corpus comes with metadata such as the source website as well as the exact URL from which the text was downloaded. The user can generate a list of all the websites or URLs together with the number of documents or tokens downloaded from each source. This can provide some insight into where the data come from. Similarly, it is possible to display this information for each concordance line or to narrow the search to only certain websites so the user is always in control of where the results come from. Text types or subcorpora are the functionalities designed to achieve this. How to build your own web corpus There is little point in building your own general purpose web corpus because there is plenty of them in Sketch Engine already, the largest ones have a size of 40 billion words and the Timestamped corpus in 18 languages is even updated daily. If you need to build a specialized corpus, use the built-in web corpus building tool with one or more of these options: build a corpus from a web search build a corpus from web links download a website. For geeks Most tools integrated into Sketch Engine into its web crawling pipeline and into the WebBootCaT tool are open source and can be downloaded individually from http://corpus.tools Their installation, configuration and use may, however, require advanced IT skills. It is generally far more effective to work with the corpora preloaded in Sketch Engine or to use built-in corpus building tool.
网络网站是一个文本语料数据的巨大来源,且数据容易获取,但同时也是垃圾邮件、机器生成内容和不适合语言分析的复制内容的垃圾仓库,并且这些无效内容可能会使网页语料库的语言质量受到影响。 SketchEngine确切意识到在构建基于网页的语料库时存在的问题。这就是为什么我们从不盲目地将网络提供的任何东西都纳入其中。通常,我们会丢弃获取到的文本中40%到60%内容。SketchEngine会使用精细程序来识别不适用于语言分析的数据,主要是剔除以下内容。 重复的内容 在多个网站甚至同一网站的不同页面上发现相同或近乎相同的内容,这种情况并不罕见。例如,传媒集团经常会在他们拥有的各个网站上发布同一篇新闻文章,有时只是稍作改动,或是经过缩短或延长的版本。同样,旅行社也经常在其网站上发表旅游胜地说明。然而,这些描述通常不是自己写的,而是从其他旅游网站抄袭过来的,有时会略作改编。 同一文本在网页上出现了几次并不意味着有几个人先后写了这篇文章。如果将文章的每一个实例都包含在一个通用的语言语料库中,语料库中显示出的语用信息就会被扭曲。简单地说,语料库将显示复制内容中的语言比实际使用频率更高。基于网页的语料库非常容易受到这个问题的困扰,这会让其变成无效语料库。 如何避免重复的内容 SketchEngine使用重复数据删除程序,该程序能够检测完全重复的文本以及稍作调整、缩短或扩展的文本。这意味着,如果两个文本间只有微小的差别,那么在语料库中只会保留一个实例。重复数据删除是在段落级别进行的,这意味着程序会比较整个段落的相似性。如果语料库中任何地方的两个段落被识别为相同,其中一个段落将会被删除。因此,基于网页的语料库中的某些文档可能会缺少一个或多个段落。 利用用户语料库删除重复数据 当用户在SketchEngine中使用内置的WebBootCaT工具(https://youtu.be/VjHC4lMop-s)构建自己的语料库时,重复数据删除程序在默认情况下不会被激活,因为在很多情况下,用户可能需要包含重复的内容。例如,用户可能希望分析同一条新闻在语料库中出现的次数。 重复数据删除程序可以在语料库编译期间激活,也可以选择执行重复数据删除的级别:句子级别、段落级别、文档级别,或可以选择任何其他结构。例如,选择句子级别后,将对单个句子进行比较,如果发现相同的句子,则将在语料库中保留一个,其余重复句子全部删除。 不需要的内容 互联网上充斥着几乎没有任何语言价值的文本内容(除非你想专门研究这类文本),这类文本可能包括: 由不完整的句子组成的文本(回帖、评论和讨论) 页面广告 网站子网页上的重复内容(如导航菜单、顶部菜单,结尾内容文本、法律条款) 文本片段(邀请用户阅读完整文章或页面的开头部分内容) 使用JusText(http://corpus.tools/wiki/justext)可以消除上述类型的垃圾文本,JusText能够从下载的网页中识别并删除这些内容。当用户使用带有网页选项的集成语料库构建工具(https://www.sketchengine.eu/guide/create-a-corpus-from-the-web/)时,该工具也适用于用户语料库。 在以下网页中应用JusText工具将会删除绿色标记外的所有内容。从第一个网页页面开始,没有任何内容会被纳入语料库,因为每一段文字都不够长,无法为语言分析提供必要的语境。 新闻网站的主页,只有新闻标题或新闻片段,没有足够的上下文。该主页没有任何内容会被纳入语料库。 一篇博客文章或新闻文章的主体将被包括在语料库中,其余的内容将被忽略。 垃圾信息 这里所说的垃圾文本是指在互联网上发现的文本,这些文本可能是机器生成的,或是生成后被多次自动复制到网络上其他地方。垃圾信息可能包括: 在互联网大量出现但在互联网外出现的频率极不相称的文本(如色情网站和成人网站,出售减肥、增肌、生发和其他健康产品的网站)极不相称。这些站点经常在各种URL上自动复制,这进一步增加了它们的出现的频率。 机器生成的文本,通常不会传达任何有意义的信息 机器翻译文本 SketchEngine构建基于网页的语料库时,会使用重复数据删除程序消除部分垃圾信息。在最坏的情况下,语料库中每页最多只会出现一处垃圾信息。然而,SketchEngine中剔除垃圾信息的主要方法是使用种子URL。 种子URL 爬取网页数据的过程并非完全随机的。在爬取数据开始之前,会编译一个高质量的网站列表,网络爬虫开始下载这些种子url的内容。这些网站可以是媒体网站、博客、专业网站,也可以是我们过去下载好内容的其他网站。如果从一个链接发现另一个网站,网络爬虫将跟随该链接,但它会一直继续重复该操作直到先前设定的级别。由于大多数不需要的网页内容都是英语内容,因此在构建基于网页的英语语料库时,必须将级别设置得很低。对于其他主要语言,级别可能会相对高,而对于小语种,设置级别可能会更高,因为这些语言的垃圾信息相对较少。 当使用内置的WebBootCaT工具构建用户语料库时,不能使用种子URL。种子URL仅限于在SketchEngine中进行的基于网页的语料库构建中使用。然而,用户可以通过使用带有网站选项的WebBootCaT一次下载一个网站来构建语料库。 附加标准 用于构建基于网页的通用语料库的网页爬取数据过程还包括各种附加准则。 长度 在应用上述清理工具后,一个文档只要在其下载的网页包含足够的数据时才会被保留在语料库中。如果文档太短,例如只有一个句子,则文档将不会被纳入到语料库中,因为脱离语境的单独句子在语言学上几乎没有价值。另一方面,如果文档过长,例如数千字,则可能表明内容不是标准网页,或内容可能根本不具有语言性质,这类文档也不会被收入语料库中。 在使用内置的WebBootCaT工具构建用户语料库时,可以将这些参数设置为不同的值,甚至可以禁用这些参数以保留语料库中的所有文本。 语言检测 在爬取数据期间,检测下载文本的语言,只保留指定语言的文本。这意味着一个英语语料库可以包含德语、西班牙语、法语、日语等网站上发布的网页,只要其文本内容是英文的。 文本从何而来? 尽管使用种子URL作为网页爬虫的起点,但要概括并给出一个简单的答案并不容易。然而,语料库中的每个文档(下载的网页)都带有元数据,例如源网站以及下载文本的确切URL。用户可以生成所有网站或URL列表以及从每个源下载的文档或令牌的数量。我们从这里可以看到数据从何而来。 同样,可以在每一行都显示这些信息,或者将搜索范围缩小到某些网站,这样用户就可以始终控制结果的来源。文本类型或子语料库是为实现这一目标而设计的功能。 如何建立自己的基于网页的语料库 建立自己基于网页的通用语料库没有什么意义,因为SketchEngine中已经有大量的基于网页的通用语料库,其中最大的包含400亿单词,18种语言的时间戳语料库(Timestampedcorpus)甚至每天都在更新。 如果您需要构建一个专用性语料库,请使用内置的网页语料库构建工具,其中包括以下一个或多个选项: 从网页搜索构建语料库 从网页链接构建语料库 下载一个网站。 极客须知 大多数集成到SketchEngine、网页爬虫和WebBootCaT中的工具都是开源的,可以从http://corpus单独下载。然而,它们的安装、配置和使用可能需要高级的IT技能。通常使用预先加载在SketchEngine中的语料库或使用内置的语料库构建工具会高校许多。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文