Loading...

Web corpus construction

The World Wide Web constitutes the largest existing source of texts written in a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given language. There are several advantages of this approach: (i) Working with su...

Full description

Bibliographic Details
Main Author: Sch�afer, Roland
Other Authors: Bildhauer, Felix
Format: eBook
Language:English
Published: San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, c2013.
Series:Synthesis digital library of engineering and computer science.
Synthesis lectures on human language technologies ; # 22.
Subjects:
Online Access:Abstract with links to full text
Table of Contents:
  • 1. Web corpora
  • 2. Data collection
  • 2.1 Introduction
  • 2.2 The structure of the web
  • 2.2.1 General properties
  • 2.2.2 Accessibility and stability of web pages
  • 2.2.3 What's in a (national) top level domain?
  • 2.2.4 Problematic segments of the web
  • 2.3 Crawling basics
  • 2.3.1 Introduction
  • 2.3.2 Corpus construction from search engine results
  • 2.3.3 Crawlers and crawler performance
  • 2.3.4 Configuration details and politeness
  • 2.3.5 Seed URL generation
  • 2.4 More on crawling strategies
  • 2.4.1 Introduction
  • 2.4.2 Biases and the pagerank
  • 2.4.3 Focused crawling
  • 3. Post-processing
  • 3.1 Introduction
  • 3.2 Basic cleanups
  • 3.2.1 HTML stripping
  • 3.2.2 Character references and entities
  • 3.2.3 Character sets and conversion
  • 3.2.4 Further normalization
  • 3.3 Boilerplate removal
  • 3.3.1 Introduction to boilerplate
  • 3.3.2 Feature extraction
  • 3.3.3 Choice of the machine learning method
  • 3.4 Language identification
  • 3.5 Duplicate detection
  • 3.5.1 Types of duplication
  • 3.5.2 Perfect duplicates and hashing
  • 3.5.3 Near duplicates, Jaccard coefficients, and shingling
  • 4. Linguistic processing
  • 4.1 Introduction
  • 4.2 Basics of tokenization, part-of-speech tagging, and lemmatization
  • 4.2.1 Tokenization
  • 4.2.2 Part-of-speech tagging
  • 4.2.3 Lemmatization
  • 4.3 Linguistic post-processing of noisy data
  • 4.3.1 Introduction
  • 4.3.2 Treatment of noisy data
  • 4.4 Tokenizing web texts
  • 4.4.1 Example: missing whitespace
  • 4.4.2 Example: emoticons
  • 4.5 POS tagging and lemmatization of web texts
  • 4.5.1 Tracing back errors in POS tagging
  • 4.6 Orthographic normalization
  • 4.7 Software for linguistic post-processing
  • 5. Corpus evaluation and comparison
  • 5.1 Introduction
  • 5.2 Rough quality check
  • 5.2.1 Word and sentence lengths
  • 5.2.2 Duplication
  • 5.3 Measuring corpus similarity
  • 5.3.1 Inspecting frequency lists
  • 5.3.2 Hypothesis testing with
  • 5.3.3 Hypothesis testing with Spearman's rank correlation
  • 5.3.4 Using test statistics without hypothesis testing
  • 5.4 Comparing keywords
  • 5.4.1 Keyword extraction with x2
  • 5.4.2 Keyword extraction using the ratio of relative frequencies
  • 5.4.3 Variants and refinements
  • 5.5 Extrinsic evaluation
  • 5.6 Corpus composition
  • 5.6.1 Estimating corpus composition
  • 5.6.2 Measuring corpus composition
  • 5.6.3 Interpreting corpus composition
  • 5.7 Summary
  • Bibliography
  • Authors' biographies.