Loading...
Web corpus construction
The World Wide Web constitutes the largest existing source of texts written in a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given language. There are several advantages of this approach: (i) Working with su...
Main Author: | |
---|---|
Other Authors: | |
Format: | eBook |
Language: | English |
Published: |
San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) :
Morgan & Claypool,
c2013.
|
Series: | Synthesis digital library of engineering and computer science.
Synthesis lectures on human language technologies ; # 22. |
Subjects: | |
Online Access: | Abstract with links to full text |
Table of Contents:
- 1. Web corpora
- 2. Data collection
- 2.1 Introduction
- 2.2 The structure of the web
- 2.2.1 General properties
- 2.2.2 Accessibility and stability of web pages
- 2.2.3 What's in a (national) top level domain?
- 2.2.4 Problematic segments of the web
- 2.3 Crawling basics
- 2.3.1 Introduction
- 2.3.2 Corpus construction from search engine results
- 2.3.3 Crawlers and crawler performance
- 2.3.4 Configuration details and politeness
- 2.3.5 Seed URL generation
- 2.4 More on crawling strategies
- 2.4.1 Introduction
- 2.4.2 Biases and the pagerank
- 2.4.3 Focused crawling
- 3. Post-processing
- 3.1 Introduction
- 3.2 Basic cleanups
- 3.2.1 HTML stripping
- 3.2.2 Character references and entities
- 3.2.3 Character sets and conversion
- 3.2.4 Further normalization
- 3.3 Boilerplate removal
- 3.3.1 Introduction to boilerplate
- 3.3.2 Feature extraction
- 3.3.3 Choice of the machine learning method
- 3.4 Language identification
- 3.5 Duplicate detection
- 3.5.1 Types of duplication
- 3.5.2 Perfect duplicates and hashing
- 3.5.3 Near duplicates, Jaccard coefficients, and shingling
- 4. Linguistic processing
- 4.1 Introduction
- 4.2 Basics of tokenization, part-of-speech tagging, and lemmatization
- 4.2.1 Tokenization
- 4.2.2 Part-of-speech tagging
- 4.2.3 Lemmatization
- 4.3 Linguistic post-processing of noisy data
- 4.3.1 Introduction
- 4.3.2 Treatment of noisy data
- 4.4 Tokenizing web texts
- 4.4.1 Example: missing whitespace
- 4.4.2 Example: emoticons
- 4.5 POS tagging and lemmatization of web texts
- 4.5.1 Tracing back errors in POS tagging
- 4.6 Orthographic normalization
- 4.7 Software for linguistic post-processing
- 5. Corpus evaluation and comparison
- 5.1 Introduction
- 5.2 Rough quality check
- 5.2.1 Word and sentence lengths
- 5.2.2 Duplication
- 5.3 Measuring corpus similarity
- 5.3.1 Inspecting frequency lists
- 5.3.2 Hypothesis testing with
- 5.3.3 Hypothesis testing with Spearman's rank correlation
- 5.3.4 Using test statistics without hypothesis testing
- 5.4 Comparing keywords
- 5.4.1 Keyword extraction with x2
- 5.4.2 Keyword extraction using the ratio of relative frequencies
- 5.4.3 Variants and refinements
- 5.5 Extrinsic evaluation
- 5.6 Corpus composition
- 5.6.1 Estimating corpus composition
- 5.6.2 Measuring corpus composition
- 5.6.3 Interpreting corpus composition
- 5.7 Summary
- Bibliography
- Authors' biographies.