Resources on the web
There are many interesting pages for corpus linguists on the Internet. The following selection of links includes websites with information on corpora and corpus analysis tools and some free, web-accessible corpora.
Helpful websites
- surveys of corpora and text archives: Looking for a corpus containing data on a specific variety or from a particular period? Interested in what corpora or text collections are available "out there"? Have a look at these lists/link collections!
- Martin Weisser's Bookmarks for Corpus-based Linguists: ca. 1,000 categorized and annotated links to resources (corpora, text archives,...) for corpus-based linguists.
- Corpus Resource Database (CoRD): an open-access online resource where academic corpus compilers can publish descriptions of their corpora (offered and maintained by the Research Unit for Variation, Contacts and Change in English, University of Helsinki); currently lists 55 English-language corpora, subcorpora and databases; includes links and a Corpus Finder tool
- the companion website to Corpus-based Language Studies: An Advanced Resource Book (2005; London: Routledge) by Tony McEnery, Richard Xiao and Yukio Tono: includes a Corpus Survey and links to further resources
- List of texts, text centres and web resources (maintained by the International Computer Archive of Modern and Medieval English)
- Learner corpora around the world: a list of more than 100 learner corpora, maintained by the Centre for English Corpus Linguistics, Université de Louvain
Please note: You will not be able to access all corpora linked to in these databases. The extent to which corpus data can be accessed varies greatly: while you may be allowed to download the corpus in its entirety in some cases, other websites will offer access to the data via a web interface only (you also may need to register as a user). Many corpora are not available unless a license fee is paid (in such cases, please check whether the Linguistics section owns a license).
- information on corpus analysis software
- Laurence Anthony's AntConc: AntConc for download, video tutorials that teach you how to use the software, links to online help (including a discussion forum for questions), documentation/manual, books/papers related to AntConc
- WordSmith Tools (Oxford University Press & Lexical Analysis Software; Mike Scott): the Support section includes 'get-started guides' in a number of languages, answers to FAQs, online help and a link to the online WordSmith discussion group
- help with quantitative data/statistics
- Log-likelihood calculator (University of Lancaster)
- Information on using statistics in "Einführung in die Korpuslinguistik: Praktische Grundlagen und Werkzeuge" [German]
- Sample Size Calculator - Helps you to determine how big your sample needs to be in order to precisely represent the corresponding population.
Some free, web-accessible corpora
- AusNC - The Australian National corpus (also includes ICE-Australia) - Requires an account. Limited access to some of the corpora (e.g. Monash, ICE-AUS)
- BNC - The British National Corpus. Requires an Account.
- CMSW - Corpus of Modern Scottish Writing, and SCOTS - Scottish Corpus of Texts and Speech. Free Download and full access. No registration required.
- EEBO - Early English Books Online (British and American Books pusblished between 1475 and 1700). Full Access granted from JLU intranet.
- FALKO - Fehlerannotiertes Lernerkorpus (can be searched using the ANNiS³ web interface). Free full access without registration.
- OBC - The Old Bailey Corpus (spoken English in the 18th and 19th centuries). Full access with JLU Login.
- SBC - The Santa Barbara Corpus of Spoken American English. Free Download, full access, no registration.
- VOICE - The Vienna-Oxford International Corpus of English (1 million words of spoken English used as a lingua franca). Free download. Web-interface requires registration.
- various corpora at corpus.byu.edu (site maintained by Mark Davies), e.g. the Corpus of Contemporary American English (COCA), the Corpus of Historical American English (COHA), the Corpus of Global Web-Based English (GloWbE), the Corpus of American Soap Operas etc. Limited number of daily queries & other limitations for free users. JLU does not have a commerical licence for BYU web interfaces.