Corpus linguistics and BNCweb


On October 30 Prof. Sebastian Hoffmann taught the seminar “Corpus Linguistics with BNCweb”.

CV details

Prof. Dr. Sebastian Hoffmann has been Professor in English Linguistics at the University of Trier (Germany) since 2009. Prior to this, he held the post of Lecturer in English Linguistics at Lancaster University (UK) for three years (2006-2009), having work for several years at the University of Zürich, first as a research assistant to Prof. Gunnel Tottie and then as a 'Wissenschaftlicher Mitarbeiter'. He obtained his PhD at the University of Zürich, too.

Prof. Hoffmann is a renowned corpus linguist with expertise in methodological and usage-based approaches to the study of language. He is one of the creators of the user-friendly web-interface to the British National Corpus known as BNCweb (, the manual guide of which he co-authored together with colleagues Stefan Evert, Nick Smith, David YW Lee and Ylva Berglund (Corpus Linguistics with BNCweb – A Practical Guide, 2008).

He also has an interest in practical issues involved in using Internet-derived data for corpus linguistic analyses, in relation to which he has published the article ‘Processing Internet-Derived Text - Creating a Corpus of Usenet Messages’ (2007). In addition, he has recently co-edited two volumes on corpus linguistics with Geoffrey Leech and Paul Rayson: English Corpus Linguistics: Looking Back, Moving Forward (2012) and Methodological and Historical Dimensions of Corpus Linguistics (2011), both collections of papers from the ICAME conference held at Lancaster University (2009). Prof. Hoffmann’s research also includes other topics such as diachronic and synchronic syntactic change, in particular grammaticalization of complex prepositions, on which he has written the monograph Grammaticalization and English Complex Prepositions. A Corpus-Based Study (Routledge, 2005); the use of tag questions in British and American English, a topic on which he has collaborated with Prof. Gunnel Tottie (e.g. 2006, 2009); and World Englishes, for instance on verb complementation in British and Indian English, in particular ditransitive verbs (see Mukherjee & Hoffmann 2006). Most recently he has started working on a large-scale project on Singapore English.

Outline of the seminar

The seminar provided an introduction to the British National Corpus through its web application, the BNCweb. It consisted of three main parts. In the first place, Prof. Hoffmann provided a general description of the corpus (design and contents), while pointing out the crucial difference between the corpus selection criteria – the ‘[s]pecifications defining the kind and proportion of material to be included for the compilation of the corpus’ (e.g. domain, medium, time) – and the corpus descriptive features – the ‘[c]lassificatory features of the corpus that were not part of the selection criteria, but [were] added post hoc – on the basis of observed evidence’ (e.g. age, gender, dialect).

The second part of the seminar addressed ‘some basic [yet necessary] methodological issues’ based on a number of case studies which attendants could test hands-on. Regarding (relative) frequency, Prof. Hoffmann pointed out the importance of normalisation and the importance of choosing the appropriate frequency metrics. The difference between precision and recall was discussed at length, including methods for optimising both. Next was annotation at word level (parts of speech), which he exemplified with the case study of intensifiers. At this point, Prof. Hoffmann explained the use of a variety of tools available in the BNCweb to perform and refine searches, including queries with specific tags, query expressions for less restrictive searches, and wildcards.

The third part of the seminar focused on collocations: raw frequency vs. statistical measures of collocational strength, register/domain-specific collocations, and how to investigate collocations with BNCweb. Sample cases included, for instance,say/tell/express an opinion, cause+nouns with negative denotation, and collocations withdangerous. As with the previous tasks, Prof. Hoffmann provided useful tips to take the most advantage of the methodological tools available in BNCweb: expected/observed collocate frequencies and the different association measures one can quickly retrieve from the application, e.g. Z-score, T-score, log-likelihood, mutual information, etc.