Corpus Stylistics Analysis of Jane Austin 'Pride and Prejudice'

The studies that approach texts of literary with corpus linguistic methods is developing and The use of corpus (corpora) in stylistics has become increasingly in recent years and the term of corpus stylistics is substantially popular. The latin word corpus (corpora) refers to a collection of texts means “ body”. The texts are saved in an electronic database. Baker, Hardie & MacEnery argue that “althought a corpus does not contain new information about language by using software packages which process data, we can obtain a new perspective on the familiar”(48-49).

Corpus stylistics is a branch of computational linguistics as Wales (1989) points out. It was developed in the late 1960s. It helps to investigate certain characteristics of the data like the length of words and sentences based on statistical and computer-aided tools to study a number of issues related to style (85).

Corpus stylistics is simply corpus linguistics with a different object of study (literature as opposed to non-literary language)'. Besides, he demonstrates that the difference between them is that corpus stylistics is not only borrowing tools from corpus linguistics but it makes itself unique by using qualitative tools and techniques of stylistics to analyze texts with the help of computational methods (McIntyre 60).

This paper presents a corpus stylistics analysis of Jane Austen novel’s “pride and prejudice”. The novel will be analysed according to a corpus stylistic approach. In general, a corpus is a collection of written and spoken texts. News is information about current events. This may be provided through many different media: word of mouth, printing, postal systems, broadcasting, electronic communication, or through the testimony of observers and witnesses to events.

The current papper focuses only on one novel electronic form. The analysis is based on the recurrent word combinations found in the text by the corpus software. Mahlbergsees corpus stylistics as “a way of bringing the study of language and literature closer together”(2007: 3).

Research Methodologies

The methodology of the study follows Mahlberg & McIntyre's (2012) method. This model focuses on one literary text by one autor. They explain that studying one text may be considered as a 'small sample of data' but then they assert that this text is still regarded as part of a corpus (206).

The majority of stylistic studies including articles, books and other works use qualitative methodologies in order to analyze literary texts. The writer taken the data from the electronic book “pride and prejudice’ by Jane Austin and continued by change the electronic book to be a text to analyze by software to deal with large texts with relatively short time. In addition, it can achieve the objectivity that stylisticians seek for. It can reveal crucial features that can be missed in the manual analysis. corpus stylistics is not only borrowing tools from corpus linguistics but it makes itself unique by using qualitative tools and techniques of stylistics to analyze texts with the help of computational methods (McIntyre).

This work aims at examining keywords, key semantic domains as well as clusters. Firstly, keywords can be defined as the most frequent or repeated words in a single text or group of texts in comparison to a reference corpus. Words are the crucial part of any corpus study. There are three groups of words in general: proper nouns, content and function words. Mahlberg & McIntyre point out that the most common words are function words. They work as the constituents of any text. However, content words are the carriers of meaning and writers' messages. For this reason, they are important for studying (384).

Gliozzo &Strapparava define semantic domains as 'fields characterized by lexically coherent words. The lexical coherence assumption can be exploited for computational purposes because it allows us to define automatic acquisition'(5).


Corpus stylistics brings the methods of corpus linguistics to the practice of stylistics. The term ‘corpus stylistics’ specifically to refer to the study of literary texts. Some researchers in this field use ‘corpus stylistics’ to refer to literary criticism only; Mahlberg characterises it as a methodology which combines different approaches but is fundamental 'a way of bringing the study of language and literature closer together' (Mahlberg, 2007, p. 219). Wynne (2006) also implicitly assumes that corpus stylistics is a stylistic inquiry into literary language. Others use the term more generally; Semino and Short include news reports and autobiographies in their book, Corpus Stylistics (2004).

Pride and Prejudice is a novel written by Jane Austen published on January 28, 1813. This novel tells the story of the upper-middle-class love in England in the late 19th century. This novel contains a description of the events surrounding the story of the main character, Elizabeth Bennet, who lives in Longbourn, England. Elizabeth is described as a cheerful, and polite woman, and also she has a sharp intelligence and refused to be intimidated by anyone.

The Bennet family has five single daughters. This family is a family that is quite prosperous but because of this family there are no sons, so they must be forced to give all their property to the uncle of the five daughters of Mr. Bennet, Mr.Collins. All of that applies if their father has passed away. The case made Mrs. Bennet worried. Then Mrs. Bennet has the ambition to marry her children to rich men.

The reasons for choosing this particular novel Pride and Prejudice are Pride and Prejudice has long fascinated readers, consistently appearing near the top of lists of 'most-loved books' among literary scholars and the general public. It has become one of the most popular novels in English literature, with over 20 million copies sold and paved the way for many archetypes that abound in modern literature There are certain words used to indicate in a direct or indirect way the main theme of the text, these are called 'thematic signals'. These words have importance over others. They are simply the carriers of meaning or the clues to the hidden codes in the text. The novel has the following thematic events; romantic, film thriller and psychological themes. The novel achieve 122,007 words.

The word 'merried' for example, occurs (57) times throughout the text. It leads to the basic theme in the novel. As mentioned before, the story is about the upper middle class love in England in the late 19th century. Similarly, the word 'trust' with (28) occurrences has a noticeable thematic value. The word 'pride' occurs (56) times, indicating also to the theme of love in the novel. The word “The occurrences or the concordances of this word in the text lead the reader to this interpretation.

Here are examples of the associations of the word within the context:

Screenshot (3) Thirty Three Examples of Successive Concordances of the Word “merried.”

From the above screenshot, it is clear from the right context that the words “advantageously” is associated with the reason why Mr. Bennet family desire to marry their daughter to a man from a wealthy family . This cooccurrence of words has an important significance. In line (4), point of being most advantageously is mentioned to refer to the benefit of them towards their daughter married. In lines (6) there is one textual phrases: “delightful thing” takes the indication of the Mrs. Bennet desire.

Fictional world and thematic signal keywords with examples of subgroup based on Mahlberg and McIntyre's (2012, 210).

Category Example keywords

Fictional world



Body parts

Clothes and accessories

Settings and props:


Elizabeth, Mr. Darcy, Bennet Family, Mr.Collins

Face, eyes, mouth

Gown, wedding clothes, coat, stocking

House, Netherfield, Pemberly House, longbourn

Thematic signals Pride, admiration, prejudice, trust, married

Keywords in Pride and Prejudice

The following table contains the top 22 keywords produced by the Wmatrix3 tool comparing the selected novel with BNC (British National Corpus sampler Written. In addition, the table illustrates the frequency of keywords in the novel.

Table ( 2 ) The Top 22 keywords of Pride and Prejudice Compared to BNC Sampler Written.

Apparently, Table (2) shows that words such as (the, to, of, and, etc) are dominant in terms of frequency. Nevertheless, this does not mean that researchers neglect a word like 'advantage', which is illustrated previously. Computational retrieval of keyness should be checked and interpreted by manual examination to get a reasonable and a subtle way of reading the text by means of corpus stylistics study.

It is worth mentioning that Mahlberg and McIntyre maintain that the analysis of the concordance lines is necessary to point out keywords related to the ''thematic signals'' and ''fictional world'' in order to search for the meaning within the context (209).


The study of corpus stylistics can be very important for researchers. it connects both quantitative and qualitative methods, also helps bring the attention of the reader / literary text. In addition, computing tools are used to perform analytical tasks novels not only save time and time get fast results with one click, but this tool also calculates possible words passed by humans. However, handled will not be significant without study manual. Thus, supported quantitative methods are encouraged analysis. This is done by explaining important items, learning them, and link the importance of text. There must be some kind of a sorting techniques that replace guide words (directed to themes) from words that are often less important and provide little information about literary text.


  1. Eman Adil Jaafar. 2017. Corpus Stylistic Analysis of Thomas Harris' The Silence of the Lamb’ University of Baghdad, Baghdad, Iraq
  2. Kristina Bujanova .2013. A Corpus-Stylistic Analysis of Mitchell’s Gone with the Wind and Hemingway’s A Farewell to Arms. UNIVERSITY OF OSLO
  6. Wordsmith tools 7.0 software
