Visit my new site

I am abandoning this site in favour of a new site https://scalablereading.northwestern.edu

Posted in Uncategorized | Leave a comment

Stanley Fish and the Digital Humanities

I have written a longish blog on Stanley Fish’s recent tri(b)logy about the Digital Humanities. You can find it on the web site of Northwestern’s new Center for Scholarly Communication and Digital Curation.

Posted in Uncategorized | Leave a comment

Nero, inmost, legitimate, pop

‘Nero’ ‘inmost’ ‘legitimate’, and ‘pop’ set up a web of associations that link King John and its source play, The Troublesome Raigne of Iohn, to Hamlet and King Lear.

The Troublesome Raigne opens not unlike Richard II. A weak king is introduced by hearing a suit. Robert Falconbridge brings a suit against his mother and older brother Philip, who he claims is the illegitimate son of the former king. Partway through the hearing, Philip decides that being the bastard son of a king is a better deal than being the legitimate son of a dull country squire. Queen Eleanor adopts him as an honorary grandson, but he stills wants to know the truth from his mother. When he is alone with her after the trial, he seeks to force a confession from her and threatens to act like Nero:

And here by heauens eternall lampes I sweare,
As cursed Nero with his mother did,
So I with you, if you resolue me not.
(Troublesome Raigne 368-370)

The mother’s confession of infidelity paradoxically wins the son’s gratitude.
The image of Nero hovers over this dispute as a whole. Queen Eleanor had earlier chided Robert for his conduct toward his mother:

Ungracious youth, to rip thy mothers shame
the wombe from whence thou didst thy being take,
All honest eares abhorre thy wickednes,
But gold I see doth beate downe natures law.
(Troublesome Raigne 131-134)

And the mother echoes this rhetoric:

Let not these eares receiue the hissing sound
Of such a viper, who with poysoned words
Doth masserate the bowels of my soule.
(Troublesome Raigne 140-142)

Shakespeare’s Bastard is less violent with his mother, but an echo of the older play seems to show up in his tirade against the treacherous English barons, which collocates ‘Nero’ with ‘rip’, ‘shame, and ‘womb’:

And you degenerate, you ingrate revolts,
You bloody Neroes, ripping up the womb
Of your dear mother England, blush for shame;
(KiJ 5.2.151-152)

It seems likely to me that the mother-son scene in the Troublesome Raigne shaped the closet scene in Hamlet. One clue is Hamlet’s comparison of himself to Nero:

O heart, lose not thy nature! let not ever
The soul of Nero enter this firm bosom,
Let me be cruel, not unnatural;
I will speake daggers to her, but use none.
(Ham. 3.2.394-396)

Hamlet as Nero generates the King’s name as Claudius, the predecessor and uncle of Nero. Gertrude thinks of her son as a kind of Nero when he says:

Come, come, and sit you down, you shall not boudge;
You go not till I set you up a glass
Where you may see the inmost part of you.
(Ham. 3.4.18-20)

This is very much like “masserate the bowels of my soul” in The Troublesome Raigne, but Gertrude understands it literally and exclaims “What wilt thou do? Thou wilt not murther me? ” (Ham. 3.4.21). “Inmost” occurs only once in other Shakespearean plays, when the mad Titus looks for justice:

‘Tis you must dig with mattock and with spade,
And pierce the inmost centre of the earth;
(Tit. 4.3. 11.-12)

So ‘inmost’ is associated with the earth/mother.
There are additional verbal and scenic links between the mother-son scenes in The Troublesome Raigne and Hamlet. Hamlet dwells in great detail on the physical difference between his father and his uncle, and Philip points to the physical differences between him and his brother Robert. As Philip works up to the climax of his threat against his mother, he utters a phrase that sounds like a famous line in Hamlet:

Nay, what is he, or what am I to him?
When any one that knoweth how to carpe,
Will scarcely iudge vs both one Countrey borne.
This Madame, this, hath droue me from myselfe:
And here by heauens eternall lampes I sweare,
As cursed Nero with his mother did,
So I with you, if you resolue me not.
(The Troublesome Raigne 364-370)

Here is Hamlet’s response to the Player’s speech:

What’s Hecuba to him or he to Hecuba
That he should weep for her?
(Ham. 2.2.559-560).

The dispute between Robert and Philip Falconbridge about land and legitimacy left clear traces in the quarrel of Edmund and Edgar and may stand behind the curious remark by Edgar/Poor Tom: ” Frateretto calls me, and tells me Nero is an angler in the lake of darkness” (KiL 3.6.6-7). Commentators refer this to a passage in Chacuer’s Monk’s Tale where we are told that

Nettes of gold threed hadde he greet plentee
To fisshe in Tybre, whan hym liste pleye.

But clearly Edgar’s phrase refers to Nero’s obscene curiosity, which Chaucer summarizes a few lines later:

His mooder made he in pitous array,
For he hire wombe slitte to biholde
Where he conceyved was

Does Edgar remember Nero because Shakespeare remembered the quarrel of Robert and Philip? Consider the associative links of the words ‘legitimate’ and ‘legitimation’. Legitimacy is a key concept in Edmund’s soliloquy:

Well then,
Legitimate Edgar, I must have your land.
Our father’s love is to the bastard Edmund
As to th’ legitimate. Fine word, “legitimate” !
Well, my legitimate, if this letter speed
And my invention thrive, Edmund the base
Shall [top] th’ legitimate. I grow, I prosper:
Now, gods, stand up for bastards!

‘Legitimate’ twice collocates with ‘land’ in King John. Robert asks for “my father’s land,”, but the king replies that “your brother is legitimate” (KiJ 1.1.115-116). In the scene with his mother, the Bastard says:

But, mother, I am not Sir Robert’s son,
I have disclaim’d Sir Robert and my land,
Legitimation, name, and all is gone;
Then, good my mother, let me know my father;
Some proper man, I hope. Who was it, mother?
(KiJ 1.1.246-50)

There is of course nothing particularly surprising about the collocation of ‘land’ and ‘legitimacy’ since in an agricultural society disputes about legitimacy are inescapably bound to disputes about land. But the resemblances stretch further. Edmund makes much of the fact that “the lusty stealth of nature” (KiL 1.2.11) makes for men of of “more composition and fierce quality” than “the whole tribe of fops” “got ‘tween asleep and wake” in the “dull, stale, tired bed” of marriage (KiL 1.2.11-15). Similarly Philip argues that Sir Robert could not have begotten him:

Madam, I was not old Sir Robert’s son;
Sir Robert might have eat his part in me
Upon Good Friday and ne’er broke his fast.
Sir Robert could do well — marry, to confess —
Could [he] get me. Sir Robert could not do it;
We know his handiwork. Therefore, good mother,
To whom am I beholding for these limbs?
Sir Robert never holp to make this leg.
(KiJ 1.1.233)

This theme does not recur elsewhere in Shakespeare, and a little detail clinches the case. Robert relates his father’s dying testimony that Philip was not his son

And if he were, he came into the world
Full fourteen weeks before the course of time.
(KiJ 1.1.112-113)

Edmund expostulates:

Wherefore should I
Stand in the plague of custom, and permit
The curiosity of nations to deprive me,
For that I am some twelve or fourteen moonshines
Lag of a brother? Why bastard? Wherefore base?
(KiL 1.2.2-6)

Since Robert’s lines immediately precede the exchange between him and the king quote above (KiJ 1.1.115-116), the two plays share the triple collocations ‘fourteen’, ‘legitimate’, and ‘land’ within the space of four and eleven lines. That is fairly powerful evidence for textual interdependence.
We may also observe the three Shakespearean occurrences of the unremarkable verb ‘pop,’ in which an adulterous ‘popping in’ pops somebody else out of his property. The sexual meaning is most explicit in Troilus and Cressida when Patroclus says to Menelaos:

But that’s no argument for kissing now,
For thus popp’d Paris in his hardiment,
And parted thus you and your argument.
(Tro. 4.5.27-29)

In King John, there is no explicit sexual reference, but the mother’s honor is very much at stake when the Bastard objects to Queen Elinor’s accusation that he casts shame on his mother:

I, madam? No, I have no reason for it;
That is my brother’s plea and none of mine,
The which if he can prove,’a pops me out
At least from fair five hundred pound a year.
Heaven guard my mother’s honor, and my land!

These two passages illuminate the most famous occurrence of the word in Hamlet’s description of Claudius as

He that hath kill’d my king and whor’d my mother,
Popp’d in between th’ election and my hopes,
(Ham.5.2.64-65)

Posted in Quirky words | Leave a comment

Shakespearean n-grams

 

The following is about  86 pairwise combinations of EMD plays that include Shakespeare on one side and sit 2.5 standard deviations above the average for all pairwise combinations. This is a crude and arbitrary cut-off and includes  less than 1% of some 11,000 pairwise combinations that meet this definition. There is a “proof in the pudding” aspect to this list. It deserves our attention to the extent that it frames surprising results within a context of the clearly expected.

We observe first that seven of the top ten, but 44 or barely more than half of the 86 links involve same-author observations. Second, about 40% of the links involve combinations of history plays, although such combinations account for just half a percent of all pairwise link.  Third, plays with many shared n-grams are close to each other in time, as is apparent from an Anderson-Darling Normality Test for elapsed time between plays:

 

Summary for timelapse

These are all very obvious findings and might prompt Horatio to mutter “There needs no ghost, my lord, come from the grave to tell us this.” The largest temporal outlier, Margaret Cavendish’s dramatic sketche, does not prima facie appear to be of much interest.

Things get a little more interesting if we group these links by authors. Shakespeare shares far more n-grams with Marlowe or Thomas Heywood than with any other writer.  In the case of Shakespeare-Heywood links, Heywood’s play is typically later and often much later. In the case of Marlowe, the n-grams involve links between Edward II or The Massacre at Paris with the three parts of Henry VI, Richard II and King John. The exact interrelation of the dates of all those plays is not entirely clear, but the exploration of shared n-grams between Marlowe and Shakespeare deserves a separate blog entry.

It is mildly interesting to look at Shakespeare’s plays in terms of the number of high-value links they share with each other or plays by other writers. Here is a list of all of Shakespeare’s play in chronological order with summary data:

circulationDate

abb

all_links

same_author_Links

different_author_links

1589

1h6

5

4

1

1590

2h6

12

6

6

1590

3h6

15

7

8

1592

com

0

0

0

1592

ri3

9

6

3

1592

tam

2

0

2

1592

tgv

0

0

0

1592

tit

3

2

1

1594

kij

3

2

1

1594

lll

0

0

0

1595

mnd

0

0

0

1595

ri2

7

5

2

1595

roj

1

1

0

1596

1h4

6

5

1

1596

mev

1

1

0

1597

2h4

8

8

0

1597

mww

10

3

7

1598

man

11

7

4

1599

ayl

2

2

0

1599

he5

3

2

1

1599

juc

3

3

0

1600

ham

3

3

0

1601

tro

3

2

1

1601

twn

5

3

2

1602

aww

3

2

0

1603

oth

5

4

1

1604

mem

2

1

1

1605

kil

0

0

0

1606

ant

2

2

0

1606

mac

0

0

0

1607

cor

3

3

0

1607

per

0

0

0

1607

tim

0

0

0

1609

cym

1

1

0

1610

win

1

1

0

1611

tem

0

0

0

1612

he8

1

1

0

If you think of the aggregate of pairwise combinations as components of a citational network, the list makes it very clear that the network is biased towards history plays and towards the earlier part of Shakespeare. The list may be less interessting for what is on it than for what is not on it. You might have expected that some Shakespeare plays were cited or parodied by contemporaries, as the Spanish Tragedy was. But from the data in the EMD corpus it appears that this was not the case. None of the highly canonical plays (by our standards) figures prominently in a citational network. Leaving aside the history plays, the two plays with the highest scores are The Merry Wives and Much Ado.

An intelligible and also mildly interesting pattern emerges if we filter out history plays and look in descending order at the Shakespeare pairs with the hishgest z-scores for shared n-grams:


Rank

comboid

z-score

linkweight2

18

juc_ant

3.97

12.59

21

3h6_tit

3.86

10.14

26

man_oth

3.51

9.52

30

tro_oth

3.41

10.40

34

2h4_man

3.15

8.33

37

mww_man

3.04

8.19

42

cym_win

2.95

7.84

43

ham_aww

2.95

8.56

44

roj_man

2.95

8.40

48

twn_aww

2.88

8.44

56

ayl_twn

2.82

6.85

60

2h6_tit

2.76

7.70

63

man_mem

2.72

8.85

65

mww_twn

2.70

7.81

66

juc_tro

2.69

8.45

71

aww_oth

2.61

7.76

76

ant_cor

2.60

7.56

77

2h4_ham

2.60

7.88

78

mev_man

2.60

7.54

80

juc_cor

2.56

7.65

83

2h4_cor

2.53

7.41

84

man_ayl

2.52

6.89

86

ham_oth

2.51

6.70

Plays share n-grams because they are close in time or subject matter. The frequent occurrence of Much Ado on this table is worth a closer look.

Posted in Early Modern Drama, Scalable Reading | Leave a comment

Did Nicolas Udall write the History of Jacob and Esau

 

The authorship of Jacob and Esau, an interlude from the mid-sixteenth century, is disputed. In a recent essay John Curran writes that Nicholas Udall and William Hunnis are “the most prevalent candidates,” and after reviewing the evidence he says that he inclines towards Hunnis (Jacob and Esau and the Iconoclasm of Merit, SEL, 49, 2009, 285-309).

I will confess that I have only glanced at Jacob and Esau, but I think I have some pretty compelling evidence that Udall is much the likelier choice. In my tabulation of 48,618 pairwise combinations of plays by different authors in the EMD corpus, the combination of Udall’s Ralph Roister Doister and “anon’s” Jacob and Esau ranks fourth in the link weight of shared n-grams. Perhaps even more to the point, if we exclude the egregiously self-repeating Thomas Killigrew, this pairwise combination shares more or weightier n-grams than all but 17 out of 2367 combinations of two plays by the same author. It is just barely behind the link of the second and third parts of Henry VI, which is the weightiest play link in the Shakespeare canon.

A different way of stating the same facts is that the Roister-Jacob link sits ten standard deviations above the average for links between different-author plays and three standard deviations above the average for links between same-author plays. Something is going on here.

There is nothing particular striking about the single hexagram, four pentagrams and 22 tetragrams that occur in these, and only in these, two plays:

  • now I will get me in
  • I did it for no
  • I am at all assays
  • I come to see if
  • may be in a readiness
  • but as for my
  • a mischief on all
  • a shame take him
  • as good as ere
  • word in thy ear
  • yea that I have
  • that thou hast me
  • when thou wilt I
  • none other life with
  • not by and by
  • be found such a
  • say what you lust
  • speak out aloud I
  • I am a lad
  • to God I vow
  • I will again to
  • I shall to my
  • I pray you sweet
  • him in very deed
  • I feel no manner
  • I hear them speak
  • since I was borne

But their frequency rules out coincidence. As I showed in the discussion of two works by Gascoigne, the odds of getting between 10 and 15 shared tetra- or pentagrams in a random draw are on the order of of 1:10,000. The odds for getting two dozen are considerably worse.

If the odds don’t speak to you, look at the n-grams shared between Othello with either Hamlet or Measure for Measure:


Hamlet Othello

I know not is it

good my lord I would

my lord what ho

from you do you

it will away again

it so fell out

father and his friends

sight let it be

with the act of

what make you from

 

Othello Measure for Measure

I think thou dost and

here is a change indeed

he must die be it

why very well then

adieu we must not

but hark what noise


But perhaps Ralph Roister Doister and Jacob and Esay share a lot of n-grams because there is less variety in plays before 1560. There is something to that. The EMD corpus has only eleven texts before 1565, and their 55 pairwise combinations have relatively high values:

N Minimum 25% Median 75% 90% Maximum
55 0.990 2.890 5.900 10.070 18.16 29.860

Compared with the interquartile range for all plays (2,75 -5.79), the interquartile range for the early plays is shifted somewhat to the right, especially at the top (2.89-10.07).  But Roister-Jacob is the middle of a top trio with link weight values of 29.9,  28.9 and 26.4, which translate into 2.5 – 3 standard deviations above the average for plays before 1565. Roister-Jacob has marginally fewer shared n-grams than the translations  of the Senecan Thyestes and Trojan Women and marginally more shared n-grams than the Three Laws of Nature and Chief Promises of God Unto Man, both attributed to John Bale.

To sum up: the shared n-grams of Jacob and Esau and Ralph Roister Doister put this combination of plays into the company of plays that either exhibit clear traces of A borrowing from B or are known to be by the same author. Which is the case here? The unremarkable nature of shared n-grams points strongly to the second alternative.

Posted in Uncategorized | Leave a comment

The Top Fifty n-gram heavy play links

This blog entry continues the entry on “Authors are trumps” and looks at the top fifty play links, which score at the 99.9th percentile of shared n-grams.  What can we learn from this list without actually looking at the plays? Or, if we think about it as an excercise in scalable reading, what can we learn about the plays from just looking at the pairwise combinations that we wouldn’t learn from actually reading the plays?

Here is a link to a list of the fifty play combinations that have the highest linkweight value for shared n-grams. The top three entries are in a class of their own. Jonson’s Fortunate Isles is a short masque that recycles substantial chunks from Neptune’s Triumph, another masque produced in the previous year. The first part of Killigrew’s Thomaso has a mountebank scene that liberally borrows from Volpone.  Shirley’s Contention for honour and riches (1631) and Honoria and Mammon tell you through their titles that they are close cousins.

Ten of the top fifty pairwise combinations come from two-part plays, giving rise to another “duh moment.” But we also observe that six of them have values that sit below the median for shared n-grams within the same play. Repetition declines sharply with distance.

23 or almost half of the outlier values involve same play reptitions by Thomas Killigrew. They also add up to almost half (23/55) of the possible links between Killigrew’s eleven plays.  have never read a play by Killigrew, and I am not sure I ever will. Here may be a case where scalable reading from a distance tells you most of what you want to know. Not only do you get the expected Cicillia I and II or Bellamira I and II, you also get pairwise combinations of Cicillia and Bellamira. We leave Killigrew with the recognition that his plays have a lot of Bellamira, Cicillias, Claracillas, parsons, pilgrims, and prisoners.

Philip Massinger has 14 plays in the EMD corpus. Six of the 91 possible links between his plays show up in this top list. He, too, appears to be an author that likes to repeat himself.

Gascoigne’s Supposes and Glass of Government

Gascoigne’s Supposes and Glass of Government share two heptagrams and  nonograms as well as an octogram. Here they are:

  • from bodily perils in the cradle from danger of
  • this is somewhat yet for by this means I
  • oh that I could tell where to find
  • my mind gives that I shall
  • but be you sure that I shall

These are not especially remarkable phrases, but there are only 1630 repeated n-gram that are longer than six words. The odds are low that five of them will show at random in one of some 50,000 possible play links. A similar point can be made about the 41 tetragrams and pentagrams that are shared betweent he two plays. There are about 150,00 such n-grams in the EMD corpus.  A simulation of their random distribution across all possible play combinations gives this pattern:

 

NewImage

You would have to play this game a very long before generating a random result with a play link that shares 41 hits.

Two-Part Plays

Here is a quick tabulation of two-part plays, ignoring Thomas Killigrew:

author playlink linkweight
marl tamb1_tamb2 63.78
heyt 1edw4_2edw4 31.44
sha 2h6_3h6 29.12
heyt ironage1_ironage2 28.86
sha 1h6_2h6 26.74
heyt maidwest1_maidwest2 25.49
sha 1h6_3h6 21.14
sha 1h4_2h4 18.64
dekker honwhore1_honwhore2 14.83
heyt knownotme1_knownotme2 11.21

Unsurprisingly, all these play links share many n-grams. All of them sit above the 99th percentile. Marlowe’s Tamburlaine stands out as a play of unusual repetivity. The two plays do not share many long n-grams. The hexagram “the Turk and his great empress” is the longest, and there are only five pentagrams, including

  • region of the air and
  • the great and mighty Tamburlaine
  • a terror to the world
  • Turke and his great empress
  • and terror of the world
  • fill all the world with
  • the monarch of the east

But there are 33 tetragrams and 56 trigrams that occur only in the two parts of Tamburlaine.

 

It is instructive to compare this with the second and third parts of Henry VI. There are are more long repetitions, though they are rather bland:

  • what news why comest thou in such
  • out some other chase for I myself
  • the son of Henry the fifth
  • the duke of York is
  • the duke of York I
  • king at nine months old
  • in the heart of France
  • for the duke of York
  • and myself with all the

There are only thirteen shared tetragrams and 35 trigrams. Some basic facts about the distinct rhetorical timbre of Marlowe’s play emerge very clearly from these simple comparisons. The point is underscored by the fact that with the exception of Edward II and the Massacre at Paris, the other same-author values for Marlowe sit below the median. That is the topic for another blog.

 

Posted in Early Modern Drama, Scalable Reading | Leave a comment

Authors are trumps

What do repeated phrases or n-grams tell us about how distant from or close to each other pairs of early modern plays are?  Do n-grams provide  dependable measures of distance, and can we learn from them about the weight of various factors that differentiate between one play and another, whether by date, genre, or author?

The following is a report on various experiments with n-grams. My quantitative skills are very limited, and the techniques I use are crude and idiosyncratic, but from a proof of the pudding perspective, they work well enough. That is to say, where the results show very little or a whole lot of X or Y, they typically confirm what a knowledgeable observer knows already. The very obviousness of such results inspires confidence that the method works, at least some of the time.

The major finding of my n-gram analysis is that authors are trumps.  Plays that are known to be by the same author share on average twice as many n-grams as plays that are related by genre or are written in the same five-year span regardless of author. The “author effect” is much stronger than any other factor that establishes links between plays.

But strong as the author effect is, it is of relatively little help in determining whether play X was written by author Y.  We can say with great confidence that plays by the same author share more n-grams, just as we say with great confidence that men are on average taller than women by a percentage that can be specified with some precision. But we cannot argue that a person who is five foot nine inches tall is a man. Some women are much taller than many men, and for reasons that may or may not be worthy of further inquiry, some plays by different authors share many more n-grams than plays by the same author.  That is the old story of statistical analysis as being very helpful in general and not so helpful in determining the individual case.

How I got my n-grams

I have  a table of some 730,000 distinct n-grams, extracted from the 320 plays in the EMD corpus. The n-grams range in length from three to 77 words and are distributed between one and 301 plays. Document frequency is the technical term for the number of distinct plays in which an n-gram appears. Collection frequency is the technical term for the total count of an n-gram in a collection. “I will not,” which occurs 2,320 times in 301 plays is the most common trigram in terms of document or collection frequency.

The n-grams were extracted in a tedious manner, which I will describe elsewhere, and they are based on a rather abstract representation of the underlying texts. The EMD corpus was linguistically annotated, and each “surface form” or spelling was mapped to a combination of a particular lemma and POS tag. From that combination I derived a standard spelling.  There are a few problems with that method, but they are outweighed by the advantage of leveling orthographic variance to a common standardized form of each word that has been linguistically defined in the same way. In this method, differences between texts are functions of those texts and not of some printer’s spelling habits.

Play links and how to weigh them

Every n-gram that is shared between one play and another establishes a link, however tenuous, between those two plays. The value of that link depends on four factors:

  1. The length of the n-gram. The frequency of n-grams declines very rapidly with length. There are some 460,000 distinct trigrams but only 1,360 heptagrams. You could define the value of an n-gram as 1 divided by the number of n-grams of that type. But such a way of counting assigns enormous weight to the occasional long n-gram. I used the square root of n-gram length instead. This gives a value of 1.73 to a trigram and a value of 8.66 to a 75-gram.
  2. The document frequency of the n-gram. If we think of a given link as having a force equal to 1, we can derive its value by the number of pairwise combinations among which it is shared. A link that is unique to two plays has a value of 1. A link shared among three plays generates three pairwise combinations. The number rises rapidly according to the formula (n-1)/0.5n. Thus the power of an n-gram shared among 16 plays is dissipated across 120  or 15*8 pairwise combinations. In this inquiry I ignore n-grams that are shared among more than 16 plays because their values are too small and scattered to make a difference.
  3. There is a difference between n-grams that string together common function words, such as “I will not” or “you and I” and n-grams that include weighty or rare words, such as “veni, vidi, vici.” I put words in log(10) bins and assigned them a value of 1/exponent. Thus a word that occurs five times has a value of 1/1, while a word that occurs 500 times has a value of 1/3. Then I added the log values of each word to get a weighted word value for the entire n-gram.
  4. A link that is shared between two short works is a rarer think than a link shared between two long works. I treated the aggregate word count of  play pairs as a single textual, counted the links of a certain type  and normalized their values to relative frequencies per 10,000 words.

You put those four things together and you get a simple formula: divide the length value of an n-gram by its document frequency value, multiply it first by the weighted word value of the n-gram and then by the relative frequency of its occurrences in a given pairwise combination of plays. You can add these values for all the n-grams in a pairwise combination and arrive at an overall weighted value for each link between two plays.

 

This is a pretty crude method of weighing n-grams, but it works reasonably well. It tells you, for instance, that “I will tell you,” which occurs in 178 plays, has a value below the measuring threshold, that “you and I have,” which occurs in 16 plays, has the barely measurable value of 0.01, and that “quem facient aliena pericula,” which occurs in two plays, has the maximum value of 8.

A few observations about repetitions within and between plays

Before turning to n-grams that occur in two or more plays, it helps to know something about the frequency of repetition within plays. An n-gram is more likely to be repeated within a given work than in some other work. It is possible to add some quantitative precision to this unsurprising pronouncement. In the EMD corpus, there are are about 25,000 n-grams of three or more words that are repeated at least once once in a given play but do not occur in any other play. Since there are 320 plays, you expect a play to have on average about 75 n-grams that are unique to it. By contrast, the EMD corpus has about 385,000 n-grams of three or more words that occur only in two works and are shared between 51,040 pairwise combinations of plays. That works out to an average of 7.5 shared n-grams between any pairwise combination of plays. Leaving aside the actual distribution of n-grams across these “play links,” as I will call them, it appears that a whole order of magnitude separates repetition within and across plays.

The table below gives a “seven number summary” of the difference between repeated n-grams within a play or shared across two plays. It shows pretty clearly that play-crossing n-grams are actually quite rare. If you find more than a handful of them in a given a pair of plays the odds are that something is going on.

Value within one play across two plays
min 5.32 0
10% 20.61 0.56
25% 24.16 1.05
50% 36.45 1.67
75% 51.02 2.46
90% 68.38 3.32
max 214.9 35.98

There is one other unsurprising, but in its way quite striking, fact about n-grams repeated within a play: their frequency diminishes rapidly with distance. Because plays differ markedly in length it makes sense to express distances in a normalized manner. You can express the distance between the two occurrences of an n-gram as a fraction of the total length of the work in which they occur. Repeated n-grams will thus be distributed in some fashion across deciles of distance. A chart of that distributions looks like this:

NewImage

It is a reasonable inference from this chart that the occurrence of n-gram repetition within a play is strongly motivated by scenic context. If about half of all repetitions occur within the first decile and the drop from the first to the second decile is a a factor of six, we can see why repetitions across plays are so much rarer than repetitions within plays.

What can you learn from tabulating the results?

N-grams shared between different plays come in two flavours. Some of them are unique to a particular pairwise combination. Others are shared among several plays. For now add their values. The “link weight”  that is generated by the various multiplications, divisions, and additions a rather abstract “rep value unit,” to be valued only in the proportions it points to. This point is made even clearer if you “standardize” the weights and create “z-scores” by subtracting the average from a given value and dividing it by the standard deviation..  The result of that operation tells you how many standard deviations a given value sits above or below the average. In a normal distribution 95% of values are found within two standard deviations, Thus a person with a height z-score of +2 is quite tall, while a person with a z-score of -2 is quite short. The weighted value for n-grams in pairwise combinations of plays ranges from 0 to 1393. The corresponding z-scores are -0.4 and 212.3, which is an astronomical number. This outlier comes from Jonson’s Neptune’s Triumph and the Fortunate Isles, two masques that recycle a lot of materials.

In the following descriptive statistics the most interesting results are not found in the outliers, but in the shifting values of the interquartile range from the 25th to the 75th percentile. Here is the table that reveals differences when you group pairwise combinations of plays by author, genre, or date:

play links by category n min 10% 25% 50% 75% 90% 99% 99.90% max
All plays 51040 0 1.63 2.75 4.16 5.79 7.66 13.83 26.54 1451.72
same author 2356 0 3.83 5.62 8.01 11.73 16.38 30.94 70.31 1451.72
different authors 48618 0 1.61 2.71 4.06 5.6 7.26 11.1 16.5 172.6
less than five years apart 6938 0 2.1 3.44 5.04 7.02 9.9 22 42.92 1451.72
less than five years apart, different authors 6047 0 2.01 3.27 4.71 6.35 8.19 12.68 20.69 28.88
more than thirty years apart 14845 0.01 0.31 2.11 3.28 4.69 6.19 9.75 13.99 172.6
same genre 15727 0 2.43 3.51 4.95 6.73 8.83 16.6 32.92 1451.72
same genre, same author 871 0 5.17 7.11 10.14 13.6 19.14 39.97 1471.52

From this table you learn that the interquartile range for plays by the same author (5.6-11.7) is twice as high as the comparable range for plays by different authors (2.71-5.6). So there is a quite marked author effect, which shows up through the range. Plays by different authors less than five years apart share more n-grams (3.44 – 7.02) than play pairs that are more than thirty years apart (2.11-4.69), but the time effect is considerably less than the author effect. The same is true of genre. The same genre or a narrow time gap will shift the interquartile ranges upward by 15% to 25%.

There is a big difference between 25% and 100%. Authors are trumps. There is more to be said about how to make sense of these crude things when looking at particular plays. But that is a story for another blog.

Posted in Early Modern Drama, Scalable Reading | Leave a comment