Corpus and Text Linguistics with R (CTL-R)

A very simple introduction for beginners, by Cornelius Puschmann.

Preface

Why another introduction to R? I created this tutorial because I felt there was a genuine lack of materials that a) don’t presume any knowledge of programming or statistics and b) are aimed at people who want to use R for working with texts, rather than numbers. Of course, R’s statistical capabilities make it unique, but its potential for text-wrangling are equally impressive, especially because of its ability to create graphs and the many excellent packages available for language-based scholarship. My objective was to create a resource for linguists, but also for other researchers working with texts (social scientists, literary scholars, historians), to allow them to get their feet wet using R without the danger of drowning. If CTL-R is a useful starting point for doing that, I’ve succeeded.

Get in touch with me with your praise, criticism, or cookies. Note that is is still work in progress.

Table of contents

  1. Downloading and installing R
  2. Talking to R
  3. Functions
  4. Vectors
  5. Working with strings
  6. Reading and writing files
  7. Plotting
  8. Doing text analysis
    1. Simple word frequencies
    2. Concordancing/key words in context
    3. Clustering texts by their lexicon
    4. Lexical density

Downloading and installing R

R is open source software, developed cooperatively by developers from around the world. It is based on the older S language and can be imagined as a combination of spreadsheet software and programming language, giving it a huge amount of flexibility.

After staring R for the first time, you’ll be faced with something similar to the screenshot below. The R Console

The console’s interface might look slightly different on your platform (the screenshot is taken from the Mac version using the German local). A number of useful programs are available for interacting with R: Tinn-R and RStudio are among the most notable contenders.

Talking to R

Initially it’s helpful to imagine your conversation with R like a conversation with a person. R will answer in a very coarse and predictable way, but when you type a command into the R console, you’ll get a response. Let’s start by saying hello.

> Hello R!
Error: unexpected symbol in "Hello R"
>

Unfortunately that doesn’t work. Let’s try something else instead.

> 42
[1] 42
> 

It’s important to understand why R responds differently to 42 than to Hello R!. To R, everything you type in is either an object, a function, or an operator. We’ll get back to what that means later. For now, it’s enough to note that when something looks like any of these things, but isn’t, R will complain. The second thing to note is that when using the console, R will respond to us after many (but not all) commands. Finally, note how R puts [1] in front of everything. This is R’s way of telling us that there is one result.

> "Hello R!"
[1] "Hello R!"
> 

Apparently there is a difference between Hello R! and “Hello R!”. We’ll get back to that. Let’s try some basic arithmetic.

> 2+2
[1] 4
>
> 2-2
[1] 0
>
> 2/2
[1] 1
>
> 2*2
[1] 4
>
> 1:2
[1] 1 2
> 

The output of the very last command looks a bit odd. It’s not really the result of a math operation, rather we are using 1 and 2 as the start and end points of a range. Let’s be a bit bolder.

> 1:50
[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[27] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
> 

Note the [27], which is there just to show us that R’s response comes in more than one line. We’ll get back to what ranges are good for later. Rather than just parroting the input we’ve given it, R solves the equations. Sure, they’re not exactly exciting, but the principle works the same with more complex requests.

Functions

Now that we’ve established the barest basics on how to have a chat with R, lets look at functions.

> round(1.75)
[1] 2
> 

As said before, R expects things to be an object, function or operator. round() is a a function that, as the name suggests, is used for rounding numbers. R has hundreds of functions in its vocabulary (thousands if you count external libraries), which do all sorts of things, from statistical analysis to drawing graphs. Don’t worry, initially you’ll fare just fine with a mere handful of them.

> toupper("Hello R!")
[1] "HELLO R!"
> 

You can probably guess that toppuer() takes the text “Hello R” and converts it to upper case “HELLO R!”. But while toupper() does something quite different from round(), both work by the same principle. A function takes one or more arguments and then returns a result. Functions are used to round numbers, transform text to upper case, split data into smaller units, sort data, draw charts, and many other things. A few functions work fine without any arguments, and a few don’t return anything, but they’re in the minority. Want to know what a given function does? Try this:

>help(round)

help() is a very useful function that explains how other functions work. The help texts in R can be a bit cryptic because it is assumed that you’re a statistics buff, but a very useful part of each help page that anyone can grasp is the function syntax. For round(), it looks like this :

round(x, digits = 0)

x   
a numeric vector. Or, for round and signif, a complex vector.

digits  
integer indicating the number of decimal places (round) or significant digits (signif) to be used. Negative values are allowed (see ‘Details’). 

This is less cryptic than it sounds at first. round() takes two arguments, but only the first is required, while the second is optional. x is just a code for that thing that you want to apply the function to, while digits specifies the number of digits to round to.

> round(1.75, digits=1)
[1] 1.8
> 

Here, round() responds with the result 1.8 rather than 2, because we’ve specified the number of digits via the digits argument. Before round() assumed the number of digits should be 0.

Vectors

So far, R is really just a glorified typewriter. But things are about to get a lot more interesting. Try this:

> my_number <- 10
> 

Read the thing directly right of my_number as an arrow pointing towards it. The first thing you might notice is that R doesn’t respond to your command in any visible way. But behind the scenes something has happened.

> my_number
[1] 10
>

R will respond with 10 if we ask it what value the object my_number has. We’ve created a vector! A vector is a basic type of object that’s used for a lot of different purposes. Sometimes objects are also referred to as data structures or variables, depending on who you ask. The arrow assigns the vector a value, in this case 10. That value is now stored in my_number, like a passenger riding on a train.

> another_number <- 4
> my_number+another_number
[1] 14
> 

We created another vector called another_number and assigned it a value of 4. Then we’ve added the two together. Obviously we can also do other things.

> my_number*another_number
[1] 40
> my_number/another_number
[1] 2.5
> my_number^another_number
[1] 10000
>

You get the idea. You might wonder how to change the value assigned to a vector you’ve previously created. It’s very simple:

> my_number <- 20
> my_number
[1] 20
> 

You’ve changed my_number from 10 to 20 by assigning it a new value. It’s likely that learning R, you’ll overwrite vectors by accident with incorrect values. The arrow keys are your friends when that happens – just use arrow up to go back to where you assigned the correct value and arrow down to go back to your most recent command.

Vectors don’t have to be numbers, they can also contain text.

> some_person <- "Mike"
> another_person <- "Jane"
> 

The vectors some_person and another_person are character vectors. While those are not inherently different from numerical vectors, certain functions only work on character vectors, while others only work on numbers.

> round(some_person)
Error in round(some_person) : 
Non-numeric argument to mathematical function
> 
> some_person + another_person
Error in some_person + another_person : 
non-numeric argument to binary operator
> 

This doesn’t work, for obvious reasons – you can round a number, but not a piece of text. You also can’t “add” two pieces of texts, at least not mathematically. Joining two pieces of text (called character vectors or strings) together is possible though (see the next chapter).

> toupper(some_person)
[1] "MIKE"
> 

This function converts our character vector some_person to upper case (duh).

A few things about vectors. The names used so far (my_number, another_number, some_person) are totally random and could also be different (x, y, z or maybe Homer, Marge and Bart, or perhaps my.favorite.vector and my_not_so_favorite_vector). Remember when we typed in Hello R! at the beginning? R expected Hello to be an object, as it is not a known operator (things like + - / * but also <-) and not a known function (for that, the function name would also have to be followed by brackets, as in round() or toupper(). Remember that quotes tell R that something is a string. Very often when things go wrong, R is confused about where a string ends, because you’ve forgotten a quote.

> toupper(number)
[1] "10"
> 

Perhaps this takes you by surprise – instead of responding with an error, R gives a response to asking it to convert a number to upper case. This is because it’s trying to help us by assuming that we really intend our number to be a string. To R, 10 and “10” are two different things. – only the first is a mathematical value, while the second is a lump of characters.

So far we’ve only stored a single value in a vector, which you can picture like a train with only a single car. To create a vector with more than one value (a train with a bunch of cars) we need to learn a new function.

highscores <- c(15, 32, 44, 75)

The function c() is used to concatenate several elements into one vector. A vector can contain virtually any number of values. Yep, that means that all the words in your massive corpus can be stores in a single vector. The function length() is useful to determine the number of elements in a vector.

> length(highscores)
[1] 4
> 

To see the entire contents of a vector, you simply type in its name. What if you just want a specific element, not everything?

> highscores[2]
[1] 32

Much of a time you don’t want all elements stored in a vector, but only specific ones. This is where the index vector – the number in the angular brackets – come in handy. highscores[2] refers to the second value stored in highscores, which is 32. Imagine telling R that you want just the second car from the highscores train, rather than everything. The way of expressing this may look weird at first, but after a while you’ll appreciate indexes.

> highscores[4]
[1] 75
> 
> highscores[3:4]
[1] 44 75
> 
> highscores[-1]
[1] 32 44 75
> 
> highscores[c(1,4)]
[1] 15 75
> 
> highscores[highscores>40]
[1] 44 75
> 

Remember ranges? In highscores[3:4] we’ve made use of one by selecting the third to fourth values in the highscores vector. highscores[–1] slices the first value off, while highscores[c(1,4)] is a bit tricky: it selects the first and fourth value from highscores. How does it do this? By creating another vector using the c() function. Actually, all indexes are really vectors themselves. Don’t worry if that seems quirky to you right now, you’ll get used to R’s logic if you work with vectors on a regular basis.

Now you probably know why R generally responds with [1] …. – it’s providing an answer to our questions in the form of a vector containing (usually) one value.

Here are a few more functions for analyzing the contents of a vector:

> min(highscores)
[1] 15
> 
> max(highscores)
[1] 75
> 
> mean(highscores)
[1] 41.5
> 
> sort(highscores)
[1] 15 32 44 75
> 
> 

What they do shouldn’t be too hard to guess: min() returns the smallest value in the vector, max() the largest value, mean() calculates the arithmetic mean() and sort() orders the values from the smallest to the largest.

> str(highscores)
num [1:4] 15 32 44 75
> 

If you need more information about a vector, the str() function is your friend. It will show you the vector’s type (num or char), its index, and the elements inside it.

Working with strings

You’ll start to see R’s potential working with character vectors containing more than one element.

> food <- c("bacon", "cheese", "anchovis", "milk")
> 

Let’s start by bringing our food vector into alphabetical order.

> sort(food)
[1] "anchovis" "bacon"    "cheese"   "milk"
>

Note that like with other functions (round(), toupper()) we haven’t actually changed a vector by applying a function to it. If we wanted to keep the change we’re making we would have to overwrite the existing vector or store it in a new one.

> food.sorted <- sort(food)
>

Have a look what’s inside food. Note that I’ll continue to work with food, rather than with food.sorted in the steps below.

> match("cheese", food)
[1] 2

The function match() is useful for determining which elements in a vector match a certain value. 2 here is the place where is located – it’s the second element in food.

> match("apples", food)
[1] NA
> 

Looking for “apples” in our food vector does not return a result because there is no such element in the vector. Instead NA is returned, which is more accurate than returning 0 – that would suggest that apples is the 0th element. If you’ve used other programming languages you might be used to counting starting with 0. In R, indexes always start at 1.

> food == "cheese"
[1] FALSE  TRUE FALSE FALSE
> 

By using the == operator we can check whether or not two given elements are the same. In the example above, this check is returned for each element in the vector. It returns FALSE for the first, third and fourth value in food and TRUE for the second element. Sometimes TRUE is shortened to T and FALSE to F for convenience. This kind of result (T or F) is called a logical vector and it’s a type of vector just like numeric and character vectors are.

Enough about logical vectors – let’s start working with sentences. Below is a sequence of steps to turn a sentence into a table of word frequencies.

> sentence <- "Mike and Jen and Sue like pie, and they also like chocolate."
> sentence
[1] "Mike and Jen and Sue like pie, and they also like chocolate."
> 
> words.list <- strsplit (sentence, " ")
> words.list
[[1]]
 [1] "Mike"       "and"        "Jen"        "and"        "Sue"        "like"      
 [7] "pie,"       "and"        "they"       "also"       "like"       "chocolate."

> words <- unlist(words.list)
> words
[1] "Mike"       "and"        "Jen"        "and"        "Sue"        "like"      
[7] "pie,"       "and"        "they"       "also"       "like"       "chocolate."
> 
> words.table <- table(words)
> words.table
words
   also        and chocolate.        Jen       like       Mike       pie, 
      1          3          1          1          2          1          1 
    Sue       they 
      1          1 
> 
> words.table.sorted <- sort(words.table, decreasing=T)
> words.table.sorted
words
   and       like       also chocolate.        Jen       Mike       pie, 
     3          2          1          1          1          1          1 
   Sue       they 
     1          1 
>  

Let me explain the above step by step. First, we create a vector called sentence and assign the text “Mike and Jen and Sue like pie, and they also like chocolate” to it. We then create a vector called words.list by applying the function strsplit(). strsplit() simply splits up a chunk of text into smaller pieces using a seperator, in this case a blank (“ ”). Why words.list? strsplit() doesn’t return a vector as we might expect, but something called a list. For now we can ignore this – lists are tricky to work with and in this case keeping the data in that format has no real advantages. So we convert words.list to a simple character vector by using the unlist() function, giving us the words vector. We then apply the table() function to tabulate the elements in the vector (we count how often they occur). Finally, we sort the table using the decreasing=T argument, which results in the largest element coming first in the vector and the smallest elements last.

> barplot(words.table.sorted)
> 

…and with this function, we plot a barchart of the word frequencies. Barchart of word frequencies

(to be continued soon)


last modified 2011-02-12