Chapter 3 Basic operations
3.1 Basics
3.1.1 <-
This is the assignment operator: the expression to its right is
evaluated (if applicable) and then assigned to the object on the
left of the operator. Hence the expression
a <- 10
means that the object a
, a single
number, “gets” the value of 10, i.e. the value of 10 is assigned to a
.
The symbol resembles
an arrow in the direction of assignment. The assignment may also be
in the other direction, with symbol ->
(and see note
4).
There should be no space between the two characters making up the arrow.
Use spaces or brackets to avoid ambiguities and errors:
<- 10 # assignment of atomic value 10 to object x
x < - 10 # is value of x less than -10 ? x
## [1] FALSE
3.1.2 #
indicates a comment: everything following this symbol, on the same line of input, is ignored.
3.1.3 scan
This command reads a simple vector from the keyboard. Make sure to assign the result to a new object! Read in the numbers 1 to 10, and assign them to a new object.
3.1.4 objects
This command shows a list of all objects in memory (similar to the
contents of the Praat
Objects window). With objects(pattern="abc")
the list is filtered so that only the objects matching the pattern string "abc"
are shown.
3.1.6 print
Contents of an object can be inspected with this command, or by just entering the name of the object, as in some examples above.
3.1.7 summary
This command offers a summary of an object. The result depends on the data class of the object, as illustrated in section REF above.
3.1.8 Workspace:
R holds its objects in memory. The whole workspace,
containing all data objects, can be stored from the
Rstudio window (Session > Save Workspace As...
).
This allows you to save a
session, and continue your work later (Session > Load Workspace...
).
3.2 Subselection
Subselection within an object is a very powerful tool in
R. The subselection operator x[…]
selects only those data from object x
that match the expression within square brackets. This expression can be a
single index number, a sequence or list of numbers, or an evaluated
expression, as illustrated in the following example.
In the following example, variable x
contains
30 numbers, but 3 of these are NA
. Notice that
the output of is.na
is the input of
table
.
# is.na() returns TRUE/FALSE for each element of ’x’.
# table() summarizes categorical data
table( is.na(x) )
##
## FALSE TRUE
## 27 3
<- !is.na( x ) # exclamation mark means NOT
ok which( !ok ) # which index numbers are NOT ok? inspect!
## [1] 11 13 19
mean( x[ok] ) # select ok values, compute mean, display
## [1] 1.015252
Subselection can also be achieved by using the function
subset(data, subset, select)
. The first
argument is the input data (set), the second argument is the selector
condition, and the optional third argument indicates which columns of a
data frame should be kept in the output.
require(hqmisc)
data(talkers)
subset( talkers, subset=( age<45 & region=="W" ) )
## id sex age region syldur nsyl
## 1 60 1 38 W 0.1940 13.56
## 3 62 1 36 W 0.2331 11.73
## 4 112 1 33 W 0.2633 11.67
## 45 153 0 39 W 0.2676 6.36
## 50 158 1 40 W 0.2131 7.99
## 51 159 0 25 W 0.2152 8.11
## 52 160 0 26 W 0.2104 8.54
## 53 161 0 27 W 0.2459 8.89
## 55 163 0 33 W 0.2287 7.60
## 80 391 1 34 W 0.2225 8.89
This command selects rows from data frame talkers
from the package hqmisc
(see 8) corresponding to speakers who are under 45 years of age, and who are from the West region.
3.3 Split, merge, reshape
There are useful functions available to split and merge data frames. First we create two example data frames. The first data frame has a list of English vowels, with a phonological feature for each vowel, and with the average frequency of the second formant5 of each vowel (Peterson and Barney 1952) spoken by male speakers. The second data frame has a partially overlapping list of vowels, with key words by John Wells 6.
<- c( "i","I","e","E","ae", "A","V","o","U","u", "@" )
vowelsymb <- data.frame( vowel=vowelsymb,
v1df feat=factor( c(rep("front",5),
rep("back",5),
"central" ) ),
F2=c( 2290,1990,NA,1840,1720,
1090,1190,NA,1020,870,
NA ) )
<- data.frame( vowel=vowelsymb[1:10],
v2df word=c("fleece","kit","face","dress","trap",
"lot","strut","goat","foot","goose") )
3.3.1 split
This command divides the data in the first argument (column F2
in
data frame v1df
) into the groups defined by the second argument.
with( v1df, split(F2,feat) ) -> v1list
This is particularly helpful in combination with
lapply
to apply a function to these grouped
data, e.g. to compute the mean of F2 for each vowel category
separately:
with(v1df, lapply( split(F2,feat), mean, na.rm=T ) )
## $back
## [1] 1042.5
##
## $central
## [1] NaN
##
## $front
## [1] 1960
Also see unsplit
.
3.3.2 merge
This command merges two data frames, by common columns. Specify
argument all=TRUE
if you want to include non-matching rows in the
output, with NA’s in the appropriate columns.
<- merge(v1df, v2df) m1
The resulting merged data frame is also sorted on the common
columns, unless argument sort=FALSE
.
3.3.3 reshape
If you need to perform a
Repeated Measures (within-subjects) analysis of variance (RM-ANOVA)
in SPSS, your data have to be in “wide” data layout, with all
observations from one subject on a single data line.
R on the other hand uses the “long” data layout, with
one observation per line, and with all descriptors of that
observation repeated for each line. There is a convenient command
reshape
to convert data between the wide
layout (of SPSS RM-ANOVA) and the long layout (of R ).
To illustrate, first we read a wide data set:
<- read.table(
widedata file=url("http://www.hugoquene.nl/emlar/widedata.txt"),
header=T)
The wide data show the subject id, between-subject group, and three within-subject observations, for 6 subjects (with leading row numbers):
head(widedata)
## subject group item1 item2 item3
## 1 1 1 2 3 4
## 2 2 1 3 4 6
## 3 3 1 1 3 6
## 4 4 2 2 4 5
## 5 5 2 4 5 6
## 6 6 2 2 5 7
These data are then reshaped to long layout with the following command:
<- reshape( widedata, direction="long",
longdata varying=c("item1","item2","item3"),
timevar="item", times=c("1","2","3"),
v.names="resp", idvar="subject")
The observations from multiple columns varying
are
collected into a new single column v.names
, using
identifiers in column idvar
. The
information contained in the multiple column names of
varying
is stored in a new single column
timevar
, using the values times
.
Inspect the two data frames to verify this.