-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-data-structures-part1.Rmd
580 lines (478 loc) · 17.3 KB
/
04-data-structures-part1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
---
title: "Data Structures"
teaching: 40
exercises: 15
questions:
- "How can I load a package in R?"
- "What is the 'tidyverse'?"
- "How can I read data in R?"
- "What are the basic data types in R?"
- "How do I represent categorical information in R?"
objectives:
- "To be aware of the different types of data."
- "To begin exploring data frames, and understand how they are related to vectors, factors and lists."
- "To be able to ask questions from R about the type, class, and structure of an object."
keypoints:
- "Use `readr::read_csv` to read tabular data in R."
- "The basic data types in R are double, integer, complex, logical, and character."
- "Use factors to represent categories in R."
source: Rmd
---
```{r, include=FALSE}
source("bin/chunk-options.R")
knitr_fig_path("04-")
```
# Data Structures
One of R's most powerful features is its ability to deal with tabular data -
such as you may already have in a spreadsheet or a CSV file. Let's start by
looking at a toy dataset in your `data/` directory, called `feline-data.csv`:
The contents of the new file, `feline-data.csv`:
```{r, eval=FALSE}
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
```
## The *readr* package
To read the data into R, we are going to use our first package, called *readr*.
*readr* is part of a suite of packages called the "tidyverse" which were designed
to work nicely together and to ease many common data operations.
The first time you use a package, you will need to install it (like installing
an app on your phone from the app store). Additionally, it is a good idea to
periodically check for updates to that package:
```{r, eval=FALSE}
install.packages("readr")
```
Everytime we want to use that package, you must load it into your R session, by
using the `library` function:
```{r}
library(readr)
```
Now we can load this data into R via the following using the `read_csv` function,
and assign it to an object called `cats`:
```{r}
cats <- read_csv(file = "data/feline-data.csv")
cats
```
The `read_csv` function is used for reading in tabular data stored in a text
file where the columns of data are separated by punctuation characters such as
CSV files (csv = comma-separated values). There is a base version of this called
`read.csv`, but the `readr` version (`read_csv`) is a bit more user-friendly, and
uses more sensible defaults. Tabs and commas are the most common
punctuation characters used to separate or delimit data points in text files.
The object that is created by `read_csv` is called a "data.frame" - a rectangular
table-like object with rows and columns.
We can begin exploring our dataset right away, first by looking at the whole thing:
```{r}
cats
```
And pulling out individual columns by specifying them using the `$` operator:
```{r}
cats$weight
cats$coat
```
We can do other operations on the columns:
```{r}
## Say we discovered that the scale weighs two Kg light:
cats$weight + 2
paste("My", cats$coat, "cat weighs", cats$weight + 2, "kilograms")
```
But what about
```{r error=TRUE}
cats$weight + cats$coat
```
Understanding what happened here is key to successfully analyzing data in R.
## Data Types
If you guessed that the last command will return an error because `2.1` plus
`"black"` is nonsense, you're right - and you already have some intuition for an
important concept in programming called *data types*. We can ask what type of
data something is:
```{r}
typeof(cats$weight)
```
There are 4 main types:
- `double`/`numeric` (decimal numbers),
- `integer` (counting numbers),
- `logical` (True/False),
- `character` (free text)
```{r}
typeof(3.14)
typeof(1L) # The L suffix forces the number to be an integer, since by default R uses double (decimal) numbers
typeof(TRUE)
typeof('banana')
```
No matter how
complicated our analyses become, all data in R is interpreted as one of these
basic data types. This strictness has some really important consequences.
The table that R loaded our cats data into is something called a
*data.frame*, and it is our first example of something called a *data
structure* - that is, a structure which R knows how to build out of the basic
data types.
We can see that it is a *data.frame* by calling the `class` function on it:
```{r}
class(cats)
```
## Vectors and Type Coercion
To better understand this behavior, let's meet another of the data structures:
the *vector*.
If we are creating vectors on our own, we will normally use the `c` (combine)
function:
```{r}
my_vector <- c(1, 3, 5 ,7 ,9)
my_vector
```
A vector in R is essentially an ordered list of things, with the special
condition that *everything in the vector must be the same basic data type*.
```{r}
class(my_vector)
```
This command indicates the basic data type found in this vector - in this case `numeric`.
We can use the logical operators that we learned earlier with vectors:
```{r}
my_vector > 4
```
Vectors can be any data type that we've already learned about. Let's make a
character vector:
```{r}
my_other_vector <- c("Harry", "Ron", "Hermione", "Ginny")
my_other_vector
```
```{r}
class(my_other_vector)
```
```{r}
my_other_vector == "Harry"
```
If we similarly do
```{r}
class(cats$weight)
```
we see that `cats$weight` is a vector, too - *the columns of data we load into R
data.frames are all vectors*, and that's the root of why R forces everything in
a column to be the same basic data type.
> ## Discussion 1
>
> Why is R so opinionated about what we put in our columns of data?
> How does this help us?
>
> > ## Discussion 1
> >
> > By keeping everything in a column the same, we allow ourselves to make simple
> > assumptions about our data; if you can interpret one entry in the column as a
> > number, then you can interpret *all* of them as numbers, so we don't have to
> > check every time. This consistency is what people mean when they talk about
> > *clean data*; in the long run, strict consistency goes a long way to making
> > our lives easier in R.
Given what we've learned so far, what do you think the following will produce?
```{r}
quiz_vector <- c(2,6,'3')
```
This is something called *type coercion*, and it is the source of many surprises
and the reason why we need to be aware of the basic data types and how R will
interpret them. When R encounters a mix of types (here numeric and character) to
be combined into a single vector, it will force them all to be the same
type. Consider:
```{r}
coercion_vector <- c('a', TRUE)
coercion_vector
another_coercion_vector <- c(0, TRUE)
another_coercion_vector
```
The coercion rules go: `logical` -> `integer` -> `double`/`numeric` -> `complex` ->
`character`, where -> can be read as *are transformed into*. You can try to
force coercion against this flow using the `as.` functions:
```{r}
character_vector_example <- c('0','2','4')
character_vector_example
character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric
numeric_coerced_to_logical <- as.logical(character_coerced_to_numeric)
numeric_coerced_to_logical
```
As you can see, some surprising things can happen when R forces one basic data
type into another! Nitty-gritty of type coercion aside, the point is: if your
data doesn't look like what you thought it was going to look like, type coercion
may well be to blame; make sure everything is the same type in your vectors and
your columns of data.frames, or you will get nasty surprises!
But coercion can also be very useful! For example, in our `cats` data
`likes_string` is numeric, but we know that the 1s and 0s actually represent
`TRUE` and `FALSE` (a common way of representing them). We should use the
`logical` datatype here, which has two states: `TRUE` or `FALSE`, which is
exactly what our data represents. We can 'coerce' this column to be `logical` by
using the `as.logical` function:
```{r}
cats$likes_string
cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string
```
## Data Frames
We said that columns in data.frames were vectors:
```{r}
str(cats$weight)
str(cats$likes_string)
```
These make sense. But what about
```{r}
str(cats$coat)
```
## Factors
Another important data structure is called a *factor*. Factors usually look like
character data, but are typically used to represent categorical information that
have a defined set of values. For
example, let's make a vector of strings labelling cat colorations for all the
cats in our study:
```{r}
coats <- c('tabby', 'tortoiseshell', 'tortoiseshell', 'black', 'tabby')
coats
```
We can turn a vector into a factor like so:
```{r}
CATegories <- factor(coats)
class(CATegories)
CATegories
```
Now R has noticed that there are three possible categories in our data - but it
also did something surprising; instead of printing out the strings we gave it,
we got a bunch of numbers instead. R has replaced our human-readable categories
with numbered indices under the hood, this is necessary as many statistical
calculations utilise such numerical representations for categorical data:
```{r}
typeof(coats)
typeof(CATegories)
```
> ## Challenge 1
>
> The default behaviour of `read.csv` is to convert all character columns to
> factors, which is one of the biggest reasons we prefer `read_csv` from the
> `readr` package.
> Look at the help for `?read.csv` to figure out how to keep text columns as character
> vectors instead of factors; then write a command or two to show that the factor
> in `cats` is actually a character vector when loaded in this way.
>
> > ## Solution to Challenge 1
> >
> > One solution is use the argument `stringAsFactors`:
> >
> > ```{r, eval = FALSE}
> > cats <- read.csv(file="data/feline-data.csv")
> > str(cats$coat)
> > cats <- read.csv(file="data/feline-data.csv", stringsAsFactors=FALSE)
> > ```
> >
> > Another solution is use the argument `colClasses`
> > that allow finer control.
> >
> > ```{r, eval = FALSE}
> > cats <- read.csv(file="data/feline-data.csv", colClasses=c(NA, NA, "character"))
> > str(cats$coat)
> > ```
> >
> > Note: new students find the help files difficult to understand; make sure to let them know
> > that this is typical, and encourage them to take their best guess based on semantic meaning,
> > even if they aren't sure.
In modelling functions, it's important to know what the baseline levels are. This
is assumed to be the first factor, but by default factors are labelled in
alphabetical order. You can change this by specifying the levels:
```{r}
mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)
```
In this case, we've explicitly told R that "control" should be represented by 1, and
"case" by 2. This designation can be very important for interpreting the
results of statistical models!
## Lists
Another data structure you'll want in your bag of tricks is the `list`. A list
is simpler in some ways than the other types, because you can put anything you
want in it:
```{r}
list_example <- list(1, "a", TRUE, 1+4i)
list_example
another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
another_list
```
We can now understand something a bit surprising in our data.frame; what happens if we run:
```{r}
typeof(cats)
```
We see that data.frames look like lists 'under the hood' - this is because a
data.frame is really a list of vectors and factors, as they have to be - in
order to hold those columns that are a mix of vectors and factors, the
data.frame needs something a bit more flexible than a vector to put all the
columns together into a familiar table. In other words, a `data.frame` is a
special list in which all the vectors must have the same length.
In our `cats` example, we have a character, a double and a logical variable. As
we have seen already, each column of data.frame is a vector.
```{r}
cats$coat
cats[,1]
typeof(cats[,1])
str(cats[,1])
```
Each row is an *observation* of different variables, itself a data.frame, and
thus can be composed of elements of different types.
```{r}
cats[1,]
typeof(cats[1,])
str(cats[1,])
```
> ## Challenge 3
>
> There are several subtly different ways to call variables, observations and
> elements from data.frames:
>
> - `cats[1]`
> - `cats[[1]]`
> - `cats$coat`
> - `cats["coat"]`
> - `cats[1, 1]`
> - `cats[, 1]`
> - `cats[1, ]`
>
> Try out these examples and explain what is returned by each one.
>
> *Hint:* Use the function `typeof()` to examine what is returned in each case.
>
> > ## Solution to Challenge 3
> > ```{r, eval=TRUE, echo=TRUE}
> > cats[1]
> > ```
> > We can think of a data frame as a list of vectors. The single brace `[1]`
> returns the first slice of the list, as another list. In this case it is the
> first column of the data frame.
> > ```{r, eval=TRUE, echo=TRUE}
> > cats[[1]]
> > ```
> > The double brace `[[1]]` returns the contents of the list item. In this case
> it is the contents of the first column, a _vector_ of type _factor_.
> > ```{r, eval=TRUE, echo=TRUE}
> > cats$coat
> > ```
> > This example uses the `$` character to address items by name. _coat_ is the
> first column of the data frame, again a _vector_ of type _factor_.
> > ```{r, eval=TRUE, echo=TRUE}
> > cats["coat"]
> > ```
> > Here we are using a single brace `["coat"]` replacing the index number with
> the column name. Like example 1, the returned object is a _list_.
> > ```{r, eval=TRUE, echo=TRUE}
> > cats[1, 1]
> > ```
> > This example uses a single brace, but this time we provide row and column
> coordinates. The returned object is the value in row 1, column 1. The object
> is an _integer_ but because it is part of a _vector_ of type _factor_, R
> displays the label "calico" associated with the integer value.
> > ```{r, eval=TRUE, echo=TRUE}
> > cats[, 1]
> > ```
> > Like the previous example we use single braces and provide row and column
> coordinates. The row coordinate is not specified, R interprets this missing
> value as all the elements in this _column_ _vector_.
> > ```{r, eval=TRUE, echo=TRUE}
> > cats[1, ]
> > ```
> > Again we use the single brace with row and column coordinates. The column
> coordinate is not specified. The return value is a _list_ containing all the
> values in the first row.
## Matrices
Last but not least is the matrix. We can declare a matrix full of zeros:
```{r}
matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_example
```
And similar to other data structures, we can ask things about our matrix:
```{r}
class(matrix_example)
typeof(matrix_example)
str(matrix_example)
dim(matrix_example)
nrow(matrix_example)
ncol(matrix_example)
```
> ## Challenge 4
>
> What do you think will be the result of
> `length(matrix_example)`?
> Try it.
> Were you right? Why / why not?
>
> > ## Solution to Challenge 4
> >
> > What do you think will be the result of
> > `length(matrix_example)`?
> >
> > ```{r}
> > matrix_example <- matrix(0, ncol=6, nrow=3)
> > length(matrix_example)
> > ```
> >
> > Because a matrix is a vector with added dimension attributes, `length`
> > gives you the total number of elements in the matrix.
> ## Challenge 5
>
> Make another matrix, this time containing the numbers 1:50,
> with 5 columns and 10 rows.
> Did the `matrix` function fill your matrix by column, or by
> row, as its default behaviour?
> See if you can figure out how to change this.
> (hint: read the documentation for `matrix`!)
>
> > ## Solution to Challenge 5
> >
> > Make another matrix, this time containing the numbers 1:50,
> > with 5 columns and 10 rows.
> > Did the `matrix` function fill your matrix by column, or by
> > row, as its default behaviour?
> > See if you can figure out how to change this.
> > (hint: read the documentation for `matrix`!)
> >
> > ```{r, eval=FALSE}
> > x <- matrix(1:50, ncol=5, nrow=10)
> > x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row
> > ```
> ## Challenge 6
> Create a list of length two containing a character vector for each of the sections in this part of the workshop:
>
> - Data types
> - Data structures
>
> Populate each character vector with the names of the data types and data
> structures we've seen so far.
>
> > ## Solution to Challenge 6
> > ```{r}
> > dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
> > dataStructures <- c('data.frame', 'vector', 'factor', 'list', 'matrix')
> > answer <- list(dataTypes, dataStructures)
> > ```
> > Note: it's nice to make a list in big writing on the board or taped to the wall
> > listing all of these types and structures - leave it up for the rest of the workshop
> > to remind people of the importance of these basics.
> >
> ## Challenge 7
>
> Consider the R output of the matrix below:
> ```{r, echo=FALSE}
> matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
> ```
> What was the correct command used to write this matrix? Examine
> each command and try to figure out the correct one before typing them.
> Think about what matrices the other commands will produce.
>
> 1. `matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)`
> 2. `matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)`
> 3. `matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)`
> 4. `matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)`
>
> > ## Solution to Challenge 7
> >
> > Consider the R output of the matrix below:
> > ```{r, echo=FALSE}
> > matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
> > ```
> > What was the correct command used to write this matrix? Examine
> > each command and try to figure out the correct one before typing them.
> > Think about what matrices the other commands will produce.
> > ```{r, eval=FALSE}
> > matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
> > ```