[Solved] What does subset(df, !duplicated(x)) do?

Question

The duplicated function traverses its argument(s) sequentially and returns TRUE if there has been a prior value identical to the current value. It is a generic function, so it has a default definition (for vectors) but also a definition for other classes, such as objects of the data.frame class. The subset function treats expressions passed as a second or third argument as though column names are first class objects. This is called “non-standard evaluation”. (Notice the negation operator.) So this call to subset will return the rows of a data.frame where only the first instance of the column named “x” is not duplicated. It would probably return a dataframe with only the number of rows that equal the number of unique items in the x column.

> dat <- data.frame( x =sample(1:5, 20, repl=TRUE), y=1:5, z=1:4)
> dat
   x y z
1  2 1 1
2  2 2 2
3  2 3 3
4  5 4 4
5  4 5 1
6  1 1 2
7  2 2 3
8  2 3 4
9  5 4 1
10 1 5 2
11 2 1 3
12 4 2 4
13 5 3 1
14 4 4 2
15 3 5 3
16 3 1 4
17 4 2 1
18 4 3 2
19 1 4 3
20 1 5 4

> subset(dat, !duplicated(x))
   x y z
1  2 1 1
4  5 4 4
5  4 5 1
6  1 1 2
15 3 5 3

Accepted Answer

The duplicated function traverses its argument(s) sequentially and returns TRUE if there has been a prior value identical to the current value. It is a generic function, so it has a default definition (for vectors) but also a definition for other classes, such as objects of the data.frame class. The subset function treats expressions passed as a second or third argument as though column names are first class objects. This is called “non-standard evaluation”. (Notice the negation operator.) So this call to subset will return the rows of a data.frame where only the first instance of the column named “x” is not duplicated. It would probably return a dataframe with only the number of rows that equal the number of unique items in the x column.

> dat <- data.frame( x =sample(1:5, 20, repl=TRUE), y=1:5, z=1:4)
> dat
   x y z
1  2 1 1
2  2 2 2
3  2 3 3
4  5 4 4
5  4 5 1
6  1 1 2
7  2 2 3
8  2 3 4
9  5 4 1
10 1 5 2
11 2 1 3
12 4 2 4
13 5 3 1
14 4 4 2
15 3 5 3
16 3 1 4
17 4 2 1
18 4 3 2
19 1 4 3
20 1 5 4

> subset(dat, !duplicated(x))
   x y z
1  2 1 1
4  5 4 4
5  4 5 1
6  1 1 2
15 3 5 3