Sunday, February 18, 2018

R Syntax Explained

Aggregate: This is used to apply a function (like mean) across a data set that is subset according to your needs.  For example, if you have a table of car sales details and prices (called mydata) and you want to know average sale prices, you can't just average the entire price column: you need the average price for each type of car.  Let's say your table has these columns: make, model, and price.

Aggregate takes a few inputs.  The first item is the dataset you care about: in this case, the table mydata, but specifically the price column.  Second item is a list of what subsets you'd like to create.  For example, we want to subset every row that matches "Ford" and "F150" and average their price.  So our second item is what categories we want to break the data out into: in this case, we want to see every unique combination of make and model.  The last item is the function we want to apply to the subset: average, median, etc.

result <- aggregate="" by="list(mydata$MAKE," mean="" mydata="" p="">

Filter and Select:  One of my favorite combinations. 
Filter takes two inputs: your dataset, and how you'd like to subset it.  So first input is our table mydata, easy enough.  Second input is a test: we give it the column Make, and test the values to see if they equal (==) Ford.  If the row's Make column contains Ford, filter will keep that row.  Otherwise, it's tossed. 

Select then is being given the result from filter.  Filter has snagged every row from our original table (mydata) and includes every column.  In other words, mydata started with columns make, model, and price, and filter also has all those columns. 

Select takes two inputs: one is your complete data set, the other is the column(s) you want to keep.  In this case, we want to keep the price column.  So this command eliminates every row that isn't a Ford sale and gives you a 1-column table of the prices of those Fords. 

In other words, the command below answers the question "give me just the prices from every Ford sale in the table."

select(filter(mydata, Make=='Ford'), Price)

Native dataframe manipulation: Sometimes you don't need to use commands like filter or aggregate to get the subset of data you want.  Let's say you have a 1-column table (let's call it car_returns) of the prices of all the cars that were brought back from a customer and had to be refunded.  How would you identify the make and model of the cars that were returned, just from the price?   So let's say the question is "give me all the rows (including make, model, and price) from my original table (mydata) that match these prices."

In general, you can subset a dataframe with a [] after the name: mydata[].  Inside the bracket, we'll need to pass two pieces of information: first, what column in mydata will correspond to the values in car_returns?  Obviously, price.  Now, we aren't comparing mydata's price column to a single price: we need to compare it to all prices that are in the car_returns table.  So we will use %in% to say "we want all the rows from mydata where Price equals one of the values from car_returns." 

mydata[mydata$PRICE %in% car_returns, ]

The other thing you'll notice is the comma after car_returns.  What's going on there is that we are comparing all the values of the column Price.  If we wanted to compare and subset based a row, we would put that after the comma.  For example, if we just wanted the Make of the car, we could do this:

mydata[mydata$PRICE %in% car_returns, mydata$MAKE]