This information may be out there already, but I couldn’t find it quickly. If we are using data from baseball-reference.com, we will come across situations often when a player played for two or more teams in the same season. How this is dealt with on that site is to put a “TOT” row first which totals the data from the other rows, and then follows that data with the information from the various teams. Most of the time, in my experience, what I want is just the “TOT” data, and I want to ignore the team specific data as superfluous. Here’s how to do that in Excel, and in R. Sadly, this functionality is missing in OpenOffice/LibreOffice Calc, and the path for removing duplicates is much to complicated for me to put up with for normal workflow.
The example I’m using is the .CSV of all of the batting statistics for players in 2016. The data is from https://www.baseball-reference.com/leagues/MLB/2016-standard-batting.shtml, and you either copy and paste the data into a spreadsheet and save it as a CSV for R, or you can use the “Share & More” button above the table to convert the table to a comma-delimited version, copy that into a spreadsheet, and use the “Text to Columns” function (under ‘Data” in Excel).
You start with the data loaded from the link above (watch out to delete the LgAvg or “League Average” row), and so will have 1,611 rows.
1) Select all cells using the sweet multi-selector to left of Column A and above Row 1.
2) Under the “Data” tab, select “Remove Duplicates”
3) Click on “Unselect All” and Click on the Box next to “Name”
4) Click “Ok”
You should have a message box pop up and let you know that 258 duplicate rows have been removed, leaving you with 1,354 rows (including the header row).
1) Load the .csv into R with a command like:
stats2016 <- read.csv("c:/2016stats.csv", header=TRUE)
(This is obviously for Windows, but is fairly similar on Linux and Mac, you may have to use slashes of a different direction.)
2) If you type:
You’ll see we have our 1,611 rows.
3) There are multiple options to get rid of the rows we don’t need, but one simple way is to type something like:
nonDupes <- stats2016[!duplicated(stats2016["Name"]),]
(This removes all of the rows in which the second column has a duplicate.)
4) If we type:
This tells us that our variable has 1,353 rows (it doesn’t count the header row, since we told it there was a header row when we loaded it).
5) So, if you wanted to isolate stolen bases from this data set, you would type something like:
SBs <- data.frame(nonDupes$Name, nonDupes$SB)
There are loads of tutorials for both Excel and R if these were more complicated than you’re ready for, but if you’re like me and just need the workflow steps to work with baseball-reference.com data, I hope this helps.