Steve Bellan (slides borrowed from John Muschelli)
x
and X
are different)
## get the working directory
getwd()
setwd('~/Documents/R Repos/EPID7500/')
Note that the dir()
function interfaces with your operating system and can show you which files are in your current working directory.
You can try some directory navigation:
dir() # shows directory contents
[1] "Data_IO.html"
[2] "Data_IO.pdf"
[3] "Data_IO.R"
[4] "index.html"
[5] "index.md"
[6] "index.pdf"
[7] "index.R"
[8] "index.Rmd"
[9] "lab"
[10] "makefile"
[11] "Youth_Tobacco_Survey_YTS_Data.csv"
[12] "YouthTobacco_newNames.csv"
[13] "yts_data.rda"
[14] "yts_dataset.rds"
dir("..") # shows up one directory
[1] "1.1-RStudio"
[2] "1.2-Basic_R"
[3] "1.3-Data_IO"
[4] "all_the_functions.csv"
[5] "all_the_packages.txt"
[6] "Arrays_Split"
[7] "Basic_R"
[8] "Best_Model_Coefficients.csv"
[9] "Best_Model_Coefficients.xlsx"
[10] "bibliography.bib"
[11] "black_and_white_theme.pdf"
[12] "bloomberg.logo.small.horizontal.blue.png"
[13] "data"
[14] "Data_Classes"
[15] "Data_Cleaning"
[16] "Data_IO"
[17] "Data_Summarization"
[18] "Data_Visualization"
[19] "data.zip"
[20] "Day 1"
[21] "dhs"
[22] "Functions"
[23] "HW"
[24] "ifelse_stata_way.R"
[25] "index.html"
[26] "index.Rmd"
[27] "install_all_packages.R"
[28] "Intro"
[29] "intro_to_r.Rproj"
[30] "Knitr"
[31] "LICENSE"
[32] "list_all_packages.R"
[33] "live_code"
[34] "makefile"
[35] "makefile_old"
[36] "makefile.copy"
[37] "Manipulating_Data_in_R"
[38] "my_tab.txt"
[39] "ProjectExample"
[40] "README.md"
[41] "render.R"
[42] "renderFile.R"
[43] "replace_css.R"
[44] "RStudio"
[45] "run_labs.R"
[46] "scratch.R"
[47] "shiny_knitr"
[48] "shiny_knitr.zip"
[49] "Simple_Knitr"
[50] "Statistics"
[51] "styles.css"
[52] "Subsetting_Data_in_R"
[53] "Syllabus-student.doc"
dir("../..") # shows up two directories
[1] "CarcCapRecap" "cRCT_vs_iRCT"
[3] "datasets" "EbolaVaccSim"
[5] "EPID7500" "HIVClinicTanzMoH"
[7] "ICI3D" "lmtacc"
[9] "MathModelsMedPH" "measlesImmunomodulation"
[11] "MMED2017" "MMEDparticipants"
[13] "RakaiLatentHet" "RTutorials"
[15] "SDPSimulations" "sshfsTip.txt"
[17] "TB_MAC_UGA" "untitled folder"
[19] "ZikaTrial"
An absolute or full path points to the same location in a file system, regardless of the current working directory.
dir('~/Documents', full.names = TRUE)[1:3]
[1] "/Users/stevenbellan/Documents/Adobe"
[2] "/Users/stevenbellan/Documents/Dragon"
[3] "/Users/stevenbellan/Documents/DreamPlan Sample Projects"
This means if I try your code, and you use absolute paths, it won’t work unless we have the exact same folder structure where R is looking (bad).
A relative path starts from current working directory. Need to have the working directory right but good for project portability.
dir('~/Documents', full.names = FALSE)[1:3]
[1] "Adobe" "Dragon"
[3] "DreamPlan Sample Projects"
In RStudio, go to Session --> Set Working Directory --> To Source File Location
RStudio should put code in the Console, similar to this:
setwd("~/Lectures/Data_IO/lecture")
Again, if you open an R file with a new RStudio session, it does this for you. You may need to make this a default.
For any function, you can write ?FUNCTION_NAME
, or help("FUNCTION_NAME")
to look at the help file:
?dir
help("dir")
Everything we do in class will be using real publicly available data - there are few ‘toy’ example datasets and ‘simulated’ data
Youth Tobacco Survey (YTS) Dataset
Middle/high school students tobacco use, exposure to environmental tobacco smoke, smoking cessation, school curriculum, knowledge and attitudes about tobacco, etc…
Within RStudio: Session –> Set Working Directory –> To Source File Location
R Studio features
mydat = read_csv("https://github.com/sbellan61/EPID7500-IntroToR/raw/gh-pages/1.3-Data_IO/Youth_Tobacco_Survey_YTS_Data.csv")
head(mydat)
# A tibble: 6 x 31
YEAR LocationAbbr LocationDesc TopicType
<int> <chr> <chr> <chr>
1 2015 AZ Arizona Tobacco Use – Survey Data
2 2015 AZ Arizona Tobacco Use – Survey Data
3 2015 AZ Arizona Tobacco Use – Survey Data
4 2015 AZ Arizona Tobacco Use – Survey Data
5 2015 AZ Arizona Tobacco Use – Survey Data
6 2015 AZ Arizona Tobacco Use – Survey Data
# ... with 27 more variables: TopicDesc <chr>, MeasureDesc <chr>,
# DataSource <chr>, Response <chr>, Data_Value_Unit <chr>,
# Data_Value_Type <chr>, Data_Value <dbl>,
# Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
# Data_Value_Std_Err <dbl>, Low_Confidence_Limit <dbl>,
# High_Confidence_Limit <dbl>, Sample_Size <int>, Gender <chr>,
# Race <chr>, Age <chr>, Education <chr>, GeoLocation <chr>,
# TopicTypeId <chr>, TopicId <chr>, MeasureId <chr>,
# StratificationID1 <chr>, StratificationID2 <chr>,
# StratificationID3 <chr>, StratificationID4 <chr>, SubMeasureID <chr>,
# DisplayOrder <int>
dat = read_csv("../data/Youth_Tobacco_Survey_YTS_Data.csv")
Parsed with column specification:
cols(
.default = col_character(),
YEAR = col_integer(),
Data_Value = col_double(),
Data_Value_Std_Err = col_double(),
Low_Confidence_Limit = col_double(),
High_Confidence_Limit = col_double(),
Sample_Size = col_integer(),
DisplayOrder = col_integer()
)
See spec(...) for full column specifications.
The data is now successfully read into your R workspace, just like from using the dropdown menu.
TidyVerse (loads as a tibble): read_csv, read_table, read_delim
tibbles print cleaner and have slightly different subsetting syntax
dat
# A tibble: 9,794 x 31
YEAR LocationAbbr LocationDesc TopicType
<int> <chr> <chr> <chr>
1 2015 AZ Arizona Tobacco Use – Survey Data
2 2015 AZ Arizona Tobacco Use – Survey Data
3 2015 AZ Arizona Tobacco Use – Survey Data
4 2015 AZ Arizona Tobacco Use – Survey Data
5 2015 AZ Arizona Tobacco Use – Survey Data
6 2015 AZ Arizona Tobacco Use – Survey Data
7 2015 AZ Arizona Tobacco Use – Survey Data
8 2015 AZ Arizona Tobacco Use – Survey Data
9 2015 AZ Arizona Tobacco Use – Survey Data
10 2015 AZ Arizona Tobacco Use – Survey Data
# ... with 9,784 more rows, and 27 more variables: TopicDesc <chr>,
# MeasureDesc <chr>, DataSource <chr>, Response <chr>,
# Data_Value_Unit <chr>, Data_Value_Type <chr>, Data_Value <dbl>,
# Data_Value_Footnote_Symbol <chr>, Data_Value_Footnote <chr>,
# Data_Value_Std_Err <dbl>, Low_Confidence_Limit <dbl>,
# High_Confidence_Limit <dbl>, Sample_Size <int>, Gender <chr>,
# Race <chr>, Age <chr>, Education <chr>, GeoLocation <chr>,
# TopicTypeId <chr>, TopicId <chr>, MeasureId <chr>,
# StratificationID1 <chr>, StratificationID2 <chr>,
# StratificationID3 <chr>, StratificationID4 <chr>, SubMeasureID <chr>,
# DisplayOrder <int>
While many online resources use the base R tools, the latest version of RStudio switched to use these new readr
data import tools, so we will use them in the class for slides. They are also up to two times faster for reading in large datasets, and have a progress bar which is nice.
Here is how to read in the same dataset using base R functionality, which returns a data.frame
directly
dat2 = read.csv("../data/Youth_Tobacco_Survey_YTS_Data.csv", as.is = TRUE)
head(dat2)
YEAR LocationAbbr LocationDesc TopicType
1 2015 AZ Arizona Tobacco Use – Survey Data
2 2015 AZ Arizona Tobacco Use – Survey Data
3 2015 AZ Arizona Tobacco Use – Survey Data
4 2015 AZ Arizona Tobacco Use – Survey Data
5 2015 AZ Arizona Tobacco Use – Survey Data
6 2015 AZ Arizona Tobacco Use – Survey Data
TopicDesc
1 Cessation (Youth)
2 Cessation (Youth)
3 Cessation (Youth)
4 Cessation (Youth)
5 Cessation (Youth)
6 Cessation (Youth)
MeasureDesc DataSource
1 Percent of Current Smokers Who Want to Quit YTS
2 Percent of Current Smokers Who Want to Quit YTS
3 Percent of Current Smokers Who Want to Quit YTS
4 Quit Attempt in Past Year Among Current Cigarette Smokers YTS
5 Quit Attempt in Past Year Among Current Cigarette Smokers YTS
6 Quit Attempt in Past Year Among Current Cigarette Smokers YTS
Response Data_Value_Unit Data_Value_Type Data_Value
1 % Percentage NA
2 % Percentage NA
3 % Percentage NA
4 % Percentage NA
5 % Percentage NA
6 % Percentage NA
Data_Value_Footnote_Symbol
1 *
2 *
3 *
4 *
5 *
6 *
Data_Value_Footnote
1 Data in these cells have been suppressed because of a small sample size
2 Data in these cells have been suppressed because of a small sample size
3 Data in these cells have been suppressed because of a small sample size
4 Data in these cells have been suppressed because of a small sample size
5 Data in these cells have been suppressed because of a small sample size
6 Data in these cells have been suppressed because of a small sample size
Data_Value_Std_Err Low_Confidence_Limit High_Confidence_Limit
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
Sample_Size Gender Race Age Education
1 NA Overall All Races All Ages Middle School
2 NA Male All Races All Ages Middle School
3 NA Female All Races All Ages Middle School
4 NA Overall All Races All Ages Middle School
5 NA Male All Races All Ages Middle School
6 NA Female All Races All Ages Middle School
GeoLocation TopicTypeId TopicId MeasureId
1 (34.865970280000454, -111.76381127699972) BEH 105BEH 170CES
2 (34.865970280000454, -111.76381127699972) BEH 105BEH 170CES
3 (34.865970280000454, -111.76381127699972) BEH 105BEH 170CES
4 (34.865970280000454, -111.76381127699972) BEH 105BEH 169QUA
5 (34.865970280000454, -111.76381127699972) BEH 105BEH 169QUA
6 (34.865970280000454, -111.76381127699972) BEH 105BEH 169QUA
StratificationID1 StratificationID2 StratificationID3 StratificationID4
1 1GEN 8AGE 6RAC 1EDU
2 2GEN 8AGE 6RAC 1EDU
3 3GEN 8AGE 6RAC 1EDU
4 1GEN 8AGE 6RAC 1EDU
5 2GEN 8AGE 6RAC 1EDU
6 3GEN 8AGE 6RAC 1EDU
SubMeasureID DisplayOrder
1 YTS01 1
2 YTS02 2
3 YTS03 3
4 YTS04 4
5 YTS05 5
6 YTS06 6
We will use the TidyVerse readr
functions because TidyVerse is the wave of the future.
nrow()
displays the number of rows of a data framencol()
displays the number of columnsdim()
displays a vector of length 2: # rows, # columnscolnames()
displays the column names (if any) and rownames()
displays the row names (if any)dim(dat2)
[1] 9794 31
nrow(dat2)
[1] 9794
ncol(dat2)
[1] 31
colnames(dat2)
[1] "YEAR" "LocationAbbr"
[3] "LocationDesc" "TopicType"
[5] "TopicDesc" "MeasureDesc"
[7] "DataSource" "Response"
[9] "Data_Value_Unit" "Data_Value_Type"
[11] "Data_Value" "Data_Value_Footnote_Symbol"
[13] "Data_Value_Footnote" "Data_Value_Std_Err"
[15] "Low_Confidence_Limit" "High_Confidence_Limit"
[17] "Sample_Size" "Gender"
[19] "Race" "Age"
[21] "Education" "GeoLocation"
[23] "TopicTypeId" "TopicId"
[25] "MeasureId" "StratificationID1"
[27] "StratificationID2" "StratificationID3"
[29] "StratificationID4" "SubMeasureID"
[31] "DisplayOrder"
Changing variable names in data.frame
s works using the names()
function, which is analagous to colnames()
for data frames (they can be used interchangeably)
names(dat)[1] = "year"
names(dat)
[1] "year" "LocationAbbr"
[3] "LocationDesc" "TopicType"
[5] "TopicDesc" "MeasureDesc"
[7] "DataSource" "Response"
[9] "Data_Value_Unit" "Data_Value_Type"
[11] "Data_Value" "Data_Value_Footnote_Symbol"
[13] "Data_Value_Footnote" "Data_Value_Std_Err"
[15] "Low_Confidence_Limit" "High_Confidence_Limit"
[17] "Sample_Size" "Gender"
[19] "Race" "Age"
[21] "Education" "GeoLocation"
[23] "TopicTypeId" "TopicId"
[25] "MeasureId" "StratificationID1"
[27] "StratificationID2" "StratificationID3"
[29] "StratificationID4" "SubMeasureID"
[31] "DisplayOrder"
write_csv(dat, path, na = "NA", append = FALSE)
?write_csv
x
: the R object (tibble
) to write
path
: the file name (with absolute or, ideally, relative path)
names(dat)[1] = "Year"
write_csv(dat, path="YouthTobacco_newNames.csv")
Excel to R options
read_csv()
xlsx
, openxlsx
, readxl
(can handle multiple worksheets)read_excel
save
R object(s) into an “R data file”: .rda
or .RData
x <- 5
yts <- read_csv('../data/Youth_Tobacco_Survey_YTS_Data.csv')
save(yts, x, file = "yts_data.rda")
ls()
lists the items in the workspace/environment and rm
removes them:
ls() # list things in the workspace
[1] "bad" "bogus" "cn" "dat" "dat2"
[6] "days" "df" "fe" "fit5" "fit6"
[11] "grouped" "hw1" "icol" "keep" "mat"
[16] "mods" "mydat" "show_keys" "simbias" "swiss"
[21] "x" "x1hist" "y" "yts" "z"
rm(list = c("x", "yts"))
ls()
[1] "bad" "bogus" "cn" "dat" "dat2"
[6] "days" "df" "fe" "fit5" "fit6"
[11] "grouped" "hw1" "icol" "keep" "mat"
[16] "mods" "mydat" "show_keys" "simbias" "swiss"
[21] "x1hist" "y" "z"
z <- load("yts_data.rda")
ls()
[1] "bad" "bogus" "cn" "dat" "dat2"
[6] "days" "df" "fe" "fit5" "fit6"
[11] "grouped" "hw1" "icol" "keep" "mat"
[16] "mods" "mydat" "show_keys" "simbias" "swiss"
[21] "x" "x1hist" "y" "yts" "z"
print(z)
[1] "yts" "x"
Note, z
is a character vector of the names of the objects loaded, not the objects themselves.