Uploading & Cleaning Data

Data preparation is a crucial step in any data analysis workflow. In this guide, we explore the scgUtils R package, specifically focusing on three functions: get_data(), process_factors(), and get_file(). These functions streamline the process of loading, cleaning, and preprocessing data, thereby simplifying your workflow:

get_data(): A helper function to load datasets into your R environment, allowing you to name the data frame.
process_factors(): Cleans your data frames by removing unused factor levels, while keeping non-factor columns intact.
get_file(): Retrieves and processes files from various sources, including local storage, OneDrive, and the web, and performs initial preprocessing based on file type.

Uploading Package Data with `get_data()`

The get_data() function in scgUtils is designed to simplify the process of importing datasets into your R environment. This function is particularly useful when working with packaged data or when you want to load data seamlessly without worrying about file paths or data import syntax.

Example Usage:
Suppose you have a dataset named “survey” within your package or want to access a dataset from another package. The get_data() function makes this process straightforward:

# Loading a sample dataset
df <- get_data("survey")

# Inspecting the dataset's class and a categorical column
class(df)
#> [1] "tbl_df"     "tbl"        "data.frame"

class(df$turnoutUKGeneral)
#> [1] "haven_labelled" "vctrs_vctr"     "double"

head(df[, 1:6])

id	wt	turnoutUKGeneral	generalElectionVote	partyIdStrength	partyId
7	0.3755288	5	4	2	4
14	0.5528756	5	1	3	2
15	0.7122303	5	12	2	12
18	0.4396403	4	4	3	10
19	0.3613798	5	2	1	2
24	1.6864884	5	7	NA	10

The function loads the dataset and assigns it to a variable, in this case, df. It then allows you to perform initial inspections and manipulations. This functionality is essential for quickly setting up your data for analysis, especially in exploratory data analysis or educational settings.

Additionally, notice how the dataset initially has a ‘haven_labelled’ class. To convert it to a more standard format, we use the labelled package:

# Converting 'haven_labelled' data to standard format
df <- labelled::unlabelled(df)

# View class
class(df)
#> [1] "tbl_df"     "tbl"        "data.frame"

class(df$turnoutUKGeneral)
#> [1] "factor"

head(df[, 1:6])

id	wt	turnoutUKGeneral	generalElectionVote	partyIdStrength	partyId
7	0.3755288	Very likely that I would vote	Scottish National Party (SNP)	Fairly strong	Scottish National Party (SNP)
14	0.5528756	Very likely that I would vote	Conservative	Not very strong	Labour
15	0.7122303	Very likely that I would vote	Brexit Party/Reform UK	Fairly strong	Brexit Party/Reform UK
18	0.4396403	Fairly likely	Scottish National Party (SNP)	Not very strong	No - none
19	0.3613798	Very likely that I would vote	Labour	Very strong	Labour
24	1.6864884	Very likely that I would vote	Green Party	NA	No - none

Cleaning Data with `process_factors()`

The process_factors() function is a crucial tool for data cleaning, particularly in dealing with factor variables. Often in datasets, especially those derived from surveys or categorisations, factor variables contain levels that are not used (e.g., caused by a factor level that was used to disqualify respondents).

These unused levels can be misleading and may affect analyses if not handled properly. The process_factors() function simplifies this task by automatically identifying and removing unused factor levels, streamlining your dataset for analysis.

Example Usage:
Consider a dataset with a factor variable ageGroup that includes levels like “Under 18”, “18-25”, “26-35”, etc. If your dataset does not have any entries for “Under 18”, this level is redundant.

# Examining factor levels before cleaning
levels(df$ageGroup)
#> [1] "Under 18" "18-25"    "26-35"    "36-45"    "46-55"    "56-65"   
#> [7] "66+"

By applying process_factors(), we can clean up these unused levels:

# Cleaning the dataset with process_factors
df <- process_factors(df)

# Verifying that "Under 18" is removed
levels(df$ageGroup)
#> [1] "18-25" "26-35" "36-45" "46-55" "56-65" "66+"

# Checking the metadata of the 'ageGroup' column
attr(df$ageGroup, "label")
#> [1] "Age group"

# Inspecting the class of cleaned columns
class(df)
#> [1] "data.frame"

class(df$turnoutUKGeneral)
#> [1] "factor"

Using `get_file()` for Diverse Data Sources

The get_file() function is a versatile tool for importing data into R from various sources. This function is particularly useful when working with data stored in different locations or formats. It not only retrieves the data but also performs initial preprocessing based on the file type, such as handling special characters in CSV files or dealing with complexities in .sav (SPSS) files.

Local

For files stored locally, get_file() can directly access them given the correct path. This is useful for datasets stored within your project or elsewhere on your system.

df <- get_file(file_path = "inst/extdata/survey.sav",
               source = "local") # default

# View class
class(df)
#> [1] "data.frame"

class(df$turnoutUKGeneral)
#> [1] "factor"

head(df[, 1:6])

id	wt	turnoutUKGeneral	generalElectionVote	partyIdStrength	partyId
7	0.3755288	Very likely that I would vote	Scottish National Party (SNP)	Fairly strong	Scottish National Party (SNP)
14	0.5528756	Very likely that I would vote	Conservative	Not very strong	Labour
15	0.7122303	Very likely that I would vote	Brexit Party/Reform UK	Fairly strong	Brexit Party/Reform UK
18	0.4396403	Fairly likely	Scottish National Party (SNP)	Not very strong	No - none
19	0.3613798	Very likely that I would vote	Labour	Very strong	Labour
24	1.6864884	Very likely that I would vote	Green Party	NA	No - none

OneDrive

get_file() can also interface with OneDrive, allowing for seamless integration of data stored in the cloud. This feature is particularly useful for collaborative projects or when accessing data across multiple devices.

OneDrive Authentication:

df <- get_file(file_path = "scgUtils_examples_folder/survey.sav",
               source = "onedrive")

Microsoft Office 365 Login Screen

After authentication, the file is downloaded and made available in your R environment.

#> Loading Microsoft Graph login for default tenant

# View class
class(df)
#> [1] "data.frame"

class(df$turnoutUKGeneral)
#> [1] "factor"

head(df[, 1:6])

id	wt	turnoutUKGeneral	generalElectionVote	partyIdStrength	partyId
7	0.3755288	Very likely that I would vote	Scottish National Party (SNP)	Fairly strong	Scottish National Party (SNP)
14	0.5528756	Very likely that I would vote	Conservative	Not very strong	Labour
15	0.7122303	Very likely that I would vote	Brexit Party/Reform UK	Fairly strong	Brexit Party/Reform UK
18	0.4396403	Fairly likely	Scottish National Party (SNP)	Not very strong	No - none
19	0.3613798	Very likely that I would vote	Labour	Very strong	Labour
24	1.6864884	Very likely that I would vote	Green Party	NA	No - none

Websites

Retrieving files from web sources is another key feature. This allows you to directly import datasets hosted online without the need to download them manually.

df <- get_file(file_path = "https://github.com/sarahcgall/scgUtils/blob/master/inst/extdata/survey.csv",
               source = "web")

# View class
class(df)
#> [1] "tbl_df"     "tbl"        "data.frame"

class(df$turnoutUKGeneral)
#> [1] "character"

Troubleshooting .sav Files with `get_file()`

Working with .sav files (SPSS format) can sometimes lead to challenges due to their complex structure and encoding. The get_file() function in the scgUtils package, which utilises the haven package for handling .sav files, is well-equipped to manage these challenges. However, users may occasionally encounter issues related to encoding or formatting.

Common Issues and Solutions:

Encoding Errors:

- Problem: .sav files may contain characters or symbols not correctly encoded, leading to warnings or errors during the import process.
- Solution: The get_file() function tries to manage these by attempting to read the file with different encodings. If the default reading fails, it attempts with ‘latin1’ encoding. This approach handles a wide range of encoding issues that are commonly encountered in .sav files.

Handling NA Values:

- Problem: SPSS files often have unique representations for missing values, which may not align with R’s standard NA.
- Solution: The get_file() function includes steps to handle these discrepancies. For instance, if a .sav file represents missing values as “NA”, the following code snippet can be used to convert them to R’s NA:

# Handling NA values represented as "__NA__" in .sav files
df[df == "__NA__"] <- NA

This step ensures that R recognises missing values correctly, allowing for accurate data analysis and manipulation.

Best Practices for Troubleshooting:

Check File Formatting: Before attempting to import a .sav file, ensure it is correctly formatted. Pay special attention to any proprietary encoding or unique representations of data within the file.
Read Warnings Carefully: When using get_file(), carefully read any warnings or error messages that appear. These messages can provide vital clues for troubleshooting and resolving issues.
Consult Documentation: For more detailed guidance on handling specific .sav file issues, refer to both the scgUtils and haven package documentation. These resources can offer additional insights and solutions for complex cases.

Uploading Package Data with get_data()

Cleaning Data with process_factors()

Using get_file() for Diverse Data Sources