Skip to content


Data preparation is a crucial step in any data analysis workflow. In this guide, we explore the scgUtils R package, specifically focusing on three functions: get_data(), process_factors(), and get_file(). These functions streamline the process of loading, cleaning, and preprocessing data, thereby simplifying your workflow:

  • get_data(): A helper function to load datasets into your R environment, allowing you to name the data frame.
  • process_factors(): Cleans your data frames by removing unused factor levels, while keeping non-factor columns intact.
  • get_file(): Retrieves and processes files from various sources, including local storage, OneDrive, and the web, and performs initial preprocessing based on file type.


Uploading Package Data with get_data()

The get_data() function in scgUtils is designed to simplify the process of importing datasets into your R environment. This function is particularly useful when working with packaged data or when you want to load data seamlessly without worrying about file paths or data import syntax.

Example Usage:
Suppose you have a dataset named “survey” within your package or want to access a dataset from another package. The get_data() function makes this process straightforward:

# Loading a sample dataset
df <- get_data("survey")

# Inspecting the dataset's class and a categorical column
class(df)
#> [1] "tbl_df"     "tbl"        "data.frame"

class(df$turnoutUKGeneral)
#> [1] "haven_labelled" "vctrs_vctr"     "double"

head(df[, 1:6])
id wt turnoutUKGeneral generalElectionVote partyIdStrength partyId
7 0.3755288 5 4 2 4
14 0.5528756 5 1 3 2
15 0.7122303 5 12 2 12
18 0.4396403 4 4 3 10
19 0.3613798 5 2 1 2
24 1.6864884 5 7 NA 10

The function loads the dataset and assigns it to a variable, in this case, df. It then allows you to perform initial inspections and manipulations. This functionality is essential for quickly setting up your data for analysis, especially in exploratory data analysis or educational settings.

Additionally, notice how the dataset initially has a ‘haven_labelled’ class. To convert it to a more standard format, we use the labelled package:

# Converting 'haven_labelled' data to standard format
df <- labelled::unlabelled(df)

# View class
class(df)
#> [1] "tbl_df"     "tbl"        "data.frame"

class(df$turnoutUKGeneral)
#> [1] "factor"

head(df[, 1:6])
id wt turnoutUKGeneral generalElectionVote partyIdStrength partyId
7 0.3755288 Very likely that I would vote Scottish National Party (SNP) Fairly strong Scottish National Party (SNP)
14 0.5528756 Very likely that I would vote Conservative Not very strong Labour
15 0.7122303 Very likely that I would vote Brexit Party/Reform UK Fairly strong Brexit Party/Reform UK
18 0.4396403 Fairly likely Scottish National Party (SNP) Not very strong No - none
19 0.3613798 Very likely that I would vote Labour Very strong Labour
24 1.6864884 Very likely that I would vote Green Party NA No - none



Cleaning Data with process_factors()

The process_factors() function is a crucial tool for data cleaning, particularly in dealing with factor variables. Often in datasets, especially those derived from surveys or categorisations, factor variables contain levels that are not used (e.g., caused by a factor level that was used to disqualify respondents).

These unused levels can be misleading and may affect analyses if not handled properly. The process_factors() function simplifies this task by automatically identifying and removing unused factor levels, streamlining your dataset for analysis.

Example Usage:
Consider a dataset with a factor variable ageGroup that includes levels like “Under 18”, “18-25”, “26-35”, etc. If your dataset does not have any entries for “Under 18”, this level is redundant.

# Examining factor levels before cleaning
levels(df$ageGroup)
#> [1] "Under 18" "18-25"    "26-35"    "36-45"    "46-55"    "56-65"   
#> [7] "66+"

By applying process_factors(), we can clean up these unused levels:

# Cleaning the dataset with process_factors
df <- process_factors(df)

# Verifying that "Under 18" is removed
levels(df$ageGroup)
#> [1] "18-25" "26-35" "36-45" "46-55" "56-65" "66+"

# Checking the metadata of the 'ageGroup' column
attr(df$ageGroup, "label")
#> [1] "Age group"

# Inspecting the class of cleaned columns
class(df)
#> [1] "data.frame"

class(df$turnoutUKGeneral)
#> [1] "factor"



Using get_file() for Diverse Data Sources

The get_file() function is a versatile tool for importing data into R from various sources. This function is particularly useful when working with data stored in different locations or formats. It not only retrieves the data but also performs initial preprocessing based on the file type, such as handling special characters in CSV files or dealing with complexities in .sav (SPSS) files.

Local

For files stored locally, get_file() can directly access them given the correct path. This is useful for datasets stored within your project or elsewhere on your system.

df <- get_file(file_path = "inst/extdata/survey.sav",
               source = "local") # default

# View class
class(df)
#> [1] "data.frame"

class(df$turnoutUKGeneral)
#> [1] "factor"

head(df[, 1:6])
id wt turnoutUKGeneral generalElectionVote partyIdStrength partyId
7 0.3755288 Very likely that I would vote Scottish National Party (SNP) Fairly strong Scottish National Party (SNP)
14 0.5528756 Very likely that I would vote Conservative Not very strong Labour
15 0.7122303 Very likely that I would vote Brexit Party/Reform UK Fairly strong Brexit Party/Reform UK
18 0.4396403 Fairly likely Scottish National Party (SNP) Not very strong No - none
19 0.3613798 Very likely that I would vote Labour Very strong Labour
24 1.6864884 Very likely that I would vote Green Party NA No - none


OneDrive

get_file() can also interface with OneDrive, allowing for seamless integration of data stored in the cloud. This feature is particularly useful for collaborative projects or when accessing data across multiple devices.

OneDrive Authentication:

df <- get_file(file_path = "scgUtils_examples_folder/survey.sav",
               source = "onedrive")
Microsoft Office 365 Login Screen
Microsoft Office 365 Login Screen


After authentication, the file is downloaded and made available in your R environment.

#> Loading Microsoft Graph login for default tenant

# View class
class(df)
#> [1] "data.frame"

class(df$turnoutUKGeneral)
#> [1] "factor"

head(df[, 1:6])
id wt turnoutUKGeneral generalElectionVote partyIdStrength partyId
7 0.3755288 Very likely that I would vote Scottish National Party (SNP) Fairly strong Scottish National Party (SNP)
14 0.5528756 Very likely that I would vote Conservative Not very strong Labour
15 0.7122303 Very likely that I would vote Brexit Party/Reform UK Fairly strong Brexit Party/Reform UK
18 0.4396403 Fairly likely Scottish National Party (SNP) Not very strong No - none
19 0.3613798 Very likely that I would vote Labour Very strong Labour
24 1.6864884 Very likely that I would vote Green Party NA No - none


Websites

Retrieving files from web sources is another key feature. This allows you to directly import datasets hosted online without the need to download them manually.

df <- get_file(file_path = "https://github.com/sarahcgall/scgUtils/blob/master/inst/extdata/survey.csv",
               source = "web")

# View class
class(df)
#> [1] "tbl_df"     "tbl"        "data.frame"

class(df$turnoutUKGeneral)
#> [1] "character"



Troubleshooting .sav Files with get_file()

Working with .sav files (SPSS format) can sometimes lead to challenges due to their complex structure and encoding. The get_file() function in the scgUtils package, which utilises the haven package for handling .sav files, is well-equipped to manage these challenges. However, users may occasionally encounter issues related to encoding or formatting.

Common Issues and Solutions:

  1. Encoding Errors:
    • Problem: .sav files may contain characters or symbols not correctly encoded, leading to warnings or errors during the import process.
    • Solution: The get_file() function tries to manage these by attempting to read the file with different encodings. If the default reading fails, it attempts with ‘latin1’ encoding. This approach handles a wide range of encoding issues that are commonly encountered in .sav files.
  1. Handling NA Values:
    • Problem: SPSS files often have unique representations for missing values, which may not align with R’s standard NA.
    • Solution: The get_file() function includes steps to handle these discrepancies. For instance, if a .sav file represents missing values as “NA”, the following code snippet can be used to convert them to R’s NA:
# Handling NA values represented as "__NA__" in .sav files
df[df == "__NA__"] <- NA


This step ensures that R recognises missing values correctly, allowing for accurate data analysis and manipulation.

Best Practices for Troubleshooting:

  • Check File Formatting: Before attempting to import a .sav file, ensure it is correctly formatted. Pay special attention to any proprietary encoding or unique representations of data within the file.
  • Read Warnings Carefully: When using get_file(), carefully read any warnings or error messages that appear. These messages can provide vital clues for troubleshooting and resolving issues.
  • Consult Documentation: For more detailed guidance on handling specific .sav file issues, refer to both the scgUtils and haven package documentation. These resources can offer additional insights and solutions for complex cases.