Using scgUtils with Survey Data • scgUtils

This article will take you through the steps of uploading, processing, and exploring survey data using the scgUtils package in R. Our journey will cover the nuances of handling survey datasets, from the initial loading of the data to the advanced stages of analysis and visualisation.

Step 1: Upload the Sample Dataset

The process begins with loading your dataset. The scgUtils package offers two main functions for this purpose: get_data() and get_file().

Using get_data(): This function is ideal for loading datasets directly from R packages. It streamlines the process of importing and naming your dataset in the R environment.

# Example of loading and preprocessing a dataset
df <- get_data("survey") %>%
  labelled::unlabelled() %>% # Convert 'haven_labelled' data to standard format
  process_factors() # Remove unused factor levels

Note: The sample data is a subset of the British Electoral Survey. For full data, visit British Election Study.

Using get_file(): When working with external data files, such as .sav or .csv, get_file() becomes invaluable. It not only imports the data but also preprocesses it, handling special characters and facilitates the conversion of specialised data types, such as haven_labelled, into standard R formats, making it a robust choice for various data sources.

# Using the  `get_file function option which includes the above preprocessing.
df <- get_file("inst/extdata/survey.sav")

head(df[, 1:6])

id	wt	turnoutUKGeneral	generalElectionVote	partyIdStrength	partyId
7	0.3755288	Very likely that I would vote	Scottish National Party (SNP)	Fairly strong	Scottish National Party (SNP)
14	0.5528756	Very likely that I would vote	Conservative	Not very strong	Labour
15	0.7122303	Very likely that I would vote	Brexit Party/Reform UK	Fairly strong	Brexit Party/Reform UK
18	0.4396403	Fairly likely	Scottish National Party (SNP)	Not very strong	No - none
19	0.3613798	Very likely that I would vote	Labour	Very strong	Labour
24	1.6864884	Very likely that I would vote	Green Party	NA	No - none

For detailed uploading instructions, refer to the Uploading & Cleaning Data article.

Step 2: Viewing the Full Dataset

After uploading the data, it’s important to understand its structure and content. The sjPlot package’s view_df() function provides an interactive HTML view of your dataset, allowing for an immediate and comprehensive examination of the data’s attributes, frequencies, and percentages. This step is crucial for identifying the nature of variables, understanding their distribution, and planning further data processing strategies.

sjPlot::view_df(df[, 1:10],  # NB first 10 variables shown only in this example
                weight.by = "wt",
                show.type = TRUE, # show whether variable is numeric or categorical
                show.wtd.frq = TRUE, # display weighted frequency
                show.wtd.prc = TRUE, # display weighted %
)

Step 3: Processing the Data

Survey data often requires specific processing steps to ensure it is analysis-ready. This may involve creating new variables, recoding factors, handling missing values, and more.

Step 4: Weighting the Data

Survey data analysis sometimes necessitates weighting to address issues like sample design or response biases. To understand and implement weighting, view the Weighting Your Data article.

Step 5: Exploring the Data

Exploring survey data effectively demands a nuanced approach to both numeric and categorical data. The scgUtils package, complemented by base R functionalities, offers a comprehensive toolkit for this exploration.

Numeric data

Numeric data, such as age, income, or survey ratings, can reveal significant trends and patterns when analysed correctly.

Summary Statistics
Begin with summary() for a quick overview, offering key statistical measures.

summary(df$age)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   18.00   55.00   66.00   63.58   73.00   93.00

For group-wise insights, tapply() combined with summary() allows you to dissect the data based on categories like gender or education level, providing a clearer understanding of distribution across different segments.

# By group:
tapply(df$age, df$gender, summary)
#> $Male
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   30.00   56.00   66.00   64.15   74.00   93.00 
#> 
#> $Female
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   18.00   54.00   66.00   62.97   73.00   93.00

Mean Calculation
Average calculations, both weighted and unweighted, are crucial in survey analysis. Utilise mean() for simple averages and weighted.mean() for more complex scenarios where survey design needs to be accounted for.

# Unweighted:
mean(df$age)
#> [1] 63.5824

# Weighted:
weighted.mean(df$age, df$wt)
#> [1] 60.17577

Grouped Mean with grp_mean()
For advanced analysis, grp_mean() elegantly calculates group-wise means. It simplifies the process of aggregating data across one or more categorical variables, offering an efficient alternative to more verbose methods like those in dplyr.

# By a single group:
grp_mean(df,
         meanVar = "age",
         groups = "gender",
         weight = "wt" # optional
)

# `dplyr` equivalent:
# df %>%
#   group_by(gender) %>%
#   summarise(Mean = weighted.mean(age, wt)) %>%
#   ungroup()

gender	Mean
Male	60.14175
Female	60.21559

# By many groups:
grp_mean(df,
         meanVar = "age",
         groups = c("gender", "partyId"),
         weight = "wt", # optional
         set_names = c("Gender", "Party Identification", "Average Age"), # optional: change names
         round_decimals = 2 # optional: round decimal places to 2 digits
) %>%
  head()

# `dplyr` equivalent:
# df %>%
#   group_by(gender, partyId) %>%
#   summarise(`Average Age` = weighted.mean(age, wt)) %>%
#   ungroup() %>%
#   rename(Gender = gender, `Party Identification` = partyId) %>%
#   head()

Gender	Party Identification	Average Age
Male	Conservative	64.60
Female	Conservative	65.86
Male	Labour	59.81
Female	Labour	59.29
Male	Liberal Democrat	60.44
Female	Liberal Democrat	59.71

Categorical data

Categorical data typically includes demographics or multiple-choice responses. Analysing these effectively unlocks insights into respondent behaviours and preferences.

Grouped Frequencies with grp_freq()
grp_freq() shines in its ability to provide detailed frequency and percentage breakdowns across various groups. It’s capable of handling both weighted and unweighted data, adding depth and precision to your categorical data analysis.

# By a single group:
grp_freq(df,
         groups = "partyId",
         weight = "wt", # optional
         addPercent = TRUE # optional
) %>%
  head()

# `dplyr` equivalent:
# df %>%
#   group_by(partyId) %>%
#   summarise(Freq = sum(wt)) %>%
#   ungroup() %>%
#   mutate(Perc = Freq / sum(Freq)) %>%
#   head()

partyId	Freq	Perc
Conservative	1167.9802	29.264551
Labour	963.5791	24.143139
Liberal Democrat	232.9101	5.835724
Scottish National Party (SNP)	101.8282	2.551375
Plaid Cymru	14.6336	0.366655
Green Party	100.3786	2.515055

# By many group:
grp_freq(df,
         groups = c("partyId", "gender"),
         weight = "wt", # optional
         groupsPercent = "partyId", # optional
         round_decimals = 2 # optional: round decimal places to 2 digits
) %>%
  head()

# `dplyr` equivalent:
# df %>%
#   group_by(partyId, gender) %>%
#   summarise(Freq = sum(wt)) %>%
#   ungroup() %>%
#   group_by(partyId) %>%
#   mutate(Perc = Freq / sum(Freq)) %>%
#   ungroup() %>%
#   head()

partyId	gender	Freq	Perc
Conservative	Male	663.95	56.85
Labour	Male	496.94	51.57
Liberal Democrat	Male	129.14	55.45
Scottish National Party (SNP)	Male	48.73	47.85
Plaid Cymru	Male	8.81	60.20
Green Party	Male	60.48	60.25

Crosstabs
Crosstabulation is a fundamental technique in survey analysis, especially when examining relationships between categorical variables. For an overview of using crosstab functions within the scgUtils package, view the Conducting Cross-Tabulation Analysis article.

Grid data

Grid questions, common in surveys, pose unique analytical challenges due to their format.

Handling Grid Data with grid_vars()
grid_vars() is tailor-made for such data, turning complex grid questions into analysable formats. It’s particularly adept at handling “select all that apply” questions or grid-type responses, transforming them into a format conducive to comparison and visualisation with libraries such as ggplot2.

# Create a named list of the columns that relate to the question
vars <- list(likeSunak = "Rishi Sunak",
             likeStarmer = "Keir Starmer",
             likeCon = "Conservative Party",
             likeLab = "Labour Party",
             likeLD = "Lib Dems",
             likeSNP = "SNP",
             likePC = "Plaid Cymru",
             likeBrexitParty = "Brexit Party",
             likeGrn = "Green Party"
)

grid_vars(df,
          vars = vars,
          weight = "wt" # optional
) %>%
  head()

Question	Response	Freq	Perc
Brexit Party	Strongly dislike	1369.44	34.31
Conservative Party	Strongly dislike	1260.06	31.57
Green Party	Strongly dislike	810.10	20.30
Keir Starmer	Strongly dislike	850.59	21.31
Labour Party	Strongly dislike	839.85	21.04
Lib Dems	Strongly dislike	776.15	19.45

Implementing grid_vars() by Group
Enhance the functionality of grid_vars() by applying it with a group variable. This allows for dissecting responses across different demographic or categorical segments, providing richer, more targeted insights.

grid_vars(df,
          vars = vars,
          group = "gender", # optional
          weight = "wt" # optional
) %>%
  head()

Question	Response	gender	Freq	Perc
Brexit Party	Strongly dislike	Male	825.56	38.36
Conservative Party	Strongly dislike	Male	706.28	32.81
Green Party	Strongly dislike	Male	548.62	25.49
Keir Starmer	Strongly dislike	Male	511.98	23.79
Labour Party	Strongly dislike	Male	478.74	22.24
Lib Dems	Strongly dislike	Male	498.82	23.18

Step 6: Visualising the Data

Effective visualisation is key in survey data analysis, offering a way to intuitively understand and communicate complex data patterns. The scgUtils package provides specialised functions like plot_popn() for demographic analysis and plot_sankey() for flow visualisation, helping you to not only understand your data but also to present it in a compelling and insightful manner.

For a broader spectrum of visualisation techniques and detailed guidance on effectively using colour in your plots, refer to the Visualising Data article and Mastering Colour Selection article. These resources provide additional insights into making the most of the scgUtils package for visualising complex survey data.

Step 7: Presenting the Results

Presenting the results of your survey analysis in a clear and impactful way is crucial. The scgUtils package offers functionalities that aid in creating detailed and informative presentations.

Tables

Tables are fundamental tools for presenting complex data in a structured and easily interpretable format. View the Conducting Cross-Tabulation Analysis article for more information on how to compile tables.

PowerPoint Integration

In future updates, scgUtils aims to incorporate capabilities for directly exporting analysis results into PowerPoint presentations. This functionality will facilitate seamless integration of your data findings into professional and engaging presentation formats, suitable for various audiences.

Interactive Dashboards

Another upcoming feature is the ability to create interactive dashboards directly from your survey data. Dashboards offer a dynamic way to explore and present data, allowing users to interact with the information, drill down into specifics, and gain a deeper understanding of the underlying patterns. This functionality will be a significant enhancement, providing a powerful tool for data storytelling and decision-making processes.