
Using scgUtils with Survey Data
Source:vignettes/articles/using-survey-data.rmd
using-survey-data.rmd
This article will take you through the steps of uploading,
processing, and exploring survey data using the scgUtils
package in R. Our journey will cover the nuances of handling survey
datasets, from the initial loading of the data to the advanced stages of
analysis and visualisation.
Step 1: Upload the Sample Dataset
The process begins with loading your dataset. The
scgUtils
package offers two main functions for this
purpose: get_data()
and get_file()
.
Using get_data()
: This function is
ideal for loading datasets directly from R packages. It streamlines the
process of importing and naming your dataset in the R environment.
# Example of loading and preprocessing a dataset
df <- get_data("survey") %>%
labelled::unlabelled() %>% # Convert 'haven_labelled' data to standard format
process_factors() # Remove unused factor levels
Note: The sample data is a subset of the British Electoral
Survey. For full data, visit British
Election Study.
Using
get_file()
: When working with external data files,
such as .sav or .csv, get_file() becomes invaluable. It not only imports
the data but also preprocesses it, handling special characters and
facilitates the conversion of specialised data types, such as
haven_labelled
, into standard R formats, making it a robust
choice for various data sources.
# Using the `get_file function option which includes the above preprocessing.
df <- get_file("inst/extdata/survey.sav")
head(df[, 1:6])
id | wt | turnoutUKGeneral | generalElectionVote | partyIdStrength | partyId |
---|---|---|---|---|---|
7 | 0.3755288 | Very likely that I would vote | Scottish National Party (SNP) | Fairly strong | Scottish National Party (SNP) |
14 | 0.5528756 | Very likely that I would vote | Conservative | Not very strong | Labour |
15 | 0.7122303 | Very likely that I would vote | Brexit Party/Reform UK | Fairly strong | Brexit Party/Reform UK |
18 | 0.4396403 | Fairly likely | Scottish National Party (SNP) | Not very strong | No - none |
19 | 0.3613798 | Very likely that I would vote | Labour | Very strong | Labour |
24 | 1.6864884 | Very likely that I would vote | Green Party | NA | No - none |
For detailed uploading instructions, refer to the Uploading
& Cleaning Data article.
Step 2: Viewing the Full Dataset
After uploading the data, it’s important to understand its structure
and content. The sjPlot
package’s view_df()
function provides an interactive HTML view of your dataset, allowing for
an immediate and comprehensive examination of the data’s attributes,
frequencies, and percentages. This step is crucial for identifying the
nature of variables, understanding their distribution, and planning
further data processing strategies.
sjPlot::view_df(df[, 1:10], # NB first 10 variables shown only in this example
weight.by = "wt",
show.type = TRUE, # show whether variable is numeric or categorical
show.wtd.frq = TRUE, # display weighted frequency
show.wtd.prc = TRUE, # display weighted %
)
Step 3: Processing the Data
Survey data often requires specific processing steps to ensure it is analysis-ready. This may involve creating new variables, recoding factors, handling missing values, and more.
Step 4: Weighting the Data
Survey data analysis sometimes necessitates weighting to address issues like sample design or response biases. To understand and implement weighting, view the Weighting Your Data article.
Step 5: Exploring the Data
Exploring survey data effectively demands a nuanced approach to both
numeric and categorical data. The scgUtils
package,
complemented by base R functionalities, offers a comprehensive toolkit
for this exploration.
Numeric data
Numeric data, such as age, income, or survey ratings, can reveal
significant trends and patterns when analysed correctly.
Summary Statistics
Begin with
summary()
for a quick overview, offering key statistical
measures.
summary(df$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 18.00 55.00 66.00 63.58 73.00 93.00
For group-wise insights, tapply()
combined with
summary()
allows you to dissect the data based on
categories like gender or education level, providing a clearer
understanding of distribution across different segments.
# By group:
tapply(df$age, df$gender, summary)
#> $Male
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 30.00 56.00 66.00 64.15 74.00 93.00
#>
#> $Female
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 18.00 54.00 66.00 62.97 73.00 93.00
Mean Calculation
Average calculations,
both weighted and unweighted, are crucial in survey analysis. Utilise
mean()
for simple averages and weighted.mean()
for more complex scenarios where survey design needs to be accounted
for.
# Unweighted:
mean(df$age)
#> [1] 63.5824
# Weighted:
weighted.mean(df$age, df$wt)
#> [1] 60.17577
Grouped Mean with
grp_mean()
For advanced analysis,
grp_mean()
elegantly calculates group-wise means. It
simplifies the process of aggregating data across one or more
categorical variables, offering an efficient alternative to more verbose
methods like those in dplyr
.
# By a single group:
grp_mean(df,
meanVar = "age",
groups = "gender",
weight = "wt" # optional
)
# `dplyr` equivalent:
# df %>%
# group_by(gender) %>%
# summarise(Mean = weighted.mean(age, wt)) %>%
# ungroup()
gender | Mean |
---|---|
Male | 60.14175 |
Female | 60.21559 |
# By many groups:
grp_mean(df,
meanVar = "age",
groups = c("gender", "partyId"),
weight = "wt", # optional
set_names = c("Gender", "Party Identification", "Average Age"), # optional: change names
round_decimals = 2 # optional: round decimal places to 2 digits
) %>%
head()
# `dplyr` equivalent:
# df %>%
# group_by(gender, partyId) %>%
# summarise(`Average Age` = weighted.mean(age, wt)) %>%
# ungroup() %>%
# rename(Gender = gender, `Party Identification` = partyId) %>%
# head()
Gender | Party Identification | Average Age |
---|---|---|
Male | Conservative | 64.60 |
Female | Conservative | 65.86 |
Male | Labour | 59.81 |
Female | Labour | 59.29 |
Male | Liberal Democrat | 60.44 |
Female | Liberal Democrat | 59.71 |
Categorical data
Categorical data typically includes demographics or multiple-choice
responses. Analysing these effectively unlocks insights into respondent
behaviours and preferences.
Grouped Frequencies with
grp_freq()
grp_freq()
shines
in its ability to provide detailed frequency and percentage breakdowns
across various groups. It’s capable of handling both weighted and
unweighted data, adding depth and precision to your categorical data
analysis.
# By a single group:
grp_freq(df,
groups = "partyId",
weight = "wt", # optional
addPercent = TRUE # optional
) %>%
head()
# `dplyr` equivalent:
# df %>%
# group_by(partyId) %>%
# summarise(Freq = sum(wt)) %>%
# ungroup() %>%
# mutate(Perc = Freq / sum(Freq)) %>%
# head()
partyId | Freq | Perc |
---|---|---|
Conservative | 1167.9802 | 29.264551 |
Labour | 963.5791 | 24.143139 |
Liberal Democrat | 232.9101 | 5.835724 |
Scottish National Party (SNP) | 101.8282 | 2.551375 |
Plaid Cymru | 14.6336 | 0.366655 |
Green Party | 100.3786 | 2.515055 |
# By many group:
grp_freq(df,
groups = c("partyId", "gender"),
weight = "wt", # optional
groupsPercent = "partyId", # optional
round_decimals = 2 # optional: round decimal places to 2 digits
) %>%
head()
# `dplyr` equivalent:
# df %>%
# group_by(partyId, gender) %>%
# summarise(Freq = sum(wt)) %>%
# ungroup() %>%
# group_by(partyId) %>%
# mutate(Perc = Freq / sum(Freq)) %>%
# ungroup() %>%
# head()
partyId | gender | Freq | Perc |
---|---|---|---|
Conservative | Male | 663.95 | 56.85 |
Labour | Male | 496.94 | 51.57 |
Liberal Democrat | Male | 129.14 | 55.45 |
Scottish National Party (SNP) | Male | 48.73 | 47.85 |
Plaid Cymru | Male | 8.81 | 60.20 |
Green Party | Male | 60.48 | 60.25 |
Crosstabs
Crosstabulation is a
fundamental technique in survey analysis, especially when examining
relationships between categorical variables. For an overview of using
crosstab functions within the scgUtils
package, view the Conducting
Cross-Tabulation Analysis article.
Grid data
Grid questions, common in surveys, pose unique analytical challenges
due to their format.
Handling Grid Data with
grid_vars()
grid_vars()
is
tailor-made for such data, turning complex grid questions into
analysable formats. It’s particularly adept at handling “select all that
apply” questions or grid-type responses, transforming them into a format
conducive to comparison and visualisation with libraries such as ggplot2
.
# Create a named list of the columns that relate to the question
vars <- list(likeSunak = "Rishi Sunak",
likeStarmer = "Keir Starmer",
likeCon = "Conservative Party",
likeLab = "Labour Party",
likeLD = "Lib Dems",
likeSNP = "SNP",
likePC = "Plaid Cymru",
likeBrexitParty = "Brexit Party",
likeGrn = "Green Party"
)
grid_vars(df,
vars = vars,
weight = "wt" # optional
) %>%
head()
Question | Response | Freq | Perc |
---|---|---|---|
Brexit Party | Strongly dislike | 1369.44 | 34.31 |
Conservative Party | Strongly dislike | 1260.06 | 31.57 |
Green Party | Strongly dislike | 810.10 | 20.30 |
Keir Starmer | Strongly dislike | 850.59 | 21.31 |
Labour Party | Strongly dislike | 839.85 | 21.04 |
Lib Dems | Strongly dislike | 776.15 | 19.45 |
Implementing grid_vars()
by
Group
Enhance the functionality of
grid_vars()
by applying it with a group variable. This
allows for dissecting responses across different demographic or
categorical segments, providing richer, more targeted insights.
Question | Response | gender | Freq | Perc |
---|---|---|---|---|
Brexit Party | Strongly dislike | Male | 825.56 | 38.36 |
Conservative Party | Strongly dislike | Male | 706.28 | 32.81 |
Green Party | Strongly dislike | Male | 548.62 | 25.49 |
Keir Starmer | Strongly dislike | Male | 511.98 | 23.79 |
Labour Party | Strongly dislike | Male | 478.74 | 22.24 |
Lib Dems | Strongly dislike | Male | 498.82 | 23.18 |
Step 6: Visualising the Data
Effective visualisation is key in survey data analysis, offering a
way to intuitively understand and communicate complex data patterns. The
scgUtils
package provides specialised functions like
plot_popn()
for demographic analysis and
plot_sankey()
for flow visualisation, helping you to not
only understand your data but also to present it in a compelling and
insightful manner.
For a broader spectrum of visualisation techniques and detailed
guidance on effectively using colour in your plots, refer to the Visualising
Data article and Mastering
Colour Selection article. These resources provide additional
insights into making the most of the scgUtils
package for
visualising complex survey data.
Step 7: Presenting the Results
Presenting the results of your survey analysis in a clear and
impactful way is crucial. The scgUtils
package offers
functionalities that aid in creating detailed and informative
presentations.
Tables
Tables are fundamental tools for presenting complex data in a
structured and easily interpretable format. View the Conducting
Cross-Tabulation Analysis article for more information on how to
compile tables.
PowerPoint Integration
In future updates, scgUtils
aims to incorporate
capabilities for directly exporting analysis results into PowerPoint
presentations. This functionality will facilitate seamless integration
of your data findings into professional and engaging presentation
formats, suitable for various audiences.
Interactive Dashboards
Another upcoming feature is the ability to create interactive dashboards directly from your survey data. Dashboards offer a dynamic way to explore and present data, allowing users to interact with the information, drill down into specifics, and gain a deeper understanding of the underlying patterns. This functionality will be a significant enhancement, providing a powerful tool for data storytelling and decision-making processes.