Skip to content


This article will take you through the steps of uploading, processing, and exploring survey data using the scgUtils package in R. Our journey will cover the nuances of handling survey datasets, from the initial loading of the data to the advanced stages of analysis and visualisation.

Step 1: Upload the Sample Dataset

The process begins with loading your dataset. The scgUtils package offers two main functions for this purpose: get_data() and get_file().

Using get_data(): This function is ideal for loading datasets directly from R packages. It streamlines the process of importing and naming your dataset in the R environment.

# Example of loading and preprocessing a dataset
df <- get_data("survey") %>%
  labelled::unlabelled() %>% # Convert 'haven_labelled' data to standard format
  process_factors() # Remove unused factor levels

Note: The sample data is a subset of the British Electoral Survey. For full data, visit British Election Study.

Using get_file(): When working with external data files, such as .sav or .csv, get_file() becomes invaluable. It not only imports the data but also preprocesses it, handling special characters and facilitates the conversion of specialised data types, such as haven_labelled, into standard R formats, making it a robust choice for various data sources.

# Using the  `get_file function option which includes the above preprocessing.
df <- get_file("inst/extdata/survey.sav")

head(df[, 1:6])
id wt turnoutUKGeneral generalElectionVote partyIdStrength partyId
7 0.3755288 Very likely that I would vote Scottish National Party (SNP) Fairly strong Scottish National Party (SNP)
14 0.5528756 Very likely that I would vote Conservative Not very strong Labour
15 0.7122303 Very likely that I would vote Brexit Party/Reform UK Fairly strong Brexit Party/Reform UK
18 0.4396403 Fairly likely Scottish National Party (SNP) Not very strong No - none
19 0.3613798 Very likely that I would vote Labour Very strong Labour
24 1.6864884 Very likely that I would vote Green Party NA No - none

For detailed uploading instructions, refer to the Uploading & Cleaning Data article.


Step 2: Viewing the Full Dataset

After uploading the data, it’s important to understand its structure and content. The sjPlot package’s view_df() function provides an interactive HTML view of your dataset, allowing for an immediate and comprehensive examination of the data’s attributes, frequencies, and percentages. This step is crucial for identifying the nature of variables, understanding their distribution, and planning further data processing strategies.

sjPlot::view_df(df[, 1:10],  # NB first 10 variables shown only in this example
                weight.by = "wt",
                show.type = TRUE, # show whether variable is numeric or categorical
                show.wtd.frq = TRUE, # display weighted frequency
                show.wtd.prc = TRUE, # display weighted %
)

Step 3: Processing the Data

Survey data often requires specific processing steps to ensure it is analysis-ready. This may involve creating new variables, recoding factors, handling missing values, and more.


Step 4: Weighting the Data

Survey data analysis sometimes necessitates weighting to address issues like sample design or response biases. To understand and implement weighting, view the Weighting Your Data article.


Step 5: Exploring the Data

Exploring survey data effectively demands a nuanced approach to both numeric and categorical data. The scgUtils package, complemented by base R functionalities, offers a comprehensive toolkit for this exploration.

Numeric data

Numeric data, such as age, income, or survey ratings, can reveal significant trends and patterns when analysed correctly.

Summary Statistics
Begin with summary() for a quick overview, offering key statistical measures.

summary(df$age)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   18.00   55.00   66.00   63.58   73.00   93.00


For group-wise insights, tapply() combined with summary() allows you to dissect the data based on categories like gender or education level, providing a clearer understanding of distribution across different segments.

# By group:
tapply(df$age, df$gender, summary)
#> $Male
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   30.00   56.00   66.00   64.15   74.00   93.00 
#> 
#> $Female
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   18.00   54.00   66.00   62.97   73.00   93.00


Mean Calculation
Average calculations, both weighted and unweighted, are crucial in survey analysis. Utilise mean() for simple averages and weighted.mean() for more complex scenarios where survey design needs to be accounted for.

# Unweighted:
mean(df$age)
#> [1] 63.5824

# Weighted:
weighted.mean(df$age, df$wt)
#> [1] 60.17577


Grouped Mean with grp_mean()
For advanced analysis, grp_mean() elegantly calculates group-wise means. It simplifies the process of aggregating data across one or more categorical variables, offering an efficient alternative to more verbose methods like those in dplyr.

# By a single group:
grp_mean(df,
         meanVar = "age",
         groups = "gender",
         weight = "wt" # optional
)

# `dplyr` equivalent:
# df %>%
#   group_by(gender) %>%
#   summarise(Mean = weighted.mean(age, wt)) %>%
#   ungroup()
gender Mean
Male 60.14175
Female 60.21559
# By many groups:
grp_mean(df,
         meanVar = "age",
         groups = c("gender", "partyId"),
         weight = "wt", # optional
         set_names = c("Gender", "Party Identification", "Average Age"), # optional: change names
         round_decimals = 2 # optional: round decimal places to 2 digits
) %>%
  head()

# `dplyr` equivalent:
# df %>%
#   group_by(gender, partyId) %>%
#   summarise(`Average Age` = weighted.mean(age, wt)) %>%
#   ungroup() %>%
#   rename(Gender = gender, `Party Identification` = partyId) %>%
#   head()
Gender Party Identification Average Age
Male Conservative 64.60
Female Conservative 65.86
Male Labour 59.81
Female Labour 59.29
Male Liberal Democrat 60.44
Female Liberal Democrat 59.71


Categorical data

Categorical data typically includes demographics or multiple-choice responses. Analysing these effectively unlocks insights into respondent behaviours and preferences.

Grouped Frequencies with grp_freq()
grp_freq() shines in its ability to provide detailed frequency and percentage breakdowns across various groups. It’s capable of handling both weighted and unweighted data, adding depth and precision to your categorical data analysis.

# By a single group:
grp_freq(df,
         groups = "partyId",
         weight = "wt", # optional
         addPercent = TRUE # optional
) %>%
  head()

# `dplyr` equivalent:
# df %>%
#   group_by(partyId) %>%
#   summarise(Freq = sum(wt)) %>%
#   ungroup() %>%
#   mutate(Perc = Freq / sum(Freq)) %>%
#   head()
partyId Freq Perc
Conservative 1167.9802 29.264551
Labour 963.5791 24.143139
Liberal Democrat 232.9101 5.835724
Scottish National Party (SNP) 101.8282 2.551375
Plaid Cymru 14.6336 0.366655
Green Party 100.3786 2.515055
# By many group:
grp_freq(df,
         groups = c("partyId", "gender"),
         weight = "wt", # optional
         groupsPercent = "partyId", # optional
         round_decimals = 2 # optional: round decimal places to 2 digits
) %>%
  head()

# `dplyr` equivalent:
# df %>%
#   group_by(partyId, gender) %>%
#   summarise(Freq = sum(wt)) %>%
#   ungroup() %>%
#   group_by(partyId) %>%
#   mutate(Perc = Freq / sum(Freq)) %>%
#   ungroup() %>%
#   head()
partyId gender Freq Perc
Conservative Male 663.95 56.85
Labour Male 496.94 51.57
Liberal Democrat Male 129.14 55.45
Scottish National Party (SNP) Male 48.73 47.85
Plaid Cymru Male 8.81 60.20
Green Party Male 60.48 60.25


Crosstabs
Crosstabulation is a fundamental technique in survey analysis, especially when examining relationships between categorical variables. For an overview of using crosstab functions within the scgUtils package, view the Conducting Cross-Tabulation Analysis article.

Grid data

Grid questions, common in surveys, pose unique analytical challenges due to their format.

Handling Grid Data with grid_vars()
grid_vars() is tailor-made for such data, turning complex grid questions into analysable formats. It’s particularly adept at handling “select all that apply” questions or grid-type responses, transforming them into a format conducive to comparison and visualisation with libraries such as ggplot2.

# Create a named list of the columns that relate to the question
vars <- list(likeSunak = "Rishi Sunak",
             likeStarmer = "Keir Starmer",
             likeCon = "Conservative Party",
             likeLab = "Labour Party",
             likeLD = "Lib Dems",
             likeSNP = "SNP",
             likePC = "Plaid Cymru",
             likeBrexitParty = "Brexit Party",
             likeGrn = "Green Party"
)

grid_vars(df,
          vars = vars,
          weight = "wt" # optional
) %>%
  head()
Question Response Freq Perc
Brexit Party Strongly dislike 1369.44 34.31
Conservative Party Strongly dislike 1260.06 31.57
Green Party Strongly dislike 810.10 20.30
Keir Starmer Strongly dislike 850.59 21.31
Labour Party Strongly dislike 839.85 21.04
Lib Dems Strongly dislike 776.15 19.45


Implementing grid_vars() by Group
Enhance the functionality of grid_vars() by applying it with a group variable. This allows for dissecting responses across different demographic or categorical segments, providing richer, more targeted insights.

grid_vars(df,
          vars = vars,
          group = "gender", # optional
          weight = "wt" # optional
) %>%
  head()
Question Response gender Freq Perc
Brexit Party Strongly dislike Male 825.56 38.36
Conservative Party Strongly dislike Male 706.28 32.81
Green Party Strongly dislike Male 548.62 25.49
Keir Starmer Strongly dislike Male 511.98 23.79
Labour Party Strongly dislike Male 478.74 22.24
Lib Dems Strongly dislike Male 498.82 23.18

Step 6: Visualising the Data

Effective visualisation is key in survey data analysis, offering a way to intuitively understand and communicate complex data patterns. The scgUtils package provides specialised functions like plot_popn() for demographic analysis and plot_sankey() for flow visualisation, helping you to not only understand your data but also to present it in a compelling and insightful manner.

For a broader spectrum of visualisation techniques and detailed guidance on effectively using colour in your plots, refer to the Visualising Data article and Mastering Colour Selection article. These resources provide additional insights into making the most of the scgUtils package for visualising complex survey data.


Step 7: Presenting the Results

Presenting the results of your survey analysis in a clear and impactful way is crucial. The scgUtils package offers functionalities that aid in creating detailed and informative presentations.

Tables

Tables are fundamental tools for presenting complex data in a structured and easily interpretable format. View the Conducting Cross-Tabulation Analysis article for more information on how to compile tables.

PowerPoint Integration

In future updates, scgUtils aims to incorporate capabilities for directly exporting analysis results into PowerPoint presentations. This functionality will facilitate seamless integration of your data findings into professional and engaging presentation formats, suitable for various audiences.

Interactive Dashboards

Another upcoming feature is the ability to create interactive dashboards directly from your survey data. Dashboards offer a dynamic way to explore and present data, allowing users to interact with the information, drill down into specifics, and gain a deeper understanding of the underlying patterns. This functionality will be a significant enhancement, providing a powerful tool for data storytelling and decision-making processes.