Retrieve and Process File from Various Sources

get_file retrieves and preprocesses a file from different sources, including local storage, OneDrive, and the web. It supports multiple file formats such as CSV, Excel (XLSX and XLS), SPSS (SAV), and ZIP archives containing these file types. The function applies preprocessing based on the file type and, for ZIP files, processes and combines data from supported files within the archive.

Usage

get_file(
  file_path,
  source = c("local", "onedrive", "web"),
  row_no = 0,
  sheet_no = 1,
  file_name = NULL,
  add_name = FALSE,
  file_type = NULL
)

Arguments

file_path: The path, ID, or URL of the file to be retrieved.
source: The source of the file: 'local', 'onedrive', or 'web' (default is 'local').
row_no: The number of rows to skip at the beginning of the file, applicable for CSV and Excel files (default is 0).
sheet_no: The sheet number to read from Excel files (XLSX or XLS), where 1 is the first sheet (default is 1, which will read the first sheet). Ignored for non-Excel files.
file_name: Optional; for ZIP files, the name of a specific file within the archive to process. If NULL, all supported files in the ZIP are processed (default is NULL).
add_name: Optional; for ZIP files. If TRUE, the file name of the unzipped file will be added in a column called "file_name" within the processed data frame (default is FALSE).
file_type: Optional; an override option for extensions. This is particularly useful for url's with no extension found within the url (default is NULL).

Value

A data frame containing the contents of the file after preprocessing. For ZIP files, it returns a combined data frame from all processed files, with an additional column 'file_name' indicating the source file within the archive.

Details

The function determines the file type from its extension (e.g., 'csv', 'xls', 'xlsx', 'sav', 'zip'), retrieves the file from the specified source using authenticate_source, and preprocesses it with preprocess_file_type. Supported file types are handled as follows:

CSV: Reads the file with automatic delimiter detection, skips the specified number of rows, and converts all columns to character initially before auto-detecting types. Special characters are removed from text columns.
Excel (XLSX and XLS): Reads the specified sheet (via sheet_no), skips the specified number of rows, and processes columns similarly to CSV files.
SAV (SPSS): Reads the file, attempting default encoding first and falling back to 'latin1' if needed. It removes labels and processes factors for cleaner output.
ZIP: Extracts supported files (CSV, XLS, XLSX, SAV) from the archive into a unique temporary directory, processes them, and combines the data into a single data frame with an additional 'file_name' column. If file_name is specified, only that file is processed.

If the file type is not supported, an error is thrown.

Examples

if (FALSE) { # \dontrun{
  # Retrieve a local CSV file
  data <- get_file("path/to/local/file.csv")

  # Retrieve a local Excel file, reading the second sheet
  data <- get_file("path/to/local/file.xlsx", sheet_no = 2)

  # Retrieve a file from OneDrive
  data <- get_file("file-id", source = "onedrive")

  # Retrieve and preprocess a Google Drive file, skipping the first row
  data <- get_file("file-id", source = "googledrive", row_no = 1)

  # Retrieve a file from the web, skipping the first row
  data <- get_file("https://example.com/data.csv", source = "web")

  # Retrieve and process all supported files from a local ZIP archive
  data <- get_file("path/to/local/archive.zip")

  # Retrieve a file from the web with no extension in the url
  data <- get_file("https://example.com/data", source = "web", file_type = "zip")

  # Retrieve and process a specific file from a ZIP archive
  data <- get_file("path/to/local/archive.zip", file_name = "specific_file.csv")
} # }