get_file
retrieves and preprocesses a file from different sources, including
local storage, OneDrive, and the web. It supports multiple file formats such as
CSV, Excel (XLSX and XLS), SPSS (SAV), and ZIP archives containing these file types.
The function applies preprocessing based on the file type and, for ZIP files, processes
and combines data from supported files within the archive.
Usage
get_file(
file_path,
source = c("local", "onedrive", "web"),
row_no = 0,
sheet_no = 1,
file_name = NULL,
add_name = FALSE,
file_type = NULL
)
Arguments
- file_path
The path, ID, or URL of the file to be retrieved.
- source
The source of the file: 'local', 'onedrive', or 'web' (default is 'local').
- row_no
The number of rows to skip at the beginning of the file, applicable for CSV and Excel files (default is 0).
- sheet_no
The sheet number to read from Excel files (XLSX or XLS), where 1 is the first sheet (default is 1, which will read the first sheet). Ignored for non-Excel files.
- file_name
Optional; for ZIP files, the name of a specific file within the archive to process. If
NULL
, all supported files in the ZIP are processed (default isNULL
).- add_name
Optional; for ZIP files. If
TRUE
, the file name of the unzipped file will be added in a column called "file_name" within the processed data frame (default isFALSE
).- file_type
Optional; an override option for extensions. This is particularly useful for url's with no extension found within the url (default is
NULL
).
Value
A data frame containing the contents of the file after preprocessing.
For ZIP files, it returns a combined data frame from all processed files, with an
additional column 'file_name'
indicating the source file within the archive.
Details
The function determines the file type from its extension (e.g., 'csv', 'xls', 'xlsx', 'sav', 'zip'),
retrieves the file from the specified source using authenticate_source
, and preprocesses
it with preprocess_file_type
. Supported file types are handled as follows:
CSV: Reads the file with automatic delimiter detection, skips the specified number of rows, and converts all columns to character initially before auto-detecting types. Special characters are removed from text columns.
Excel (XLSX and XLS): Reads the specified sheet (via
sheet_no
), skips the specified number of rows, and processes columns similarly to CSV files.SAV (SPSS): Reads the file, attempting default encoding first and falling back to 'latin1' if needed. It removes labels and processes factors for cleaner output.
ZIP: Extracts supported files (CSV, XLS, XLSX, SAV) from the archive into a unique temporary directory, processes them, and combines the data into a single data frame with an additional
'file_name'
column. Iffile_name
is specified, only that file is processed.
If the file type is not supported, an error is thrown.
Examples
if (FALSE) { # \dontrun{
# Retrieve a local CSV file
data <- get_file("path/to/local/file.csv")
# Retrieve a local Excel file, reading the second sheet
data <- get_file("path/to/local/file.xlsx", sheet_no = 2)
# Retrieve a file from OneDrive
data <- get_file("file-id", source = "onedrive")
# Retrieve and preprocess a Google Drive file, skipping the first row
data <- get_file("file-id", source = "googledrive", row_no = 1)
# Retrieve a file from the web, skipping the first row
data <- get_file("https://example.com/data.csv", source = "web")
# Retrieve and process all supported files from a local ZIP archive
data <- get_file("path/to/local/archive.zip")
# Retrieve a file from the web with no extension in the url
data <- get_file("https://example.com/data", source = "web", file_type = "zip")
# Retrieve and process a specific file from a ZIP archive
data <- get_file("path/to/local/archive.zip", file_name = "specific_file.csv")
} # }