R For Data Science: Import, Tidy, Transform, Vi... ((FULL))
This data is now tidy, but we could make future computation a bit easier by converting values of week from character strings to numbers using mutate() and readr::parse_number(). parse_number() is a handy function that will extract the first number from a string, ignoring all other text.
R for Data Science: Import, Tidy, Transform, Vi...
This format is also used to record regularly spaced observations overtime. For example, the Billboard dataset shown below records the date asong first entered the billboard top 100. It has variables forartist, track, date.entered,rank and week. The rank in each week after itenters the top 100 is recorded in 75 columns, wk1 towk75. This form of storage is not tidy, but it is usefulfor data entry. It reduces duplication since otherwise each song in eachweek would need its own row, and song metadata like title and artistwould need to be repeated. This will be discussed in more depth in multiple types.
This dataset is mostly tidy, but the element column isnot a variable; it stores the names of variables. (Not shown in thisexample are the other meteorological variables prcp(precipitation) and snow (snowfall)). Fixing this requireswidening the data: pivot_wider() is inverse ofpivot_longer(), pivoting element andvalue back out across multiple columns:
In order to get from the native/raw data to the OMOP Common Data Model (CDM) we have to create an extract, transform, and load (ETL) process. This process should restructure the data to the CDM, and add mappings to the Standardized Vocabularies, and is typically implemented as a set of automated scripts, for example SQL scripts. It is important that this ETL process is repeatable, so that it can be rerun whenever the source data is refreshed. 041b061a72