Load the package

library(RShinyHelpers)

Intro and training data

The temporal functions aim to complete missing time series data to ensure functions like window functions like rolling averages and auto regressive algebra are represented correctly.

For a variety of reasons, dates may not be present, distinctly different from being present and their values having NA or null. Such instances shouldn’t really exist, but instead be explicitly mark data as missing or zero on that time stamp.

Construct a clean time series with some variables and take out random dates to see how a real life example might look.

require(ggplot2)
require(dplyr)
 
# construct a time series
ts <-  as.Date("2022-01-03"):as.Date(Sys.Date())

#sample some dates of the original so its 'patchy'
ts <- sort(
  as.Date(
    ts[sample(c(TRUE,FALSE),size = 101,replace = TRUE,prob = c(0.7,0.3))],
    origin='1970-01-01')
  )
 
#generate a corresponding factor alongside
df <- data.frame(time=as.Date(ts),
             col1=sample(replace=TRUE,letters[c(1:5)],size=length(ts)),
             col2=sample(replace=TRUE,1:26,size=length(ts))
             )

The output from our patchy data.frame will look something like this. Note the missing dates.

Also perhaps not evident is that not all factors are represented for every timestamp.

kableExtra::kable(head(df,10))
time col1 col2
2022-01-03 a 1
2022-01-04 d 6
2022-01-05 a 24
2022-01-07 e 11
2022-01-08 e 8
2022-01-09 d 10
2022-01-10 d 6
2022-01-11 d 25
2022-01-13 e 16
2022-01-14 c 22

Use

Our strategy for dealing with raw data like this is similar to expand.grid in base R - but in expand.grid the algorithm does not infer missing dates from the time series our function does !!

Repair the time series using the complete_time_factors from RShinyHelpers

new_df <- complete_time_factors(df)

Example 1 - window functions

Now observe the difference in window operations but computing a simple lagged difference between two time periods.

operates on consecutive time differences without considering the nature of non present dates

df%>%
    dplyr::count(time,wt=col2)%>%
    dplyr::mutate(n-lag(n,1))%>%head(10)
#>          time  n n - lag(n, 1)
#> 1  2022-01-03  1            NA
#> 2  2022-01-04  6             5
#> 3  2022-01-05 24            18
#> 4  2022-01-07 11           -13
#> 5  2022-01-08  8            -3
#> 6  2022-01-09 10             2
#> 7  2022-01-10  6            -4
#> 8  2022-01-11 25            19
#> 9  2022-01-13 16            -9
#> 10 2022-01-14 22             6

while the completed timeseries displays the correct lagged difference with the dates filled out.

It is noted that the in the improved strategy for dealing with italics{complete time series that now that every date timestamp is defined, each also has a value attached to it - notably some of these values are non- trivial as other datetimes are present ond non zero.

new_df%>%
    dplyr::count(time,wt=col2)%>%
    dplyr::mutate(n-lag(n,1))%>%head(10)
#>          time  n n - lag(n, 1)
#> 1  2022-01-03  1            NA
#> 2  2022-01-04  6             5
#> 3  2022-01-05 24            18
#> 4  2022-01-06  0           -24
#> 5  2022-01-07 11            11
#> 6  2022-01-08  8            -3
#> 7  2022-01-09 10             2
#> 8  2022-01-10  6            -4
#> 9  2022-01-11 25            19
#> 10 2022-01-12  0           -25

Example 2 - Plotting

Plotting is similarly improved. The patchy timeseries below is implicitly filled along the x axis as ggplot is intelligent enough to recognise a datetime variable. However the factors are still linearly interpolated.

Compare BEFORE we’ve treated the time series

  ggplot2::ggplot(df)+
  ggplot2::geom_line(ggplot2::aes(time,col2,col=col1))+
  ggplot2::facet_wrap(~col1)+
  RShinyHelpers::custom_theme()

… and after

  ggplot2::ggplot(new_df)+
  ggplot2::geom_line(ggplot2::aes(time,col2,col=col1))+
  ggplot2::facet_wrap(~col1)+
  RShinyHelpers::custom_theme()

Data is represented more fairly - particularly in this instance representing its sparsity better with our improved regime.