Time Series Classification Synthetic vs Real Financial Time Series

This article discusses distinguishing between real financial time series and synthetic time series using XGBoost.

By Matthew Smith, Complutense University Madrid

::Note:: This is a long post but I talk about the procedure I took when dealing with a specific time series classification task.

I was given a “Data Science” challenge as part of an interview in which I had to distinguish between real financial time series and synthetic time series. I document the results here, the data was anonymous and I have no idea which assets were which or from what time series the assets came from.

To conclude I obtained an in-sample-test-accuracy of 67% and an out-of-sample-test-accuracy of 65% (based on what the interviewers told me)

All I knew was that I had 12,000 real time series and 12,000 synthetically created time series. (apologies for no data but this was the companies data and not mine, I have uploaded the train and test data sets discussed later here where you should be able to run the final XGBoost model). In total there were 24,000 observations. I show the code here for methodological purposes and if you are interested in visualising time series in R and ggplot2. The time series features used here are taken from the following papers:

Large Scale Unusual Time Series Detection by R.Hyndman, E.Wang and N.Laptev
Visualising forecasting algorithm performance using time series instance spaces by Y.Kang, Rob.Hyndman and Kate Smith-Miles

You can check out my Jupyter Notebook version here.

I added a lot of notes to the code throughout the document which might be of additional interest.

A brief overview of the notebook:

Part 1 of the notebook:

Cleans the data and puts it into a better format for analysis. The data I recieved removed all dates, assest names etc. for anonymity.
Simple plot of some returns for the Synthetic and Real financial time series.
Box-plots of average returns and standard deviations.
Computes the Durbin-Watson test statistics for both Synthetic and Real time series and box-plots.
Plot the 10 day rolling mean and standard deviations for a random time series for Synthetic and real data.
Dickey Fuller test on both the Synthetic and real time series.
Jarque-Bera Test For Normality on the Synthetic and real time series.
ACF Plots for both the Synthetic and real time series.

Part 2 of the notebook:

Creates the time series features.
Splits the train.csv into “train” and “validation” data sets.
Puts the data into the correct format for XGBoost.
Sets up and searches over a parameter space to find the most optimal parameters for this data set (on the train data).
Outputs these parameters into a data frame.
Train the model using the optimal parameters found from the grid-search.
Plot the feature importance scores - i.e. the most “important” variables that the model found when making its predictions.
Assign a cut-off on the probability scores (> 0.5 then assign a 1 - real time series, <= 0.5 then assign a 0 for Synthetic).
Compute the Confusion Matrix and analyse the ‘in-sample’ validation results.

Part 3 of the notebook:

Create the “test.csv” features just as before and save as “TSfeatures_test.csv”.
Load in the “TSfeatures_train_val.csv” and “TSfeatures_test.csv” which were created from “train.csv” and “test.csv”.
Set up and run the XGBoost model using the optimal parameters found from the cross-validation grid search in “Part 2”.
Plot the predicted probability density plot as before in “Part 2”.
Set the cut-off threshold as the mean prediction score (0.465) which is close to the (0.500) score from “Part 2”.
Save the results as “submission.csv”.

Lets get started…

I often remove all other data in my environment before hand and turn scientific notation off which is what the first 2 lines does. The shhh command is useful for Jupyter Notebooks which outputs all the warning messages, adding shhh suppresses these warning messaged when loading in the packages. (In R markdown I can set warning = FALSE but there is no option on Notebooks. - that I know of - )

rm(list = ls())
options(scipen=999)
setwd('C:/Users/Matt/Desktop/Data Science Challenge')
shhh <- suppressPackageStartupMessages

shhh(library(dplyr))
library(readr)
library(TSrepr)
library(ggplot2)
library(data.table)
library(cluster)
library(clusterCrit)
library(fractalrock)
library(cowplot)
library(tidyr)
library(tidyquant)
library(lmtest)
library(aTSA)
library(tsoutliers)
library(tsfeatures)
library(xgboost)
library(caret)
library(purrr)

train_val <- read_csv("train.csv")
test <- read_csv("test.csv")

NOTE:

I have 2 data sets, the train_Val.csv for training and validation data set and the test.csv data set. I do not touch the test.csv data set until the very end in part 3. All the analysis and optimisation is performed only on the train_val.csv data set. The train_val.csv contains 12,000 observations and the test.csv contains 12,000 observations.

Part 1

The data was given to me in this format:

head(train_val[, 1:5], 1)

## # A tibble: 1 x 5
##   feature1 feature2 feature3 feature4 feature5
##      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1  0.00629  0.00441  -0.0381   0.0253 -0.00658

The names of the columns are as follows:

colnames(train_val) %>%
  data.frame() %>%
  setNames(c("features")) %>%
  split(as.integer(gl(nrow(.), 20, nrow(.)))) %>%
  kable(caption = "Time series variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12)

Table 1: Time series variables

features
feature1
feature2
feature3
feature4
feature5
feature6
feature7
feature8
feature9
feature10
feature11
feature12
feature13
feature14
feature15
feature16
feature17
feature18
feature19
feature20

	features
21	feature21
22	feature22
23	feature23
24	feature24
25	feature25
26	feature26
27	feature27
28	feature28
29	feature29
30	feature30
31	feature31
32	feature32
33	feature33
34	feature34
35	feature35
36	feature36
37	feature37
38	feature38
39	feature39
40	feature40

	features
41	feature41
42	feature42
43	feature43
44	feature44
45	feature45
46	feature46
47	feature47
48	feature48
49	feature49
50	feature50
51	feature51
52	feature52
53	feature53
54	feature54
55	feature55
56	feature56
57	feature57
58	feature58
59	feature59
60	feature60

	features
61	feature61
62	feature62
63	feature63
64	feature64
65	feature65
66	feature66
67	feature67
68	feature68
69	feature69
70	feature70
71	feature71
72	feature72
73	feature73
74	feature74
75	feature75
76	feature76
77	feature77
78	feature78
79	feature79
80	feature80

	features
81	feature81
82	feature82
83	feature83
84	feature84
85	feature85
86	feature86
87	feature87
88	feature88
89	feature89
90	feature90
91	feature91
92	feature92
93	feature93
94	feature94
95	feature95
96	feature96
97	feature97
98	feature98
99	feature99
100	feature100

	features
101	feature101
102	feature102
103	feature103
104	feature104
105	feature105
106	feature106
107	feature107
108	feature108
109	feature109
110	feature110
111	feature111
112	feature112
113	feature113
114	feature114
115	feature115
116	feature116
117	feature117
118	feature118
119	feature119
120	feature120

	features
121	feature121
122	feature122
123	feature123
124	feature124
125	feature125
126	feature126
127	feature127
128	feature128
129	feature129
130	feature130
131	feature131
132	feature132
133	feature133
134	feature134
135	feature135
136	feature136
137	feature137
138	feature138
139	feature139
140	feature140

	features
141	feature141
142	feature142
143	feature143
144	feature144
145	feature145
146	feature146
147	feature147
148	feature148
149	feature149
150	feature150
151	feature151
152	feature152
153	feature153
154	feature154
155	feature155
156	feature156
157	feature157
158	feature158
159	feature159
160	feature160

	features
161	feature161
162	feature162
163	feature163
164	feature164
165	feature165
166	feature166
167	feature167
168	feature168
169	feature169
170	feature170
171	feature171
172	feature172
173	feature173
174	feature174
175	feature175
176	feature176
177	feature177
178	feature178
179	feature179
180	feature180

	features
181	feature181
182	feature182
183	feature183
184	feature184
185	feature185
186	feature186
187	feature187
188	feature188
189	feature189
190	feature190
191	feature191
192	feature192
193	feature193
194	feature194
195	feature195
196	feature196
197	feature197
198	feature198
199	feature199
200	feature200

	features
201	feature201
202	feature202
203	feature203
204	feature204
205	feature205
206	feature206
207	feature207
208	feature208
209	feature209
210	feature210
211	feature211
212	feature212
213	feature213
214	feature214
215	feature215
216	feature216
217	feature217
218	feature218
219	feature219
220	feature220

	features
221	feature221
222	feature222
223	feature223
224	feature224
225	feature225
226	feature226
227	feature227
228	feature228
229	feature229
230	feature230
231	feature231
232	feature232
233	feature233
234	feature234
235	feature235
236	feature236
237	feature237
238	feature238
239	feature239
240	feature240

	features
241	feature241
242	feature242
243	feature243
244	feature244
245	feature245
246	feature246
247	feature247
248	feature248
249	feature249
250	feature250
251	feature251
252	feature252
253	feature253
254	feature254
255	feature255
256	feature256
257	feature257
258	feature258
259	feature259
260	feature260

	features
261	class

There are 260 “features” in the train data along with a class variable which is excluded from the testing data. With ~253 trading days in a year the feature1, feature2, … featureN were daily time series. From my initial observation (and plots) I believed this data to be “returns” data. I firstly clean a little the data since time series does not do so well with feature1, feature2, … featureN as its input. I chose a year at random and renamed the columns with the function getTradingDates (there is always an R package for everything…).

######################################################################
################# Clean the data #####################################

# Since the "features" are daily time series, I just choose a random year and rename the feautres into more meaningful names
# Such as "2010-01-01", "2010-01-02", "2010-01-03" instead of "feature1", "feature2", "feature3" etc.
# Theres a "trading dates" package in R to get only the dates which are trading dates.
colnames(train_val) <- getTradingDates('2010-01-01', obs = 260)
colnames(train_val)[ncol(train_val)] <- "class"
colnames(test) <- getTradingDates('2010-01-01', obs = 260)
test$dataset <- "test"
train_val$dataset <- "train"

Here (if I were to do things differently) I would keep to tidy data principles and use test %>% add_column(dataset = "test) and train %>% add_colum(dataset = "train") instead of test$dataset <- "test and train_val$dataset <- "train". But that doesn’t matter much.

How the training data looks after cleaning:

Table 2: How the training set looks now (cleaned)
2009-01-05	2009-01-06	2009-01-07	2009-01-08	2009-01-09
0.0062865	0.0044074	-0.0380887	0.0252850	-0.0065788
0.0008491	0.0025729	0.0013584	-0.0054742	-0.0098234
0.0142292	-0.0252533	-0.0100752	-0.0319871	-0.0065087
-0.0215930	-0.0102866	-0.0210674	-0.0086876	0.0371876
0.0092523	-0.0235778	0.0170582	0.0037303	0.0171185
0.0143528	0.0094828	0.0042109	-0.0038064	0.0084914

How the testing data looks after cleaning:

Table 3: How the testing data looks (cleaned)
2009-01-05	2009-01-06	2009-01-07	2009-01-08	2009-01-09
0.0331039	0.0086225	0.0040622	0.0082554	0.0558741
0.0020681	-0.0034293	0.0134305	-0.0109182	-0.0184851
0.0147834	-0.0113800	-0.0046055	-0.0008757	-0.0011536
-0.0094855	0.0113410	-0.0213286	0.0033220	-0.0111519
0.0381690	-0.0037092	-0.0010865	-0.0062307	0.0232117
0.0004257	-0.0042553	0.0029915	0.0017043	0.0012760

The goal: Was to classify which financial time series were real vs which were synthetically created (by some algorithm I have no knowledge of how it generated the synthetic time series)

I re-arranged the data using the melt function in R, however I suggest anybody reading this to use the pivol_longer function from the tidyverse packages. The pivot_longer package was released a few weeks after writing the code for this problem.

######################################################################
################# Rearrange the data #################################

# I melt the data for easier analysis, now the data is in a long format.

# "Class" corresponds to whether the asset is Synthetic or Real
# "Dataset" tells me where the data came from
# "row_id" - corresponds to a unique ID assigned to each asset both "(Synthetic & Real)"
# "Variable" is the column names of the original dataset (feature1, feature2, ... , featureN) converted to some date
# "Value" is the daily returns

df <- train_val %>%
  mutate(row_id = row_number()) %>%
  melt(., measure.vars = 1:260) %>%
  arrange(row_id)

head(df)

##   class dataset row_id   variable        value
## 1     0   train      1 2009-01-05  0.006286455
## 2     0   train      1 2009-01-06  0.004407363
## 3     0   train      1 2009-01-07 -0.038088652
## 4     0   train      1 2009-01-08  0.025285012
## 5     0   train      1 2009-01-09 -0.006578773
## 6     0   train      1 2009-01-12  0.005713677

dim(df)

## [1] 3120000       5

Note: I call the training data df which in hindsight is probably bad practice and it should be called something related to the train_Val named data set. Just keep in mind that df refers to the train_Val data set. (and does not include the test.csv data set data)

As we can see the data has 3,120,000 rows which is 12,000 assets * 260 trading days. Next I plot the returns series using ggplot.

# Plot some returns - I only plot a random sample of 20 assets for each Synthetic vs Real.

ret_plot0 <- df %>%
  filter(class == 0) %>%
  group_by(row_id) %>%
  nest() %>%
  ungroup() %>% 
  sample_n(20) %>%
  unnest() %>% 
  ggplot(aes(x = variable, y = value)) +
  geom_line(aes(group = factor(row_id), color = factor(row_id))) +
  ggtitle("Synthetic Financial Time Series") +
  theme_classic() +
  theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())


ret_plot1 <- df %>%
  filter(class == 1) %>%
  group_by(row_id) %>%
  nest() %>%
  ungroup() %>%
  sample_n(20) %>%
  unnest() %>%
  ggplot(aes(x = variable, y = value)) +
  geom_line(aes(group = factor(row_id), color = factor(row_id))) +
  ggtitle("Real Financial Time Series") +
  theme_classic() +
  theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())

plot_grid(ret_plot0, ret_plot1)

Next I plot boxplots for the Average returns and secondly the standard deviations.

ave_box <- df %>%
  group_by(class, row_id) %>%
  summarise(mean = mean(value)) %>%
  ggplot(aes(x = factor(class), y = mean, color = factor(class))) +
  geom_boxplot(show.legend = FALSE) +
  ggtitle("Syn vs Real Average Returns") +
  xlab("Class") +
  ylab("Average Returns") +
  theme_tq()

sd_box <- df %>%
  group_by(class, row_id) %>%
  summarise(sd = sd(value)) %>%
  ggplot(aes(x = factor(class), y = sd, color = factor(class))) +
  geom_boxplot(show.legend = FALSE) +
  ggtitle("Syn vs Real Standard Deviations") +
  xlab("Class") +
  ylab("Standard Deviation") +
  theme_tq()

plot_grid(ave_box, sd_box)

I next calculate the Durbin-Watson statistic. I mostly code using R’s tidy data principles and therefore use the tidy function from the broom package to tidy the output of the DW statistic up a little. I do this for both the synthetic time series and real time series.

# I calculate the Durbin-Watson statistic and use the "tidy()" function to summarise the key information from the calculation.

dw_test_class_zero <- df %>%
  dplyr::filter(class == 0) %>%
  nest(-row_id) %>%
  mutate(dw_res = map(data, ~ broom::tidy(lmtest::dwtest(value ~ 1, data = .x)))) %>%
  unnest(dw_res) %>%
  mutate(class = "0")

dw_test_class_zero %>% 
  head()

## # A tibble: 6 x 7
##   row_id         data statistic p.value method      alternative            class
##    <int> <list<df[,4>     <dbl>   <dbl> <chr>       <chr>                  <chr>
## 1      1    [260 x 4]      1.98   0.426 Durbin-Wat~ true autocorrelation ~ 0    
## 2      2    [260 x 4]      2.01   0.521 Durbin-Wat~ true autocorrelation ~ 0    
## 3      4    [260 x 4]      2.08   0.747 Durbin-Wat~ true autocorrelation ~ 0    
## 4      5    [260 x 4]      2.49   1.000 Durbin-Wat~ true autocorrelation ~ 0    
## 5      6    [260 x 4]      1.90   0.214 Durbin-Wat~ true autocorrelation ~ 0    
## 6      9    [260 x 4]      1.87   0.138 Durbin-Wat~ true autocorrelation ~ 0

# Here I do the exact same thing as above but this time for the class == 1 data.

dw_test_class_one <- df %>%
  filter(class == 1) %>%
  nest(-row_id) %>%
  mutate(dw_res = map(data, ~ broom::tidy(lmtest::dwtest(value ~ 1, data = .x)))) %>%
  unnest(dw_res) %>%
  mutate(class = "1")

dw_test_class_one %>%
  head()

## # A tibble: 6 x 7
##   row_id         data statistic p.value method      alternative            class
##    <int> <list<df[,4>     <dbl>   <dbl> <chr>       <chr>                  <chr>
## 1      3    [260 x 4]      2.08  0.728  Durbin-Wat~ true autocorrelation ~ 1    
## 2      7    [260 x 4]      1.81  0.0654 Durbin-Wat~ true autocorrelation ~ 1    
## 3      8    [260 x 4]      1.93  0.296  Durbin-Wat~ true autocorrelation ~ 1    
## 4     13    [260 x 4]      2.05  0.644  Durbin-Wat~ true autocorrelation ~ 1    
## 5     15    [260 x 4]      2.07  0.715  Durbin-Wat~ true autocorrelation ~ 1    
## 6     16    [260 x 4]      2.07  0.709  Durbin-Wat~ true autocorrelation ~ 1

Next I plot the boxplot statistics for each of the Durbin Watson tests.

# I bind the rows together and plot a box-plot.

bind_rows(dw_test_class_zero, dw_test_class_one) %>%
  group_by(class) %>%
  ggplot(aes(x = factor(class), y = statistic, color = factor(class))) +
  geom_boxplot(show.legend = FALSE) +
  ggtitle("Durbin Watson Box Plot Statistics") +
  xlab("Class") +
  ylab("Durbin Watson") +
  theme_tq()

I compute the 10 day rolling mean and standard deviation using the tq_mutate function from the tidyquant package. value corresponds to the returns of the financial time series and is plotted in blue with the 10 day rolling average and standard deviation plotted over the returns. (I use melt again here but look into the pivot_longer function for a more intuitive application)

# Rolling mean and standard deviations
# I only use a random sample of 1 of each class of the grouped observations to save on memory and to make the plot more readable.
# The rollowing window is 10 days
# I use the tq_mutate functionality from the "tidyquant" package to keep things in a "tidy" format as per the "tidyverse" 'rules'.
# In the plot "value" is the returns, "mean_10" is the 10 day rolling mean and "sd_10" is the 10 day rolling standard deviation.

plot0 <- df %>%
  filter(class == 0) %>%
  as_tibble() %>%
  group_by(row_id) %>%
  nest() %>%
  ungroup() %>% 
  sample_n(1) %>%
  unnest() %>%
  mutate(variable = as.Date(variable)) %>%
  tq_mutate(
    select     = value,
    mutate_fun = rollapply,
    width      = 10,
    align      = "right",
    FUN        = mean,
    na.rm      = TRUE,
    col_rename = "mean_10"
    ) %>%
  tq_mutate(
    select     = value,
    mutate_fun = rollapply,
    width      = 10,
    align      = "right",
    FUN        = sd,
    na.rm      = TRUE,
    col_rename = "sd_10"
    ) %>%
  melt(measure.vars = 5:7) %>%
  setNames(c("row_id", "class", "data set", "date", "variable", "value")) %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = value, colour = variable)) +
  ggtitle("Synthetic Financial Time Series Rolling Mean and Standard Deviation") +
  theme_classic() +
  scale_colour_manual(values = c("#1f77b4", "red", "black")) +
  theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())

plot1 <- df %>%
  filter(class == 1) %>%
  as_tibble() %>%
  group_by(row_id) %>%
  nest() %>%
  ungroup() %>% 
  sample_n(1) %>%
  unnest() %>%
  mutate(variable = as.Date(variable)) %>%
  tq_mutate(
    select     = value,
    mutate_fun = rollapply,
    width      = 10,
    align      = "right",
    FUN        = mean,
    na.rm      = TRUE,
    col_rename = "mean_10"
  ) %>%
  tq_mutate(
    select     = value,
    mutate_fun = rollapply,
    width      = 10,
    align      = "right",
    FUN        = sd,
    na.rm      = TRUE,
    col_rename = "sd_10"
  ) %>%
  melt(measure.vars = 5:7) %>%
  setNames(c("row_id", "class", "data set", "date", "variable", "value")) %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = value, colour = variable)) +
  ggtitle("Real Financial Time Series Rolling Mean and Standard Deviation") +
  theme_classic() +
  scale_colour_manual(values = c("#1f77b4", "red", "black")) +
  theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())  

plot_grid(plot0, plot1)

An important note in the code here is that I randomly sample by group, that is, I do not take a random sample from all observations across all groups. Instead I group_by each time series (each of the 6,000 observations after I filtered by the class == 0, likewise when I filter by the class == 1), I then nest() the data to collapse the daily time series for each asset into a list. From here I will have 6,000 observations, each of which has their time series nested inside a list. Thus, I can sample 1 of the 6,000 observations and then unnest() and obtain a full time series set of one of the random assets selected, - instead of sampling randomly over all assets time series data (which would be completely wrong).

For example the following commented out code group_by() the ID variable and nest() the data, takes a random sample_n() of the grouped data and then unnest() the data to its original form, this time with a random sample of the IDs.

#  group_by(row_id) %>%
#  nest() %>%
#  ungroup() %>% 
#  sample_n(1) %>%
#  unnest() %>%

Next I compute the Dickey Fuller test on both series for a single random observation, hence the sample_n(1) argument (it’s too computationally expensive to compute it on all 12,000 observations).

For the synthetically created series.

# Dickey Fuller test on the 0 class
# I only randomly sample 1 of the assets for the 0 class to save on output space

df %>%
  filter(class == 0) %>%
  group_by(row_id) %>%
  nest() %>%
  ungroup() %>% 
  sample_n(1) %>%
  unnest() %>%
  nest(-row_id) %>%
  mutate(adf_res = map(data, ~ adf.test(.x$value))) %>%
  unnest(adf_res)

## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag    ADF p.value
## [1,]   0 -17.94    0.01
## [2,]   1 -11.75    0.01
## [3,]   2  -8.66    0.01
## [4,]   3  -7.62    0.01
## [5,]   4  -7.13    0.01
## Type 2: with drift no trend 
##      lag    ADF p.value
## [1,]   0 -17.94    0.01
## [2,]   1 -11.76    0.01
## [3,]   2  -8.67    0.01
## [4,]   3  -7.64    0.01
## [5,]   4  -7.15    0.01
## Type 3: with drift and trend 
##      lag    ADF p.value
## [1,]   0 -18.00    0.01
## [2,]   1 -11.83    0.01
## [3,]   2  -8.77    0.01
## [4,]   3  -7.74    0.01
## [5,]   4  -7.26    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

## # A tibble: 3 x 3
##   row_id           data adf_res          
##    <int> <list<df[,4]>> <named list>     
## 1   7807      [260 x 4] <dbl[,3] [5 x 3]>
## 2   7807      [260 x 4] <dbl[,3] [5 x 3]>
## 3   7807      [260 x 4] <dbl[,3] [5 x 3]>

The same but on the real financial series.

# Dickey Fuller test on the 1 class
# I only randomly sample 1 of the assets for the 1 class to save on output space

df %>%
  filter(class == 1) %>%
  group_by(row_id) %>%
  nest() %>%
  ungroup() %>% 
  sample_n(1) %>%
  unnest() %>%
  nest(-row_id) %>%
  mutate(adf_res = map(data, ~ adf.test(.x$value))) %>%
  unnest(adf_res)

## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag    ADF p.value
## [1,]   0 -15.99    0.01
## [2,]   1 -10.71    0.01
## [3,]   2  -9.12    0.01
## [4,]   3  -8.74    0.01
## [5,]   4  -7.58    0.01
## Type 2: with drift no trend 
##      lag    ADF p.value
## [1,]   0 -16.10    0.01
## [2,]   1 -10.84    0.01
## [3,]   2  -9.27    0.01
## [4,]   3  -8.93    0.01
## [5,]   4  -7.81    0.01
## Type 3: with drift and trend 
##      lag    ADF p.value
## [1,]   0 -16.27    0.01
## [2,]   1 -10.99    0.01
## [3,]   2  -9.46    0.01
## [4,]   3  -9.18    0.01
## [5,]   4  -8.06    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

## # A tibble: 3 x 3
##   row_id           data adf_res          
##    <int> <list<df[,4]>> <named list>     
## 1  10833      [260 x 4] <dbl[,3] [5 x 3]>
## 2  10833      [260 x 4] <dbl[,3] [5 x 3]>
## 3  10833      [260 x 4] <dbl[,3] [5 x 3]>

Next the Jarque-Bera tests for normality. Firstly on the synthetically created series.

# For both classes I take a random sample of 1 observation from each class (Synthetic and Real financial series)

jb_zero <- df %>%
  filter(class == 0) %>%
  group_by(row_id) %>%
  nest() %>%
  ungroup() %>% 
  sample_n(1) %>%
  unnest() %>%
  nest(-row_id) %>%
  mutate(JarqueBeraTest = map(data, ~ JarqueBera.test(.x$value)))

print("Jarque-Bera Test on the 0 - Synthetic class")

## [1] "Jarque-Bera Test on the 0 - Synthetic class"

jb_zero$JarqueBeraTest

## [[1]]
## 
##  Jarque Bera Test
## 
## data:  .x$value
## X-squared = 0.3088, df = 2, p-value = 0.8569
## 
## 
##  Skewness
## 
## data:  .x$value
## statistic = 0.045794, p-value = 0.7631
## 
## 
##  Kurtosis
## 
## data:  .x$value
## statistic = 2.8582, p-value = 0.6406

Also on the real financial series.

jb_one <- df %>%
  filter(class == 0) %>%
  group_by(row_id) %>%
  nest() %>%
  ungroup() %>% 
  sample_n(1) %>%
  unnest() %>%
  nest(-row_id) %>%
  mutate(JarqueBeraTest = map(data, ~ JarqueBera.test(.x$value)))

print("Jarque-Bera Test on the 1 - Real class")

## [1] "Jarque-Bera Test on the 1 - Real class"

jb_one$JarqueBeraTest

## [[1]]
## 
##  Jarque Bera Test
## 
## data:  .x$value
## X-squared = 25.14, df = 2, p-value = 0.000003474
## 
## 
##  Skewness
## 
## data:  .x$value
## statistic = 0.084191, p-value = 0.5794
## 
## 
##  Kurtosis
## 
## data:  .x$value
## statistic = 4.514, p-value = 0.0000006251

Autocorrelation plots:

I plot the Autocorrelation Function for a “random” sample of observations time series. I selected 4 observations and filtered the data by them.

######################################################################
################# ACF plots ##########################################

# I only use 4 observations for these plots, 2 from the "synthetic" class and 2 from the "real" class.

df %>%
  filter(row_id == 6422 | row_id == 8967 | row_id == 6080 | row_id ==   5734) %>%
  mutate(date = as.Date(variable)) %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = value), color = "red", alpha = 0.4) +
  geom_hline(yintercept = 0) +
  facet_wrap(~ row_id + class) +
  theme_tq()

acf_data <- df %>%
  dplyr::filter(row_id == 6422 | row_id == 8967 | row_id == 6080 | row_id ==    5734) %>%
  mutate(date = as.Date(variable))

df_acf <- acf_data %>%
  group_by(row_id) %>% 
  summarise(list_acf = list(acf(value, plot=FALSE))) %>%
  mutate(acf_vals = purrr::map(list_acf, ~as.numeric(.x$acf))) %>% 
  select(-list_acf) %>% 
  unnest() %>% 
  group_by(row_id) %>% 
  mutate(lag = row_number() - 1)

df_ci <- acf_data %>% 
  group_by(row_id) %>% 
  summarise(ci = qnorm((1 + 0.95)/2)/sqrt(n()))

ggplot(df_acf, aes(x = lag, y = acf_vals)) +
  geom_bar(stat="identity", width=.05) +
  geom_hline(yintercept = 0) +
  geom_hline(data = df_ci, aes(yintercept = -ci), color="blue", linetype="dotted") +
  geom_hline(data = df_ci, aes(yintercept = ci), color="blue", linetype="dotted") +
  labs(x="Lag", y="ACF") +
  facet_wrap(~ row_id) +
  theme_tq()

Thats enough data analysis I could probably fit the PACF plots also along with a few more exploratory data analysis but I move on to generating the financial time series features using the tsfeatures package.

What I do in the below code is to take a random sample of 5 groups (Using the whole data set takes too long to calculate the time series features) and then apply all the functions in the tsfeatures package to each of the time series assets data which is does by mapping over each assets data and computing the time series features.

This section takes some time to process and compute (especially on the whole sample) and I already saved the results as a csv which I will just work from and load in the pre-computed time series features.

################# Generate Time Series Features ######################

# I create some time series features from the package "tsfeatures". There are 40+ functions in the "tsfeatures" package
# which can generate approximately 106 time series features.
# Due to memory issues I am only able to create a few of the features, therefore I randomly sample 10 features from the
# "tsfeatures" package. We could also add in technical indicators from the "PerformanceAnalytics" or "TTR" packages (I omit these
# here, however creating 'functions2 <- ls("package:TTR")' and adding it to the 'summarise' command will work.)

functions <- ls("package:tsfeatures")[1:42]
# functions <- sample(functions, 20)

Stats <- df %>%
  group_by(row_id, class) %>%
  nest() %>%
  ungroup() %>%
  sample_n(5) %>%
  unnest() %>%
  nest(-row_id, -class) %>%
  group_by(row_id, class) %T>%
  {options(warn = -1)} %>%
  summarise(Statistics = map(data, ~ data.frame(
    bind_cols(
      tsfeatures(.x$value, functions))))) %>%
  unnest(Statistics)

# I saved to whole dataset as "Stats" next I split it between training and test.
Stats <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/TSfeatures_train_val.csv")

Note: Again, bad practice by me. I just called the df data Stats which consists of only the time series features. This still only refers to the train_val.csv data and not the test.csv data.

The training data looks like: (after computing the time series features). Now each asset has been collapsed from ~260 days down to 1 signal time series feature observation.

Recall the goal here was to classify synthetic time series vs real time series and not what the next days price is going to be. For each asset I have a signal observation and based on this I can train a classifying algorithm to distinguish between real vs synthetic time series.

How the training data looks:

Table 4: tsfeatures package features
X	row_id	class	ac_9_ac_9	acf_features_x_acf1	acf_features_x_acf10	acf_features_diff1_acf1	acf_features_diff1_acf10	acf_features_diff2_acf1	acf_features_diff2_acf10	ARCH.LM	autocorr_features_embed2_incircle_1	autocorr_features_embed2_incircle_2	autocorr_features_ac_9	autocorr_features_firstmin_ac	autocorr_features_trev_num	autocorr_features_motiftwo_entro3	autocorr_features_walker_propcross	binarize_mean_binarize_mean	binarize_mean_NA	compengine_embed2_incircle_1	compengine_embed2_incircle_2	compengine_ac_9	compengine_firstmin_ac	compengine_trev_num	compengine_motiftwo_entro3	compengine_walker_propcross	compengine_localsimple_mean1	compengine_localsimple_lfitac	compengine_sampen_first	compengine_std1st_der	compengine_spreadrandomlocal_meantaul_50	compengine_spreadrandomlocal_meantaul_ac2	compengine_histogram_mode_10	compengine_outlierinclude_mdrmd	compengine_fluctanal_prop_r1	crossing_points	dist_features_histogram_mode_10	dist_features_outlierinclude_mdrmd	embed2_incircle	entropy	firstmin_ac	firstzero_ac	flat_spots	fluctanal_prop_r1_fluctanal_prop_r1	arch_acf	garch_acf	arch_r2	garch_r2	histogram_mode	alpha	beta	hurst	hw_parameters_hw_parameters	hw_parameters_NA	localsimple_taures	lumpiness	max_kl_shift	time_kl_shift	max_level_shift	time_level_shift	max_var_shift	time_var_shift	motiftwo_entro3	nonlinearity	outlierinclude_mdrmd	x_pacf5	diff1x_pacf5	diff2x_pacf5	pred_features_localsimple_mean1	pred_features_localsimple_lfitac	pred_features_sampen_first	sampen_first_sampen_first	sampenc	scal_features_fluctanal_prop_r1	spreadrandomlocal_meantaul	stability	station_features_std1st_der	station_features_spreadrandomlocal_meantaul_50	station_features_spreadrandomlocal_meantaul_ac2	std1st_der_std1st_der	seasonal_period	trend	spike	linearity	curvature	e_acf1	e_acf10	trev_num	tsfeatures_frequency	tsfeatures_seasonal_period	tsfeatures_trend	tsfeatures_spike	tsfeatures_linearity	tsfeatures_curvature	tsfeatures_e_acf1	tsfeatures_e_acf10	tsfeatures_entropy	tsfeatures_x_acf1	tsfeatures_x_acf10	tsfeatures_diff1_acf1	tsfeatures_diff1_acf10	tsfeatures_diff2_acf1	tsfeatures_diff2_acf10	unitroot_kpss	unitroot_pp	walker_propcross
1	1	0	-0.0675275	0.0097094	0.0526897	-0.5005299	0.3297018	-0.6772403	0.6124739	0.0627825	0.3929961	0.6147860	-0.0675275	1	0.1208750	2.071663	0.5405405	1	1	0.3929961	0.6147860	-0.0675275	1	0.1208750	2.071663	0.5405405	1	1	1.788841	1.408737	1.68	1.43	-0.25	-0.2865385	0.1627907	132	-0.25	-0.2865385	0.3929961	0.9840151	1	3	4	0.1627907	0.0652585	0.0154406	0.0627825	0.0253367	-0.25	0.0013330	0.0013330	0.5000458	NA	NA	1	0.3556536	1.783636	103	1.297736	97	2.819828	46	2.071663	0.0752319	-0.2865385	0.0108653	0.4457792	1.0525222	1	1	1.788841	1.788841	1.788841	0.1627907	1.76	0.0562693	1.408737	1.74	1.36	1.408737	1	0.0043052	0.0000261	0.8421403	-0.7069160	0.0052389	0.0588324	0.1208750	1	1	0.0043052	0.0000261	0.8421403	-0.7069160	0.0052389	0.0588324	0.9840151	0.0097094	0.0526897	-0.5005299	0.3297018	-0.6772403	0.6124739	0.0993829	-249.7732	0.5405405
2	2	0	-0.0421577	-0.0075902	0.0387481	-0.5171529	0.3129147	-0.6727897	0.5379301	0.0558032	0.4285714	0.6563707	-0.0421577	1	-0.4765229	2.077581	0.5019305	1	1	0.4285714	0.6563707	-0.0421577	1	-0.4765229	2.077581	0.5019305	1	1	1.780390	1.419266	1.95	1.00	0.50	0.2615385	0.1627907	123	0.50	0.2615385	0.4285714	0.9864332	1	1	4	0.1627907	0.0664358	0.0657859	0.0558032	0.0554355	0.50	0.0001000	0.0001000	0.5000458	NA	NA	1	0.4636768	1.733008	247	1.311861	141	2.625772	221	2.077581	0.0273335	0.2615385	0.0256032	0.4606850	1.0171377	1	1	1.780390	1.780390	1.780390	0.1627907	2.05	0.0892206	1.419266	2.12	1.00	1.419266	1	0.0177460	0.0000399	0.9249561	0.7665407	-0.0218053	0.0411861	-0.4765229	1	1	0.0177460	0.0000399	0.9249561	0.7665407	-0.0218053	0.0411861	0.9864332	-0.0075902	0.0387481	-0.5171529	0.3129147	-0.6727897	0.5379301	0.0414599	-256.0485	0.5019305
3	3	1	0.0099598	-0.0405929	0.0449036	-0.5026683	0.3471209	-0.6718885	0.6109006	0.0325470	0.4671815	0.7065637	0.0099598	1	-0.8755173	2.069233	0.5328185	1	0	0.4671815	0.7065637	0.0099598	1	-0.8755173	2.069233	0.5328185	1	1	1.706841	1.443315	1.38	1.00	-0.50	-0.2538462	0.1395349	132	-0.50	-0.2538462	0.4671815	0.9868568	1	1	6	0.1395349	0.0388513	0.0039162	0.0325470	0.0041902	-0.50	0.0014557	0.0014557	0.5000458	NA	NA	1	1.2670493	7.746711	95	1.403784	87	5.235499	84	2.069233	0.2436499	-0.2538462	0.0223069	0.5356408	0.9954919	1	1	1.706841	1.706841	1.706841	0.1395349	1.42	0.0716499	1.443315	1.42	1.00	1.443315	1	0.0141368	0.0000929	0.8414359	-0.0259311	-0.0547484	0.0492987	-0.8755173	1	1	0.0141368	0.0000929	0.8414359	-0.0259311	-0.0547484	0.0492987	0.9868568	-0.0405929	0.0449036	-0.5026683	0.3471209	-0.6718885	0.6109006	0.0775698	-258.1295	0.5328185
4	4	0	-0.0428748	-0.0443619	0.0615867	-0.4571442	0.3184053	-0.5906478	0.4361178	0.1275576	0.4555985	0.7027027	-0.0428748	2	-0.9943808	2.068744	0.4903475	0	0	0.4555985	0.7027027	-0.0428748	2	-0.9943808	2.068744	0.4903475	1	1	1.660825	1.445807	1.24	1.00	0.25	0.0153846	0.1395349	127	0.25	0.0153846	0.4555985	0.9790521	2	1	7	0.1395349	0.0694296	0.0112709	0.0579144	0.0123884	0.25	0.0480021	0.0001000	0.5000458	NA	NA	1	1.0068624	4.994753	132	1.258758	173	5.886911	156	2.068744	0.3840091	0.0153846	0.0503205	0.5402603	1.1070217	1	1	1.660825	1.660825	1.660825	0.1395349	1.10	0.1065111	1.445807	1.14	1.00	1.445807	1	0.0283540	0.0000482	-1.2297854	0.2921899	-0.0728152	0.0752389	-0.9943808	1	1	0.0283540	0.0000482	-1.2297854	0.2921899	-0.0728152	0.0752389	0.9790521	-0.0443619	0.0615867	-0.4571442	0.3184053	-0.5906478	0.4361178	0.2129633	-262.0781	0.4903475
5	5	0	0.0259312	-0.2447835	0.1469130	-0.5810073	0.4796508	-0.6799229	0.6232529	0.2014861	0.6563707	0.7992278	0.0259312	1	-0.7167079	2.059764	0.5289575	1	0	0.6563707	0.7992278	0.0259312	1	-0.7167079	2.059764	0.5289575	1	1	1.347789	1.580825	1.08	0.98	-0.50	0.7961538	0.1627907	133	-0.50	0.7961538	0.6563707	0.9723766	1	1	9	0.1627907	0.2718058	0.2229375	0.1765130	0.1330761	-0.50	0.0001000	0.0001000	0.5000458	NA	NA	1	2.8846415	11.474426	80	1.772392	229	8.468236	236	2.059764	0.2143595	0.7961538	0.1008392	0.7538746	1.2926800	1	1	1.347789	1.347789	1.347789	0.1627907	1.08	0.0797924	1.580825	1.06	0.98	1.580825	1	0.0121072	0.0001568	-0.5488436	0.2255538	-0.2599764	0.1558209	-0.7167079	1	1	0.0121072	0.0001568	-0.5488436	0.2255538	-0.2599764	0.1558209	0.9723766	-0.2447835	0.1469130	-0.5810073	0.4796508	-0.6799229	0.6232529	0.1506344	-323.5672	0.5289575
6	6	0	-0.0761166	0.0468556	0.0858348	-0.5253131	0.3438031	-0.6901570	0.6130725	0.0432628	0.4352941	0.6627451	-0.0761166	1	0.0898648	2.068914	0.5250965	1	1	0.4352941	0.6627451	-0.0761166	1	0.0898648	2.068914	0.5250965	1	1	1.751575	1.381854	2.69	1.71	-0.25	-0.0846154	0.3488372	134	-0.25	-0.0846154	0.4352941	0.9806218	1	5	5	0.3488372	0.0500806	0.0502154	0.0627968	0.0620877	-0.25	0.0286244	0.0001000	0.5188805	NA	NA	1	0.2189481	3.145763	141	1.447883	80	2.077936	84	2.068914	0.0137733	-0.0846154	0.0172321	0.4345976	1.0881798	1	1	1.751575	1.751575	1.751575	0.3488372	2.61	0.1479673	1.381854	2.63	1.81	1.381854	1	0.0077481	0.0000329	-0.5473782	0.4505809	0.0410068	0.0873468	0.0898648	1	1	0.0077481	0.0000329	-0.5473782	0.4505809	0.0410068	0.0873468	0.9806218	0.0468556	0.0858348	-0.5253131	0.3438031	-0.6901570	0.6130725	0.0259414	-262.3484	0.5250965

## [1] 12000   109

The dimensions of the data as still 12,000 with 109 features (created from the tsfeatures package). That is we have 6,000 synthetic and 6,000 real financial time series (12,000 * ~260 = 3,120,000 but we applied tsfeatures to collapse the ~260 down to 1 single observation for each asset)

I collapsed this problem down from a time series prediction problem to a pure classification problem. I split the data between training and validation set next… I also split the data into X_train, Y_train… etc.

I split the df/Stats data set into a train set of 75% of the observations and an in-sample test data set of 25% of the observations.

######################################################################
################# Train and XGBoost model on the TS Features #########

#Stats <- Stats %>%
#  select_if(~sum(!is.na(.)) > 0)

# Split the training set up between train and a small validation set
smp_size <- floor(0.75 * nrow(Stats))
#set.seed(123)
train_ind <- sample(seq_len(nrow(Stats)), size = smp_size)

train <- Stats[train_ind, ]
val <- Stats[-train_ind, ]

# We have 106 time series features for the model to learn from.

x_train <- train %>%
  ungroup() %>%
  select(-class, -row_id, -X) %>%
  as.matrix()

x_val <- val %>%
  ungroup() %>%
  select(-class, -row_id, -X) %>%
  as.matrix()

y_train <- train %>%
  ungroup() %>%
  pull(class)

y_val <- val %>%
  ungroup() %>%
  pull(class)

How the training X (input variables) data looks:

Table 5: How the X_train data look
	ac_9_ac_9	acf_features_x_acf1	acf_features_x_acf10	acf_features_diff1_acf1	acf_features_diff1_acf10	acf_features_diff2_acf1	acf_features_diff2_acf10	ARCH.LM	autocorr_features_embed2_incircle_1	autocorr_features_embed2_incircle_2	autocorr_features_ac_9	autocorr_features_firstmin_ac	autocorr_features_trev_num	autocorr_features_motiftwo_entro3	autocorr_features_walker_propcross	binarize_mean_binarize_mean	binarize_mean_NA	compengine_embed2_incircle_1	compengine_embed2_incircle_2	compengine_ac_9	compengine_firstmin_ac	compengine_trev_num	compengine_motiftwo_entro3	compengine_walker_propcross	compengine_localsimple_mean1	compengine_localsimple_lfitac	compengine_sampen_first	compengine_std1st_der	compengine_spreadrandomlocal_meantaul_50	compengine_spreadrandomlocal_meantaul_ac2	compengine_histogram_mode_10	compengine_outlierinclude_mdrmd	compengine_fluctanal_prop_r1	crossing_points	dist_features_histogram_mode_10	dist_features_outlierinclude_mdrmd	embed2_incircle	entropy	firstmin_ac	firstzero_ac	flat_spots	fluctanal_prop_r1_fluctanal_prop_r1	arch_acf	garch_acf	arch_r2	garch_r2	histogram_mode	alpha	beta	hurst	hw_parameters_hw_parameters	hw_parameters_NA	localsimple_taures	lumpiness	max_kl_shift	time_kl_shift	max_level_shift	time_level_shift	max_var_shift	time_var_shift	motiftwo_entro3	nonlinearity	outlierinclude_mdrmd	x_pacf5	diff1x_pacf5	diff2x_pacf5	pred_features_localsimple_mean1	pred_features_localsimple_lfitac	pred_features_sampen_first	sampen_first_sampen_first	sampenc	scal_features_fluctanal_prop_r1	spreadrandomlocal_meantaul	stability	station_features_std1st_der	station_features_spreadrandomlocal_meantaul_50	station_features_spreadrandomlocal_meantaul_ac2	std1st_der_std1st_der	seasonal_period	trend	spike	linearity	curvature	e_acf1	e_acf10	trev_num	tsfeatures_frequency	tsfeatures_seasonal_period	tsfeatures_trend	tsfeatures_spike	tsfeatures_linearity	tsfeatures_curvature	tsfeatures_e_acf1	tsfeatures_e_acf10	tsfeatures_entropy	tsfeatures_x_acf1	tsfeatures_x_acf10	tsfeatures_diff1_acf1	tsfeatures_diff1_acf10	tsfeatures_diff2_acf1	tsfeatures_diff2_acf10	unitroot_kpss	unitroot_pp	walker_propcross
6801	0.0498492	-0.0642025	0.0542648	-0.4423482	0.2575236	-0.5981303	0.4149592	0.0271444	0.4710425	0.7181467	0.0498492	2	0.8754566	2.057333	0.5598456	0	1	0.4710425	0.7181467	0.0498492	2	0.8754566	2.057333	0.5598456	1	1	1.704503	1.460466	1.33	1.00	-0.50	0.1115385	0.8604651	139	-0.50	0.1115385	0.4710425	0.9888208	2	1	3	0.8604651	0.0332257	0.0244434	0.0370423	0.0287773	-0.50	0.0001000	0.0001000	0.5000458	NA	NA	1	0.7769640	3.827223	209	1.027671	131	3.254518	195	2.057333	0.0695918	0.1115385	0.0474059	0.5669070	1.0663179	1	1	1.704503	1.704503	1.704503	0.8604651	1.41	0.0639649	1.460466	1.42	1.00	1.460466	1	0.0069481	0.0000643	-0.8628963	0.2636951	-0.0719026	0.0587799	0.8754566	1	1	0.0069481	0.0000643	-0.8628963	0.2636951	-0.0719026	0.0587799	0.9888208	-0.0642025	0.0542648	-0.4423482	0.2575236	-0.5981303	0.4149592	0.1777957	-246.9618	0.5598456
4209	-0.0037257	-0.0166400	0.0302609	-0.5444182	0.3391695	-0.7025401	0.5898760	0.0369855	0.3976834	0.6409266	-0.0037257	1	0.0772589	2.065480	0.5598456	1	1	0.3976834	0.6409266	-0.0037257	1	0.0772589	2.065480	0.5598456	1	1	1.752028	1.427591	1.39	1.00	-0.25	-0.1000000	0.4651163	137	-0.25	-0.1000000	0.3976834	0.9866480	1	1	4	0.4651163	0.0328564	0.0286941	0.0369855	0.0347972	-0.25	0.0008843	0.0008843	0.5000458	NA	NA	1	0.2267605	3.549229	215	1.390319	3	2.017745	143	2.065480	0.0236440	-0.1000000	0.0060988	0.4859730	1.0685267	1	1	1.752028	1.752028	1.752028	0.4651163	1.49	0.0831999	1.427591	1.53	1.00	1.427591	1	0.0431696	0.0000288	-0.6356332	1.0362897	-0.0608160	0.0358936	0.0772589	1	1	0.0431696	0.0000288	-0.6356332	1.0362897	-0.0608160	0.0358936	0.9866480	-0.0166400	0.0302609	-0.5444182	0.3391695	-0.7025401	0.5898760	0.0372919	-268.4757	0.5598456
11168	0.0236704	-0.0269749	0.0299079	-0.4943006	0.2640054	-0.6626027	0.4906038	0.1265569	0.4401544	0.6640927	0.0236704	2	-0.4569401	2.075666	0.4633205	1	1	0.4401544	0.6640927	0.0236704	2	-0.4569401	2.075666	0.4633205	1	1	1.709466	1.431144	1.52	1.00	0.25	-0.0961538	0.1627907	122	0.25	-0.0961538	0.4401544	0.9882937	2	1	4	0.1627907	0.1453674	0.1490540	0.1265569	0.1247021	0.25	0.0411075	0.0001000	0.5000458	NA	NA	1	0.3863291	2.834691	227	1.096209	123	2.760158	197	2.075666	0.1218026	-0.0961538	0.0088598	0.4643608	1.0505751	1	1	1.709466	1.709466	1.709466	0.1627907	1.61	0.0691848	1.431144	1.50	1.00	1.431144	1	0.0134781	0.0000342	-0.6468298	-1.1770328	-0.0419291	0.0376999	-0.4569401	1	1	0.0134781	0.0000342	-0.6468298	-1.1770328	-0.0419291	0.0376999	0.9882937	-0.0269749	0.0299079	-0.4943006	0.2640054	-0.6626027	0.4906038	0.1743418	-260.0758	0.4633205
5794	-0.0007087	0.1194830	0.0616705	-0.4062897	0.2206195	-0.6016700	0.4137913	0.1556551	0.4806202	0.6782946	-0.0007087	2	-0.5797405	2.066637	0.4787645	1	0	0.4806202	0.6782946	-0.0007087	2	-0.5797405	2.066637	0.4787645	1	1	1.558307	1.328565	2.03	1.18	-0.25	-0.3000000	0.2325581	120	-0.25	-0.3000000	0.4806202	0.9815963	2	2	5	0.2325581	0.2198692	0.0941053	0.1406280	0.0756639	-0.25	0.0125856	0.0001000	0.5477543	NA	NA	1	0.7772726	8.411092	48	1.573682	146	3.802986	149	2.066637	0.1381103	-0.3000000	0.0193037	0.3959500	0.9255264	1	1	1.558307	1.558307	1.558307	0.2325581	1.98	0.1331827	1.328565	2.01	1.27	1.328565	1	0.0139233	0.0000358	-0.8988748	0.9389128	0.1079346	0.0661260	-0.5797405	1	1	0.0139233	0.0000358	-0.8988748	0.9389128	0.1079346	0.0661260	0.9815963	0.1194830	0.0616705	-0.4062897	0.2206195	-0.6016700	0.4137913	0.1182423	-224.0670	0.4787645
8693	-0.0814496	-0.0984498	0.1142883	-0.4688008	0.3181153	-0.6166136	0.4555893	0.1508792	0.4054054	0.6602317	-0.0814496	2	0.3988370	2.060571	0.5250965	0	1	0.4054054	0.6602317	-0.0814496	2	0.3988370	2.060571	0.5250965	1	1	1.651243	1.484233	1.19	1.00	-0.50	-0.0576923	0.3488372	136	-0.50	-0.0576923	0.4054054	0.9745764	2	1	6	0.3488372	0.0946062	0.0937635	0.1057152	0.1052409	-0.50	0.0269522	0.0001000	0.5000458	NA	NA	1	0.5495742	7.853783	195	1.039641	191	4.458772	187	2.060571	0.1164590	-0.0576923	0.0467339	0.5896074	1.1095330	1	1	1.651243	1.651243	1.651243	0.3488372	1.24	0.0998210	1.484233	1.35	1.00	1.484233	1	0.0033231	0.0000574	0.1887497	0.4564879	-0.1022983	0.1171558	0.3988370	1	1	0.0033231	0.0000574	0.1887497	0.4564879	-0.1022983	0.1171558	0.9745764	-0.0984498	0.1142883	-0.4688008	0.3181153	-0.6166136	0.4555893	0.0391658	-262.9010	0.5250965
1073	-0.1253873	0.1511912	0.0608605	-0.3832523	0.2048003	-0.5832067	0.3861283	0.0876692	0.4031008	0.6356589	-0.1253873	2	0.2463431	2.061698	0.4594595	1	1	0.4031008	0.6356589	-0.1253873	2	0.2463431	2.061698	0.4594595	1	1	1.763381	1.304792	2.44	1.13	-0.25	0.1230769	0.1395349	121	-0.25	0.1230769	0.4031008	0.9867903	2	2	4	0.1395349	0.0779468	0.0618625	0.0695878	0.0601294	-0.25	0.0778294	0.0001000	0.5663347	NA	NA	1	0.3151884	7.528904	185	2.069230	177	2.340804	169	2.061698	0.0279574	0.1230769	0.0310540	0.3527793	0.8978003	1	1	1.763381	1.763381	1.763381	0.1395349	2.45	0.0816322	1.304792	2.35	1.23	1.304792	1	0.0213244	0.0000306	-0.5577693	0.6111726	0.1329904	0.0758345	0.2463431	1	1	0.0213244	0.0000306	-0.5577693	0.6111726	0.1329904	0.0758345	0.9867903	0.1511912	0.0608605	-0.3832523	0.2048003	-0.5832067	0.3861283	0.0849681	-208.4546	0.4594595

How the training Y (predictor variable) data looks:

Table 6: Y_train
.
1
0
1
0
0
1

I set the data up for an XGBoost model:

I create a grid search in order search over a parameter space to locate the optimal parameters for the data set. It needs a little more work but it’s a pretty good starting point. I can just add code to the expand.grid function. That is, say I want to increase the depth of the tree I can add to max_depth = c(5, 8, 14) more parameters such as max_depth = c(5, 8, 14, 1, 2, 3, 4, 6, 7. Note Adding parameters to the grid search increases computational time exponentially. Every parameter you add a value to, the model has to search all possible combinations associated with that parameter. That is, adding an eta = c(0.1) and max_depth = c(5) would give me the optimal parameter for one iteration/loop through the training model, i.e. an eta = c(0.1) mapped onto a max_depth = c(5). Adding an additional value to the eta = c(0.1, 0.3) and max_depth = c(5) would map eta = 0.1 onto max_depth = 5 and eta = 0.3 on to max_depth = 5. If I add another value such that eta = c(0.1, 0.3, 0.4) then all 3 of these values will be mapped to max_depth = c(5). Adding values to the max_depth = c(5) parameter would add an extra layer of complexity to the grid search. This added into the fact that there are many parameters to optimize in an XGBoost model can drastically increase computational complexity. Thus, understanding the statistics behind the models in Machine Learning is important when trying to avoid getting stuck in a local minimum (which any greedy algorithm using gradient descent optimisation can do: greedy algorithm).

######################################################################
################# XGBoost Grid Search to locate Optimal Parameters ###

##############################################################################################################################
# NOTE: This section was taken from the first chapter of my PhD where I needed to search over a parameter space to locate the
# most optimal parameters - I have just adapted it for this problem of Time Series Classification.
# Its simple enough to add parameters and different values - I just optimise a few important parameters from domain knowledge
# of the XGBoost model for this task, i.e depth and eta are quite important in gradient boosting.

# 1) I create a "grid" with different parameter values or combinations of parameter values
# 2) I apply cross validation over the parameter space to fine the most optimal values for the XGBoost model.
# 3) I print the model parameters which give the best train / (in-sample test) results in a data table.
##############################################################################################################################

# Grid Search Parameters:
# 1)
searchGridSubCol <- expand.grid(subsample = c(1), #Range (0,1], default = 1, set to 0.5 will prevent overfitting
                                colsample_bytree = c(1), #Range (0,1], default = 1
                                max_depth = c(5, 8, 14), #Range (0, inf], default = 6
                                min_child = c(1), #Range (0, inf], default = 1
                                eta = c(0.1, 0.05, 0.3), #Range (0,1], default = 0.3
                                gamma = c(0), #Range (0, inf], default = 0
                                lambda = c(1), #Default = 1, L2 regularisation on weights, higher the more conservative the model
                                alpha = c(0), #Default = 0, L1 regularisation on weights, higher the more conservative the model
                                max_delta_step = c(0), #Range (0, inf], default = 0 (Helpful for logisitc regression when class is extremely imbalanced, set to value 1-10 may help control the update)
                                colsample_bylevel = c(1) #Range (0,1], default = 1
                                )

ntrees = 200
nfold <- 10                             # I use nfold = 10 which is probably too many folds, 5 should be sufficient.
watchlist <- list(train = dtrain, test = dval)

# 2)
system.time(
  AUCHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
    #Extract Parameters to test
    currentSubsampleRate <- parameterList[["sub_sample"]]
    currentColsampleRate <- parameterList[["colsample_bytree"]]
    currentDepth <- parameterList[["max_depth"]]
    currentEta <- parameterList[["eta"]]
    currentMinChild <- parameterList[["min_child"]]
    gamma <- parameterList[["gamma"]]
    lambda <- parameterList[["lambda"]]
    alpha <- parameterList[["alpha"]]
    max_delta_step <- parameterList[["max_delta_step"]]
    colsample_bylevel <- parameterList[["colsample_bylevel"]]
    xgboostModelCV <- xgb.cv(data =  dtrain,
                             nrounds = ntrees,
                             nfold = nfold,
                             showsd = TRUE,
                             metrics = c("auc", "logloss", "error"),
                             verbose = TRUE,
                             "eval_metric" = c("auc", "logloss", "error"),
                             "objective" = "binary:logistic", #Outputs a probability "binary:logitraw" - outputs score before logistic transformation
                             "max.depth" = currentDepth,
                             "eta" = currentEta,
                             "gamma" = gamma,
                             "lambda" = lambda,
                             "alpha" = alpha,
                             "subsample" = currentSubsampleRate,
                             "colsample_bytree" = currentColsampleRate,
                             print_every_n = 50, # print ever 50 trees to reduce the outputs printed.
                             "min_child_weight" = currentMinChild,
                             booster = "gbtree", #booster = "dart"  #using dart can help improve accuracy.
                             early_stopping_rounds = 10,
                             watchlist = watchlist,
                             seed = 1234)
    xvalidationScores <<- as.data.frame(xgboostModelCV$evaluation_log)
    train_auc_mean <- tail(xvalidationScores$train_auc_mean, 1)
    test_auc_mean <- tail(xvalidationScores$test_auc_mean, 1)
    train_logloss_mean <- tail(xvalidationScores$train_logloss_mean, 1)
    test_logloss_mean <- tail(xvalidationScores$test_logloss_mean, 1)
    train_error_mean <- tail(xvalidationScores$train_error_mean, 1)
    test_error_mean <- tail(xvalidationScores$test_error_mean, 1)
    output <- return(c(train_auc_mean, test_auc_mean, train_logloss_mean, test_logloss_mean, train_error_mean, test_error_mean, xvalidationScores, currentSubsampleRate, currentColsampleRate, currentDepth, currentEta, gamma, lambda, alpha, max_delta_step, colsample_bylevel, currentMinChild))
    hypemeans <- which.max(AUCHyperparameters[[1]]$test_auc_mean)
    output2 <- return(hypemeans)
    }))

The output of the grid search can be set into a nice data frame using the following code. However I did not save this output to file and therefore cannot read it in. You can view the output on the original Jupyter Notebook In [49] here

# 3)
output <- as.data.frame(t(sapply(AUCHyperparameters, '[', c(1:6, 20:29))))
varnames <- c("TrainAUC", "TestAUC", "TrainLogloss", "TestLogloss", "TrainError", "TestError", "SubSampRate", "ColSampRate", "Depth", "eta", "gamma", "lambda", "alpha", "max_delta_step", "col_sample_bylevel", "currentMinChild")
colnames(output) <- varnames
data.table(output)

According to the results at the time the optimal parameters were:

ntrees = 95,
eta = 0.1,
max_depth = 5,

With the other parameters left to default settings for simplicity.

Plug the optimal parameters into the model.

#################################################################################
################# XGBoost Optimal Parameters from Cross Validation ##############

# This is the final training model where I use the most optimal parameters found over the grid space and plug them in here.

watchlist <- list("train" = dtrain)

params <- list("eta" = 0.1, "max_depth" = 5, "colsample_bytree" = 1, "min_child_weight" = 1, "subsample"= 1,
               "objective"="binary:logistic", "gamma" = 1, "lambda" = 1, "alpha" = 0, "max_delta_step" = 0,
               "colsample_bylevel" = 1, "eval_metric"= "auc",
               "set.seed" = 176)

nround <- 95

Now that I have the optimal parameters from the cross validation grid search I can train the final XGBoost model on the whole train_val.csv data set. (Whereas before the optimal parameters were obtained from different folds in the model. More info on k-fold cross validation here)

# Train the XGBoost model

xgb.model <- xgb.train(params, dtrain, nround, watchlist)

## [1]  train-auc:0.700790 
## [2]  train-auc:0.720114 
## [3]  train-auc:0.735281 
## [4]  train-auc:0.741159 
## [5]  train-auc:0.748016 
## [6]  train-auc:0.752070 
## [7]  train-auc:0.754637 
## [8]  train-auc:0.759151 
## [9]  train-auc:0.762538 
## [10] train-auc:0.769652 
## [11] train-auc:0.776582 
## [12] train-auc:0.780015 
## [13] train-auc:0.782065 
## [14] train-auc:0.782815 
## [15] train-auc:0.788966 
## [16] train-auc:0.791026 
## [17] train-auc:0.793545 
## [18] train-auc:0.797363 
## [19] train-auc:0.799069 
## [20] train-auc:0.802015 
## [21] train-auc:0.802583 
## [22] train-auc:0.806938 
## [23] train-auc:0.808239 
## [24] train-auc:0.811255 
## [25] train-auc:0.813142 
## [26] train-auc:0.816767 
## [27] train-auc:0.817697 
## [28] train-auc:0.820239 
## [29] train-auc:0.821589 
## [30] train-auc:0.823343 
## [31] train-auc:0.823939 
## [32] train-auc:0.825701 
## [33] train-auc:0.827316 
## [34] train-auc:0.829365 
## [35] train-auc:0.832646 
## [36] train-auc:0.833297 
## [37] train-auc:0.837006 
## [38] train-auc:0.838857 
## [39] train-auc:0.839923 
## [40] train-auc:0.842968 
## [41] train-auc:0.844877 
## [42] train-auc:0.845940 
## [43] train-auc:0.846583 
## [44] train-auc:0.847330 
## [45] train-auc:0.848292 
## [46] train-auc:0.850215 
## [47] train-auc:0.851641 
## [48] train-auc:0.852670 
## [49] train-auc:0.854706 
## [50] train-auc:0.855752 
## [51] train-auc:0.856772 
## [52] train-auc:0.857806 
## [53] train-auc:0.860245 
## [54] train-auc:0.861337 
## [55] train-auc:0.864178 
## [56] train-auc:0.865290 
## [57] train-auc:0.865808 
## [58] train-auc:0.866386 
## [59] train-auc:0.867751 
## [60] train-auc:0.870032 
## [61] train-auc:0.870500 
## [62] train-auc:0.872442 
## [63] train-auc:0.873391 
## [64] train-auc:0.875188 
## [65] train-auc:0.877767 
## [66] train-auc:0.879196 
## [67] train-auc:0.880079 
## [68] train-auc:0.879969 
## [69] train-auc:0.880638 
## [70] train-auc:0.881389 
## [71] train-auc:0.882066 
## [72] train-auc:0.882515 
## [73] train-auc:0.883854 
## [74] train-auc:0.884654 
## [75] train-auc:0.885104 
## [76] train-auc:0.885922 
## [77] train-auc:0.887100 
## [78] train-auc:0.888646 
## [79] train-auc:0.889833 
## [80] train-auc:0.890387 
## [81] train-auc:0.891815 
## [82] train-auc:0.892281 
## [83] train-auc:0.894417 
## [84] train-auc:0.895006 
## [85] train-auc:0.897079 
## [86] train-auc:0.899254 
## [87] train-auc:0.901114 
## [88] train-auc:0.902460 
## [89] train-auc:0.902939 
## [90] train-auc:0.903763 
## [91] train-auc:0.903792 
## [92] train-auc:0.904433 
## [93] train-auc:0.904986 
## [94] train-auc:0.907339 
## [95] train-auc:0.907761

# Note: Plot AUC on for the in-sample train / validation scores -  this was a note for me at the time of writing this R file - I never did get around to plotting the AUC for the in-sample train / validation scores...

What is nice about tree based models is that we can obtain importance scores from the model and find which variables contributed most to the gain in the model. The original paper explains more about the gain in Algorithm 1 and Algorithm 3 here.

# We can obtain "feature" importance results from the model.
xgb.imp <- xgb.importance(model = xgb.model)
xgb.plot.importance(xgb.imp, top_n = 10)

That is, the XGBoost model found that the spike was the most important variable. The spike comes from the stl_features function of the tsfeatures package in R. It computes various measures of trend and seasonality based on Seasonal and Trend Decomposition (STL) and measures the spikiness of a time series based on the variance of the leave-one-out variances of component e_t.

The second variable is interesting also and comes from the compengine feature set from the CompEngine database. It groups variables as autocorrelation, prediction, stationarity, distribution and scaling.

The ARCH.LM comes from the arch_stat function of the tsfeatures package and is based on the Lagrange Multiplier for Autoregressive Conditional Heteroscedasticity (ARCH) Engle 1982.

These are just a few of the variables the XGBoost model found to be the most important. A full overview and more information of the variables used in the model can be found here.

Predictions using the in-sample test set

Now that I have trained the model using the optimal parameters I want to see if it scores the same or better based on the cross validation phase using the validation data. I use the dval which is the validation data set from the training split to test the model.

# I next make the predictions on the 'in-sample' held out test set, that is, originally I took the 12,000 training samples
# and split them between 75% training and 25% 'in-sample' testing (9000 training vs 3000 in-sample testing)

# I plot the probabilities from the model - the "dashed" line is the average predicted probability.
xgb.pred <- predict(xgb.model, dval, type = 'prob')

results <- cbind(y_val, xgb.pred)

results %>%
 as.tibble() %>%
 ggplot(aes(x = xgb.pred)) + 
 geom_density(color = "darkblue", fill = "lightblue") +
 geom_vline(aes(xintercept = mean(xgb.pred)),
            color = "blue", linetype = "dashed", size = 1) +
 geom_histogram(aes(y = ..density..), colour = "black", fill = "white", alpha = 0.1, position = "identity") +
 ggtitle("Predicted probability density plot") +
 theme_tq()

# The average predicted probability sits around 0.48 / 0.49, for simplicity I will just select 0.50 as the cut off threshold.
# That is, all observations <= 0.50 are assigned a "0" class or "synthetic" data and all observations >= are assigned a "1" or
# "real" data.
# Finally I output the confusion matrix on the 'in-sample' testing data.

results <- results %>%
  as_tibble() %>%
  mutate(pred = case_when(
    xgb.pred > 0.5 ~ 1,
    xgb.pred <= 0.5 ~ 0
  ))

confusionMatrix(factor(results$pred), factor(results$y_val))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1041  537
##          1  465  957
##                                              
##                Accuracy : 0.666              
##                  95% CI : (0.6488, 0.6829)   
##     No Information Rate : 0.502              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.3319             
##                                              
##  Mcnemar's Test P-Value : 0.0249             
##                                              
##             Sensitivity : 0.6912             
##             Specificity : 0.6406             
##          Pos Pred Value : 0.6597             
##          Neg Pred Value : 0.6730             
##              Prevalence : 0.5020             
##          Detection Rate : 0.3470             
##    Detection Prevalence : 0.5260             
##       Balanced Accuracy : 0.6659             
##                                              
##        'Positive' Class : 0                  
##

A balanced accuracy score of 67% isn’t so bad considering I threw the kitchen sink at the classification problem and that this is a time series (stock market) classification problem. By kitchen sink I refer to all the time series functions found in the tsfeatures package.

From here I end the training and validation model. I have obtained the optimal values based on the training and validation data sets and now I want to test it on the unknown data the test.csv data.

I read in the test data and compute the time series features from the tsfeatures package just as I did with the training data.

 test_final <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/test.csv") %>%
   mutate(row_id = row_number()) %>%
   melt(., measure.vars = 1:260) %>%
   arrange(row_id)

How the test features look - (they look similar to the train data set):

Table 7: Test feature data set
row_id	variable	value
1	feature1	0.0331039
1	feature2	0.0086225
1	feature3	0.0040622
1	feature4	0.0082554
1	feature5	0.0558741
1	feature6	-0.0061266

I call this test_final and not test for no reason what so ever - its the same test.csv from the beginning.

Next I create the same time series features on the test data set as I do on the training data set. I save this as TSfeatures_test.csv.

functions <- sample(functions, 20)

test_final <- test_final %>%
  group_by(row_id) %>%
#  nest() %>%
#  sample_n(5) %>%
#  ungroup() %>%
#  unnest() %>%
  nest(-row_id) %>%
  group_by(row_id) %T>%
  {options(warn = -1)} %>%
  summarise(Statistics = map(data, ~ data.frame(
    bind_cols(
      tsfeatures(.x$value, functions))))) %>%
  unnest(Statistics)

#print("Generated 106 Time Series features")
#write.csv(test_final, "TSfeatures_test.csv")

I have computed all the tsfeatures for the train data set and also for the test data set. I saved these two as TSfeatures_train_val.csv and TSfeatures_test.csv.

Load in the train and test features data sets

I uploaded these files here

# I have already created the features for the training dataset so I can just load them right back in as 
train_final <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/TSfeatures_train_val.csv")
test_final <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/TSfeatures_test.csv")

The final data for the training and test looks like:

train_final %>%
  head() %>%
  kable(caption = "Final training data set") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12)

Table 8: Final training data set
X	row_id	class	ac_9_ac_9	acf_features_x_acf1	acf_features_x_acf10	acf_features_diff1_acf1	acf_features_diff1_acf10	acf_features_diff2_acf1	acf_features_diff2_acf10	ARCH.LM	autocorr_features_embed2_incircle_1	autocorr_features_embed2_incircle_2	autocorr_features_ac_9	autocorr_features_firstmin_ac	autocorr_features_trev_num	autocorr_features_motiftwo_entro3	autocorr_features_walker_propcross	binarize_mean_binarize_mean	binarize_mean_NA	compengine_embed2_incircle_1	compengine_embed2_incircle_2	compengine_ac_9	compengine_firstmin_ac	compengine_trev_num	compengine_motiftwo_entro3	compengine_walker_propcross	compengine_localsimple_mean1	compengine_localsimple_lfitac	compengine_sampen_first	compengine_std1st_der	compengine_spreadrandomlocal_meantaul_50	compengine_spreadrandomlocal_meantaul_ac2	compengine_histogram_mode_10	compengine_outlierinclude_mdrmd	compengine_fluctanal_prop_r1	crossing_points	dist_features_histogram_mode_10	dist_features_outlierinclude_mdrmd	embed2_incircle	entropy	firstmin_ac	firstzero_ac	flat_spots	fluctanal_prop_r1_fluctanal_prop_r1	arch_acf	garch_acf	arch_r2	garch_r2	histogram_mode	alpha	beta	hurst	hw_parameters_hw_parameters	hw_parameters_NA	localsimple_taures	lumpiness	max_kl_shift	time_kl_shift	max_level_shift	time_level_shift	max_var_shift	time_var_shift	motiftwo_entro3	nonlinearity	outlierinclude_mdrmd	x_pacf5	diff1x_pacf5	diff2x_pacf5	pred_features_localsimple_mean1	pred_features_localsimple_lfitac	pred_features_sampen_first	sampen_first_sampen_first	sampenc	scal_features_fluctanal_prop_r1	spreadrandomlocal_meantaul	stability	station_features_std1st_der	station_features_spreadrandomlocal_meantaul_50	station_features_spreadrandomlocal_meantaul_ac2	std1st_der_std1st_der	seasonal_period	trend	spike	linearity	curvature	e_acf1	e_acf10	trev_num	tsfeatures_frequency	tsfeatures_seasonal_period	tsfeatures_trend	tsfeatures_spike	tsfeatures_linearity	tsfeatures_curvature	tsfeatures_e_acf1	tsfeatures_e_acf10	tsfeatures_entropy	tsfeatures_x_acf1	tsfeatures_x_acf10	tsfeatures_diff1_acf1	tsfeatures_diff1_acf10	tsfeatures_diff2_acf1	tsfeatures_diff2_acf10	unitroot_kpss	unitroot_pp	walker_propcross
1	1	0	-0.0675275	0.0097094	0.0526897	-0.5005299	0.3297018	-0.6772403	0.6124739	0.0627825	0.3929961	0.6147860	-0.0675275	1	0.1208750	2.071663	0.5405405	1	1	0.3929961	0.6147860	-0.0675275	1	0.1208750	2.071663	0.5405405	1	1	1.788841	1.408737	1.68	1.43	-0.25	-0.2865385	0.1627907	132	-0.25	-0.2865385	0.3929961	0.9840151	1	3	4	0.1627907	0.0652585	0.0154406	0.0627825	0.0253367	-0.25	0.0013330	0.0013330	0.5000458	NA	NA	1	0.3556536	1.783636	103	1.297736	97	2.819828	46	2.071663	0.0752319	-0.2865385	0.0108653	0.4457792	1.0525222	1	1	1.788841	1.788841	1.788841	0.1627907	1.76	0.0562693	1.408737	1.74	1.36	1.408737	1	0.0043052	0.0000261	0.8421403	-0.7069160	0.0052389	0.0588324	0.1208750	1	1	0.0043052	0.0000261	0.8421403	-0.7069160	0.0052389	0.0588324	0.9840151	0.0097094	0.0526897	-0.5005299	0.3297018	-0.6772403	0.6124739	0.0993829	-249.7732	0.5405405
2	2	0	-0.0421577	-0.0075902	0.0387481	-0.5171529	0.3129147	-0.6727897	0.5379301	0.0558032	0.4285714	0.6563707	-0.0421577	1	-0.4765229	2.077581	0.5019305	1	1	0.4285714	0.6563707	-0.0421577	1	-0.4765229	2.077581	0.5019305	1	1	1.780390	1.419266	1.95	1.00	0.50	0.2615385	0.1627907	123	0.50	0.2615385	0.4285714	0.9864332	1	1	4	0.1627907	0.0664358	0.0657859	0.0558032	0.0554355	0.50	0.0001000	0.0001000	0.5000458	NA	NA	1	0.4636768	1.733008	247	1.311861	141	2.625772	221	2.077581	0.0273335	0.2615385	0.0256032	0.4606850	1.0171377	1	1	1.780390	1.780390	1.780390	0.1627907	2.05	0.0892206	1.419266	2.12	1.00	1.419266	1	0.0177460	0.0000399	0.9249561	0.7665407	-0.0218053	0.0411861	-0.4765229	1	1	0.0177460	0.0000399	0.9249561	0.7665407	-0.0218053	0.0411861	0.9864332	-0.0075902	0.0387481	-0.5171529	0.3129147	-0.6727897	0.5379301	0.0414599	-256.0485	0.5019305
3	3	1	0.0099598	-0.0405929	0.0449036	-0.5026683	0.3471209	-0.6718885	0.6109006	0.0325470	0.4671815	0.7065637	0.0099598	1	-0.8755173	2.069233	0.5328185	1	0	0.4671815	0.7065637	0.0099598	1	-0.8755173	2.069233	0.5328185	1	1	1.706841	1.443315	1.38	1.00	-0.50	-0.2538462	0.1395349	132	-0.50	-0.2538462	0.4671815	0.9868568	1	1	6	0.1395349	0.0388513	0.0039162	0.0325470	0.0041902	-0.50	0.0014557	0.0014557	0.5000458	NA	NA	1	1.2670493	7.746711	95	1.403784	87	5.235499	84	2.069233	0.2436499	-0.2538462	0.0223069	0.5356408	0.9954919	1	1	1.706841	1.706841	1.706841	0.1395349	1.42	0.0716499	1.443315	1.42	1.00	1.443315	1	0.0141368	0.0000929	0.8414359	-0.0259311	-0.0547484	0.0492987	-0.8755173	1	1	0.0141368	0.0000929	0.8414359	-0.0259311	-0.0547484	0.0492987	0.9868568	-0.0405929	0.0449036	-0.5026683	0.3471209	-0.6718885	0.6109006	0.0775698	-258.1295	0.5328185
4	4	0	-0.0428748	-0.0443619	0.0615867	-0.4571442	0.3184053	-0.5906478	0.4361178	0.1275576	0.4555985	0.7027027	-0.0428748	2	-0.9943808	2.068744	0.4903475	0	0	0.4555985	0.7027027	-0.0428748	2	-0.9943808	2.068744	0.4903475	1	1	1.660825	1.445807	1.24	1.00	0.25	0.0153846	0.1395349	127	0.25	0.0153846	0.4555985	0.9790521	2	1	7	0.1395349	0.0694296	0.0112709	0.0579144	0.0123884	0.25	0.0480021	0.0001000	0.5000458	NA	NA	1	1.0068624	4.994753	132	1.258758	173	5.886911	156	2.068744	0.3840091	0.0153846	0.0503205	0.5402603	1.1070217	1	1	1.660825	1.660825	1.660825	0.1395349	1.10	0.1065111	1.445807	1.14	1.00	1.445807	1	0.0283540	0.0000482	-1.2297854	0.2921899	-0.0728152	0.0752389	-0.9943808	1	1	0.0283540	0.0000482	-1.2297854	0.2921899	-0.0728152	0.0752389	0.9790521	-0.0443619	0.0615867	-0.4571442	0.3184053	-0.5906478	0.4361178	0.2129633	-262.0781	0.4903475
5	5	0	0.0259312	-0.2447835	0.1469130	-0.5810073	0.4796508	-0.6799229	0.6232529	0.2014861	0.6563707	0.7992278	0.0259312	1	-0.7167079	2.059764	0.5289575	1	0	0.6563707	0.7992278	0.0259312	1	-0.7167079	2.059764	0.5289575	1	1	1.347789	1.580825	1.08	0.98	-0.50	0.7961538	0.1627907	133	-0.50	0.7961538	0.6563707	0.9723766	1	1	9	0.1627907	0.2718058	0.2229375	0.1765130	0.1330761	-0.50	0.0001000	0.0001000	0.5000458	NA	NA	1	2.8846415	11.474426	80	1.772392	229	8.468236	236	2.059764	0.2143595	0.7961538	0.1008392	0.7538746	1.2926800	1	1	1.347789	1.347789	1.347789	0.1627907	1.08	0.0797924	1.580825	1.06	0.98	1.580825	1	0.0121072	0.0001568	-0.5488436	0.2255538	-0.2599764	0.1558209	-0.7167079	1	1	0.0121072	0.0001568	-0.5488436	0.2255538	-0.2599764	0.1558209	0.9723766	-0.2447835	0.1469130	-0.5810073	0.4796508	-0.6799229	0.6232529	0.1506344	-323.5672	0.5289575
6	6	0	-0.0761166	0.0468556	0.0858348	-0.5253131	0.3438031	-0.6901570	0.6130725	0.0432628	0.4352941	0.6627451	-0.0761166	1	0.0898648	2.068914	0.5250965	1	1	0.4352941	0.6627451	-0.0761166	1	0.0898648	2.068914	0.5250965	1	1	1.751575	1.381854	2.69	1.71	-0.25	-0.0846154	0.3488372	134	-0.25	-0.0846154	0.4352941	0.9806218	1	5	5	0.3488372	0.0500806	0.0502154	0.0627968	0.0620877	-0.25	0.0286244	0.0001000	0.5188805	NA	NA	1	0.2189481	3.145763	141	1.447883	80	2.077936	84	2.068914	0.0137733	-0.0846154	0.0172321	0.4345976	1.0881798	1	1	1.751575	1.751575	1.751575	0.3488372	2.61	0.1479673	1.381854	2.63	1.81	1.381854	1	0.0077481	0.0000329	-0.5473782	0.4505809	0.0410068	0.0873468	0.0898648	1	1	0.0077481	0.0000329	-0.5473782	0.4505809	0.0410068	0.0873468	0.9806218	0.0468556	0.0858348	-0.5253131	0.3438031	-0.6901570	0.6130725	0.0259414	-262.3484	0.5250965

test_final %>%
  head() %>%
  kable(caption = "Final testing data set") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12)

Table 9: Final testing data set
X	row_id	ac_9_ac_9	acf_features_x_acf1	acf_features_x_acf10	acf_features_diff1_acf1	acf_features_diff1_acf10	acf_features_diff2_acf1	acf_features_diff2_acf10	ARCH.LM	autocorr_features_embed2_incircle_1	autocorr_features_embed2_incircle_2	autocorr_features_ac_9	autocorr_features_firstmin_ac	autocorr_features_trev_num	autocorr_features_motiftwo_entro3	autocorr_features_walker_propcross	binarize_mean_binarize_mean	binarize_mean_NA	compengine_embed2_incircle_1	compengine_embed2_incircle_2	compengine_ac_9	compengine_firstmin_ac	compengine_trev_num	compengine_motiftwo_entro3	compengine_walker_propcross	compengine_localsimple_mean1	compengine_localsimple_lfitac	compengine_sampen_first	compengine_std1st_der	compengine_spreadrandomlocal_meantaul_50	compengine_spreadrandomlocal_meantaul_ac2	compengine_histogram_mode_10	compengine_outlierinclude_mdrmd	compengine_fluctanal_prop_r1	crossing_points	dist_features_histogram_mode_10	dist_features_outlierinclude_mdrmd	embed2_incircle	entropy	firstmin_ac	firstzero_ac	flat_spots	fluctanal_prop_r1_fluctanal_prop_r1	arch_acf	garch_acf	arch_r2	garch_r2	histogram_mode	alpha	beta	hurst	hw_parameters_hw_parameters	hw_parameters_NA	localsimple_taures	lumpiness	max_kl_shift	time_kl_shift	max_level_shift	time_level_shift	max_var_shift	time_var_shift	motiftwo_entro3	nonlinearity	outlierinclude_mdrmd	x_pacf5	diff1x_pacf5	diff2x_pacf5	pred_features_localsimple_mean1	pred_features_localsimple_lfitac	pred_features_sampen_first	sampen_first_sampen_first	sampenc	scal_features_fluctanal_prop_r1	spreadrandomlocal_meantaul	stability	station_features_std1st_der	station_features_spreadrandomlocal_meantaul_50	station_features_spreadrandomlocal_meantaul_ac2	std1st_der_std1st_der	seasonal_period	trend	spike	linearity	curvature	e_acf1	e_acf10	trev_num	tsfeatures_frequency	tsfeatures_seasonal_period	tsfeatures_trend	tsfeatures_spike	tsfeatures_linearity	tsfeatures_curvature	tsfeatures_e_acf1	tsfeatures_e_acf10	tsfeatures_entropy	tsfeatures_x_acf1	tsfeatures_x_acf10	tsfeatures_diff1_acf1	tsfeatures_diff1_acf10	tsfeatures_diff2_acf1	tsfeatures_diff2_acf10	unitroot_kpss	unitroot_pp	walker_propcross
1	1	-0.0262073	-0.0396281	0.0429784	-0.4964245	0.3379915	-0.6704837	0.6178088	0.1425744	0.5482625	0.7528958	-0.0262073	1	-0.5824739	2.063564	0.4826255	1	1	0.5482625	0.7528958	-0.0262073	1	-0.5824739	2.063564	0.4826255	1	1	1.383933	1.437946	1.91	1.00	0.50	0.4307692	0.1395349	117	0.50	0.4307692	0.5482625	0.9817288	1	1	7	0.1395349	0.1906443	0.0422059	0.1425744	0.0417531	0.50	0.0440489	0.0001000	0.5000458	NA	NA	1	1.1617874	4.857530	130	1.031623	230	3.967385	214	2.063564	0.0716802	0.4307692	0.0271516	0.5270423	0.9564642	1	1	1.383933	1.383933	1.383933	0.1395349	1.80	0.0804590	1.437946	1.89	1.00	1.437946	1	0.0355541	0.0000573	-2.6210355	-0.0981868	-0.0740868	0.0651438	-0.5824739	1	1	0.0355541	0.0000573	-2.6210355	-0.0981868	-0.0740868	0.0651438	0.9817288	-0.0396281	0.0429784	-0.4964245	0.3379915	-0.6704837	0.6178088	0.8820380	-252.2509	0.4826255
2	2	-0.0047799	0.0544155	0.0423445	-0.4931653	0.3114689	-0.6980787	0.6597427	0.1111625	0.4513619	0.6964981	-0.0047799	3	0.2147570	2.068849	0.5250965	1	0	0.4513619	0.6964981	-0.0047799	3	0.2147570	2.068849	0.5250965	1	1	1.611106	1.375120	2.15	1.40	0.25	0.1211538	0.1627907	142	0.25	0.1211538	0.4513619	0.9856808	3	3	6	0.1627907	0.1313081	0.0468159	0.0939769	0.0402163	0.25	0.0063703	0.0001000	0.5012778	NA	NA	1	0.5347516	6.848494	91	1.360520	80	3.586240	75	2.068849	0.0618461	0.1211538	0.0344415	0.4336405	0.9510320	1	1	1.611106	1.611106	1.611106	0.1627907	2.14	0.0796936	1.375120	1.82	1.34	1.375120	1	0.0216068	0.0000391	0.1351482	-0.3430376	0.0339344	0.0578569	0.2147570	1	1	0.0216068	0.0000391	0.1351482	-0.3430376	0.0339344	0.0578569	0.9856808	0.0544155	0.0423445	-0.4931653	0.3114689	-0.6980787	0.6597427	0.0722224	-226.9463	0.5250965
3	3	0.0370364	-0.0041963	0.1781209	-0.3838557	0.3158431	-0.5535087	0.3948373	0.3450202	0.6138996	0.7915058	0.0370364	2	2.9002534	2.067845	0.5598456	1	0	0.6138996	0.7915058	0.0370364	2	2.9002534	2.067845	0.5598456	1	1	1.436472	1.414575	1.24	1.00	0.50	0.7230769	0.1627907	139	0.50	0.7230769	0.6138996	0.9627133	2	1	6	0.1627907	0.4731295	0.0342727	0.2247245	0.0323111	0.50	0.0001000	0.0001000	0.5000458	NA	NA	1	3.9022555	33.656077	240	1.695947	222	9.122984	232	2.067845	0.7040489	0.7230769	0.0685939	0.5171369	1.0433489	1	1	1.436472	1.436472	1.436472	0.1627907	1.39	0.1088905	1.414575	1.43	1.00	1.414575	1	0.0058644	0.0001243	-1.1897947	-0.4762066	-0.0084531	0.1814633	2.9002534	1	1	0.0058644	0.0001243	-1.1897947	-0.4762066	-0.0084531	0.1814633	0.9627133	-0.0041963	0.1781209	-0.3838557	0.3158431	-0.5535087	0.3948373	0.1757311	-235.0780	0.5598456
4	4	-0.0576029	-0.0338906	0.0251717	-0.4963752	0.2570591	-0.6694337	0.4910006	0.0471296	0.3899614	0.6332046	-0.0576029	3	-0.1053821	2.075447	0.5366795	0	1	0.3899614	0.6332046	-0.0576029	3	-0.1053821	2.075447	0.5366795	1	1	1.785628	1.436827	1.52	1.00	-0.25	0.0769231	0.1860465	137	-0.25	0.0769231	0.3899614	0.9886539	3	1	3	0.1860465	0.0511246	0.0516446	0.0471296	0.0470911	-0.25	0.0025845	0.0025845	0.5000458	NA	NA	1	0.2161135	2.534373	34	1.404765	154	2.213233	205	2.075447	0.0681473	0.0769231	0.0179401	0.4720756	0.9626432	1	1	1.785628	1.785628	1.785628	0.1860465	1.44	0.0499953	1.436827	1.42	1.00	1.436827	1	0.0042080	0.0000286	0.9969942	0.1863847	-0.0370368	0.0269840	-0.1053821	1	1	0.0042080	0.0000286	0.9969942	0.1863847	-0.0370368	0.0269840	0.9886539	-0.0338906	0.0251717	-0.4963752	0.2570591	-0.6694337	0.4910006	0.0860264	-241.6752	0.5366795
5	5	-0.1236994	0.0086381	0.0308039	-0.5025363	0.3330186	-0.6693011	0.5835466	0.1157603	0.4202335	0.7003891	-0.1236994	1	-0.0489352	2.058889	0.4864865	1	0	0.4202335	0.7003891	-0.1236994	1	-0.0489352	2.058889	0.4864865	1	1	1.722492	1.396172	1.69	1.32	-0.50	-0.0076923	0.8139535	120	-0.50	-0.0076923	0.4202335	0.9908616	1	3	6	0.8139535	0.0537820	0.0583484	0.1157603	0.1120523	-0.50	0.0001609	0.0001609	0.5090878	NA	NA	1	0.6488028	3.045684	97	1.287940	14	4.338131	240	2.058889	0.0094165	-0.0076923	0.0059114	0.4457371	0.9190563	1	1	1.722492	1.722492	1.722492	0.8139535	1.63	0.1107442	1.396172	1.75	1.35	1.396172	1	0.0229286	0.0000550	-0.6149100	0.2128084	-0.0125452	0.0317617	-0.0489352	1	1	0.0229286	0.0000550	-0.6149100	0.2128084	-0.0125452	0.0317617	0.9908616	0.0086381	0.0308039	-0.5025363	0.3330186	-0.6693011	0.5835466	0.1169027	-266.1451	0.4864865
6	6	0.0137566	-0.0889224	0.0668615	-0.5649436	0.4404459	-0.7097820	0.7128451	0.0752299	0.5366795	0.6447876	0.0137566	1	0.3033072	2.064104	0.5328185	1	0	0.5366795	0.6447876	0.0137566	1	0.3033072	2.064104	0.5328185	1	1	1.464977	1.477767	1.53	1.00	0.25	0.3269231	0.1627907	136	0.25	0.3269231	0.5366795	0.9835850	1	1	6	0.1627907	0.1033936	0.0236197	0.0740159	0.0248339	0.25	0.0001000	0.0001000	0.5000458	NA	NA	1	0.7510236	12.688453	197	1.217490	189	2.987989	194	2.064104	0.0649001	0.3269231	0.0200688	0.5201834	1.0761503	1	1	1.464977	1.464977	1.464977	0.1627907	1.35	0.0814814	1.477767	1.36	1.00	1.477767	1	0.0081147	0.0000469	0.6555116	-0.0489727	-0.0976177	0.0700199	0.3033072	1	1	0.0081147	0.0000469	0.6555116	-0.0489727	-0.0976177	0.0700199	0.9835850	-0.0889224	0.0668615	-0.5649436	0.4404459	-0.7097820	0.7128451	0.0869913	-279.8920	0.5328185

Finally we can run the final model on the held-out-test-set and obtain our predictions based on the training data and the optimal parameters.

# previously and run the final training model (to make predictions on the out-of-sample test data)

x_train_final <- train_final %>%
  ungroup() %>%
  select(-class, -row_id, -X) %>%
  as.matrix()

x_test_final <- test_final %>%
  ungroup() %>%
  select(-row_id, -X) %>%
  as.matrix()

y_train_final <- train_final %>%
  ungroup() %>%
  pull(class)

dtrain_final <- xgb.DMatrix(data = as.matrix(x_train_final), label = y_train_final, missing = "NaN")
dtest_final <- xgb.DMatrix(data = as.matrix(x_test_final), missing = "NaN")

watchlist <- list("train" = dtrain_final)

params <- list("eta" = 0.1, "max_depth" = 5, "colsample_bytree" = 1, "min_child_weight" = 1, "subsample"= 1,
               "objective"="binary:logistic", "gamma" = 1, "lambda" = 1, "alpha" = 0, "max_delta_step" = 0,
               "colsample_bylevel" = 1, "eval_metric"= "auc",
               "set.seed" = 176)

nround <- 95

xgb.model_final <- xgb.train(params, dtrain_final, nround, watchlist)

## [1]  train-auc:0.708604 
## [2]  train-auc:0.721700 
## [3]  train-auc:0.723230 
## [4]  train-auc:0.729888 
## [5]  train-auc:0.735542 
## [6]  train-auc:0.738081 
## [7]  train-auc:0.740926 
## [8]  train-auc:0.744105 
## [9]  train-auc:0.746320 
## [10] train-auc:0.748644 
## [11] train-auc:0.754211 
## [12] train-auc:0.756892 
## [13] train-auc:0.761524 
## [14] train-auc:0.763882 
## [15] train-auc:0.767216 
## [16] train-auc:0.772009 
## [17] train-auc:0.772943 
## [18] train-auc:0.774261 
## [19] train-auc:0.775471 
## [20] train-auc:0.777801 
## [21] train-auc:0.780629 
## [22] train-auc:0.784384 
## [23] train-auc:0.787112 
## [24] train-auc:0.788946 
## [25] train-auc:0.791835 
## [26] train-auc:0.793142 
## [27] train-auc:0.795289 
## [28] train-auc:0.798502 
## [29] train-auc:0.799893 
## [30] train-auc:0.802186 
## [31] train-auc:0.804981 
## [32] train-auc:0.805649 
## [33] train-auc:0.807120 
## [34] train-auc:0.809020 
## [35] train-auc:0.810318 
## [36] train-auc:0.812637 
## [37] train-auc:0.814760 
## [38] train-auc:0.816024 
## [39] train-auc:0.817956 
## [40] train-auc:0.819350 
## [41] train-auc:0.821653 
## [42] train-auc:0.822729 
## [43] train-auc:0.824029 
## [44] train-auc:0.824765 
## [45] train-auc:0.826924 
## [46] train-auc:0.827804 
## [47] train-auc:0.828475 
## [48] train-auc:0.831018 
## [49] train-auc:0.832247 
## [50] train-auc:0.833265 
## [51] train-auc:0.834168 
## [52] train-auc:0.835535 
## [53] train-auc:0.836093 
## [54] train-auc:0.837008 
## [55] train-auc:0.837715 
## [56] train-auc:0.839537 
## [57] train-auc:0.840310 
## [58] train-auc:0.841701 
## [59] train-auc:0.842480 
## [60] train-auc:0.843106 
## [61] train-auc:0.844495 
## [62] train-auc:0.845348 
## [63] train-auc:0.845932 
## [64] train-auc:0.847843 
## [65] train-auc:0.849445 
## [66] train-auc:0.850345 
## [67] train-auc:0.851337 
## [68] train-auc:0.852121 
## [69] train-auc:0.852663 
## [70] train-auc:0.854132 
## [71] train-auc:0.855949 
## [72] train-auc:0.856758 
## [73] train-auc:0.857115 
## [74] train-auc:0.857954 
## [75] train-auc:0.858849 
## [76] train-auc:0.859527 
## [77] train-auc:0.859917 
## [78] train-auc:0.860590 
## [79] train-auc:0.861264 
## [80] train-auc:0.862359 
## [81] train-auc:0.863101 
## [82] train-auc:0.863794 
## [83] train-auc:0.864911 
## [84] train-auc:0.866293 
## [85] train-auc:0.866976 
## [86] train-auc:0.867436 
## [87] train-auc:0.869036 
## [88] train-auc:0.869469 
## [89] train-auc:0.869931 
## [90] train-auc:0.870681 
## [91] train-auc:0.872326 
## [92] train-auc:0.873706 
## [93] train-auc:0.875704 
## [94] train-auc:0.876178 
## [95] train-auc:0.876789

I make the final predictions based on the test.csv data. The predict function in R is great, it can take any model and make predictions, we just need to provide the testing data along with the model. I “ask” for probability scores from the predictions. I plot the density of predicted probabilities also.

# Make the final predictions on the 'test.csv' data and plot the probability density function.

xgb.pred_final <- predict(xgb.model_final, dtest_final, type = 'prob')

xgb.pred_final %>%
 as_tibble() %>%
 setNames(c("Prediction")) %>%
 ggplot(aes(x = Prediction)) + 
 geom_density(color = "darkblue", fill = "lightblue") +
 geom_vline(aes(xintercept = mean(Prediction)),
            color = "blue", linetype = "dashed", size = 1) +
 geom_histogram(aes(y = ..density..), colour = "black", fill = "white", alpha = 0.1, position = "identity") +
 ggtitle("(Out of sample) - Predicted probability density plot") +
 theme_tq()

Finally! I make the submission file based on the predicted probabilities.

# Convert the probabilities into a binary class of 0 or 1 by a decision threshold of 0.465.
# Write the predictions to "submission.csv"

xgb.pred_final %>%
  as_tibble() %>%
  setNames(c("Prediction")) %>%
  summarise(mean = mean(Prediction))

## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1 0.465

xgb.pred_final %>%
  as_tibble() %>%
  setNames(c("Prediction")) %>%
  mutate(pred = case_when(
    Prediction > 0.465 ~ 1,
    Prediction <= 0.465 ~ 0
  )) %>%
  write.csv("submission.csv")

I make the final remark in the Jupyter Notebook I sent as part of the interview process

Quote:: Final footnote: Hopefully the out-of-sample predictions will obtain a 67% accuracy (the predictions in the “submission.csv” file).

I was told after I sent my scores as part of the interview process how the scores were evaluated (In Spanish):

*Para que sepas cómo es la valoración:

Obtener entre 0.4-0.6 se considera un resultado aleatorio.

A partir 0.6 el algoritmo clasifica correctamente y más de un 0.7 el algoritmo es genial.

Por debajo de 0.4 son capaces de diferenciar series sintéticas de las reales, pero están intercambiadas.*

I was informed that based on the held out test set I obtained a result of 0.649636 ~0.65% (a little lower than my 0.67% in-sample training set!) but still consistent with the correct methodology I was using (i.e. no leaking test data to the training data) along with the fact that I was just throwing the time series features book/kitchen sink at the problem. Further reading into time series features will strengthen this classification problem and will certainly improve the prediction accuracy! Recall, that my feature selection consisted of applying every feature in the tsfeatures package… Using functions <- ls("package:tsfeatures")[1:42] and then mapping over the data using summarise(Statistics = map(data, ~ data.frame( bind_cols(tsfeatures(.x$value, functions))))) %>%. So there is plenty of improvements to the current model.

Conclusion: A combination of time series feature selection and classifciation models can do pretty well on time series classification models such as this one I faced.

Any errors are my own!

Bio: Matthew Smith (@MatthewSmith786) is a PhD student at the Complutense University Madrid. His research focuses on Machine Learning methods applied to Economics and Finance. He writes about topics in R, Python and C++.

Original. Reposted with permission.

Related:

Time Series Classification Synthetic vs Real Financial Time Series

A brief overview of the notebook:

NOTE:

Part 1

How the training data looks after cleaning:

How the testing data looks after cleaning:

Autocorrelation plots:

How the training data looks:

How the training X (input variables) data looks:

How the training Y (predictor variable) data looks:

Plug the optimal parameters into the model.

Predictions using the in-sample test set

How the test features look - (they look similar to the train data set):

Finally! I make the submission file based on the predicted probabilities.

More On This Topic

Latest Posts

Top Posts

<img width="94" height="95" src="/images/tkb-2003-s.png" width=94 alt="Silver Blog" align="right">Time Series Classification Synthetic vs Real Financial Time Series

A brief overview of the notebook:

NOTE:

Part 1

How the training data looks after cleaning:

How the testing data looks after cleaning:

Autocorrelation plots:

How the training data looks:

How the training X (input variables) data looks:

How the training Y (predictor variable) data looks:

Plug the optimal parameters into the model.

Predictions using the in-sample test set

How the test features look - (they look similar to the train data set):

Finally! I make the submission file based on the predicted probabilities.

More On This Topic

Latest Posts

Top Posts

Time Series Classification Synthetic vs Real Financial Time Series