<< Go back to Posts
Warning : This document is still a draft

Napoleon X Challenge - Part I - Matching Pieces Together

Introduction

I wanted to get in touch with financial time series. I found one dataset dealing with cryptocurrencies, that you can find on Challenge data, with the details of the challenge.

Prices are recorded every hours. You get chunks of 21 days, with the 24th hour price missing.

Additionally, time series are grouped into “clusters”. You need to predict the 24th hour price for the cluster (not for the individual time series). revious hour points the average change of the 24th hours.

The description of the dataset is obfuscated:

Each chunk of 21 days is anonymized, you don’t know if its ethereum or dodge coin
How clusters are created is unknown
There are undescribed quantities: md and bc

I found some regularities in the dataset, an I was able to:

reconstruct the full history (i.e., find which assets is after another),
use this information to accurately predict the average return of a cluster
with external data, retrieve the dates and which crypto asset is what
find what is the meaning of bc and md
get the first place of the challenge.

I will detail in this article and the followings how I did it.

This one is dedicated to how to reconstruct the full history.

Table of Content

How to use the secret quantities ?
- Visual analysis
- Searching matching pairs
Full signal Reconstruction
- Approaches complexity
- Cross correlation to the rescue
Analysis of the test set
- Matching clusters
- Matching assets
Conclusion

Secret quantities

Visual Analysis

Starting with the secret quantities md and bc. The acronymes are not standard (e.g. mean/std/var), so that useless to guess without analysis.

The first idea is to look at histograms to study the distribution.

For bc, the maximal value is 1 while the minimal value is not clearly visible (there is not a strong bottom line). It might be a rate 1==100% for instance, but negative rate are often not allowed when capping to 100%. It might be a correlation coefficient
For md, because of the negative values, it seems to be log values.

The second thing we can do is to look at the values over one cluster. Here, the values are sorted by asset ID. As we have 21 chunks per cluster, we have 21 values per asset. (because the dataset is recorded week by week) Looking at the following plot, the asset value is unchanged over the cluster lifetime. It seems that asset IDs are assigned based on the decreasing value of md. For most of the clusters, it is true, but there are a few exceptions.

Searching Matching Pairs

If md and bc are unique and constant over 21 days, I wanted to see how similar two time series with a similar md or bc looks like.

I studied first how unique are md and bc values:

We have 23,607 unique pairs of (cluster, asset) (given the IDs)
We only have 17,864 unique md values
We only have 17,870 unique bc values

Therefore, we can find 5,750 pairs of time series with exactly the same bc or the same md.

So, we picked at random md and bc pairs, and we obtained that:

If you measure correlation, you get a perfect 1

Knowing that two assets in two different clusters are exactly the same, it is possible to “align” clusters, i.e., finding clusters which occur during the same period. The 1464 train clusters were grouped into 311 time related groups. The following plot show the size of the groups obtained, in terms of cluster number:

Over the 1464, only \(\approx 100\) clusters are alone. Therefore, we can learn plenty of stuff using the 90% others.

Connecting Clusters

In the previous part, we were able to find which clusters occur at the same time.

If we can group clusters that way, it might be possible to find overlap between time periods, and therefore find which group occurs after another one.

Complexity

Because assets are crypto-assets, they are likely to not be independent of each other, and to be subject to the same market variations. If we look at several clusters occurring at the same date, we can see the market impact. Globally, the market is increasing. We see two major event at time \(\approx 150\) and \(\approx 250\).

Notes:

The black line is the cluster average.
We took the cumulated returns

Even if in each cluster, there is only one asset which is shared, the correlation is very high between the means.

In the dataset, assuming that clusters overlap, we would find asset \(X_1\) and \(X_2\) where \(X_1(t+k) = X_2(t)\). There are two ways to identify these match:

working at the asset level
working at the cluster level

When working at the asset level, we will try to find which asset matches a given asset. Because we have about \(18,000\) unique asset time series, the search cost (without optimization) is very expensive (\(\approx 18000^2/2\)).

When working at the cluster level, we take advantage of the market correlation. Next, after finding which cluster follows another one, we would have to find which assets match. In terms of complexity, this is much faster as we have only 311 cluster groups.

Cross-Correlation

The easiest way to find matching time series is by looking at their cross-correlation coefficient (CC). The CC is simply the correlation coefficient of the two considered segments, separated by a lag \(dt\):

\[CC(X, Y; dt) = \frac{\sum_{i=1:n-dt}(x_i - \mu_X)(y_{i+dt}-\mu_Y)}{\sqrt{\sum_{i=1:n-dt}(x_i - \mu_X)^2 \times \sum_{i=1:n-dt}(y_{i+dt} - \mu_Y)^2}}\]

The interpretation is the same as for Pearson correlation coefficient:

1: signals are correlated
0: signals are not correlated
-1: signals are anti-correlated (unlikely to occurs with the market hypothesis)

To compute the cross-correlation, we take the mean return rate of a cluster (i.e. non-integrated signals). Compared to the integrated signal, the resulting correlation is much more discriminative (because trends can fool the results).

\[\hat{R}_C(t) = \sum_{A \in C} X_A(t)\]

Finding Aligned Groups

The following image shows the cross correlation with lag=0 for all clusters pairs.

Diagonal elements are 1 (hopefully)
We have some white lines, which are due to missing values for some assets. (We will exploit this information after)
Green dots represent pairs with sufficient correlation (\(CC > 0.5\))

Setting an acceptance threshold to 0.5, we find 122 matching pairs, which move our number of meta groups from 311 to 226. We verified that the clusters pairs match correctly by plotting their average integrated return rate:

Looking at the size of the meta groups, we can see that many of the singleton clusters are now assigned to a larger group.

Finding Non-Aligned Groups

Looking at the groups where we could not measure the cross correlation (white lines and columns on the previous matrix), we can see two things:

There is one or two weeks of missing values (i.e. the cut is neat)
Calling \(R_1\) the mean of the cluster with one week of missing values, and \(R_2\) with two weeks of missing values, \(CC(R_1, R_2, \text{one week}) > 0.5\).

In other word, the lag between clusters is likely to be exactly 1 week, which greatly reduces the size of our search space. We repeated the same operation that in the previous sub-section, but with a lag of 1 week (we also used a lag of 2 weeks for the unmatched clusters). For validation, we checked if we could find at least one asset in both groups separated of one week.

With some graph analysis, we recovered the full order, where element with missing values were either the start, either the end of the dataset. In total, we have 217 full weeks of consecutive data.

You can check yourself with the clusters of the 10 first weeks:

// (weekNbr: [cluster_ID])
{
[136, 956],
[1169, 1172],
[512, 811, 951, 867, 639, 576, 93, 994],
[857, 1123, 92, 1129, 653, 1165],
[568, 1265, 515, 427, 156, 14, 1216, 542],
[752, 583, 430, 924, 492, 499, 55],
[268, 844, 77, 455, 237],
[648, 1448, 242, 614, 518, 325, 756],
[234, 842, 1378, 46, 1196, 691],
[1393, 538, 1055, 883, 1358, 859, 190],
[768, 732, 181, 1384, 511, 1115, 302, 1226],
}

Given this history, we created a “Frankenstein” time-series, which is the average price by connecting all clusters together.

Average return rate

About the Testing Set

Now that we have the ordered sequence, we can study the testing set. It is very likely that the testing set overlap the training part. We would do just as before:

Exploit md and bc;
Study the cross-correlation to find where each cluster goes;
Match assets together.

Matching Clusters

After computing the CC, we can visually check that results at the cluster level are coherent:

For all test clusters, we were able to align them with a train group.

With this information, we can stop there and propose a submission, without a lot of intelligence:

For a given day:

find the train clusters which cover this date
given the train prediction (ytrain), compute the average return
find the test clusters which cover this date
assign the mean return value to this date.

With this approach, I got rank 7:

7 Sept. 2, 2022, 3:18 p.m. japoneris MSE 0.0061

Matching Assets

This is not the end of the analysis. Let’s finish with asset matching.

Now that we know which clusters are recorded during the same period, we can search for matching assets. When looking at md values, we can see that some assets are present in both training and testing set.

Previously, we found how clusters follow each others. The goal now is to reconstruct for one assets its full price sequence over time. With a longer history, it would be easier to study the regularities / behavior.

We selected one test cluster at random, the 1472, identified at week 34. If we compute the cross-correlation with the train assets of this week, we get this correlation matrix, where we have only a strong correlation for asset 11 (yellow color):

Now, we can look at past and future events. If we look at assets starting the week before (33), we get many more matching candidates. For all but assets 9 and 10, we could find their relatives:

And looking one week after (35), each asset has an analog asset:

With these information, we can put all pieces together, and obtain a perfect overlap:

If we aligned them correctly by changing the starting value of each segment, we get:

We have identified 1262 unique time series. A few assets cover all the dataset, but many of them are singleton (408 occurs only once.)

The following plot is the length of an assets versus its rank:

The longest signal we have covers all the dataset, and is very similar to the “Frankenstein” signal we have ploted before:

Conclusion

We analyzed the Napoleon X dataset, and we were able to recover the relative time for all clusters. We were also able to recreate assets signals that were cut into pieces. The longest asset cover all the dataset.

Without any sophisticated algorithms, we were able to get a good MSE score.

We still don’t know what are md and bc, which will be the topic of the next post on the series.

You can read the next article on the topic here

>> You can subscribe to my mailing list here for a monthly update. <<