# Napoleon X Challenge - Part I - Matching Pieces Together
# Introduction
I wanted to get in touch with financial time series.
I found one dataset dealing with cryptocurrencies, that you can find [on Challenge data](https://challengedata.ens.fr/participants/challenges/71/), with the details of the challenge.
Prices are recorded every hours.
You get chunks of 21 days, with the 24th hour price missing.
Additionally, time series are grouped into "clusters".
You need to predict the 24th hour price for the **cluster** (not for the individual time series).
revious hour points the average change of the 24th hours.
The description of the dataset is obfuscated:
- Each chunk of 21 days is anonymized, you don't know if its ethereum or dodge coin
- How clusters are created is unknown
- There are undescribed quantities: `md` and `bc`
I found some regularities in the dataset, an I was able to:
- reconstruct the full history (i.e., find which assets is after another),
- use this information to accurately predict the average return of a cluster
- with external data, retrieve the dates and which crypto asset is what
- find what is the meaning of `bc` and `md`
- get the first place of the challenge.
I will detail in this article and the followings how I did it.
This one is dedicated to how to reconstruct the full history.
## Table of Content
- [How to use the secret quantities ?](#secret-quantities)
- [Visual analysis](#visual-analysis)
- [Searching matching pairs](#searching-matching-pairs)
- [Full signal Reconstruction](#connecting-clusters)
- [Approaches complexity](#complexity)
- [Cross correlation to the rescue](#cross-correlation)
- [Analysis of the test set](#about-the-testing-set)
- [Matching clusters](#matching-clusters)
- [Matching assets](#matching-assets)
- [Conclusion](#conclusion)
----
# Secret quantities
## Visual Analysis
Starting with the secret quantities `md` and `bc`.
The acronymes are not standard (e.g. mean/std/var), so that useless to guess without analysis.
The first idea is to look at histograms to study the distribution.
- For `bc`, the maximal value is `1` while the minimal value is not clearly visible (there is not a strong bottom line). It might be a rate `1==100%` for instance, but negative rate are often not allowed when capping to `100%`. It might be a correlation coefficient
- For `md`, because of the negative values, it seems to be `log` values.
![](/images/napoleon_X/P1/hist_md_bc.svg)
The second thing we can do is to look at the values over one cluster.
Here, the values are sorted by asset ID. As we have 21 chunks per cluster, we have 21 values per asset. (because the dataset is recorded week by week)
Looking at the following plot, the asset value is unchanged over the cluster lifetime.
It seems that asset IDs are assigned based on the decreasing value of `md`.
For most of the clusters, it is true, but there are a few exceptions.
![](/images/napoleon_X/P1/cluster_11_md_bc.svg)
## Searching Matching Pairs
If `md` and `bc` are unique and constant over 21 days, I wanted to see how similar two time series with a similar `md` or `bc` looks like.
I studied first how unique are `md` and `bc` values:
- We have `23,607` unique pairs of `(cluster, asset)` (given the IDs)
- We only have `17,864` unique `md` values
- We only have `17,870` unique `bc` values
Therefore, we can find `5,750` pairs of time series with exactly the same `bc` or the same `md`.
So, we picked at random `md` and `bc` pairs, and we obtained that:
![](/images/napoleon_X/P1/clusters_176_630_asset_0_1.svg)
If you measure correlation, you get a perfect `1`
Knowing that two assets in two different clusters are exactly the same, it is possible to "align" clusters, i.e., finding clusters which occur during the same period.
The `1464` train clusters were grouped into `311` time related groups.
The following plot show the size of the groups obtained, in terms of cluster number:
![](/images/napoleon_X/P1/groups_md_alignement.svg)
Over the `1464`, only $$\approx 100$$ clusters are alone.
Therefore, we can learn plenty of stuff using the `90%` others.
# Connecting Clusters
In the previous part, we were able to find which clusters occur at the same time.
If we can group clusters that way, it might be possible to find overlap between time periods, and therefore find which group occurs after another one.
## Complexity
Because assets are crypto-assets, they are likely to not be independent of each other, and to be subject to the same market variations.
If we look at several clusters occurring at the same date, we can see the market impact.
Globally, the market is increasing. We see two major event at time $$\approx 150$$ and $$\approx 250$$.
*Notes*:
- The black line is the cluster average.
- We took the cumulated returns
![](/images/napoleon_X/P1/group_ID1_2_clust_361.svg)
![](/images/napoleon_X/P1/group_ID1_2_clust_404.svg)
![](/images/napoleon_X/P1/group_ID1_2_clust_501.svg)
Even if in each cluster, there is only one asset which is shared, the correlation is very high between the means.
In the dataset, assuming that clusters overlap, we would find asset $$X_1$$ and $$X_2$$ where $$X_1(t+k) = X_2(t)$$.
There are two ways to identify these match:
- working at the **asset** level
- working at the **cluster** level
When working at the *asset level*, we will try to find which asset matches a given asset.
Because we have about $$18,000$$ unique asset time series, the search cost (without optimization) is very expensive ($$\approx 18000^2/2$$).
When working at the *cluster level*, we take advantage of the market correlation.
Next, after finding which cluster follows another one, we would have to find which assets match.
In terms of complexity, this is much faster as we have only `311` cluster groups.
## Cross-Correlation
The easiest way to find matching time series is by looking at their cross-correlation coefficient (CC).
The CC is simply the correlation coefficient of the two considered segments, separated by a lag $$dt$$:
$$CC(X, Y; dt) = \frac{\sum_{i=1:n-dt}(x_i - \mu_X)(y_{i+dt}-\mu_Y)}{\sqrt{\sum_{i=1:n-dt}(x_i - \mu_X)^2 \times \sum_{i=1:n-dt}(y_{i+dt} - \mu_Y)^2}}$$
The interpretation is the same as for Pearson correlation coefficient:
- `1`: signals are correlated
- `0`: signals are not correlated
- `-1`: signals are anti-correlated (unlikely to occurs with the market hypothesis)
To compute the cross-correlation, we take the mean return rate of a cluster (i.e. non-integrated signals).
Compared to the integrated signal, the resulting correlation is much more discriminative (because trends can fool the results).
$$\hat{R}_C(t) = \sum_{A \in C} X_A(t)$$
## Finding Aligned Groups
The following image shows the cross correlation with `lag=0` for all clusters pairs.
- Diagonal elements are `1` (hopefully)
- We have some white lines, which are due to missing values for some assets. (We will exploit this information after)
- Green dots represent pairs with sufficient correlation ($$CC > 0.5$$)
![](/images/napoleon_X/P1/correlation_0.svg)
Setting an acceptance threshold to `0.5`, we find `122` matching pairs, which move our number of meta groups from `311` to `226`.
We verified that the clusters pairs match correctly by plotting their average integrated return rate:
![](/images/napoleon_X/P1/correlation_0_series.svg)
Looking at the size of the meta groups, we can see that many of the singleton clusters are now assigned to a larger group.
![](/images/napoleon_X/P1/groups_cc_alignement.svg)
## Finding Non-Aligned Groups
Looking at the groups where we could not measure the cross correlation (white lines and columns on the previous matrix), we can see two things:
- There is one or two weeks of missing values (i.e. the cut is neat)
- Calling $$R_1$$ the mean of the cluster with one week of missing values, and $$R_2$$ with two weeks of missing values, $$CC(R_1, R_2, \text{one week}) > 0.5$$.
In other word, the lag between clusters is likely to be exactly `1 week`, which greatly reduces the size of our search space.
We repeated the same operation that in the previous sub-section, but with a lag of `1 week` (we also used a lag of `2 weeks` for the unmatched clusters).
For validation, we checked if we could find at least one asset in both groups separated of one week.
With some graph analysis, we recovered the full order, where element with missing values were either the start, either the end of the dataset.
In total, we have `217` full weeks of consecutive data.
You can check yourself with the clusters of the 10 first weeks:
```json
// (weekNbr: [cluster_ID])
{
0: [136, 956],
1: [1169, 1172],
2: [512, 811, 951, 867, 639, 576, 93, 994],
3: [857, 1123, 92, 1129, 653, 1165],
4: [568, 1265, 515, 427, 156, 14, 1216, 542],
5: [752, 583, 430, 924, 492, 499, 55],
6: [268, 844, 77, 455, 237],
7: [648, 1448, 242, 614, 518, 325, 756],
8: [234, 842, 1378, 46, 1196, 691],
9: [1393, 538, 1055, 883, 1358, 859, 190],
10: [768, 732, 181, 1384, 511, 1115, 302, 1226],
}
```
Given this history, we created a "Frankenstein" time-series, which is the average price by connecting all clusters together.
![Average return rate](/images/napoleon_X/P1/global_rate.svg)
---
# About the Testing Set
Now that we have the ordered sequence, we can study the testing set.
It is very likely that the testing set overlap the training part.
We would do just as before:
1. Exploit `md` and `bc`;
2. Study the cross-correlation to find where each cluster goes;
3. Match assets together.
## Matching Clusters
After computing the CC, we can visually check that results at the cluster level are coherent:
![](/images/napoleon_X/P1/TT_comparison_112.svg)
![](/images/napoleon_X/P1/TT_comparison_158.svg)
For all test clusters, we were able to align them with a train group.
With this information, we can stop there and propose a submission, without a lot of intelligence:
For a given day:
- find the train clusters which cover this date
- given the train prediction (`ytrain`), compute the average return
- find the test clusters which cover this date
- assign the mean return value to this date.
With this approach, I got rank 7:
> 7 Sept. 2, 2022, 3:18 p.m. japoneris **MSE 0.0061**
## Matching Assets
This is not the end of the analysis.
Let's finish with asset matching.
Now that we know which clusters are recorded during the same period, we can search for matching assets.
When looking at `md` values, we can see that some assets are present in both training and testing set.
Previously, we found how clusters follow each others.
The goal now is to reconstruct for one assets its full price sequence over time.
With a longer history, it would be easier to study the regularities / behavior.
We selected one test cluster at random, the `1472`, identified at week `34`.
If we compute the cross-correlation with the train assets of this week, we get this correlation matrix, where we have only a strong correlation for asset `11` (yellow color):
![](/images/napoleon_X/P1/CC_TT_1472_same.svg)
Now, we can look at past and future events.
If we look at assets starting the week before (`33`), we get many more matching candidates.
For all but assets `9` and `10`, we could find their relatives:
![](/images/napoleon_X/P1/CC_TT_1472_before.svg)
And looking one week after (`35`), each asset has an analog asset:
![](/images/napoleon_X/P1/CC_TT_1472_after.svg)
With these information, we can put all pieces together, and obtain a perfect overlap:
![](/images/napoleon_X/P1/asset_7871_unaligned.svg)
If we aligned them correctly by changing the starting value of each segment, we get:
![](/images/napoleon_X/P1/asset_7871_aligned.svg)
We have identified `1262` unique time series.
A few assets cover all the dataset, but many of them are singleton (`408` occurs only once.)
The following plot is the length of an assets versus its rank:
![](/images/napoleon_X/P1/asset_rank_length.svg)
The longest signal we have covers all the dataset, and is very similar to the "Frankenstein" signal we have ploted before:
![](/images/napoleon_X/P1/asset_15045_full_reconstruction.svg)
# Conclusion
We analyzed the Napoleon X dataset, and we were able to recover the relative time for all clusters.
We were also able to recreate assets signals that were cut into pieces.
The longest asset cover all the dataset.
Without any sophisticated algorithms, we were able to get a good MSE score.
We still don't know what are `md` and `bc`, which will be the topic of the next post on the series.
You can read the next article on the topic [here]({% post_url 2022-09-08-Napoleon_II %})