# Napoleon X Challenge - Matching Pieces Together

# Introduction

In my learning path, I wanted to work with financial time series.
I found one dataset dealing with cryptocurrencies, that you can find [on Challenge data](https://challengedata.ens.fr/participants/challenges/71/), with the details of the challenge.
You are given groups of time series (called clusters) where the goal is to predict using the 23 previous hour points the average change of the 24th hours.

The description of the dataset is obfuscated:

- they say time series are clustered, but we don't know on which basis
- time series with the same asset number within two different clusters do not correspond necessarily to the same crypto asset,
- There are undescribed quantities: `md` and `bc`

I found some regularities in the dataset.
I did not find yet how the time series are clustered, but I am able to:

- Find the relative date of each cluster
- Find which asset is which, whatever the cluster

I will detail in this article how I did it.


## Table of Content

- [How to use the secret quantities ?](#secret-quantities)
    - [Visual analysis](#visual-analysis)
    - [Searching matching pairs](#searching-matching-pairs)
- [Full signal Reconstruction](#connecting-clusters)
    - [Approaches complexity](#complexity)
    - [Cross correlation to the rescue](#cross-correlation)
- [Analysis of the test set](#about-the-testing-set)
    - [Matching clusters](#matching-clusters)
    - [Matching assets](#matching-assets)
- [Conclusion](#conclusion)


# Secret quantities

## Visual Analysis

Starting with the secret quantities `md` and `bc`.
The acronymes are not standard (e.g. mean/std/var), so that useless to guess without analysis.

The first idea is to look at histograms, to know the range.

- For `bc`, the maximal value is `1` while the minimal value is not clearly visible (there are no strong bottom lines). It might be a rate `1==100%` for instance, but negative rate are often not allowed when capping to `100%`. It might be a correlation coefficient
- For `md`, because of the negative values, it seems that these are log values.


![](/images/napoleon_X/P1/hist_md_bc.svg)

The second thing we can do is to look at the values over one cluster.
Here, the values are sorted by asset ID. As we have 21 chunks per cluster, we have 21 values per asset.
Looking at the following plot, the asset value is unchanged over the cluster lifetime.
It seems that asset IDs are assigned based on the decreasing value of `md`. For most of the clusters, it is true, but there are a few exceptions.

![](/images/napoleon_X/P1/cluster_11_md_bc.svg)


## Searching Matching Pairs

Because the value of `md` and `bc` is constant for one asset over the cluster lifetime,
my hypothesis was that `md` (and `bc`) is independent of the cluster ID, but depend on the time.

When we count the number of unique pairs `(cluster, asset)`, we have `23,607` pairs.
However, if we look at the number of unique `md` values, we have only `17,864` distinct values (and `17,870` unique `bc` values), which means, if the hypothesis is correct, that we have `5,750` duplicates.

We searched for time-series with the same `md` values.
Looking at the correlation between two selected time series, the value is `1`, for all pairs we found.
As an example, we have:

![](/images/napoleon_X/P1/clusters_176_630_asset_0_1.svg)

Knowing that two assets in two different clusters are exactly the same, it is possible to group the cluster together.
Therefore, we can group the `1464` train clusters into `311` time related groups.
The following plot show the size of the groups obtained, in terms of cluster number.

![](/images/napoleon_X/P1/groups_md_alignement.svg)

*Note*: To obtain these plot, we simply merge all the 23-hours chunks together, adding a 24th hour with a return rate equal to $$0$$.
To get this series, we just took the cumulated return `X.cumsum()`. This is not equal to the unscaled asset value `(1+X).cumprod()`, but this difference has no impact for this part.


# Connecting Clusters


In the previous part, we were able to find which clusters occurs at the same time.
Now, we would like to find the order, i.e. the relative date of all clusters.


## Complexity

Because assets are cryptoassets, they are likely to not be independent of each other, and to be subject to the same market variations.
If we look at several clusters occurring at the same date, we can see the market impact.
Globally, the market is increasing. We see two major event at time $$\approx 150$$ and $$\approx 250$$.

*Note*: The black line is the cluster average.


![](/images/napoleon_X/P1/group_ID1_2_clust_361.svg)

![](/images/napoleon_X/P1/group_ID1_2_clust_404.svg)

![](/images/napoleon_X/P1/group_ID1_2_clust_501.svg)

Even if in each cluster, there is only one asset which is shared, the correlation is high.

In the dataset, assuming that clusters overlap, we would find asset $$X_1$$ and $$X_2$$ where $$X_1(t+k) = X_2(t)$$.
There are two ways to identify these match:

- working at the **asset** level
- working at the **cluster** level

When working at the *asset level*, we will try to find which asset match which other asset.
Because we have about $$18,000$$ unique assets time series, the search cost (without optimization) is very expensive ($$\approx 18000^2/2$$).

When working at the *cluster level*, we take advantage of the market correlation, knowing that events would occurs at the same time, whatever the asset.
Next, after finding which cluster follows another one, we would have to find which assets match.
In terms of complexity, this is much faster as we have only `311` cluster groups.


## Cross-Correlation

The easiest way to find matching time series is by looking at their cross-correlation (CC).
The CC is  simply the correlation coefficient of the two considered segments, separated by a lag $$dt$$:

$$CC(X, Y; dt) = \frac{\sum_{i=1:n-dt}(x_i - \mu_X)(y_{i+dt}-\mu_Y)}{\sqrt{\sum_{i=1:n-dt}(x_i - \mu_X)^2 \times \sum_{i=1:n-dt}(y_{i+dt} - \mu_Y)^2}}$$

The interpretation is the same as for Pearson correlation coefficient:

- `1`: signals are correlated
- `0`: signals are not correlated
- `-1`: signals are anti-correlated (unlikely to occurs with the market hypothesis)


To compute the cross-correlation, we take the mean return rate of a cluster (i.e. non-integrated signals).
Compared to the integrated signal, the resulting correlation is much more discriminative (because trends fool the results).

$$\hat{R}_C(t) \sum_{A \in C} X_A(t)$$


## Finding Aligned Groups

The following image show the cross correlation with `lag=0` for all clusters pairs.
Of course, the diagonal is `1`.
We have some white lines, which are due to missing values for some assets.
We will exploit this information after.

![](/images/napoleon_X/P1/correlation_0.svg)

Setting an acceptance threshold to `0.5`, we find `122` matching pairs, which move our number of meta groups from `311` to `226`.
We verify that the clusters pairs match correctly by plotting their average integrated return rate:

![](/images/napoleon_X/P1/correlation_0_series.svg)

Looking at the size of the meta groups, we can see that many of the singleton clusters have been assigned to a larger group.


![](/images/napoleon_X/P1/groups_cc_alignement.svg)


## Finding Non-Aligned Groups

Looking at the groups where we couldn't measure the cross correlation, we can see two things:

- There is one or two weeks of missing values (i.e. the cut is neat)
- calling $R_1$ the mean of the cluster with one week of missing values, and $$R_2$$ with two weeks of missing values, $$CC(R_1, R_2, \text{one week}) > 0.5$$.

In other word, the lag between clusters is likely to be `1 week`, and not random values like `59 hours`, which reduces greatly our search space.
We repeated the same operation that in the previous sub-section, but with a lag of `1 week` (we also used a lag of `2 weeks`).
For validation, we checked if we could find at least one asset in both groups separated of one week.

With some graph analysis, we found the start, knowing that the end was made of the clusters with missing data.


![Average return rate](/images/napoleon_X/P1/global_rate.svg)


In total, we have `217` full weeks of consecutive data.

You can check yourself with the clusters of the 10 first weeks:

```json
// (weekNbr: [cluster_ID])
{
  0: [136, 956],
  1: [1169, 1172],
  2: [512, 811, 951, 867, 639, 576, 93, 994],
  3: [857, 1123, 92, 1129, 653, 1165],
  4: [568, 1265, 515, 427, 156, 14, 1216, 542],
  5: [752, 583, 430, 924, 492, 499, 55],
  6: [268, 844, 77, 455, 237],
  7: [648, 1448, 242, 614, 518, 325, 756],
  8: [234, 842, 1378, 46, 1196, 691],
  9: [1393, 538, 1055, 883, 1358, 859, 190],
  10: [768, 732, 181, 1384, 511, 1115, 302, 1226],
}
```


# About the Testing Set

Now that we have the ordered sequence, we can study the testing set.
It is very likely that the testing set overlap the training part.
We would just do as before:

1. Exploit `md` and `bc`;
2. Study the cross-correlation to find where each cluster goes;
3. Match assets together.

## Matching Clusters

After computing the CC, we can visually check that results at the cluster level are coherent:

![](/images/napoleon_X/P1/TT_comparison_112.svg)

![](/images/napoleon_X/P1/TT_comparison_158.svg)

For all test clusters, we are able to give a unique date.

With this information, we can stop there and propose a submission, without a lot of intelligence:

For a given day:

- find the train clusters which cover this date
- given the train prediction (`ytrain`), compute the average return
- find the test clusters which cover this date
- assign the mean return value to this date.

With this approach, I got rank 7:

> 7 	Sept. 2, 2022, 3:18 p.m. 	japoneris 	**MSE 0.0061**


## Matching Assets

This is not the end of the analysis.
Let's finish with asset matching.

Now that we know which clusters are recorded at the same time, we have to search which assets is which.
When looking at `md` values, we can see that some assets are present in both training and testing set.

Previously, we found how clusters follow each others.
The goal now is to find the asset pieces over time.
With a longer history, it would be easier to study the regularities / behavior.

We selected one test cluster at random, the `1472`, identified at week `34`.
If we compute the cross-correlation with the train assets of this week, we get this correlation matrix, where we have only a strong correlation for asset `11` (yellow color):

![](/images/napoleon_X/P1/CC_TT_1472_same.svg)


Now, we can look at past and future events.
If we look at assets starting the week before (`33`), we get many more matching candidates.
For all but assets `9` and `10`, we could find their relatives:

![](/images/napoleon_X/P1/CC_TT_1472_before.svg)

And looking one week after (`35`), each asset has an analog asset:

![](/images/napoleon_X/P1/CC_TT_1472_after.svg)

With these information, we can put all pieces together, and obtain a perfect overlap:


![](/images/napoleon_X/P1/asset_7871_unaligned.svg)

![](/images/napoleon_X/P1/asset_7871_aligned.svg)

With this analysis, we can put assets pieces together.
We identified `1262` unique time series (which is an upper bound. Some assets are not continuously recorded, so some relationships are not observable).
A few of them cover all the dataset duration, and many are singleton (`408` occurs only once.)

![](/images/napoleon_X/P1/asset_rank_length.svg)


The longest signal we have covers all the dataset, and is very similar to the cluster-average signal:

![](/images/napoleon_X/P1/asset_15045_full_reconstruction.svg)


# Conclusion

We analyzed the Napoleon X dataset, and we were able to recover the relative time for all clusters.
We were also able to recreate assets signals that were cut into pieces.
The longest asset cover all the dataset.

Without any sophisticated algorithms, we were able to get a good MSE score.

We still don't know what are `md` and `bc`, which will be the topic of the next post on the series.