# Napoleon X Challenge - Part II - Time Series Disambiguation # Introduction On the previous post, I connected the dots between the different time series, allowing to recover the full time order, and get a unique identifier for each asset. In this part, I will use external data to disambiguate the assets and understand the magic quantities. The data and details of the challenge are still [here data](https://challengedata.ens.fr/participants/challenges/71/). # Visual Analysis The first thing we can do to learn what are the two quantities `md` and `bc` is to look at their distribution and their evolution over time. ## Distribution ![](/images/napoleon_X/P2/meta_bc_md_distrib.svg) When looking at these two quantities, it doesn't look like usual distribution (Normal, Poisson, Cauchy, ...). At least, we know the range of the values. For `bc`, the values are mainly between `[0, 1]` but some part of the distribution leaks bellow `0`, so this might not be a rate in `%`. Also, the values are weirdly capping at 1, which doesn't look natural as negative values are possible. `bc` might be a correlation coefficient, however we don't know how it is calculated. For `md`, the first hypothesis was the log of the true asset price relatively to the bitcoin. Because no cryptoasset is more expensive than the bitcoin, the values would be in $$(0, 1]$$, leading to log values in $$(-\infty, 0]$$. ## Temporal Analysis As we know the relative time of each asset, we can see how these two quantities evolve over time. For `md` we get: ![](/images/napoleon_X/P2/meta_md_time_week.svg) And `bc`: ![](/images/napoleon_X/P2/meta_bc_time_week.svg) We can see that there are major events that affect the `md` and `bc` time-series globally. This is very clear for `md`, where the lines seems parallel to each others, with a jump near `week 20` and a progressive drop with recovery between `week 160-180`. The behavior of `bc` is clearly different. It it less stable, with short-term events of large amplitude. The dates of the events of `md` and `bc` don't seem to be related. ## Evolution for Assets If we look the history for each asset, it helps to understand more precisely the long term variation would be. For `md`, because of the closeness between the top-items, we expect they belong to a single asset. Showing the lineage confirms this hypothesis: ![](/images/napoleon_X/P2/meta_md_time_asset_150.svg) We can study `bc` the same way: ![](/images/napoleon_X/P2/meta_bc_time_asset_150.svg) The results are noisier and more difficult to understand. The value of `bc` is less stable over time. For the legibility of the plot, we only represented assets with an history with at least 150 weeks. We can see that even if the series are noisy, it is still the "top" series. ## Asset Values and Return Selecting one asset at random, we get the following reconstructed signal: ![](/images/napoleon_X/P2/meta_md_asset_0_rec.svg) The corresponding `md` values is: ![](/images/napoleon_X/P2/meta_md_asset_0_md.svg) You wan see that the relationship between asset and the `md` value is not straightforward. Given that the assets variation are quite large, a way to study it is to move to the log scale. Superposing the two curves, (with different scaling), we get the following: ![](/images/napoleon_X/P2/meta_md_asset_0_md_logrec.svg) Here, we can see how similar are the two curves. The shape is almost the same, and short-term events occurring at the same time. There are still some differences which are visible on the long term, as the two curves do not exactly match. Looking at other assets, we get similar results: ![](/images/napoleon_X/P2/meta_md_asset_1_md_logrec.svg) The hypothesis that `md` represents the **mean relatively to the bitcoin** is highly probable. The observed difference can be due to the 24th missing point of each day. ### Log: Basis Here, we used the **natural log** to convert the asset time-series. As a reminder, the value of an asset at time $$t$$ given the previous results is: $$X_t = X_0 \prod_{i=1}^t (1 + R_i)$$ where $$X_0$$ is the value of the asset at time $$t_0$$, and $$R_i$$ the return at time $$t_i$$, i.e. $$R_i = \frac{X_i}{X_{i-1}} - 1$$. Because we don't know what is $$X_0$$, we set it to $$1$$ to study the behavior of $$\mathbf{X}$$. When moving to the log, everything become simpler: $$\log(X_t) = \log(X_0) + \sum_{i=1}^t \log(1 + R_i)$$ To convert one log from one base to another, we have a simple formula: $$\log_b(a) = \frac{\log(a)}{\log(b)}$$ The log in base $$b$$ is simply the value of the log in the natural base divided by the value of the log of the base. Here, we observed that the natural log is the best match $$md \propto A + \sum_{i=1}^t \log(1 + R_i)$$. Centering the asset log series to the `md` mean, we can see that the scale is similar: ![](/images/napoleon_X/P2/meta_md_asset_0_md_logrec_ss.svg) # Looking at True asset prices Because cryptocurrencies are "open systems", prices are well recorded and available in multiple databases for free. To confirm our hypothesis about `md` interpretation, we wanted to compare the values to true historical asset prices. We found [Coinmetrics](https://coinmetrics.io/community-network-data/), which provides a day-by-day history for many assets. For each asset, we have the price relatively to the Bitcoin price, and the price of the asset relatively to USD or Euro. There are many other informations, but we won't exploit them here. The only issue we have with this dataset is the samping: The Napoleon's dataset is an hourly dataset, while the Coinmetrics base gives us the daily prices. We need to adjust to this by averaging over one day the return we have. ## ETH relatively to Bitcoin. With this dataset in the hands, we can quickly see if our hypothesis is valid or not. The most valuable crypto asset after Bitcoin is Ethereum. If we look at its price **relatively to the Bitcoin**, we get: ![](/images/napoleon_X/P2/crypto_eth_lin.svg) And if we move to the log scale, we get this signal. ![](/images/napoleon_X/P2/crypto_eth_log.svg) Averaging over one week's chunk and moving to the log scale, we get: ![](/images/napoleon_X/P2/crypto_eth_log_md_w1.svg) where the signal exactly overlap the `md` values of the best asset. We can now put a date on it: ![](/images/napoleon_X/P2/crypto_eth_log_md_date.svg) We are quite happy but we needed to do a small adjustment. We needed to **shift** the time series by `1.8` points up, where there is no clear explanation to it. When we use the wrong log base, it impacts the amplitude of the time-series movement. However, the amplitude was correct, and didn't need to be adjusted. It means that $$MD(t) = A + \log(ETC/BTC price(t)) = \log(\exp(A) \times ETC/BTC price(t))$$. The reason is not clearly understood why we have this factor. Additionally, the factor is not the same for all crypto-assets. For `XLM`, we needed to readjust by `7` points, for `DOGE` by `8.8` points. The factor doesn't seem to be related to the initial asset value. Hopefully, we found that Ethereum is the top one, but for instance dash which is pricey has a negative coefficient of `-0.69`. We did not find a general law to explain these coefficients. It seems that for each asset, a random factor has been selected to transform the true price so we cannot recover it trivially just using the dataset. ### Dates To identify which dates suits the most to our dataset, we studied the cross-correlation. For finding the recording date of `md`, we tested averaging over one, two and three weeks the asset's values. We tested different starting days of the week, because it is possible that `day 1` is not a Monday. Additionally, in some country, the first day of the week is Sunday, so we checked for it. The best averaging window was `1 week`, and the best starting date was the `2017-08-02` (i.e. averaging between this date and the `2017-08-09`). The 2nd of August is a Wednesday. This date match `md` for the earliest record. To find the recording date of the asset value, we integrated our series and under-sampled it once every day: ```python (1 + R).cumprod()[::24] ``` We did not pay attention to the possible hour lag, as it is of limited interest. By searching the best cross-correlation, testing different chunks, we found that the best starting day is `2017-07-19`. Knowing that there are 216 weeks recorded, the recording ends the `2021-09-07`. In other words, `md` is the log value of the asset average value over the 3rd week of a cluster. # Filling the Dataset As we were able to find that `md` was the mean log price, we tried to fill the missing 24 hours of each day by adjusting to the difference between the log returns and the `md` values. However, this approach was unsuccessful. The gap between the returns and `md` is too large to be compensated using `5%` of the dataset. We obtained a strongly modified distribution. Therefore, we did not try any submission with this information.