<< Go back to Tech

# Introduction

On the previous post, I connected the dots between the different time series, allowing to recover the full time order, and get a unique identifier for each asset.

In this part, I will use external data to disambiguate the assets and understand the magic quantities.

The data and details of the challenge are still here data.

# Visual Analysis

The first thing we can do to learn what are the two quantities md and bc is to look at their distribution and their evolution over time.

## Distribution

When looking at these two quantities, it doesn’t look like usual distribution (Normal, Poisson, Cauchy, …). At least, we know the range of the values.

For bc, the values are mainly between [0, 1] but some part of the distribution leaks bellow 0, so this might not be a rate in %. Also, the values are weirdly capping at 1, which doesn’t look natural as negative values are possible. bc might be a correlation coefficient, however we don’t know how it is calculated.

For md, the first hypothesis was the log of the true asset price relatively to the bitcoin. Because no cryptoasset is more expensive than the bitcoin, the values would be in $$(0, 1]$$, leading to log values in $$(-\infty, 0]$$.

## Temporal Analysis

As we know the relative time of each asset, we can see how these two quantities evolve over time.

For md we get:

And bc:

We can see that there are major events that affect the md and bc time-series globally. This is very clear for md, where the lines seems parallel to each others, with a jump near week 20 and a progressive drop with recovery between week 160-180.

The behavior of bc is clearly different. It it less stable, with short-term events of large amplitude. The dates of the events of md and bc don’t seem to be related.

## Evolution for Assets

If we look the history for each asset, it helps to understand more precisely the long term variation would be. For md, because of the closeness between the top-items, we expect they belong to a single asset. Showing the lineage confirms this hypothesis:

We can study bc the same way:

The results are noisier and more difficult to understand. The value of bc is less stable over time. For the legibility of the plot, we only represented assets with an history with at least 150 weeks. We can see that even if the series are noisy, it is still the “top” series.

## Asset Values and Return

Selecting one asset at random, we get the following reconstructed signal:

The corresponding md values is:

You wan see that the relationship between asset and the md value is not straightforward.

Given that the assets variation are quite large, a way to study it is to move to the log scale.

Superposing the two curves, (with different scaling), we get the following:

Here, we can see how similar are the two curves. The shape is almost the same, and short-term events occurring at the same time. There are still some differences which are visible on the long term, as the two curves do not exactly match.

Looking at other assets, we get similar results:

The hypothesis that md represents the mean relatively to the bitcoin is highly probable. The observed difference can be due to the 24th missing point of each day.

### Log: Basis

Here, we used the natural log to convert the asset time-series. As a reminder, the value of an asset at time $$t$$ given the previous results is:

$X_t = X_0 \prod_{i=1}^t (1 + R_i)$

where $$X_0$$ is the value of the asset at time $$t_0$$, and $$R_i$$ the return at time $$t_i$$, i.e. $$R_i = \frac{X_i}{X_{i-1}} - 1$$. Because we don’t know what is $$X_0$$, we set it to $$1$$ to study the behavior of $$\mathbf{X}$$.

When moving to the log, everything become simpler:

$\log(X_t) = \log(X_0) + \sum_{i=1}^t \log(1 + R_i)$

To convert one log from one base to another, we have a simple formula:

$\log_b(a) = \frac{\log(a)}{\log(b)}$

The log in base $$b$$ is simply the value of the log in the natural base divided by the value of the log of the base. Here, we observed that the natural log is the best match $$md \propto A + \sum_{i=1}^t \log(1 + R_i)$$. Centering the asset log series to the md mean, we can see that the scale is similar:

# Looking at True asset prices

Because cryptocurrencies are “open systems”, prices are well recorded and available in multiple databases for free. To confirm our hypothesis about md interpretation, we wanted to compare the values to true historical asset prices. We found Coinmetrics, which provides a day-by-day history for many assets. For each asset, we have the price relatively to the Bitcoin price, and the price of the asset relatively to USD or Euro. There are many other informations, but we won’t exploit them here.

The only issue we have with this dataset is the samping: The Napoleon’s dataset is an hourly dataset, while the Coinmetrics base gives us the daily prices. We need to adjust to this by averaging over one day the return we have.

## ETH relatively to Bitcoin.

With this dataset in the hands, we can quickly see if our hypothesis is valid or not. The most valuable crypto asset after Bitcoin is Ethereum. If we look at its price relatively to the Bitcoin, we get:

And if we move to the log scale, we get this signal.

Averaging over one week’s chunk and moving to the log scale, we get:

where the signal exactly overlap the md values of the best asset.

We can now put a date on it:

We are quite happy but we needed to do a small adjustment. We needed to shift the time series by 1.8 points up, where there is no clear explanation to it.

When we use the wrong log base, it impacts the amplitude of the time-series movement. However, the amplitude was correct, and didn’t need to be adjusted.

It means that $$MD(t) = A + \log(ETC/BTC price(t)) = \log(\exp(A) \times ETC/BTC price(t))$$. The reason is not clearly understood why we have this factor. Additionally, the factor is not the same for all crypto-assets. For XLM, we needed to readjust by 7 points, for DOGE by 8.8 points.

The factor doesn’t seem to be related to the initial asset value. Hopefully, we found that Ethereum is the top one, but for instance dash which is pricey has a negative coefficient of -0.69. We did not find a general law to explain these coefficients. It seems that for each asset, a random factor has been selected to transform the true price so we cannot recover it trivially just using the dataset.

### Dates

To identify which dates suits the most to our dataset, we studied the cross-correlation.

For finding the recording date of md, we tested averaging over one, two and three weeks the asset’s values. We tested different starting days of the week, because it is possible that day 1 is not a Monday. Additionally, in some country, the first day of the week is Sunday, so we checked for it. The best averaging window was 1 week, and the best starting date was the 2017-08-02 (i.e. averaging between this date and the 2017-08-09). The 2nd of August is a Wednesday. This date match md for the earliest record.

To find the recording date of the asset value, we integrated our series and under-sampled it once every day:

(1 + R).cumprod()[::24]


We did not pay attention to the possible hour lag, as it is of limited interest. By searching the best cross-correlation, testing different chunks, we found that the best starting day is 2017-07-19. Knowing that there are 216 weeks recorded, the recording ends the 2021-09-07.

In other words, md is the log value of the asset average value over the 3rd week of a cluster.

# Filling the Dataset

As we were able to find that md was the mean log price, we tried to fill the missing 24 hours of each day by adjusting to the difference between the log returns and the md values. However, this approach was unsuccessful. The gap between the returns and md is too large to be compensated using 5% of the dataset. We obtained a strongly modified distribution. Therefore, we did not try any submission with this information.

>> You can subscribe to my mailing list here for a monthly update. <<