Thursday, April 25, 2024
HomePythonPython Time Sequence Forecast on Bitcoin Information (Half II) – Finxter

Python Time Sequence Forecast on Bitcoin Information (Half II) – Finxter


A Time Sequence is basically a tabular knowledge with the particular function of getting a time index. The frequent forecast taks is ‘understanding the previous (and generally the current), predict the longer term’. This process, taken as a precept, reveals itself in a number of methods: in find out how to interpret your downside, in function engineering and wherein forecast technique to take.

That is the second article in our sequence. Within the first article we mentioned find out how to create options out of a time sequence utilizing lags and developments. Right now we comply with the wrong way by highlighting developments as one thing you need instantly deducted out of your mannequin. 

Motive is, Machine Studying fashions work in numerous methods. Some are good with subtractions, others should not.

For instance, for any function you embrace in a Linear Regression, the mannequin will mechanically detect whether or not to infer it from the precise knowledge or not. A Tree Regressor (and its variants) won’t behave in the identical approach and often will ignore a pattern within the knowledge.

Due to this fact, every time utilizing the latter sort of fashions, one often requires a hybrid mannequin, that means, we use a Linear(ish) first mannequin to detect world periodic patterns after which apply a second Machine Studying mannequin to deduce extra refined conduct.

We use the Bitcoin Sentiment Evaluation knowledge we captured within the final article as a proof of idea.

The hybrid mannequin a part of this text is closely based mostly on Kaggle’s Time Sequence Crash Course, nevertheless, we intend to automate the method and talk about extra in-depth the DeterministicProcess class.

Traits, as one thing you don’t wish to have

(Or that you really want it deducted out of your mannequin)

An aerodynamic approach to take care of developments and seasonality is utilizing, respectively, DeterministicProcess and CalendarFourier from statsmodel. Allow us to begin with the previous. 

DeterministicProcess goals at creating options for use in a Regression mannequin to find out pattern and periodicity. It takes your DatetimeIndex and some different parameters and returns a DataFrame filled with options to your ML mannequin.

A typical occasion of the category will learn just like the one beneath. We use the sentic_mean column as an instance.

from statsmodels.tsa.deterministic import DeterministicProcess

y = dataset['sentic_mean'].copy()

dp = DeterministicProcess(
index=y.index, 
fixed=True, 
order=2
)

X = dp.in_sample()

X

We are able to use X and y as options and goal to coach a LinearRegression mannequin. On this approach, the LinearRegression will be taught no matter traits from y could be inferred (in our case) solely out of:

  • the variety of elapsed time intervals (pattern column);
  • the final quantity squared (trend_squared); and
  • a bias time period (const).

Take a look at the end result:

from sklearn.linear_model import LinearRegression

mannequin = LinearRegression().match(X,y)

predictions = pd.DataFrame(
                    mannequin.predict(X),
                    index=X.index,
                    columns=['Deterministic Curve']
)

Evaluating predictions and precise values provides:

import matplotlib.pyplot as plt

plt.determine()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
plt.present()

Even the quadratic time period appears ignorable right here. The DeterministicProcess class additionally helps us with future predictions because it carries a technique that gives the suitable future type of the chosen options.

Particularly, the out_of_sample technique of dp takes the variety of time intervals we wish to predict as enter and generates the wanted options for you.

We use 60 days beneath for instance:

X_out = dp.out_of_sample(60)

predictions_out = pd.DataFrame(
                        mannequin.predict(X_out),
                        index=X_out.index,
                        columns=['Future Predictions']
)



plt.determine()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, colour="purple")
plt.present()

Allow us to repeat the method with sentic_count to have a sense of a higher-order pattern.

👍 As a rule of thumb, the order must be one plus the whole variety of (trending) hills + peaks  within the graph, however not far more than that.

We select 3 for sentic_count and examine the output with the order=2 end result (we don’t write the code twice, although).

y = dataset['sentic_count'].copy()

from statsmodels.tsa.deterministic import DeterministicProcess, CalendarFourier


dp = DeterministicProcess(
    index=y.index, fixed=True, order=3
)
X = dp.in_sample()


mannequin = LinearRegression().match(X,y)

predictions = pd.DataFrame(
                    mannequin.predict(X),
                    index=X.index,
                    columns=['Deterministic Curve']
)


X_out = dp.out_of_sample(60)

predictions_out = pd.DataFrame(
                        mannequin.predict(X_out),
                        index=X_out.index,
                        columns=['Future Predictions']
)



plt.determine()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, colour="purple")
plt.present()

Though the order-three polynomial matches the information higher, use discretion in deciding whether or not the sentiment depend will lower so drastically within the subsequent 60 days or not. Normally, belief short-time predictions relatively than lengthy ones.

DeterministicProcess accepts different parameters, making it a really attention-grabbing device. Discover a description of the just about full listing beneath.

dp = DeterministicProcess(
    index, # the DatetimeIndex of your knowledge
    interval: int or None, # in case the information exhibits some periodicity, embrace the dimensions of the periodic cycle right here: 7 would imply 7 days in our case
    fixed: bool, # features a fixed function within the returned DataFrame, i.e., a function with the identical worth for everybody. It returns the equal of a bias time period in Linear Regression 
    order: int, # order of the polynomial that you simply assume higher approximates your pattern: the only the higher
    seasonal: bool, # make it True when you assume the information has some periodicity. If you happen to make it True and don't specify the interval, the dp will attempt to infer the interval out of the index
    additional_terms: tuple of statsmodel's DeterministicTerms, # we come again to this subsequent
    drop: bool # drops ensuing options that are collinear to others. If you'll use a linear mannequin, make it True
)

Seasonality

As a hardened Mathematician, seasonality is my favourite half as a result of it offers with Fourier evaluation (and wave capabilities are simply… cool!):

Do you bear in mind your first ML course if you heard Linear Regression can match arbitrary capabilities, not solely strains? So, why not a wave operate? We simply did it for polynomials and didn’t even really feel prefer it 😉

Basically, for any expression f which is a operate of a function or of your DatetimeIndex, you possibly can create a function column whose ith row is the worth of f similar to the ith index.

Then linear regression finds the fixed coefficient multiplying f that most closely fits your knowledge. Once more, this process works normally, not solely with Datetime indexes – the trend_squared time period above is an instance of it.

For seasonality, we use a second statsmodel‘s wonderful class: CalendarFourier. It’s one other statsmodel‘s DeterministicTerm class (i.e., with the in_sample and out_of_sample strategies) and instantiates with two parameters, 'frequency' and 'order'.

As a 'frequency', the category expects a string corresponding to ‘D’, ‘W’, ‘M’ for day, week or month, respectively, or any of the fairly complete Pandas Datetime offset aliases.

The 'order' is the Fourier enlargement order which must be understood because the variety of waves you expect in your chosen frequency (depend the variety of ups and downs – one wave can be understood as one up and one down)

CalendarFourier integrates swiftly with DeterministicProcess by together with an occasion of it within the listing of additional_terms.

Right here is the complete code for sentic_mean:

from statsmodels.tsa.deterministic import DeterministicProcess, CalendarFourier

y = dataset['sentic_mean'].copy()

fourier = CalendarFourier(freq='A',order=2)

dp = DeterministicProcess(
    index=y.index, fixed=True, order=2, seasonal=False, additional_terms=[fourier], drop=True
)
X = dp.in_sample()


from sklearn.linear_model import LinearRegression

mannequin = LinearRegression().match(X,y)

predictions = pd.DataFrame(
                    mannequin.predict(X),
                    index=X.index,
                    columns=['Prediction']
)


X_out = dp.out_of_sample(60)

predictions_out = pd.DataFrame(
                        mannequin.predict(X_out),
                        index=X_out.index,
                        columns=['Prediction']
)



plt.determine()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, colour="purple")
plt.present()

If we take seasonal=True inside DeterministicProcess, we get a crispier line:

Together with ax.set_xlim(('2022-08-01', '2022-10-01')) earlier than plt.present() zooms the graph in:

Though I recommend utilizing the seasonal=True parameter with care, it does discover attention-grabbing patterns (with large RMSE error, although).

For example, take a look at this BTC share change zoomed chart:

Right here interval is about to 30 and seasonal=True. I additionally manually rescaled the predictions to be higher seen within the graphic. Though the predictions are far-off from fact, pondering as a dealer, isn’t it spectacular what number of peaks and hills it will get proper? At the least for this zoomed month…

To keep up the workflow promise, I ready a code that does every thing to date in a single shot:

def deseasonalize(df: pd.Sequence, season_freq='A', fourier_order=0, 
                    fixed=True, dp_order=1,  dp_drop=True,
                    mannequin=LinearRegression(),
                    fourier=None,
                    dp=None,
                    **DeterministicProcesskwargs)->(pd.Sequence, plt.Axes, pd.DataFrame):
    """
    Returns a deseasonalized and detrended df, a seasonal plot, and the fitted DeterministicProcess occasion.
    """
    
    if fourier is None:
        fourier = CalendarFourier(freq=season_freq, order=fourier_order)
    
    if dp is None:
        dp = DeterministicProcess(
        index=df.index,
        fixed=True,
        order=dp_order,
        additional_terms=[fourier],
        drop=dp_drop,
        **DeterministicProcesskwargs
        )
    
    X = dp.in_sample()
    mannequin = LinearRegression().match(X, df)
    y_pred = pd.Sequence(
                        mannequin.predict(X),
                        index=X.index,
                        title=df.title+'_pred'
    )

    ax = plt.subplot()
    y.plot(ax=ax, legend=True)
    predictions.plot(ax=ax)
    
    y_pred.columns = df.title
    y_deseason = df - y_pred
    y_deseason.title = df.title +'_deseasoned'
    return y_deseason, ax, dp


The sentic_mean analyses get decreased to:

y_deseason, ax, dp=    deseasonalize(y, 
        season_freq='A',
        fourier_order=2,
        fixed=True,
        dp_order=2,
        dp_drop=True,
        mannequin=LinearRegression() )

Cycles and Hybrid Fashions

Allow us to transfer on to an entire Machine Studying prediction. We use XGBRegressor and examine its efficiency amongst three cases: 

  1. Predict sentic_mean instantly utilizing lags;
  2. Identical prediction including the seasonal/trending with a DeterministicProcess;
  3. A hybrid mannequin, utilizing LinearRegression to deduce and take away seasons/developments, after which apply a XGBRegressor.

The primary half would be the bulkier because the different two comply with from easy modifications within the ensuing code. 

Getting ready the information

Earlier than any evaluation, we cut up the information in practice and check units. Since we’re coping with time sequence, this implies we set the ‘current date’ as some extent up to now and attempt to predict its respective ‘future’. Right here we decide 22 days up to now.

s = dataset['sentic_mean']

s_train = s[:'2022-09-01']

We made this primary cut up in an effort to not leak knowledge whereas doing any evaluation.

Subsequent, we put together goal and have units. Recall our SentiCrypto’s knowledge was set to be obtainable on a regular basis at 8AM. Think about we’re doing the prediction by 9AM.

On this case, something till the current knowledge (the ‘lag_0‘) can be utilized as options, and our goal is s_train‘s first lead (which we outline as a -1 lag). To decide on different lags as options, we look at theirs statsmodel’s partial auto-correlation plot:

from statsmodels.graphics.tsaplots import plot_pacf

plot_pacf(s_train, lags=20)

We use the primary 4 for sentic_mean and the primary seven + the eleventh for sentic_count (you possibly can simply check totally different combos with the code beneath.)

Now we end selecting options, we return to the complete sequence for engineering. We apply to s_maen and s_count the make_lags operate we outlined within the final article (which we transcribe right here for comfort). 

def make_lags(df, n_lags=1, lead_time=1):
    """
    Compute lags of a pandas.Sequence from lead_time to lead_time + n_lags. Alternatively, an inventory could be handed as n_lags.
    Returns a pd.DataFrame whose ith column is both the i+lead_time lag or the ith factor of n_lags.
    """
    if isinstance(n_lags,int):
        lag_list = listing(vary(lead_time, n_lags+lead_time))
    else:
        lag_list = n_lags
    lags ={
        f'{df.title}_lag_{i}': df.shift(i) for i in lag_list
        }
    
    return  pd.concat(lags,axis=1)

X = make_lags(s, [0,1,2,3,4])

y = make_lags(s, [-1])

show(X)
y

Now a train-test cut up with sklearn is handy (Discover the shuffle=False parameter, that’s key for time sequence):

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=22, shuffle=False)

X_train

(Observe that the ultimate date is about appropriately, in accordance with our evaluation’ cut up.)

 Making use of the regressor:

xgb = XGBRegressor(n_estimators=50)

xgb.match(X_train,y_train)

predictions_train = pd.DataFrame(
                    xgb.predict(X_train),
                    index=X_train.index,
                    columns=['Prediction']
)


predictions_test = pd.DataFrame(
                    xgb.predict(X_test),
                    index=X_test.index,
                    columns=['Prediction']
)

print(f'R2 practice rating: {r2_score(y_train[:-1],predictions_train[:-1])}')

plt.determine()
ax = plt.subplot()
y_train.plot(ax=ax, legend=True)
predictions_train.plot(ax=ax)
plt.present()

plt.determine()
ax = plt.subplot()
y_test.plot(ax=ax, legend=True)
predictions_test.plot(ax=ax)
plt.present()

print(f'R2 check rating: {r2_score(y_test[:-1],predictions_test[:-1])}')

You may cut back overfitness by decreasing the variety of estimators, however the R2 check rating maintains unfavorable.

We are able to replicate the method for sentic_count (or no matter you need). Beneath is a operate to automate it.

from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from statsmodels.tsa.stattools import pacf




def apply_univariate_prediction(sequence,  test_size, to_predict=1, nlags=20, minimal_pacf=0.1, mannequin=XGBRegressor(n_estimators=50)):
    '''
    Ranging from sequence, breaks it in practice and check subsets; 
    chooses which lags to make use of based mostly on pacf > minimal_pacf; 
    and applies the given sklearn-type mannequin. 
    Returns the ensuing options and targets and the skilled mannequin. 
    It plots the graph of the coaching and prediction, along with their r2_score.
    '''
    s = sequence.iloc[:-test_size]
    
    if isinstance(to_predict,int):
        to_predict = [to_predict]
    
    from statsmodels.tsa.stattools import pacf

    s_pacf = pd.Sequence(pacf(s, nlags=nlags))
    
    column_list = s_pacf[s_pacf>minimal_pacf].index
    
    X = make_lags(sequence, n_lags=column_list).dropna()
    
    y = make_lags(sequence,n_lags=[-x for x in to_predict]).loc[X.index]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=False)
    
    mannequin.match(X_train,y_train)

    predictions_train = pd.DataFrame(
                        mannequin.predict(X_train),
                        index=X_train.index,
                        columns=['Train Predictions']
    )


    predictions_test = pd.DataFrame(
                        mannequin.predict(X_test),
                        index=X_test.index,
                        columns=['Test Predictions']
    )


    fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,5), sharey=True)

    y_train.plot(ax=ax1, legend=True)
    predictions_train.plot(ax=ax1)
    ax1.set_title('Practice Predictions')

    y_test.plot(ax=ax2, legend=True)
    predictions_test.plot(ax=ax2)
    ax2.set_title('Take a look at Predictions')
    plt.present()


    print(f'R2 practice rating: {r2_score(y_train[:-1],predictions_train[:-1])}')

    print(f'R2 check rating: {r2_score(y_test[:-1],predictions_test[:-1])}')
    
    return X, y, mannequin

apply_univariate_prediction(dataset['sentic_count'],22)
apply_univariate_prediction(dataset['BTC-USD'], 22)

Predicting with Seasons

Because the options created by DeterministicProcess are solely time-dependent, we are able to add them harmlessly to the function DataFrame we automated get from our univariate predictions.

The predictions, although, are nonetheless univariate. We use the deseasonalize operate to acquire the season options. The info preparation is as follows:

s = dataset['sentic_mean']

X, y, _ = apply_univariate_prediction(s,22);

s_deseason, _, dp = deseasonalize(s, 
        season_freq='A',
        fourier_order=2,
        fixed=True,
        dp_order=2,
        dp_drop=True,
        mannequin=LinearRegression() );
X_f = dp.in_sample().shift(-1)

X = pd.concat([X,X_f], axis=1, be part of='inside').dropna()

With a little bit of copy and paste, we arrive at:

And we really carry out approach worse! 😱

Deseasonalizing

However, the right-hand graphic illustrates the shortcoming of greedy developments. Our final shot is a hybrid mannequin.

Right here we comply with three steps:

  1. We use the LinearRegression to seize the seasons and developments, rendering the sequence y_s. Then we purchase a deseasonalized goal y_ds = y-y_s;
  2. Practice an XGBRegressor on y_ds and the lagged options, leading to deseasonalized predictions y_pred;
  3. Lastly, we incorporate y_s again to y_pred to match the ultimate end result.

Though Bitcoin-related knowledge are onerous to foretell, there was an enormous enchancment on the r2_score (lastly one thing constructive!). We outline the used operate beneath.

get_hybrid_univariate_prediction(dataset['sentic_mean'], 22,
                                     season_freq='A', 
                                     fourier_order=2, 
                                     fixed=True, 
                                     dp_order=2,  
                                     dp_drop=True,
                                     model1=LinearRegression(),
                                     fourier=None, is_seasonal=True, season_period=7,
                                     dp=None, 
                                     to_predict=1, 
                                     nlags=20, 
                                     minimal_pacf=0.1, 
                                     model2=XGBRegressor(n_estimators=50)
                                         
                    )

As a substitute of going by each element, we will even automate this code. With a view to get the code operating easily, we revisit the deseasonalize and the apply_univariate_prediction capabilities in an effort to take away the plotting a part of them.

The ultimate operate solely plots graphs and returns nothing. It intends to present you a baseline for a hybrid mannequin rating. Change the operate at will to make it return no matter you want.

def get_season(sequence: pd.Sequence, 
               test_size,
               season_freq='A', 
               fourier_order=0, 
               fixed=True, 
               dp_order=1,  
               dp_drop=True,
               model1=LinearRegression(),
               fourier=None,
               is_seasonal=False,
               season_period=None,
               dp=None):
    """
    Decompose sequence in a deseasonalized and a seasonal half. The parameters are relative to the fourier and DeterministicProcess used. 
    Returns y_ds and y_s.

    """

    se = sequence.iloc[:-test_size]
    
    if fourier is None:
        fourier = CalendarFourier(freq=season_freq, order=fourier_order)

    if dp is None:
        dp = DeterministicProcess(
        index=se.index,
        fixed=True,
        order=dp_order,
        additional_terms=[fourier],
        drop=dp_drop,
        seasonal=is_seasonal,
        interval=season_period
        )

    X_in = dp.in_sample()
    X_out = dp.out_of_sample(test_size)
    
    model1 = model1.match(X_in, se)
    
    X = pd.concat([X_in,X_out],axis=0)

    
    y_s = pd.Sequence(
                        model1.predict(X),
                        index=X.index,
                        title=sequence.title+'_pred'
    )

    y_s.title = sequence.title
    y_ds = sequence - y_s
    y_ds.title = sequence.title +'_deseasoned'
    return y_ds, y_s




def prepare_data(sequence, 
              test_size, 
              to_predict=1, 
              nlags=20,
              minimal_pacf=0.1):
    '''
    Creates a function dataframe by making lags and a goal sequence by a unfavorable to_predict-shift. 
    Returns X, y.
    '''
    s = sequence.iloc[:-test_size]

    if isinstance(to_predict,int):
        to_predict = [to_predict]

    from statsmodels.tsa.stattools import pacf

    s_pacf = pd.Sequence(pacf(s,nlags=nlags))

    column_list = s_pacf[s_pacf>minimal_pacf].index

    X = make_lags(sequence, n_lags=column_list).dropna()

    y = make_lags(sequence,n_lags=[-x for x in to_predict]).loc[X.index].squeeze()


    return X, y
    
    
def get_hybrid_univariate_prediction(sequence: pd.Sequence, 
                                     test_size,
                                     season_freq='A', 
                                     fourier_order=0, 
                    fixed=True, 
                                     dp_order=1,  
                                     dp_drop=True,
                    model1=LinearRegression(),
                    fourier=None,
                                     is_seasonal=False,
                                     season_period=None,
                                         dp=None, 
                                         to_predict=1, 
                                         nlags=20, 
                                         minimal_pacf=0.1, 
                                         model2=XGBRegressor(n_estimators=50)
                                         
                    ):
    """
    Apply the hybrid mannequin technique by deseasonalizing/detrending a time sequence with model1 and investigating the ensuing sequence with model2. It plots the respective graphs and computes r2_scores. 
    """
    
    
    
    y_ds, y_s = get_season(sequence, test_size, season_freq=season_freq, fourier_order=fourier_order, fixed=fixed, dp_order=dp_order, dp_drop=dp_drop, model1=model1, fourier=fourier, dp=dp, is_seasonal=is_seasonal, season_period=season_period)
    
    X, y_ds = prepare_data(y_ds,test_size=test_size)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y_ds, test_size=test_size, shuffle=False)
    
    y = y_s.squeeze() + y_ds.squeeze()
        
    model2 = model2.match(X_train,y_train)

    predictions_train = pd.Sequence(
                        model2.predict(X_train),
                        index=X_train.index,
                        title="Prediction"
    )+y_s[X_train.index]


    predictions_test = pd.Sequence(
                        model2.predict(X_test),
                        index=X_test.index,
                        title="Prediction"
    )+y_s[X_test.index]

    fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,5), sharey=True)

    y_train_ps = y.loc[y_train.index]
    y_test_ps = y.loc[y_test.index]

    y_train_ps.plot(ax=ax1, legend=True)
    predictions_train.plot(ax=ax1)
    ax1.set_title('Practice Predictions')

    y_test_ps.plot(ax=ax2, legend=True)
    predictions_test.plot(ax=ax2)
    ax2.set_title('Take a look at Predictions')
    plt.present()


    print(f'R2 practice rating: {r2_score(y_train_ps[:-to_predict],predictions_train[:-to_predict])}')

    print(f'R2 check rating: {r2_score(y_test_ps[:-to_predict],predictions_test[:-to_predict])}')

A be aware of warning: if you don’t count on your knowledge to comply with time patterns, do concentrate on cycles! The hybrid mannequin succeeds nicely for a lot of duties, nevertheless it really decreases the R2 rating of our earlier Bitcoin prediction:

get_hybrid_univariate_prediction(dataset['BTC-USD'], 22,
                                     season_freq='A', 
                                     fourier_order=4, 
                                     fixed=True, 
                                     dp_order=5,  
                                     dp_drop=True,
                                     model1=LinearRegression(),
                                     fourier=None, is_seasonal=True, season_period=30,
                                     dp=None, 
                                     to_predict=1, 
                                     nlags=20, 
                                     minimal_pacf=0.05, 
                                     model2=XGBRegressor(n_estimators=20)
                                         
                    )

The previous rating was round 0.31.

Conclusion

This text goals at presenting capabilities to your time sequence workflow, specifically for lags and deseasonalization. Use them with care, although: apply them to have baseline scores earlier than delving into extra refined fashions.

In future articles we are going to deliver forth multi-step predictions (predict greater than someday forward) and examine efficiency of various fashions, each univariate and multivariate.


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments