mercredi 27 septembre 2023

Speeding up by removing col and row for-loops (Randomization with condition)

For a financial application using bootstrapped resampling, I would like to speed up (vectorize) this function without the need of for-loops.

The set up. I have a data with multiple gvkeys (ids), returns and dates (large T, large N). Additionally SB_dates with bootstrapped random dates (in blocks) and newBL a boolean array matrix which is true each new Block. SB_dates and newBL have the same size (in the example 20 x 6).

For each, each new block select 5 (n) random stocks from data and calculate the average return on that date. If it is not a new block, keep the same random stocks that were selected previously, and calculate the average return on that date.

So far, this is what I got. But I am having struggle in speeding things up mainly due to the loops. Any ideas?

@jit
def calculate_average_returns(data, SB_dates, newBL, n=5):
    n_rows, n_columns = np.shape(SB_dates)
    df_sb = pd.DataFrame()

    for col in range(n_columns):
        date_column = SB_dates[:, col]
        newBL_column = newBL[:, col]
        average_returns_col = []

        for i in range(n_rows):
            if newBL_column[i]:
                selected_stocks = data[data['yyyymm1'] == date_column.iloc[i]]
                random_stock_ids = np.random.choice(selected_stocks['gvkey'], n, replace=False)
                selected_stocks_df = data[(data['yyyymm1'] == date_column.iloc[i]) & (data['gvkey'].isin(random_stock_ids))]
                ret_mean = selected_stocks_df['ret_w'].mean()
                average_returns_col.append(ret_mean)
            else:
                selected_stocks_df = data[(data['yyyymm1'] == date_column.iloc[i]) & (data['gvkey'].isin(random_stock_ids))]
                ret_mean = selected_stocks_df['ret_w'].mean()
                average_returns_col.append(ret_mean)

        df_sb[col] = average_returns_col
    return df_sb

The data. Generates data

import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

n_rows = 1000
start_date = datetime(2000, 1, 1)
end_date = datetime(2023, 12, 31)

# Generate random dates within a range
dates = [start_date + timedelta(days=random.randint(0, (end_date - start_date).days)) for _ in range(n_rows)]

# Generate random returns between -1 and 1
returns = np.random.uniform(-0.01, 0.01, n_rows)

# Generate random gvkey values
gvkey = [random.randint(1000, 9999) for _ in range(n_rows)]

# Create a DataFrame
data = pd.DataFrame({'yyyymm1': dates, 'ret_w': returns, 'gvkey': gvkey})

and for the SB_dates

           0          1          2          3          4          5   \
0  2015-01-31 2015-02-28 2014-09-30 2017-10-31 2010-12-31 2018-05-31   
1  2015-02-28 2015-03-31 2010-06-30 2017-11-30 2011-01-31 2018-06-30   
2  2015-03-31 2015-04-30 2010-07-31 2017-12-31 2011-02-28 2018-07-31   
3  2015-04-30 2015-05-31 2010-08-31 2018-01-31 2011-03-31 2018-08-31   
4  2015-05-31 2015-06-30 2010-09-30 2018-02-28 2011-04-30 2018-09-30   
5  2015-06-30 2015-07-31 2010-10-31 2018-03-31 2011-05-31 2018-10-31   
6  2010-05-31 2015-08-31 2013-10-31 2015-07-31 2011-06-30 2018-11-30   
7  2014-08-31 2015-09-30 2013-11-30 2015-08-31 2011-07-31 2018-12-31   
8  2010-11-30 2015-10-31 2013-12-31 2015-09-30 2011-08-31 2019-01-31   
9  2010-12-31 2015-11-30 2014-01-31 2015-10-31 2011-09-30 2019-02-28   
10 2011-01-31 2015-12-31 2014-02-28 2015-11-30 2011-10-31 2019-03-31   
11 2010-05-31 2016-01-31 2011-11-30 2015-12-31 2011-11-30 2019-04-30   
12 2010-06-30 2016-02-29 2011-12-31 2016-01-31 2011-12-31 2019-05-31   
13 2010-07-31 2016-03-31 2007-07-31 2016-02-29 2012-01-31 2019-06-30   
14 2008-11-30 2016-04-30 2007-08-31 2016-03-31 2012-07-31 2019-07-31   
15 2008-12-31 2016-05-31 2007-09-30 2016-04-30 2012-08-31 2017-04-30   
16 2009-01-31 2016-06-30 2007-10-31 2016-05-31 2018-11-30 2017-05-31   
17 2017-07-31 2016-07-31 2007-11-30 2016-06-30 2018-12-31 2017-06-30   
18 2017-08-31 2016-08-31 2007-12-31 2016-07-31 2019-01-31 2017-07-31   
19 2017-09-30 2016-09-30 2008-01-31 2016-08-31 2019-02-28 2017-08-31   

and for the newBL

        0      1      2      3      4      5
0    True   True   True   True   True   True
1   False  False   True  False  False  False
2   False  False  False  False  False  False
3   False  False  False  False  False  False
4   False  False  False  False  False  False
5   False  False  False  False  False  False
6    True  False   True   True  False  False
7    True  False  False  False  False  False
8    True  False  False  False  False  False
9   False  False  False  False  False  False
10  False  False  False  False  False  False
11   True  False   True  False  False  False
12  False  False  False  False  False  False
13  False  False   True  False  False  False
14   True  False  False  False   True  False
15  False  False  False  False  False   True
16  False  False  False  False   True  False
17   True  False  False  False  False  False
18  False  False  False  False  False  False
19  False  False  False  False  False  False 

For example in row 6 of the first column, there is a new Block which is marked as true in newBL. The first rows start always with a new BL.

Sorry for the long data example. But the last two are generated from another function.




Aucun commentaire:

Enregistrer un commentaire