For a financial application using bootstrapped resampling, I would like to speed up (vectorize) this function without the need of for-loops.
The set up. I have a data with multiple gvkeys (ids), returns and dates (large T, large N). Additionally SB_dates with bootstrapped random dates (in blocks) and newBL a boolean array matrix which is true each new Block. SB_dates and newBL have the same size (in the example 20 x 6).
For each, each new block select 5 (n) random stocks from data and calculate the average return on that date. If it is not a new block, keep the same random stocks that were selected previously, and calculate the average return on that date.
So far, this is what I got. But I am having struggle in speeding things up mainly due to the loops. Any ideas?
@jit
def calculate_average_returns(data, SB_dates, newBL, n=5):
n_rows, n_columns = np.shape(SB_dates)
df_sb = pd.DataFrame()
for col in range(n_columns):
date_column = SB_dates[:, col]
newBL_column = newBL[:, col]
average_returns_col = []
for i in range(n_rows):
if newBL_column[i]:
selected_stocks = data[data['yyyymm1'] == date_column.iloc[i]]
random_stock_ids = np.random.choice(selected_stocks['gvkey'], n, replace=False)
selected_stocks_df = data[(data['yyyymm1'] == date_column.iloc[i]) & (data['gvkey'].isin(random_stock_ids))]
ret_mean = selected_stocks_df['ret_w'].mean()
average_returns_col.append(ret_mean)
else:
selected_stocks_df = data[(data['yyyymm1'] == date_column.iloc[i]) & (data['gvkey'].isin(random_stock_ids))]
ret_mean = selected_stocks_df['ret_w'].mean()
average_returns_col.append(ret_mean)
df_sb[col] = average_returns_col
return df_sb
The data. Generates data
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
n_rows = 1000
start_date = datetime(2000, 1, 1)
end_date = datetime(2023, 12, 31)
# Generate random dates within a range
dates = [start_date + timedelta(days=random.randint(0, (end_date - start_date).days)) for _ in range(n_rows)]
# Generate random returns between -1 and 1
returns = np.random.uniform(-0.01, 0.01, n_rows)
# Generate random gvkey values
gvkey = [random.randint(1000, 9999) for _ in range(n_rows)]
# Create a DataFrame
data = pd.DataFrame({'yyyymm1': dates, 'ret_w': returns, 'gvkey': gvkey})
and for the SB_dates
0 1 2 3 4 5 \
0 2015-01-31 2015-02-28 2014-09-30 2017-10-31 2010-12-31 2018-05-31
1 2015-02-28 2015-03-31 2010-06-30 2017-11-30 2011-01-31 2018-06-30
2 2015-03-31 2015-04-30 2010-07-31 2017-12-31 2011-02-28 2018-07-31
3 2015-04-30 2015-05-31 2010-08-31 2018-01-31 2011-03-31 2018-08-31
4 2015-05-31 2015-06-30 2010-09-30 2018-02-28 2011-04-30 2018-09-30
5 2015-06-30 2015-07-31 2010-10-31 2018-03-31 2011-05-31 2018-10-31
6 2010-05-31 2015-08-31 2013-10-31 2015-07-31 2011-06-30 2018-11-30
7 2014-08-31 2015-09-30 2013-11-30 2015-08-31 2011-07-31 2018-12-31
8 2010-11-30 2015-10-31 2013-12-31 2015-09-30 2011-08-31 2019-01-31
9 2010-12-31 2015-11-30 2014-01-31 2015-10-31 2011-09-30 2019-02-28
10 2011-01-31 2015-12-31 2014-02-28 2015-11-30 2011-10-31 2019-03-31
11 2010-05-31 2016-01-31 2011-11-30 2015-12-31 2011-11-30 2019-04-30
12 2010-06-30 2016-02-29 2011-12-31 2016-01-31 2011-12-31 2019-05-31
13 2010-07-31 2016-03-31 2007-07-31 2016-02-29 2012-01-31 2019-06-30
14 2008-11-30 2016-04-30 2007-08-31 2016-03-31 2012-07-31 2019-07-31
15 2008-12-31 2016-05-31 2007-09-30 2016-04-30 2012-08-31 2017-04-30
16 2009-01-31 2016-06-30 2007-10-31 2016-05-31 2018-11-30 2017-05-31
17 2017-07-31 2016-07-31 2007-11-30 2016-06-30 2018-12-31 2017-06-30
18 2017-08-31 2016-08-31 2007-12-31 2016-07-31 2019-01-31 2017-07-31
19 2017-09-30 2016-09-30 2008-01-31 2016-08-31 2019-02-28 2017-08-31
and for the newBL
0 1 2 3 4 5
0 True True True True True True
1 False False True False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
5 False False False False False False
6 True False True True False False
7 True False False False False False
8 True False False False False False
9 False False False False False False
10 False False False False False False
11 True False True False False False
12 False False False False False False
13 False False True False False False
14 True False False False True False
15 False False False False False True
16 False False False False True False
17 True False False False False False
18 False False False False False False
19 False False False False False False
For example in row 6 of the first column, there is a new Block which is marked as true in newBL. The first rows start always with a new BL.
Sorry for the long data example. But the last two are generated from another function.
Aucun commentaire:
Enregistrer un commentaire