macrosynergy.management.utils.df_utils#

Utility functions for working with DataFrames.

is_categorical_qdf(df)[source]#

Check if a column in a DataFrame is categorical.

Parameters:
  • df (pd.DataFrame) – The DataFrame to be checked.

  • column (str) – The column to be checked.

Returns:

True if the column is categorical, False otherwise.

Return type:

bool

standardise_dataframe(df)[source]#

Applies the standard JPMaQS Quantamental DataFrame format to a DataFrame.

Parameters:

df (pd.DataFrame) – The DataFrame to be standardized.

Raises:
  • TypeError – If the input is not a pandas DataFrame.

  • ValueError – If the input DataFrame is not in the correct format.

Returns:

The standardized DataFrame.

Return type:

pd.DataFrame

drop_nan_series(df, column='value', raise_warning=False)[source]#

Drops any series that are entirely NaNs. Raises a user warning if any series are dropped and the raise warning flag is set to true.

Parameters:
  • df (pd.DataFrame) – The dataframe to be cleaned.

  • column (str) – The column to be used as the value column, defaults to “value”.

  • raise_warning (bool) – Whether to raise a warning if any series are dropped.

Raises:
  • TypeError – If the input is not a pandas DataFrame.

  • ValueError – If the input DataFrame is not in the correct format.

Returns:

The cleaned DataFrame.

Return type:

pd.DataFrame | QuantamentalDataFrame

qdf_to_ticker_df(df, value_column='value')[source]#

Converts a standardized JPMaQS DataFrame to a wide format DataFrame with each column representing a ticker.

Parameters:
  • df (pd.DataFrame) – A standardised quantamental dataframe.

  • value_column (str) – The column to be used as the value column, defaults to “value”. If the specified column is not present in the DataFrame, a column named “value” will be used. If there is no column named “value”, the first column in the DataFrame will be used instead.

Returns:

The converted DataFrame.

Return type:

pd.DataFrame

ticker_df_to_qdf(df, metric='value')[source]#

Converts a wide format DataFrame (with each column representing a ticker) to a standardized JPMaQS DataFrame.

Parameters:

df (pd.DataFrame) – A wide format DataFrame.

Returns:

The converted DataFrame.

Return type:

pd.DataFrame

concat_single_metric_qdfs(df_list, errors='ignore')[source]#

Combines a list of Quantamental DataFrames into a single DataFrame.

Parameters:
  • df_list (List[QuantamentalDataFrame]) – A list of Quantamental DataFrames.

  • errors (str) – The error handling method to use. If ‘raise’, then invalid items in the list will raise an error. If ‘ignore’, then invalid items will be ignored. Default is ‘ignore’.

Returns:

The combined DataFrame.

Return type:

QuantamentalDataFrame

apply_slip(df, slip, cids=None, xcats=None, tickers=None, metrics=['value'], extend_dates=False, raise_error=True)[source]#

Applies a “slip” to the DataFrame for the given cross-sections and categories, on the given metrics. A slip shifts the specified category n-days fowards in time, where n is the slip value. This is identical to a lag, but is measured in days, and must always be applied before any resampling.

Parameters:
  • target_df (QuantamentalDataFrame) – DataFrame to which the slip is applied.

  • slip (int) – Slip to be applied.

  • cids (List[str]) – List of cross-sections.

  • xcats (List[str]) – List of target categories.

  • metrics (List[str]) – List of metrics to which the slip is applied.

  • extend_dates (bool) – If True, includes the dates added by the slip in the DataFrame. If False, only the input dates are included. Default is False.

  • raise_error (bool) – If True, raises an error if the slip cannot be applied to all xcats in the target DataFrame. If False, raises a warning instead.

Raises:
  • TypeError – If the provided parameters are not of the expected type.

  • ValueError – If the provided parameters are semantically incorrect.

Returns:

DataFrame with the slip applied.

Return type:

QuantamentalDataFrame

downsample_df_on_real_date(df, groupby_columns=[], freq='M', agg='mean')[source]#

Downsamples JPMaQS DataFrame.

Parameters:
  • df (pd.Dataframe) – standardized JPMaQS DataFrame with the necessary columns: ‘cid’, ‘xcat’, ‘real_date’ and at least one column with values of interest.

  • groupby_columns (List) – a list of columns used to group the DataFrame.

  • freq (str) – frequency option. Per default the correlations are calculated based on the native frequency of the datetimes in ‘real_date’, which is business daily. Downsampling options include weekly (‘W’), monthly (‘M’), or quarterly (‘Q’) mean.

  • agg (str) – aggregation method. Must be one of “mean” (default), “median”, “min”, “max”, “first” or “last”.

Returns:

the downsampled DataFrame.

Return type:

pd.DataFrame

update_df(df, df_add, xcat_replace=False)[source]#

Append a standard DataFrame to a standard base DataFrame with ticker replacement on the intersection.

Parameters:
  • df (pd.DataFrame) – standardised base JPMaQS DataFrame with the following necessary columns: ‘cid’, ‘xcat’, ‘real_date’ and ‘value’.

  • df_add (pd.DataFrame) – another standardised JPMaQS DataFrame, with the latest values, to be added with the necessary columns: ‘cid’, ‘xcat’, ‘real_date’, and ‘value’. Columns that are present in the base DataFrame but not in the appended DataFrame will be populated with NaN values.

  • xcat_replace (bool) – all series belonging to the categories in the added DataFrame will be replaced, rather than just the added tickers.

Returns:

standardised DataFrame with the latest values of the modified or newly defined tickers added.

Return type:

pd.DataFrame

..note::

Tickers are combinations of cross-sections and categories.

update_tickers(df, df_add)[source]#

Method used to update aggregate DataFrame on a ticker level.

Parameters:
  • df (pd.DataFrame) – aggregate DataFrame used to store all tickers.

  • df_add (pd.DataFrame) – DataFrame with the latest values.

update_categories(df, df_add)[source]#

Method used to update the DataFrame on the category level.

Parameters:
  • df (pd.DataFrame) – base DataFrame.

  • df_add (pd.DataFrame) – appended DataFrame.

reduce_df(df, xcats=None, cids=None, start=None, end=None, blacklist=None, out_all=False, intersect=False)[source]#

Filter DataFrame by xcats and cids and notify about missing xcats and cids.

Parameters:
  • df (pd.Dataframe) – standardized JPMaQS DataFrame with the necessary columns: ‘cid’, ‘xcat’, ‘real_date’ and ‘value’.

  • xcats (Union[str, List[str]]) – extended categories to be filtered on. Default is all in the DataFrame.

  • cids (List[str]) – cross sections to be checked on. Default is all in the dataframe.

  • start (str) – string representing the earliest date. Default is None.

  • end (str) – string representing the latest date. Default is None.

  • blacklist (dict) – cross-sections with date ranges that should be excluded from the data frame. If one cross-section has several blacklist periods append numbers to the cross-section code.

  • out_all (bool) – if True the function returns reduced dataframe and selected/ available xcats and cids. Default is False, i.e. only the DataFrame is returned

  • intersect (bool) – if True only retains cids that are available for all xcats. Default is False.

Returns:

reduced DataFrame that also removes duplicates or (for out_all True) DataFrame and available and selected xcats and cids.

Return type:

pd.Dataframe

reduce_df_by_ticker(df, ticks=None, start=None, end=None, blacklist=None)[source]#

Filter dataframe by xcats and cids and notify about missing xcats and cids

Parameters:
  • df (pd.Dataframe) – standardized dataframe with the following columns: ‘cid’, ‘xcat’, ‘real_date’.

  • ticks (List[str]) – tickers (cross sections + base categories)

  • start (str) – string in ISO 8601 representing earliest date. Default is None.

  • end (str) – string ISO 8601 representing the latest date. Default is None.

  • blacklist (dict) – cross sections with date ranges that should be excluded from the dataframe. If one cross section has several blacklist periods append numbers to the cross section code.

Returns:

reduced dataframe that also removes duplicates

Return type:

pd.Dataframe

categories_df_aggregation_helper(dfx, xcat_agg)[source]#

Helper method to down-sample each category in the DataFrame by aggregating over the intermediary dates according to a prescribed method.

Parameters:
  • dfx (List[str]) – standardised DataFrame defined exclusively on a single category.

  • xcat_agg (List[str]) – associated aggregation method for the respective category.

categories_df(df, xcats, cids=None, val='value', start=None, end=None, blacklist=None, years=None, freq='M', lag=0, fwin=1, xcat_aggs=['mean', 'mean'])[source]#

In principle, create custom two-categories DataFrame with appropriate frequency and, if applicable, lags.

Parameters:
  • df (pd.Dataframe) – standardized JPMaQS DataFrame with the following necessary columns: ‘cid’, ‘xcat’, ‘real_date’ and at least one column with values of interest.

  • xcats (List[str]) – extended categories involved in the custom DataFrame. The last category in the list represents the dependent variable, and the (n - 1) preceding categories will be the explanatory variables(s).

  • cids (List[str]) – cross-sections to be included. Default is all in the DataFrame.

  • val (str) – name of column that contains the values of interest. Default is ‘value’.

  • start (str) – earliest date in ISO 8601 format. Default is None, i.e. earliest date in DataFrame is used.

  • end (str) – latest date in ISO 8601 format. Default is None, i.e. latest date in DataFrame is used.

  • blacklist (dict) – cross-sections with date ranges that should be excluded from the DataFrame. If one cross section has several blacklist periods append numbers to the cross section code.

  • years (int) – number of years over which data are aggregated. Supersedes the “freq” parameter and does not allow lags, Default is None, i.e. no multi-year aggregation.

  • freq (str) – letter denoting frequency at which the series are to be sampled. This must be one of ‘D’, ‘W’, ‘M’, ‘Q’, ‘A’. Default is ‘M’. Will always be the last business day of the respective frequency.

  • lag (int) – lag (delay of arrival) of explanatory category(s) in periods as set by freq. Default is 0.

  • fwin (int) – forward moving average window of first category. Default is 1, i.e no average. Note: This parameter is used mainly for target returns as dependent variable.

  • xcat_aggs (List[str]) – exactly two aggregation methods. Default is ‘mean’ for both. The same aggregation method, the first method in the parameter, will be used for all explanatory variables.

Returns:

custom DataFrame with category columns. N.B.: The number of explanatory categories that can be included is not restricted and will be appended column-wise to the returned DataFrame. The order of the DataFrame’s columns will reflect the order of the categories list.

Return type:

pd.DataFrame

estimate_release_frequency(timeseries=None, df_wide=None, atol=None, rtol=None)[source]#

Estimates the release frequency of a timeseries, by inferring the frequency of the timeseries index. Before calling pd.infer_freq, the function drops NaNs, and rounds values as specified by the tolerance parameters to allow dropping of “duplicate” values.

Parameters:
  • timeseries (pd.Series, optional) – The timeseries to be used to estimate the release frequency. Only one of timeseries or df_wide must be passed.

  • df_wide (pd.DataFrame, optional) – The wide DataFrame to be used to estimate the release frequency. This mode processes each column of the DataFrame as a timeseries. Only one of timeseries or df_wide must be passed.

  • atol (float, optional) – The absolute tolerance for the difference between two values. If None, no rounding is applied.

  • rtol (float, optional) – The relative tolerance for the difference between two values. If None, no rounding is applied.

Returns:

The estimated release frequency. If df_wide is passed, a dictionary with the column names as keys and the estimated frequencies as values is returned.

Return type:

str or dict

years_btwn_dates(start_date, end_date)[source]#

Returns the number of years between two dates.

Return type:

int

quarters_btwn_dates(start_date, end_date)[source]#

Returns the number of quarters between two dates.

Return type:

int

months_btwn_dates(start_date, end_date)[source]#

Returns the number of months between two dates.

Return type:

int

weeks_btwn_dates(start_date, end_date)[source]#

Returns the number of business weeks between two dates.

Return type:

int

get_eops(dates=None, start_date=None, end_date=None, freq='M')[source]#

Returns a series of end-of-period dates for a given frequency. Dates can be passed as a series, index, a generic iterable or as a start and end date.

Parameters:
  • freq (str) – The frequency string. Must be one of “D”, “W”, “M”, “Q”, “A”.

  • dates (pd.DatetimeIndex | pd.Series | Iterable[pd.Timestamp]) – The dates to be used to generate the end-of-period dates. Can be passed as a series, index, a generic iterable or as a start and end date.

  • start_date (str | pd.Timestamp) – The start date. Must be passed if dates is not passed.

Return type:

Series

merge_categories(df, xcats=None, new_xcat=None, cids=None, hierarchy=None, backfill=False, start=None)[source]#

Merges categories into a new category, given a list of categories to be merged. The merging is done in a preferred order, i.e. the first category in the list will be the preferred value for each real_date and if the first category does not have a value for a given real_date, the next category in the list will be used, etc…

Parameters:
  • df (pd.DataFrame) – standardized JPMaQS DataFrame with the columns ‘cid’, ‘xcat’, ‘real_date’ and ‘value’.

  • xcats (List[str]) – extended categories to be merged, in preferred order. Alias for hierarchy; provide one or the other.

  • new_xcat (str) – name of the new category to be created.

  • cids (List[str], optional) – cross sections to be included. Default is all in the DataFrame.

  • hierarchy (List[str], optional) – alias for xcats. Provided for parity with the previous extend_history API.

  • backfill (bool, optional) – If True, the new xcat is backfilled with its first valid value to the date specified by start. Default is False.

  • start (str, optional) – ISO date. If backfill is True, the first valid value is propagated back to this date. If backfill is False and start is provided, the output is trimmed to dates >= start.

Returns:

DataFrame with the merged category.

Return type:

pd.DataFrame

get_sops(dates=None, start_date=None, end_date=None, freq='M')[source]#

Returns a series of start-of-period dates for a given frequency. Dates can be passed as a series, index, a generic iterable or as a start and end date.

Parameters:
  • freq (str) – The frequency string. Must be one of “D”, “W”, “M”, “Q”, “A”.

  • dates (pd.DatetimeIndex | pd.Series | Iterable[pd.Timestamp]) – The dates to be used to generate the start-of-period dates. Can be passed as a series, index, a generic iterable or as a start and end date.

  • start_date (str | pd.Timestamp) – The start date. Must be passed if dates is not passed.

Return type:

Series

concat_categorical(df1, df2)[source]#

Concatenate two DataFrames with categorical columns. The dtypes of the of the second DataFrame will be cast to the dtypes of the first. The columns of the DataFrames must be identical.

Parameters:
  • df1 (pd.DataFrame) – The first DataFrame.

  • df2 (pd.DataFrame) – The second DataFrame.

Returns:

The concatenated DataFrame with the same columns as the input.

Return type:

pd.DataFrame

forward_fill_wide_df(df, blacklist=None, n=1)[source]#

Forward fills NaN values in a wide DataFrame using the last valid value in each column. It will not forward fill gaps in the data, only the next n periods after the last valid value.

Parameters:
  • df (pd.DataFrame) – The DataFrame to be forward filled in wide format, where each column represents a cross-section and the index are dates.

  • blacklist (dict, optional) – A dictionary where keys are column names and values are lists of two elements, representing the start and end dates of periods to be excluded from filling.

  • n (int, optional) – The number of periods to fill forward. Default is 1, meaning only the next period

rotate_cid_xcat(df, direction, xcat_template, fixed_value)[source]#

Rotate a panel DataFrame between cid-per-row and xcat-per-row representations.

Two directions are supported:

  • “to_xcats”: for each row, replaces “cid” with a per-stock xcat derived from xcat_template (substituting the cid value into the “{cid}” placeholder) and sets “cid” to fixed_value.

  • “to_cids”: the inverse — extracts the stock identifier from “xcat” using the template as a regex, writes it into “cid”, and replaces “xcat” with fixed_value.

Parameters:
  • df (pd.DataFrame or QuantamentalDataFrame) – Panel DataFrame with at least “cid” and “xcat” columns.

  • direction (str) – Transformation direction: “to_xcats” or “to_cids”.

  • xcat_template (str) – Template string containing the placeholder “{cid}” that maps between a stock identifier and an xcat name, e.g. “EQXR_{cid}_NSA”.

  • fixed_value (str) – Value assigned to the column being collapsed. When direction is “to_xcats”, all rows will have cid set to fixed_value; when direction is “to_cids”, all rows will have xcat set to fixed_value.

Returns:

A copy of df with “cid” and “xcat” updated according to direction.

Return type:

pd.DataFrame

Raises:

ValueError – If direction is not “to_xcats” or “to_cids”.