Outliers

class rackio_AI.Outliers()

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.[3] An outlier can cause serious problems in statistical analyses.

Attributes

outliers: (dict) Its keys are the dataframe columns with outliers. Keys:
- column_name: (dict) Contains the following keys:
  - locs: (list) locations where were the outliers added
  - values: (list) Outliers values added
detected: (dict)
- column_name: (dict) Contains the following keys:
  - locs: (list) locations where were the outliers added
  - values: (list) Outliers values added
  - performance: (float)
optimizer_result: (pandas.DataFrame)

add(self, df, percent=5, method='tf', cols=None)

Creates outliers values in a dataframe based on a given method

Parameters

:param df: (pandas.DataFrame) Data to add outlier
:param percent: (float) outliers percent
:param method: (str) custom function name to calculate outlier
- "tf": tukey-fence method
:param cols: (list) column names to add outliers, default None
- If "None" outliers will be added to all columns

returns

df: (pandas.DataFrame) Data with outliers added

Snippet code

>>> import matplotlib.pyplot as plt
>>> from rackio_AI import Outliers
>>> df = pd.DataFrame(np.random.randn(100,2), columns=["a", "b"])
>>> out = Outliers()
>>> df = out.add(df)
>>> ax = plt.plot(df["a"], '-r', df["b"], '-b', out.outliers["a"]["locs"], out.outliers["a"]["values"], 'rD', out.outliers["b"]["locs"], out.outliers["b"]["values"], 'bD')
>>> ax = plt.legend(["a", "b", "a outliers", "b outliers"])
>>> plt.show()

Add Outlier

tukey_fence(self, subset, k_min=2, k_max=5, q_min=0.25, q_max=0.75)

A nonparametric outlier detection method. It is calculated by creating a 'fence' boundary a distance of k values * IQR beyond the 1st and 3rd quartiles. Any data beyond these fences are considered to be outliers.

Outliers are values below q_min-k(q_max - q_min) or above q_max + k(q_max - q_min)

Parameters

:param subset: (np.ndarray) values to calculate outlier based on interquartile
:param k_min: (float) lower boundary for tukey fence
:param k_max: (float) upper boundary for tukey fence
:param q_min: (float) between [0 - 1] lower quartile
:param q_max: (float) between [0 - 1] upper quartile

returns

value (float) outlier value

z_score(self, df, threshold=3)

Rejects outlier values based on z-score modified

Parameters

:param df: (pandas.DataFrame)
:param threshold: (float)

returns

y: (list)

iqr(self, subset, q_min=0.25, q_max=0.75)

A nonparametric outlier detection method. It is calculated by creating a 'fence' boundary a distance of k values * IQR

Parameters

:param subset: (np.ndarray) values to calculate outlier based on interquartile
:param q_min: (float) lower quartile
:param q_max: (float) upper quartile

returns

iqr (tuple) (q_min, q_max, iqr)
- q_min lower quartile from a subset
- q_max upper quartile form a subset
- iqr interquartile

detect(self, df, win_size=30, step=1, conf=0.95, cols=None)

Detects any outliers values if exists in dataframe. If exists these outliers values will be imputed.

Parameters

:param df: (pandas.DataFrame)
:param win_size: (int)
:param step: (int)
:param conf: (float)
:param cols: (list)

returns

df: (pandas.DataFrame)

Snippet code

>>> import matplotlib.pyplot as plt
>>> from rackio_AI import Outliers
>>> df = pd.DataFrame(np.random.randn(1000,2), columns=["a", "b"])
>>> out = Outliers()
>>> df = out.add(df, percent=1)
>>> df_imputed = out.detect(df, win_size=30)
>>> ax = plt.plot(df["a"], '-r', df["b"], '-b', out.outliers["a"]["locs"], out.outliers["a"]["values"], 'rD', out.outliers["b"]["locs"], out.outliers["b"]["values"], 'bo', out.detected["a"]["locs"], out.detected["a"]["values"], 'kD', out.detected["b"]["locs"], out.detected["b"]["values"], 'ko')
>>> ax = plt.legend(["a", "b", "a outliers", "b outliers", "a dectected", "b detected"])
>>> plt.show()

Detect Outlier

check(self, value, subsets, col)

Check if any value is an outlier in subsets using sliding windows and z-score modified

Parameters

:param value: (float) Value to check if an outlier value
:param subset: (list) Dataframes list with sliding windows
:param col: (str) Column name where belongs value in the dataframe

returns

status_outlier: (bool) If true the value is an outlier

impute(self, value, sample, conf=0.95)

Imputes outlier values using Auto Regressive method with two lags

Parameters

:param value: (float)
:param sample: (pd.Series)
:param conf: (float)

returns

value: (float)

best_win_size_step(self, df, grid_type, *args, **kwargs)

Grid search of window and step size for sliding windows problems

Parameters

:param df: (Pandas.DataFrame)
:param grid_type: (str)
:param args:
- win_sizes: (list)
- steps: (list)
:param percent: (float)

returns

df: (pandas.DataFrame)

get_best_window_step_size(self)

Get best window and step size after optimization

Parameters

None

returns

best_win_size, best_step_size (Tuple of int values)