LSTM Data Preparation
rackio_AI.preprocessing.LSTMDataPreparation()Documentation here
split_sequences(self, df, timesteps, stepsize=1, input_cols=None, output_cols=None, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)Splits dataframe in a 3D numpy array format supported by LSTM architectures using sliding windows concept.
Parameters
- :param df: (pandas.DataFrame) Contains inputs and outputs data
- :param timesteps: (list or int) Timestep for each input variable.
- If timestep is an int value, all input columns will be the same timestep
- If timestep is a list, must be same lenght that input_cols argument
- :param stepsize: (int, default = 1) step size for the sliding window
- :param input_cols: (list, default = None) Column names that represents the input variables to LSTM
- If input_cols is None the method assumes that inputs are all column except the last one.
- :param output_cols: (list, default = None) Column names that represents the output variables to LSTM
- If output_cols is None the method assumes that output is the last column.
The rest of parameters represent the parameters for pad_sequences method, see its description.
returns
sequences (3D numpy array) dimensions (df.shape[0] - max(timesteps), max(timesteps), features)
>>> import numpy as np
>>> from rackio_AI import RackioAI
>>> a = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90]).reshape(-1,1)
>>> b = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95]).reshape(-1,1)
>>> c = np.array([a[i]+b[i] for i in range(len(a))]).reshape(-1,1)
>>> data = np.hstack((a,b,c))
>>> data
array([[ 10, 15, 25],
[ 20, 25, 45],
[ 30, 35, 65],
[ 40, 45, 85],
[ 50, 55, 105],
[ 60, 65, 125],
[ 70, 75, 145],
[ 80, 85, 165],
[ 90, 95, 185]])
>>> df = pd.DataFrame(data, columns=['a', 'b', 'c'])
>>> preprocess = RackioAI.get("Preprocessing", _type="Preprocessing")
>>> x, y = preprocess.lstm_data_preparation.split_sequences(df, 2)
>>> x.shape
(8, 2, 2)
>>> x
array([[[10., 15.],
[20., 25.]],
<BLANKLINE>
[[20., 25.],
[30., 35.]],
<BLANKLINE>
[[30., 35.],
[40., 45.]],
<BLANKLINE>
[[40., 45.],
[50., 55.]],
<BLANKLINE>
[[50., 55.],
[60., 65.]],
<BLANKLINE>
[[60., 65.],
[70., 75.]],
<BLANKLINE>
[[70., 75.],
[80., 85.]],
<BLANKLINE>
[[80., 85.],
[90., 95.]]])
>>> y.shape
(8, 1, 1)
>>> y
array([[[ 45.]],
<BLANKLINE>
[[ 65.]],
<BLANKLINE>
[[ 85.]],
<BLANKLINE>
[[105.]],
<BLANKLINE>
[[125.]],
<BLANKLINE>
[[145.]],
<BLANKLINE>
[[165.]],
<BLANKLINE>
[[185.]]])
pad_sequences(self, sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)Pads sequences to the same length.
This function transforms a list (of length num_samples) of sequences (lists of integers)
into a 2D Numpy array of shape (num_samples, num_timesteps).
num_timesteps is either the maxlen argument if provided, or the length of the longest
sequence in the list.
Sequences that are shorter than num_timesteps are padded with value until they are
num_timesteps long.
Sequences longer than num_timesteps are truncated so that they fit the desired length.
The position where padding or truncation happens is determined by the arguments padding
and truncating, respectively.
Pre-padding or removing values from the beginning of the sequence is the
default.
Parameters
- :param sequences: (list) List of sequences (each sequence is a list of integers).
- :param maxlen: (Optional Int), maximum length of all sequences. If not provided, sequences will be padded to the length of the longest individual sequence.
- :param dtype: (Optional, defaults to int32). Type of the output sequences.
To pad sequences with variable length strings, you can use
object. - :param padding: (String, 'pre' or 'post') (optional, defaults to 'pre'): pad either before or after each sequence.
- :param truncating: (String, 'pre' or 'post') (optional, defaults to 'pre'):
remove values from sequences larger than
maxlen, either at the beginning or at the end of the sequences. - :param value: (Float or String), padding value. (Optional, defaults to 0.)
returns:
- Numpy array with shape
(len(sequences), maxlen)
Raises:
- ValueError: In case of invalid values for
truncatingorpadding, or in case of invalid shape for asequencesentry.
>>> from rackio_AI import RackioAI
>>> sequence = [[1], [2, 3], [4, 5, 6]]
>>> preprocessing = RackioAI.get("Preprocessing", _type="Preprocessing")
>>> preprocessing.lstm_data_preparation.pad_sequences(sequence)
array([[0, 0, 4],
[0, 2, 5],
[1, 3, 6]])
>>> preprocessing.lstm_data_preparation.pad_sequences(sequence, value=-1)
array([[-1, -1, 4],
[-1, 2, 5],
[ 1, 3, 6]])
>>> preprocessing.lstm_data_preparation.pad_sequences(sequence, padding='post')
array([[1, 2, 4],
[0, 3, 5],
[0, 0, 6]])
>>> preprocessing.lstm_data_preparation.pad_sequences(sequence, maxlen=2)
array([[0, 2, 5],
[1, 3, 6]])