DataFrame

This module provides the DataFrame class, designed host data_frame in the context of multi-objective scoring activities.

The DataFrame class resembles the pandas DataFrame class, but doe non inherit from it. The DataFrame class is designed to be more lightweight and tailored to the specific needs of the package.

An instance of DataFrame hosts information about a collection of items. Each row corresponds to an item, and each column represents a property of these objects.

Simple Usage Examples:

The module can be imported as follows:

>>> from pumas.dataframes.dataframe import DataFrame

The data_frame can be provided as a list of dictionaries. Each dictionary in the list represents an item and contributes a row to the data_frame frame; each key contributes a column name.

>>> input_data = [{"A": 1, "B": 4, "C": "x"}, {"A": 2, "B": 4, "C": "y"}]

1. Initialization

>>> df = DataFrame(row_data=input_data)
>>> df.num_rows
2
>>> df.num_columns
3
>>> df.shape
(2, 3)
>>> df.size
6
>>> df.columns
['A', 'B', 'C']

2. Index Management

Each DataFrame instance has an index attribute that can be used to access the index values.

>>> df.index.values
[0, 1]

It is possible to reset the index of the DataFrame >>> df.rebuild_index(strategy=”range”) >>> df.index.values [0, 1]

>>> df.rebuild_index(strategy="uuids")
>>> df.index.values
[UUID('...'), UUID('...')]

It is possible to set the index from an existing column.

>>> df.set_index_from_column(column_name="A")
>>> df.index.values
[1, 2]

The column, however, should have unique values, otherwise the following error will be raised:

>>> try:
...     df.set_index_from_column(column_name="B")
... except DuplicateValuesError as e:
...     print(e)
Index values must be unique.

This exception, if raised, can be caught and handled as needed. Alternatively it is possible to avoid it by checking the uniqueness of th the values of a column before setting it as an index.

>>> df.column_has_unique_values(column_name="B")
False

3. Metadata Management

Metadata are additional data_frame attributes that do not find a place in the main data_frame structure. Metadata are useful to store additional information about the data_frame, such as units, descriptions, or other properties.

The dataframe initialization creates blank metadata for the columns and rows.

>>> df.column_metadata_map
{'A': ColumnMetadata(uid='A', properties=None), 'B': ColumnMetadata(uid='B', properties=None), 'C': ColumnMetadata(uid='C', properties=None)}

If metadata is provided during initialization, it will be used to populate the metadata map.

>>> column_metadata_map = {
...                       "A": {"uid": "1", "properties": {"unit": "count"}},
...                       "B": {"uid": "2", "properties": {"unit": "count"}},
...                       "C": {"uid": "3", "properties": {"unit": "count"}}
...                       }
>>> df = DataFrame(row_data=input_data,column_metadata_map=column_metadata_map)
>>> df.column_metadata_map
{'A': ColumnMetadata(uid='1', properties={'unit': 'count'}), 'B': ColumnMetadata(uid='2', properties={'unit': 'count'}), 'C': ColumnMetadata(uid='3', properties={'unit': 'count'})}

4. Data Type Management

Before using the data_frame, it might be necessary to observe or change the data types of the data contained in the data_frame. It is necessary that all the data in a column have the same data type, wich is used as a column flag.

If the column has mixed data types, the column flag will be set to UnspecifiedDataType.

>>> input_data = [{"A": 1, "B": 4, "C": "5"}, {"A": 2, "B": "4", "C": "y"}]
>>> df = DataFrame(row_data=input_data)
>>> df.dtypes_map
{'A': <class 'int'>, 'B': <class 'mpstk.dataframes.dataframe.UnspecifiedDataType'>, 'C': <class 'str'>}

It is possible to set the data types of the columns using a dictionary. This attempts a casting of the data in the columns to the specified data type.

>>> dtype_map = {"A": int, "B": float, "C": str}
>>> df = DataFrame(row_data=input_data,dtypes_map=dtype_map)
>>> df.dtypes_map
{'A': <class 'int'>, 'B': <class 'float'>, 'C': <class 'str'>}

If the casting fails on any element of the column, the values of the columns and data type of the column will not be changed. In this case, a warning will be raised.

>>> dtype_map = {"A": int, "B": str, "C": int}
>>> df = DataFrame(row_data=input_data,dtypes_map=dtype_map)
>>> df.dtypes_map
{'A': <class 'int'>, 'B': <class 'str'>, 'C': <class 'str'>}

It is possible to perform a partial casting of the data types of the columns. In this case, only the columns specified in the dtype_map will be casted.

>>> dtype_map = {"A": float}
>>> df = DataFrame(row_data=input_data,dtypes_map=dtype_map)
>>> df.dtypes_map
{'A': <class 'float'>, 'B': <class 'mpstk.dataframes.dataframe.UnspecifiedDataType'>, 'C': <class 'str'>}

5. Applying Functions to Data

The DataFrame class provides methods to apply functions to the data contained in the DataFrame.

5.1 Elementwise Operations

Elementwise operations allow you to apply a function to each element of a column. The function can be applied in parallel. This might be useful when the function encoding the operation does not work vectroized data, or require special handling. The index of the original DataFrame is maintained in the new DataFrame.

>>> input_data = [{"A": 1, "B": 4}, {"A": 2, "B": 5}, {"A": 3, "B": 6}]
>>> df = DataFrame(row_data=input_data)
>>> df.set_index_from_column(column_name="A")
>>> df.index.values
[1, 2, 3]

We will apply the square function to column ‘A’. This function does not require any additional parameters.

>>> def square(x):
...     return x ** 2
>>> df_squared = df.apply_elementwise_column(column_name='A', new_column_name='A_squared', func=square)
>>> df_squared.row_data
[{'A_squared': 1}, {'A_squared': 4}, {'A_squared': 9}]
>>> df_squared.index.values
[1, 2, 3]

We will apply the add function to column ‘A’, adding 5 to each value. This function requires an additional parameter.

>>> def add(x, amount):
...     return x + amount
>>> df_added = df.apply_elementwise_column(column_name='A', new_column_name='A_added', func=add, func_kwargs={'amount': 5})
>>> df_added.row_data
[{'A_added': 6}, {'A_added': 7}, {'A_added': 8}]
>>> df_added.index.values
[1, 2, 3]

This method ensures the index of the original DataFrame is maintained in the new DataFrame, regardless of the parallelization strategy used.

6 Concatenate DataFrames