Module `markov.api.data.data_segment`

Classes

class DataSegment (segment_type: SegmentType, name: str = '', **kwargs)

Object model for a DataSegment. DataSegments are : 1. Train 2. Test 3. Validate 4. Unknown (for generic dataset that is not assigned any segment above)

Args

data_frame : DataFame: data_set in a dataframe format
segment_type : SegmentType: what type of segment does this data_set belong to

name (): **kwargs ():

Static methods

def create_datasegment(segment_type: SegmentType, ds_id: str, x_indexes: List[int], y_index: int, x_col_names: List[str], y_col_name: str, data_format: str, is_analyzed: bool = False, num_rows: int = 0, num_cols: int = 0, exists: bool = True, delimiter: str = ',') ‑> DataSegment

Create a Data Segment

Args

segment_type : SegmentType: what segment type is that (Train/Test/Validate/Unknown)
ds_id : str: dataset id that is uniquely assigned to the parent dataset
x_indexes : list: Either the index of the feature cols or the name of the feature columns (but not both) are required
y_index : int: Either the index of target column or name of the target column for labeled dataset are required
x_col_names : list: Either column name(s) or index of the feature cols are required
y_col_name : str: Either the name of target col or index of target column are required
data_format : str: format of data (csv/tsv etc.)
is_analyzed : bool: true of the dataset is analyzed or false if its not
num_rows : int: count of number of rows
num_cols : int: number of columns in this dataset segment
exists : bool: True if this segment exists if not false
delimiter : str: delimiter for this dataset

Returns

The DataSegment object

def from_pickle_file(file_path) ‑> DataSegment

Read the DataSegment object from the binary file (pickled file). Please do not use pickle to unpickle objects

Args

file_path : str: file_path is the location of serialized binary file created by

calling save on the DataSegment.

Returns

DataSegment object from the picked dataset stored at this path.

Instance variables

prop ds_id: unique identifier to identify this dataset

Returns

ds_id (DataSetID) associated with the dataset that uniquely identifies this dataSet With MarkovML. Every dataset registered with MarkovML gets a unique identifier to identify it with MarkovML.
prop features: Returns

Return the feature columns in a dataframe as a Copy
prop is_analyzed: Returns

True if dataset has been analyzed otherwise false by default
prop num_cols: Returns

Num of cols in the data segment
prop num_rows: Returns

Number of rows present in the data segment
prop segment_type: Returns

The Segments that are part of this data_set. The segment_types are (Train | Test | Validate | Unknown)
prop target: Returns

Get the copy of the target column as the dataframe

Methods

def as_df(self, force_download=False) ‑> pandas.core.frame.DataFrame

Returns the segment as a dataframe. This is downloaded for the first time and cached for future use within the session

Args

force_download : bool: By default MarkovML will only download the dataset once to make loading faster. However,

if for some reason you want to download the dataset from storage, set force to True

Returns

DataFrame containing Data for this DataSegment in the DF format

def download_as_csv(self, filepath: str = None) ‑> None

Download dataset segment as csv file at the given filepath. If filepath is not provided, it will be downloaded at the current working directory with the same name as the uploaded file name. For example:

>>> dataset.train.download_as_csv(filepath="train.csv")
>>> dataset.test.download_as_csv(filepath="test.csv")
>>> dataset.test.download_as_csv() # will download with the filename same as the uploaded file name

def normalized_df(self, filters: List, replace_symbol: str = ' ')

Normalize the dataframe with different filters like [stop words, emoji, urls] etc Return: Normalized dataframe

>>> import markov
>>> dataset = markov.dataset.get_by_id("id")
>>> dataset.train.normalized_df(filters=["url", "email"])
>>> <dataframe>

def sample(self, n: int, strategy='uniform') ‑> DataSegment

Sample the dataframe applying specific strategy

Args

n : int: number of points to sample

strategy (): Sampling strategy applied to get the sample. Sampling strategies are (Stratified , Cochens, Weighted, Uniform) Returns:

def save(self, file_path)

Save this DataSegment at the user provided path.

Args

file_path: Full path with file_name on your local system where you want to dump DataSegmentObject.

class DataSegmentPath (segment_type: str = 'unknown', path: Optional[str] = '', multi_file: bool = False, num_records: int = 0, num_cols: int = 0, data_source: Union[pd.DataFrame, str] = None, data_source_type: DataSourceType = None, data_source_format: DataSourceFormat = None)

Path of this DataSegment

Static methods

def from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) ‑> ~A
def from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) ‑> ~A
def schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) ‑> dataclasses_json.mm.SchemaF[~A]

Instance variables

var data_source : pandas.core.frame.DataFrame | str: Datasource type
var data_source_format : DataSourceFormat
var data_source_type : DataSourceType: Datasource type
var multi_file : bool: Number of records in this data segment
var num_cols : int: Register dataframe
var num_records : int: Number of columns in this data segment
var path : str | None: Is this path the path to folder with multiple files
var segment_type : str: Actual path / URI to a data_set resource on external storage

Methods

def clone(self, ds_path: DataSegmentPath)
def get_dict(self) ‑> Dict
def get_json(self) ‑> str
def to_dict(self, encode_json=False) ‑> Dict[str, dict | list | str | int | float | bool | None]
def to_json(self, *, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: int | str | None = None, separators: Tuple[str, str] | None = None, default: Callable | None = None, sort_keys: bool = False, **kw) ‑> str

class DataSegmentTuple (ds_id: str, segment_types: List)

DataSegmentTuple(ds_id: 'str', segment_types: 'List')

Static methods

def from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) ‑> ~A
def from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) ‑> ~A
def schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) ‑> dataclasses_json.mm.SchemaF[~A]

Instance variables

var ds_id : str
var segment_types : List

Methods

def get_dict(self) ‑> Dict
def get_json(self) ‑> str
def to_dict(self, encode_json=False) ‑> Dict[str, dict | list | str | int | float | bool | None]
def to_json(self, *, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: int | str | None = None, separators: Tuple[str, str] | None = None, default: Callable | None = None, sort_keys: bool = False, **kw) ‑> str

class DownloadDatasetInfo (url: str, storage_format: StorageFormatType, storage_type: StorageType)

Signed url to download the dataset

Static methods

def from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) ‑> ~A
def from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) ‑> ~A
def schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) ‑> dataclasses_json.mm.SchemaF[~A]

Instance variables

var storage_format : StorageFormatType
var storage_type : StorageType
var url : str

Methods

def to_dict(self, encode_json=False) ‑> Dict[str, dict | list | str | int | float | bool | None]
def to_json(self, *, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: int | str | None = None, separators: Tuple[str, str] | None = None, default: Callable | None = None, sort_keys: bool = False, **kw) ‑> str