Module markov.api.data.data_segment

Classes

class DataSegment (segment_type: SegmentType, name: str = '', **kwargs)

Object model for a DataSegment. DataSegments are : 1. Train 2. Test 3. Validate 4. Unknown (for generic dataset that is not assigned any segment above)

Args

data_frame : DataFame
data_set in a dataframe format
segment_type : SegmentType
what type of segment does this data_set belong to

name (): **kwargs ():

Static methods

def create_datasegment(segment_type: SegmentType, ds_id: str, x_indexes: List[int], y_index: int, x_col_names: List[str], y_col_name: str, data_format: str, is_analyzed: bool = False, num_rows: int = 0, num_cols: int = 0, exists: bool = True, delimiter: str = ',') ‑> DataSegment

Create a Data Segment

Args

segment_type : SegmentType
what segment type is that (Train/Test/Validate/Unknown)
ds_id : str
dataset id that is uniquely assigned to the parent dataset
x_indexes : list
Either the index of the feature cols or the name of the feature columns (but not both) are required
y_index : int
Either the index of target column or name of the target column for labeled dataset are required
x_col_names : list
Either column name(s) or index of the feature cols are required
y_col_name : str
Either the name of target col or index of target column are required
data_format : str
format of data (csv/tsv etc.)
is_analyzed : bool
true of the dataset is analyzed or false if its not
num_rows : int
count of number of rows
num_cols : int
number of columns in this dataset segment
exists : bool
True if this segment exists if not false
delimiter : str
delimiter for this dataset

Returns

The DataSegment object

def from_pickle_file(file_path) ‑> DataSegment

Read the DataSegment object from the binary file (pickled file). Please do not use pickle to unpickle objects

Args

file_path : str
file_path is the location of serialized binary file created by

calling save on the DataSegment.

Returns

DataSegment object from the picked dataset stored at this path.

Instance variables

prop ds_id

unique identifier to identify this dataset

Returns

ds_id (DataSetID) associated with the dataset that uniquely identifies this dataSet With MarkovML. Every dataset registered with MarkovML gets a unique identifier to identify it with MarkovML.

prop features

Returns

Return the feature columns in a dataframe as a Copy

prop is_analyzed

Returns

True if dataset has been analyzed otherwise false by default

prop num_cols

Returns

Num of cols in the data segment

prop num_rows

Returns

Number of rows present in the data segment

prop segment_type

Returns

The Segments that are part of this data_set. The segment_types are (Train | Test | Validate | Unknown)

prop target

Returns

Get the copy of the target column as the dataframe

Methods

def as_df(self, force_download=False) ‑> pandas.core.frame.DataFrame

Returns the segment as a dataframe. This is downloaded for the first time and cached for future use within the session

Args

force_download : bool
By default MarkovML will only download the dataset once to make loading faster. However,

if for some reason you want to download the dataset from storage, set force to True

Returns

DataFrame containing Data for this DataSegment in the DF format

def download_as_csv(self, filepath: str = None) ‑> None

Download dataset segment as csv file at the given filepath. If filepath is not provided, it will be downloaded at the current working directory with the same name as the uploaded file name. For example:

>>> dataset.train.download_as_csv(filepath="train.csv")
>>> dataset.test.download_as_csv(filepath="test.csv")
>>> dataset.test.download_as_csv() # will download with the filename same as the uploaded file name
def normalized_df(self, filters: List, replace_symbol: str = ' ')

Normalize the dataframe with different filters like [stop words, emoji, urls] etc Return: Normalized dataframe

>>> import markov
>>> dataset = markov.dataset.get_by_id("id")
>>> dataset.train.normalized_df(filters=["url", "email"])
>>> <dataframe>
def sample(self, n: int, strategy='uniform') ‑> DataSegment

Sample the dataframe applying specific strategy

Args

n : int
number of points to sample

strategy (): Sampling strategy applied to get the sample. Sampling strategies are (Stratified , Cochens, Weighted, Uniform) Returns:

def save(self, file_path)

Save this DataSegment at the user provided path.

Args

file_path
Full path with file_name on your local system where you want to dump DataSegmentObject.
class DataSegmentPath (segment_type: str = 'unknown', path: Optional[str] = '', multi_file: bool = False, num_records: int = 0, num_cols: int = 0, data_source: Union[pd.DataFrame, str] = None, data_source_type: DataSourceType = None, data_source_format: DataSourceFormat = None)

Path of this DataSegment

Class variables

var data_source : Union[pandas.core.frame.DataFrame, str]

Datasource type

var data_source_formatDataSourceFormat
var data_source_typeDataSourceType

Datasource type

var multi_file : bool

Number of records in this data segment

var num_cols : int

Register dataframe

var num_records : int

Number of columns in this data segment

var path : Optional[str]

Is this path the path to folder with multiple files

var segment_type : str

Actual path / URI to a data_set resource on external storage

Static methods

def from_dict(kvs: Union[dict, list, str, int, float, bool, ForwardRef(None)], *, infer_missing=False) ‑> ~A
def from_json(s: Union[str, bytes, bytearray], *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) ‑> ~A
def schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) ‑> dataclasses_json.mm.SchemaF[~A]

Methods

def clone(self, ds_path: DataSegmentPath)
def get_dict(self) ‑> Dict
def get_json(self) ‑> str
def to_dict(self, encode_json=False) ‑> Dict[str, Union[dict, list, str, int, float, bool, ForwardRef(None)]]
def to_json(self, *, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: Union[int, str, ForwardRef(None)] = None, separators: Optional[Tuple[str, str]] = None, default: Optional[Callable] = None, sort_keys: bool = False, **kw) ‑> str
class DataSegmentTuple (ds_id: str, segment_types: List)

DataSegmentTuple(ds_id: 'str', segment_types: 'List')

Class variables

var ds_id : str
var segment_types : List

Static methods

def from_dict(kvs: Union[dict, list, str, int, float, bool, ForwardRef(None)], *, infer_missing=False) ‑> ~A
def from_json(s: Union[str, bytes, bytearray], *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) ‑> ~A
def schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) ‑> dataclasses_json.mm.SchemaF[~A]

Methods

def get_dict(self) ‑> Dict
def get_json(self) ‑> str
def to_dict(self, encode_json=False) ‑> Dict[str, Union[dict, list, str, int, float, bool, ForwardRef(None)]]
def to_json(self, *, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: Union[int, str, ForwardRef(None)] = None, separators: Optional[Tuple[str, str]] = None, default: Optional[Callable] = None, sort_keys: bool = False, **kw) ‑> str
class DownloadDatasetInfo (url: str, storage_format: StorageFormatType, storage_type: StorageType)

Signed url to download the dataset

Class variables

var storage_formatStorageFormatType
var storage_typeStorageType
var url : str

Static methods

def from_dict(kvs: Union[dict, list, str, int, float, bool, ForwardRef(None)], *, infer_missing=False) ‑> ~A
def from_json(s: Union[str, bytes, bytearray], *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) ‑> ~A
def schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) ‑> dataclasses_json.mm.SchemaF[~A]

Methods

def to_dict(self, encode_json=False) ‑> Dict[str, Union[dict, list, str, int, float, bool, ForwardRef(None)]]
def to_json(self, *, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: Union[int, str, ForwardRef(None)] = None, separators: Optional[Tuple[str, str]] = None, default: Optional[Callable] = None, sort_keys: bool = False, **kw) ‑> str