Module markov.api.data.data_segment
Classes
class DataSegment (segment_type: SegmentType, name: str = '', **kwargs)
-
Object model for a DataSegment. DataSegments are : 1. Train 2. Test 3. Validate 4. Unknown (for generic dataset that is not assigned any segment above)
Args
data_frame
:DataFame
- data_set in a dataframe format
segment_type
:SegmentType
- what type of segment does this data_set belong to
name (): **kwargs ():
Static methods
def create_datasegment(segment_type: SegmentType, ds_id: str, x_indexes: List[int], y_index: int, x_col_names: List[str], y_col_name: str, data_format: str, is_analyzed: bool = False, num_rows: int = 0, num_cols: int = 0, exists: bool = True, delimiter: str = ',') ‑> DataSegment
-
Create a Data Segment
Args
segment_type
:SegmentType
- what segment type is that (Train/Test/Validate/Unknown)
ds_id
:str
- dataset id that is uniquely assigned to the parent dataset
x_indexes
:list
- Either the index of the feature cols or the name of the feature columns (but not both) are required
y_index
:int
- Either the index of target column or name of the target column for labeled dataset are required
x_col_names
:list
- Either column name(s) or index of the feature cols are required
y_col_name
:str
- Either the name of target col or index of target column are required
data_format
:str
- format of data (csv/tsv etc.)
is_analyzed
:bool
- true of the dataset is analyzed or false if its not
num_rows
:int
- count of number of rows
num_cols
:int
- number of columns in this dataset segment
exists
:bool
- True if this segment exists if not false
delimiter
:str
- delimiter for this dataset
Returns
The DataSegment object
def from_pickle_file(file_path) ‑> DataSegment
-
Read the DataSegment object from the binary file (pickled file). Please do not use pickle to unpickle objects
Args
file_path
:str
- file_path is the location of serialized binary file created by
calling save on the DataSegment.
Returns
DataSegment object from the picked dataset stored at this path.
Instance variables
prop ds_id
-
unique identifier to identify this dataset
Returns
ds_id (DataSetID) associated with the dataset that uniquely identifies this dataSet With MarkovML. Every dataset registered with MarkovML gets a unique identifier to identify it with MarkovML.
prop features
-
Returns
Return the feature columns in a dataframe as a Copy
prop is_analyzed
-
Returns
True if dataset has been analyzed otherwise false by default
prop num_cols
-
Returns
Num of cols in the data segment
prop num_rows
-
Returns
Number of rows present in the data segment
prop segment_type
-
Returns
The Segments that are part of this data_set. The segment_types are (Train | Test | Validate | Unknown)
prop target
-
Returns
Get the copy of the target column as the dataframe
Methods
def as_df(self, force_download=False) ‑> pandas.core.frame.DataFrame
-
Returns the segment as a dataframe. This is downloaded for the first time and cached for future use within the session
Args
force_download
:bool
- By default MarkovML will only download the dataset once to make loading faster. However,
if for some reason you want to download the dataset from storage, set force to True
Returns
DataFrame containing Data for this DataSegment in the DF format
def download_as_csv(self, filepath: str = None) ‑> None
-
Download dataset segment as csv file at the given filepath. If filepath is not provided, it will be downloaded at the current working directory with the same name as the uploaded file name. For example:
>>> dataset.train.download_as_csv(filepath="train.csv") >>> dataset.test.download_as_csv(filepath="test.csv") >>> dataset.test.download_as_csv() # will download with the filename same as the uploaded file name
def normalized_df(self, filters: List, replace_symbol: str = ' ')
-
Normalize the dataframe with different filters like [stop words, emoji, urls] etc Return: Normalized dataframe
>>> import markov >>> dataset = markov.dataset.get_by_id("id") >>> dataset.train.normalized_df(filters=["url", "email"]) >>> <dataframe>
def sample(self, n: int, strategy='uniform') ‑> DataSegment
-
Sample the dataframe applying specific strategy
Args
n
:int
- number of points to sample
strategy (): Sampling strategy applied to get the sample. Sampling strategies are (Stratified , Cochens, Weighted, Uniform) Returns:
def save(self, file_path)
-
Save this DataSegment at the user provided path.
Args
file_path
- Full path with file_name on your local system where you want to dump DataSegmentObject.
class DataSegmentPath (segment_type: str = 'unknown', path: Optional[str] = '', multi_file: bool = False, num_records: int = 0, num_cols: int = 0, data_source: Union[pd.DataFrame, str] = None, data_source_type: DataSourceType = None, data_source_format: DataSourceFormat = None)
-
Path of this DataSegment
Class variables
var data_source : Union[pandas.core.frame.DataFrame, str]
-
Datasource type
var data_source_format : DataSourceFormat
var data_source_type : DataSourceType
-
Datasource type
var multi_file : bool
-
Number of records in this data segment
var num_cols : int
-
Register dataframe
var num_records : int
-
Number of columns in this data segment
var path : Optional[str]
-
Is this path the path to folder with multiple files
var segment_type : str
-
Actual path / URI to a data_set resource on external storage
Static methods
def from_dict(kvs: Union[dict, list, str, int, float, bool, ForwardRef(None)], *, infer_missing=False) ‑> ~A
def from_json(s: Union[str, bytes, bytearray], *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) ‑> ~A
def schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) ‑> dataclasses_json.mm.SchemaF[~A]
Methods
def clone(self, ds_path: DataSegmentPath)
def get_dict(self) ‑> Dict
def get_json(self) ‑> str
def to_dict(self, encode_json=False) ‑> Dict[str, Union[dict, list, str, int, float, bool, ForwardRef(None)]]
def to_json(self, *, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: Union[int, str, ForwardRef(None)] = None, separators: Optional[Tuple[str, str]] = None, default: Optional[Callable] = None, sort_keys: bool = False, **kw) ‑> str
class DataSegmentTuple (ds_id: str, segment_types: List)
-
DataSegmentTuple(ds_id: 'str', segment_types: 'List')
Class variables
var ds_id : str
var segment_types : List
Static methods
def from_dict(kvs: Union[dict, list, str, int, float, bool, ForwardRef(None)], *, infer_missing=False) ‑> ~A
def from_json(s: Union[str, bytes, bytearray], *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) ‑> ~A
def schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) ‑> dataclasses_json.mm.SchemaF[~A]
Methods
def get_dict(self) ‑> Dict
def get_json(self) ‑> str
def to_dict(self, encode_json=False) ‑> Dict[str, Union[dict, list, str, int, float, bool, ForwardRef(None)]]
def to_json(self, *, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: Union[int, str, ForwardRef(None)] = None, separators: Optional[Tuple[str, str]] = None, default: Optional[Callable] = None, sort_keys: bool = False, **kw) ‑> str
class DownloadDatasetInfo (url: str, storage_format: StorageFormatType, storage_type: StorageType)
-
Signed url to download the dataset
Class variables
var storage_format : StorageFormatType
var storage_type : StorageType
var url : str
Static methods
def from_dict(kvs: Union[dict, list, str, int, float, bool, ForwardRef(None)], *, infer_missing=False) ‑> ~A
def from_json(s: Union[str, bytes, bytearray], *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) ‑> ~A
def schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) ‑> dataclasses_json.mm.SchemaF[~A]
Methods
def to_dict(self, encode_json=False) ‑> Dict[str, Union[dict, list, str, int, float, bool, ForwardRef(None)]]
def to_json(self, *, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: Union[int, str, ForwardRef(None)] = None, separators: Optional[Tuple[str, str]] = None, default: Optional[Callable] = None, sort_keys: bool = False, **kw) ‑> str