Module markov.api.data

Sub-modules

markov.api.data.analysis
markov.api.data.catalog_handler
markov.api.data.data_family
markov.api.data.data_segment
markov.api.data.data_set
markov.api.data.embedding
markov.api.data.run_tracker
markov.api.data.storage_uploader

Functions

def create_datafamily(name: str, notes: str, **kwargs) ‑> DataFamily

Create DataFamily object in memory. This is not registered with MarkovML. If you want to register it, you'll need to explicitly call register()

Args

name : str
name of the datafamily.
notes : str
text details about this data family that you want to persist
**kwargs
variable key value args ({key:value} that you want to store alongside datafamily)

Returns

DataFamily object

def create_dataset(name: str, notes: str, data_category: str, delimiter: str, datafamily_id: str, storage_type: str, data_segment_path: List[DataSegmentPath], storage_format: str = StorageFormatType.CSV, x_indexes: List[int] = None, y_index: int = -1, x_col_names: List[str] = None, y_name: str = '', info: dict = None) ‑> DataSet

Create DataSet object in memory. To register it with MarkovML you'll need to explicitly call register().

Args

name : str
unique name of the dataset.
notes : str
Any free form text detail you want to store along with this dataset in MarkovML backend
data_category : text
one of "Text": for text data, usually used for NLP "Numerical": where each feature is a numerical value "Categorical": where each feature value is catagorical "OneHot": this is one hot encoding representation "TimeSeries": This dataset represents a time series data "Mixed": This dataset contains numerical and categorical values
delimiter : str
delimiter which seperates a feature from another
datafamily_id : str
datafamily id this dataset belongs to. DataFamilyId represents the collection of DataSets
that have similar logical characteristics/schema/properties. Check DataFamily
storage_type : str
What type of storage is actual data located on. Currently, we only support S3 storage
data_segment_path : str
valid absolute path of the datasegment on storage.
DataSegments are Train/Test/Validate/Unknown
storage_format : str
Type of data format, supported formats are : "CSV","TSV","YOUR_CUSTOM_DELIMITER"
x_indexes : list[int]
list of index of feature columns. If you have named columns, please see x_col_names
y_index : int
index of the target column. If you have named columns, please see y_name
x_col_names : list[str]
name of the columns that form the feature vector
y_name : str
name of the target column
info : dict
any key-value information that you want to persist alongside the dataset.

Returns:

def register_datafamily(name: str, notes: str, **kwargs) ‑> DataFamilyRegistrationResponse

Register the datafamily with MarkovML Backend

Args

name : str
name of the datafamily.
notes : str
text details about this data family that you want to persist

**kwargs(): variable key value args ({key:value} that you want to store alongside datafamily)

Returns

DataFamilyRegistrationResponse

def register_dataset(name: str, notes: str, data_category: str, delimiter: str, datafamily_id: str, storage_type: str, data_segment_path: List[DataSegmentPath], credentials: Union[str, GenericCredential], storage_format: str = StorageFormatType.CSV, x_indexes: List[int] = None, y_index: int = -1, x_col_names: List[str] = None, y_name: str = '', info: dict = None, source: str = '', should_analyze=False) ‑> DataSetRegistrationResponse

Register the dataset with MarkovML backend. Dataset has two components, the DataSetProperties, which define characteristics of the dataset, and List[DataSegmentPath] which contains the location of this specific dataset on a cloud storage.

Args

should_analyze : bool
set to true if you want to get it analyzed while registering
name : str
unique name of the dataset.
notes : str
Any free form text detail you want to store along with this dataset in MarkovML backend
data_category : text
one of "Text": for text data, usually used for NLP "Numerical": where each feature is a numerical value "Categorical": where each feature value is catagorical "OneHot": this is one hot encoding representation "TimeSeries": This dataset represents a time series data "Mixed": This dataset contains numerical and categorical values
delimiter : str
delimiter which separate a feature from another
datafamily_id : str
datafamily id this dataset belongs to. DataFamilyId represents the collection of DataSets
that have similar logical characteristics/schema/properties. Check DataFamily
storage_type : str
What type of storage is actual data located on. Currently we only support S3 storage
data_segment_path : str
valid absolute path of the datasegment on storage.
DataSegments are Train/Test/Validate/Unknown
credentials(Union[str,GenericCredential]): If you are reusing a credential registered with Markov,
use it's credential_id. If you've not registered your credentials with Markov, pass the Credential object
storage_format : str
Type of data format, supported formats are : "CSV","TSV","YOUR_CUSTOM_DELIMITER"
x_indexes : list[int]
list of index of feature columns. If you have named columns, please see x_col_names
y_index : int
index of the target column. If you have named columns, please see y_name
x_col_names : list[str]
name of the columns that form the feature vector
y_name : str
name of the target column
source : str
source of the dataset. e.g. Kaggle, 3P Dataset, etc.
info : dict
any key-value information that you want to persist alongside the dataset.

Returns

DataSetRegistrationResponse
Registration response
def register_embedding(name: str, notes: str, delimiter: str, ds_id: str, storage_type: str, data_segment_path: List[DataSegmentPath], credentials: Union[str, GenericCredential], storage_format: str = StorageFormatType.CSV, x_indexes: List[int] = None, embedding_index: int = -1, x_col_names: List[str] = None, embedding_name: str = '', info: dict = None, source: str = '', should_analyze=False) ‑> EmbeddingRegistrationResponse

Register the Embedding with MarkovML backend. Embedding has two components, the EmbeddingProperties, which define characteristics of the embedding, and List[DataSegmentPath] which contains the location of this specific embedding on a cloud storage.

Args

name
Unique name for the embedding
notes
Any free form text detail you want to store along with this embedding in MarkovML backend
delimiter
delimiter which separate a feature from another
ds_id
Dataset id this embedding is mapped to
storage_type
What type of storage is actual data located on. Currently we only support S3 storage
data_segment_path
valid absolute path of the datasegment on storage.
credentials
If you are reusing a credential registered with Markov,
use it's credential_id. If you've not registered your credentials with Markov, pass the Credential object
storage_format:
x_indexes
list of index of feature columns. If you have named columns, please see x_col_names
embedding_index
index of the embedding column. If you have named columns, please see embedding_name
x_col_names
name of the columns that form the feature vector
embedding_name
Name of embedding column,The format expected is a list of values corresponding to each embedding
info
Any key-value information that you want to persist alongside the embedding.
source
Source of the embedding.
should_analyze
Set to true if you want to get it analyzed while registering

Returns: