Module `markov.api.data`

Sub-modules

markov.api.data.analysis
markov.api.data.catalog_handler
markov.api.data.data_family
markov.api.data.data_segment
markov.api.data.data_set
markov.api.data.embedding
markov.api.data.run_tracker
markov.api.data.storage_uploader

Functions

def create_datafamily(name: str, notes: str, **kwargs) ‑> DataFamily

Create DataFamily object in memory. This is not registered with MarkovML. If you want to register it, you'll need to explicitly call register()

Args

name : str: name of the datafamily.
notes : str: text details about this data family that you want to persist
**kwargs: variable key value args ({key:value} that you want to store alongside datafamily)

Returns

DataFamily object

def create_dataset(name: str, notes: str, data_category: str, delimiter: str, datafamily_id: str, storage_type: str, data_segment_path: List[DataSegmentPath], storage_format: str = StorageFormatType.CSV, x_indexes: List[int] = None, y_index: int = -1, x_col_names: List[str] = None, y_name: str = '', info: dict = None) ‑> DataSet

Create DataSet object in memory. To register it with MarkovML you'll need to explicitly call register().

Args

name : str: unique name of the dataset.
notes : str: Any free form text detail you want to store along with this dataset in MarkovML backend
data_category : text: one of "Text": for text data, usually used for NLP "Numerical": where each feature is a numerical value "Categorical": where each feature value is catagorical "OneHot": this is one hot encoding representation "TimeSeries": This dataset represents a time series data "Mixed": This dataset contains numerical and categorical values
delimiter : str: delimiter which seperates a feature from another
datafamily_id : str: datafamily id this dataset belongs to. DataFamilyId represents the collection of DataSets
that have similar logical characteristics/schema/properties. Check DataFamily
storage_type : str: What type of storage is actual data located on. Currently, we only support S3 storage
data_segment_path : str: valid absolute path of the datasegment on storage.
DataSegments are Train/Test/Validate/Unknown
storage_format : str: Type of data format, supported formats are : "CSV","TSV","YOUR_CUSTOM_DELIMITER"
x_indexes : list[int]: list of index of feature columns. If you have named columns, please see x_col_names
y_index : int: index of the target column. If you have named columns, please see y_name
x_col_names : list[str]: name of the columns that form the feature vector
y_name : str: name of the target column
info : dict: any key-value information that you want to persist alongside the dataset.

Returns:

def register_datafamily(name: str, notes: str, **kwargs) ‑> DataFamilyRegistrationResponse

Args

name : str: name of the datafamily.
notes : str: text details about this data family that you want to persist

**kwargs(): variable key value args ({key:value} that you want to store alongside datafamily)

Returns

DataFamilyRegistrationResponse

def register_dataset(name: str, notes: str, data_category: str, delimiter: str, datafamily_id: str, storage_type: str, data_segment_path: List[DataSegmentPath], credentials: str | GenericCredential, storage_format: str = StorageFormatType.CSV, x_indexes: List[int] = None, y_index: int = -1, x_col_names: List[str] = None, y_name: str = '', info: dict = None, source: str = '', should_analyze=False) ‑> DataSetRegistrationResponse

Register the dataset with MarkovML backend. Dataset has two components, the DataSetProperties, which define characteristics of the dataset, and List[DataSegmentPath] which contains the location of this specific dataset on a cloud storage.

Args

should_analyze : bool: set to true if you want to get it analyzed while registering
name : str: unique name of the dataset.
notes : str: Any free form text detail you want to store along with this dataset in MarkovML backend
data_category : text: one of "Text": for text data, usually used for NLP "Numerical": where each feature is a numerical value "Categorical": where each feature value is catagorical "OneHot": this is one hot encoding representation "TimeSeries": This dataset represents a time series data "Mixed": This dataset contains numerical and categorical values
delimiter : str: delimiter which separate a feature from another
datafamily_id : str: datafamily id this dataset belongs to. DataFamilyId represents the collection of DataSets
that have similar logical characteristics/schema/properties. Check DataFamily
storage_type : str: What type of storage is actual data located on. Currently we only support S3 storage
data_segment_path : str: valid absolute path of the datasegment on storage.
DataSegments are Train/Test/Validate/Unknown
credentials(Union[str,GenericCredential]): If you are reusing a credential registered with Markov,
use it's credential_id. If you've not registered your credentials with Markov, pass the Credential object
storage_format : str: Type of data format, supported formats are : "CSV","TSV","YOUR_CUSTOM_DELIMITER"
x_indexes : list[int]: list of index of feature columns. If you have named columns, please see x_col_names
y_index : int: index of the target column. If you have named columns, please see y_name
x_col_names : list[str]: name of the columns that form the feature vector
y_name : str: name of the target column
source : str: source of the dataset. e.g. Kaggle, 3P Dataset, etc.
info : dict: any key-value information that you want to persist alongside the dataset.

Returns

DataSetRegistrationResponse: Registration response

def register_embedding(name: str, notes: str, delimiter: str, ds_id: str, storage_type: str, data_segment_path: List[DataSegmentPath], credentials: str | GenericCredential, storage_format: str = StorageFormatType.CSV, x_indexes: List[int] = None, embedding_index: int = -1, x_col_names: List[str] = None, embedding_name: str = '', info: dict = None, source: str = '', should_analyze=False) ‑> EmbeddingRegistrationResponse

Register the Embedding with MarkovML backend. Embedding has two components, the EmbeddingProperties, which define characteristics of the embedding, and List[DataSegmentPath] which contains the location of this specific embedding on a cloud storage.

Args

name: Unique name for the embedding
notes: Any free form text detail you want to store along with this embedding in MarkovML backend
delimiter: delimiter which separate a feature from another
ds_id: Dataset id this embedding is mapped to
storage_type: What type of storage is actual data located on. Currently we only support S3 storage
data_segment_path: valid absolute path of the datasegment on storage.
credentials: If you are reusing a credential registered with Markov,
use it's credential_id. If you've not registered your credentials with Markov, pass the Credential object
storage_format:
x_indexes: list of index of feature columns. If you have named columns, please see x_col_names
embedding_index: index of the embedding column. If you have named columns, please see embedding_name
x_col_names: name of the columns that form the feature vector
embedding_name: Name of embedding column,The format expected is a list of values corresponding to each embedding
info: Any key-value information that you want to persist alongside the embedding.
source: Source of the embedding.
should_analyze: Set to true if you want to get it analyzed while registering

Returns: