Module markov.api.data
Sub-modules
markov.api.data.analysismarkov.api.data.catalog_handlermarkov.api.data.data_familymarkov.api.data.data_segmentmarkov.api.data.data_setmarkov.api.data.embeddingmarkov.api.data.run_trackermarkov.api.data.storage_uploader
Functions
def create_datafamily(name: str, notes: str, **kwargs) ‑> DataFamily-
Create DataFamily object in memory. This is not registered with MarkovML. If you want to register it, you'll need to explicitly call register()
Args
name:str- name of the datafamily.
notes:str- text details about this data family that you want to persist
**kwargs- variable key value args ({key:value} that you want to store alongside datafamily)
Returns
DataFamily object
def create_dataset(name: str,
notes: str,
data_category: str,
delimiter: str,
datafamily_id: str,
storage_type: str,
data_segment_path: List[DataSegmentPath],
storage_format: str = StorageFormatType.CSV,
x_indexes: List[int] = None,
y_index: int = -1,
x_col_names: List[str] = None,
y_name: str = '',
info: dict = None) ‑> DataSet-
Create DataSet object in memory. To register it with MarkovML you'll need to explicitly call register().
Args
name:str- unique name of the dataset.
notes:str- Any free form text detail you want to store along with this dataset in MarkovML backend
data_category:text- one of "Text": for text data, usually used for NLP "Numerical": where each feature is a numerical value "Categorical": where each feature value is catagorical "OneHot": this is one hot encoding representation "TimeSeries": This dataset represents a time series data "Mixed": This dataset contains numerical and categorical values
delimiter:str- delimiter which seperates a feature from another
datafamily_id:str- datafamily id this dataset belongs to. DataFamilyId represents the collection of DataSets
- that have similar logical characteristics/schema/properties. Check DataFamily
storage_type:str- What type of storage is actual data located on. Currently, we only support S3 storage
data_segment_path:str- valid absolute path of the datasegment on storage.
- DataSegments are Train/Test/Validate/Unknown
storage_format:str- Type of data format, supported formats are : "CSV","TSV","YOUR_CUSTOM_DELIMITER"
x_indexes:list[int]- list of index of feature columns. If you have named columns, please see x_col_names
y_index:int- index of the target column. If you have named columns, please see y_name
x_col_names:list[str]- name of the columns that form the feature vector
y_name:str- name of the target column
info:dict- any key-value information that you want to persist alongside the dataset.
Returns:
def register_datafamily(name: str, notes: str, **kwargs) ‑> DataFamilyRegistrationResponse-
Register the datafamily with MarkovML Backend
Args
name:str- name of the datafamily.
notes:str- text details about this data family that you want to persist
**kwargs(): variable key value args ({key:value} that you want to store alongside datafamily)
Returns
DataFamilyRegistrationResponse
def register_dataset(name: str,
notes: str,
data_category: str,
delimiter: str,
datafamily_id: str,
storage_type: str,
data_segment_path: List[DataSegmentPath],
credentials: str | GenericCredential,
storage_format: str = StorageFormatType.CSV,
x_indexes: List[int] = None,
y_index: int = -1,
x_col_names: List[str] = None,
y_name: str = '',
info: dict = None,
source: str = '',
should_analyze=False) ‑> DataSetRegistrationResponse-
Register the dataset with MarkovML backend. Dataset has two components, the DataSetProperties, which define characteristics of the dataset, and List[DataSegmentPath] which contains the location of this specific dataset on a cloud storage.
Args
should_analyze:bool- set to true if you want to get it analyzed while registering
name:str- unique name of the dataset.
notes:str- Any free form text detail you want to store along with this dataset in MarkovML backend
data_category:text- one of "Text": for text data, usually used for NLP "Numerical": where each feature is a numerical value "Categorical": where each feature value is catagorical "OneHot": this is one hot encoding representation "TimeSeries": This dataset represents a time series data "Mixed": This dataset contains numerical and categorical values
delimiter:str- delimiter which separate a feature from another
datafamily_id:str- datafamily id this dataset belongs to. DataFamilyId represents the collection of DataSets
- that have similar logical characteristics/schema/properties. Check DataFamily
storage_type:str- What type of storage is actual data located on. Currently we only support S3 storage
data_segment_path:str- valid absolute path of the datasegment on storage.
- DataSegments are Train/Test/Validate/Unknown
- credentials(Union[str,GenericCredential]): If you are reusing a credential registered with Markov,
- use it's credential_id. If you've not registered your credentials with Markov, pass the Credential object
storage_format:str- Type of data format, supported formats are : "CSV","TSV","YOUR_CUSTOM_DELIMITER"
x_indexes:list[int]- list of index of feature columns. If you have named columns, please see x_col_names
y_index:int- index of the target column. If you have named columns, please see y_name
x_col_names:list[str]- name of the columns that form the feature vector
y_name:str- name of the target column
source:str- source of the dataset. e.g. Kaggle, 3P Dataset, etc.
info:dict- any key-value information that you want to persist alongside the dataset.
Returns
DataSetRegistrationResponse- Registration response
def register_embedding(name: str,
notes: str,
delimiter: str,
ds_id: str,
storage_type: str,
data_segment_path: List[DataSegmentPath],
credentials: str | GenericCredential,
storage_format: str = StorageFormatType.CSV,
x_indexes: List[int] = None,
embedding_index: int = -1,
x_col_names: List[str] = None,
embedding_name: str = '',
info: dict = None,
source: str = '',
should_analyze=False) ‑> EmbeddingRegistrationResponse-
Register the Embedding with MarkovML backend. Embedding has two components, the EmbeddingProperties, which define characteristics of the embedding, and List[DataSegmentPath] which contains the location of this specific embedding on a cloud storage.
Args
name- Unique name for the embedding
notes- Any free form text detail you want to store along with this embedding in MarkovML backend
delimiter- delimiter which separate a feature from another
ds_id- Dataset id this embedding is mapped to
storage_type- What type of storage is actual data located on. Currently we only support S3 storage
data_segment_path- valid absolute path of the datasegment on storage.
credentials- If you are reusing a credential registered with Markov,
- use it's credential_id. If you've not registered your credentials with Markov, pass the Credential object
- storage_format:
x_indexes- list of index of feature columns. If you have named columns, please see x_col_names
embedding_index- index of the embedding column. If you have named columns, please see embedding_name
x_col_names- name of the columns that form the feature vector
embedding_name- Name of embedding column,The format expected is a list of values corresponding to each embedding
info- Any key-value information that you want to persist alongside the embedding.
source- Source of the embedding.
should_analyze- Set to true if you want to get it analyzed while registering
Returns: