Module markov.api.data
Sub-modules
markov.api.data.analysis
markov.api.data.catalog_handler
markov.api.data.data_family
markov.api.data.data_segment
markov.api.data.data_set
markov.api.data.embedding
markov.api.data.run_tracker
markov.api.data.storage_uploader
Functions
def create_datafamily(name: str, notes: str, **kwargs) ‑> DataFamily
-
Create DataFamily object in memory. This is not registered with MarkovML. If you want to register it, you'll need to explicitly call register()
Args
name
:str
- name of the datafamily.
notes
:str
- text details about this data family that you want to persist
**kwargs
- variable key value args ({key:value} that you want to store alongside datafamily)
Returns
DataFamily object
def create_dataset(name: str, notes: str, data_category: str, delimiter: str, datafamily_id: str, storage_type: str, data_segment_path: List[DataSegmentPath], storage_format: str = StorageFormatType.CSV, x_indexes: List[int] = None, y_index: int = -1, x_col_names: List[str] = None, y_name: str = '', info: dict = None) ‑> DataSet
-
Create DataSet object in memory. To register it with MarkovML you'll need to explicitly call register().
Args
name
:str
- unique name of the dataset.
notes
:str
- Any free form text detail you want to store along with this dataset in MarkovML backend
data_category
:text
- one of "Text": for text data, usually used for NLP "Numerical": where each feature is a numerical value "Categorical": where each feature value is catagorical "OneHot": this is one hot encoding representation "TimeSeries": This dataset represents a time series data "Mixed": This dataset contains numerical and categorical values
delimiter
:str
- delimiter which seperates a feature from another
datafamily_id
:str
- datafamily id this dataset belongs to. DataFamilyId represents the collection of DataSets
- that have similar logical characteristics/schema/properties. Check DataFamily
storage_type
:str
- What type of storage is actual data located on. Currently, we only support S3 storage
data_segment_path
:str
- valid absolute path of the datasegment on storage.
- DataSegments are Train/Test/Validate/Unknown
storage_format
:str
- Type of data format, supported formats are : "CSV","TSV","YOUR_CUSTOM_DELIMITER"
x_indexes
:list[int]
- list of index of feature columns. If you have named columns, please see x_col_names
y_index
:int
- index of the target column. If you have named columns, please see y_name
x_col_names
:list[str]
- name of the columns that form the feature vector
y_name
:str
- name of the target column
info
:dict
- any key-value information that you want to persist alongside the dataset.
Returns:
def register_datafamily(name: str, notes: str, **kwargs) ‑> DataFamilyRegistrationResponse
-
Register the datafamily with MarkovML Backend
Args
name
:str
- name of the datafamily.
notes
:str
- text details about this data family that you want to persist
**kwargs(): variable key value args ({key:value} that you want to store alongside datafamily)
Returns
DataFamilyRegistrationResponse
def register_dataset(name: str, notes: str, data_category: str, delimiter: str, datafamily_id: str, storage_type: str, data_segment_path: List[DataSegmentPath], credentials: Union[str, GenericCredential], storage_format: str = StorageFormatType.CSV, x_indexes: List[int] = None, y_index: int = -1, x_col_names: List[str] = None, y_name: str = '', info: dict = None, source: str = '', should_analyze=False) ‑> DataSetRegistrationResponse
-
Register the dataset with MarkovML backend. Dataset has two components, the DataSetProperties, which define characteristics of the dataset, and List[DataSegmentPath] which contains the location of this specific dataset on a cloud storage.
Args
should_analyze
:bool
- set to true if you want to get it analyzed while registering
name
:str
- unique name of the dataset.
notes
:str
- Any free form text detail you want to store along with this dataset in MarkovML backend
data_category
:text
- one of "Text": for text data, usually used for NLP "Numerical": where each feature is a numerical value "Categorical": where each feature value is catagorical "OneHot": this is one hot encoding representation "TimeSeries": This dataset represents a time series data "Mixed": This dataset contains numerical and categorical values
delimiter
:str
- delimiter which separate a feature from another
datafamily_id
:str
- datafamily id this dataset belongs to. DataFamilyId represents the collection of DataSets
- that have similar logical characteristics/schema/properties. Check DataFamily
storage_type
:str
- What type of storage is actual data located on. Currently we only support S3 storage
data_segment_path
:str
- valid absolute path of the datasegment on storage.
- DataSegments are Train/Test/Validate/Unknown
- credentials(Union[str,GenericCredential]): If you are reusing a credential registered with Markov,
- use it's credential_id. If you've not registered your credentials with Markov, pass the Credential object
storage_format
:str
- Type of data format, supported formats are : "CSV","TSV","YOUR_CUSTOM_DELIMITER"
x_indexes
:list[int]
- list of index of feature columns. If you have named columns, please see x_col_names
y_index
:int
- index of the target column. If you have named columns, please see y_name
x_col_names
:list[str]
- name of the columns that form the feature vector
y_name
:str
- name of the target column
source
:str
- source of the dataset. e.g. Kaggle, 3P Dataset, etc.
info
:dict
- any key-value information that you want to persist alongside the dataset.
Returns
DataSetRegistrationResponse
- Registration response
def register_embedding(name: str, notes: str, delimiter: str, ds_id: str, storage_type: str, data_segment_path: List[DataSegmentPath], credentials: Union[str, GenericCredential], storage_format: str = StorageFormatType.CSV, x_indexes: List[int] = None, embedding_index: int = -1, x_col_names: List[str] = None, embedding_name: str = '', info: dict = None, source: str = '', should_analyze=False) ‑> EmbeddingRegistrationResponse
-
Register the Embedding with MarkovML backend. Embedding has two components, the EmbeddingProperties, which define characteristics of the embedding, and List[DataSegmentPath] which contains the location of this specific embedding on a cloud storage.
Args
name
- Unique name for the embedding
notes
- Any free form text detail you want to store along with this embedding in MarkovML backend
delimiter
- delimiter which separate a feature from another
ds_id
- Dataset id this embedding is mapped to
storage_type
- What type of storage is actual data located on. Currently we only support S3 storage
data_segment_path
- valid absolute path of the datasegment on storage.
credentials
- If you are reusing a credential registered with Markov,
- use it's credential_id. If you've not registered your credentials with Markov, pass the Credential object
- storage_format:
x_indexes
- list of index of feature columns. If you have named columns, please see x_col_names
embedding_index
- index of the embedding column. If you have named columns, please see embedding_name
x_col_names
- name of the columns that form the feature vector
embedding_name
- Name of embedding column,The format expected is a list of values corresponding to each embedding
info
- Any key-value information that you want to persist alongside the embedding.
source
- Source of the embedding.
should_analyze
- Set to true if you want to get it analyzed while registering
Returns: