sarus.Dataset

class sarus.Dataset(id: int, client: Client, dataspec: DataSpec, is_bigdata: bool | None = None, type_metadata: str | None = None, human_description: str | None = None, marginals: str | None = None, policy: dict | None = None, synthetic: Dict[str, Synthetic] | None = None)

A class representing a Sarus Dataset.

This class is the interface to the protected data. It enables to inspect the Sarus dataset metadata, manipulate synthetic data locally, prepare processing steps and identify the dataset for executing remote private queries.

Parameters:
  • id – The dataset id.

  • client – The Sarus client where the dataset is defined.

  • type_metadata – A serialized json holding the dataset metadata.

  • marginals – A serialized json holding the dataset marginals.

  • human_description – A short human readable description.

  • policy

as_pandas(randomize_bigdata_sampling=True) DataFrame

Create a DataFrame wrapper ready to be used for pandas operations.

If the source dataset is big data, we select, if needed, a limited number of rows to enable external transformations (pandas, sklearn, etc.).

Parameters:
  • randomize_bigdata_sampling (bool, optional) – Determines whether the limited rows should be selected

  • performed. (randomly or if only the first rows should be used. This is only applicable if sampling is)

  • True. (Defaults to)

Returns:

The Sarus DataFrame wrapper.

Return type:

DataFrame

as_tensorflow(max_download_size: int | None = None, original: bool = False) Dataset

Return the corresponding sarus.tensorflow.Dataset object.

This allows to manipulate the Sarus Dataset as a Tensorflow dataset.

Parameters:
  • max_download_size (int, optional) – Max number of synthetic data rows to download locally. Indicates the number of synthetic data rows to download from the Sarus server. It will not download more than the maximum number of available synthetic data. If None, it will download all the synthetic data. If different from None, all local computations will be done on the local synthetic sample so local results will differ from remote results.

  • original (bool) – Returns categories original values. If True will return categories as original values. If False, will encode categories as integers.

Returns:

A sarus_tensorflow.Dataset.

property epsilon: float

Retrieve the remaining global privacy budget (epsilon) of the current access rule.

Returns:

The remaining privacy budget (global epsilon) of the access rule.

Return type:

float

property features: Dict[str, Dict] | None

Features of the Sarus dataset and associated metadata.

Returns:

A dictionary holding metadata where each key is a table name and each value is a dict with features.

Return type:

Dict[str, Dict]

property max_epsilon: float

Retrieve the maximum global privacy budget (epsilon) granted by the Data preparator, for the current access rule.

Returns:

The maximum privacy budget (global epsilon) of the access rule.

Return type:

float

sql(query: str, sarus_default_output=None) Dataset[RecordBatch]

Apply an SQL query to the dataset.

Parameters:

query (str) – SQL query

Returns:

an instance of Sarus Dataset.

table(table_name: List[str]) Dataset[RecordBatch]

Get a table from the dataset.

Parameters:

table_name (List[str]) – Name of a form [‘namespace_1’,’namespace_2’…,’table_name’]. One can omit namespaces if table names are not ambiguous.

Returns:

Table fitting the given name.

Return type:

Table

tables() List[List[str]]

For given parameters of a Sarus Dataset return a list of Sarus Tables.