Python SDK Documentation

Overview

The Sarus python SDK packages the Sarus API so that it’s easy for analysts and data scientists using python to work with sensitive data in a secure way.

The Sarus python SDK can be installed with the usual:
pip install sarus.
(or python -m pip install sarus if you are in a conda environment for instance).

Note: Sarus python sdk requires python3.8.

Once the library is installed, you can connect to your Sarus instance to check the list of datasets that were made available to you by the Data Preparator and select the one you want to analyze.

>>> from sarus import Client

>>> client=Client(url="https://yoursarusinstance.sarus.tech:5000", email="your_email")

>>> client.list_datasets()
[<Sarus Dataset slugname=census id=1>,
 <Sarus Dataset slugname=your_dataset_name id=2>]

>>> remote_dataset = client.dataset(slugname="your_dataset_name")

Once the dataset is selected, you can directly run analyses on the remote data using your the same code you would write if the data was on the local filesystem.

To use the standard data libraries, you just need to change the import lines to point to the Sarus wrapper of the selected library. To retrieve a dataframe-like object from a remote Sarus dataset you can use .as_pandas() on the dataset object. This will return a sarus.pandas.DataFrame that behaves just like a pandas.DataFrame but on the remote data.

NB: The SDK will let you manipulate the remote data with the version of the library that is installed on the Sarus instance. To make sure the local version of the library is compatible with the remote version, you can download it by setting the right target in pip (e.g.: pip install sarus[pandas]).

For example:

>>> import sarus.pandas as pd
>>> import sarus.numpy as np

>>> dataframe = remote_dataset.as_pandas()

That’s it!

The rest of the analysis experience is the exact same as if you were manipulating a pandas dataframe without Sarus!

Under the hood, the Sarus proxy computes the graph of operations that you apply to the remote Sarus dataset. You actually manipulate Sarus objects without even noticing. Each time your code expects an output (e.g.: extract of rows, aggregates, model weights etc.), the proxy evaluates the current node of the graph by:

  1. checking what’s authorized by the Privacy Policy that were assigned to you by the Data Preparator/Admin (see note below)

  2. compiling the graph version that meets the privacy constraints

  3. executing the graph on the remote data

  4. returning the secured result, always with the objective of maximizing the accuracy given the privacy constraints.

All this works for the libraries and operations that are supported by Sarus (see further).

NB: Please note that this current version of the SDK actually comes with a default Privacy Policy defined as follows: the analyst only gets DP estimates except for operations that are whitelisted; in the case of whitelisted operations, they get exact results. Today, all supported model fits and performance evaluation metrics are whitelisted, meaning the analyst gets the real model weights and performance, fitted on the real data. A feature allowing the Data Admin/Preparator to whitelist specific operations on an exception basis is on the roadmap and will be available soon. See Privacy Policies definition in section Introduction to main Sarus concepts for more details.

Supported libraries and operations

The Sarus team continuously works on supporting more python libraries and operations. Currently, a subset of operations of the below libraries are supported:

  • numpy

  • pandas

  • sklearn

  • xgboost

  • tensorflow

NB: The SDK will let you manipulate the remote data with the version of the library that is installed on the Sarus instance. To make sure the local version of the library is compatible with the remote version, you can download by setting the right target in pip (e.g.: pip install sarus[pandas]).

For all those libraries, the Sarus wrapper should be imported instead of the standard library via import sarus.library (ex: import sarus.numpy as np).

Please find the current list of supported operations here: Supported transformations.

Can’t find the library or operation you need?
Contact your Sarus representative so that we add it for you.

Please note that:

  • Objects mutation, python functions with side effect, multi-assignments and operations taking several Sarus objects as arguments are not supported yet

  • The described execution logic applies only for supported operations. So we encourage you to only use supported ops (see list here: Supported transformations) and contact us to extend the list of supported ops if you need an extra library or operation.

  • Using an unsupported operation of one of the above listed libraries will result in an INFO log in stdout so that you can notice when you apply a non supported method and consequently receive a standard Python object (NB: Displaying INFO log level usually requires some configuration)

  • Using an unsupported library on a Sarus object should result in this object evaluation based on the Privacy Policy that were assigned to you (so you get a standard Python object)

  • At any time, you can check if an object is a Sarus object with type(object)

  • Evaluating an object can result either in a DP estimate (synthetic or not) or a result computed on the real data without DP, depending on the Privacy Policy (if the ops is whitelisted for example). At any time, you can check what was the evaluation policy of an object with sarus.eval_policy(object)

Advanced concepts

Sarus Dataspecs

Dataspecs are a piece of data for which the computation graph is known. A Dataspec is either a source Dataset made available through the Sarus client or can be defined by applying a transformation to suche a Dataspec.

It is called Dataspec instead of Dataset because it is more general. For instance, the weights of a trained model are a Dataspec but are not a Dataset.

A Dataspec is merely a description of how to compute a piece of data. It is a sequence of computation instructions. It is not to be confounded with the dataspec’s value which is the result of the computation.

Dataspec wrappers

In the Sarus python SDK, you don’t directly manipulate Dataspecs. Instead, you manipulate Sarus objects emulating standard data science objects (e.g. pandas DataFrame). These Sarus objects wrap dataspecs. The Sarus classes inherit from the DataspecWrapper class.

A DataspecWrapper act as a view of the dataspec. It keeps a reference to the underlying dataspecs and provides a way to interact with the dataspec’s value. It also defines which operations, if applied to a Dataspec, will produce another DataSpec.

Example

For instance, the Sarus python SDK defines a sarus.pandas.DataFrame object. Calling the mean method on it will yield a sarus.pandas.Series object. This is because Sarus supports the mean method on DataFrames and registers a new Dataspec.

If you call a method that is not supported by Sarus, the method will be applied on the underlying Dataspec value object. For example, calling an unsupported method on a sarus.pandas.DataFrame will yield a standard pandas.DataFrame.

The transition between Sarus objects and standard objects is designed to be seamless for the data scientist. However, this implies that you need to be careful about which methods are supported and which are not. For clarity, an INFO log is printed in stdout to help you identify when you have applied a non supported method and will consequently receive a standard Python object.

Members

SDK classes and functions.

Client([url, google_login, email, password, ...])

Entry point for the Sarus API client.

Dataset(id, client, dataspec[, is_bigdata, ...])

A class representing a Sarus Dataset.