Python SDK Documentation
Overview
The Sarus python SDK packages the Sarus API so that it’s easy for analysts and data scientists using python to work with sensitive data in a secure way.
pip install sarus
.python -m pip install sarus
if you are in a conda environment for instance).Note: Sarus python sdk requires python3.8.
Once the library is installed, you can connect to your Sarus instance to check the list of datasets that were made available to you by the Data Preparator and select the one you want to analyze.
>>> from sarus import Client
>>> client=Client(url="https://yoursarusinstance.sarus.tech:5000", email="your_email")
>>> client.list_datasets()
[<Sarus Dataset slugname=census id=1>,
<Sarus Dataset slugname=your_dataset_name id=2>]
>>> remote_dataset = client.dataset(slugname="your_dataset_name")
Once the dataset is selected, you can directly run analyses on the remote data using your the same code you would write if the data was on the local filesystem.
To use the standard data libraries, you just need to change the import
lines to point to the Sarus
wrapper of the selected library. To retrieve a dataframe-like object
from a remote Sarus dataset you can use .as_pandas()
on the dataset
object. This will return a sarus.pandas.DataFrame
that behaves just
like a pandas.DataFrame
but on the remote data.
NB: The SDK will let you manipulate the remote data with the version of
the library that is installed on the Sarus instance.
To make sure the local version of the library is compatible with the remote
version, you can download it by setting the right target in pip
(e.g.: pip install sarus[pandas]
).
For example:
>>> import sarus.pandas as pd
>>> import sarus.numpy as np
>>> dataframe = remote_dataset.as_pandas()
That’s it!
The rest of the analysis experience is the exact same as if you were manipulating a pandas dataframe without Sarus!
Under the hood, the Sarus proxy computes the graph of operations that you apply to the remote Sarus dataset. You actually manipulate Sarus objects without even noticing. Each time your code expects an output (e.g.: extract of rows, aggregates, model weights etc.), the proxy evaluates the current node of the graph by:
checking what’s authorized by the Privacy Policy that were assigned to you by the Data Preparator/Admin (see note below)
compiling the graph version that meets the privacy constraints
executing the graph on the remote data
returning the secured result, always with the objective of maximizing the accuracy given the privacy constraints.
All this works for the libraries and operations that are supported by Sarus (see further).
NB: Please note that this current version of the SDK actually comes with a default Privacy Policy defined as follows: the analyst only gets DP estimates except for operations that are whitelisted; in the case of whitelisted operations, they get exact results. Today, all supported model fits and performance evaluation metrics are whitelisted, meaning the analyst gets the real model weights and performance, fitted on the real data. A feature allowing the Data Admin/Preparator to whitelist specific operations on an exception basis is on the roadmap and will be available soon. See Privacy Policies definition in section Introduction to main Sarus concepts for more details.
Supported libraries and operations
The Sarus team continuously works on supporting more python libraries and operations. Currently, a subset of operations of the below libraries are supported:
numpy
pandas
sklearn
xgboost
tensorflow
NB: The SDK will let you manipulate the remote data with the version of
the library that is installed on the Sarus instance.
To make sure the local version of the library is compatible with the remote
version, you can download by setting the right target in pip
(e.g.: pip install sarus[pandas]
).
For all those libraries, the Sarus wrapper should be imported instead of
the standard library via import sarus.library
(ex: import sarus.numpy as np
).
Please find the current list of supported operations here: Supported transformations.
Please note that:
Objects mutation, python functions with side effect, multi-assignments and operations taking several Sarus objects as arguments are not supported yet
The described execution logic applies only for supported operations. So we encourage you to only use supported ops (see list here: Supported transformations) and contact us to extend the list of supported ops if you need an extra library or operation.
Using an unsupported operation of one of the above listed libraries will result in an INFO log in
stdout
so that you can notice when you apply a non supported method and consequently receive a standard Python object (NB: Displaying INFO log level usually requires some configuration)Using an unsupported library on a Sarus object should result in this object evaluation based on the Privacy Policy that were assigned to you (so you get a standard Python object)
At any time, you can check if an object is a Sarus object with
type(object)
Evaluating an object can result either in a DP estimate (synthetic or not) or a result computed on the real data without DP, depending on the Privacy Policy (if the ops is whitelisted for example). At any time, you can check what was the evaluation policy of an object with
sarus.eval_policy(object)
Advanced concepts
Sarus Dataspecs
Dataspecs are a piece of data for which the computation graph is known. A Dataspec is either a source Dataset made available through the Sarus client or can be defined by applying a transformation to suche a Dataspec.
It is called Dataspec instead of Dataset because it is more general. For instance, the weights of a trained model are a Dataspec but are not a Dataset.
A Dataspec is merely a description of how to compute a piece of data. It is a sequence of computation instructions. It is not to be confounded with the dataspec’s value which is the result of the computation.
Dataspec wrappers
In the Sarus python SDK, you don’t directly manipulate Dataspecs. Instead, you
manipulate Sarus objects emulating standard data science objects (e.g. pandas
DataFrame). These Sarus objects wrap dataspecs. The Sarus classes inherit
from the
DataspecWrapper
class.
A DataspecWrapper
act as a view of the dataspec. It keeps a reference to the
underlying dataspecs and provides a way to interact with the dataspec’s value.
It also defines which operations, if applied to a Dataspec, will produce another
DataSpec.
Example
For instance, the Sarus python SDK defines a sarus.pandas.DataFrame
object.
Calling the
mean
method on it will yield a sarus.pandas.Series
object. This is because
Sarus supports the mean
method on DataFrames and registers a new Dataspec.
If you call a method that is not supported by Sarus, the method will be applied
on the underlying Dataspec value object. For example, calling an unsupported
method on a sarus.pandas.DataFrame
will yield a standard pandas.DataFrame
.
The transition between Sarus objects and standard objects is designed to be
seamless for the data scientist. However, this implies that you need to be
careful about which methods are supported and which are not. For clarity, an
INFO log is printed in stdout
to help you identify when you have applied a non
supported method and will consequently receive a standard Python object.
Members
SDK classes and functions.
|
Entry point for the Sarus API client. |
|
A class representing a Sarus Dataset. |