Data Science & AI

Introduction

The Sarus SDK allows the analyst to manipulate datasets with standard data science libraries. It endeavors to provide a single unified interface around the familiar NumPy, Pandas, and Scikit-Learn APIs. For instance, any user familiar with the Pandas API should feel at home with the SDK.

The general framework is the following:

  1. The analyst writes Python code in their usual development environment (local machine, remote notebook…)

  2. At any step along the way, to get a result, they can request the evaluation of their program

  3. When an evaluation is requested, the SDK creates a graph of transformations based on the code. The graph of transformations corresponds to the data processing job defined by the analyst

  4. The SDK sends the graph to the Sarus application for evaluation

  5. The server checks if the graph of operations is authorized, with respect to the analyst’s privacy policy:

    • If so, the graph is evaluated on the remote data and the result is sent back to the analyst

    • If not, the server creates an alternative graph that complies with the privacy policy (meeting the constraints on authorized operations, privacy budget, or synthetic data access for instance). Then the server evaluates the alternative graph and sends the result back to the analyst

Image of the different steps.

Find below a guide which presents the details of all these steps.

SDK installation and imports

To use the SDK, first the analyst needs to install the Sarus package:

>>> pip install sarus==sarus_version

Note that network security rules sometimes prevent downloading packages from PyPI (Python Package Index). If no exception car be granted for the download of an external package, please contact your Sarus account manager. NB: you want to use the SDK version corresponding to the version of Sarus installed on the server.

In order to use pandas or numpy packages on the remote dataset, it is necessary to load the corresponding API from the Sarus SDK:

>>> import sarus.pandas as pd
>>> import sarus.numpy as np
>>> from sarus.sklearn.compose import ColumnTransformer

Dataset selection

Once the library is installed, the analyst can connect to the Sarus instance using the Client object. They can check the list of datasets that were made available and select the one they want to work on:

>>> from sarus import Client
>>> client = Client(url="https://your_url",
      email = "your_email@example.com",
      password = "your_password"
    )
>>> client.list_datasets()
[<Sarus Dataset slugname=census id=1>,
  <Sarus Dataset slugname=dataset_name id=2>]

>>> dataset = client.dataset(slugname="dataset_name")
>>> dataset.tables()
[`dataset_name`, `schema`, `first_table`]

SQL analyses

The Sarus SDK can be used to run SQL analyses. The method dataset.sql() enables the analyst to run SQL queries on Sarus datasets.

List of SQL functions and operators coming soon.

The sql() method is mostly used for two things:

  • Run statistical SQL analyses on the data

### Run a basic SQL analysis counting the
### number of rows and the average
### amount by country

query = """
  SELECT country, COUNT(*) as nb_rows, AVG(amount) as avg_amount
  FROM dataset.schema.table
  GROUP BY person
"""
dataset.sql(query).as_pandas()
  • Define an extract of the dataset to be used in a data science workflow

  query = """
    SELECT *
    FROM dataset.schema.table
    WHERE age > 56
  """
df_extract = dataset.sql(query).as_pandas()

Then, df_extract can be used as a pandas.dataframe, for example to build a ML job - see below.

Working with Pandas, Numpy, Scikit-learn…

Once the dataset is selected and the extraction of interest defined, the analyst can run analyses on the remote data using the same code they would write if the data was on their local file system. The analyst can use most of the Pandas, Numpy or Scikit-learn methods.

Supported libraries & transformations

The Sarus team continuously works on supporting more python libraries and transformations. Currently, a subset of operations of the following libraries is supported:

  • Pandas

  • Numpy

  • Scikit-learn

  • Scipy

  • Shap

  • XGBoost

Supported transformations

Note that if the analyst uses a transformation which is not supported, the result will be computed against the synthetic data - check Understand outputs to learn more.

NB: on the Server side, there is a particular version of each library which is installed. For instance, pandas version is 1.3.5. To be sure that you will not have any incompatibilities issues, please check and install the requested version of the packages in your developing environment.

You can automatically install the requested version using the syntax pip install sarus[package_name].

Working in pandas

To work in pandas, the analyst applies the as_pandas() method to the selected table. It can be used directly on the original table of a dataset or after a .sql(). Some examples:

### as_pandas() directly on the dataset
df_transactions = dataset.table(['transactions']).as_pandas()

----------------

### as_pandas() after a .sql()
query = """
  SELECT *
  FROM dataset.schema.table
  WHERE age > 56
"""
df_extract = dataset.sql(query).as_pandas()

Understand outputs

At any step along the way, the analyst can check the value of their data processing job. They can do it explicitly by calling sarus.eval() or implicitly by using a remote object in an unsupported function (e.g. print(remote_object) will call implicitly the sarus.eval() function first, and then apply print()). It sends the computation graph to the server for evaluation. The server will make sure the graph is compliant with the privacy policy (if not it will create a compatible alternative) and send the result back to the analyst.

### Compute the mean of the column amount
mean_amount = df.transactions.amount.mean(axis=0,numeric_only=True)
sarus.eval(mean_amount)

Output:
Differentially-private evaluation (epsilon=1.90)
135.56147013575

There are 3 types of results:

  1. Evaluated from Synthetic Data only : the result is computed using the synthetic data only. There is no additional DP mechanism being implemented and the calculation will not be accounted for the privacy budget.

  2. Differentially-private evaluation: the result is computed using the remote real data by resorting to a differentially-private alternative of the requested graph (DP). The result of the evaluation will be noisy and the computation will be taken into account in the privacy budget.

  3. Whitelisted: the result is using against the remote real data without any modification to the computational graph or protection.

The type of result depends on the transformations used by the analyst in their code. Transformations can have the following properties:

  • Supported transformation (see the list): this is all the methods and functions supported by the SDK and that can be used on remote data.

  • Protected-Unit Preserving (PUP) Transformation: this is a subset of the supported transformations where the protected unit is traceable in the output of the operation, which is a requirement to be able to apply subsequent differentially-private mechanisms (e.g.: apply() is PUP because the protected entity is unchanged before and after).

  • With DP implementation: this is a subset of supported transformations for which there is a differentially-private mechanism equivalent to this transformation (e.g.: count() has a DP mechanism).

To get DP or whitelisted results, it is necessary to use only supported transformations, otherwise the result will be computed using the synthetic data. The main way to have a DP result is when all transformations are PUP except for the last one that has a DP-implementation.

Epsilon consumption of DP results

Every DP result consumes some of the privacy budget (epsilon - see DP theory to know more). It can be specified by the analyst by setting the target_epsilon in the sarus.eval() function. the specified target_epsilon cannot be higher than the per-query limit fixed in the privacy policy by the data owner. When not specified, the consumption of a DP query is the default_target_epsilon determined by the privacy policy - the one set in the “Limit Value per query” in the Privacy Policy configuration.

If the DP budget is fully consumed or if the target_epsilon is greater than the authorized value, then the result will be computed against the synthetic data.