Quickstart tutorial

Sarus provides a privacy-first proxy between sensitive data and data consumers so that Analytics and AI projects can start on day 1. Data Analysts and Scientists work on the real unaltered data without directly seeing it. Data is safe and insights are powerful.

As a Data Owner/Admin, you can use the Sarus UI or the admin API to list your sensitive datasets on your Sarus instance and make them available for Data Practitioners to carry out BI analyses or AI with privacy guarantees, thanks to privacy policies templates enforceable in a few clicks.

As a Data Practitioner (Scientist, Analyst…), you can use the Sarus Learning API, the python SDK or the BI connector to leverage the datasets made available to you by the Data Owner/Admin. Use your usual data & ML tools to explore sensitive data, carry out analyses, or build your models, in a seamless way.

In this tutorial you will see how to:

As a Data Owner/Admin:

Sign-up/Login to your Sarus instance
Invite users
Add a dataset to your Sarus instance
Select who can query your dataset and with which privacy policy

As a Data Practitioner:

Connect to your Sarus instance and see available datasets
Submit SQL queries onto remote datasets
Preprocess and train a ML model onto remote datasets

Part 1: How to prepare a dataset as a Data Owner or Admin

Sign up / Login

If you don’t have an account yet (and you’re not a Sarus app Admin), create one using one of the following methods:

Using the invitation link shared by an admin: Visit the link; you’ll see a prefilled single-use token. Just enter your information and validate sign-up. It will create your account and let you in.

Using a token: If you were given a Sarus token to create your account, click “Sign up”, enter your information and the single-use token that you received. It will create your account and let you in.

Using your Google account: If the instance supports signing up with Google accounts, click “Sign up with Google” and follow the instructions.

Using your email address: Fill in your email address and password. It will create your account and let you in.

If you’re the Sarus app Admin and login for the first time after Sarus installation, go to your Sarus instance’s UI login page and use the credentials specified in the .env file.

Invite other users

You can invite other users from “Users” section in left menu of the user interface.

Provide their email address and assign them at least one role among:

Data Preparator: users who are authorised to see some sensitive data and will manage the data access

Data Practitioner: Analysts, Data Scientists..

User Manager

Admin: users who have full rights

All roles come with a set of permissions. See the definition page for more info.

Validating will generate an invitation link. Share it with the user so that they finalize their account creation and can start to use Sarus.

Add a dataset

The dataset is a central object in Sarus, this is what you will be able to safely make available to Data Practitioners.

Note that adding a dataset doesn’t mean moving or copying your data: thanks to reading rights to your data sources, Sarus app allows you to open safe and scalable access to the data, with the data always remaining in your environment.

To add a dataset so it can be used for private analytics and machine learning:

You can set up a Data Connection in the left menu: it is a link to a remote repository or database (GCS, S3, Redshift, Postgre,..) where some source data you want to share are located. Can’t find your connector? Contact your Sarus representative so that we add it for you.
Add a new dataset by clicking the “+ Add” button in “Datasets” section. Fill in your dataset info and select a data source. You have 2 options:

Upload file: Drop or browse your CSV file. This will upload the data onto your Sarus instance and keep a copy of it.

Use data connection: Select an existing connection or create a new one.

For file-based storage connections (S3, GCS..), paste the URI of your data source file. For now, Sarus only supports the adding of a single CSV file.

For SQL sources, browse and select one or more source table(s).

Launch the adding by clicking “Next” button.

This triggers the detection of the schema, i.e. data types and ranges, needed to generate a useful synthetic dataset and run Differentially-Private queries.
If the detected schema is satisfactory, click “Validate”. It launches synthetic data generation. This may take a while depending on the size and type of your data.

As soon as the synthetic data is ready, dataset status becomes “Ready” and authorised users will be able to work on it. By default, only the Data Owner has access to it. Let’s make it more interesting and grant access to this new dataset.

Select who can query the remote dataset and how

While the synthetic data is being generated, you can decide who is authorised to run analyses on your dataset and under which privacy policy, from the Access Rules tab of the current dataset.

Grant access by selecting a user or a group of users and assign them a Privacy Policy. Click “Add Rule” to confirm the access granting.

Privacy Policies define how users should be able to access the datasets: access to synthetic data, to aggregated results, Differential-Privacy access with different set of parameters etc. As the Data preparator/Admin, you could also decide to whitelist some specific operations on an exception basis because you consider the risk is acceptable; meaning for those operations, the Data practitioner will get the exact result, from the real data. The feature allowing for whitelisting will be available soon. Meanwhile, all supported sklearn model fits and evaluation performance metrics are whitelisted. All other results sent to the Data practitioner will be a DP estimate.

You can create your own privacy policy or use any of the predefined templates:

Synthetic data & aggregates: With this access policy, users can’t directly see the real rows of data but can compute as many aggregated queries as they want. If rows are asked, they get synthetic data. Note that all queries are logged to allow for monitoring and avoid illegal tries.

Anonymized data (per-query DP): Differentially-Private access with query limitation. For each query (SQL or model fitting), the output is anonymized, with a maximum authorised privacy loss of 2. If this per-query limit is exceeded, the DP query is rejected.

Anonymized data (per-user DP): Differentially-Private access capped at the user level with a per-user privacy loss limit of 20. A user under this policy will not be able to extract personal information no matter which series of queries they run. This is a very strong guarantee as long as many users do not collude to combine their respective results. If you have many users of your dataset and there is a risk that they may craft sophisticated attacks together, you should also cap at group level.

You’re all set!

When the synthetic data generation completes, the dataset status turns to “Ready” and is now ready for safe analysis by Data Practitioners!

At any time, you can check out the dataset information (description, source etc.), view its schema, manage access rules, or check out synthetic data from the dataset page.

Part 2: How to analyze a protected dataset as a Data Practitioner

Once in “Ready” status, a dataset can be analyzed by any Sarus user who has been granted access to it (through an access rule).

As a Data Practitioner, to start playing with a dataset, you can use the API directly (see API documentation) but we have done most of the heavy lifting for you and created a simple python SDK to analyze data remotely from any python environment.

You can access an example notebook here. This is a template to easily run all the below operations on the US Census adult dataset, which we assume sensitive and as a result, only accessible via the Sarus safe access.

You may also want to directly follow the below steps:

Download Sarus python SDK using pypi: pip install sarus sarus[sklearn]

(or python -m pip install sarus sarus[sklearn] if you are in a conda environment for instance).

NB: For now, sarus requires python3.8.

Now you can connect to the API from your favorite python environment with your email address and password:

>>> from sarus import Client
>>> client=Client(url="https://yoursarusinstance.sarus.tech/gateway/", email="your_email")

The Client object allows you to interact with the remote data in several ways. Below are some of the main features you may want to try.

Select a dataset

The first thing you need to do is to select a dataset to work on.

>>> client.list_datasets()
[<Sarus Dataset slugname=census id=1>,
 <Sarus Dataset slugname=your_dataset_name id=2>]

>>> remote_dataset = client.dataset(slugname="your_dataset_name")

You can print the list and names of the dataset’s tables with:

>>> remote_dataset.tables()
[['your_dataset_name','schema','firsttablename'],
['your_dataset_name','schema','secondtablename']]

Submit SQL queries

The dataset’s tables can be queried using SQL queries. Here is an example.

>>> import pandas as pd
>>> r = client.query("SELECT AVG(age) FROM your_dataset_name.schema.firsttablename)
>>> pd.DataFrame(r['result'], columns=r['columns'])

Explore private data with pandas

Pandas API can be used to explore and analyze the remote tables. See Python SDK Documentation for more details.

>>> dataframe = remote_dataset.table(remote_dataset.tables()[0]).as_pandas()
>>> dataframe.head(10) # Falls back to synth data as seeing real rows is forbidden
>>> dataframe.describe()

Preprocess & model private data with pandas, numpy and sklearn

>>> from sarus.sklearn.model_selection import train_test_split
>>> from sarus.sklearn.ensemble import RandomForestClassifier
>>> from sarus.sklearn.metrics import accuracy_score

>>> dataframe = dataframe.fillna(0)
>>> target_colname = 'target'
>>> X = df.drop([target_colname], axis=1)
>>> y = df[target_colname]

>>> df.target.value_counts()/df.shape[0]
True     0.507101
False    0.492899
Name: target, dtype: float64

>>> result = train_test_split(X, y, test_size=0.2)
>>> X_train = result[0]
>>> X_test = result[1]
>>> y_train = result[2]
>>> y_test = result[3]

>>> model = RandomForestClassifier()
>>> fitted_model = model.fit(X_train, y_train)

>>> y_pred = fitted_model.predict(X_test)
>>> accuracy_score(y_pred, y_test)
0.897

Your model has been trained on the original unaltered data, for maximum performance!