Quickstart tutorial
In this tutorial you will see how to:
- As a Data Owner/Admin:
Sign-up/Login to your Sarus instance
Invite users
Add a dataset to your Sarus instance
Select who can query your dataset and with which privacy policy
- As a Data Practitioner:
Connect to your Sarus instance and see available datasets
Submit SQL queries onto remote datasets
Preprocess and train a ML model onto remote datasets
Part 1: How to prepare a dataset as a Data Owner or Admin
Sign up / Login
If you don’t have an account yet (and you’re not a Sarus app Admin), create one using one of the following methods:
Using the invitation link shared by an admin: Visit the link; you’ll see a prefilled single-use token. Just enter your information and validate sign-up. It will create your account and let you in.
Using a token: If you were given a Sarus token to create your account, click “Sign up”, enter your information and the single-use token that you received. It will create your account and let you in.
Using your Google account: If the instance supports signing up with Google accounts, click “Sign up with Google” and follow the instructions.
Using your email address: Fill in your email address and password. It will create your account and let you in.
If you’re the Sarus app Admin and login for the first time after Sarus installation, go to your Sarus instance’s UI login page and use the credentials specified in the .env file.
Invite other users
You can invite other users from “Users” section in left menu of the user interface.
Provide their email address and assign them at least one role among:
Data Preparator: users who are authorised to see some sensitive data and will manage the data access
Data Practitioner: Analysts, Data Scientists..
User Manager
Admin: users who have full rights
All roles come with a set of permissions. See the definition page for more info.
Validating will generate an invitation link. Share it with the user so that they finalize their account creation and can start to use Sarus.
Add a dataset
You can set up a Data Connection in the left menu: it is a link to a remote repository or database (GCS, S3, Redshift, Postgre,..) where some source data you want to share are located. Can’t find your connector? Contact your Sarus representative so that we add it for you.
Add a new dataset by clicking the “+ Add” button in “Datasets” section. Fill in your dataset info and select a data source. You have 2 options:
Upload file: Drop or browse your CSV file. This will upload the data onto your Sarus instance and keep a copy of it.
Use data connection: Select an existing connection or create a new one.
For file-based storage connections (S3, GCS..), paste the URI of your data source file. For now, Sarus only supports the adding of a single CSV file.
For SQL sources, browse and select one or more source table(s).
Launch the adding by clicking “Next” button.
This triggers the detection of the schema, i.e. data types and ranges, needed to generate a useful synthetic dataset and run Differentially-Private queries.
If the detected schema is satisfactory, click “Validate”. It launches synthetic data generation. This may take a while depending on the size and type of your data.
Select who can query the remote dataset and how
You can create your own privacy policy or use any of the predefined templates:
Synthetic data & aggregates: With this access policy, users can’t directly see the real rows of data but can compute as many aggregated queries as they want. If rows are asked, they get synthetic data. Note that all queries are logged to allow for monitoring and avoid illegal tries.
Anonymized data (per-query DP): Differentially-Private access with query limitation. For each query (SQL or model fitting), the output is anonymized, with a maximum authorised privacy loss of 2. If this per-query limit is exceeded, the DP query is rejected.
Anonymized data (per-user DP): Differentially-Private access capped at the user level with a per-user privacy loss limit of 20. A user under this policy will not be able to extract personal information no matter which series of queries they run. This is a very strong guarantee as long as many users do not collude to combine their respective results. If you have many users of your dataset and there is a risk that they may craft sophisticated attacks together, you should also cap at group level.
You’re all set!
At any time, you can check out the dataset information (description, source etc.), view its schema, manage access rules, or check out synthetic data from the dataset page.
Part 2: How to analyze a protected dataset as a Data Practitioner
Once in “Ready” status, a dataset can be analyzed by any Sarus user who has been granted access to it (through an access rule).
You may also want to directly follow the below steps:
pip install sarus sarus[sklearn]
(or python -m pip install sarus sarus[sklearn]
if you are in a conda environment for instance).
NB: For now, sarus requires python3.8.
Now you can connect to the API from your favorite python environment with your email address and password:
>>> from sarus import Client
>>> client=Client(url="https://yoursarusinstance.sarus.tech/gateway/", email="your_email")
The Client
object allows you to interact with the remote data in several
ways. Below are some of the main features you may want to try.
Select a dataset
The first thing you need to do is to select a dataset to work on.
>>> client.list_datasets()
[<Sarus Dataset slugname=census id=1>,
<Sarus Dataset slugname=your_dataset_name id=2>]
>>> remote_dataset = client.dataset(slugname="your_dataset_name")
You can print the list and names of the dataset’s tables with:
>>> remote_dataset.tables()
[['your_dataset_name','schema','firsttablename'],
['your_dataset_name','schema','secondtablename']]
Submit SQL queries
The dataset’s tables can be queried using SQL queries. Here is an example.
>>> import pandas as pd
>>> r = client.query("SELECT AVG(age) FROM your_dataset_name.schema.firsttablename)
>>> pd.DataFrame(r['result'], columns=r['columns'])
Explore private data with pandas
Pandas API can be used to explore and analyze the remote tables. See Python SDK Documentation for more details.
>>> dataframe = remote_dataset.table(remote_dataset.tables()[0]).as_pandas()
>>> dataframe.head(10) # Falls back to synth data as seeing real rows is forbidden
>>> dataframe.describe()
Preprocess & model private data with pandas, numpy and sklearn
>>> from sarus.sklearn.model_selection import train_test_split
>>> from sarus.sklearn.ensemble import RandomForestClassifier
>>> from sarus.sklearn.metrics import accuracy_score
>>> dataframe = dataframe.fillna(0)
>>> target_colname = 'target'
>>> X = df.drop([target_colname], axis=1)
>>> y = df[target_colname]
>>> df.target.value_counts()/df.shape[0]
True 0.507101
False 0.492899
Name: target, dtype: float64
>>> result = train_test_split(X, y, test_size=0.2)
>>> X_train = result[0]
>>> X_test = result[1]
>>> y_train = result[2]
>>> y_test = result[3]
>>> model = RandomForestClassifier()
>>> fitted_model = model.fit(X_train, y_train)
>>> y_pred = fitted_model.predict(X_test)
>>> accuracy_score(y_pred, y_test)
0.897
Your model has been trained on the original unaltered data, for maximum performance!