Introduction
Sarus’s mission is to unlock analytics, AI and GenAI on sensitive data without privacy risk. It is achieved by building a privacy layer between the sensitive data and the data practitioners: the data scientists can query the data any way they want, all the results that are returned are made privacy-safe by the application.
Key concepts
Access Rule
An Access Rule is the application of a Privacy Policy to a Dataset for one user or group of users. It defines the queries that the user(s) will be able to execute on the Dataset.
Admin
Computational graph
A Graph is a specific computation that a data practitioner would like to perform on a Dataset. It can be a single SQL query (possibly with nested queries) or a more complex data processing task with intermediate results (e.g.: sequence of python transformations).
A Graph is sent to the Application API by using the BI Connector or the SDK. The Application will analyze the Graph and possibly rewrite it to comply with the privacy policy of the data practitioner before executing it on the source data.
Evaluation
To get a data result, the Data Analyst needs to evaluate the graph. It can be done using the sarus.eval() function. Sometimes the evaluation is implicit, meaning that the sarus.eval() function is used under the hood. For instance, the Python function print() will first evaluate the object and then display it.
To understand the different types of results, you can :ref: check this section <understand-outputs>.
Privacy Policy
A Privacy Policy defines the outputs of data processing tasks that can be retrieved by a user. It can include:
Outputs derived exclusively from synthetic data
Outputs from differentially-private mechanisms
Outputs of data processing tasks that have been whitelisted by exception
Today, all Privacy Policies include at least the first right: synthetic data.. The right to query with differential privacy comes with parameters that will control the level of protection from differential privacy. They define a privacy loss budget which limits the maximum amount of information that may be retrieved under a privacy policy. The budgets can be at the query level, the user-level, or the Access Rule level. The Application will maintain an account of all queries at a given level. It will enforce that the cumulated privacy consumption falls below the authorized thresholds.
Finally, the user may have the right to retrieve outputs of whitelisted queries or data processing tasks.. Whitelists can be attached to a given computation(e.g.: one particular SQL query) or apply to any results of a particular transformation (e.g.: all performance metrics of ML models).
Note that even if the synthetic data has been generated with differential privacy, the privacy consumption for the synthetic data is not accounted for in any of the budgets. It is as if it had been made public by the data owner in addition to any differentially-private query right.
Protected Unit
A protected unit (PU) is an entity one wants to keep private. It may be an individual, a group of individuals, a single financial transaction, or any data object that should not be revealed in the analyses. With Sarus, an analyst will be able to carry out analyses on a dataset with mathematical guarantees that information on any given protected entity is protected.
In more technical terms the protected entity will be the basis for dataset distance as understood in the definition of differential privacy. Two datasets will be neighboring if they differ by at most one protected entity.
Sarus Dataset
A Dataset is a collection of rows with a predefined schema. It can be one or several two-dimensional tables.
Each Dataset points to a source that can be a file uploaded by the data preparator or tables from a Data Connection. A Dataset has Access rules that define who can query the source data and under which privacy policy.
Schema of a dataset
The schema of a dataset contains basic properties regarding the structure of the data (e.g.: columns name, types of columns, relationships between tables, privacy keys for each table, or ranges of possible values). It is defined during the onboarding process by the Sarus application. Schema details come from the data source for typed source (e.g.: SQL-like) or are inferred in the case of text files (e.g.: CSV). They can then be refined by the data preparator (e.g.: adjusting the ranges of possible values, adding primary or foreign keys, setting protection keys;. learn more about user input).
Once the Dataset is ready, the schema becomes immutable and all information that it contains will be visible to all the users that can see the dataset irrespective of the underlying privacy policy.
Synthetic Data
During the onboarding process, the application trains a generative AI model or a trigram model to capture the statistical distribution of the data. The model seeks to preserve the statistical distribution of columns and, when possible, the joint distribution of multiple columns. The generative model is trained with differential privacy to ensure that no private information may leak. Once training is complete, the application samples from the learned distribution to generate new fake records. Each synthetic record formally respects the original schema. Collectively, they have a realistic statistical distribution.. Synthetic data is a very important element for data scientists to get familiar with a dataset they cannot see.
Whitelisting
In a privacy policy, it is possible to specify certain transformations as exceptions. When a Data Scientist uses a transformation that has been added to the whitelist, it can be used in any data processing task without any protection. Therefore, it is crucial to be cautious when declaring exceptions, as they may be used to extract private information. Exceptions are best suited with trusted data practitioners that will not try to exploit the exception to carry out privacy attacks.
For instance, it can be useful to whitelist the fit() method of Machine Learning models so that the fitted model is the most accurate possible.
Note that all data processing tasks are logged, which enables the data owner to audit all use of exceptions.
Core components
Administration API/UI
The application can be managed either by a web-app or directly by a REST API. It enables the data owner to create users, prepare datasets, define Privacy Policies, and grant Access Rules to users.
This API and web-app also provide a detailed and searchable list of logs to monitor and audit queries that have been submitted.
Python SDK
The python SDK makes it easy for data practitioners to submit data processing tasks. They define computational graphs using the tools and libraries they are the most used to, including a .sql() method or an APIs that wraps the most common data science libraries (e.g.: pandas, numpy, sklearn).
It is also the easiest way to pull synthetic data from the application.
BI Connector
The BI connector is an API that implements the HiveQL/SparkSQL protocol so that BI tools (e.g.: Metabase, PowerBI, Tableau software..) and SQL query editors can submit SQL queries to Sarus.