Python SDK Documentation ======================== Overview ________ The Sarus python SDK packages the Sarus API so that it's easy for analysts and data scientists using python to work with sensitive data in a secure way. | The Sarus python SDK can be installed with the usual: | ``pip install sarus``. | (or ``python -m pip install sarus`` if you are in a conda environment for instance). **Note: Sarus python sdk requires python3.8.** Once the library is installed, you can connect to your Sarus instance to check the list of datasets that were made available to you by the Data Preparator and select the one you want to analyze. .. code-block:: >>> from sarus import Client >>> client=Client(url="https://yoursarusinstance.sarus.tech:5000", email="your_email") >>> client.list_datasets() [, ] >>> remote_dataset = client.dataset(slugname="your_dataset_name") Once the dataset is selected, you can directly run analyses on the remote data **using your the same code you would write if the data was on the local filesystem**. To use the standard data libraries, you just need to change the import lines to point to the Sarus wrapper of the selected library. To retrieve a dataframe-like object from a remote Sarus dataset you can use ``.as_pandas()`` on the dataset object. This will return a ``sarus.pandas.DataFrame`` that behaves just like a ``pandas.DataFrame`` but on the remote data. NB: The SDK will let you manipulate the remote data with the version of the library that is installed on the Sarus instance. To make sure the local version of the library is compatible with the remote version, you can download it by setting the right target in pip (e.g.: ``pip install sarus[pandas]``). For example: .. code-block:: >>> import sarus.pandas as pd >>> import sarus.numpy as np >>> dataframe = remote_dataset.as_pandas() That's it! **The rest of the analysis experience is the exact same as if you were manipulating a pandas dataframe without Sarus!** Under the hood, the Sarus proxy computes the graph of operations that you apply to the remote Sarus dataset. You actually manipulate Sarus objects without even noticing. Each time your code expects an output (e.g.: extract of rows, aggregates, model weights etc.), the proxy evaluates the current node of the graph by: 1. checking what's authorized by the Privacy Policy that were assigned to you by the Data Preparator/Admin (see note below) 2. compiling the graph version that meets the privacy constraints 3. executing the graph on the remote data 4. returning the secured result, always with the objective of maximizing the accuracy given the privacy constraints. All this works for the libraries and operations that are supported by Sarus (see further). NB: **Please note that this current version of the SDK actually comes with a default** **Privacy Policy defined as follows: the analyst only gets DP estimates** **except for operations that are whitelisted; in the case of whitelisted** **operations, they get exact results**. Today, all supported model fits and performance evaluation metrics are whitelisted, meaning the analyst gets the real model weights and performance, fitted on the real data. A feature allowing the Data Admin/Preparator to whitelist specific operations on an exception basis is on the roadmap and will be available soon. See Privacy Policies definition in section *Introduction to main Sarus concepts* for more details. Supported libraries and operations __________________________________ The Sarus team continuously works on supporting more python libraries and operations. Currently, a subset of operations of the below libraries are supported: - numpy - pandas - sklearn - xgboost - tensorflow NB: The SDK will let you manipulate the remote data with the version of the library that is installed on the Sarus instance. To make sure the local version of the library is compatible with the remote version, you can download by setting the right target in pip (e.g.: ``pip install sarus[pandas]``). For all those libraries, the Sarus wrapper should be imported instead of the standard library via ``import sarus.library`` (ex: ``import sarus.numpy as np``). .. toctree:: :hidden: op_list Please find the current list of supported operations here: :ref:`supported_transformations`. | Can't find the library or operation you need? | Contact your Sarus representative so that we add it for you. **Please note that**: - **Objects mutation, python functions with side effect, multi-assignments and operations taking several Sarus objects as arguments are not supported yet** - The described execution logic **applies only for supported operations**. So we encourage you to only use supported ops (see list here: :ref:`supported_transformations`) and contact us to extend the list of supported ops if you need an extra library or operation. - Using an unsupported operation of one of the above listed libraries will result in an INFO log in ``stdout`` so that you can notice when you apply a non supported method and consequently receive a standard Python object (NB: Displaying INFO log level usually requires some configuration) - Using an unsupported library on a Sarus object should result in this object evaluation based on the Privacy Policy that were assigned to you (so you get a standard Python object) - At any time, you can check if an object is a Sarus object with ``type(object)`` - Evaluating an object can result either in a DP estimate (synthetic or not) or a result computed on the real data without DP, depending on the Privacy Policy (if the ops is whitelisted for example). At any time, you can check what was the evaluation policy of an object with ``sarus.eval_policy(object)`` Advanced concepts _________________ Sarus Dataspecs ^^^^^^^^^^^^^^^ Dataspecs are a piece of data for which the computation graph is known. A Dataspec is either a source Dataset made available through the Sarus client or can be defined by applying a transformation to suche a Dataspec. It is called Dataspec instead of Dataset because it is more general. For instance, the weights of a trained model are a Dataspec but are not a Dataset. A Dataspec is merely a description of how to compute a piece of data. It is a sequence of computation instructions. It is not to be confounded with the dataspec's value which is the result of the computation. Dataspec wrappers ^^^^^^^^^^^^^^^^^ In the Sarus python SDK, you don't directly manipulate Dataspecs. Instead, you manipulate Sarus objects emulating standard data science objects (e.g. pandas DataFrame). These Sarus objects wrap dataspecs. The Sarus classes inherit from the ``DataspecWrapper`` class. A ``DataspecWrapper`` act as a view of the dataspec. It keeps a reference to the underlying dataspecs and provides a way to interact with the dataspec's value. It also defines which operations, if applied to a Dataspec, will produce another DataSpec. Example ^^^^^^^ For instance, the Sarus python SDK defines a ``sarus.pandas.DataFrame`` object. Calling the ``mean`` method on it will yield a ``sarus.pandas.Series`` object. This is because Sarus supports the ``mean`` method on DataFrames and registers a new Dataspec. If you call a method that is not supported by Sarus, the method will be applied on the underlying Dataspec value object. For example, calling an unsupported method on a ``sarus.pandas.DataFrame`` will yield a standard ``pandas.DataFrame``. The transition between Sarus objects and standard objects is designed to be seamless for the data scientist. However, this implies that you need to be careful about which methods are supported and which are not. For clarity, an INFO log is printed in ``stdout`` to help you identify when you have applied a non supported method and will consequently receive a standard Python object. Members _______ .. automodule:: sarus .. autosummary:: :toctree: generated Client Dataset