A brief review of data safety in building an AI system.

What you can find in this summary note:

Related link: A starter guide towards building a trustworthy AI

Most of the content here is updated from the Content Sharing Call of OpenMined Community - one of the most active open groups about privacy-preserving data science and AI. Please go to original blogs for more details if you have any questions. Thanks Nick from Miratrix/ OpenMined to share information with me.

In some industries such as healthcare, operational logistic and finances, traditionally most of data is private and/or sensitive. In addition, there is still a need for applying machine learning techniques to solve problems. Take an example from healthcare. Applications could be used to increase patients’ outcome by better diagnostics and reduce a time to make clinical decisions by doctors.

alt_text Source: ShutterStock

Nevertheless, data privacy becomes one of the most challenging barriers for innovation. The fact is people don’t know whether their data has not been sold on or used without their consent or stored longer than intended. Users also become more awared of their data used without consent. On the other side, it is difficult to manage the consent when gathering data. All of these makes the conventional industries be difficult to move forward data protection policies, e.g. GDPR or California Privacy Consumer Act.

Big picture: The past, the present and the future

Why would we care about safety from a perspective of car manufacture and design?


Source: Seatbeltplus.com

The past: Lessons

The three-point seat belt was introduced in 1959 by Nils Bohlin - the chief safety officer of Volvo. It was a result of tens of thousands of accidents studied by hundreds of experiments. This is a great demonstration of a commitment for public good. Using this example, would we ask ourselves whether there is a better way to promote an open culture of such innovation.

The present: Attention

There are three key points when working towards an AI system.

  1. An importance of data.
  2. A trend of implementing and updating safety critical features.
  3. An optimization on an entire dataset.


An example of a process to select an image by slicing function. The result we have is a photo satisfying rain and rain. Source: Openmined Blog.

Here, we pay more attention on the second highlight with three options:

  1. How to release a model without losing competitive advantage.
  2. How to release enough data without losing privacy.
  3. How to do both with Federated Learning, especially by identifying application and critical data subset with slicing functions.


A computer vision system in a self-driving car. Source: NVIDIA/VICTOR TANGERMANN

The future: An ecosystem

Here is an hypothetical example of Federated Learning in Tesla computer vision system to identify pedestrians on country lanes.

  1. Create a slicing function.
  2. Function sent to an ecosystem governed by federated learning.
  3. Approved or rejected by ecosystem members.
  4. Models are federated to approving members.
  5. Models are gathered and insight gained.


Source: Openmined Blog

In such system, we would have an ecosystem achieve three goals:

Please go to the blog for further information. Blog

Case study 1: Data storage

Introduce SplitNN

The critical pain point is a centralization of data, which leads us to an approach of decentralization. Here we introduce a concept called Split Neural Network from Open Minded. This technique was introduced by MIT in December 2018, then opened a new branch of privacy-preserving research in machine learning. In simple words, we split a training process into two or more hosts. Each part is self-contained, so it feeds the segment in front.


A diagram of how SplitNN works. Source

Why would we use a SplitNN?


An illustration of how effective it is Source

Train & Implementation of a SplitNN

Please go to the blog for further information. Blog

Case study 2: An application of using data

An introduction of Recommendation System: the convention and the Federated fashion.

“Recommendation system”: what is it?

An engine that allows websites and mobile applications to suggest information based on pre-existing data.

Some examples are Facebook Ads, Amazon Suggest Product, Youtube recommended videos. The good side is an exploration of new ideas and products. The down side is unwanted decisions, e.g. purchasing a service that you don’t need, spending too much time on cats and dogs video, etc. Most of these are driven by algorithms with your personal data.

Technically, there are three main domains of such algorithms:

  1. Content filtering: Create a portrait by item and user with a set of features. It is aimed to understand interests.
  2. Collaborative filtering: A common way to build a recommendation system by applying a technique called Matrix Factorization. It is solely based on an interaction between data and a platform, e.g. e-commerce website.
  3. Hybrid filtering: A combination of the above.

Other domains,

  1. Learning to rank with ranking/ recommendation/ search domain.
  2. Neural based recommendation system: Neural collaborative filtering, wide and deep networks.

Privacy preserving recommendation systems

This post is an example of how to build a federated collaborative filtering model via matrix factorization.

How to do federated collaborative filtering (Federated ALS)


Train & implementation of Federated ALS

Please go to the blog for further information. Blog

Tagged #ai, #summary, #TrustworthyAI.