oboulant commented 1 year ago

Zama Bounty Program: Credit Scoring

Please give us as much information as possible on the bounty you would like to submit. You can find inspiration from our existing list of bounties here.

Bounty name: Credit Scoring

Bounty type: major_bounty
Category: Application
Overview: We propose to showcase in a real world application on credit scoring how Zama's technology can help address the privacy issues related to exposing sensitive personal information. We propose :
- from the modelling perspective, to start from an existing dataset and model
- from the user experience perspective, to package the whole thing and deploy a server and a frontend client
Library targeted: Concrete-ML
Reward: 13500$ if planned as described by macro sizing

Description

Credit Scoring

A credit score is a numerical expression based on a level analysis of a person's credit files, to represent the creditworthiness of an individual. A credit score is primarily based on a credit report, information typically sourced from credit bureaus. — Wikipedia

Introductory Brief

Credit scoring has always traditionally been reserved to banking institutions and their likes, to assess their customers likelihood to repay their credit — or to decide whether to grant a credit to a potential customer.

Users are lacking a way to assess themselves their credit score, as doing so would require them to submit their private, sensitive credit data to a third party service. This concern over the user data privacy opens a great use-case for FHE : it allows a machine-learning model to be built and to assess user credit scores, without compromising the user banking data, nor their actual credit score.

As such, this projects aims to provide a concrete, useable application that assess users credit scores, while respecting their data privacy.

Application goals

This application has 2 main goals: provide a hands-on approach to Zama FHE and showcase the working of their encryption in a more user-friendly way

Provide a hands-on experience to users

FHE is a difficult concept to grasp. Non-technical users fail to understand how it works — or how it can work, and more technical ones doubt there can be an actual working implementation beyond just a technical proof. This credit-score app is a pretext to deliver an interactive experience over FHE. As such the focus will be put on showcasing a FHE encryption rather than building a full-fledged user credit-score app (e.g. no business-model, no “premium” features…). Nonetheless the model, results and overall behaviour should be immersive enough so the users can understand that FHE is no longer a theoretical concept: FHE is ready to reshape our concept of data privacy.

Encryption showcase

The main goal of FHE applied to machine-learning is to enforce the user data privacy. Ingenuous users will probably miss the difference between FHE and HTTPS, or fail to grasp how the data privacy can be preserved server-side. This application should help non-technical users to understand the preservation of the privacy of their data.

Our take is an ingenuous user needs 2 things in order to accept a change of paradigm (in our case a new form of data privacy behaviour): a representation he can grasp, and the existence of a proof

Visual representation of the encrypted data

A first step is to show the user a visual representation of its plain submitted data, how it presents a risk for its privacy (i.e. banking/judicial data), and how readable it is. Then show him how the same data is unreadable when encrypted so the user can visualize for himself that his data is protected.

Converting the user data to base64 or showing the HTTPS encrypted messages would provide the same sense of security, although not enforcing data privacy by any mean - so this step helps to build trust through a visual medium but does not prove it.

Proof that encrypted data is unreadable by the server

Stating that the data cannot be decrypted by the server is insufficient to build trust. After helping the user visualise its data is “unreadable” the next step is to provide a technical proof in a user-friendly format (e.g. Zama documentation), which can be one of the delivery items.

It is outside the scope of this application to deliver this technical proof as Zama core team is by far the best suited to deliver this (and probably has it already, under the form of a white paper or the likes). We should focus only on referencing it for the more technical user, and providing links to any work that assess it.

Missions

The tasks are grouped in 3 categories:

Machine Learning, which covers everything related to setting up the model,
Web Application, which covers the development of the user interface,
Deployment, which covers the application delivery.

Several deliverables will be produced:

The main application (client and server), deployed and publicly accessible,
The source code, over a public git repository,
Notebooks, showcasing the inner workings of the model,
Markdown writings, which can be exploited as documentation or blog post materials.

The tasks will be converted into Github issues and the deliverables will be converted into Github milestones to help tracking the project development progress.

Maching Learning

Setup the ML project

Setup a machine-learning project that allows to operate credit-scoring on behalf of users. The goal here is not to develop a performant model from scratch but rather to find inspiration on existing models and quickly bootstrap a working model that is compliant with FHE model specific needs.

As to date of 06/04/2023, the main source of inspiration (models + data) is the following Kaggle competition: https://www.kaggle.com/competitions/GiveMeSomeCredit/overview

An emphasis will be put on ensuring the selected Kaggle model works and scales well with its Concrete-ML counterpart

Task	Setup a credit scoring machine-learning model
Deliverable	A notebook sumarizing the model, why it was selected and how it should be configured
Macro Sizing	4 days

Build the Concrete-ML equivalent

Task	Convert the model from the previous step into its FHE equivalent with Concrete-ML
Deliverable	A notebook sumarizing the steps of turning the development steps and showcasing a python script that can be later used for the model deployment
Macro Sizing	5 days

Performance Benchmarking

Analyse the performance between a base ML Model (e.g. scikit-learn implementation) and the built Concrete-ML counterparts. Analyse performance in terms of train/compilation time as well as prediction. The goal here is to show the difference in performance (which we foresee to be very large) but also to emphasise that this “drop” in perf is not so much of a concern at the user level as the execution time remains acceptable.

Task	Analyse the model performance
Deliverable	A written analyse (markdown) of the FHE model performances
Macro Sizing	3 days

Web Application

Core application

Build a web app that allows users to submit their banking data over a simple form and display a credit score result. The application should provide the following pages:

A main page with a submission form
A summary page that displays the user credit score

The emphasis will be put on having a quickly working example, rather than spending time on complex UX/UI. The app structure will allow to later add other functionalities.

The data will be mocked (interfacing the web app with the model will be done in a later stage, once the model is deployed).

Task	Build a single-page web application for interacting with a credit-scoring distant model.
Deliverable	A functional web app which works with mock data.
Macro Sizing	4 days

Visual representation of encryption

Add an intermediary step in the submission form, demonstrating to the user his data is encrypted before being sent to the server. This should be done by replacing the form “send” button to an “Encrypt” button, which redirects him to a page which demonstrates the data is encrypted. As stated in the Application Goals section, building trust at this stage is limited to showing the encrypted data. The interface will also provide links to Zama most adapted “proof” content.

If Zama has any simple visual means (e.g. infography, diagrams…) this can be included in the page.

The page will also provide a “Send to server” button to resume the flow.

Task	Add an intermediary encryption page
Deliverable	The updated web app with mock data.
Macro Sizing	1 day

Encryption with TFHE-rs

Replace the mock implementation of the encryption on the encryption page with the Wasm THFE-rs implementation. Depending on the encrypted data displayability (Binary?), the encrypted data visualizer might have to be adjusted (scroll, only first bits, etc…)

Task	Implement the encryption with the WASM API
Deliverable	The updated web app with working encryption.
Macro Sizing	3 days (depending on how the ease of using the WASM API)

Interfacing with the server

Once the server has been deployed, interface its API so that the web client effectively sends the user encrypted data and obtains results.

Build a mirror page of the previous client encryption, display the encrypted response data received from the server, and provide a “decrypt” button in the client interface. Once the data is decrypted redirect the user to the summary page which displays its credit score.

Task	Interface the client with the server
Deliverable	The updated web app.
Macro Sizing	2 days

Deployment

Setup the production deployment

Follow the production deployment, as described in the documentation. Depending on the practicability, we foresee the following:

a script that builds the model and produce de production artifacts (client.zip, server.zip and serialized_processing.json),
a small HTTP API to interact with the production model,
a CI script (Github Actions) with Docker to automate the workflow.

These steps require some more info and will be split into more specific tasks once the complete workflow is determined.

Task	Setup the production deployment
Deliverable	The source code and a production deployment.
Macro Sizing	5 days

Co-written with @robinstraub

oboulant commented 1 year ago

A first step would be to validate that we can move forward with https://www.kaggle.com/competitions/GiveMeSomeCredit/overview ?

Since, from the ML perspective, the goal is not to start from scratch, but rather build upon a reasonable good enough model for that particular problem, I already had a look at what already exists related to that dataset and problem. Before going any further, it would be nice to validate that we can move forward with those data and model ⬆️ ?

Usable ressources if we validate that we work on this dataset and problem :

With the most upvotes. It tests a lot of things. It tests several pre-processing. It removes outliers, etc. https://www.kaggle.com/code/riteshrhyme/starter-credit-card-scoring-bbe98584-0/notebook
Second most upvotes. Nice since it shows how well the trained model generalizes. But it does not use AUC as a performance metrics. https://www.kaggle.com/code/prasadposture121/financial-distress-prediction
Very simple approach with a nice AUC. But it does not have many upvotes. https://www.kaggle.com/code/dhruv1234/givemesomecredit-auc-0-86721

aquint-zama commented 1 year ago

You could proceed with the Dataset mentioned (our goal is to have a real life use case)
WebApp and deployment should be considered as followup once we already have the app running in FHE (⚠️ TFHE-rs wasm client and Concrete for the server part, are not yet compatible) and deployment will be greatly improved in the coming monthes.

zama-ai / bounty-program

Credit Scoring #8

Zama Bounty Program: Credit Scoring

Description

Credit Scoring

Introductory Brief

Application goals

Provide a hands-on experience to users

Encryption showcase

Visual representation of the encrypted data

Proof that encrypted data is unreadable by the server

Missions

Maching Learning

Setup the ML project

Build the Concrete-ML equivalent

Performance Benchmarking

Web Application

Core application

Visual representation of encryption

Encryption with TFHE-rs

Interfacing with the server

Deployment

Setup the production deployment