zama-ai / bounty-program

Zama Bounty Program: Contribute to the FHE space and Zama's open source libraries and get rewarded šŸ’°
https://zama.ai
244 stars 14 forks source link

Create a string library that works on encrypted data using TFHE-rs #80

Closed zaccherinij closed 9 months ago

zaccherinij commented 1 year ago

Winners

šŸ„‡ 1st place: A submission by JoseSK999 šŸ„ˆ 2nd place A submission by Tomtau šŸ„‰ 3rd place : A submission by M-Bln


Overview

This TFHE-rs bounty looks to provide a set of APIs working over encrypted string reproducing APIs from the rust std library str type (see documentation at https://doc.rust-lang.org/std/primitive.str.html)

This document will specify the expected structure of the example that you will produce and specify various constraints on the types that will be introduced and the APIs that are expected on the primitives.

How to participate?

1ļøāƒ£ Register here. 2ļøāƒ£ When ready, submit your code here. šŸ—“ļø Submission deadline: December 17, 2023.

Description

String encoding

The input string encoding is expected to be ASCII.

To avoid leaking the length of the input string we may want to encrypt zeros after the string actual content, to easily mark the end of the string we will make the FheString null terminated, i.e., it must always end by the value 0u8 encrypted, like C Strings are null terminated.

Note [ADDED 23.10.2023] A null character "\0" may mark the early end of a string but is not required (as was first asked in the bounty) as it can make some algorithms more complicated and slower. The option to have encrypted strings with a padding made with 0s is still required (to leave the option to the user to obfuscate the string's length), differentiating between the two kind of strings (padded or not) can and should be used to chose better algorithms when possible.

Functions to implement

Your submission should at least implement, the following method for an encrypted string:

API to use for the bounty

This bounty should make use of the integer API from TFHE-rs, more precisely we expect you to use RadixCiphertexts with the default parallelized functions available in the crate.

Structure of the example directory

You will put your code in the directory tfhe/src/examples/fhe_strings

This directory will contain the following:

main.rs:

The code for a command line executable that will take an input string to be encrypted and a pattern for string functions which require it. This executable will encrypt the first string and run all the available APIs on the input string using the pattern when necessary. The results will be compared to the clear APIs provided by rustā€™s std::str, the ouptuts will be nicely formatted for the user to see that the results match (if there are errors then the bounty would not be considered valid) with timing information for the FHE version.

Use clap (the version pinned in TFHE-rs) to build the command line.

ciphertext.rs:

This module will contain:

client_key.rs:

Provide a ClientKey type that can be built from shortint parameters (that are also used by the integer API) or from an IntegerClientKey directly, realistically this ClientKey type will wrap the IntegerClientKey and provide primitives to encrypt a rust str (validating before encryption that the str is a valid ASCII str), decrypt it in a fresh String. There should be an encryption primitive that allows the user to specify how many ā€œpaddingā€ encryption of zeros should be appended after the 0-terminated string that will be encrypted from the provided input string.

ClientKey must derive serde::Serialize, serde::Deserialize and Clone.

directory server_key:

mod.rs: Provide a ServerKey that wraps an Integer ServerKey that can be built from a ClientKey from client_key.rs or directly from an Integer ServerKey.

ServerKey must derive serde::Serialize, serde::Deserialize and Clone.

Create files for families of functions that share similar algorithmic needs, e.g., IF IT MAKES SENSE (here we are speculating, this is just an example supposing similar functions will require similar algorithms)

split.rs

Will contain an impl Block for the above ServerKey dedicated to split functions like

and all relevant helper functions. If some functions are needed everywhere then those functions can be put in the mod.rs file from the directory.

trim.rs

Will contain an impl Block for the above ServerKey dedicated to trim functions like

Note that the above functions will not literally trim blocks off of the provided FheAsciiString but rather will zero out the correct blocks to make the null terminated string the trimmed version as appropriate.

Also, it could make sense to separate functions depending on the type of the pattern (i.e., encrypted or clear) in separate files.

Other requirements

Each function should have a docstring describing what it does (those can be adapted from the rust std docs) with a doctest demonstrating the usage on simple hard coded cases. We expect standard algorithms if it makes sense (with links to the papers describing them or publicly available content on the web like a Wikipedia page for example), if some tricks are used proper comments are expected to be provided in the code.

[!NOTE] As submissions are evaluated on a 64 cores machine, your code should heavily use parallelization

A Makefile target (see Makefile for the template of our targets) for running the tests of the example is expected. A command to run the binary easily from the terminal is also requested. The code must compile on the latest stable rust. No #[allow(clippy::...)] are authorized unless absolutely necessary but that should never be required. The code must pass the make pcc (pre commit check) available from our Makefile, if you get errors then the code needs to be fixed to satisfy the pcc checks. Good luck!

Reward

šŸ„‡Best submission: up to ā‚¬10,000.

To be considered best submission, a solution must be efficient, effective and demonstrate a deep understanding of the core problem. Alongside the technical correctness, it should also be submitted with a clean code, clear explanations and a complete documentation.

šŸ„ˆSecond-best submission: up to ā‚¬3,500.

For a solution to be considered the second best submission, it should be both efficient and effective. The code should be neat and readable, while its documentation might not be as exhaustive as the best submission, it should cover the key aspects of the solution.

šŸ„‰Third-best submission: up to ā‚¬1,500.

The third best submission is one that presents a solution that effectively tackles the challenge at hand, even if it may have certain areas of improvement in terms of efficiency or depth of understanding. Documentation should be present, covering the essential components of the solution.

Reward amounts are decided based on code quality and speed performance on a m6i.metal AWS server.

Related links and references

Submission

Apply directly to this bounty by opening an application here.

Questions?

Do you have a specific question about this bounty? Join the live conversation on the FHE.org discord server here. You can also send us an email at: bounty@zama.ai

leyuskckiran1510 commented 1 year ago

Could you provide me the time frame,I will be working on it after two weeks; will that be okay?

aquint-zama commented 1 year ago

Could you provide me the time frame,I will be working on it after two weeks; will that be okay?

Deadline for submission is 17th December (you could see it with the associated milestone

tomtau commented 1 year ago

I have a few questions:

  1. patterns: I assume it should compile on the stable Rust, right? So in that case, it's not possible to use the pattern feature https://github.com/rust-lang/rust/issues/27721 so we'd just need to make a custom type or trait for it -- I also assume the patterns could only be ASCII (like strings in that bounty), is that correct? And should empty patterns be allowed?
  2. All those functions should be on the ServerKey, or should they also be on FheString? And they will all need ServerKey anyway, so it's not possible to implement e.g. std::ops traits (like for +) directly on FheString... hence those functions would be just standalone methods on ServerKey, is that correct or should those traits be somehow implemented (e.g. implement them for (ServerKey, FheString) or something like that)?
  3. split: similarly to patterns, I guess it's ok to create custom types for those different Split return types / iterators (RSplit, RSplitN, SplitAsciiWhitespace, SplitInclusive...), is that correct? For splitn and rsplitn, only the pattern can either be clear or encrypted, n is always clear, right?
  4. outputs: (I'm new to tfhe and) I assume outputs of all those functions would be encrypted, right? Should they use some special types from tfhe, or ok just to use a type alias or a wrapper around RadixCiphertext? And for methods that "branch" (i.e. I assume one cannot know with a ServerKey if a pattern was matched or not in a ciphertext, so needs to compute both options), should it return a vector of all computed ciphertexts (while ClientKey would have a method that can process it and pick the correct output)? I saw that in the regex tutorial: https://docs.zama.ai/tfhe-rs/application-tutorials/regex#matching. so I was wondering whether that's what's expected... and if not, do you have an example of a way how to handle it, e.g. somehow combine boolean and integer circuits for this kind of branching output aggregation?
  5. trim:

    Note that the above functions will not literally trim blocks off of the provided FheAsciiString but rather will zero out the correct blocks to make the null terminated string the trimmed version as appropriate.

this makes sense for me with trim_end, but I'm not sure about trim or trim_start... is it ok to just treat 0-byte as empty anywhere in the string (rather than just at its end / null-termination), so ClientKey's decryption would just filter out these 0s to get a string? Or should it still respect C-style and what to do in that case? Should it use a special value to mark empty places (e.g. 256 outside of u8's range) before 0-termination?

tomtau commented 1 year ago

@aquint-zama one follow-up question depending on the answer to 4. about function outputs:

  1. repeat: how should the encrypted number of repetitions be handled? Again, based on my newbie understanding, one cannot know with a ServerKey whether a given string should be repeated 1-time or 1000000000-times... is it in that situation ok to return a "lazy" repeated string iterator (which will most likely return two encrypted values: a flag whether the number of repetitions was reached and the repeated encrypted string constructed up to that point)? ClientKey can then get the final string by iterating and checking the decrypted flags?
IceTDrinker commented 1 year ago

hello @tomtau

  1. Yes this should compile on stable rust, ASCII patterns indeed, empty patterns should be handled the same way the std rust library handles them I guess ?
  2. On the ServerKey is perfectly fine, managing traits are not in the scope of this bounty
  3. n could be encrypted but with obvious limitations to small integer types (probably u8) as oblivious execution would mean a worst case scenario with respect to n which becomes impossible for large integers, if it proves impossible to implement or has very bad runtime this would be taken into account. Custom return types are fine as long as they are easy to manipulate and make sense (the std rust library has been built the way it is for good reasons), but in the FHE world some of those may not be expressible as you don't know where some of the sub strings end, this is part of why this is a bounty, some unknowns around the algorithms/data handling
  4. For the branching you have the if_then_else function for integer ciphertexts, we expect the return types to be FheStrings where it makes sense, and RadixCiphertexts for boolean values which will in the end (if the algorithm is correct) just encrypt a 0 or a 1
  5. This is a good point, so trim_end can fill with 0s while trim_start may probably need to shuffle blocks back to the start of the string to be valid, there may be something smarter to do with special characters but that may not be encodable if rust uses extended ascii (8 bits) vs standard ascii (7 bits)
  6. For encrypted n several ways, the one you propose seems interesting, to note that depending on the type of n (u8 is doable, u16 less so already, u32 is probably impossible on commodity hardware and would be way too slow) it could be doable fully
tomtau commented 1 year ago

Thanks @IceTDrinker . And for "replace" / "replacen" functions, should the "to" argument be FheString or a plain string/slice?

IceTDrinker commented 1 year ago

I believe both are doable, with a complication/performance degradation if "to" is also encrypted

Lcressot commented 1 year ago

@aquint-zama Some You wrote about the main : "This executable will encrypt the first string and run all the available APIs on the input string using the pattern when necessary" Should the pattern be encrypted or not, or both ? This would change the speed of the computation so it is important every one does the same for fairness. Thanks

IceTDrinker commented 1 year ago

Hello @Lcressot submissions providing both clear and encrypted patterns would be rated better than a solution with only clear patterns, if some cases look impossible (see encrypted n for repeatn for large integer types which does not look like it can realistically work beyond u8 (and even then will be most likely very slow)) then obviously we would take that into account and in that case the relative speed/technical solutions would be evaluated. If you manage to do something clever where other submissions just ignore the difficulty by marking it as Ā« wonā€™t implement Ā» then the submission proposing a solution will likely be rated better

Lcressot commented 1 year ago

@IceTDrinker Ok great thanks, so we need to measure and display the computation time for each operation and not just globally

IceTDrinker commented 1 year ago

Absolutely @Lcressot

Lcressot commented 1 year ago

Hi, should the encrypted pattern also have a padding ? I think using padding to the pattern would lead to a huge additional complexity to some algorithms. Maybe we can implement both padding and no padding functions.

IceTDrinker commented 1 year ago

This is a good question, in the general case the pattern length can leak some amount of information about what users would be searching for.

With that said the performance could be abysmal indeed, potential ideas, we will be discussing on our end at Zama but if you have some inputs on that please feel free to share :

Lcressot commented 1 year ago

Hello, thanks for answering. My current opinion on this:

This is a good question, in the general case the pattern length can leak some amount of information about what users would be searching for.

With that said the performance could be abysmal indeed, potential ideas, we will be discussing on our end at Zama but if you have some inputs on that please feel free to share :

  • also have the length of the string as an encrypted value in the FheString struct (we can make it a 32 bits integer), not sure if that helps or notNot sure the encrypted length would help because because we would have to deal with all possible length value as well. The padding zeros already contain de encrypted length information in my opinion.
  • have an enum in the FheString, to know if the encrypted string is padded with 0s at the end or not allowing to match on both the encrypted string and the pattern to use better algorithms if they exist in the various implementations (less error prone for the users as this enum would be set during encryption)I used a boolean has_padding that is set at encryption. I consider it a trade-off between security and performance for the user to decide.
  • have different functions to manage all the cases, I'm not a fan of this option as it would clutter the API with a lot of function variations in addition to being error prone for the users

I think the algorithms could integrate an if has_padding for the pattern, maybe panic if there is padding and they donā€™t work in this case, and keep the same for the string (wether it has padding or not) considering it has an ending /0 in all cases.

LoĆÆcĀ 

IceTDrinker commented 1 year ago

So after discussion @Lcressot and @tomtau it is acceptable to have a flag indicating if the string is padded with zeros at the end or if it has a single 0, meaning the string length can be deduced by the vector it stores.

Having better algorithms for when there is no padding is accepted, submissions which cover the most difficult cases (e.g. padded input and padded pattern) would likely be rated better as the other algorithms should be more straightforward

Lcressot commented 1 year ago

Instead of duplicating all functions for both types FheString and String (encrypted or clear pattern), can we make template function that take FheString where T can be CipherText or char ? This will require in the main.rs to build both encrypted and clear FheString versions of the pattern string. Or do you guys absolutly want duplicate functions that can take directly a String as argument for the pattern ? I am asking because doing templates would greatly simplify the code.

IceTDrinker commented 1 year ago

having generic functions would be a nice touch I would say, they can then dispatch to the proper algorithm

matthiasgeihs commented 1 year ago

[ADDED 23.10.2023] A null character "\0" may mark the early end of a string but is not required (as was first asked in the bounty) as it can make some algorithms more complicated and slower.

Regarding this note: Is it OK if the FheString has a cleartext flag isZeroPadded (as suggested in https://github.com/zama-ai/bounty-program/issues/80#issuecomment-1765966098) to signal whether it is padded with zeros or not? Without a cleartext flag, I don't think we can hope for much algorithmic improvement, as we can't branch on encrypted values. @IceTDrinker

IceTDrinker commented 1 year ago

@matthiasgeihs a flag indicating the padding status is fine yes, or an enum works as well

JoseSK999 commented 1 year ago

strip_prefix and strip_suffix return the string without the matched pattern wrapped in Some, but if thereā€™s no match they return None. Should we just return the string without prefix/suffix (if thereā€™s match) and the original string (if there isnā€™t)? If we returned special characters representing None we would probably need to decrypt the output to handle a potential None before continuing with subsequent FHE operations.

LokartBC commented 1 year ago

In my opinion you can return the result along with a boolean telling wether the prefix pattern was found or not

IceTDrinker commented 1 year ago

From reading the std docs to emulate the None you have two choices, the boolean flag and returning the original string, doing a cmux on the first character setting it to the null byte (early string termination means empty string here) with said boolean flag while still encrypted, given one probably wants to chain operations, that latter option can be interesting but is slightly more costly

JoseSK999 commented 1 year ago

Do you mean treating the None as a empty string?

IceTDrinker commented 1 year ago

Yes thatā€™s a possibility, maybe there is something Iā€™m missing but it does not seem like such a bad idea from a usability perspective

tomtau commented 1 year ago

How would one distinguish between Some("") (a result of when prefix/suffix matched the entire string input) and None? Or is it ok to divert from the std behaviour?

IceTDrinker commented 1 year ago

Or two variants of that function, one matching the std 1:1 with a boolean flag, another returning an empty string

I really can't say both have advantages and disadvantages, as long as this is documented/explained I don't think it would make or break a bounty

JoseSK999 commented 1 year ago

I would say operating with str functions over a None doesnā€™t make much sense and users would probably want to work with either the string with pattern removed or the original string in the None case. Returning the boolean flag along with the correct string makes more sense to me. Or if they only wanted to get that boolean they can use starts_with and ends_with.

IceTDrinker commented 1 year ago

sounds perfectly reasonable

MakisChristou commented 11 months ago

A clarification regarding the cli arguments main.rs should expect. From

The code for a command line executable that will take an input string to be encrypted and a pattern for string functions which require it. This executable will encrypt the first string and run all the available APIs on the input string using the pattern when necessary. The results will be compared to the clear APIs provided by rustā€™s std::str, the ouptuts will be nicely formatted for the user to see that the results match (if there are errors then the bounty would not be considered valid) with timing information for the FHE version.

It looks like only 2 arguments should be provided. But to cover all cases we also need an n, from and to off the top of my head for the specific algorithms that require them (e.g. repeat requires n and replace requires from and to). It makes sense to include those as well in the required cli arguments of main.rs but asking here for clarification/sanity check. Thank you in advance.

IceTDrinker commented 11 months ago

Of course, any missing arguments are ok to take on the command line, as we did not implement the bounty ourselves there are some stuff we have not covered/thought about this thread is here for that !

MakisChristou commented 11 months ago

Another clarification regarding the added note.

A null character "\0" may mark the early end of a string but is not required (as was first asked in the bounty) as it can make some algorithms more complicated and slower. The option to have encrypted strings with a padding made with 0s is still required (to leave the option to the user to obfuscate the string's length), differentiating between the two kind of strings (padded or not) can and should be used to chose better algorithms when possible.

I am unsure whether both options are required or not. The first sentence suggests that padded strings are not required as they may make some algorithms slower. The second one suggests that a user should be able to choose whether to use padded strings or not. In my solution for example all algorithms are with padded strings except for 2. Is the point to make it explicitly clear to the user which version of strings they should use? Or should we implement both options for all algorithms? Thanks!

IceTDrinker commented 11 months ago

@MakisChristou the user must have the option to pad the input string, algorithms can use \0 to indicate the early end of a string, however a string is valid even if it does not have a \0

MakisChristou commented 11 months ago

So we can have algorithms that don't work with padded strings, but I presume they will be less favourable than their padded counterpart?

IceTDrinker commented 11 months ago

What matters at the end of the day for padded inputs and non padded inputs will be the runtime, so having an algorithm working on padded strings will most likely be more generic but slower, meaning you likely want both variants (and one likely is a derivative of the other with stronger assumptions/invariants)

MakisChristou commented 11 months ago

As for the submission do we just make our repo public after we add the link? Also I have cloned the state of tfhe-rs as it was 2 or so months ago and added my tfhe_strings folder. Is that the expectation?

aquint-zama commented 11 months ago

As for the submission do we just make our repo public after we add the link? Also I have cloned the state of tfhe-rs as it was 2 or so months ago and added my tfhe_strings folder. Is that the expectation?

Forking is indeed the way to go, you could keep the repository private until the deadline (17th of December) but still submit it through the submission form. Then, on the 18th of December you turn it public.

IceTDrinker commented 11 months ago

@MakisChristou does your code work with tfhe 0.4.1 ?

MakisChristou commented 11 months ago

@IceTDrinker Its currently using version 0.4.0 but I could upgrade it. The thing is, last I checked I could not fork and then private the repo with a free Github account. @aquint-zama.

IceTDrinker commented 11 months ago

that's a good point, 0.4 is fine I think you should be able to upgrade to 0.4.1 easily.

For the repo forking and private setting it is indeed a problem, so I guess you can fork for the moment as a public repo and have something on your work/dev machine without pushing to git yet if you are afraid your code will leak/be used by others

IceTDrinker commented 11 months ago

Hello @MakisChristou so for the version we would recommend rebasing on top of the main commit https://github.com/zama-ai/tfhe-rs/commit/ad41fdf5a5060c0a981cd0c35bf998feafe68e02 (or later) the reason being that main has had improvements in low level performance and some other algorithms which should improve the runtime of your solution

Cheers

MakisChristou commented 11 months ago

So should we use the BooleanBlock instead of BaseRadixCiphertext<Ciphertext>? This is what I was currently using as an inner to the FheAsciiChar implementation.

IceTDrinker commented 11 months ago

So should we use the BooleanBlock instead of BaseRadixCiphertext<Ciphertext>? This is what I was currently using as an inner to the FheAsciiChar implementation.

@MakisChristou No BooleanBlock is just a type we introduced that you can easily convert to Radix, the new type is to convey that some values encode a boolean value (i.e. a 0 or a 1) and should make some API usage less error prone, I think some functions may have had optimizations linked to BooleanBlock but don't take my word for it.

IceTDrinker commented 11 months ago

@MakisChristou Though for some functions if it makes sense to return a Boolean then feel free to return a BooleanBlock !

MakisChristou commented 11 months ago

I see, is the convertion one way or 2 way? Like can I convert a BaseRadixCiphertext<Ciphertext> to a BooleanBlock? My if_then_else implementation requires it for example unless I will make changes to the API.

IceTDrinker commented 11 months ago

So you should be able to convert the boolean block to a radix with into_radix, you can use convert on a RadixCiphertext to a BooleanBlock, this will do the C-like conversion where

bool my_bool = some_int != 0; 

for further support please use the FHE.org Discord channel for TFHE-rs :)

MakisChristou commented 11 months ago

So this has been somewhat discussed before but wanted to make sure I am getting the requirements right. For example in many functions like repeat or trim_start we have 2 options after applying the algorithm in the padded case. Either shuffle zeroes at the end of the string or carefully place some special character such that the client can trivially get the correct result at decryption time. Just wanted to make sure that option 2 is considered valid since it's a lot faster than the former.

MakisChristou commented 11 months ago

For example we can use a non ascii byte as a placeholder for bytes to be removed by the client. The resulting string will be the valid unpadded result.

IceTDrinker commented 11 months ago

well the special character of option 2 is a \0 right ?

MakisChristou commented 11 months ago

well the special character of option 2 is a \0 right ?

Not nessesarily. Lets say we run repeat on ā€œhello/0ā€ 2 times. Instead of for example having ā€œhello/0hello/0ā€ and applying shuffling to get ā€œhellohello/0/0ā€ we can do ā€œhelloxhelloxā€ where x can be 255u8 or something out of the ASCII range. Then the client can trivially remove the 255u8s getting ā€œhellohelloā€. In my code /0 marks the end of the string whereas this special char can mark ā€œremove this byte on the client sideā€.

IceTDrinker commented 11 months ago

can you call another of the API on the string with the special character and have a correct result ? šŸ¤” feels like a \0 with extra steps as other algorithms will need to handle that anyways ? Maybe I'm wrong there