Identity/AttachedServices/EncryptedUserData
This proposal by Brian Warner is under review as of March 15, 2013. Updates will be made as needed.
Contents
Overview
PiCL: Should User Data be Encrypted?
As the PICL project gets closer to having real code, we need to make some decisions about how each user's data is made available to them (and, hopefully, nobody else). At one level, this means decisions about when/how to encrypt their data. I've written some proposals here:
https://wiki.mozilla.org/Identity/CryptoIdeas/03-ID-Attached-Data#Data_Protection_Classes
That page describes 3 different classes of data protection, from the point of view of the user who wants to get their data back, which form a spectrum from "available" to "confidential":
- (most available)
- class A: you need a Persona assertion to retrieve the data
- class B: you need a password (which isn't shared directly with an IdP)
- class C: you need a paired device (ala FF Sync)
- (most confidential)
Class A is the most "available": the user has the best chance of getting their data back even when they forget passwords and lose devices. Class C is the most confidential: an attacker has the worst chance of reading the user's data. Neither A nor C requires passwords. Current FF Sync uses class C for everything.
(There are variations within each class that don't strictly affect the user's access options. For example, the "A-" scheme described in that document (storing data in plaintext on servers who give it to anyone with an assertion) provides the same user-relative protection as the "A+" scheme (encrypting data with a strong key, storing that key in a keyserver that gives it to anyone with an assertion). Likewise encrypting the data at-rest (where the decryption key is stored by the same service as the encrypted data) is good for defense-in-depth, but doesn't improve the protection class. These variations reduce the number of computers that you have to rely upon, but since the user never sees those computers anyways, they're just invisible implementation details).
We need to decide which data goes into which class by default. Our working assumption has been:
- passwords: B
- form fill: A
- bookmarks: A
- open tabs: A
- history : A
- themes : A
with detailed/advanced preferences to let the user move any or all of them into class A, B, or C as they see fit.
That means that a user who loses all their devices and forgets their sign-into-the-browser password, but who can still somehow read their email (maybe their email provider uses security questions to let someone gain/regain control of an account) will be able to get back all of their data except the saved passwords. Conversely, an attacker who manages to read the user's email will be able to see all of their browser data except the saved passwords. The email provider will also be able to see that class-A data, as will anyone who breaks into their systems, or coerces them (via subpoena, bribery, or other threat), or sweet-talks their tech support staff into providing access.
(we can probably do something clever with device-resident tokens to allow someone who has forgotten their password, but still has their cellphone, to reset the password and recover their class-B data)
Setting everything to "B" would mean that the password is necessary to recover data: if you forget the password, that data is lost. On the flip side, an attacker reading your email doesn't get to see the class-B data either, nor does the IdP or someone who coerces them. Some number of systems (the "keyserver", in the proposal linked above) will have the ability to do a dictionary attack against the password, so the user's protection against those systems will vary according to how strong of a password they're willing to manage. Studies show that most users will choose passwords that are easy to guess, but motivated users at least have the option to do better.
By putting data into class C, the user ties that data to a small number of paired devices. Like FF Sync, you'd use a short one-time J-PAKE code like "9502" to transfer the encryption key from one device to another. You could also print out (and type in) a 256-bit long-term key, which would look like this: T4Ec9wHtUB13Ey38rg5pPH9OljwwJGu4uFSXnWTGz+Q= . If you lose all of your devices at the same time (and didn't back up that key), you lose access to the class-C data. On the flip side, no one in the world will be able to brute-force the data: it is confidential to your devices alone. And you don't need to remember any passwords.
Arguments To Encrypt Everything
Crystal made the point that much data (e.g. history, open tabs, and form-fill data) are inherently ephemeral, and users would not be seriously inconvenienced or surprised if it were lost when the user performs a password reset or loses all their devices. And Chris points out that history and open tabs are extremely sensitive data for many people (e.g. it might reveal my embarrassing fascination with Pokemon
- -), even more so than bookmarks, since you can always just type in URLs
from memory.
So, the argument goes, we should lean towards the "confidential" side of the spectrum for these, using class-C if available. Bookmarks and passwords are more precious, so may deserve to live closer to the "available" side (class-B or even class-A). In this approach, users would be presented with some "opt-in to availability" setup-time question like:
"Your data is encrypted to keep attackers from reading it. (C To access your data from a new device, you must pair it with one of your old devices. (or B To access your data, you must remember a password. Otherwise neither you (nor anyone else) will be able to see your data. You can _press here_ to make it easier for you to access your data (at the cost of making it easier for attackers to access your data too)"
If any data is class-B, the user must establish a password before sync can begin. If everything were in class-C, then no password would be necessary, but new devices would need to be paired with an existing device before use.
Arguments To Not Encrypt Anything
Chris made the point that many users will want class-A availability for their data (regardless of whether this is an informed decision or not), and will do work to get it, even if we don't honor that desire. If we make it hard for them (e.g. by defaulting to class-B and making them set up a password), we can predict that they'll express their "I-want-class-A!" opinion by choosing the weakest password that the system will accept (usually "123456"). Then they'll get none of the security that class-B is trying to offer, and they'll still have to type in some extra busy-work stuff all the time, making nobody happy.
So, the argument goes, we should make it easy for these folks to pick class-A, perhaps by making all the data class-A by default and having them "opt-in to confidentiality" with a setup-time question like:
"As long as you can read email for <USERID@DOMAIN>, you can get back to this data. This means that attackers who can read your email can also see your data, as well as the operators of <DOMAIN>, anyone who breaks into their system, and anyone who coerces them into providing access. You can _press here_ to add password protection to your data, or lock it down to devices that you authorized (at the cost of making it impossible for you to access your data if you forget a password or lose all of your devices)"
In fact, if all of the data is class-A, the user would not even be asked to establish a password during the setup process. Type in your email address, get a Persona assertion as usual, and you'd be done.
So, How Should We Build This?
"I want to get access to my data even if I forget those stupid passwords or lose my only device" vs "I don't want anyone else to get access to my data even if they guess my lousy password or coerce my IdP"
I respect both attitudes, and I think both represent valid definitions of the word "safe" (as in "please keep my data safe"). So the best we can do is:
- provide as much of both Availability and Confidentiality as we can without making the user pay attention or decide anything
- then, when they must decide, give the user a fighting chance of learning enough about the tradeoffs to make it an informed decision that will make them happy in the long run
By using the Keyserver described in the document above, we can remove the storage servers from the reliability set for class-A data, leaving just the IdP and attackers who can compromise it (or read the user's email). This is the same level of protection that most web services provide today. It's probably the most confidentiality we can offer while still providing the email-based level of availability that most web services offer (and is thus what a lot of users expect).
But for users who want more confidentiality, or who can tolerate less availability, we have the technology to do better. The keyserver is the only machine capable of trying a dictionary attack on class-B data, and the class-C data is protected even against that (and has no passwords to remember). I'm proud to have helped with the design of FF Sync's class-C approach, and every single security-minded person I've talk to loves it. I'm really hoping PICL will make it an option for folks like them.
I expect that we'll wind up with some kind of cartoon that tries to explain the differences, and the choice that is available to the user. Maybe a setup page that says "Are you more worried about losing access to your data, or knowing that other people might look at your data?". And we might force the issue by offering a radio-box with no default setting. But I expect that any approach will impose some sort of bias, so we have to make *some* kind of choices for the users.
Which should they be?
What Do You Think?
This is a thorny problem, and (if I've explained it well) ought to provoke a lot of controversy. Please help us! If you were setting up browser data synchronization, which data-protection class (A, B, or C) would you pick for each datatype? What would you recommend to your family, your friends, your grandparents? Can you think of better ways to explain this choice? Or recovery options that we haven't thought of?