In this explainer, science communicator Anjana Nair spells out what federated learning is by outlining why it is important and how it is being used. The research team at the Open Data Institute (ODI) is exploring how federated learning can be deployed to support responsible data stewardship. We want to understand the scope of such technologies to facilitate safe access to sensitive data. Ultimately this research will help developers, researchers, and organisations to follow best practice guidelines for data sharing and management.
A huge amount of data is generated every day. This is increasing exponentially due to the use of technologies like social media, cloud services, and the Internet of Things (IoT). All of this data has immense value potential for businesses and society, but data can be reluctant to share it due to concerns over privacy, mistrust and other risks. This means a lot of this valuable data remains unavailable. Advances in privacy-preserving technologies such as federated learning can offer a promising solution to these issues, and provide a more secure infrastructure towards data sharing.
What is federated learning?
The traditional process for developing a machine learning application is to gather a large dataset, train a model on that data, and run the trained model on a cloud server. To do this, a user’s data has to be sent to the server that hosts the model. However, not all data can be shared in this way – especially sensitive personal data. This is where federated learning comes into play.
Federated learning does not send raw data to the machine learning model, but instead brings the model to the data. The model is trained locally on each device, and the data never leaves its original location.
So now we know what federated learning is, but where does it apply?
A popular example of federated learning in action is Gboard. This is an Android keyboard which deploys federated learning to enable next word prediction. The keyboard learns new words and phrases from the user’s typing patterns, and does so without needing to send the data to the central server (in this case, Google). But not every federated learning model works in the same way.
Based on the participating clients and training scale, the research team at the ODI has found it useful to divide federated learning into two basic types: cross-device and cross-silo. The differences are based on the number of parties involved and the data each party stores.
Let’s understand the two in detail…
Cross-device federated learning
Cross-device federated learning is popularly used in consumer digital products – like Google’s Gboard mobile keyboard. Apple also uses cross-device federated learning for its voice recognition application, Siri.
In this model, clients (devices or organisations where raw data is held) are typically small distributed entities – like smartphones, wearables, and IoT devices. These devices are likely to have a relatively small amount of local data, which means as a large number (sometimes even up to millions) of devices need to participate in the training process.
Data here is generated locally and mostly remains decentralised. This means each client stores its own data and cannot read the data of other clients. Data, therefore, is not independently distributed, considering only a fraction of clients are available at one time. Instead, a central server organises the training, and never sees the raw data. The advantage here is that there is no exchange of data.
So how does it work in action? Let’s look at the Gboard example. Gboard learns from your keyboard and dictation. Instead of sending your writing patterns and texts to a central server, the server sends an initial central model to the device (your phone). The device, in turn, trains this central model with its own data. Once the model is trained locally on the device (in this case, your phone) only the trained model – and not the underlying data – is sent to the central server.
This means your private messages stay on your phone, and have not been sent to the server. The server then combines the updated models from each of the devices and creates an updated central model. This process can then start over again to further improve the model. You see, it is the model that is being shared and not your data.
Cross-silo federated learning
Cross-silo federated learning is most popular in domains like finance risk prediction, pharmaceuticals discovery, electronic health records mining, medical data segmentation, and even smart manufacturing.
Compared with cross-device federated learning, the number of participants is small, usually ranging from two to 100, with each client expected to participate in the entire training process.
The cross-silo model involves training a model on siloed data – meaning it is stored in a standalone system and can often be incompatible with other datasets. Although the clients are smaller in number, the volume of data they carry is huge! They have large computational ability. An example of this is the collaboration between Moorfields Eye Hospital and Bitfount for secure healthcare research to improve early detection of eye diseases. The hospital benefited from rich datasets of patients without centralising the data in one place and maintaining patient confidentiality.
The simplest way to understand the difference between the two is that cross-device federated learning is mostly associated with a large constellation of relatively homogenous devices, and cross-silo federated learning is mostly associated with organisations.
Since these two types of federated learning have rather different set-ups and requirements, the team at the ODI is exploring them separately. Based on the research done so far, we have identified that currently the most established type is cross-device federated learning.
Now that we have an idea of the basic architecture of federated learning, what makes this technology special?
Unlike traditional machine learning practices, federated learning opens up a world of new opportunities for training models while helping to maintain data confidentiality. In fact, there are a number of reasons why federated learning is appealing.
First, it is collaborative. Federated learning allows devices such as mobile phones to learn a shared prediction model together. This approach keeps the training data on the device rather than needing the data to be uploaded and stored on a central server.
Second, it saves time. The datasets are stored locally in federated learning models. This reduces time lag, and data can be accessed without connecting to the central server. This means data still stays in its original location even without connecting to the internet to run a model.
Third, it can be extremely diverse. Federated learning lends itself to more data diversity, as the centralised model is continuously learning from different devices or organisations rather than one dataset. This results in a more generalisable and inclusive model.
A perfect example of cross-device use is the IoT environment, where many devices, such as microcomputers and mini PCs are used. The devices include data from a wide variety of users from different geographic locations, and the model continues to support an updated model for a better system.
Lastly, federated learning could also have potential environmental benefits. This is one of the many interesting advantages that we have found. Unlike traditional training of machine learning models, federated learning involves training models across a large number of individual machines. A team of researchers at the University of Cambridge found that this can lead to lower carbon emissions than traditional learning methods. Federated learning may therefore have relevancy for more environmentally sustainable training of large machine learning models in the future.
All this sounds amazing, but…
While federated learning opens the door to more collaborative machine learning, it also comes with its own unique set of challenges:
The privacy challenge: Communication is a serious bottleneck in federated learning networks where data created on every device remains local. Although federated learning helps to protect data generated on a device by sharing model updates rather than raw data, communicating throughout the training process can still reveal sensitive information, either to a third party, or to the central server. So it doesn’t fully solve the privacy problem when used in isolation of other technologies and processes.
Our findings so far suggest that privacy is not the main benefit of federated learning. The connection between federated learning and privacy (or security) is an open theme that is subject to ongoing research. Uncertainties around the privacy guarantees in federated learning can make it difficult to assess the risks of using the technology, and therefore also to decide whether to develop and deploy it in the first place.
The security issue: In federated learning, opportunities still exist for an attacker to get information about the data. There is a range of potential attacks possible against FL systems, such as backdoor, poisoning attacks, and inference attacks – some of which do not even exist in traditional machine learning settings (or are harder to detect and mitigate). However, federated learning combined with other privacy technologies such as differential privacy or secure aggregation can keep the data more secure.
That said, the ODI’s preliminary findings suggest that cross-silo federated learning, where heterogeneous data (data that varies in type and format) is being used from various collaborating organisations, can potentially increase the security of the training process. This is because the coordinating organisation doing the training would be aware of the different data formats from the different data sources, which could pose an additional layer of complexity for would-be attackers to contend with.
Too much heterogeneity: The devices participating in the training process may significantly differ in terms of storage, computational ability, power supply, and network connectivity capabilities. Therefore, the approach must ensure fault tolerance (ie systems’ ability to keep functioning) as user devices may drop out before finishing the training cycle.
The future of federated learning
Federated learning offers great opportunities for machine learning models to retain their accuracy without risking users’ confidentiality. Its applications, from everyday instances of better predictive text to more critical use cases of improving clinical diagnostics, are just scratching the surface. We see Brave browser using this technology to recommend news for privacy-conscious users. Personalisation and recommendation services are classic examples for cross-device federated learning.
In the coming years, we are likely to see more use cases of cross-silo federated learning. NVIDIA has been working on implementing cross-silo federated learning in the medical imaging space already. Cross-silo federated learning has the potential to establish itself as a major competitive, technological, and scientific advantage for organisations. We also anticipate that we will see other applications of federated learning models and will better understand specific aspects of its development process.
Take part in the ODI’s research
We’re looking to produce a range of case studies that demonstrate current challenges and capabilities of federated learning and other privacy-preserving technologies.
We may also want to conduct a set of additional interviews with stakeholders and data institutions to clarify outstanding questions and explore certain areas more in depth. Some open questions for further exploration include that of regulation and ethics of federated learning, choosing the right governance model, and exploring the use for educational and charitable purposes.
If you want to get involved, get in touch by emailing [email protected]