Capture Hidden Trends - Use Cases for Private and Decentralized ML Training
2025 May 30Since the beginning of this year, Iâve been exploring the intersection of cryptography and machine learning and thinking whatâs important work on in the long term. In my last post, I shared a technical overview of the first iteration of my new project: Publicly Verifiable, Private & Collaborative AI Training (for brevity, Iâll call it private & decentralized ML model training from now on).
To summarize, I prototyped a system that allows mutually distrusting parties in a decentralized protocol to collaboratively train a machine learning model, without exposing their private dataset to one another. All participants in the system use zero-knowledge proofs to verify the integrity of their local computations, including client-side training and server-side model aggregation.
In this post, I will explore potential use cases and social implications of this technology that Iâve been reflecting on.
Table of Contents
- Capture Hidden Trends
1.1 Private Data Exists in Silo
1.2 Structural Privilege in Data Collection
1.3 Pull-style â Push-style Data Science
- Usecase 1: Crowdsourced Health Data Analysis
- Usecase
2: Private Fine-tuning for Vulnerable Subgroups
3.1 Tailor-made Models for Marginalized Communities
3.2 Exporting Crypto Credit Score to TradFi for the Unbanked
3.3 Model Merging for Intersectional Demographics
- Usecase 3: Recommendation System for dApps
- Usecase 4: Privacy-preserving Model Training for Decoding Biometric Data
- Note on
Verifiability and Bonus Project Idea
6.1 Verifiability for Malicious Adversary
6.2 Bandwidth-Efficient Edge Device Training
- End Note
Capture Hidden Trends
Before diving into each use case idea, I want to talk about a recurring theme among them, which has shaped the direction of this project.
Private Data Exists in Silo
Definition. Data point (noun): an identifiable element in a dataset.
Definition. Data / Dataset (noun): facts and statistics collected together for reference or analysis.
First, data points, by definition become meaningful in relation to the other data points. Letâs say I step on a scale today and I see some number. If there are no other weights (either mine or other peopleâs) that I want to compare it to, this number alone does not tell me any insight. When individual data points are grouped together, they form a dataset â something that can be analyzed to extract patterns or insights.
Second, some data points exist in silos, and only those in positions of power and those with access to sufficient infrastructure are able to collect them (not necessarily with proper consent from the data owners but thatâs another point) and form a dataset.
For example, imagine I wanted to compare my income to that of other female, Asian cryptography researchers living in Europe. This would be extremely difficult for the following reasons:
As an individual unaffiliated with any scientific institution, I have no way to directly coordinate with people in this specific demographic to collect such data.
Even if a global income dataset existed, filtering it by such personal attributesâfemale, Asian, based in Europeâwould be nearly impossible due to privacy concerns.
Structural Privilege in Data Collection
I see a structural privilege here. Data tends to get collected and studied when powerful institutions decide to do so, in a way that they designed.
For those of you whoâre unfamiliar with the concept of structural privilege (or oppression), historically various systems in society have been designed by a dominant group in a way that serves their own interestsâintentionally or unintentionally. As a result, other marginalized groups have faced implicit systemic disadvantages, often because their needs are not reflected during the design process.
Prime example includes the history of voting rights in the US, but more specifically for data science, there is a category of diseases called Neglected Tropical Diseases (NTDs) which are common in low-income populations in developing countries. It affects over 1 billion people worldwide, but is under-studied with a lack of market incentive since pharma companies make little profit from treating poor populations.
Another example of structural disadvantage appears in automotive safety testing, where crash tests have long prioritized dummies modeled on the âaverage male body.â Since the 1960s when testing started, average female dummies were either absent or used in ways that ignored key anatomical differences, often justified by funding constraints. As a result, research has shown that women face as much as 73% higher risks of fatality or serious injury in car crashes in the past. It is reasonable to infer that this systematic exclusion of women from safety design decisions is closely linked to the male-dominated nature of the automotive industry (another resource that explains the historical context is here).
As we can see from these examples, the disparity between prioritized and ignored data stems from the combination of pursuit for profitable research (which isnât limited to industry; academic research also depends on funding, often prioritizing data collection and analysis aligned with industry interests) and the dominance of privileged group in decision-making positions.
Pull-style â Push-style Data Science
What happens if we can gain more agency over which of our data we collect and how we make use of it? More precisely, what if we complemented1 the conventional pull style data science, where institutions decide which data is worth collecting, with a push style, where individuals proactively contribute their data (in a privacy-preserving way, otherwise this doesnât work)? Such a shift could enable collaborative data analysis among people who share similar interests, goals, or curiosities.
I believe there are many hidden patterns within private data scattered across the world. There is an invisible trend embedded in a missing datasetâdatasets that should, but donât exist yet for structural oppression, but the individuals hold this data havenât had the means to effectively coordinate for privacy concerns.
Perhaps what society ignores, or actively hides tells more about the world than what it highlights. With decentralized & private ML model training, we can extract these patterns without exposing the underlying data itself, and make the invisible visible, on our terms2.
(1: I use the word complement intentionally here. I donât mean to dismiss the work that institutional data scientists have done so far, nor am I trying to create a dichotomy where centralized data science is âbadâ and decentralized data science is inherently âgood.â However, I believe more and more individuals without formal academic training or institutional affiliation will become capable of conducting valuable experiments and data analysis. Iâm curious to see what hidden truths might emerge if independent researchers with more diverse backgrounds and original perspectives are given free access to whatever datasets they get curious to study.)
(2: This type of ground-up data collection isnât a completely new initiative. Scholars have coined several terms such as counterdataâdata that is collected to contest a dominant institution or ideology, to describe the concept)
Usecase 1: Crowdsourced Health Data Analysis
This use case idea represents the theme I described in the above section pretty clearly. It enables individuals to contribute (âpushâ) their data in a privacy-preserving manner to uncover patterns within a specific demographic. Data contributors could verify that they belong to a target demographic (again, preserving privacyâfor instance via ZK) and perform local training on their own data. This will exactly allow us to âcapture hidden patternsâ within private dataset, which have traditionally been difficult to collect in one place. That said, I still need to think more on whether we can assume each individual holds enough data to train a meaningful model, which depends on specific use cases. If they only hold a single data point, which is obviously insufficient to train a model alone, then contributors might instead submit their data to MPC nodes and delegate the training on a larger volume of data collected from various data contributors. That shifts the architecture closer to Hashcloakâs noir-mpc-ml rather than my prototype based on âzk-federated-learning.â
Usecase 2: Private Fine-tuning for Vulnerable Subgroups
This is an idea Iâm personally most excited about. Suppose we have a pre-trained foundation model (like LLMs) out there and some blockchain nodes hold a specific dataset representing a marginalized, smaller community. This kind of dataset is difficult to collect with âpullâ style, due to their sensitive attributes, such as race, gender, disability status, sexual orientation etc, as I explained in the first section. (Guy Roghbulm, a research scientist at Apple explains that âit can be perfectly appropriate and necessary to use sensitive features (for ML), but frustratingly, itâs sometimes difficult for legal reasons in the USâ in this lecture from Graduate Summer School on Algorithmic Fairness) So instead, what if each client with private dataset can locally fine-tune a foundation model and generalize nuanced patterns unique to this specific subgroup? Those are patterns that are often overlooked or averaged out in a global model trained on a vast dataset.
(Note: Initially I was vaguely thinking decentralized AI training can reduce algorithmic bias. I still believe it could mitigate the problem, but I think âreduceâ is a wrong phrasing. I would say machine learning inherently is a technology to create bias. It generalizes some patterns within a group and predicts some outcomes for unseen data points assuming that this pattern persists. This directly fits the definition of creating and using bias. So I would argue, the only way we can make a fair use of it (with a cost of more customization/less automation) is to narrow down the scope of its usage and carefully design the training dataset accordingly.)
Tailor-made Models for Marginalized Communities
For example, this ânarrowly scoped, tailoredâ model can be used in tasks such as financial risk assessment, medical detection, and hiring for marginalized communities. Institutions that care about creating more fairness and equal opportunities for them such as Community Development Financial Institution (CDFIs) or Out For Undergrad(O4U) would be interested in building this tailor-made model without collecting required training data with sensitive attributes. Whatâs even cooler is, companies and public institutions will be able to publicly verify their design of training dataset tailored to specific communities, so that they can prove their intention for fairness towards these groups.
Exporting Crypto Credit Score to TradFi for the Unbanked
Another potentially impactful idea is to privately build a credit scoring model for the unbanked, inclding their real-world sensitive attributes while keeping them private. This model could then be exported to traditional financial institutions, signaling the patterns of real-world personal attributes for those who have been responsibly borrowing money in crypto. This would create new financial opportunities/pathways to high-street banks even for those who began with zero credit in traditional finance.
Model Merging for Intersectional Demographics
Additional idea: Now, Iâm curious to see what happens if we merge each of these fine-tuned models and build an intersectional model. I suppose such a combined model would generalize patterns in intersectional identities better than the individual models alone. For example, in a hiring context, merging a model that identifies strong candidates from one minority group (e.g. Hispanic) with another focused on a different group (e.g. women) could improve performance for those who belong to both groups (e.g. Hispanic women). Another example is, you can also ask a question like âWhatâs the likelihood of White male vegans developing osteoporosis?â This kind of question involves overlapping personal identity factors that single-group models may not capture well.
Following these examples, I believe model merging techniques could be extremely powerful. If we have access to models trained on private data from smaller demographic groups, we can combine them to build custom models tailored to even more niche communities we want to make predictions about.
On that note, one interesting method of model merge is Evolutionary Model Merge. Whatâs special about this method is it automates merging models with different specialities and modalities such as Japanese-LLM with mathematical reasoning capability or Image Generation Model.
Usecase 3: Recommendation System for Dapps
This idea may be less novel, but itâs likely the most realistic usecase in my opinion. As we all know, decentralized applications (dApps) have pseudonymous/anonymous users often privacy-conscious, which makes it difficult for dApps service providers to collect personal user profiles or track their in-app behavior. This creates a challenge to build personalized recommendation systems, which traditionally depend on a large volume of personal data collection in a central server for training ML models. If decentralized & private model training can scale to support millions of clients or allow delegating such training to MPC nodes (which is more realistic), dApps could deliver personalized experiences without compromising user privacy. (My attempt of such application developement)
Usecase 4: Privacy-preserving Model Training for Decoding Biometric Data
This idea is a bit of a jump from the others, but it was actually the initial motivation that led me to research more on private ML model training. At the end of 2024, I was introduced to the field of brain-computer interface (BCI). I learned that after capturing brain signals with whatever method (e.g. EEG or ultrasound), BCIs typically involve a âdecodingâ process that interprets raw brain wave data into meaningful labels such as physiological states, based on its frequency. (For example, delta waves with 0.5â4 Hz are associated with deep sleep or unconsciousness, while beta waves with 13â30 Hz are linked to alertness, active thinking.) This decoding is generally powered by machine learning model inference. With public information right now, companies seem to rely on labeled datasets collected in clinical or research environments to train these models. However, itâs reasonable to assume they may eventually seek to collect training data directly from end users. This could merely be my speculation, but if it actually happens, it would raise serious privacy concerns and be subject to strict regulation. (You might remember WorldCoin was suspended in some European/African/Asian countries for failing to demonstrate proper handling of iris data) Even in a world where âprivacy doesnât sell,â regardless of how end users would feel, it wonât be easy for private companies to collect such sensitive biometric data and use it for businesses. In the near future, I believe introducing privacy-preserving training methods to commercial companies that handle biometric data will be demanded, enabling model improvement without forcing users to compromise their sensitive data.
Note on Verifiability and Bonus Project Idea
Verifiability for Malicious Adversary
Iâve been exploring additional motivations for adding verifiability to federated learning (FL), beyond the aforementioned cases of deploying FL on a decentralized network (where participants are mutually distrusting and thus require proof of correct local computations).
In cryptography world, this is a setting that demands security against malicious adversaries, as opposed to the semi-honest (or honest-but-curious) adversary model. (A helpful explanation can be found here). Traditionally, federated learning has been applied in collaborations where a baseline level of trust or business alignment already exists (mostly equivalent to âsemi-honestâ setting)âsuch as between different branches of the same bank (e.g., U.S. and European divisions), or across hospitals within the same region. In these cases, FL is often used not because the parties distrust each other, but because data sharing is restricted by regulations like GDPR. However, general trend in ML training is that the architectures have been shifting toward distributed edge device training for better scalability. Edge device training exactly fits in the definition of a setting which requires security against malicious adversaries.
Bandwidth-Efficient Edge Device Training
And here is a new idea to gain even more efficiency utilizing verifiability: In some cases, local models trained on edge devices can reach comparable accuracy even if their parameters differ slightly. That means they may not need to synchronize with the central server to build a global model as frequently. During these âidleâ periods, each edge device could instead submit a succinct proof attesting that:
- Their model was trained correctly, and
- The resulting accuracy remains within an acceptable bound.
This approach can significantly reduce required bandwidth and computational cost to aggregate local models on the central server, compared to transmitting full model updates each round.
End Note
In this post, I listed up potential use cases and project ideas for Publicly Verifiable, Private & Collaborative AI Training. Iâd immensely appreciate feedback from experts in the relevant fields. Also Iâm currently conducting this research independently and looking for organizations that can host me to further develop this work in partnership with external teams or clients. If youâre interested, please reach out to: yuriko dot nishijima at google mail.
Special thanks to Shintaro, Lucas (you can check his commentary for this post with his molecular biology research background), and Octavio for valuable feedback and insightful discussions.
If you have any feedback or comments on this post and are willing to engage in a meaningful discussion, please leave them in the HackMD draft: https://hackmd.io/@yuriko/Bk28WMRxgl