Zupeng Zeng, Ashlynn Parker, Thomas Le Gouellec, Ronaldo Lopez
DATA 202 - TR Reddit Group
Reddit Policy Moderation Report - Final Report
Problem and Recommended Solution
Reddit recently changed their API terms to be able to monetize what they believe is the future of their business, training language learning models from Google, Open AI, and Microsoft on Reddit‘s data. This has led to a whole host of issues because it negatively affects indie developers as well as raises data privacy concerns, especially for EU residents' data and confirming with EU GDPR rules. It also raises the thorny subject of training AI on social media data sets, as that has led to AIs becoming racist and violent in the past.
Recommended Policy Changes
We propose a comprehensive revision of data handling policies to ensure transparency, user control, and compliance with international data privacy laws. This includes clear mechanisms for obtaining explicit user consent, transparent communication about data usage, and easy opt-out options. Here are explanations:
Explicit Consent: Provide clear, accessible, and user-friendly mechanisms for obtaining consent.
Transparent Communication: Communicate the purposes of data collection, how data will be used, and the benefits received from user data, including monetary gains.
Opt-Out Option: Offer easy ways for users to opt out of data collection and usage for AI training, respecting user preferences and privacy rights.
Background for EU GDPR & AI Regulations
The Reddit API controversy becomes even more complicated when considering issues of European Union data privacy laws, among the strictest on earth. Reddit’s plan to cash in on its large trove of data by selling it to teach language learning models runs afoul of two sizable pieces of European legislation, The General Data Protection Regulation, or GDPR, and the soon-to-be-ratified EU AI act. GDPR is a body of regulations that governs the processing of EU/EEA citizens’ data outside of Europe, mainly targeted at large multinationals at a time when enormous amounts of data on users are being stored and processed in data centers mainly in the United States.
Essentially, it is focused on giving individuals control over their data, limiting the types of data that can be processed by companies about their users, as well as preventing targeted data through the requirements of deletions, opt-outs, and consent policies. The EU AI Act is another body of legislation currently being voted on in the European Parliament to restrict and control the types of AI being created and their uses. Both of these, especially GDPR, present significant challenges against Reddit's plan to sell its user data to large language models. This data would inherently be processed overseas by non-European companies or servers.
The largest issue concerning the GDPR and Reddit API controversy is that none of Reddit’s EU users consented to the sale and use of their information to train these LLMs. This would be a direct violation of GDPR requirements of transparency and consent. As such, to avoid running afoul of GDPR, and potential EU AI Act regulations, a stringent policy of user consent, opt-out, and data protection and deletion would need to be instituted at Reddit to avoid massive fines of up to 4% of Reddit’s annual revenue per violation. This could be disseminated both at user account creation, as well as a pop-up the first time users access the site after this change.
Justifications
We have chosen to address this issue due to its significant impact on the legal, ethical, and public relations dimensions of Reddit’s operations. Safeguarding data privacy and ensuring ethical AI practices not only positions Reddit favorably in terms of international legal standards but also fosters user trust, which is paramount for maintaining a strong and reputable platform.
The existing policies are inadequate in providing the necessary transparency and protection of user privacy, and they fall short of complying with stringent laws such as the GDPR. Our recommended policy changes actively work to resolve these issues. They ensure that Reddit and developers continue to profit from user data while simultaneously giving users control over their information, aligning with privacy concerns and legal requirements. This approach enhances transparency and user trust, resulting in higher customer satisfaction.
Our proposed changes offer a comprehensive solution. Unlike other potential changes that might address only specific aspects of the issue, like creating a repository of sanitized and anonymized datasets that are shared openly with the public and developer community for AI training which would please Reddit users but harm the profitability of the company, our policy recommendations provide a balanced and sustainable strategy. They ensure continued profitability for Reddit and its developers from user data, while also securing user trust through increased transparency and control. This not only satisfies legal requirements but also addresses user privacy concerns, ensuring that Reddit’s reputation remains intact and its user base content and secure.
Implementations
The policy changes will roll out in phases, starting with new users and followed by notifying existing users via email and in-app notifications. An interactive guide and modular control options will be introduced, alongside a ‘Data Usage & Permissions’ section in user dashboards.
For the first stage, design an interactive guide, perhaps utilizing infographics and concise explanations, to engage the user while informing them. Provide users with modular options to control what data they're comfortable sharing, and they have the right to delete all (or part of) their traits anytime. Enable users to selectively remove data types—such as comments, upvotes, or saved posts. Also empowering users to control their digital footprint aligns with privacy-first strategies and offers users peace of mind about their online presence. Existing users' data will be anonymized consistent with new users' data handling. This anonymized data will remain accessible for AI training unless users opt out, balancing privacy with the utility of data for development (VentureBeat, 2023).
Feasibility Test: Google allowed users to choose what information they allow the platform to use, giving them the right to download, and delete their data anytime.
For the second stage, add items to the agreement to introduce a new requirement for explicit user consent for data usage in AI training, ensuring users are informed and agree to have their data used for such purposes after proper preprocessing (see anonymization below) and the fact that the platform will receive monetary benefits from their data. Enhance user dashboards or account settings with a dedicated section for 'Data Usage & Permissions.' This would detail how Reddit utilizes its data, the third parties involved, and any financial gains derived from its data. As collaborations and data use cases evolve, communicate changes in plain language through in-app notifications or emails.
Feasibility Test: Apple introduced privacy labels on its App Store, requiring app developers to disclose their data collection and usage practices clearly. This move was generally well-received, with users appreciating the increased transparency regarding their data.
(https://developer.apple.com/app-store/app-privacy-details/)
Thirdly, Remove personally identifiable information like IP addresses, user-agent strings, and device information. Mention this practice in the 'Data Usage & Permissions' section, emphasizing the rigorous steps taken to maintain individual privacy.
Feasibility Test: Apple adds statistical noise to individual data points, so the raw data can't be reconstructed. (https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf)
Potential Challenges
Technical Complexity and Costs
Implementing data anonymization processes is a technical endeavor that requires significant investment and expertise. This complexity and the associated costs are unavoidable challenges in our bid to enhance user privacy and data security. By removing personally identifiable information such as IP addresses, user-agent strings, and device information, we aim to protect user anonymity, but this comes at the expense of increased complexity in our data processing systems.
Data Availability for AI Training
Our actions might restrict the amount of data available for developers, particularly for AI training purposes. The removal of certain data points to ensure anonymity could potentially limit the effectiveness and accuracy of AI models trained on Reddit’s data. This is a trade-off between user privacy and the richness of the data available for development purposes.
Dissatisfaction from Small Developers
Small developers may express dissatisfaction due to the costs associated with accessing Reddit’s data, despite the privacy enhancements. We recognize that the monetization of Reddit’s data, while beneficial for the company’s revenue, could pose a financial burden on smaller entities in the developer community.
Compliance and Legal Concerns
Ensuring full compliance with the GDPR and the upcoming EU AI Act is of paramount importance. Non-compliance could result in severe financial penalties, up to 4% of Reddit’s annual revenue per violation, and irreparable damage to our reputation. The necessity to gain explicit user consent for data usage in AI training, offering opt-out options, and ensuring data protection and deletion capabilities are integral steps in adhering to these regulations. The data isn't necessarily anonymized and has the possibility to be de-anonymized, therefore these extra steps would be achieving full compliance to get ahead of tougher rules with the EU AI Act.
Anticipated Public Response
The public’s response to these changes is an essential consideration. While we anticipate an overall positive reception due to increased transparency and user control, there may be concerns regarding the limitations placed on data availability for development purposes. Clear communication and education on the reasons behind these changes and their benefits are vital to garnering public support and understanding. For EU users, a more stringent approach to user data privacy and protection going beyond the bare minimum of GDPR serves as a strong public relations item to point to, both to highlight our care towards "privacy" and assuage EU users.
Assessment
Success could be evaluated through user feedback, compliance audits, and monitoring of user engagement and trust levels. Key metrics here include user opt-in rates, positive feedback, compliance audit results, and stable or increased user engagement levels. A comprehensive assessment will take place one-year post-implementation, with interim reviews every quarter to ensure ongoing compliance and user satisfaction.
here's some good research in the opening section here, but you need to include sourcing/citations. Choose a formal citation style. Convert the unstructured list of sources and related articles at the end of this document into a real work cited page.
Elephants in the room to address:
- GDPR doesn't apply to anonymized data, so you should clarify this and explain why you would still go a step beyond with the consent structures.
Clarify what happens to data access for existing users following implementation. Does that data get anonymized but remain in use unless the user opts out? Unavailable until they opt in?
Bibliography
Church, Ezra D. “Data Privacy and AI Regulation in Europe, the UK, and US.” Morgan Lewis, July 7, 2023. https://www.morganlewis.com/pubs/2023/07/data-privacy-and-ai-regulation-in-europe-the-uk-and-us.
Differential Privacy Overview - Apple. Accessed November 23, 2023. https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf.
“Data Privacy Settings & Controls - Google Safety Center.” Data Privacy Settings & Controls - Google Safety Center. Accessed November 22, 2023. https://safety.google/privacy/privacy-controls/?utm_source=sem&utm_medium=cpc&utm_campaign=us-priv-controls-bkws-exa&utm_content=rsa&gclid=CjwKCAjwnOipBhBQEiwACyGLuvOC9l_jpSvl8ZP83is5eIe62ZT2K--7m-1QtrmGm2iuaIGt5tqbPhoCV24QAvD_BwE&gclsrc=aw.ds.
Frederiksen, Jonas. “The EU GDPR and AI Systems: What Issuers Need to Know.” LinkedIn, April 4, 2023. https://www.linkedin.com/pulse/eu-gdpr-ai-systems-what-issuers-need-know-jonas-frederiksen.
Goldman, Sharon. “Generative AI’s Secret Sauce - Data Scraping - under Attack.” VentureBeat, July 6, 2023. https://venturebeat.com/ai/generative-ai-secret-sauce-data-scraping-under-attack/.
Inc., Apple. “App Privacy Details - App Store.” Apple Developer. Accessed November 22, 2023. https://developer.apple.com/app-store/app-privacy-details/.
Isaac, Mike. “Reddit Wants to Get Paid for Helping to Teach Big A.I. Systems.” The New York Times, April 18, 2023. https://www.nytimes.com/2023/04/18/technology/reddit-ai-openai-google.html?action=click&module=RelatedLinks&pgtype=Article.
Vigliarolo, Brandon. “Reddit: If You Want to Slurp Our API to Train That LLM, You Better Pay for It, Pal.” The Register, April 18, 2023. https://www.theregister.com/2023/04/18/reddit_charging_ai_api/#:~:text=End%20of%20free%20money%20era,for%20building%20billion%2Ddollar%20models&text=In%20a%20move%20seemingly%20designed,of%20its%20data%2Ddownloading%20API.