Hey there, Another week, another discussion about the ethics of research, and of compiling massive datasets about online communities. This is one of the largest Discord datasets we've ever seen, and Matthew does a good job of breaking down what's going on here. -Jason Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discord’s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active. Though the researchers claim they’ve anonymized the data, it’s hard to imagine anyone is comfortable with almost a decade of their Discord messages sitting in a public JSON file online. Separately, a different programmer released a Discord tool called "Searchcord" based on a different data set that shows non-anonymized chat histories.
|
|
This segment is a paid ad. If you’re interested in advertising, let's talk. Do internet search engines know too much? Anyone can find your sensitive info with a Google search—phone number, home address, DOB, and SSN. What happens next? Marketers buy your contact information to spam you with ads. In the worst cases, scammers and identity thieves breach those brokers, leaving your data vulnerable or on the dark web. You can put an end to that with Incogni’s unlimited plan. With custom removals, you can hunt down your personal data from an unlimited number of websites, and privacy agents will remove any exposed data on your behalf. And unlike other services, Incogni helps automate removals from over 250 broker types, including those tricky People Search Sites, and private databases which are bought by big data companies. Start removing your personal data anywhere on the internet with Incogni Unlimited. Use code INCOGNI404 to get an exclusive 55% discount and get your data off 250+ data brokers and people search sites with Incogni.
|
|
|
These two separate events have created some panic in some Discord communities, with server moderators and users worrying about their privacy. A team of 15 researchers at the University of Finance Minas Gerais in Brazil conducted the scrape as part of a research project. The team explained the how and why of the project in a paper titled Discord Unveiled: A Comprehensive Dataset of Public Communication (2015 - 2024), which they say was created so that other teams of researchers could have a database of online discussions to use when studying mental health and politics or training bots. “Throughout every step of our data collection process, we prioritized adherence to ethical standards,” they wrote in a section called ‘Ethical Concerns.’ “Precautions were taken to collect data responsibly. All data was sourced from groups that are explicitly considered public according to Discord’s terms of use, which every user agrees to upon signing up. The data was anonymized, and the methodology was detailed to promote reproducibility and transparency.” That may be the case, but Discord is designed to be a series of chatrooms which are not universally searchable, and which in their design feel far less public than, say, tweeting something or posting it to Reddit. The amount of data is massive. “This paper introduces the most extensive Discord dataset available to date, comprising 2,052,206,308 messages from 4,735,057 unique users across 3,167 servers—approximately 10% of the servers listed in Discord’s Discovery tab.” The researchers have published the database online as a series of JSON files. Within the database, one JSON represents a single Discord server and all of the messages that were contained therein. An uncompressed sample version of the data is 6.2GB and unfurls into a 108GB database. The complete database is 118GB compressed and likely unfurls into a database several orders of magnitude larger. The researchers said they created the dataset so that other researchers could study bots, politics, and mental health. “Our dataset enables researchers to explore the impact of digital platforms on political discourse, the propagation of misinformation, and the development of effective moderation and regulation strategies tailored to such environments,” it said in a section near the end. They also said the database could be of help “identifying patterns of at-risk behavior and explore [sic] critical questions such as the prevalence of harm behaviors or supportive interactions” and “facilitate the creation of domain-specific chatbots.” The way that the Brazilian researchers scraped these messages differs from the way that a tool we reported on last year did something similar. In 2024, a service called Spy.pet scraped Discord servers en masse by placing bots into specific servers which then archived the messages. This allowed the creators of Spy.pet to target specific servers and to archive the messages within servers that were not public. It also did not anonymize the messages in any way. Days after 404 Media broke the Spy.pet story, Discord banned accounts associated with the service. The Brazilian researchers say that they scraped the messages using Discord’s API. Discord servers are user generated and can be set to public or private and newcomers can find the public servers using Discord’s “Discovery” feature. In their paper, the researchers said they used this discovery feature to map every public Discord server, discovering a total of 31,673 as of November 17, 2024. Then they selected 10 percent of those servers to scrape at random. The researchers accomplished this using Discord’s own public API to put in calls for all the data on the servers. Bots are popular on Discord and users stand them up for a variety of reasons including moderating channels, playing music, and rolling dice. User-designed bots are a ubiquitous part of the Discord experience and the company offers its public API, in part, to make the bots easy to launch and maintain. In their paper, the researchers insist that the project was conducted in the bounds of Discord’s API policies. They said that before publication, they replaced usernames with generated pseudonyms, hashed and truncated user and message IDs, and removed other identifying features entirely. “All data collection adhered strictly to Discord’s API guidelines, and anonymization techniques were applied to ensure compliance with privacy standards,” the paper said. The paper also pointed out that all these messages were scraped from public spaces. “All data was sourced from groups that are explicitly considered public according to Discord’s terms of use, which every user agrees to upon signing up.” It should be noted, however, that almost no one reads end-user license agreements and many of Discord’s users are children and teenagers. Discord is, first and foremost, a platform for gamers to organize communities and it’s not plausible that a 15 year old looking for a Fortnite meme server ever thought their dumb jokes about Tomato Town would end up in a public database five years later. Even with the pains taken to anonymize the data, the scrape appears to be against Discord’s Terms of Service. The Discord Developer policy, which covers the use of its API, is clear. “Do not mine or scrape any data, content, or information available on or through Discord services,” it says. Some form of this prohibition against scraping has been in place since at least 2020. Discord did not return 404 Media’s request for comment on this issue.
|