• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Researchers quantify bias in Reddit content sometimes used to train AI

August 9, 2020   Big Data
 Researchers quantify bias in Reddit content sometimes used to train AI

VB Transform

Watch every session from the AI event of the year

On-Demand

Watch Now

In a paper published on the preprint server Arxiv.org, scientists at the King’s College London Department of Informatics used natural language to show evidence of pervasive gender and religious bias in Reddit communities. This alone isn’t surprising, but the problem is that data from these communities are often used to train large language models like OpenAI’s GPT-3. That in turn is important because, as OpenAI itself notes, this sort of bias leads to placing words like “naughty” or “sucked” near female pronouns and “Islam” near words like “terrorism.”

The scientists’ approach uses representations of words called embeddings to discover and categorize language biases, which could enable data scientists to trace the severity of bias in different communities and take steps to counteract this bias. To spotlight examples of potentially offensive content on Reddit subcommunities, given a language model and two sets of words representing concepts to compare and discover biases from, the method identifies the most biased words toward the concepts in a given community. It also ranks the words from the least to most biased using an equation to provide an ordered list and overall view of the bias distribution in that community.

Reddit has long been a popular source for machine learning model training data, but it’s an open secret that some groups within the network are unfixably toxic. In June, Reddit banned roughly 2,000 communities for consistently breaking its rules by allowing users to harass others with hate speech. But in accordance with the site’s policies on free speech, Reddit’s admins maintain they don’t ban communities solely for featuring controversial content, such as those advocating white supremacy, mocking perceived liberal bias, and promoting demeaning views on transgender women, sex workers, and feminists.

To further specify the biases they encountered, the researchers took the negativity and positivity (also called “sentiment polarity”) of biased words into account. And to facilitate analyses of biases, they combined semantically related terms under broad rubrics like “Relationship: Intimate/sexual” and “Power, organizing” that they modeled on the UCREL Semantic Analysis System (USAS) framework for automatic semantic and text tagging. (USAS has a multi-tier structure, with 21 major discourse fields subdivided into fine-grained categories like “People,” “Relationships,” or “Power.”)

One of the communities the researchers examined — /r/TheRedPill, ostensibly a forum for the “discussion of sexual strategy in a culture increasingly lacking a positive identity for men” — had 45 clusters of biased words. (/r/TheRedPill is currently “quarantined” by Reddit’s admins, meaning users have to bypass a warning prompt to visit or join.) Sentiment scores indicated that the first two biased clusters toward women (“Anatomy and Physiology,” “Intimate sexual relationships,” and “Judgement of appearance”) carried negative sentiments, whereas most of the clusters related to men contained neutral or positively connotated words. Perhaps unsurprisingly, labels such as “Egoism” and “Toughness; strong/weak” weren’t even present in female-biased labels.

Another community — /r/Dating_Advice — exhibited negative bias toward men, according to the researchers. Biased clusters included the words “poor,” “irresponsible,” “erratic,” “unreliable,” “impulsive,” “pathetic,” and “stupid,” with words like “abusive” and “egotistical” among the most negative in terms of sentiment. Moreover, the category “Judgment of appearance” was more frequently biased toward men than women, and physical stereotyping of women was “significantly” less prevalent than in /r/TheRedPill.

The researchers chose the community /r/Atheism, which calls itself “the web’s largest atheism forum,” to evaluate religious biases. They note that all the mentioned biased labels toward Islam had an average negative polarity except for geographical names. Categories such as “Crime, law and order,” “Judgement of appearance,” and “Warfare, defense, and the army” aggregated words with evidently negative connotations like “uncivilized,” “misogynistic,” “terroristic,” “antisemitic,” “oppressive,” “offensive,” and “totalitarian.” By contrast, none of the labels were relevant in Christianity-biased clusters, and most of the words in Christianity-biased clusters (e.g., “Unitarian,” “Presbyterian,” “Episcopalian,” “unbaptized,” “eternal”) didn’t carry negative connotations.

The coauthors assert their approach could be applied by legislators, moderators, and data scientists to trace the severity of bias in different communities and to take steps to actively counteract this bias. “We view the main contribution of our work as introducing a modular, extensible approach for exploring language biases through the lens of word embeddings,” they wrote. “Being able to do so without having to construct a-priori definitions of these biases renders this process more applicable to the dynamic and unpredictable discourses that are proliferating online.”

There’s a real and present need for tools like these in AI research. Emily Bender, a professor at the University of Washington’s NLP group, recently told VentureBeat that even carefully crafted language data sets can carry forms of bias. A study published last August by researchers at the University of Washington found evidence of racial bias in hate speech detection algorithms developed by Google parent company Alphabet’s Jigsaw. And Facebook AI head Jerome Pesenti found a rash of negative statements from AI created to generate humanlike tweets that targeted Black people, Jewish people, and women.

“Algorithms are like convex mirrors that refract human biases, but do it in a pretty blunt way. They don’t permit polite fictions like those that we often sustain our society with,” Kathryn Hume, Borealis AI’s director of product, said at the Movethedial Global Summit in November. “These systems don’t permit polite fictions. … They’re actually a mirror that can enable us to directly observe what might be wrong in society so that we can fix it. But we need to be careful, because if we don’t design these systems well, all that they’re going to do is encode what’s in the data and potentially amplify the prejudices that exist in society today.”

Let’s block ads! (Why?)

Big Data – VentureBeat

bias, Content, quantify, Reddit, researchers, Sometimes, train, used
  • Recent Posts

    • Why it’s time for fintechs and FIs to jump on the open banking bandwagon (VB Live)
    • Integrating a function with integration limits also dependent on a variable
    • GIVEN WHAT HE TOLD A MARINE…..IT WOULD NOT SURPRISE ME
    • How the pandemic is accelerating enterprise open source adoption
    • Rickey Smiley To Host 22nd Annual Super Bowl Gospel Celebration On BET
  • Categories

  • Archives

    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited