• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

AI models from Microsoft and Google already surpass human performance on the SuperGLUE language benchmark

January 6, 2021   Big Data
 AI models from Microsoft and Google already surpass human performance on the SuperGLUE language benchmark

Transform 2021

Join us for the world’s leading event about accelerating enterprise transformation with AI and Data, for enterprise technology decision-makers, presented by the #1 publisher in AI and Data

Learn More


In late 2019, researchers affiliated with Facebook, New York University (NYU), the University of Washington, and DeepMind proposed SuperGLUE, a new benchmark for AI designed to summarize research progress on a diverse set of language tasks. Building on the GLUE benchmark, which had been introduced one year prior, SuperGLUE includes a set of more difficult language understanding challenges, improved resources, and a publicly available leaderboard.

When SuperGLUE was introduced, there was a nearly 20-point gap between the best-performing model and human performance on the leaderboard. But as of early January, two models — one from Microsoft called DeBERTa and a second from Google called T5 + Meena — have surpassed the human baselines, becoming the first to do so.

Sam Bowman, assistant professor at NYU’s center for data science, said the achievement reflected innovations in machine learning including self-supervised learning, where models learn from unlabeled datasets with recipes for adapting the insights to target tasks. “These datasets reflect some of the hardest supervised language understanding task datasets that were freely available two years ago,” he said. “There’s no reason to believe that SuperGLUE will be able to detect further progress in natural language processing, at least beyond a small remaining margin.”

But SuperGLUE isn’t a perfect — nor a complete test of human language ability. In a blog post, the Microsoft team behind DeBERTa themselves noted that their model is “by no means” reaching the human-level intelligence of natural language understanding. They say this will require research breakthroughs — along with new benchmarks to measure them and their effects.

SuperGLUE

As the researchers wrote in the paper introducing SuperGLUE, their benchmark is intended to be a simple, hard-to-game measure of advances toward general-purpose language understanding technologies for English. It comprises eight language understanding tasks drawn from existing data and accompanied by a performance metric as well as an analysis toolkit.

The tasks are:

  • Boolean Questions (BoolQ) requires models to respond to a question about a short passage from a Wikipedia article that contains the answer. The questions come from Google users, who submit them via Google Search.
  • CommitmentBank (CB) tasks models with identifying a hypotheses contained within a text excerpt from sources including the Wall Street Journal and determining whether this hypothesis holds true.
  • Choice of plausible alternatives (COPA) provides a premise sentence about topics from blogs and a photography-related encyclopedia from which models must determine either the cause or effect from two possible choices.
  • Multi-Sentence Reading Comprehension (MultiRC) is a question-answer task where each example consists of a context paragraph, a question about that paragraph, and a list of possible answers. A model must predict which answers are true and false.
  • Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) has models predict masked-out words and phrases from a list of choices in passages from CNN and the Daily Mail, where the same words or phrases might be expressed using multiple different forms, all of which are considered correct.
  • Recognizing Textual Entailment (RTE) challenges natural language models to identify whenever the truth of one text excerpt follows from another text excerpt.
  • Word-in-Context (WiC) provides models two text snippets and a polysemous word (i.e., word with multiple meanings) and requires them to determine whether the word is used with the same sense in both sentences.
  • Winograd Schema Challenge (WSC) is a task where models, given passages from fiction books, must answer multiple-choice questions about the antecedent of ambiguous pronouns. It’s designed to be an improvement on the Turing Test.

SuperGLUE also attempts to measure gender bias in models with Winogender Schemas, pairs of sentences that differ only by the gender of one pronoun in the sentence. However, the researchers note that Winogender has limitations in that it offers only positive predictive value: While a poor bias score is clear evidence that a model exhibits gender bias, a good score doesn’t mean the model is unbiased. Moreover, it doesn’t include all forms of gender or social bias, making it a coarse measure of prejudice.

To establish human performance baselines, the researchers drew on existing literature for WiC, MultiRC, RTE, and ReCoRD and hired crowdworker annotators through Amazon’s Mechanical Turk platform. Each worker, which was paid an average of $ 23.75 an hour, completed a short training phase before annotating up to 30 samples of selected test sets using instructions and an FAQ page.

Architectural improvements

The Google team hasn’t yet detailed the improvements that led to its model’s record-setting performance on SuperGLUE, but the Microsoft researchers behind DeBERTa detailed their work in a blog post published earlier this morning. DeBERTa isn’t new — it was open-sourced last year — but the researchers say they trained a larger version with 1.5 billion parameters (i.e., the internal variables that the model uses to make predictions). It’ll be released in open source and integrated into the next version of Microsoft’s Turing natural language representation model, which supports products like Bing, Office, Dynamics, and Azure Cognitive Services.

DeBERTa is pretrained through masked language modeling (MLM), a fill-in-the-blank task where a model is taught to use the words surrounding a masked “token” to predict what the masked word should be. DeBERTa uses both the content and position information of context words for MLM, such that it’s able to recognize “store” and “mall” in the sentence “a new store opened beside the new mall” play different syntactic roles, for example.

Unlike some other models, DeBERTa accounts for words’ absolute positions in the language modeling process. Moreover, it computes the parameters within the model that transform input data and measure the strength of word-word dependencies based on words’ relative positions. For example, DeBERTa would understand the dependency between the words “deep” and “learning” is much stronger when they occur next to each other than when they occur in different sentences.

DeBERTa also benefits from adversarial training, a technique that leverages adversarial examples derived from small variations made to training data. These adversarial examples are fed to the model during the training process, improving its generalizability.

The Microsoft researchers hope to next explore how to enable DeBERTa to generalize to novel tasks of subtasks or basic problem-solving skills, a concept known as compositional generalization. One path forward might be incorporating so-called compositional structures more explicitly, which could entail combining AI with symbolic reasoning — in other words, manipulating symbols and expressions according to mathematical and logical rules.

“DeBERTa surpassing human performance on SuperGLUE marks an important milestone toward general AI,” the Microsoft researchers wrote. “[But unlike DeBERTa,] humans are extremely good at leveraging the knowledge learned from different tasks to solve a new task with no or little task-specific demonstration.”

New benchmarks

According to Bowman, no successor to SuperGLUE is forthcoming, at least not in the near term. But there’s growing consensus within the AI research community that future benchmarks, particularly in the language domain, must take into account broader ethical, technical, and societal challenges if they’re to be useful.

For example, a number of studies show that popular benchmarks do a poor job of estimating real-world AI performance. One recent report found that 60%-70% of answers given by natural language processing models were embedded somewhere in the benchmark training sets, indicating that the models were usually simply memorizing answers. Another study — a meta-analysis of over 3,000 AI papers — found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.

Part of the problem stems from the fact that language models like OpenAI’s GPT-3, Google’s T5 + Meena, and Microsoft’s DeBERTa learn to write humanlike text by internalizing examples from the public web. Drawing on sources like ebooks, Wikipedia, and social media platforms like Reddit, they make inferences to complete sentences and even whole paragraphs.

As a result, language models often amplify the biases encoded in this public data; a portion of the training data is not uncommonly sourced from communities with pervasive gender, race, and religious prejudices. AI research firm OpenAI notes that this can lead to placing words like “naughty” or “sucked” near female pronouns and “Islam” near words like “terrorism.” Other studies, like one published by Intel, MIT, and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular models, including Google’s BERT and XLNet, OpenAI’s GPT-2, and Facebook’s RoBERTa. This bias could be leveraged by malicious actors to foment discord by spreading misinformation, disinformation, and outright lies that “radicalize individuals into violent far-right extremist ideologies and behaviors,” according to the Middlebury Institute of International Studies.

Most existing language benchmarks fail to capture this. Motivated by the findings in the two years since SuperGLUE’s introduction, perhaps future ones might.

VentureBeat

VentureBeat’s mission is to be a digital townsquare for technical decision makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you,
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more.

Become a member

Let’s block ads! (Why?)

Big Data – VentureBeat

already, Benchmark, from, Google, human, language, Microsoft, Models, Performance, SuperGLUE, surpass
  • Recent Posts

    • Database version control: Getting started with Flyway
    • Support CRM with New Dynamics 365 Field Service Mobile App
    • 6 Strategies for Achieving Your Business Goals in the New Year
    • Researchers propose using the game Overcooked to benchmark collaborative AI systems
    • Oracle Launches Version 21c
  • Categories

  • Archives

    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited