• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Research shows natural language benchmarks don’t measure AI models’ general knowledge well

August 12, 2020   Big Data
 Research shows natural language benchmarks don’t measure AI models’ general knowledge well

Transform 2020

Watch every session from our flagship AI event

On Demand

Watch Now

Open-domain question-answering models — models theoretically capable of responding to novel questions with novel answers — often simply memorize answers found in the data on which they’re trained, depending on the data set. That’s the assertion of a team of researchers affiliated with Facebook and the University College London, who in a preprint paper present evidence that 60%-70% of answers given by models tested on open-domain benchmarks are embedded somewhere in the training sets.

Open-domain question-answering has received attention in the AI community for its practical applications, and more recently as a method to analyze language models’ grasp of factual knowledge. But a deep understanding of what kinds of questions models can answer remains elusive; unknowns about how questions and answers are distributed in benchmark corpora make it hard to contextualize the results.

In their study, the researchers sought to evaluate the test sets of popular open-domain question-answering data sets including WebQuestions, TriviaQA, and Open Natural Questions. They identified classes of question a model should be able to answer and annotated 1,000 question-answer pairs from each test set for repeated questions in their respective training sets. Then they computed the performance of several models on the benchmarks using open-book (which leverage retrieval from a large corpus of documents) and closed-book approaches (which focus on training large models with no external knowledge).

The three data sets in question aren’t much alike, which was the point — testing across all three guaranteed robustness. WebQuestions contains 3,778 training and 2,032 test question-answer pairs from a search engine, while TriviaQA has 78,785 training and 11,313 test question-answer pairs from free trivia websites. Meanwhile, Open Natural Questions comprises 79,168 training and 3,610 question-answer pairs from a combination of search engines and Wikipedia articles.

The team theorizes open-domain question-answering models should be able to (1) recall the answer to a question seen at training time, (2) answer novel questions at test time and choose an answer from the set of answers seen during training, and (3) answer novel questions that have answers not contained within the training data set. To determine whether the aforementioned benchmarks measure any of these behaviors, the coauthors split the test data in each corpus by whether the answers appeared somewhere in the training sets. Around 58%-71% of test answers were also somewhere in the training data, according to the researchers, demonstrating that the majority of the test data didn’t probe for answer generalization.

The team also probed the benchmarks for paraphrased questions in training data, using the set of 1,000 annotated questions. They say that 28%-34% of the questions were paraphrased, the majority being near-duplicates differing only by one or two words. “This result implies that 30% of the test set of these datasets only probe for how well models can simply memorize question-answer pairs seen at training,” the coauthors wrote.

The researchers selected several “open book” models — dense passage retrieval, retrieval-augmented generation, and fusion-in-decoder — and “closed book” models (Facebook’s BART and Google’s T5) to test, as well as nearest-neighbor models that store all available answers and classify new answers based on a similarity measure. Results on the benchmark corpora imply that all models memorized questions well, with an untrained nearest-neighbor model answering 20% of the test questions correctly. But they performed poorly on questions that couldn’t be memorized from training sets, with a mean absolute performance difference of 63% between repeated and non-repeated data. And when it came to generalization, one model that reliably memorized questions — T5 — struggled, achieving only a 22% match score.

“It is clear that performance on these data sets cannot be properly understood by overall question-answer accuracy,” the researchers wrote. “We suggest that in future, a greater emphasis be placed on more behavior-driven evaluation rather than pursuing single-number overall accuracy figures.”

Let’s block ads! (Why?)

Big Data – VentureBeat

Benchmarks, Don’t, General, Knowledge, language, Measure, Models, Natural, research, Shows, Well
  • Recent Posts

    • THEY CAN FIND THE GUY WHO BROKE A WINDOW BUT NOT A MURDERER?
    • TIBCO4Good and She Loves Data Offer Free Data Skills Workshops During a Time of Vulnerability
    • Aurora partners with Paccar to develop driverless trucks
    • “Without Data, Nothing” — Building Apps That Last With Data
    • SO MUCH FOR GLOBAL WARMING, EH?
  • Categories

  • Archives

    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited