• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Tag Archives: language

Researchers find that large language models struggle with math

March 9, 2021   Big Data

The power of audio

From podcasts to Clubhouse, branded audio is more important than ever. Learn how brands are increasing customer loyalty and personalization with these best practices.

Register Now


Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.


Mathematics is the foundation of countless sciences, allowing us to model things like planetary orbits, atomic motion, signal frequencies, protein folding, and more. Moreover, it’s a valuable testbed for the ability to problem solve, because it requires problem solvers to analyze a challenge, pick out good methods, and chain them together to produce an answer.

It’s revealing, then, that as sophisticated as machine learning models are today, even state-of-the-art models struggle to answer the bulk of math problems correctly. A new study published by researchers at the University of California, Berkeley finds that large language models including OpenAI’s GPT-3 can only complete 2.9% to 6.9% of problems from a dataset of over 12,500. The coauthors believe that new algorithmic advancements will likely be needed to give models stronger problem-solving skills.

Prior research has demonstrated the usefulness of AI that has a firm grasp of mathematical concepts. For example, OpenAI recently introduced GPT-f, an automated prover and proof assistant for the Metamath formalization language. GPT-f found new short proofs that have been accepted into the main Metamath library, the first time a machine learning-based system contributed proofs that were adopted by a formal mathematics community. For its part, Facebook also claims to have experimented successfully with math-solving AI algorithms. In a blog post last January, researchers at the company said they’d taught a model to view complex mathematical equations “as a kind of language and then [treat] solutions as a translation problem.”

“While most other text-based tasks are already nearly solved by enormous language models, math is notably different. We showed that accuracy is slowly increasing and, if trends continue, the community will need to discover conceptual and algorithmic breakthroughs to attain strong performance on math,” the coauthors wrote. “Given the broad reach and applicability of mathematics, solving math datasets with machine learning would be of profound practical and intellectual significance.”

To measure the problem-solving ability of large and general-purpose language models, the researchers created a dataset called MATH, which consists of 12,500 problems taken from high school math competitions. Given a problem from MATH, language models must generate a sequence that reveals the final answer.

 Researchers find that large language models struggle with math

Above: A comparison of a MATH dataset problem with problems from DeepMind’s Mathematics Dataset and a Metamath module.

Image Credit: MATH

Problems in MATH are labeled by difficulty from 1 to 5 and span seven subjects, including geometry, algebra, calculus, statistics, linear algebra, and number theory. They also come with step-by-step solutions so that language models can learn to answer new questions they haven’t seen before.

Training models on the fundamentals of mathematics required the researchers to create a separate dataset with hundreds of thousands of solutions to common math problems. This second dataset, the Auxiliary Mathematics Problems and Solutions (AMPS), comprises more than 100,000 problems from Khan Academy with solutions and over 5 million problems generated using Mathematica scripts based on 100 hand-designed modules. In total, AMPS contains 23GB of content.

As the researchers explain, the step-by-step solutions in the datasets allow the language models to use a “scratch space” much like a human mathematician might. Rather than having to arrive at the correct answer right away, models can first “show their work” in partial solutions that step toward the right answer.

Even with the solutions, the coauthors found that accuracy remained low for the large language models they benchmarked: GPT-3 and GPT-2, GPT-3’s predecessor. Having the models generate their own solutions before producing an answer actually degraded accuracy because while many of the steps were related to the question, they were illogical. Moreover, simply increasing the amount of training time and the number of parameters in the models, which sometimes improves performance, proved to be impractically costly. (In machine learning, parameters are variables whose values control the learning process.)

This being the case, the researchers showed that step-by-step solutions still provide benefits in the form of improved performance. In particular, providing models with solutions at training time increased accuracy substantially, with pretraining on AMPS boosting accuracy by around 25% — equivalent to a 15 times increase in model size.

“Despite these low accuracies, models clearly possess some mathematical knowledge: they achieve up to 15% accuracy on the easiest difficulty level, and they are able to generate step-by-step solutions that are coherent and on-topic even when incorrect,” the coauthors wrote. “Having models train on solutions increases relative accuracy by 10% compared to training on the questions and answers directly.”

The researchers have released MATH and AMPS in open source to, along with existing mathematics datasets like DeepMind’s, spur further research along this direction.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Frequency of letters used in English language

February 19, 2021   Humor

Posted by Krisgo

via

About Krisgo

I’m a mom, that has worn many different hats in this life; from scout leader, camp craft teacher, parents group president, colorguard coach, member of the community band, stay-at-home-mom to full time worker, I’ve done it all– almost! I still love learning new things, especially creating and cooking. Most of all I love to laugh! Thanks for visiting – come back soon icon smile Frequency of letters used in English language


Let’s block ads! (Why?)

Deep Fried Bits

Read More

Facebook researchers propose ‘pre-finetuning’ to improve language model performance

February 2, 2021   Big Data
 Facebook researchers propose ‘pre finetuning’ to improve language model performance

How open banking is driving huge innovation

Learn how fintechs and forward-thinking FIs are accelerating personalized financial products through data-rich APIs.

Register Now


Machine learning researchers have achieved remarkable success with language model pretraining, which uses self-supervision, a training technique that doesn’t require labeled data. Pretraining refers to training a model with one task to help it recognize patterns that can be applied to a range of other tasks. In this way, pretraining imitates the way human beings process new knowledge. That is, using parameters of tasks that have been learned before, models learn to adapt to new and unfamiliar tasks.

For many natural language tasks, however, training examples for related problems exist. In an attempt to leverage these, researchers at Facebook propose “pre-finetuning,” a methodology of training language models that involves a learning step with over 4.8 million training examples performed on around 50 classification, summarization, question-answering, and commonsense reasoning datasets. They claim that pre-finetuning consistently improves performance for pretrained models while also significantly improving sample efficiency during fine-tuning.

It’s an approach that has been attempted before, often with success. In a 2019 study, researchers at the Allen Institute noticed that pre-finetuning a BERT model on a multiple choice question dataset appeared to teach the model something about multiple choice questions in general. A subsequent study found that pre-finetuning increased a model’s robustness for name swaps, where the names of different people were swapped in a sentence about which the model had to answer.

In order to ensure that their pre-finetuning stage incorporated general language representations, the researchers included tasks in four different domains: classification, commonsense reasoning, machine reading comprehension, and summarization. They call their pre-finetuned models MUPPET, which roughly stands for “Massive Multi-task Representation with Pre-finetuning.”

After pre-finetuning RoBERTa and BART, two popular pretrained models for natural language understanding, the researchers tested their performance on widely-used benchmarks including RTE, BoolQ, RACE, SQuAD, and MNLI. Interestingly, the results show that pre-finetuning can hurt performance when few tasks are used to a critical point, usually above 15 tasks. But pre-finetuning beyond this point leads to performance improvements correlated with the number of language tasks. MUPPET models outperform their vanilla pretrained counterparts and leveraging representations with 34-40 tasks enables the models to reach higher even accuracies with less data than a baseline RoBERTa model.

“These [performance] gains are particularly strong in the low resource regime, where there is relatively little labeled data for fine-tuning,” the researchers wrote in a paper describing their work. “We show that we can effectively learn more robust representations through multitask learning at scale. … Our work shows how even seemingly very different datasets, for example, summarization and extractive QA, can help each other by improving the model’s representations.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Google trained a trillion-parameter AI language model

January 12, 2021   Big Data

Transform 2021

Join us for the world’s leading event about accelerating enterprise transformation with AI and Data, for enterprise technology decision-makers, presented by the #1 publisher in AI and Data

Learn More


Parameters are the key to machine learning algorithms. They’re the part of the model that’s learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well. For example, OpenAI’s GPT-3 — one of the largest language models ever trained, at 175 billion parameters — can make primitive analogies, generate recipes, and even complete basic code.

In what might be one of the most comprehensive tests of this correlation to date, Google researchers developed and benchmarked techniques they claim enabled them to train a language model containing more than a trillion parameters. They say their 1.6-trillion-parameter model, which appears to be the largest of its size to date, achieved an up to 4 times speedup over the previously largest Google-developed language model (T5-XXL).

As the researchers note in a paper detailing their work, large-scale training is an effective path toward powerful models. Simple architectures, backed by large datasets and parameter counts, surpass far more complicated algorithms. But effective, large-scale training is extremely computationally intensive. That’s why the researchers pursued what they call the Switch Transformer, a “sparsely activated” technique that uses only a subset of a model’s weights, or the parameters that transform input data within the model.

The Switch Transformer builds on a mix of experts, an AI model paradigm first proposed in the early ’90s. The rough concept is to keep multiple experts, or models specialized in different tasks, inside a larger model and have a “gating network” choose which experts to consult for any given data.

The novelty of the Switch Transformer is that it efficiently leverages hardware designed for dense matrix multiplications — mathematical operations widely used in language models — such as GPUs and Google’s tensor processing units (TPUs). In the researchers’ distributed training setup, their models split unique weights on different devices so the weights increased with the number of devices but maintained a manageable memory and computational footprint on each device.

In an experiment, the researchers pretrained several different Switch Transformer models using 32 TPU cores on the Colossal Clean Crawled Corpus, a 750GB-sized dataset of text scraped from Reddit, Wikipedia, and other web sources. They tasked the models with predicting missing words in passages where 15% of the words had been masked out, as well as other challenges, like retrieving text to answer a list of increasingly difficult questions.

 Google trained a trillion parameter AI language model

The researchers claim their 1.6-trillion-parameter model with 2,048 experts (Switch-C) exhibited “no training instability at all,” in contrast to a smaller model (Switch-XXL) containing 395 billion parameters and 64 experts. However, on one benchmark — the Sanford Question Answering Dataset (SQuAD) — Switch-C scored lower (87.7) versus Switch-XXL (89.6), which the researchers attribute to the opaque relationship between fine-tuning quality, computational requirements, and the number of parameters.

This being the case, the Switch Transformer led to gains in a number of downstream tasks. For example, it enabled an over 7 times pretraining speedup while using the same amount of computational resources, according to the researchers, who demonstrated that the large sparse models could be used to create smaller, dense models fine-tuned on tasks with 30% of the quality gains of the larger model. In one test where a Switch Transformer model was trained to translate between over 100 different languages, the researchers observed “a universal improvement” across 101 languages, with 91% of the languages benefitting from an over 4 times speedup compared with a baseline model.

“Though this work has focused on extremely large models, we also find that models with as few as two experts improve performance while easily fitting within memory constraints of commonly available GPUs or TPUs,” the researchers wrote in the paper. “We cannot fully preserve the model quality, but compression rates of 10 to 100 times are achievable by distilling our sparse models into dense models while achieving ~30% of the quality gain of the expert model.”

In future work, the researchers plan to apply the Switch Transformer to “new and across different modalities,” including image and text. They believe that model sparsity can confer advantages in a range of different media, as well as multimodal models.

Unfortunately, the researchers’ work didn’t take into account the impact of these large language models in the real world. Models often amplify the biases encoded in this public data; a portion of the training data is not uncommonly sourced from communities with pervasive gender, race, and religious prejudices. AI research firm OpenAI notes that this can lead to placing words like “naughty” or “sucked” near female pronouns and “Islam” near words like “terrorism.”  Other studies, like one published in April by Intel, MIT, and Canadian AI initiative CIFAR researchers, have found high levels of stereotypical bias from some of the most popular models, including Google’s BERT and XLNet, OpenAI’s GPT-2, and Facebook’s RoBERTa. This bias could be leveraged by malicious actors to foment discord by spreading misinformation, disinformation, and outright lies that “radicalize individuals into violent far-right extremist ideologies and behaviors,” according to the Middlebury Institute of International Studies.

It’s unclear whether Google’s policies on published machine learning research might have played a role in this. Reuters reported late last year that researchers at the company are now required to consult with legal, policy, and public relations teams before pursuing topics such as face and sentiment analysis and categorizations of race, gender, or political affiliation. And in early December, Google fired AI ethicist Timnit Gebru, reportedly in part over a research paper on large language models that discussed risks, including the impact of their carbon footprint on marginalized communities and their tendency to perpetuate abusive language, hate speech, microaggressions, stereotypes, and other dehumanizing language aimed at specific groups of people.

VentureBeat

VentureBeat’s mission is to be a digital townsquare for technical decision makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you,
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more.

Become a member

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

AI models from Microsoft and Google already surpass human performance on the SuperGLUE language benchmark

January 6, 2021   Big Data
 AI models from Microsoft and Google already surpass human performance on the SuperGLUE language benchmark

Transform 2021

Join us for the world’s leading event about accelerating enterprise transformation with AI and Data, for enterprise technology decision-makers, presented by the #1 publisher in AI and Data

Learn More


In late 2019, researchers affiliated with Facebook, New York University (NYU), the University of Washington, and DeepMind proposed SuperGLUE, a new benchmark for AI designed to summarize research progress on a diverse set of language tasks. Building on the GLUE benchmark, which had been introduced one year prior, SuperGLUE includes a set of more difficult language understanding challenges, improved resources, and a publicly available leaderboard.

When SuperGLUE was introduced, there was a nearly 20-point gap between the best-performing model and human performance on the leaderboard. But as of early January, two models — one from Microsoft called DeBERTa and a second from Google called T5 + Meena — have surpassed the human baselines, becoming the first to do so.

Sam Bowman, assistant professor at NYU’s center for data science, said the achievement reflected innovations in machine learning including self-supervised learning, where models learn from unlabeled datasets with recipes for adapting the insights to target tasks. “These datasets reflect some of the hardest supervised language understanding task datasets that were freely available two years ago,” he said. “There’s no reason to believe that SuperGLUE will be able to detect further progress in natural language processing, at least beyond a small remaining margin.”

But SuperGLUE isn’t a perfect — nor a complete test of human language ability. In a blog post, the Microsoft team behind DeBERTa themselves noted that their model is “by no means” reaching the human-level intelligence of natural language understanding. They say this will require research breakthroughs — along with new benchmarks to measure them and their effects.

SuperGLUE

As the researchers wrote in the paper introducing SuperGLUE, their benchmark is intended to be a simple, hard-to-game measure of advances toward general-purpose language understanding technologies for English. It comprises eight language understanding tasks drawn from existing data and accompanied by a performance metric as well as an analysis toolkit.

The tasks are:

  • Boolean Questions (BoolQ) requires models to respond to a question about a short passage from a Wikipedia article that contains the answer. The questions come from Google users, who submit them via Google Search.
  • CommitmentBank (CB) tasks models with identifying a hypotheses contained within a text excerpt from sources including the Wall Street Journal and determining whether this hypothesis holds true.
  • Choice of plausible alternatives (COPA) provides a premise sentence about topics from blogs and a photography-related encyclopedia from which models must determine either the cause or effect from two possible choices.
  • Multi-Sentence Reading Comprehension (MultiRC) is a question-answer task where each example consists of a context paragraph, a question about that paragraph, and a list of possible answers. A model must predict which answers are true and false.
  • Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) has models predict masked-out words and phrases from a list of choices in passages from CNN and the Daily Mail, where the same words or phrases might be expressed using multiple different forms, all of which are considered correct.
  • Recognizing Textual Entailment (RTE) challenges natural language models to identify whenever the truth of one text excerpt follows from another text excerpt.
  • Word-in-Context (WiC) provides models two text snippets and a polysemous word (i.e., word with multiple meanings) and requires them to determine whether the word is used with the same sense in both sentences.
  • Winograd Schema Challenge (WSC) is a task where models, given passages from fiction books, must answer multiple-choice questions about the antecedent of ambiguous pronouns. It’s designed to be an improvement on the Turing Test.

SuperGLUE also attempts to measure gender bias in models with Winogender Schemas, pairs of sentences that differ only by the gender of one pronoun in the sentence. However, the researchers note that Winogender has limitations in that it offers only positive predictive value: While a poor bias score is clear evidence that a model exhibits gender bias, a good score doesn’t mean the model is unbiased. Moreover, it doesn’t include all forms of gender or social bias, making it a coarse measure of prejudice.

To establish human performance baselines, the researchers drew on existing literature for WiC, MultiRC, RTE, and ReCoRD and hired crowdworker annotators through Amazon’s Mechanical Turk platform. Each worker, which was paid an average of $ 23.75 an hour, completed a short training phase before annotating up to 30 samples of selected test sets using instructions and an FAQ page.

Architectural improvements

The Google team hasn’t yet detailed the improvements that led to its model’s record-setting performance on SuperGLUE, but the Microsoft researchers behind DeBERTa detailed their work in a blog post published earlier this morning. DeBERTa isn’t new — it was open-sourced last year — but the researchers say they trained a larger version with 1.5 billion parameters (i.e., the internal variables that the model uses to make predictions). It’ll be released in open source and integrated into the next version of Microsoft’s Turing natural language representation model, which supports products like Bing, Office, Dynamics, and Azure Cognitive Services.

DeBERTa is pretrained through masked language modeling (MLM), a fill-in-the-blank task where a model is taught to use the words surrounding a masked “token” to predict what the masked word should be. DeBERTa uses both the content and position information of context words for MLM, such that it’s able to recognize “store” and “mall” in the sentence “a new store opened beside the new mall” play different syntactic roles, for example.

Unlike some other models, DeBERTa accounts for words’ absolute positions in the language modeling process. Moreover, it computes the parameters within the model that transform input data and measure the strength of word-word dependencies based on words’ relative positions. For example, DeBERTa would understand the dependency between the words “deep” and “learning” is much stronger when they occur next to each other than when they occur in different sentences.

DeBERTa also benefits from adversarial training, a technique that leverages adversarial examples derived from small variations made to training data. These adversarial examples are fed to the model during the training process, improving its generalizability.

The Microsoft researchers hope to next explore how to enable DeBERTa to generalize to novel tasks of subtasks or basic problem-solving skills, a concept known as compositional generalization. One path forward might be incorporating so-called compositional structures more explicitly, which could entail combining AI with symbolic reasoning — in other words, manipulating symbols and expressions according to mathematical and logical rules.

“DeBERTa surpassing human performance on SuperGLUE marks an important milestone toward general AI,” the Microsoft researchers wrote. “[But unlike DeBERTa,] humans are extremely good at leveraging the knowledge learned from different tasks to solve a new task with no or little task-specific demonstration.”

New benchmarks

According to Bowman, no successor to SuperGLUE is forthcoming, at least not in the near term. But there’s growing consensus within the AI research community that future benchmarks, particularly in the language domain, must take into account broader ethical, technical, and societal challenges if they’re to be useful.

For example, a number of studies show that popular benchmarks do a poor job of estimating real-world AI performance. One recent report found that 60%-70% of answers given by natural language processing models were embedded somewhere in the benchmark training sets, indicating that the models were usually simply memorizing answers. Another study — a meta-analysis of over 3,000 AI papers — found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.

Part of the problem stems from the fact that language models like OpenAI’s GPT-3, Google’s T5 + Meena, and Microsoft’s DeBERTa learn to write humanlike text by internalizing examples from the public web. Drawing on sources like ebooks, Wikipedia, and social media platforms like Reddit, they make inferences to complete sentences and even whole paragraphs.

As a result, language models often amplify the biases encoded in this public data; a portion of the training data is not uncommonly sourced from communities with pervasive gender, race, and religious prejudices. AI research firm OpenAI notes that this can lead to placing words like “naughty” or “sucked” near female pronouns and “Islam” near words like “terrorism.” Other studies, like one published by Intel, MIT, and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular models, including Google’s BERT and XLNet, OpenAI’s GPT-2, and Facebook’s RoBERTa. This bias could be leveraged by malicious actors to foment discord by spreading misinformation, disinformation, and outright lies that “radicalize individuals into violent far-right extremist ideologies and behaviors,” according to the Middlebury Institute of International Studies.

Most existing language benchmarks fail to capture this. Motivated by the findings in the two years since SuperGLUE’s introduction, perhaps future ones might.

VentureBeat

VentureBeat’s mission is to be a digital townsquare for technical decision makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you,
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more.

Become a member

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Uber researchers propose AI language model that emphasizes positive and polite responses

January 5, 2021   Big Data
 Uber researchers propose AI language model that emphasizes positive and polite responses

Transform 2021

Join us for the world’s leading event about accelerating enterprise transformation with AI and Data, for enterprise technology decision-makers, presented by the #1 publisher in AI and Data

Learn More


AI-powered assistants like Siri, Cortana, Alexa, and Google Assistant are pervasive. But for these assistants to engage users and help them to achieve their goals, they need to exhibit appropriate social behavior and provide informative replies. Studies show that users respond better to social language in the sense that they’re more responsive and likelier to complete tasks. Inspired by this, researchers affiliated with Uber and Carnegie Mellon developed a machine learning model that injects social language into an assistant’s responses while preserving their integrity.

The researchers focused on the customer service domain, specifically a use case where customer service personnel helped drivers sign up with a ride-sharing provider like Uber or Lyft. They first conducted a study to suss out the relationship between customer service representatives’ use of friendly language to drivers’ responsiveness and the completion of their first ride-sharing trip. Then, they developed a machine learning model for an assistant that includes a social language understanding and language generation component.

In their study, the researchers found that that the “politeness level” of customer service representative messages correlated with driver responsiveness and completion of their first trip. Building on this, they trained their model on a dataset of over 233,000 messages from drivers and corresponding responses from customer service representatives. The responses had labels indicating how generally polite and positive they were, chiefly as judged by human evaluators.

Post-training, the researchers used automated and human-driven techniques to evaluate the politeness and positivity of their model’s messages. They found it could vary the politeness of its responses while preserving the meaning of its messages, but that it was less successful in maintaining overall positivity. They attribute this to a potential mismatch between what they thought they were measuring and manipulating and what they actually measured and manipulated.

“A common explanation for the negative association of positivity with driver responsiveness in … and the lack of an effect of positivity enhancement on generated agent responses … might be a discrepancy between the concept of language positivity and its operationalization as positive sentiment,” the researchers wrote in a paper detailing their work. “[Despite this, we believe] the customer support services can be improved by utilizing the model to provide suggested replies to customer service representatives so that they can (1) respond quicker and (2) adhere to the best practices (e.g. using more polite and positive language) while still achieving the goal that the drivers and the ride-sharing providers share, i.e., getting drivers on the road.”

The work comes as Gartner predicts that by the year 2020, only 10% of customer-company interactions will be conducted via voice. According to the 2016 Aspect Consumer Experience Index research, 71% of consumers want the ability to solve most customer service issues on their own, up 7 points from the 2015 index. And according to that same Aspect report, 44% said that they would prefer to use a chatbot for all customer service interactions compared with a human.

VentureBeat

VentureBeat’s mission is to be a digital townsquare for technical decision makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you,
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more.

Become a member

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Allen Institute researchers find pervasive toxicity in popular language models

September 27, 2020   Big Data
 Allen Institute researchers find pervasive toxicity in popular language models

Automation and Jobs

Read our latest special issue.

Open Now

Researchers at the Allen Institute for AI have created a data set — RealToxicityPrompts — that attempts to elicit racist, sexist, or otherwise toxic responses from AI language models, as a way of measuring the models’ preferences for these responses. In experiments, they claim to have found that no current machine learning technique sufficiently protects against toxic outputs, underlining the need for better training sets and model architectures.

It’s well-established that models amplify the biases in data on which they were trained. That’s problematic in the language domain, because a portion of the data is often sourced from communities with pervasive gender, race, and religious prejudices. AI research firm OpenAI notes that this can lead to placing words like “naughty” or “sucked” near female pronouns and “Islam” near words like “terrorism.” Other studies, like one published by Intel, MIT, and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular models, including Google’s BERT and XLNet, OpenAI’s GPT-2, and Facebook’s RoBERTa.

The Allen Institute researchers designed RealToxicityPrompts to measure the risk of “toxic degeneration” by pretrained language models, or models fed data sets containing thousands to billions of documents. They compiled a list of 100,000 naturally occurring prompts extracted from a large corpus of English Reddit text (the open source Open-WebText Corpus) and paired it with toxicity scores from Google’s Perspective API, which uses machine learning models to detect the potential toxicity of a comment.

The coauthors evaluated five language models using RealToxicityPrompts, specifically three models from OpenAI (GPT-1 GPT-2, and GPT-3) and two models from Salesforce (CTRL and CTRL-Wiki). The found that while toxic prompts — prompts offensive or stereotypically biased on their face — were 70% or more likely to yield toxic content from the language models, even non-toxic prompts resulted in offensive responses. The results show that all models were 49% or more likely to answer non-toxic content with toxic responses, even models like CTRL-Wiki that were only trained on Wikipedia data.

To uncover the potential reasons for this, the researchers investigated the corpora used to pretrain several of the language models: OpenAI-WT (GPT-2’s training data) and OWTC (an open source fork of OpenAI-WT). OWTC contains text from Reddit posts with a karma of 3 or higher and 38GB of English documents, including news articles. OpenAI-WT — which has a 29% overlap with OWTC, such that at least 2.3 million documents in OpenAI-WT also appear in OWTC — contains about 8 million documents filtered using a blocklist of sexually explicit and otherwise offensive subreddits.

The researchers found that OWTC and OpenAI-WT contain “non-negligible” amounts of toxicity as identified by the Perspective API. About 2.1% of documents in OWTC were offensive compared with 4.3% in OpenAI-WT, or twice that of OWTC despite the blocklist. Unreliable news sites were another major source of toxicity in the data sets, as were posts from banned or quarantined subreddits. In fact, 63,000 documents in OpenAI-WT and OWTC came from links shared on problematic Reddit communities; GPT-2 was pretrained on at least 40,000 documents from the quarantined /r/The_Donald and 4,000 documents from the banned /r/WhiteRights.

“Overall, our investigations demonstrate that toxicity is a prevalent issue in both neural language generation and web text corpora,” the coauthors wrote in a paper describing their work. “Although they show some reduction in toxicity, steering methods do not fully protect neural models from toxic degeneration. Additionally, the corpora that language models are pretrained on contain non-negligible amounts of toxic, abusive, and untrustworthy content.”

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

AI Weekly: Cutting-edge language models can produce convincing misinformation if we don’t stop them

September 19, 2020   Big Data

Automation and Jobs

Read our latest special issue.

Open Now

It’s been three months since OpenAI launched an API underpinned by cutting-edge language model GPT-3, and it continues to be the subject of fascination within the AI community and beyond. Portland State University computer science professor Melanie Mitchell found evidence that GPT-3 can make primitive analogies, and Columbia University’s Raphaël Millière asked GPT-3 to compose a response to the philosophical essays written about it. But as the U.S. presidential election nears, there’s growing concern among academics that tools like GPT-3 could be co-opted by malicious actors to foment discord by spreading misinformation, disinformation, and outright lies. In a paper published by the Middlebury Institute of International Studies’ Center on Terrorism, Extremism, and Counterterrorism (CTEC), the coauthors find that GPT-3’s strength in generating “informational,” “influential” text could be leveraged to “radicalize individuals into violent far-right extremist ideologies and behaviors.”

Bots are increasingly being used around the world to sow the seeds of unrest, either through the spread of misinformation or the amplification of controversial points of view. An Oxford Internet Institute report published in 2019 found evidence of bots disseminating propaganda in 50 countries, including Cuba, Egypt, India, Iran, Italy, South Korea, and Vietnam. In the U.K., researchers estimate that half a million tweets about the country’s proposal to leave the European Union sent between June 5 and June 12 came from bots. And in the Middle East, bots generated thousands of tweets in support of Saudi Arabia’s crown prince Mohammed bin Salman following the 2018 murder of Washington Post opinion columnist Jamal Khashoggi.

Bot activity perhaps most relevant to the upcoming U.S. elections occurred last November, when cyborg bots spread misinformation during the local Kentucky elections. VineSight, a company that tracks social media misinformation, uncovered small networks of bots retweeting and liking messages casting doubt on the gubernatorial results before and after the polls closed.

But bots historically haven’t been sophisticated; most simply retweet, upvote, or favorite posts likely to prompt toxic (or violent) debate. GPT-3-powered bots or “cyborgs” — accounts that attempt to evade spam detection tools by fielding tweets from human operators — could prove to be far more harmful given how convincing their output tends to be. “Producing ideologically consistent fake text no longer requires a large corpus of source materials and hours of [training]. It is as simple as prompting GPT-3; the model will pick up on the patterns and intent without any other training,” the coauthors of the Middlebury Institute study wrote. “This is … exacerbated by GPT-3’s impressively deep knowledge of extremist communities, from QAnon to the Atomwaffen Division to the Wagner Group, and those communities’ particular nuances and quirks.”

 AI Weekly: Cutting edge language models can produce convincing misinformation if we don’t stop them

Above: A question-answer thread generated by GPT-3.

In their study, the CTEC researchers sought to determine whether people could color GPT-3’s knowledge with ideological bias. (GPT-3 was trained on trillions of words from the internet, and its architectural design enables fine-tuning through longer, representative prompts like tweets, paragraphs, forum threads, and emails.) They discovered that it only took a few seconds to produce a system able to answer questions about the world consistent with a conspiracy theory, in one case falsehoods originating from the QAnon and Iron March communities.

“GPT-3 can complete a single post with convincing responses from multiple viewpoints, bringing in various different themes and philosophical threads within far-right extremism,” the coauthors wrote. “It can also generate new topics and opening posts from scratch, all of which fall within the bounds of [the communities’] ideologies.”

CTEC’s analysis also found GPT-3 is “surprisingly robust” with respect to multilingual language understanding, demonstrating an aptitude for producing Russian-language text in response to English prompts that show examples of right-wing bias, xenophobia, and conspiracism. The model also proved “highly effective” at creating extremist manifestos that were coherent, understandable, and ideologically consistent, communicating how to justify violence and instructing on anything from weapons creation to philosophical radicalization.

 AI Weekly: Cutting edge language models can produce convincing misinformation if we don’t stop them

Above: GPT-3 writing extremist manifestos.

“No specialized technical knowledge is required to enable the model to produce text that aligns with and expands upon right-wing extremist prompts. With very little experimentation, short prompts produce compelling and consistent text that would believably appear in far-right extremist communities online,” the researchers wrote. “GPT-3’s ability to emulate the ideologically consistent, interactive, normalizing environment of online extremist communities poses the risk of amplifying extremist movements that seek to radicalize and recruit individuals. Extremists could easily produce synthetic text that they lightly alter and then employ automation to speed the spread of this heavily ideological and emotionally stirring content into online forums where such content would be difficult to distinguish from human-generated content.”

OpenAI says it’s experimenting with safeguards at the API level including “toxicity filters” to limit harmful language generation from GPT-3. For instance, it hopes to deploy filters that pick up antisemitic content while still letting through neutral content talking about Judaism.

Another solution might lie in a technique proposed by Salesforce researchers including former Salesforce chief scientist Richard Socher. In a recent paper, they describe GeDi (short for “generative discriminator”), a machine learning algorithm capable of “detoxifying” text generation by language models like GPT-3’s predecessor, GPT-2. During one experiment, the researchers trained GeDi as a toxicity classifier on an open source data set released by Jigsaw, Alphabet’s technology incubator. They claim that GeDi-guided generation resulted in significantly less toxic text than baseline models while achieving the highest linguistic acceptability.

 AI Weekly: Cutting edge language models can produce convincing misinformation if we don’t stop them

But technical mitigation can only achieve so much. CTEC researchers recommend partnerships between industry, government, and civil society to effectively manage and set the standards for use and abuse of emerging technologies like GPT-3. “The originators and distributors of generative language models have unique motivations to serve potential clients and users. Online service providers and existing platforms will need to accommodate for the impact of the output from such language models being utilized with the use of their services,” the researchers wrote. “Citizens and the government officials who serve them may empower themselves with information about how and in what manner creation and distribution of synthetic text supports healthy norms and constructive online communities.”

It’s unclear the extent to which this will be possible ahead of the U.S. presidential election, but CTEC’s findings make apparent the urgency. GPT-3 and like models have destructive potential if not properly curtailed, and it will require stakeholders from across the political and ideological spectrum to figure out how they might be deployed both safely and responsibly.

For AI coverage, send news tips to Khari Johnson and Kyle Wiggers — and be sure to subscribe to the AI Weekly newsletter and bookmark our AI Channel.

Thanks for reading,

Kyle Wiggers

AI Staff Writer

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Researchers find cutting-edge language models fall short in basic reasoning

September 9, 2020   Big Data

Automation and Jobs

Read our latest special issue.

Open Now

Even sophisticated language models such as OpenAI’s GPT-3 struggle with socially important topics like morality, history, and law. That’s the top-line finding from a new paper coauthored by Columbia, University of Chicago, and University of California, Berkeley researchers that proposes a 57-task test to measure models’ ability to reason. Models must possess problem-solving abilities and extensive knowledge about the world to perform well on the test. But in experiments, the coauthors found that the models they benchmarked — including GPT-3 — frequently didn’t know when they were wrong.

The goal of the novel test set is to bridge the gap between the knowledge that models see during training and existing measures of success in natural language processing. Like all machine learning models, language models learn patterns from vast data sets often sourced from Wikipedia, Reddit, ebooks, and other web sources. Some recently introduced benchmarks attempt to capture the linguistic skills of models, but so far, there’s little evidence to suggest a correlation between benchmark performance and a model’s grasp of commonsense reasoning.

The researchers claim their test is different in that it assesses models across subjects humans commonly learn, like mathematics, history, and ethics. To craft it, graduate and undergraduate students collected 15,908 questions from freely available sources online, including practice exams for undergraduate courses, quizzes for readers of Oxford University Press publications, and tests like the Graduate Record Examination, U.S. Medical Licensing Examination, and Examination for Professional Practice in Psychology. The tasks range in difficulty from an elementary level to an “advanced professional level,” a sampling the coauthors argue is sufficient for identifying a model’s blind spots.

 Researchers find cutting edge language models fall short in basic reasoning

Above: Example questions from the researchers’ test set.

“We measure arbitrary real-world text understanding,” they wrote, noting that each subject contains at least 100 test examples. “Since models are pretrained on the internet, this enables us to test how well they can extract useful knowledge from massive corpora.”

In addition to GPT-3, the researchers benchmarked Google’s T5 and the Allen Institute for AI’s UnifiedQA question-answering model against their test set. The results show that meaningful progress has only become possible in recent months, with models containing up to 13 billion parameters achieving 25% accuracy and 175-billion-parameter models like GPT-3 reaching 43.9% accuracy. (Parameters are parts of the model learned from historical training data.) But that being the case, GPT-3 failed to excel at any single subject; its performance was on the test set was lopsided, with almost 70% accuracy for its best subject (U.S. foreign policy) but “near-random” performance for several other subjects (e.g., college chemistry).

“Overall, GPT-3 does poorly on highly procedural problems,” the researchers explained. “It is notably poor at modeling human (dis)approval, as evident by the low performance on the professional law and moral scenarios tasks, [and it] also has difficulty performing calculations, so much so that it exhibits poor performance on elementary mathematics and many other STEM subjects with ‘plug and chug’ problems … We speculate that is in part because GPT-3 acquires declarative knowledge more readily than procedural knowledge.”

The findings imply that current models have room for improvement, but it’s unclear whether existing techniques will suffice. As the researchers point out, previous research indicates that a 10 times increase in model size must be accompanied by an approximately 5 times increase in data, which might be logistically prohibitive.

“Aside from the tremendous expense in creating multi-trillion parameter language models, data may also become a bottleneck,” the researchers continued. “There is far less written about esoteric branches of knowledge than about everyday text.”

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Research shows natural language benchmarks don’t measure AI models’ general knowledge well

August 12, 2020   Big Data
 Research shows natural language benchmarks don’t measure AI models’ general knowledge well

Transform 2020

Watch every session from our flagship AI event

On Demand

Watch Now

Open-domain question-answering models — models theoretically capable of responding to novel questions with novel answers — often simply memorize answers found in the data on which they’re trained, depending on the data set. That’s the assertion of a team of researchers affiliated with Facebook and the University College London, who in a preprint paper present evidence that 60%-70% of answers given by models tested on open-domain benchmarks are embedded somewhere in the training sets.

Open-domain question-answering has received attention in the AI community for its practical applications, and more recently as a method to analyze language models’ grasp of factual knowledge. But a deep understanding of what kinds of questions models can answer remains elusive; unknowns about how questions and answers are distributed in benchmark corpora make it hard to contextualize the results.

In their study, the researchers sought to evaluate the test sets of popular open-domain question-answering data sets including WebQuestions, TriviaQA, and Open Natural Questions. They identified classes of question a model should be able to answer and annotated 1,000 question-answer pairs from each test set for repeated questions in their respective training sets. Then they computed the performance of several models on the benchmarks using open-book (which leverage retrieval from a large corpus of documents) and closed-book approaches (which focus on training large models with no external knowledge).

The three data sets in question aren’t much alike, which was the point — testing across all three guaranteed robustness. WebQuestions contains 3,778 training and 2,032 test question-answer pairs from a search engine, while TriviaQA has 78,785 training and 11,313 test question-answer pairs from free trivia websites. Meanwhile, Open Natural Questions comprises 79,168 training and 3,610 question-answer pairs from a combination of search engines and Wikipedia articles.

The team theorizes open-domain question-answering models should be able to (1) recall the answer to a question seen at training time, (2) answer novel questions at test time and choose an answer from the set of answers seen during training, and (3) answer novel questions that have answers not contained within the training data set. To determine whether the aforementioned benchmarks measure any of these behaviors, the coauthors split the test data in each corpus by whether the answers appeared somewhere in the training sets. Around 58%-71% of test answers were also somewhere in the training data, according to the researchers, demonstrating that the majority of the test data didn’t probe for answer generalization.

The team also probed the benchmarks for paraphrased questions in training data, using the set of 1,000 annotated questions. They say that 28%-34% of the questions were paraphrased, the majority being near-duplicates differing only by one or two words. “This result implies that 30% of the test set of these datasets only probe for how well models can simply memorize question-answer pairs seen at training,” the coauthors wrote.

The researchers selected several “open book” models — dense passage retrieval, retrieval-augmented generation, and fusion-in-decoder — and “closed book” models (Facebook’s BART and Google’s T5) to test, as well as nearest-neighbor models that store all available answers and classify new answers based on a similarity measure. Results on the benchmark corpora imply that all models memorized questions well, with an untrained nearest-neighbor model answering 20% of the test questions correctly. But they performed poorly on questions that couldn’t be memorized from training sets, with a mean absolute performance difference of 63% between repeated and non-repeated data. And when it came to generalization, one model that reliably memorized questions — T5 — struggled, achieving only a 22% match score.

“It is clear that performance on these data sets cannot be properly understood by overall question-answer accuracy,” the researchers wrote. “We suggest that in future, a greater emphasis be placed on more behavior-driven evaluation rather than pursuing single-number overall accuracy figures.”

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More
« Older posts
  • Recent Posts

    • Accelerate Your Data Strategies and Investments to Stay Competitive in the Banking Sector
    • SQL Server Security – Fixed server and database roles
    • Teradata Named a Leader in Cloud Data Warehouse Evaluation by Independent Research Firm
    • Derivative of a norm
    • TODAY’S OPEN THREAD
  • Categories

  • Archives

    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited