• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Tag Archives: Text

Researchers claim that AI-translated text is less ‘lexically’ rich than human translations

February 3, 2021   Big Data
 Researchers claim that AI translated text is less ‘lexically’ rich than human translations

Human interpreters make choices unique to them, consciously or unconsciously, when translating one language into another. They might explicate, normalize, or condense and summarize, creating fingerprints known informally as “translationese.” In machine learning, generating accurate translations has been the main objective thus far. But this might be coming at the expense of translation richness and diversity.

In a new study, researchers at Tilburg University and the University of Maryland attempt to quantify the lexical and grammatical diversity of “machine translationese” — i.e., the fingerprints made by AI translation algorithms. They claim to have found a “quantitatively measurable” difference between the linguistic richness of machine translation systems’ training data and their translations, which could be a product of statistical bias.

The researchers looked a range of different machine learning model architectures including Transformer, neural machine translation, long short-term memory networks, and phrase-based statistical machine translation. In experiments, they tasked each with translating between English, French, and Spanish and compared the original text with the translations using 9 different metrics.

The researchers report that in experiments, the original training data — a collection of reference translations — always had a higher lexical diversity than the machine translations regardless of the type of model used. In other words, the reference translations were consistently more diverse in terms of vocabulary and synonym usage than the translations from the models.

The coauthors point out that while the loss of lexical diversity could be a desirable side effect of machine translation systems (in terms of simplification or consistency), the loss of morphological richness is problematic as it can prevent systems from making grammatically correct choices. Bias can emerge, too, with machine translation systems having a stronger negative impact in terms of diversity and richness on morphologically richer languages like Spanish and French.

“As [machine translation] systems have reached a quality that is (arguably) close to that of human translations and as such are being used widely on a daily basis, we believe it is time to look into the potential effects of [machine translation] algorithms on language itself,” the researchers wrote in a paper describing their work. “All [of our] metrics indicate that the original training data has more lexical and morphological diversity compared to translations produced by the [machine translation] systems … If machine translationese (and other types of ‘NLPese’) is a simplified version of the training data, what does that imply from a sociolinguistic perspective and how could this affect language on a longer term?”

The coauthors propose no solutions to the machine translation problems they claim to have uncovered. However, they believe their metrics could drive future research on the subject.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

OpenAI debuts DALL-E for generating images from text

January 6, 2021   Big Data

Transform 2021

Join us for the world’s leading event about accelerating enterprise transformation with AI and Data, for enterprise technology decision-makers, presented by the #1 publisher in AI and Data

Learn More


OpenAI today debuted two multimodal AI systems that combine computer vision and NLP, like DALL-E, a system that generates images from text. For example, the photo above for this story was generated from the text prompt “an illustration of a baby daikon radish in a tutu walking a dog.” DALL-E uses a 12-billion parameter version of GPT-3, and like GPT-3 is a Transformer language model. The name is meant to hearken to the artist Salvador Dali and the robot WALL-E.

Above: Examples of images generated from the text prompt “A stained glass window with an image of a blue strawberry”

Image Credit: OpenAI

Tests shared by OpenAI today appear to demonstrate DALL-E has the ability to manipulate and rearrange objects in generated imagery but also create things that just don’t exist like a cube with the texture of a porcupine or cube of clouds. Based on text prompt, images generated by DALL-E can appear as if they were taken in the real world, while others can depict works of art. Visit the OpenAI website to try a controlled demo of DALL-E.

Above: cloud cube

“We recognize that work involving generative models has the potential for significant, broad societal impacts. In the future, we plan to analyze how models like DALL·E relate to societal issues like economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer term ethical challenges implied by this technology,” OpenAI said in a blog post about DALL-E today.

OpenAI also introduced CLIP today, a multimodal model trained on 400 million pairs of images and text collected from the internet. CLIP uses zero-shot learning capabilities akin to GPT-2 and GPT-3 language models.

“We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including object character recognition (OCR), geo-localization, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and find it can be competitive with prior task-specific supervised models,” a paper about the model by 12 OpenAI coauthors reads.

Although testing found CLIP was proficient at a number of tasks, testing also found that CLIP falls short in specialization tasks like satellite imagery classification or lymph node tumor detection.

“This preliminary analysis is intended to illustrate some of the challenges that general purpose computer vision models pose and to give a glimpse into their biases and impacts. We hope that this work motivates future research on the characterization of the capabilities, shortcomings, and biases of such models, and we are excited to engage with the research community on such questions,” the paper reads.

OpenAI chief scientist Ilya Sutskever was coauthor of a paper detailing CLIP, and seems to have alluded to the coming release of CLIP when he told deeplearning.ai recently that multimodal models would be a major machine learning trend in 2021. Google AI chief Jeff Dean made a similar prediction for 2020 in an interview with VentureBeat.

The release of DALL-E follows the release of a number of generative models with the power to mimic or distort reality or predict how people paint landscape and still life art. Some, like StyleGAN, have demonstrated a propensity to racial bias.

OpenAI researchers working on CLIP and DALL-E called for additional research into the potential societal impact of both systems. GPT-3 displayed significant anti-Muslim bias and negative sentiment scores for Black people so the same shortcomings could be embedded in DALL-E. A bias test included in the CLIP paper found that the model was most likely to miscategorize people under 20 as criminals or non-human, people classified as men were more likely to be labeled as criminals then people classified as women, and some label data contained in the dataset are heavily gendered.

How OpenAI made DALL-E and additional details will be shared in an upcoming paper. Large language models that use data scraped from the internet have been criticized by researchers who say the AI industry needs to undergo a culture change.

VentureBeat

VentureBeat’s mission is to be a digital townsquare for technical decision makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you,
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more.

Become a member

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Text Mining and Sentiment Analysis: Analysis with R

May 20, 2020   BI News and Info

The series so far:

  1. Text Mining and Sentiment Analysis: Introduction
  2. Text Mining and Sentiment Analysis: Power BI Visualizations
  3. Text Mining and Sentiment Analysis: Analysis with R

This is the third article of the “Text Mining and Sentiment Analysis” Series. The first article introduced Azure Cognitive Services and demonstrated the setup and use of Text Analytics APIs for extracting key Phrases & Sentiment Scores from text data. The second article demonstrated Power BI visualizations for analyzing Key Phrases & Sentiment Scores and interpreting them to gain insights. This article explores R for text mining and sentiment analysis. I will demonstrate several common text analytics techniques and visualizations in R.

Note: This article assumes basic familiarity with R and RStudio. Please jump to the References section for more information on installing R and RStudio. The Demo data raw text file and R script are available for download from my GitHub repository; please find the link in the References section.

R is a language and environment for statistical computing and graphics. It provides a wide variety of statistical and graphical techniques and is highly extensible. R is available as free software. It’s easy to learn

and use and can produce well designed publication-quality plots. For the demos in this article, I am using R version 3.5.3 (2019-03-11), RStudio Version 1.1.456

The input file for this article has only one column, the “Raw text” of survey responses and is a text file.

A sample of the first few rows are shown in Notepad++ (showing all characters) in Figure 1.

a screenshot of a computer description automatica Text Mining and Sentiment Analysis: Analysis with R

Figure 1. Sample of the input text file

The demo R script and demo input text file are available on my GitHub repo (please find the link in the References section).

R has a rich set of packages for Natural Language Processing (NLP) and generating plots. The foundational steps involve loading the text file into an R Corpus, then cleaning and stemming the data before performing analysis. I will demonstrate these steps and analysis like Word Frequency, Word Cloud, Word Association, Sentiment Scores and Emotion Classification using various plots and charts.

Installing and loading R packages

The following packages are used in the examples in this article:

  • tm for text mining operations like removing numbers, special characters, punctuations and stop words (Stop words in any language are the most commonly occurring words that have very little value for NLP and should be filtered out. Examples of stop words in English are “the”, “is”, “are”.)
  • snowballc for stemming, which is the process of reducing words to their base or root form. For example, a stemming algorithm would reduce the words “fishing”, “fished” and “fisher” to the stem “fish”.
  • wordcloud for generating the word cloud plot.
  • RColorBrewer for color palettes used in various plots
  • syuzhet for sentiment scores and emotion classification
  • ggplot2 for plotting graphs

Open RStudio and create a new R Script. Use the following code to install and load these packages.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

# Install

install.packages(“tm”)  # for text mining

install.packages(“SnowballC”) # for text stemming

install.packages(“wordcloud”) # word-cloud generator

install.packages(“RColorBrewer”) # color palettes

install.packages(“syuzhet”) # for sentiment analysis

install.packages(“ggplot2″) # for plotting graphs

# Load

library(“tm”)

library(“SnowballC”)

library(“wordcloud”)

library(“RColorBrewer”)

library(“syuzhet”)

library(“ggplot2″)

Reading file data into R

The R base function read.table() is generally used to read a file in table format and imports data as a data frame. Several variants of this function are available, for importing different file formats;

  • read.csv() is used for reading comma-separated value (csv) files, where a comma “,” is used a field separator
  • read.delim() is used for reading tab-separated values (.txt) files

The input file has multiple lines of text and no columns/fields (data is not tabular), so you will use the readLines function. This function takes a file (or URL) as input and returns a vector containing as many elements as the number of lines in the file. The readLines function simply extracts the text from its input source and returns each line as a character string. The n= argument is useful to read a limited number (subset) of lines from the input source (Its default value is -1, which reads all lines from the input source). When using the filename in this function’s argument, R assumes the file is in your current working directory (you can use the getwd() function in R console to find your current working directory). You can also choose the input file interactively, using the file.choose() function within the argument. The next step is to load that Vector as a Corpus. In R, a Corpus is a collection of text document(s) to apply text mining or NLP routines on. Details of using the readLines function are sourced from: https://www.stat.berkeley.edu/~spector/s133/Read.html .

In your R script, add the following code to load the data into a corpus.

# Read the text file from local machine , choose file interactively

text <- readLines(file.choose())

# Load the data as a corpus

TextDoc <- Corpus(VectorSource(text))

Upon running this, you will be prompted to select the input file. Navigate to your file and click Open as shown in Figure 2.

a screenshot of a computer description automatica 1 Text Mining and Sentiment Analysis: Analysis with R

Figure 2. Select input file

Cleaning up Text Data

Cleaning the text data starts with making transformations like removing special characters from the text. This is done using the tm_map() function to replace special characters like /, @ and | with a space. The next step is to remove the unnecessary whitespace and convert the text to lower case.

Then remove the stopwords. They are the most commonly occurring words in a language and have very little value in terms of gaining useful information. They should be removed before performing further analysis. Examples of stopwords in English are “the, is, at, on”. There is no single universal list of stop words used by all NLP tools. stopwords in the tm_map() function supports several languages like English, French, German, Italian, and Spanish. Please note the language names are case sensitive. I will also demonstrate how to add your own list of stopwords, which is useful in this Team Health example for removing non-default stop words like “team”, “company”, “health”. Next, remove numbers and punctuation.

The last step is text stemming. It is the process of reducing the word to its root form. The stemming process simplifies the word to its common origin. For example, the stemming process reduces the words “fishing”, “fished” and “fisher” to its stem “fish”. Please note stemming uses the SnowballC package. (You may want to skip the text stemming step if your users indicate a preference to see the original “unstemmed” words in the word cloud plot)

In your R script, add the following code to transform and run to clean-up the text data.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

#Replacing “/”, “@” and “|” with space

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, ” “, x))

TextDoc <- tm_map(TextDoc, toSpace, “/”)

TextDoc <- tm_map(TextDoc, toSpace, “@”)

TextDoc <- tm_map(TextDoc, toSpace, “\|”)

# Convert the text to lower case

TextDoc <- tm_map(TextDoc, content_transformer(tolower))

# Remove numbers

TextDoc <- tm_map(TextDoc, removeNumbers)

# Remove english common stopwords

TextDoc <- tm_map(TextDoc, removeWords, stopwords(“english”))

# Remove your own stop word

# specify your custom stopwords as a character vector

TextDoc <- tm_map(TextDoc, removeWords, c(“s”, “company”, “team”))

# Remove punctuations

TextDoc <- tm_map(TextDoc, removePunctuation)

# Eliminate extra white spaces

TextDoc <- tm_map(TextDoc, stripWhitespace)

# Text stemming – which reduces words to their root form

TextDoc <- tm_map(TextDoc, stemDocument)

Building the term document matrix

After cleaning the text data, the next step is to count the occurrence of each word, to identify popular or trending topics. Using the function TermDocumentMatrix() from the text mining package, you can build a Document Matrix – a table containing the frequency of words.

In your R script, add the following code and run it to see the top 5 most frequently found words in your text.

# Build a term-document matrix

TextDoc_dtm <- TermDocumentMatrix(TextDoc)

dtm_m <- as.matrix(TextDoc_dtm)

# Sort by descearing value of frequency

dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)

dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)

# Display the top 5 most frequent words

head(dtm_d, 5)

The following table of word frequency is the expected output of the head command on RStudio Console.

word image 1 Text Mining and Sentiment Analysis: Analysis with R

Plotting the top 5 most frequent words using a bar chart is a good basic way to visualize this word frequent data. In your R script, add the following code and run it to generate a bar chart, which will display in the Plots sections of RStudio.

# Plot the most frequent words

barplot(dtm_d[1:5,]$ freq, las = 2, names.arg = dtm_d[1:5,]$ word,

        col =“lightgreen”, main =“Top 5 most frequent words”,

        ylab = “Word frequencies”)

The plot can be seen in Figure 3.

a screenshot of a cell phone description automati Text Mining and Sentiment Analysis: Analysis with R

Figure 3. Bar chart of the top 5 most frequent words

One could interpret the following from this bar chart:

  • The most frequently occurring word is “good”. Also notice that negative words like “not” don’t feature in the bar chart, which indicates there are no negative prefixes to change the context or meaning of the word “good” ( In short, this indicates most responses don’t mention negative phrases like “not good”).
  • “work”, “health” and “feel” are the next three most frequently occurring words, which indicate that most people feel good about their work and their team’s health.
  • Finally, the root “improv” for words like “improve”, “improvement”, “improving”, etc. is also on the chart, and you need further analysis to infer if its context is positive or negative

Generate the Word Cloud

A word cloud is one of the most popular ways to visualize and analyze qualitative data. It’s an image composed of keywords found within a body of text, where the size of each word indicates its frequency in that body of text. Use the word frequency data frame (table) created previously to generate the word cloud. In your R script, add the following code and run it to generate the word cloud and display it in the Plots section of RStudio.

#generate word cloud

set.seed(1234)

wordcloud(words = dtm_d$ word, freq = dtm_d$ freq, min.freq = 5,

          max.words=100, random.order=FALSE, rot.per=0.40,

          colors=brewer.pal(8, “Dark2″))

Below is a brief description of the arguments used in the word cloud function;

  • words – words to be plotted
  • freq – frequencies of words
  • min.freq – words whose frequency is at or above this threshold value is plotted (in this case, I have set it to 5)
  • max.words – the maximum number of words to display on the plot (in the code above, I have set it 100)
  • random.order – I have set it to FALSE, so the words are plotted in order of decreasing frequency
  • rot.per – the percentage of words that are displayed as vertical text (with 90-degree rotation). I have set it 0.40 (40 %), please feel free to adjust this setting to suit your preferences
  • colors – changes word colors going from lowest to highest frequencies

You can see the resulting word cloud in Figure 4.

a screenshot of a cell phone description automati 1 Text Mining and Sentiment Analysis: Analysis with R

Figure 4. Word cloud plot

The word cloud shows additional words that occur frequently and could be of interest for further analysis. Words like “need”, “support”, “issu” (root for “issue(s)”, etc. could provide more context around the most frequently occurring words and help to gain a better understanding of the main themes.

Word Association

Correlation is a statistical technique that can demonstrate whether, and how strongly, pairs of variables are related. This technique can be used effectively to analyze which words occur most often in association with the most frequently occurring words in the survey responses, which helps to see the context around these words

In your R script, add the following code and run it.

# Find associations

findAssocs(TextDoc_dtm, terms = c(“good”,“work”,“health”), corlimit = 0.25)

You should see the results as shown in Figure 5.

a screenshot of a cell phone description automati 2 Text Mining and Sentiment Analysis: Analysis with R

Figure 5. Word association analysis for the top three most frequent terms

This script shows which words are most frequently associated with the top three terms (corlimit = 0.25 is the lower limit/threshold I have set. You can set it lower to see more words, or higher to see less). The output indicates that “integr” (which is the root for word “integrity”) and “synergi” (which is the root for words “synergy”, “synergies”, etc.) and occur 28% of the time with the word “good”. You can interpret this as the context around the most frequently occurring word (“good”) is positive. Similarly, the root of the word “together” is highly correlated with the word “work”. This indicates that most responses are saying that teams “work together” and can be interpreted in a positive context.

You can modify the above script to find terms associated with words that occur at least 50 times or more, instead of having to hard code the terms in your script.

# Find associations for words that occur at least 50 times

findAssocs(TextDoc_dtm, terms = findFreqTerms(TextDoc_dtm, lowfreq = 50), corlimit = 0.25)

a screenshot of a cell phone description automati 3 Text Mining and Sentiment Analysis: Analysis with R

Figure 6: Word association output for terms occurring at least 50 times

Sentiment Scores

Sentiments can be classified as positive, neutral or negative. They can also be represented on a numeric scale, to better express the degree of positive or negative strength of the sentiment contained in a body of text.

This example uses the Syuzhet package for generating sentiment scores, which has four sentiment dictionaries and offers a method for accessing the sentiment extraction tool developed in the NLP group at Stanford. The get_sentiment function accepts two arguments: a character vector (of sentences or words) and a method. The selected method determines which of the four available sentiment extraction methods will be used. The four methods are syuzhet (this is the default), bing, afinn and nrc. Each method uses a different scale and hence returns slightly different results. Please note the outcome of nrc method is more than just a numeric score, requires additional interpretations and is out of scope for this article. The descriptions of the get_sentiment function has been sourced from : https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?

Add the following code to the R script and run it.

# regular sentiment score using get_sentiment() function and method of your choice

# please note that different methods may have different scales

syuzhet_vector <- get_sentiment(text, method=“syuzhet”)

# see the first row of the vector

head(syuzhet_vector)

# see summary statistics of the vector

summary(syuzhet_vector)

Your results should look similar to Figure 7.

a screenshot of a cell phone description automati 4 Text Mining and Sentiment Analysis: Analysis with R

Figure 7. Syuzhet vector

An inspection of the Syuzhet vector shows the first element has the value of 2.60. It means the sum of the sentiment scores of all meaningful words in the first response(line) in the text file, adds up to 2.60. The scale for sentiment scores using the syuzhet method is decimal and ranges from -1(indicating most negative) to +1(indicating most positive). Note that the summary statistics of the suyzhet vector show a median value of 1.6, which is above zero and can be interpreted as the overall average sentiment across all the responses is positive.

Next, run the same analysis for the remaining two methods and inspect their respective vectors. Add the following code to the R script and run it.

# bing

bing_vector <- get_sentiment(text, method=“bing”)

head(bing_vector)

summary(bing_vector)

#affin

afinn_vector <- get_sentiment(text, method=“afinn”)

head(afinn_vector)

summary(afinn_vector)

Your results should resemble Figure 8.

a screenshot of a cell phone description automati 5 Text Mining and Sentiment Analysis: Analysis with R

Figure 8. bing and afinn vectors

Please note the scale of sentiment scores generated by:

  • bing – binary scale with -1 indicating negative and +1 indicating positive sentiment
  • afinn – integer scale ranging from -5 to +5

The summary statistics of bing and afinn vectors also show that the Median value of Sentiment scores is above 0 and can be interpreted as the overall average sentiment across the all the responses is positive.

Because these different methods use different scales, it’s better to convert their output to a common scale before comparing them. This basic scale conversion can be done easily using R’s built-in sign function, which converts all positive number to 1, all negative numbers to -1 and all zeros remain 0.

Add the following code to your R script and run it.

#compare the first row of each vector using sign function

rbind(

  sign(head(syuzhet_vector)),

  sign(head(bing_vector)),

  sign(head(afinn_vector))

)

Figure 9 shows the results.

a screenshot of a cell phone description automati 6 Text Mining and Sentiment Analysis: Analysis with R

Figure 9. Normalize scale and compare three vectors

Note the first element of each row (vector) is 1, indicating that all three methods have calculated a positive sentiment score, for the first response (line) in the text.

Emotion Classification

Emotion classification is built on the NRC Word-Emotion Association Lexicon (aka EmoLex). The definition of “NRC Emotion Lexicon”, sourced from http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm is “The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing.”

To understand this, explore the get_nrc_sentiments function, which returns a data frame with each row representing a sentence from the original file. The data frame has ten columns (one column for each of the eight emotions, one column for positive sentiment valence and one for negative sentiment valence). The data in the columns (anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, positive) can be accessed individually or in sets. The definition of get_nrc_sentiments has been sourced from: https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?

Add the following line to your R script and run it, to see the data frame generated from the previous execution of the get_nrc_sentiment function.

# run nrc sentiment analysis to return data frame with each row classified as one of the following

# emotions, rather than a score:

# anger, anticipation, disgust, fear, joy, sadness, surprise, trust

# It also counts the number of positive and negative emotions found in each row

d<-get_nrc_sentiment(text)

# head(d,10) – to see top 10 lines of the get_nrc_sentiment dataframe

head (d,10)

The results should look like Figure 10.

a screenshot of a cell phone description automati 7 Text Mining and Sentiment Analysis: Analysis with R

Figure 10. Data frame returned by get_nrc_sentiment function

The output shows that the first line of text has;

  • Zero occurrences of words associated with emotions of anger, disgust, fear, sadness and surprise
  • One occurrence each of words associated with emotions of anticipation and joy
  • Two occurrences of words associated with emotions of trust
  • Total of one occurrence of words associated with negative emotions
  • Total of two occurrences of words associated with positive emotions

The next step is to create two plots charts to help visually analyze the emotions in this survey text. First, perform some data transformation and clean-up steps before plotting charts. The first plot shows the total number of instances of words in the text, associated with each of the eight emotions. Add the following code to your R script and run it.

1

2

3

4

5

6

7

8

9

10

11

#transpose

td<-data.frame(t(d))

#The function rowSums computes column sums across rows for each level of a grouping variable.

td_new <- data.frame(rowSums(td[2:253]))

#Transformation and cleaning

names(td_new)[1] <- “count”

td_new <- cbind(“sentiment” = rownames(td_new), td_new)

rownames(td_new) <- NULL

td_new2<-td_new[1:8,]

#Plot One – count of words associated with each sentiment

quickplot(sentiment, data=td_new2, weight=count, geom=“bar”, fill=sentiment, ylab=“count”)+ggtitle(“Survey sentiments”)

You can see the bar plot in Figure 11.

a screenshot of a cell phone description automati 8 Text Mining and Sentiment Analysis: Analysis with R

Figure 11. Bar Plot showing the count of words in the text, associated with each emotion

This bar chart demonstrates that words associated with the positive emotion of “trust” occurred about five hundred times in the text, whereas words associated with the negative emotion of “disgust” occurred less than 25 times. A deeper understanding of the overall emotions occurring in the survey response can be gained by comparing these number as a percentage of the total number of meaningful words. Add the following code to your R script and run it.

#Plot two – count of words associated with each sentiment, expressed as a percentage

barplot(

  sort(colSums(prop.table(d[, 1:8]))),

  horiz = TRUE,

  cex.names = 0.7,

  las = 1,

  main = “Emotions in Text”, xlab=“Percentage”

)

The Emotions bar plot can be seen in figure 12.

a screenshot of a cell phone description automati 9 Text Mining and Sentiment Analysis: Analysis with R

Figure 12. Bar Plot showing the count of words associated with each sentiment expressed as a percentage

This bar plot allows for a quick and easy comparison of the proportion of words associated with each emotion in the text. The emotion “trust” has the longest bar and shows that words associated with this positive emotion constitute just over 35% of all the meaningful words in this text. On the other hand, the emotion of “disgust” has the shortest bar and shows that words associated with this negative emotion constitute less than 2% of all the meaningful words in this text. Overall, words associated with the positive emotions of “trust” and “joy” account for almost 60% of the meaningful words in the text, which can be interpreted as a good sign of team health.

Conclusion

This article demonstrated reading text data into R, data cleaning and transformations. It demonstrated how to create a word frequency table and plot a word cloud, to identify prominent themes occurring in the text. Word association analysis using correlation, helped gain context around the prominent themes. It explored four methods to generate sentiment scores, which proved useful in assigning a numeric value to strength (of positivity or negativity) of sentiments in the text and allowed interpreting that the average sentiment through the text is trending positive. Lastly, it demonstrated how to implement an emotion classification with NRC sentiment and created two plots to analyze and interpret emotions found in the text.

References:

Let’s block ads! (Why?)

SQL – Simple Talk

Read More

IBM’s Watson Assistant for Citizens answers coronavirus questions by phone or text

April 2, 2020   Big Data

IBM today announced the launch of Watson Assistant for Citizens, a new chatbot solution available to government agencies, health care institutions, and academic organizations free of charge for 90 days. The hope is that by tapping AI technologies like natural language processing, it’ll triage residents looking for guidance on COVID-19, which has affected 204 countries to date.

Online, by text, or by phone, the Watson Assistant for Citizens virtual agent — which brings together IBM’s Watson Assistant and Watson Discovery services and AI capabilities from IBM Research — draws on the Centers for Disease Control and Prevention and local sources like links to school closings, news and documents on state websites, and more to answer natural language questions about the novel coronavirus. For instance, Watson Assistant for Citizens automates responses to commonly posed queries like “What are symptoms?,” “How do I clean my home properly?,” and “How do I protect myself?”

Watson Assistant for Citizens includes 15 pretrained intents (i.e., queries) and dialog flows out of the box, and it can integrate with backend enterprise resource planning systems to incorporate information related to specific cities or regions. For instance, state government agencies can choose to have the virtual agent address questions like “What are cases in my neighborhood?,” “How long are schools shut down?,” and “Where can I get tested?”

 IBM’s Watson Assistant for Citizens answers coronavirus questions by phone or text

Above: A screenshot of IBM’s Watson Assistant for Citizens.

Image Credit: IBM

Watson Assistant for Citizens is available in English and Spanish, but it can be tailored to up to 13 different languages. IBM says it’s already being used by government and health care agencies across the U.S., as well as by organizations in the Czech Republic, Finland, Greece, Italy, Poland, Spain, U.K., and more.

Watson Assistant for Citizens’ debut comes after IBM made available a map on The Weather Channel to track the spread of COVID-19, mainly using data from governments as well as the World Health Organization. The company also built a dashboard on top of its Cognos Analytics suite that’s designed to help researchers, data scientists, and media analyze and filter coronavirus information down to the county level.

 IBM’s Watson Assistant for Citizens answers coronavirus questions by phone or text

IBM last week announced it would coordinate an effort to make supercomputing capacity available to researchers in order to help identify treatments, viable mitigation strategies, and vaccines for COVID-19. It also launched a new Call for Code Global Challenge that will encourage developers to build open source technologies that address several areas, including crisis communication during an emergency, ways to improve remote learning, and how to inspire cooperative local communities.

IBM isn’t the only organization deploying chatbots to keep folks informed of COVID-19 developments, of course.

Building atop Microsoft’s Healthcare Bot service, the U.S. Centers for Disease Control and Prevention (CDC) released a COVID-19 assessment bot that can assess symptoms, provide information, and suggest next courses of action. Elsewhere, startup Quiq collaborated with the city of Knoxville, Tennessee to deploy a chatbot via its website and a mobile app, and Jefferson City, Missouri announced that it’s working on a bot that can answer questions online.

Overseas, the Indian government teamed up with Facebook’s WhatsApp to launch a COVID-19 informational chatbot called MyGov Corona Helpdesk. The U.K.’s National Health Service is also in talks with WhatsApp to set up a dedicated chatbot. And Pakistan collaborated with startup Botsify to create a bot that connects users with the Ministry of National Health Services, Regulations & Coordination in Islamabad.

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

MIT CSAIL’s TextFooler generates adversarial text to strengthen natural language models

February 9, 2020   Big Data
 MIT CSAIL’s TextFooler generates adversarial text to strengthen natural language models

AI and machine learning algorithms are vulnerable to adversarial samples that have alterations from the originals. That’s especially problematic as natural language models become capable of generating humanlike text, because of their attractiveness to malicious actors who would use them to produce misleading media. In pursuit of a technique that illustrates the extent to which adversarial text can affect model prediction, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), the University of Hong Kong, and Singapore’s Agency for Science, Technology, and Research developed TextFooler, a baseline framework for synthesizing adversarial text examples. They claim in a paper that it was able to successfully attack three leading target models, including Google’s BERT.

“If those tools are vulnerable to purposeful adversarial attacking, then the consequences may be disastrous,” said Di Jin, MIT Ph.D. student and lead author on the paper, who noted that the adversarial examples produced by TextFooler could improve the robustness of AI models trained on them. “These tools need to have effective defense approaches to protect themselves, and in order to make such a safe defense system, we need to first examine the adversarial methods.”

The researchers assert that besides the ability to fool AI models, the outputs of a natural language “attacking” system like TextFooler should meet certain criteria: human prediction consistency, such that human predictions remain unchanged; semantic similarity, such that crafted examples bear the same meaning as the source; and language fluency, such that generated examples look natural and grammatical. TextFooler meets all three even when no model architecture or parameters (values that influence model performance) are available — i.e., black-box scenarios.

It achieves this by identifying the most important words for the target models and replacing them with semantically similar and grammatically correct words until the prediction is altered. TextFooler is applied to two different tasks — text classification and entailment (the relationship between text fragments in a sentence) — with the goal of changing the classification or invalidating the entailment judgment of the original models. For instance, given the input “The characters, cast in impossibly contrived situations, are totally estranged from reality,” TextFooler might output “The characters, cast in impossibly engineered circumstances, are fully estranged from reality.”

To evaluate TextFooler, the researchers applied it to text classification data sets with various properties, including news topic classification, fake news detection, and sentence- and document-level sentiment analysis, where the average text length ranged from tens of words to hundreds of words. For each data set, they trained the aforementioned state-of-the-art models on a training set before generating adversarial examples semantically similar to the test set to attack those models.

The team reports that on the adversarial examples, they managed to reduce the accuracy of almost all target models in all tasks to below 10% with fewer than 20% of the original words perturbed. Even for BERT, which attained relatively robust performance compared with the other models tested, TextFooler reduced its prediction accuracy by about 5 to 7 times on a classification task and about 9 to 22 times on an entailment task (where the goal was to judge whether a sentence could be derived from entailment, contradiction, or a neutral relationship).

“The system can be used or extended to attack any classification-based NLP models to test their robustness,” said Jin. “On the other hand, the generated adversaries can be used to improve the robustness and generalization of deep learning models via adversarial training, which is a critical direction of this work.”

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Google open-sources LaserTagger, an AI model that speeds up text generation

February 1, 2020   Big Data
 Google open sources LaserTagger, an AI model that speeds up text generation

Sequence-to-sequence AI models, which were introduced by Google in 2014, aim to map fixed-length input (usually text) with a fixed-length output where the length of the input and output might differ. They’re used in text-generating tasks including summarization, grammatical error correction, and sentence fusion, and recent architectural breakthroughs have made them more capable than before. But they’re imperfect in that they (1) require large amounts of training data to reach acceptable levels of performance and that they (1) typically generate the output word-by-word (which makes them inherently slow).

That’s why researchers at Google developed LaserTagger, an open source text-editing model that predicts a sequence of edit operations to transform a source text into a target text. They assert that LaserTagger tackles text generation in a fashion that’s less error-prone — and that’s easier to train and faster to execute.

The release of LaserTagger follows on the heels of notable contributions from Google to the field of natural language processing and understanding. This week, the tech giant took the wraps off of Meena, a neural network with 2.6 billion parameters that can handle multiturn dialogue. And earlier this month, Google published a paper describing Reformer, a model that can process the entirety of novels.

LaserTagger takes advantage of the fact that for many text-generation tasks, there’s often an overlap between the input and the output. For instance, when detecting and fixing grammatical mistakes or when fusing several sentences, most of the input text can remain unchanged — only a small fraction of words need to be modified. LaserTagger, then, produces a sequence of edit operations instead of actual words, like keep (which copies a word to the output, delete (which removes a word), and keep-addx or delete-addx (which adds phrase X before the tagged word and optionally deletes the tagged word).

Added phrases come from a restricted vocabulary that’s been optimized to minimize vocabulary size and maximize the number of training examples. The only words necessary to add to the target text come from the vocabulary alone, preventing the model from adding arbitrary words and mitigating the problem of hallucination (i.e., producing outputs that aren’t supported by the input text).  And LaserTagger can predict edit operations in parallel with high accuracy, enabling an end-to-end speedup compared with models that perform predictions sequentially.

Evaluated on several text generation tasks, LaserTagger performed “comparably strong” with, and up to 100 times faster than, a baseline model that used a large number of training examples. Even when trained using only a few hundred or a few thousand training examples, it produced “reasonable” results that could be manually edited or curated.

“The advantages of LaserTagger become even more pronounced when applied at large scale, such as improving the formulation of voice answers in some services by reducing the length of the responses and making them less repetitive,” wrote the team. “The high inference speed allows the model to be plugged into an existing technology stack, without adding any noticeable latency on the user side, while the improved data efficiency enables the collection of training data for many languages, thus benefiting users from different language backgrounds.”

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Google Brain’s AI achieves state-of-the-art text summarization performance

December 24, 2019   Big Data
 Google Brain’s AI achieves state of the art text summarization performance

Summarizing text is a task at which machine learning algorithms are improving, as evidenced by a recent paper published by Microsoft. That’s good news — automatic summarization systems promise to cut down on the amount of message-reading enterprise workers do, which one survey estimates amounts to 2.6 hours each day.

Not to be outdone, a Google Brain and Imperial College London team built a system — Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence, or Pegasus — that leverages Google’s Transformers architecture combined with pretraining objectives tailored for abstractive text generation. They say it achieves state-of-the-art results in 12 summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills, and that it shows “surprising” performance on low-resource summarization, surpassing previous top results on six data sets with only 1,000 examples.

As the researchers point out, text summarization aims to generate accurate and concise summaries from input documents, in contrast to executive techniques. Rather than merely copy fragments from the input, abstractive summarization might produce novel words or cover principal information such that the output remains linguistically fluent.

Transformers are a type of neural architecture introduced in a paper by researchers at Google Brain, Google’s AI research division. As do all deep neural networks, they contain functions (neurons) arranged in interconnected layers that transmit signals from input data and slowly adjust the synaptic strength (weights) of each connection — that’s how all AI models extract features and learn to make predictions. But Transformers uniquely have attention. Every output element is connected to every input element, and the weightings between them are calculated dynamically.

The team devised a training task in which whole, and putatively important, sentences within documents were masked. The AI had to fill in the gaps by drawing on web and news articles, including those contained within a new corpus (HugeNews) the researchers compiled.

In experiments, the team selected their best-performing Pegasus model — one with 568 million parameters, or variables learned from historical data — trained on either 750GB of text extracted from 350 million web pages (Common Crawl) or on HugeNews, which spans 1.5 billion articles totaling 3.8TB collected from news and news-like websites. (The researchers say that in the case of HugeNews, a whitelist of domains ranging from high-quality news publishers to lower-quality sites was used to seed a web-crawling tool.)

Pegasus achieved high linguistic quality in terms of fluency and coherence, according to the researchers, and it didn’t require countermeasures to mitigate disfluencies. Moreover, in a low-resource setting with just 100 example articles, it generated summaries at a quality comparable to a model that had been trained on a full data set ranging from 20,000 to 200,000 articles.

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Google launches AutoML Natural Language with improved text classification and model training

December 13, 2019   Big Data
 Google launches AutoML Natural Language with improved text classification and model training

Earlier this year, Google took the wraps off of AutoML Natural Language, an extension of its Cloud AutoML machine learning platform to the natural language processing domain. After a months-long beta, AutoML today launched in general availability for customers globally, with support for tasks like classification, sentiment analysis, and entity extraction, as well as a range of file formats, including native and scanned PDFs.

By way of refresher, AutoML Natural Language taps machine learning to reveal the structure and meaning of text from emails, chat logs, social media posts, and more. It can extract information about people, places, and events both from uploaded and pasted text or Google Cloud Storage documents, and it allows users to train their own custom AI models to classify, detect, and analyze things like sentiment, entities, content, and syntax. It furthermore offers custom entity extraction, which enables the identification of domain-specific entities within documents that don’t appear in standard language models.

AutoML Natural Language has over 5,000 classification labels and allows training on up to 1 million documents up to 10MB in size, which Google says makes it an excellent fit for “complex” use cases like comprehending legal files or document segmentation for organizations with large content taxonomies. It has been improved in the months since its reveal, specifically in the areas of text and document entity extraction — Google says that AutoML Natural Language now considers additional context (such as the spatial structure and layout information of a document) for model training and prediction to improve the recognition of text in invoices, receipts, resumes, and contracts.

Additionally, Google says that AutoML Natural Language is now FedRAMP-authorized at the Moderate level, meaning it has been vetted according to U.S. government specifications for data where the impact of loss is limited or serious. It says that this — along with newly introduced functionality that lets customers create a data set, train a model, and make predictions while keeping the data and related machine learning processing within a single server region — makes it easier for federal agencies to take advantage.

Already, Hearst is using AutoML Natural Language to help organize content across its domestic and international magazines, and Japanese publisher Nikkei Group is leveraging AutoML Translate to publish articles in different languages. Chicory, a third early adopter, tapped it to develop custom digital shopping and marketing solutions for grocery retailers like Kroger, Amazon, and Instacart.

The ultimate goal is to provide organizations, researchers, and businesses who require custom machine learning models a simple, no-frills way to train them, explained product manager for natural language Lewis Liu in a blog post. “Natural language processing is a valuable tool used to reveal the structure and meaning of text,” he said. “We’re continuously improving the quality of our models in partnership with Google AI research through better fine-tuning techniques, and larger model search spaces. We’re also introducing more advanced features to help AutoML Natural Language understand documents better.”

Notably, the launch of AutoML follows on the heels of AWS Textract, Amazon’s machine learning service for text and data extraction, which debuted in May. Microsoft offers a comparable service in Azure Text Analytics.

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

Facebook trains AI to generate worlds in a fantasy text adventure

November 24, 2019   Big Data

Procedurally generating an interesting video game environment isn’t just challenging — it’s incredibly time-consuming. Tools like Promethean AI, which tap machine learning to generate scenes, promise to ease the design burden somewhat. But barriers remain.

That’s why researchers at Facebook, the University of Lorraine, and the University College London in a preprint research paper investigated an AI approach to creating game worlds. Using content from LIGHT, a fantasy text-based multiplayer adventure, they designed models that could compositionally arrange locations and characters and generate new content on the fly.

“We show how [machine learning] algorithms can learn to assemble … different elements, arranging locations and populating them with characters and objects,” wrote the study’s coauthors. “[Furthermore, we] demonstrate that these … tools can aid humans interactively in designing new game environments.”

By way of refresher, LIGHT — which was proposed in a March paper published by the same team of scientists — is a research environment in the form of a text-based game within which AI and humans interact as player characters. All told, it comprises crowdsourced natural language descriptions of 663 locations based on a set of regions and biomes, along with 3,462 objects and 1,755 characters.

 Facebook trains AI to generate worlds in a fantasy text adventure

In this latest study, the team built a model to generate game worlds, which entail crafting location names and descriptions including background information. They trained it using example neighboring locations partitioned into test and validation sets, such that the locations were distinct in each set. Two ranking models were considered — one where models had access to the location name only and a second where they had access to the location description information — and architected so that when a new world was constructed at test time, the placed location was the highest scoring candidate of several.

To create a map for a new game, the models predicted the neighboring locations of each existing location, and for each location added, they filled in the surroundings. A location could connect to up to four neighboring locations (though not all connections needed to be filled), and locations couldn’t appear multiple times in one map.

A separate set of models produced objects, or items with which characters could interact. (Each object has a name, description, and a set of affordances that represent object properties, such as “gettable” and “drinkable.”) Using characters and objects associated with locations from LIGHT, the researchers created data sets to train algorithms that placed both objects and characters in locations, as well as objects within objects (e.g., coins inside a wallet).

Yet another family of models that had been fed the corpora from the world construction task created new game elements — either a location, character, or object — by leveraging a Transformer architecture pretrained on 2 billion Reddit comments, which were chosen because of their “closeness to natural human conversation” and because they exhibit “elements of creativity and storytelling.” It predicted a background and description given a location name; a persona and description given an object name; or a description and affordances given an object name.

 Facebook trains AI to generate worlds in a fantasy text adventure

So how did it all work in concert? First, an empty map grid was initialized to represent the number of possible locations, with a portion of grid positions marked inaccessible to make exploration more interesting. The central location was populated randomly, and the best-performing model iteratively filled in neighboring locations until the entire grid was populated. Then, for each placed location, a model predicted which characters and objects should populate that location before another model predicted if objects should be placed inside existing objects.

The researchers also propose a human-aided design paradigm, where the models could provide suggestions for which elements to place. If human designers enter names of game elements not present in the data set, the generative models would write descriptions, personas, and affordances.

In experiments, the team used their framework to generate 5,000 worlds with a maximum size of 50 arranged locations. Around 65% and 60% of characters and objects in the data set, respectively, were generated after the full 5,000 maps. The most commonly placed location was “the king’s quarters” (in 34% of the generated worlds), while the least commonly placed location was “brim canal,” and 80% of the worlds had more than 30 locations.

Despite the fact that the generative models didn’t tap the full range of entities available to them, the researchers say that the maps they produced were generally cohesive, interesting, and diverse. “These steps show a path to creating cohesive game worlds from crowd-sourced content, both with model-assisted human creation tooling and fully automated generation,” they wrote.

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More

IBM’s Lambada AI generates training data for text classifiers

November 15, 2019   Big Data
 IBM’s Lambada AI generates training data for text classifiers

What’s a data scientist to do if they lack sufficient data to train a machine learning model? One potential avenue is synthetic data generation, which researchers at IBM Research advocate in a newly published preprint paper. They used a pretrained machine learning model to artificially synthesize new labeled data for text classification tasks. They claim that their method, which they refer to as language-model-based data augmentation (Lambada for short), improves classifiers’ performance on a variety of data sets and significantly improves upon state-of-the-art techniques for data augmentation.

“Depending upon the problem at hand, getting a good fit for a classifier model may require abundant labeled data. However, in many cases, and especially when developing AI systems for specific applications, labeled data is scarce and costly to obtain,” wrote the paper’s coauthors. “Depending upon the problem at hand, getting a good fit for a classifier model may require abundant labeled data. However, in many cases, and especially when developing AI systems for specific applications, labeled data is scarce and costly to obtain.”

Generating synthetic training data tends to be more challenging in the text domain than the visual domain, the researchers note, because the transformations used in simpler methods usually distort the text, making it grammatically and semantically incorrect. That’s why most text data augmentation techniques — including those detailed in the paper — involve replacing a single word with a synonym, deleting a word, or changing the word order.

Lambada leverages a generative model (OpenAI’s GPT) that’s pretrained on large bodies of text, enabling it to capture the structure of language such that it produces coherent sentences. The researchers fine-tuned their model on an existing, small data set, and used the fine-tuned model to synthesize new labeled sentences. Independently, they trained a classifier on the aforementioned data set and had it filter the synthesized corpus, retaining only data that appeared to be “qualitative enough” before re-training the classifier on both the existing and synthesized data.

To validate their approach, the researchers tested three different classifiers — BERT, a support vector machine, and a long short-term memory network — on three data sets by running experiments in which they varied the training samples per class. The corpora in question contained queries on flight-related information, open-domain and fact-based questions in several categories, and data from telco customer support systems.

They report that Lambada statically improved all three classifiers’ performance on small data sets, which they attribute in part to its controls over the number of samples per class. Said controls allowed them to invest more time in generating samples for classes that are under-represented in the original data set, they said.

“Our augmentation framework does not require additional unlabeled data … Surprisingly, for most classifiers, LAMBADA achieves better accuracy compared to a simple weak labeling approach,” wrote the coauthors. “Clearly, the generated data set contributes more to improving the accuracy of the classifier than … samples taken from the original data set.”

Let’s block ads! (Why?)

Big Data – VentureBeat

Read More
« Older posts
  • Recent Posts

    • InfoWars Surrenders
    • Invest Your Time in the Right Skills to Become a Data Scientist in 2021
    • Facebook’s new computer vision model achieves state-of-the-art performance by learning from random images
    • Now make soup!
    • Attach2Dynamics Or SharePoint Security Sync – Choose your smart app for effective document management in Dynamics 365 CRM/Power Apps.
  • Categories

  • Archives

    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited