• Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Special Offers
Business Intelligence Info
  • Business Intelligence
    • BI News and Info
    • Big Data
    • Mobile and Cloud
    • Self-Service BI
  • CRM
    • CRM News and Info
    • InfusionSoft
    • Microsoft Dynamics CRM
    • NetSuite
    • OnContact
    • Salesforce
    • Workbooks
  • Data Mining
    • Pentaho
    • Sisense
    • Tableau
    • TIBCO Spotfire
  • Data Warehousing
    • DWH News and Info
    • IBM DB2
    • Microsoft SQL Server
    • Oracle
    • Teradata
  • Predictive Analytics
    • FICO
    • KNIME
    • Mathematica
    • Matlab
    • Minitab
    • RapidMiner
    • Revolution
    • SAP
    • SAS/SPSS
  • Humor

Microsoft explains how it improved automatic image captioning in Azure Cognitive Services

October 14, 2020   Big Data

Automation and Jobs

Read our latest special issue.

Open Now

Microsoft today launched a new computer vision service it claims can generate image captions that are, in some cases, more accurate than human-written descriptions. The company calls the service, which is available as part of Azure Cognitive Services Computer Vision, a “significant research breakthrough” and an example of its commitment to accessible AI.

Automatic image captioning has a number of broad use cases, first and foremost assisting users with disabilities. According to the World Health Organization, the number of people of all ages who are visually impaired is estimated to be 285 million, of whom 39 million are blind.

Accuracy becomes all the more critical when vision-impaired users rely on captioning for daily tasks. According to a study by researchers at Indiana University, the University of Washington, and Microsoft, blind people tend to place a lot of trust in automatically generated captions, building unsupported narratives to reconcile differences between image contexts and incongruent captions. When asked to identify captions of images on Twitter that might be incorrect, even blind users who describe themselves as being skilled and consistent about double-checking tended to trust automatic captions, the researchers found — no matter whether the captions make sense.

In early 2017, Microsoft updated Office 365 apps like Word and PowerPoint with automatic image captioning, drawing on Cognitive Services Computer Vision. (Cognitive Services is a cloud-based suite of APIs and SDKs available to developers building AI and machine learning capabilities into their apps and services.) More recently, the company launched Seeing AI, a mobile app designed to help low- and impaired-vision users navigate the world around them.

But while Office 365 and Seeing AI could automatically caption images better than some AI baselines, Microsoft engineers pursued new techniques to improve them further.

The engineers describe their technique in a September paper published on Arxiv.org, a server for preprints. Called visual vocabulary pretraining, or VIVO for short, it leverages large amounts of photos without annotations to learn a vocabulary for image captioning. (Typically, training automatic captioning models requires corpora that contain annotations provided by human labelers.) The vocabulary comprises an embedding space where features of image regions and tags of semantically similar objects are mapped into vectors that are close to each other (e.g., “person” and “man,” “accordion” and “instrument”). Once the visual vocabulary is established, an automatic image captioning model can be fine-tuned using a data set of images and corresponding captions.

 Microsoft explains how it improved automatic image captioning in Azure Cognitive Services

Above: Image captioning results on nocaps. B: A baseline without adding VIVO pretraining. V: With VIVO
pretraining. Red text represents novel objects. The bounding box color is brighter when the similarity is higher.

Image Credit: Microsoft

During the model training process, one or more tags are randomly masked and the model is asked to predict the masked tags conditioned on the image region features and the other tags. Even though the dataset used for fine-tuning only covers a small subset of the most common objects in the visual vocabulary, the VIVO-pretrained model can generalize to any images that depict similar scenes (e.g., people sitting on a couch together). In fact, it’s one of the few caption-generating pretraining methods that doesn’t rely on caption annotations, enabling it to work with existing image data sets developed for image tagging and object detection tasks.

Microsoft benchmarked the VIVO-pretrained model on nocaps, a test designed to encourage the development of image captioning models that can learn visual concepts from alternative sources of data. Evaluated on tens of thousands of human-generated captions describing thousands of images, the model achieved state-of-the-art results with substantial improvement for objects it hadn’t seen before. Moreover, on a metric called consensus-based image description evaluation (CIDEr), which aims to measure the similarity of a generated caption against ground truth sentences written by humans, the model surpassed human performance by a statistically significant margin.

In addition to the latest version of the Cognitive Services Computer Vision API, Microsoft says the model is now included in Seeing AI. It will roll out to Microsoft products and services including Word and Outlook, for Windows and Mac, and PowerPoint for Windows, Mac, and web later this year, replacing an image captioning model that’s been used since 2015.

“Given the benefit of this, we’ve worked to accelerate the integration of this research breakthrough and get it into production and Azure AI,” Eric Boyd, corporate vice president of AI platform at Microsoft, told VentureBeat via phone earlier this week. “It’s one thing to have a breakthrough of something that works in a delicate setup in the lab. But to have something that [in a few months] we can have pressure-tested and operating at scale and part of Azure … showcases how we’re able to go from the research breakthrough to getting things out into production.”

Let’s block ads! (Why?)

Big Data – VentureBeat

automatic, Azure, captioning, Cognitive, Explains, Image, improved, Microsoft, Services
  • Recent Posts

    • OUR MAGNIFICENT UNIVERSE
    • What to Avoid When Creating an Intranet
    • Is Your Business Ready for the New Generation of Analytics?
    • Contest for control over the semantic layer for analytics begins in earnest
    • Switch from Old Record View to Kanban Board View to Maximize Business Productivity within Dynamics 365 CRM / PowerApps
  • Categories

  • Archives

    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
© 2021 Business Intelligence Info
Power BI Training | G Com Solutions Limited