{"id":2526,"date":"2020-07-09T14:08:55","date_gmt":"2020-07-09T05:08:55","guid":{"rendered":"https:\/\/cinnamon.ai\/new-cinnamon\/?post_type=ideas&#038;p=565"},"modified":"2023-11-02T07:51:37","modified_gmt":"2023-11-01T22:51:37","slug":"bootcamp-tech-blog-1-overview-vqa-problem","status":"publish","type":"ideas","link":"https:\/\/cinnamon.ai\/en\/ideas\/bootcamp-tech-blog-1-overview-vqa-problem\/","title":{"rendered":"Bootcamp Tech Blog #1: Overview of the VQA problem"},"content":{"rendered":"<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/cinnamon.ai\/new-cinnamon\/wp-uploads\/sites\/5\/2020\/06\/8d3c49141b71b88092feee74657fadf0.png\" alt=\"\" class=\"wp-image-421\"\/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Hello. I am in charge of Cinnamon AI public relations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Today, I would like to introduce the Japanese translation of the blog that introduces the contents of the internship &quot;Boot Camp&quot; that is regularly held at Cinnamon AI&#039;s Vietnam base.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This blog is operated by Cinnamon AI&#039;s Vietnam team, and a link to the blog is also posted at the end of this article. Please take a look.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Bootcamp Tech Blog #1: Overview of the VQA problem<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"magicdomid246\"><strong>Prologue<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Remember the boom in deep learning applications since the ImageNet contest?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Deep learning is now a milestone and fundamental approach for most machine learning tasks, including computer vision, natural language processing, and speech recognition. Until now, many AIs could only handle one content type, such as images or text. However, in order to approximate human behavior, we need an engine that combines these elements to handle multitasking problems. Examples of tasks that include both visual and textual content include text searches in images, image captions, and visual question answering. In this blog, I would like to introduce an overview of the VQA problem, challenges, and efforts toward practical application.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/cinnamon.ai\/new-cinnamon\/wp-uploads\/sites\/5\/2020\/06\/1_fuc0Y8fY_YBfKd9sLrN2Yg.jpeg\" alt=\"\" class=\"wp-image-419\"\/><\/figure>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"magicdomid9\"><strong>VQA issues<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">VQA stands for &quot;Visual Question Answering&quot;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/cinnamon.ai\/new-cinnamon\/wp-uploads\/sites\/5\/2020\/06\/VN002.png\" alt=\"\"> Figure 2: Illustration of a VQA System<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We define VQA here as the task of finding answers to questions related to a given image\/video (visual content). Specifically, it takes visual content and related text-based questions as input, and outputs text-based answers. (Figure 2)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/cinnamon.ai\/new-cinnamon\/wp-uploads\/sites\/5\/2020\/06\/VN003.png\" alt=\"\"> Figure 3: Some examples of a VQA system&#039;s input<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With previous technology, it was said to be difficult to develop a VQA system that could answer arbitrary questions. However, this technology is now considered to be the core value of VQA systems. The questions are optional and cover many sub-questions in the field of computer vision. For example, look at Figure 4 and ask the following questions:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Q: Object recognition. What kind of food is there in the center?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Q: Object detection. Do you have meat?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Q: Classification of attributes. What color is an avocado?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Q: Counting numbers. How many types of food are there in total?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Q:\u2026<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/cinnamon.ai\/new-cinnamon\/wp-uploads\/sites\/5\/2020\/06\/VN004.jpeg\" alt=\"\"> Figure 4: Arbitrary questions can be asked and some are related to a sub-problem in computer vision.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Additionally, more complex items require more advanced text comprehension, such as questions about spatial relationships between objects, events, actions, and common sense reasoning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/cinnamon.ai\/new-cinnamon\/wp-uploads\/sites\/5\/2020\/06\/VN005.png\" alt=\"\"> Figure 5: Examples of some complex questions.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"magicdomid266\">algorithm<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In recent years, various algorithm development methods have been proposed. The common structure in algorithms consists of three main parts: visual information extraction, textual information extraction, and an algorithm that integrates these two features to generate an answer. The process of answer generation is usually thought of as a discriminant problem, where each unique answer is treated as a separate category. The main difference between the methods is how they combine visual and textual features.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/cinnamon.ai\/new-cinnamon\/wp-uploads\/sites\/5\/2020\/06\/VN006.png\" alt=\"\"> Figure 6: The flow of a VQA system.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"magicdomid280\"><strong>assignment<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Since 2014, various studies have been conducted in the development of VQA systems that have faced numerous issues. The main issues discovered are listed below.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>\u2460 Expertise<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First of all, many challenges come from the prerequisite knowledge for system development. After all, ``vision&#039;&#039; is in the realm of computer vision in the past, and ``question answering&#039;&#039; is a problem of natural language understanding.That&#039;s why I think it&#039;s a good challenge.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>\u2461 Lack of semantic consistency between image and text<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A VQA system consists of two different data streams (textual data and visual data), which must be used and combined correctly to ensure robust performance. Therefore, to learn cross-modal representations, current state-of-the-art techniques on the VQA-v2 dataset use large-scale models to pre-train a large number of visual-text pairs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>\u2462Limited answers \u2013 Not as free as thinking.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;Most VQA algorithms view the process of generating an answer as a classification problem. An answer dictionary typically contains a pool of K possible answers, with some algorithm calculating the probability of each answer for a given question. The generated answers can be made more diverse as K increases, but this requires a larger model and a larger training dataset.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>ability to answer complex questions<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;Machines have limited technological capabilities to develop like humans, and they still have a long way to go to catch up to human cognitive abilities. Complex questions such as &quot;Why?&quot; and questions that require advanced knowledge (e.g. Q. Who is the person in the photo? - A. Donald Trump) are typical examples of high difficulty. This is an example.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/cinnamon.ai\/new-cinnamon\/wp-uploads\/sites\/5\/2020\/06\/VN008.png\" alt=\"\"> Figure 8: An example of a hard question: To acknowledge the position of \u201cglobal optimum for non-convex function\u201d requires a (potentially) very vast knowledge base! (that human may not reach yet)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"magicdomid307\"><strong>application<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The appeal of VQA lies in its relevance to our daily lives. Questions and answers are an important part of life and that will always be the case. The way VQA systems answer questions consists of understanding visual and textual information, and in some respects how to combine the two data streams and how to use advanced knowledge appropriately. Their decision making methods are similar to ours.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u3000<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We present a range of potential applications integrating VQA systems. Currently, the best application is a free application provided by Microsoft that is being put into practical use to help visually impaired people. (Seeing AI 2016 Prototype \u2013 Microsoft Research Project) Many applications with visual and textual information conversion have been published and have improved the lives of many people.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another application for VQA systems is human-computer interaction. Specifically, it is an application that provides for obtaining visual content. For example, kids can ask the system various questions to learn the names of real objects while looking at them, or ask the camera questions about the weather outside when they are indoors. .<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is an overview of the VQA issue. You can try VQA&#039;s online demo here. In the next article, we will review other approaches we have studied and our suggestions for improving VQA systems. Please stay tuned.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>References<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/arxiv.org\/pdf\/1610.01465.pdf\">Answering visual questions. Datasets, algorithms, and future challenges<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">summary<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In this article, we have delivered a Japanese translation of the TECHBLOG that introduces the contents of the internship conducted at Cinnamon AI&#039;s Vietnam base. We would be happy if you could learn more about overseas human resources and AI research that Cinnamon AI is focusing on.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The English version of this article is<a href=\"https:\/\/medium.com\/@cinnamonai\/overview-of-the-vqa-problem-f96ba63f6fdf\">here<\/a>You can view it from here.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"http:\/\/www.deepl.com\/Translator%EF%BC%88%E7%84%A1%E6%96%99%E7%89%88%EF%BC%89%E3%81%A7%E7%BF%BB%E8%A8%B3%E3%81%97%E3%81%BE%E3%81%97%E3%81%9F%E3%80%82\" target=\"_blank\" rel=\"noreferrer noopener\">Translated with www.DeepL.com\/Translator (free version).<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For inquiries regarding this article or product consultation, please contact<a href=\"https:\/\/cinnamon.is\/\">here<\/a>Please send it from<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, Cinnamon AI regularly<a href=\"https:\/\/cinnamon.is\/seminar\/\">Holding a seminar<\/a>Doing.<\/p>","protected":false},"excerpt":{"rendered":"<p>Hello. This is the PR representative for Cinnamon AI. Today, I would like to introduce the Japanese translation of a blog that introduces the content of the internship program &quot;Boot Camp,&quot; which is held regularly at Cinnamon AI&#039;s Vietnam base. This blog [...]<\/p>","protected":false},"featured_media":0,"template":"","ideas-cat":[59],"class_list":["post-2526","ideas","type-ideas","status-publish","hentry","ideas-cat-tech"],"acf":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/cinnamon.ai\/en\/wp-json\/wp\/v2\/ideas\/2526","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cinnamon.ai\/en\/wp-json\/wp\/v2\/ideas"}],"about":[{"href":"https:\/\/cinnamon.ai\/en\/wp-json\/wp\/v2\/types\/ideas"}],"wp:attachment":[{"href":"https:\/\/cinnamon.ai\/en\/wp-json\/wp\/v2\/media?parent=2526"}],"wp:term":[{"taxonomy":"ideas-cat","embeddable":true,"href":"https:\/\/cinnamon.ai\/en\/wp-json\/wp\/v2\/ideas-cat?post=2526"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}