This week’s article published in the Google Research Blog shows that an internal team of the company is trying to make artificial intelligence (AI), like the human brain, actively proactively focus on a sound source while filtering other sound sources—just like you are at a party. When talking to friends.
Google's approach uses an audio-visual model that allows it to focus on the sounds of a video. The company also released multiple YouTube videos to demonstrate the actual effect of the technology.
Google said that this technology can be applied to single-track video, and can separate the audio content of different people in the video by algorithm, and also allow the user to manually select the face in the video and specifically listen to the person's voice.
Google said that the visual element is the key, because this technology will focus on a person's lip movement, so as to better judge which part of the sound should be focused on at a certain time, and create a more accurate independent audio track for a longer video.
Google researchers developed this model by collecting 100,000 YouTube 'speech videos'. A total of approximately 2,000 hours of content was extracted. Then the audio tracks were mixed and artificial background noise was added.
Google later trained the technology to segment the mixed audio by observing the spectrograms of the faces and video tracks in each frame of the video. This system can distinguish which source belongs to which face within a specific time and is Everyone makes a separate audio track.
Google believes that closed-captioning systems will become a major application area for the system. They are also contemplating a wider range of applications and are exploring more opportunities to integrate them into various Google products. For example, if By adding it to the Google Home smart speaker, you can distinguish the instructions issued by different users.
However, this model needs to work well with video, so it may be more suitable for the Amazon Echo Show. Google opened the Google assistant for smart displays such as Echo Show earlier this year, but the company itself has not yet introduced such products.
However, this technology may also cause privacy concerns. Although the actual effect of the technology is far less than a video presentation, it may indeed become a powerful monitoring and monitoring tool with some minor adjustments.