Build Your Own Audio/Video Analytics App With HPE Haven OnDemand – Part 1

In this first part of a two part tutorial, learn how to leverage HPE Haven OnDemand's Machine Learning APIs to build an audio/video analytics app with minimal time and effort.

By Phong Vu, HPE Haven OnDemand Developer Evangelist.

Rich Media Analysis

Wouldn’t it be useful to make media content, in the form of videos and audio, searchable for what is said in it, analyze it, and bring more useful information related to the media content to enhance the user experience? In this blog, we will walk through the necessary steps to build an app that can analyze audio and video content to extract text and actionable insights - using Haven OnDemand APIs. Once we’re finished we’ll have an app which will:

  • Transcribe speech-to-text from audio and video files using one of more than 20 language models
  • Allow users to search for a word or phrase using natural language from any media content within a large media gallery.
  • Play a media content with stylish synchronized transcript.
  • Enable users to interact with the media content from the transcript by double click on any word from the text to fast-forward or fast-backward to the selected word.
  • Highlight positive and negative human opinions in the transcript text based on sentiment analysis of transcripts.
  • Highlight key concepts of a selected media content and allow users to easily navigate to related information from Wikipedia and online news channels.
  • Browse and see summaries of famous people, famous places and famous companies mentioned in the transcript.

This demo application is built for Windows Universal, iOS, and Android. Throughout this blog, we will use C# for Windows with the use of the HODClient and HODResponseParser libraries for sending requests to Haven OnDemand APIs and for parsing APIs’ responses, respectively. Please note, you can easily build this application in any software language you like with the client libraries provided by Haven OnDemand. If you have never heard of or used the libraries, please read this article to learn more about them.

Project's source code

Code snippets in this blog are just for illustration purpose. In order to follow the course of the application development, you may want to download the entire project source code for Windows Universal 8.1, iOS and Android. You can also install the demo app from Windows App store onto your PC/Laptop or Windows phone if your device runs Windows/WP 8.x or 10.

If you want to build your own app using the source code, remember to use your own API key to access Haven OnDemand APIs. If you don’t have the API key, click here to sign up with Haven OnDemand to get one.

Preparing the content

Before we start, make sure that we have some media content with good quality of speech. The format of the media content can be .mp3, .wav, .m4a, .wma, .wmv, .mp4.

To enable the demo app on different platforms to access and play the same media content from our media gallery, we will upload and store media content in our online server. For your own project, if you want to enable this feature you must use your own online server to keep your media content. Otherwise, you can store and play the media content directly from your device.

Note: For the purposes of this tutorial, only the demo app for Windows platform provides the capability to upload media files to this remote server. For simplicity, we won’t discuss the process of uploading a file to a remote server. But if you are interested, you can have a look at the implementation from the UploadEngine.cs file from the Windows project. Also you can copy the uploadmedia.php file and place it in your online server where you want to store the media content.

Extract text from speech from audio or video content

HPE Speech Recognition

The first thing we need to do is to extract the text from media content. This can be done by calling the Speech Recognition API. It is a simple process, particularly when using the HODClient library to call the API. Let’s have a look at the following code in C#:

var Params = new Dictionary();
Params.Add("url", "<wbr />4");
Params.Add("interval", "0");
Params.Add("language", "en-US");
hodClient.PostRequest(ref Params, HODApps.RECOGNIZE_SPEECH, HODClient.REQ_MODE.ASYNC);

From the code above, we can see how the Speech Recognition API can be called using the HODClient library.

  • First, we need to create a Params variable to specify parameters required by the Speech Recognition API. The first parameter in the Params variable is the “url” which is the input data source and in this case, it is the URI pointing to a media file on our remote server (provided that we have uploaded a media file named demomediafile.mp4 to the remote server).
  • The second parameter in the Params variable is the “interval” which will tell the Speech Recognition API how to extract text. We set the interval value to 0 because we want to receive extracted text in an array of words and every single word will have its timestamp in the offset array. We will discuss about the purpose of having these 2 arrays of data later in the Playback media content with Rich Media Analytics section.
  • The third parameter in the Params variable is the “language” which specifies the language of the speech in the media content. Please click here to get full information about the Speech Recognition API’s parameters and supported languages.

To reuse the text content of a media content later for analyzing purposes, we need to store the text somewhere so that it can be easily retrieved when we need it, instead of calling the Speech Recognition API again for the same media content.

We will use Haven OnDemand Text Indexing APIs to store the text of every media content we extracted from.

We should know that the main purpose of Haven OnDemand text indexing is not for content management. It lets us build a structured database for unstructured data and allows us to perform advanced search from the index database. Therefore, using text indexing will not only help us to quickly retrieve the text of a media content but also enable us to search for any single word or phrase spoken from the media content that we have extracted and stored in the indexing database. We can also search for media type such as audio or video content or even search for similar content from our media gallery.