Build Your Own Audio/Video Analytics App With HPE Haven OnDemand – Part 2

In the conclusion to this two part tutorial, learn how to leverage HPE Haven OnDemand's Machine Learning APIs to build an audio/video analytics app with minimal time and effort.

By Phong Vu, HPE Haven OnDemand Developer Evangelist.

Editor's note: See part 1 of this tutorial, which was posted yesterday on KDnuggets.

Playback media content with Rich Media Analytics

List and search media content

Now it’s time to explore the power of Haven OnDemand text indexing which enables us to search for words or phrases from our media gallery.

Query text index

To list all media content, we just call the Query Text Index API with the search argument defined as an asterisk “*”.

var arg = (searchVideos.Text.Length == 0) ? "*" : searchVideos.Text;

var Params = new Dictionary()
    {"indexes", "speechindex"},
    {"text", arg},
    {"print_fields", "medianame,mediatype,filename"},
    {"absolute_max_results", 100}
hodClient.PostRequest(ref Params, hodApp, HODClient.REQ_MODE.SYNC);

We will let the users enter search arguments into a text input field named “searchVideos”. Just for listing the media titles, we don’t need to read everything from the text index but just read the data we are interested in. We will use the “print_fields” to specify what piece of data we want to retrieve from the text index and in this case, they are the media name, media type and media file name.

We can also let the user choose to search only for audio or only video content by setting the “field_text” filter as shown in the code below. This is doable because we defined the mediatype field as a parametric field when we created our “speechindex” text index. If you want to provide a function to search for content from only certain languages, you can predefine the “language” field as a parametric field when you create the index and use similar syntax below for language.

if (media == "video")
    Params.Add("field_text", "MATCH{video/mp4}:mediatype");
else if (media == "audio")
    Params.Add("field_text", "MATCH{audio/mpeg,audio/mp4}:mediatype");

The Query Text Index API also includes optional parameters for autocomplete and spellcheck of user input search terms, which can further help improve the search experience for users.

Play media content with synchronized transcript

There are many different techniques in different platforms to synchronize the transcript while playing back the media content. In this demo, we will examine how the feature is implemented in Windows platform.

When a user selects a media content from the media list, we will call the Get Content API to retrieve the text and metadata of that media file from the index.

var Params = new Dictionary<string, object>();
Params.Add("index_reference", item.reference);
Params.Add("indexes", item.index);
Params.Add("print_fields", "offset,text,content,concepts,occurrences,language");
hodClient.GetRequest(ref Params, hodApp, HODClient.REQ_MODE.SYNC);

When we called the Query Text Index, the API automatically returns matched item with mandatory fields such as the reference id and the index name where the item is indexed. Now, we use the reference id “index_reference” and the index name “indexes” to identify the item we want to get the content from.

Again, we use the “print_fields” to define what information we want to retrieve. Let’s go through the purpose of those fields information.

  • The offset and text arrays: we will use a timestamp in the offset array to sync a word in the text array when that word is spoken while the media is playing back.
  • The content: this is the whole text string of the speech in the media content. We will use this text for analyzing sentimental statements and for finding interesting entities.
  • The concepts and occurrences arrays: we will list the key concepts and highlight them in different colors and font sizes based on their occurrence in the content.
  • The language: the language will be used for specifying the language code for sentiment analysis and for choosing a database for that language when we find similar content from Haven OnDemand public indexes.

In the Windows platform, we will display the content text string in a RichTextBlock, which allows us to apply different styles to different parts of display text. This way, we can display read text in green color, spoken word in red color and unread text in gray color. We also use data binding so the text will be updated whenever we change the value of the read text, spoken word and unread text.

<RichTextBlock Name="runningText" SelectionChanged="runningText_SelectionChanged" TextWrapping="Wrap">
    <Paragraph FontSize="26">
        <Run Foreground="Green" Text="{Binding ReadText}"/>
        <Run Foreground="Red" Text="{Binding Word}"/>
        <Run Foreground="Gray" Text="{Binding UnreadText}"/>

In iOS and Android platforms, there is no rich text UI component, so we use the WebView for display rich content. If you are interested, please have a look at the source code to see how it was implemented.

Now, let’s sync the spoken word while playing back the media. To do that, we just need a timer, which repeatedly fires an event to call a function every 100 milliseconds.

var delayTimer = new DispatcherTimer();
delayTimer.Tick += DelayTimer_Tick;
delayTimer.Interval = TimeSpan.FromMilliseconds(100);

We then start the timer every time the media player starts playing back the media content. Every time the DelayTimer_Tick function is called, we will check the current position of the playing media and compare to the timestamp in the offset array to find the current spoken word. When the spoken word is found, we can easily specify the read text string and the unread text string using the text array index value.

private async void DelayTimer_Tick(object sender, object e)
    if (contentResponse != null && contentResponse.documents != null)
        GetContentResponse.Document doc = contentResponse.documents[0];
        await this.Dispatcher.RunAsync(Windows.UI.Core.CoreDispatcherPriority.Normal, () =>
            var pos = mplayer.Position.TotalMilliseconds;
            if (index < doc.offset.Count)
                var check = doc.offset[index];
                if (pos > check)
                    string word = doc.text[index];
                    textItem.ReadText = readText;

                    int start = index + 1;
                    var sub = doc.text.GetRange(start, doc.offset.Count - start);
                    string leftOver = String.Join(" ", sub);
                    readText += " " + word;
                    textItem.Word = word;
                    textItem.UnreadText = leftOver;