Building an intelligent Digital Assistant

In this second part we want to outline our own experience building an AI application and reflect on why we chose not to utilise deep learning as the core technology used.

By Dr Vladimir Dobrynin, Dr Xiwu Han, Mr Alexey Mishenin, Dr David Patterson, Dr Niall Rooney, Mr Julian Serdyuk, Aiqudo

In part 1 of this article we discussed the industry trend of companies wanting to brand themselves as “AI first” and often positioning themselves as deep learning. We highlighted some of the problems building and deploying a deep learning solution presents and suggest that often other machine learning approaches could provide a solution in a simpler and more cost effective way. In this second part we want to outline our own experience building an AI application and reflect on why we chose not to utilise deep learning as the core technology used.



At Aiqudo we have built a personal digital assistant for smart phones. Our goal is to understand what users are saying, figure out their intent and execute the correct action for them on their devices.. An example dialogue could be “Book me a 3 star hotel in New York near central park for Friday night.“ By voice enabling their phones we save them time and circumnavigate the need to physically interact with their devices. It may seem tempting to think about going down the deep learning route when building technology that would understand a user’s intent from what they are saying but, in addition to the challenges outlined in part 1 of this article, we felt there were additional linguistic reasons as to why Deep Learning was not the best choice of technology. We feel it is important to build our intelligent algorithms based on linguistic principles relating to how we as humans understand meaning and use language to communicate.


How do we form meaning from language?

Language can be considered as a model of the real world according to society. While everyone uses language no one person controls it. If it is considered incomplete it is improved in an evolutionary manner, step by step over time as new terms are introduced and existing terms disappear from use. Specific individuals may coin new terms initially, but they cannot control if they catch on or disappear. This process is largely unpredictable and determined by society as a whole.

Language is complex - It is naive to think that each word we use when we communicate directly corresponds to an object of real world. There are different theories that try to explain the correlation between language and the real world. One of them, semiotics, uses the concept of signs. Here a word we use is one part of the sign and the related mental concept (mental image) it maps to in our brain is the other part. Given different contexts a sign can change in meaning. For example given the skull and cross bones in the flag below

Most people will interpret this as meaning ‘Pirates’. But if we change the context to a bottle (a different mental concept),



the meaning also changes. Now we understand that it refers to poison. Similarly words can map to different mental concepts and this explains why some words have many meanings.



Personal Context is important to understanding meaning- Which mental concept in our brain a particular word maps to (and therefore how we interpret and assign meaning to it) depends on our personal background, experiences and context. This means it is a very person dependent process. If we try to use a machine learning approach (such as neural networks) to "understand" the meaning of a text, we ignore personal background experience and context. This is because implicitly we are forced to use the background and context (biases) of the human expert who labelled the training dataset used by the algorithm and the meanings they had for terms. This is OK if we train and deploy the algorithm in a very specific domain such as hotel bookings for example. This is because we all have very similar background experiences related to hotels and each word from the domain fires very similar mental concepts in all our brains. Also, the context we all have when we want to book a hotel is very similar - we want to book as good a hotel as possible within our budget in good location.

This is why a linguistic neural network can be very successful in very specific domains (such as booking a restaurant), but we can't reuse the same model in other domains (for example booking a hotel or hiring a car). The difference in context and background experiences in these different domains means the same word may fire one mental concept in one domain and very different concept in another domain.

Discourses in Language - Nobody knows a language in its entirety. The average person only knows a fraction of the total number of words that make up a language and uses even fewer on a regular basis. (in English it is estimated there are about 170 thousand words with the average person knowing 20 to 30 thousand and using only 3 thousand on a regular basis). We become familiar with the subsection of a language as defined by our needs, e.g. an economist doesn’t need to know all the names of the parts of the body whereas a Doctor does. In this way people belong to one or more communities where they communicate with other people who use similar words with the same meanings. These are called discourse communities. People develop and adhere to specific language within these communities to improve the effectiveness of communicating with each other. This new language can be adopted by that community or even by other communities if they find it useful to adopt. The interesting thing is that the same term can have different meanings in different communities e.g. – a software engineer will assign a different meaning to the term “java” than a coffee grower in Brazil (they have different mental concepts for the same term). But when communicating with each other Brazilian farmers don’t need to clarify what they mean when they use the term “java”. To be truly effective and reflect this complexity a deep learning model would need to be built and maintained over time for each discourse community it was used in to ensure each term was assigned its correct meaning within each discourse based on its context.

Implicit knowledge - Focusing in further on Aiqudo’s Voice application. Consider the command 'I want to do online shopping'. On hearing this a voice assistant should start the Amazon application for example. But how does the assistant know that Amazon is relevant to online shopping? Nothing in the command itself or in the language contains this information. The assistant needs some external knowledge about the world the user lives in. Again, this is problematic for a neural network as it doesn't have a means to encapsulate this knowledge. As already mentioned in part 1 of the article, you can’t “look inside” a deep learning model to understand what it knows or to expand its knowledge by demand. Even if this were possible theoretically how would you go about adding this data to cover all possible situations that cover the real-life scenarios of millions of people?

So, we had three goals when building our voice application. One was to build a platform that removed the developmental burden on engineering teams to voice enable their apps (as is the case with Google Home, Alexa etc). The second was to build a truly natural language interface that understands a user’s precise intent from the commands they speak and seamlessly executes the action within the app on their phone that best meets that intent. Thirdly we wanted it to be an unsupervised approach to eradicate the impact of human labelling biases on algorithm training. For the reasons discussed through this article and the previous one, we built our own intelligent algorithms on the principles of Discourse Communities and Semiotics. For those who are interested in more information on how technically we do this, it can be found in our next article.