Domain-Specific Language Processing Mines Value From Unstructured Data

Processing unstructured text data in real-time is challenging when applying NLP or NLU. Find out how Domain-Specific Language Processing can also help mine valuable information from data by following your guidance and using the language of your business.

By John Harney, CTO and Janet Dwyer, CEO of DataScava.

Real-time mining of unstructured textual content isn’t simple. To work effectively, a solution must be fine-tuned to meet your organization’s specific needs and address the quirks in your company’s information. To add value, applications require a vocabulary that accurately captures the definitions, context, and nuance of your business and the way it uses language.

Consequently, unstructured textual data needs to be organized before it can be put to use. That’s a huge challenge. For data-driven systems based on artificial intelligence, machine learning or other advanced applications, natural language processing and natural language understanding are supposed to be the solution, either by themselves or as hybrid models that plug in industry terms or use complex Boolean logic. But none of these are easy to use or implement, and without extensive programming and training, they often process data incorrectly.

There is, however, a simpler approach to addressing these challenges, one that does not require NLP or NLU. It’s called “Domain-Specific Language Processing,” or DSLP, and it uses the language of your business to mine unstructured textual data.

Why is Natural Language Processing So Hard?

In his lecture Why is NLP Hard? Dr. Dragomir Radev discusses the difficulties with human language that make it so hard for NLP systems to understand and process them. Going further, he acknowledges that language is difficult even for humans.

Language is difficult because words, sentences, and paragraphs can be ambiguous and hard to interpret because their meaning is so often dependent on context. At the same time, even the most advanced systems are challenged to “understand” the cornerstone of human language: nuance. For example, consider this sentence: “The blood-red sun birthed a new day as it rose over the horizon.” How would an NLP system translate this?

In general, NLP is useful for relatively simple tasks, such as automating a phone attendant’s call routing or a chatbot’s responses. However, processing the more complex undertakings human beings perform every day is a challenge. How does NLP process “The DJIA rose 4 points to an all-time high of 27,332.03 in July 2019. It sank to its record low of 41.20 in July of 1932”? Or what about, “Drinking herbal tea cures cancer”?

Unlike NLP, Domain-Specific Language Processing does not attempt to disambiguate natural language. It processes unstructured textual content at the document level, not the sentence level. DLSP finds and measures keywords and phrases associated with domain-specific topics controlled by the user. It searches these topic scores to find highly precise on-topic documents and eliminate irrelevant ones.

Domain-Specific Language Processing: A Simpler Approach

Many tools support NLP tasks, but customizing models for a specific domain requires retraining of the NLU layer. Extensive re-programming is a complex, expensive effort. And even if a technical solution is deployed, most non-technical users won’t understand how it impacts the underlying logic of the system or their search. Thus, the potential for inefficiency, lost productivity, and incorrect insights remain high.

“Many experts in our survey argued that the problem of natural language understanding (NLU) is central, as it is a prerequisite for many tasks such as natural language generation (NLG). The consensus was that none of our current models exhibit ‘real’ understanding of natural language,” said Sebastian Ruder, a research scientist at DeepMind, London with a Ph.D. in NLP and Deep Learning.

This brings us to Domain-Specific Language Processing, which leverages the language of your business so you can mine unstructured data using scored topics, keywords, and phrases that are in context, visible and under your control. DSLP doesn’t infer what you’re looking for. Instead, it uses your commands to find what you are looking for. And you don’t need to be a data scientist to use it.

Power from Unstructured Data

DSLP extracts highly precise, domain-specific information and presents it in a business’s context. Data-agnostic, the approach is most useful when it’s made accessible to a range of technical and non-technical users. When both business people (who may not be experts in data) and data professionals (who may not be experts in business) can collaborate across disciplines, efficiency, and productivity quickly improve.

As its name implies, Domain-Specific Language Processing leverages the user’s subject matter expertise. Ideally, users can set up their customized search filters before the system goes to work. Then, operating around the clock and at the direction of expert users, DSLP continually refines the system’s capabilities in a measurable way.

Because DSLP is an approach to processing—not to software architecture or hardware design—its solutions can stand on their own, integrate with other systems, or curate the data needed for AI and ML projects. DSLP becomes the foundation for the automated, simple and precise searching, indexing, scoring, and matching of unstructured textual content.

DSLP and Weighted Topic Scoring

DSLP is especially effective when combined with Weighted Topic Scoring, or WTS. Using DSLP, WTS shows your found topic scores, keywords, and phrases at the file level to surface key results and the most relevant documents from large datasets—in context.

With WTS, users select, prioritize and segment “required” and “nice-to-have” topics, setting the minimum scores in each topic files must meet, and ranking the resulting output. WTS only matches files that meet or exceed all required topic scores.

Non-technical users can adjust DSLP and WTS to add or edit topics, keywords and phrases on the fly, hone in further using multi-level sort and other criteria, and set the WTS search filters to work continuously to mine new textual data as it arrives.

DSLP and WTS work together in DataScava to mine unstructured textual data, in TalentBrowser to unlock job boards, databases or the ATS “black hole,” and in MedNasc for Orthopedics to match orthopedics specialists to opportunities.

As Flexible as Language

To work properly, data-driven systems require a tremendous amount of standardized, labeled, and otherwise “structured” data. Yet by 2022, more than 90% of the world’s electronic information will be unstructured, all of it written in different styles and using terms whose meaning differs from sector to sector.

DSLP helps bridge those gaps by ensuring that input is relevant, thus improving the quality of results while reducing the risk of inappropriate analysis and badly informed decisions. All the while, it increases your business and data teams’ efficiency.

Because it makes unstructured data more accessible, more understandable, and, above all, more useful, DSLP is a powerful alternative or adjunct to NLP, NLU, Semantic and Boolean Search.

Bio: John Harney and Janet Dwyer are co-founders of DataScava and TalentBrowser software solutions. John brings expertise in Visual Studio, SQL Server, cloud computing, and full life cycle software development, including 20+ years of experience in software architecture, project management, and development. His career began in Ireland, where he created financial decision support software for Fixed Income/FX/short-term cash markets, providing software to 19 of 23 banks in Dublin.

Janet has 15 years of combined experience in software product development, I.T. recruitment, and account management, with a focus on Wall Street’s financial services firms, the Fortune 1000, and other hi-tech companies. Prior to her current roles, she was a Partner in a NYC startup I.T. staffing firm and District Manager in a boutique I.T. consulting firm specializing in derivatives and capital markets I.T. staffing.