3.2.2023
While there are many excellent tools for data mining structured, numerical data, unstructured text is the holy grail of modern artificial intelligence (AI) research. In the last decade or so, the mainstream focus has been on natural language processing (NLP), which aims to teach machines the correct linking of words in a human language, mostly using the concept of triples “subject-predicate-object”. In essence, this amounts to teaching machines the grammar of a language. Large language models, like ChatGPT, are nothing else than the ultimate expression of this approach. And, correspondingly, these tools haveuncanny abilities to correctly string together words, but that is all!
Unfortunately, grammar has very little to do with meaning!...as any whatsapp message of a kid can immediately testify. A single emoji can tell more than hundred words, without any grammar at all! Correspondingly, tools like ChatGPT often write beautiful, but non-sensical text. NLP and large language models are not a promising approach for text AI. Both need extensive training and both deal only with the relative positioning of groups of few words. This is like studying small groups of trees while completely missing the forest. Meaning is an emergent phenomenon, it is hopeless to address this problem word by word.
At InfoCodex Semantic Technologies, we believe in a different approach, content recognition. We teach machines the “meaning of meaning”, in the form of an extremely large semantic structure in which synonyms are aggregated in groups, so-called semantic clouds, and assigned to a hierarchical structure of ever more generic hypernyms, i.e. categories of concepts. This structure is formulated in a meta-language for which it is possible to instantiate unlimited new languages. Information-theoretic algorithms are used to create a mathematical model of text in a 100-dimensional space of the most relevant high-level concepts and self-organizing neural networks are then used to formulate a similarity of meaning on these high-level topics. Single words and their combinations are not relevant in this approach, only the high-level semantic concepts in the text.
This approach permits applications that are beyond the reach of NLP and language models. One of these is knowledge discovery, as opposed to information retrieval. In knowledge discovery, the aim is to identify facts that were not previously known. This of course impossible by definition with NLP methods: these are constructed to understand sentences written by humans and so they can only retrieve information known by those who wrote the sentences in the first place. This is why, e.g., the InfoCodex semantic engine was possibly the first to actually discover new biomarkers for diabetes just by analyzing large collections of biomedical research documents, and this with no human intervention or training whatsoever! To give a concrete example of what we mean by content recognition, InfoCodex could identify by itself, with no training or human feedback, that “Hctz” is an acronym of “hydrochlorothiazide” and that this is a “diuretic drug”. Such a feat would have been impossible with a focus on triples of words or on sentence completion.
A second application that is made possible by content recognition is summarization of documents of any length, even entire books, with no training or feedback. This is beyond the reach of language models like GPT-3 or ChatGPT, which require constant human feedback beyond the limit of few sentences. As in many applications of AI, the need for training is the real bottleneck for successful commercial implementations. The associated costs in time and money are simply too high. However, a tool that can instantly summarize large flows of even very long documents, like legal or medical reports is urgently needed. We are aware of many organizations, private and public, that outsource this task to legions of professional readers at huge costs. InfoCodex Summarizer can do this, by using its underlying content-recognition technology to identify the high-level concepts touched upon in a document and extract the sentences that are most representative of these topics. Another example of how semantics, i.e. meaning, primes over grammar, i.e. the organization of words.
For a demo of InfoCodex please contact: zdavatz@ywesee.com