view · edit · sidebar · attach · print · history

Generative models vs. Content Recognition

By Carlo Trugenberger, 26.1.2023

Generative language models, like ChatGPT, are all the rage these days. These models are probability distributions over sequences of words. Given a sequence of (n-1) words, the most likely next word is determined as the sequence of n words with the highest probability, given the shorter sequence.

Two facts should immediately catch the eye. First, language models need extensive training on a specific corpus before they can be used. Second, and related to this, they are not able to disambiguate. Once trained on a biology- focused corpus, a language model will predict that “Tiger” should be followed by words like “species” or “endangered” rather than “Woods” as would be more likely for a sports-related document.

Language models are prone to pique the interest of the wider public, given their human-like abilities to answer questions and generate essays on the topics on which they have been trained. However, "The cost of training is the real bottleneck in the application of AI/NLP. Companies spend millions of dollars weekly just to train and fine-tune their AI workloads," (Prof. A. Shrivastava, Rice University, Houston, April 2021). And, when questions outside the training domain of a language model, like ChatGPT, are asked, the answers are immediately very wrong, as was to be expected.

Tools like ChatGPT surely have many potential commercial applications, as the investments of companies like Microsoft testify. One for which they are truly unsuited, however is, e.g., creating summaries, especially of very long documents. As is very well explained on OpenAI’s own Web site, https://www.width.ai/post/gpt3-summarizer, there are two types of summaries, extractive ones, in which exact sentences and keywords of the main text are selected that represent best its overall content, and abstractive summaries, in which the original text is supplanted by new, more compact sentences supposedly representing the same content. While abstractive summaries might superficially seem more appealing, they are far too “dangerous” for productive applications. What is the use of a summarization tool you cannot trust 100%?

Language models use the fancy names “zero-shot” summarization or “multiple- shot” summarization to refer to unsupervised summarization, that does not need training, or summarization algorithms that need training, in various possible forms (see https://www.width.ai/post/gpt3-summarizer). But what is the commercial use of a summarization tool that needs to be trained each time on the very documents it should facilitate the comprehension of? Of course, only zero-shot summarization makes any sense! And, as it is clearly stated on the above OpenAI page, zero-shot summarization with language models like GPT-3 fails for long documents, as expected. This is particularly true for book- length documents, for which language models like GPT-3 need constant human feedback (https://openai.com/blog/summarizing-books/), a contradiction in terms for productive applications in which humans know nothing about the text and expect to actually gain knowledge from the AI tool! And documents this long are ubiquitous in several fields, like legal or medical expertises.

Unsupervised summarization, even of very long documents, is the realm of content-recognition AI, a direction orthogonal to generative language models. Here the focus is not on producing text but, rather, on the unsupervised identification of the meaning of existing text, with no training and on generic topics. Contrary to generative language models, this paradigm excels at summarization, even of very long documents. It is also ideal to group and find documents by similarity and for knowledge discovery, the holy grail of text mining (Hahn U, Cohen KB, Garten Y, Shah NH, Mining the pharmacogenomics literature: a survey of the state of the art. Brief Bioinform. 2012, 13(4):460–494), which requires the ability to identify previously unnoticed, hidden correlations and is paramount for investigative and research work, e.g. for drug discovery.

InfoCodex is such a content-recognition engine. It is based on probably the largest linguistic database in the world, combined with information theory and self-organizing neural networks. Its linguistic knowledge base is universal, it covers generic topics and is organized hierarchically. This permits the information theory algorithms to identify by themselves the relevant topics of a document and to consequently automatically disambiguate possibly equivocal words. “Tiger” will not be identified as an “animal” in a document about “golf”. Context is paramount and training is not necessary, the tool trains itself. The self-organizing neural networks will then provide the pattern recognition capability to identify similarity of meaning, i.e. automatic content recognition without previous training and on generic topics.

InfoCodex, contrary to GPT-3 or ChatGPT, can instantly provide summaries of text of any length, even of entire books, on any topic and without previous training or human feedback. While generative models have surely a wealth of applications, they are clearly not ideal for summarization and knowledge discovery. For these crucial applications you need unsupervised content- recognition, i.e. a tool like InfoCodex.

For a demo of InfoCodex please contact: zdavatz@ywesee.com

ywesee GmbH - the need to Share!

Generative models vs. Content Recognition