InfoCodex creates a temporary text file from a, for example, PDF file using the program pdf2txt. The metadata are, as far as possible, extracted at the same time (author, title, document date, file type, graphical content percentage of the document, etc.).
Sentences in the converted text files are recognized (by way of commas, full stops, semicolons, etc). All words are identified and looked up in the linguistic database (2.7 million entries; G, E, F, I, S). If necessary a lexical analysis is performed on endings such as plurals "es" "'s", conjugations etc.. Simultaneously InfoCodex performs CWR ("collocated word recognition"), checking whether several consecutive words form a term. Example: The term "European Court of Justice" is recognized as one term and not just the four words: "European", "Court", "of", "Justice".
Part of speech (proper name, noun, verb, number, etc.), significance, language, synonym group and the link to the ontology tree are retrieved from the linguistic database and stored for later use. Parallel to the analysis a continuous language recognition is performed.
The cumulative frequencies of all the nodes referenced in the taxonomy tree by the content of documents form an illustration of the thematic emphases of the document collection. Using cluster analysis, a 100-dimensional content space is constructed which can best represent the content of the document collection at hand. At the same time, the entropies (the "uncertainties" of the different words and terms) are calculated and used.
With the information collected in step 2, using the collected entropies, all the text documents are projected into the content space created in step 3. Every document is made to a 100 dimensional vector with 20 descriptors.
The vectors form the input for the self-organizing neural network (Kohonen map). This model now determines the logical arrangement of the documents in an information map: i.e. determination of a logical order by thematic aspects and arrangement of the documents in the information map. At the same time the similarity measure for the content comparison of the documents is assessed.
The generation of descriptors (labelling), the identification of document families (almost identical documents) and the automatic generation of document abstracts follows.