view · edit · sidebar · attach · print · history

The InfoCodex Text Analysis Process

Step 1: Conversion / Extraction of Metadata

InfoCodex creates a temporary text file from a, for example, PDF file using the program pdf2txt. The metadata are, as far as possible, extracted at the same time (author, title, document date, file type, graphical content percentage of the document, etc.).

Step 2: Text Mining

Sentences in the converted text files are recognized (by way of commas, full stops, semicolons, etc). All words are identified and looked up in the linguistic database (2.7 million entries; G, E, F, I, S). If necessary a lexical analysis is performed on endings such as plurals "es" "'s", conjugations etc.. Simultaneously InfoCodex performs CWR ("collocated word recognition"), checking whether several consecutive words form a term. Example: The term "European Court of Justice" is recognized as one term and not just the four words: "European", "Court", "of", "Justice".

Part of speech (proper name, noun, verb, number, etc.), significance, language, synonym group and the link to the ontology tree are retrieved from the linguistic database and stored for later use. Parallel to the analysis a continuous language recognition is performed.

Step 3: Construction of the 100-dimensional Content Space

The cumulative frequencies of all the nodes referenced in the taxonomy tree by the content of documents form an illustration of the thematic emphases of the document collection. Using cluster analysis, a 100-dimensional content space is constructed which can best represent the content of the document collection at hand. At the same time, the entropies (the "uncertainties" of the different words and terms) are calculated and used.

Step 4: Projection of the Documents into 100-dimensional Vectors

With the information collected in step 2, using the collected entropies, all the text documents are projected into the content space created in step 3. Every document is made to a 100 dimensional vector with 20 descriptors.

Step 5: Kohonen Map

The vectors form the input for the self-organizing neural network (Kohonen map). This model now determines the logical arrangement of the documents in an information map: i.e. determination of a logical order by thematic aspects and arrangement of the documents in the information map. At the same time the similarity measure for the content comparison of the documents is assessed.

Step 6: Labelling / Generation of Abstracts

The generation of descriptors (labelling), the identification of document families (almost identical documents) and the automatic generation of document abstracts follows.

ywesee GmbH - the need to Share!