Text mining is the
discovery of new, useful information, by automatically identifying
relevant patterns with the use of computers reading text resources.
Those pieces of information need to be properly associated in order to
produce new facts and rules about the domain being analyzed.
Text mining is a subset of data mining and it is a particularly
challenging field due to the particularities of human languages
(irregularities, implied statements).
Text mining approaches can be
- ruled-based with a focus on linguistic knowledge
- ruled-based with a focus on semantics and
existing ontologies from which the system draws conclusions
- or a combination of any of these main
approaches
Even statistically based approaches currently rely on some basic
linguistic processing and the more the better. Other approaches relie
more on an initial analysis of linguistic patterns. The linguistic
processing takes place at different levels and can go from simple
part-of-speech tagging to a deeper syntactic analysis at phrase and
sentence level, anaphora resolution (identifying the identity of open
references such as pronouns), discourse analysis and going into the
realm of semantics. Semantics is necesary for any non trivial attempt
at
knowledge representation and reasoning.
Each approach has its own pros and cons. Statistically based methods
are said to require less development time, but they encounter
also
particular limitations. People have to
pay special attention to define what the training material for
statistics can be. If one does not have a deep knowledge of how real
texts diverge, one can end up having very biased statistics no matter
how sophisticated the algorithms are. Approaches for analysis texts
based on linguistic and semantic knowledge, on the other hand, can work
very well on resolving specifics, where statistical and probabilistic
methods often fail. The challenge with these methods is that they also
require huge quantities of linguistic
and semantic data that are difficult to obtain. Many companies in
the area of natural language processing and text mining have found
themselves confronted with huge costs and very long development times
to build up decent database. Companies that do want to use the
knowledge of linguistics and semantics do need to develop sophisticated
mechanisms to generate reliable and very complete data while
maintaining production costs low.
Text mining requires in any case a pre-processing part, where the
software extracts the actual texts from documents of different types,
recognizes the real
tokens and labels them with some initial information on part of speech,
basic form or stem form, etc. The
algorithms have to take into account that punctuation rules are
different from one language to another and abbreviations are processed
differently. One has to take into account ambiguities as
many words need to be tagged with a different part of speech and
stemming according to context.
After this, text mining software can go deeper or not into analyzing
the text before the actual recognition and extraction of new data. At
Crossminder we go deep into linguistic, semantic and statistical
analysis because we believe that is the best way
of delivering to the user the most useful information.
The next level is the actual recognition of known items and
then the most important part: the discovery of new information based on
the objects founds, the relationships existing between them and their
relationship to the world.
We at Crossminder have decided to embrace the best of different
approaches. We reject
the belief that there might be one magic solution based on one
approach only, whether it is statistical, based on linguistics or else.
We work on harmonizing the different results to arrive at the meaning
of the text, identify the objects and their relationships and the
implications of the message in question. We believe the human brain
acts in a similar
way: smaller modules contribute to the recognition of the
bigger, more abstract picture.
Although some people consider identifying entities in a text as a form
of text mining, Crossminder considers text mining proper more
than
mere information extraction. Text mining implies that new
knowledge can be taken out of the text.
One of the basic problems of text mining is that the real information
one is trying to retrieve is not completely coded in the text. It needs
to be deduced by recurring to what is in the text and its connection
with the real world.
Optimal results are obtained by using all possible dimensions of text
analysis.
Crossminder builds on its expertise on computer linguistic
knowledge, statistical knowledge and semantic engineering to go beyond
identifying and retrieving known information and go towards inferring
and deducing new knowledge. Among other things, we enable software that
helps in learning more about unknown people, getting as much data as
possible about their background.