Crossminder
SOFTWARE BUSINESS SOLUTIONS






HOME   •  MISSION   •  PRODUCTS   •  SERVICES   •  ABOUT US   •  CONTACT























Text mining is the discovery of new, useful information, by automatically identifying relevant patterns with the use of computers reading text resources. Those pieces of information need to be properly associated in order to produce new facts and rules about the domain being analyzed. Text mining is a subset of data mining and it is a particularly challenging field due to the particularities of human languages (irregularities, implied statements). Text mining approaches can be

  • statistically based
  • ruled-based with a focus on linguistic knowledge
  • ruled-based with a focus on semantics and existing ontologies from which the system draws conclusions
  • or a combination of any of these main approaches

Even statistically based approaches currently rely on some basic linguistic processing and the more the better. Other approaches relie more on an initial analysis of linguistic patterns. The linguistic processing takes place at different levels and can go from simple part-of-speech tagging to a deeper syntactic analysis at phrase and sentence level, anaphora resolution (identifying the identity of open references such as pronouns), discourse analysis and going into the realm of semantics. Semantics is a field relevant not only to linguistics and computational linguistics, but also to the field of knowledge representation and semantic engineering (ontology management, reasoning through semantic data bases) and to other areas of computer science and else.

Each approach has its own pros and cons. Statistically based methods are said to require less development time, but they encounter also particular limitations. People have to pay special attention to define what the training material for statistics can be. If one does not have a deep knowledge of how real texts diverge, one can end up having very biased statistics no matter how sophisticated the algorithms are. Approaches for analysis texts based on linguistic and semantic knowledge, on the other hand, can work very well on resolving specifics, where statistical and probabilistic methods often fail. The challenge with these methods is that they also require huge quantities of linguistic and semantic data that are difficult to obtain. Many companies in the area of natural language processing and text mining have found themselves confronted with huge costs and very long development times to build up decent database. Companies that do want to use the knowledge of linguistics and semantics do need to develop sophisticated mechanisms to generate reliable and very complete data while maintaining production costs low. 

Text mining requires in any case a pre-processing part, where the software extracts the actual texts from documents of different types, recognizes the real tokens and labels them with some initial information on part of speech, basic form or stem form, etc. The algorithms have to take into account that punctuation rules are different from one language to another and abbreviations are processed differently. One has to take into account ambiguities as many words need to be tagged with a different part of speech and stemming according to context.
After this, text mining software can go deeper or not into analyzing the text before the actual recognition and extraction of new data. At Crossminder we go deep into linguistic, semantic and statistical analysis because we believe that is the best way of delivering to the user the most useful information.

The next level is the actual recognition of known  items and then the most important part: the discovery of new information based on the objects founds, the relationships existing between them and their relationship to the world..
 
We at Crossminder have decided to embrace the best of different approaches. We reject the belief that there might be one magic solution based on one approach only, whether it is statistical, based on linguistics or else. We work on harmonizing the different results to arrive at the meaning of the text, identify the objects and their relationships and the implications of the message in question. We believe the human brain acts in a similar way: smaller modules contribute to the recognition of the bigger, more abstract picture.

Although some people consider identifying entities in a text as a form of text mining, Crossminder considers text mining proper more than mere information extraction. Text mining implies that new knowledge can be taken out of the text. One of the basic problems of text mining is that the real information one is trying to retrieve is not completely coded in the text. It needs to be deduced by recurring to what is in the text and its connection with the real world. Optimal results are obtained by using all possible dimensions of text analysis.

Crossminder builds on its expertise on computer linguistic knowledge, statistical knowledge and semantic engineering to go beyond identifying and retrieving known information and go towards inferring and deducing new knowledge. Among other things, we enable software that helps in learning more about unknown people, getting as much data as possible about their background.  



linguistic knowledge
probabilistic knowledge

knowledge base
logic


Our technology also makes it possible to carry out cross-lingual searches on- or offline as well as to have the translation of the new or old items identified in the data set you want to analyze. You might know a given foreign language, but having the possibility of doing research based on queries on your own language AND the foreign one and to use semantic approximations to find "the needle out there" gives you a huge advantage.





Questions or comments about the web site? Contact us at webmaster@crossminder.com.
Crossminder BVBA © 2011

Privacy - Terms of Use