Abstract:
|
The basic goal of this doctoral thesis is a research into different techniques and
models which are applied in information extraction, and providing an informatic
support in processing of natural language texts from culinary and gastronomy
domain. Information extraction is a subfield of computational linguistics which
includes techniques for natural languages processing, in order to find relevant
information, define their meaning and establish relations between them. A very
special attention is given to ontology based information extraction. It consists of the
following: recognition of instances of ontology concepts in non‐structured or semistructured
texts written in natural language, reasoning over the identified instances
based on the rules defined in the ontology, as well as recognition of instances and
their use for instantiating the proper ontology concepts.
The main result of thesis reflects in the presentation of a new model for
ontology based information extraction. Besides solving tasks of information
extraction, the new model includes not only upgrade of existing lexical resources and
ontologies, but also creation of the new ones. Its application resulted in development
of a system for extraction of information related to the culinary domain, but this new
model can be used in other fields as well. Beside this, the food ontology has been
developed, Serbian WordNet is extended for another 1.404 synsets from the culinary
domain, while electronic dictionary of Serbian is enlarged with 1.248 entries. The
significance of the model application comes from the fact that the new and enriched
linguistic resources can be used in other systems for natural language processing.
The opening chapter of the thesis elaborates the need of providing an
informatic model for processing a huge linguistic corpus related to culinary and
gastronomy domain, through methodologically precise and solid approach
integrating pieces of information on the domain. Also, the formalization of the basic
research subject, text in electronic form, has been presented. Further on, the chapter
contains a description of the natural languages approximations introduced in order
to enable modern information technologies to process texts written in natural
languages, and it emphasizes the need to make the characterisation of the text
language with corresponding corpus and sublanguage.
Further on in the first chapter, the task of information extraction, and the
models for informatic processing of non‐structured or semi‐structured texts, used
by the computer to interpret the meaning that the author (not necessarily a human)
has intended to give while writing the text, are defined. Additionally, this chapter
contains the description of the methods used in information extraction field –
methods based on rules and methods based on machine learning. Their advantages
and shortcomings are listed, so as the reasons why in this thesis are used techniques
based on linguistic knowledge. As a conclusion to the introduction chapter, a special
attention is given to ontologies, WordNet, and the significance of its usage as
ontology.
The second chapter contains the presentation of the linguistic resources and
tools exploited in this thesis. It describes morphological dictionaries and local
grammars used for solving the problem of information extraction from texts written
in Serbian. A review of information extraction systems is given subsequently. At the
end of the second chapter, the stages in processing of Serbian written texts during
the information extraction in the software systems Unitex and GATE are described.
The main result of the thesis is presented in the third chapter. It is the model
for solving the problem of information extraction by integrating linguistic resources
and tools, which includes creation of a text corpus, definition of tasks for information
extraction, establishment of finite state models for information extraction, and their
application accordingly, iterative enlarging of electronic morphological dictionaries,
enrichment and enhancement of WordNet, and creation of new ontologies. Each of
these steps is described thoroughly. Even though the model was at first considered
as a solution for problems in processing Serbian, it can be equally applied for
processing texts written in other languages, with the development of suitable
language resources accordingly.
The implementation of the above explained steps is described in the fourth
chapter, through a system for information extraction from the culinary texts written
in Serbian. Then follows the description of a bond in the development and mutual
complement of lexical resources through steps in creating domain corpus,
identifying culinary lexica, expanding and upgrading of WordNet and electronic
morphological dictionaries, and developing of domain ontologies – the food
ontology, the approximate measure ontology, and the ontology of ingredients that
can be used as mutual replacements in the culinary domain. This system, developed
for information extraction, has served for creating an advanced search system which,
based on a corpus of culinary texts, generates all possible answers to inquiries made
by users. In the frame of this system is implemented a specific method which serves
for creation of links between different recipes. This is used in case when the user
reviews a text of a recipe and notices that in preparing description features some
part which already had appeared in other recipe, but with additional or different
explanation. Another contribution of this thesis is application of developed
ontologies in tasks that convert approximate measures into standard measures, and
establishment of similarities among the recipes. The similarity of the recipes is
defined as similarity of texts which describe process of course preparation in
accordance with a specific recipe.
The last chapter contains final conclusions and directions for future research. |