Abstract:
|
Proteins with intrinsically disordered regions are involved in large number of
key
cell processes including
signaling, transcription, and chromatin remodeling
functions
. On the other side, such proteins have been observed in people suffering from
neurological and cardiovascular diseases, as well as various malignancies.
Process of
experimentally determining disordered regions in proteins is a very expensive and long
-
term process. As a consequence, a various computer programs for predicting position of
disordered regions in proteins have been developed and constantly improved.
In this thesis a new method for determining Amino acid sequences
that
characterize ordered/disordered regions is presented. Material used in research includes
4076 viruses wit
h more than 190000 proteins. Proposed method is based on defining
correspondence between n
-grams (including both repeats and palindromic sequence
s)
characteristics
and their belonging to ordered/disordered protein regions. Positions of
ordered/disordered regions are predicted using three different predictors.
The features of the repetitive strings used in the research
include mol
e fractions,
fract
ional differences, and z
-values. Also, data mining techniques association rules and
classification were applied on both repeats and palindromes. The results obtained by all
techniques show a high level of agreement for a short length of less than 6, while the
level of agreement grows up to the maximum with increasing the length of the
sequences.
The high reliability of the results obtained by the data
mining
techniques
shows that there are n
-grams, both repeating sequences and palindromes, which
uniquely ch
aracterize the disordered/
ordered
regions of the proteins
. The obtained
results were verified by comparing with the results based on
n- grams from the DisProt
database
which
contain
s the positions of experimentally verified disordered
regions of
the protein. Results
can be used both for the fast
localization of disordered/ordered
regions in proteins
as well as for further improving existing programs for their
prediction. |