1. Introduction 3
2. The symptom analysis 10
2.1. Symptom and super symptom 10
2.2. The concept of canonical correlation and its definition of the aspect of symptoms
analysis 12
2.3. Canonical symptom analysis 12
2.3.1. The difference between canonical symptom analysis and some other
multivariate statistical method 13
2.3.2. Selection algorithm and interpretation of results 13
2.3.3. Data description 14
2.3.4. Canonical sympton analysis and biometrical example 15
2.4. Classification methods based on symptom analysis 16
2.4.1. Classifier 16
2.4.2. Prediction errors 17
2.4.3. Least squares classifier 17
2.5. Methods 18
2.6. Artificial Neural Network 18
2.6.1. Models of artificial neural networks and methods of education and
programming: 19
2.6.2. Neural network algorithm related back propagation 19
2.7. Discriminate Analysis 21
2.8. Random Forest 21
2.9. Application to Real Data 21
2.10. Experimentation 22
2.10.1. Artificial Neural Network Model 22
2.10.2. Discriminate Analysis Model 25
2.10.3. Random Forest Model 27
2.11. Experiment and Results 29
2.11.1. Comparing classifier model based on symptom analysis with other
classifier learning systems (Full Training Mode) 29
2.12. Iterative Algorithm for Constructing Super Symptoms 31
2.13. Iterative procedure 31
2.14. Iterative procedure for selecting the most informative symptoms 32
2.15. Computationual details of the iteration method (algorithem) 33
2.16. Appling the iterative method 33
2.17. The symptom analysis by using iterative method based on uncertainty coefficient 34
2.17.1. Super symptoms of the second order syndrome 34
2.17.2. Iterative procedure algorithm 34
2.18. Majorized symptoms and syndromes 35
2.18.1. Strong majoraization 36
2.18.2. Examples 36
2.18.3. Measures for performance evaluation 38
3. Cluster Method for Reducing Labor Intensity in Symptom Syndrome
Analysis of Multidimensional Binary Data 43
3.1. Syndrome definition 43
3.2. Clustering Ek points by syndrome 44
3.3. Restoring symptoms from a set of prefigurations 45
3.4. Vector a of supersymptom terms 48
3.5. Vector 0 of supersymptom values 48
3.6. Linear relation between a mid fi ......................... 49
3.7. An example of finding the most significant syndrome 50
4. About approximation of a prognosis by partial predictions in conditions
of incomplete data 52
4.1. Statistical Model of the Dependence Structure
of Partial Predictions 52
4.2. Mathematical expectation of a special determinant
with random components 53
4.3. Estimation of the full forecast from partial predictions 54
4.4. An example of estimation of the full forecast from partial predictions based
on incomplete breast cancer data 57
4.5. Flowchart and Algorithm 58
4.6. Algorithm 59
4.7. Simulation process 60
4.8. Experimental results 61
4.9. Example 1 numerical simulation of random correlation matrix An by using
algorithem when N = 1,n = 3,h = 0.2,r = 0.6 62
4.10. Example 2 numerical simulation of random correlation matrix An by using
when N = 500,n = 5,h = 0.2,r = 0.6 63
4.11. Estimates of the parameters of the regression model with variable coefficients 64
4.11.1. Example 65
4.12. An example of forecasting from incomplete data 66
5. A New Method of Poisson Regression Estimator in The Presence of a
Multicollinearity Problem: Simulation and Application 67
5.1. Properties of Poisson Distribution 67
5.2. Assumption of Poisson Regression Model 68
5.3. Multicollinearity Problem 69
5.3.1. Effects Of Multiconllinearity Problem 70
5.3.2. Detecting Of Multicollinearity Problem 70
5.4. Estimation Of The Parameters For Poisson Regression Model 71
5.4.1. Maximum Likelihood Estimators Method 72
5.5. Estimate The Parameters For Poisson Regression Model With Multicollinearity
Problem 74
5.5.1. Ridge Regression Estimators Method 74
5.5.2. Ridge Parameter Estimators 77
5.5.3. Liu Estimators Method [37] 80
5.5.4. Liu Biased Parameter Estimators 82
5.6. Propose Method 84
5.6.1. Proposed Method 84
5.6.2. Proposed Estimators for The Bias Parameter 89
5.7. The Monte Carlo Simulation 90
5.7.1. The Design of The Experiment 90
5.7.2. Simulation Results 91
5.7.3. Results Discussion 92
5.7.4. Interpretation of the results according to the change in the sample size
n 95
5.7.5. Interpretation of the results according to the change in the value of
the correlation matrix p.......................... 96
5.7.6. Interpretation of the results according to the change in the number of
independent variables P.......................... 97
5.8. Application 100
5.8.1. Test Data for The Dependent Variable Y ................ 101
5.8.2. Multilinearity Problem and Application Data 102
5.8.3. Application Results 103
6. Conclusion 105
Motivation. Multidimensional incomplete data classification and clustering methods with applications in oncology is the primary motivation of this thesis. Accurate knowledge in oncology is referred from patient observation. The precision and complexity of the methods of biological observations developed by many generations of biologists have made it possible to mathematically analyze the accumulated information. Methods based on mathematical modeling have greatly contributed to various biological problems such as: modeling of cancer tumor development [2] and modeling of human circulatory system and other systems in human organism [1].
This is an area where incomplete, multidimensional, data exploration techniques can have a large impact on scientific outcomes. Separate factors are sometimes not enough to describe a group of risks. If many factors are taken into consideration, then a major problem is in dimensionality reduction which means looking for some functions of factor with minimal loss of information. Therefore, the statistical task of comparing a single dependent variable with a complex set of several independent dichotomous variables will remain effective , especially when the effect of different factors on the dependent variable is studied separately, and all interrelations are not significant.
The linear combinations of dichotomous variables over the held F2, which are called symptoms, form the projective space from which it is possible select the more informative subspaces for reducing the dimensionality of binary data, we describe the structure of the relationship between two sets of categorical variables by analogy with the canonical correlation analysis based on symptom analysis.
Novel approach for classification technique such as Artificial Neural Network (ANN), Linear Discriminant Analysis (LDA) and Random Forest (RF) based on factor or dichotomous variables is proposed. We are search for the most informative finitely linear combinations (symptoms) of variables in the finite field on the based of the Fisher’s exact test and uncertainty coefficients and accurately predict the target class for each case in the data [15]. The super symptom means a linear combination of various multiplications of k dichotomous variables over a field of characteristic 2 without repeating. In algebra, such functions are called Zhegalkin polynomials or algebraic normal forms. This procedure necessarily yields the new variable of the same nature(e.g. factor).
In 2013, (Filippo Amato) researched artificial neural networks in medical diagnostics ( where discussed the problem of simplifying the diagnostic process in the daily routine and avoiding the error of diagnosis. Artificial intelligence used special methods in assisted diagnosis computer and artificial neural networks and adaptive learning algorithms can deal with different types of medical data are incorporated into their categorized outputs and there are examples of how to use artificial capabilities of neural networks in medical diagnosis. It was through this conclusion that artificial neural networks (ANN) have the ability to process large amounts of data diseases and increasing diagnostic accuracy and increasing patient satisfaction [54].
In 2017, the researcher (Jasmina D.Novakovic)search solve classification problems radial basis function (RBF) and filter methods discuss the problem of avoiding diagnostic problems, by using machine learning that diagnoses tumors, heart disease, hepatitis and some medical problems used in artificial neural networks aim to present and compare different algorithm approach for the construction system that learns from experience and makes decisions and predictions and reduce the expected number or percentage of errors. It was through this conclusion that techniques should be used to solve dimension reduction problems data such as clustring methods, extraction of features, analysis and comparison of these techniques can also improving the performance of algorithms and classifying learning [55].
In (2012) compared different classification techniques to find accuracy among three different breast cancer datasets for which confusion matrix based on 10-fold cross validation method is used [56].
In (2012) machine learning algorithms are effective because their process of searching for a model function can explain and differentiate the class and concept data, which the model is determined based on the data training analysis that is class object data whose label class is already known. The kinds of learning algorithm are Linear Discriminant Analysis, Super Vector Machine ,Logistic Regression, Naive Bayes, Neural Network, Decision Tree and К-Nearest Neighbor [57].
In (2020) the researcher (N.Alexeyeva) used application example in symptom analysis of multidimensional categorical data with applications [58].
As a result of the tremendous developments witnessed by the world in various fields and activities, especially in information technology, it has become logical and even necessary to use the approaches of information systems in the medical fields. The use of ANN in medical fields was the best evidence of the introduction of information technology in health services. These networks can after training find the pathological mark by entering it and then you will be able to find the complete stored sample which represent the best diagnosis of the pathological case[59].
Neural networks have been widely used for breast cancer diagnosis [60] [61]. Feed forward neural networks (FFNN) are commonly used for classification. Feed forward neural networks have been trained with standard back propagation algorithm [62].
In this work the perform of the classifiers is evaluated using breast cancer data set for training algorithm. Moreover, these three different algorithms have been studied very well based on symptom analysis and thus we do focus on the fact that the best results are from which algorithm.
The application of linear statistical methods or the analysis of individual factors is not sufficient to choose or describe the risk group, in order to solve these problems we can use symptom syndrome analysis [3] The method consists of constructing new factors in the form of linear combinations over the field F2, which form a finite projective space [4].
perhaps we can find in many cases that it is useful to create new super symptoms for each observation or experimental unit so that it is possible to make a work comparison among them in a more easy way. These super symptoms are functions that include all the original variables approved in the iterative procedure. The functional method [4],[6],[[9] - [11]] was intended to detect the most significant combinations of factors only in the form of their linear combinations over fields F2 and F3 and was used to solve a number of biometric problems [[17] - [22]] .
The method of super symptom analysis was used with the help of an iterative procedure to find the ideal syndrome through a large set of variables....
The practical value of the results were obtained in the work .
1) In chapter 1
The symptom-syndrome method is proposed for detection of the most informative variables in the finite held (symptoms) on the base of Fisher’s exact test and uncertainty coefficients and accurately predict the target class for each case in the data. Also using entropy to determine the strength of association between categorical variables based on super symptoms and applied it for breast cancer data set.
We have proposed a novel technique for categorical classifier learning. The proposed algorithm is tested on a real life problem such as diagnostic problems of cancer (medical diagnosis). Our proposal is based on defining symptom analysis for learning a classifier as (ANN)(LDA) and (RF). As for the classifier learning based on symptom alalysis (observe that the best result emerged from the proposed method), from comparing classifier model based on symptom analysis with other classifier learning systems
The accent here is on the role of classifier learning based on symptom analysis in cancer management and prognosis. Further work is needed to increase the accuracy of classification of breast cancer diagnosis.
As a result of the work, a symptom syndrome analysis was performed for data on breast cancer in women, based on the results of symptom analysis by using the proposed method of iterative procedure to construct the ideal syndrome through a large set of variables. The results of the convergence analysis of this procedure showed that the reason for achieving the steady state could not only be a decrease in the dimensions of the syndromes but also their majorazation.
In chapter 2
The theory is proven through Di = a is obtained from Da = fi can be determined super symptom. If the number of clusters is more than two, then the super syndrome analysis must be applied. If considering that the task of classifying observations by dichotomous features, it is obvious that the set of observations can be divided into subsets depending on the values of the corresponding multidimensional dichotomous space, and in each of these subsets estimate frequencies by a variable of type class.
After the clusters are selected, the problem of their description arises. One of the main techniques employ to describe the cluster as subset of multidimensional over F2 is method of symptom analysis which allows to select a complex factor for predicting. Thus, the class probability estimator in the context of high-dimensional data based diagnosis is estimated.
We find the maximum and minimum number of cluster based on order syndrome m0.
Further work is needed to identify risk group that is describe by a logical combination factor. It should be noted that the results obtained are of great practical importance. For further medical research in the held of breast cancer (medical diagnosis).
In chapter 3
Thus, recurrent and explicit expressions for algebraic complements of a matrix with random off-diagonal elements are obtained. This made it possible to introduce multiplicative corrections that improve forecasting by averaging partial predictions. The simplest expression has estimate fa, which is the product of the average partial prediction by a coefficient depending on the degree of connectivity of partial predictions and their number. More complex estimates in the form of linear combinations of partial predictions with unequal weights are objectively more variable. This approach allows solve a regression problem with a variable number of predictors and can be used to correct any meta-estimates that use averaging of partial solutions. In the future, it is planned to conduct a more detailed study of the effectiveness of the estimates obtained. In addition, it is planned to adapt the method for estimating a posteriori probabilities from incomplete data in a linear classification problem.
In practical terms, the advantage of the method is that it is possible to cover a larger number of observations without artificially filling gaps. This is especially important when working with biomedical data of the so-called observational type, in which, for objective reasons, it is impossible to achieve data completeness, and it is very difficult to convince a doctor who analyzes the medical records of his patients that someone missing data can be attributed, no matter how good their statistical properties are.
In chapter 4
The ridge regression and Liu estimator at a different time were corresponded to the Poisson Regression Model to solve multicollinearity. However, in this study, we developed a new estimator, establish its statistical properties, carried out theoretical comparisons with the estimators mentioned above.
The increase in the sample size and the number of independent variables does not constitute any obstacles towards the efficiency of the proposed method in estimating the parameters of the Poisson regression model, while these factors affect the efficiency of some of the previous estimation methods. The proposed estimation method across all bias parameters and especially H1 represents the optimal solution when the value of the correlation coefficient between the independent variables is increased.
Furthermore, the efficiency of proposed method estimator less than 1 (or the relative efficiency ratio is close to zero) under the effect of n, p and p indicates that >■,/l is not at efficient as fisug in estimating the parameter value with smaller mean square error.
[1] Muller LO, Toro EF. A global multiscale mathematical model for the human circulation with emphasis on the venous system. Int J Numer Method Biomed Eng. 2014 Jul;30(7):681-725. doi: 10.1002/cnm.2622. Epub 2014 Jan 15. PMID: 24431098.
[2] Classical mathematical models for description and prediction of experimental tumor growth / SF(c)bastien Benzekry, Clare Lamont, Afshin Beheshti et al. // PLoS computational biology. bTj” 2014. bTj” Vol. 10, no. 8. bTj” P. el003800.
[3] Alexeyeva N. (2013). Analysis of biomedical systems. Reciprocity. Ergodicity. Synonymy. Publishing of the Saint-Petersburg State University, Saint-Petersburg.
[4] N.Alexeyeva, P.Gracheva, E.Podkhalyuzina, K. Usevich (2010). Symptom and syndrome analysis of categorial series, logical principles and forms of logic. In Proceedings, 3rd International Conference on BioMedical Engineering and Informatics BMEI, pages 2603bT)“2606. China.
[5] Алексеева H. П., Конради А. О., Бондаренко Б. Симптомный анализ в исследовании долгосрочного клинического прогноза // Артериальная гипертензия. Том 14. 2008. Т. 1. С. 38-43.
[6] Alexeyeva N., Gracheva Р., Martynov В., Smirnov 1.(2009). The finitely geometric symptom analysis in the glioma survival. In The 2nd International Conference on BioMedical Engineering and Informatics (BMEI09). Okt.2009. DOI: 10.1109/BMEI.2009.5305560. China.
[7] Liu, K. (2003). Using Liu-type estimator to combat collinearity. Communications in Statistics—Theory and Methods 32:1009-1020.
[8] Hoerl, A. E and Kennard, R . W (1970). “Ridge Regression : Application to Non - Orthogonal Problems”, Technometrics, Vol. 12, PP. 69-82.
[9] Alekseeva N.P., Alekseev A.O. On the role of finite geometries in the correlation analysis of binary features. In: M.K.Chirkova (ed.) Mathematical models. Theory and applications, iss. 4. St. Petersburg, 102-117 (2004). (In Russian).
[10] Alekseeva N. P., Konradi A.O., Bondarenko B.B. Symptom Analysis in Long-Term Clinical Prognosis Research. Arterial hypertension 14 (1), 38-43 (2008). (In Russian).
[11] Alexeyeva N.P., Al-Juboori F. S., Skurat E. P. Symptom analysis of multidimensional categorical data with applications. Periodicals of Engineering and Natural Sciences 8 (3), 1517-1524 (2020).
[12] R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, https ://www.R-project.org.
[13] Dorugade, A. V. Modified two parameter estimator in linear regression. J. Stat. Trans. New Ser. 15(1), 23-36 (2014).
[14] Lukman, A. F., Ayinde, K., Binuomote, S. and Onate, A. C. Modified ridge-type estimator to combat multicollinearity: application to chemical data. J. Chemomet. https ://doi.org/10.1002/cem.3125 (2019).
[15] Wolczuk, D. , Norman, D. . (2012). Introduction to Linear Algebra for Science and Engineering, 2/E. Pearson....52