Current Research in Bioorganic & Organic Chemistry Computational Modeling of the Anticonvulsant Activity of 3-Aminopropane-1,2-Diol and 1-Aminoethane-1,2-Diol Derivatives

Quantitative Structure-Activity Relationship (QSAR) modeling was conducted on some 3-aminopropane-1, 2-diol and 1-aminoethane-1, 2-diol derivatives with anticonvulsant activity against maximal electroshock induced seizure using Genetic Function Algorithm-Multiple Linear Regression (GFA-MLR) method. The data set (37 molecules), was divided into 26 training and 11 test subsets by Modified-K-mediods clustering method. The models built by the GFA-MLR method provided satisfactory statistical results with LOF (0.087 to 0.097), R 2 (0.963 to 0.980), Q 2 (0.948 to 0.971), F (139.3 to 258.3), R 2pred (0.861 to 0.931) and MAE (95%) (0.059 to 0.066). Descriptors contained in these models suggested that increment in molecular mass and polarizability of dataset molecules was favorable for improving their anticonvulsant activity values. Intelligent consensus modeling applied to the models gave a representative model with improved MAE (95%) of 0.054. Applicability domain of the models was well defined and therefore, the models can be used to screen molecules for anticonvulsant activity.


Introduction
The continuous effort to investigate new molecules with anticonvulsant properties is important because about 30% of epileptic patients with convulsion as a major symptom do not respond to marketed Antiepileptic Drugs (AEDS) [1]. In addition, almost all marketed AEDs had attendant side effects [2]. Therefore, developing new anticonvulsant improved quality in term of potency and safety is a continuous task for medicinal scientist. Modern computational chemistry as an evolving discipline provided rational approaches to drug design. It accelerates and reduces the cost of drug discovery process via obliteration of classical trial and error approach [3]. An example of computational approach to drug discovery is Quantitative Structure-Activity Relationship (QSAR) modeling which aids in identifying structure feature of a molecule which correlates mathematically with the observed biological activity of the molecule [7-8] [4-6]. This approach is now widely used as an aid to or an outright substitute for experimental studies to predict the activity of the molecules from their structure [7]. It reduces the number of animals needed for experiment and reduces cost in term of funds and time [8].
3-aminopropane-1, 2-diol and 1-aminoethane-1, 2-diol derivatives were reported to have anticonvulsant activity in Maxima Electroshock Seizure (MES) test, which one of the golden test for preliminary screening of molecules for anticonvulsant activity. To the best of our knowledge there were no QSAR reports on this group of molecules using a combination of density function theory quantum mechanical method and chemo metric principles. The objective of this study is to explore the structural features that are responsible for observed anticonvulsant activity of these groups of molecules through QSAR methodology.

Materials and Method Data Set
The data set were derivatives of 3-aminopropane-1, 2-diol and 1-aminoethane-1, 2-diol whose IUPAC name and anticonvulsant activity (against Maxima Electroshock (MES) induced seizure) values were obtained from literature [9]. The activity value reported as amount of molecule (mg kg -1 ) that is effective to prevent convulsion in fifty percent of the tested animals (ED 50 ) was transformed to ED 50 (mol kg -1 ) and later to log (1/ED 50 ) to abate the deviation to the normal distribution of the data set activity values [10].

Molecular Structure Generation, Optimization and Descriptors Calculation
Spartan 14 [11] software was used to draw and optimize the equilibrium geometry of each molecule in the dataset. Density function theory B3LYP/6-31G** quantum mechanical method was employed for optimization calculation using Pulay DIIS algorithm and direct geometric minimization. This method gave the most stable molecule associated with absolute minima in the potential energy hypersurface which represents the most probable structure of the molecule [12]. DFT also gave reliable information on electronic properties of molecule [13].
The optimized structures were ported to PaDEL-Descriptor software [14] to compute around 1875 different physicochemical, topological and structural molecular descriptors. Molecular structure and the corresponding anticonvulsant activity value of dataset molecules were presented in (Table 1)

Dataset Division
Modified K-Medoid clustering algorithm proposed by (Park & Jun, 2009) [15] available in Modified KMedoid version 1.2 was used to divide the database into training set for model development and test set for model validation. The algorithm proceeds via three main steps which are, selection of initial Mediod, update of selected mediods and assignment of object to mediods. In the first step, given n numbers of objects having p number of variables (descriptors) each, they were grouped into given k clusters, where k < n. Defining the variable of object i as X ia (i = 1,. . .,n; a = 1,. . .,p). The Euclidean distance (d ij ) between two object i and object j was calculated: i = 1, …….., n and j = 1 ….. n (1) Scaled Euclidean distance ( ) for each object was calculated by dividing the distances by sum of the entire distances. The for objects in each cluster k was arranged in ascending order and objects with smallest values in each group are selected as the initial most middle objects in a cluster (mediods). The objects were re-shuffled to obtain initial cluster by assigning each object to the nearest medoid. The sum of distances from all objects to their medoids was calculated and kept for comparison.
To update the mediods, a new medoid of each cluster was found, which is the object minimizing the total distance to other objects in its cluster. Then the current medoid in each cluster is updated by replacing with the new medoid. Then (the third step), each object is assigned to the nearest medoid resulting to the formation of new k clusters. The sum of distance of objects to their mediods (total cost distance) was re-calculated. Now, if the total cost distance is equal to the previous one, the algorithm stops, otherwise, it goes back to the second step [15].

Transformation of Descriptor Values
Molecular properties are often measured in different unit and regression analysis frequently produces equation that favors property with higher magnitude of measurement [10]. To give all properties (descriptors) equal chance of appearing in the models produced in the study, descriptor values were transformed by normalization method using the equation below: (4) Where is the normalized descriptor, X i is the original descriptor value, X max and X min are the maximum and minimum descriptor value respectively in a descriptor column of the database [16].

Selection of Most Desirable Descriptor
Using the training set data only, combinations of descriptors that were optimally correlated with the anticonvulsant activity of the dataset molecules were selected using Genetic Function Algorithm (GFA) available in Materials Studio 7.0. GFA is a frequently used method that utilizes genetic algorithm to select combinations of descriptors that can be used to produce models and multivariate adaptive regression splines algorithm to evaluate the fitness of the models [17]. It has the advantage of producing of multiple models via repeated runs and automatically selects and determines the exact number of descriptors needed to build a fullsize model.

Construction and Validation of QSAR Models
The combinations of training set descriptors reported by the GFA variable selection were collected in separate spreadsheets for both training and test sets. These spread sheets were imported into the MLRplusValidation1.3 [18] to calculate various internal and external validation parameters. Furthermore, the presence of multicolinearity problem among descriptor blend that made up a model was checked with Variance Inflation Factor (VIF) value for each descriptor i: Where is the coefficient of determination of the regression of descriptor i on all the other descriptors. VIF value greater than 10 indicates high degree of correlation among descriptors (multicolinearity problem) [19]. Full explanations of the various validation parameters used were presented in (Table 2).

Symbol
Definition and allowed threshold Ref. Predicted determination coefficient for the test set data, R 2 (pred) > 0.6 indicate good predictive ability (Golbraikh and Tropsha, 2002) [21] are the square of correlation coefficient for the plot of predicted versus observed activity for test set with and without intercept respectively. If the value of the parameter is < 0.1 then, the model is predictive

Models Applicability Domain (AD)
The AD is the structural and chemical space of a QSAR model where it can make a reliable prediction [23]. Degree of extrapolation method was used to define AD in the study. It uses leverage (h i ) values for each compound obtained as the diagonal elements of a hat matrix and standardized residual (SDR) produced by the models. Williams plot (graph of SDR versus h i ) gives a quick pictorial representation of AD in this method. Hat matrix H was computed with the equation below: Where X represents the descriptors matrix and X T is the transpose of the matrix. The diagonal elements of H are leverages for each compound. Leverage threshold (h*) for a model was computed with the equation below: where n is the number of compounds in the training set only and k is the number of descriptors in the model. SDR was computed with the equation below: Where n is the number of compounds in the training set. Y pred and Y obs are the predicted and experimentally observed activity values respectively. A compound with h i > h* for a model is structurally dissimilar to other members of the model training set i.e. an influential data and prediction for such compound by the model are not reliable. A compound with SDR > ± 3 is an outlier in the response space of the model [23].

Training Set and Test Set Data Structure
The clustering method used divivded the entire data into 26 training set (70% of the entire dataset) and 16 test set (30% of the entire dataset). The test compounds were marked with asterisk in Table 1. The plot of normalized mean distance against the observed anticonvulsant activity for both training and test set (Figure 1) showed that test set data was distributed within the descriptor space of the training set data. This showed that the data division method used performed well.

QSAR Models and Validation Parameters
The top three models produced by the GFA-MLR method used in the study were presented in Equation 4 to 6. These QSAR models were obtained from 26 training set data and contained 4 descriptors each, meaning their Toplis ratio was 6.5. Hence, they do not violate the QSAR semi-empirical rule of thumb [24].
Correlation analysis carried out on the descriptors contained in each model showed the highest absolute correlation coefficient between descriptors was 0.667 (Table 3). This indicated that descriptors contained in the models were relatively orthogonal to one another. Their VIF values (Table 3) further confirmed there was no multi-co-linearity problem in the models reported. Detailed quality and validation parameters values for the models were presented in (Table 4) These results showed that the models had good internal and external predictive ability and were void of systematic error.   Although the models reported were of good quality, the aim of all QSAR practitioners to improve the quality of prediction by reducing predicted residuals for test/query compounds to the barest minimum. To achieve this aim, intelligent consensus modeling [25] available in Intelligent Consensus Predictor version 1.1 software was applied on the models. Intelligent consensus modeling combined the proposed validated individual models (Equation 4 to Equation 6), and it carefully accounted for carefully accounting for the different assumptions characterizing each model. The optimized software setting for the study was without the entire additional criteria (i.e. Euclidean distance cutoff, applicability domain criteria and Dixon Q-test), a similar condition reported in literature [25].
The test-set validation parameters for individual models as well as consensus models obtained were reported in (Table 5). In the table, IM1, IM2 and IM3 represent the Eq. 4, Eq. 5 and Eq. 6 respectively. While, CM0 is the ordinary consensus model which uses simple average of prediction of individual model for all compounds in the test set; CM1 is the intelligent consensus model1 which uses the average of predictions from all qualified individual models; CM2 is the intelligent consensus model 2 which uses Weighted Average Predictions (WAPs) from all qualified individual models; and CM3 is the intelligent consensus model 3 which uses the best selection of predictions (compound-wise) from individual models [25].  In the table, CM0 was ordinary consensus model in which simple average of prediction of individual model for test set compounds as used. CM1 was intelligent consensus 1in which the average of predictions from all qualified individual models for a given compounds were used. CM2 was intelligent consensus 2 in which uses weighted average predictions (WAPs) from all qualified individual models for a given test set compounds was used. Finally, CM3 was intelligent consensus 3 in which uses the best selection of predictions (compound-wise) from individual models was used [25].
Comparing the three individual models (IM1-IM3) with the three intelligent consensus models (CM1-CM3), it was obvious that the values of external validation parameters were better in almost all the cases for consensus models. The mean absolute error MAE (95%) metric for intelligent consensus models CM1 to CM3 were lower compare to that of individual models (Table  5). CM2 emerged as superior to all other models with MAE (95%) 0.054 (Table 5). CM2 was used to predict the activity of the entire data and the predicted activity values were reported in Table 1. The predicted test set activity values for the entire dataset by the individual models (IM1-IM3) and the intelligent consensus models (CM0-CM3) were presented in Table S2 of the Supplementary file.
Linear relationship existed between the experimental and predicted activity values by the CM2 (Figure 2) and there was even of its predicted activity residuals around the line standardized residual equal zero (Figure 3). These observations indicated that the model had good internal and external predictive capability and also void of systematic error. Therefore, it can be used to make prediction for known molecule without activity, provided the molecule is in the applicability domain (AD) of the developed models.

Models Applicability Domain
The William plots for the models (Figures 4-6) showed that all dataset molecules had leverage value less than less than the models threshold leverage (h* = 0.57) and their standardized residual (SDR) were less than ±2.5. This indicated that all molecules were within the applicability domain of the models defined by the square area 0 < hi < h* and -2.5 < SDR < 2.5. Hence, the models reported were able to predict the activity values for all dataset molecules with high level of reliability. In summary, the models had high-quality parameters and great predictive power for molecules within their AD.

Descriptors Interpretation
A QSAR model can be used as knowledge generator to improve the biological activity under consideration for any molecule. Interpretation of the model descriptors usually played a major role in this endeavor. Therefore, attempt was made in the study to a brief interpretation for descriptors contained in the reported QSAR models. Table 6 contained definition of descriptors shared by reported models; their average regression coefficient and incidence. Topological distance based autocorrelation -lag 9 / weighted by polarizability 1.555 (1) Note: ARC (I) is average regression coefficient (incidence). AATSC8m, AATS2p and MATS3m were 2D spatialdependent autocorrelation descriptors calculated on a molecular graph with the use of Broto-Moreau coefficient (in the case AATSC8m and AATS2p) and Moran coefficient (in the case of MATS3m) [26]. AATSC8m measures the strength of the connection between relative atomic masses of two atoms in a molecular space separated by eight bonds (lag 8). It had positive average regression coefficient and appeared in the three models (Table 6). AATS2p measures the strength of the connection between polarizability of two atoms in a molecular space separated by two bonds (lag 2). Also had positive average regression coefficient and with one incidence in the entire models (Table 6). While MATS3m measures the strength of the connection between relative atomic masses of two atoms in a molecular space separated by three bonds (lag 3), it was negatively correlated with the anticonvulsant activity of the studied dataset (Table 6). It also appeared in the three models. Therefore, increment in values of AATSC8m and AATS2p augments the anticonvulsant activity value of dataset molecules, while, that of MATS3m diminishes the activity.
Mp was a 2D constitutional descriptor defined as mean atomic polarizability scaled on Carbon atom [14]. It was positively correlated to the anticonvulsant activity of the studied dataset and occurred in one of the model (Table 6). SHCsats is a 2D electrotopological-state index of an atom which unifies in a single index both electronic and topological description of a molecule [27]. It is defined as Sum of atom-type H on C sp3 bonded to another saturated C. It had positive regression coefficient and incidence of three (Table 6). TDB2p is 3D topological distance based autocorrelation -lag 2 / molecular polarizability. It is a member of the 3D autocorrelation descriptors [28] which uses both Euclidean (geometric) and topological distances to encode information about molecular structure. TDB is an index of shape and branching of molecules [26]. It occurs in one of the model reported and it's positively correlated to the anticonvulsant activity of dataset molecules.
In summary, the descriptors contained in the reported models suggested that increment in the molecular mass and polarizability will improve the anticonvulsant activity of the dataset molecules. This can be achieved via chain elongation to increase the value of SHCsats, AATSC8m and addition of electronegative elements which will be favorable to the values of AATS2p, Mp and TDB2p.

Conclusion
Anticonvulsant activity of some 3-aminopropane-1,2diol and 1-aminoethane-1,2-diol derivatives were successfully model via QSAR strategy. The QSAR models obtained had good statistical quality: LOF (0.087 to 0.097), R 2 (0.963 to 0.980), Q 2 (0.948 to 0.971), F (139.3 to 258.3), R 2 pred (0.861 to 0.931) and mean absolute error after removal of 5% data i.e. MAE (95%) (0.059 to 0.066). Intelligent consensus 2 (CM2) with MAE (95%) of 0.054 was the golden model for making prediction in the study. The result in the study showed that AATSC8m, AATS2p, MATS3m, Mp, SHCsats and TDB2p descriptors had influence on the anticonvulsant activity values of dataset molecules. Therefore, increase in molecular mass and polarizability of dataset molecules is favorable for improving their anticonvulsant activity values. The models reported were robust and with good predictive ability. Their applicability domains were well defined and they can have used to virtually design and screen molecules for anticonvulsant activity.