Comparative Analysis of the C4.5 Algorithm and the Nearest Neighbor for the Number of Prospective New Student Registrants

In 2015, the number of registrants for new student candidates at Muhammadiyah University of Gorontalo, has increased about 20% - 50% from the last year in 2014, but when it starts from 2017/2018 of the academic year the number of new student candidates who registered was only around 4,713 students for bachelor’s and there is 1,256 students for Bachelor’s Degree, while in the academic year of 2018/2019 bachelor’s degree students were only 765 and bachelor’s students were around 4,187, it is known as a decline from the previous year. This study, aims to help to predict the number of prospective of the new students who will enroll in the following of the academic year by analyzing the comparison of the C4.5 and Nearest Neighbor Algorithms with comparing two of algorithms to get the best results. In the C4.5 and Nearest Neighbor Algorithms, it is necessary to be able to see some patterns from the data about the prospective students, then, they can produce the predictions of the number of prospective students who can help in increasing the number of prospective students that is according to the target achievements of Muhammadiyah University of Gorontalo (UMG) itself.


Introduction
Muhammadiyah University of Gorontalo is one of the private universities in Gorontalo, which has been established for more than one decade and, in 2018 with the number of students in the 2015/2016 of the academic year it is experienced a very significant increase that around six hundred (600) students who were accepted. In addition, in every new academic year, the University regularly holds the New Student Admissions activities. This new student admission activity has been routinely carried out, then it will indirectly collect a lot of data from the prospective new students themselves, therefore, in this case what the University needs to do is process of the data that came from all prospective students, then later it will becomes an important information. Based on this information, one of the data that can be generated is the target of the number of new student candidates itself.
Data Mining is a sequence of several processes to find an added value from a data set in the form of knowledge that is currently unknown manually (Retnosari & Jananto, 2013;Kurgan & Musilek, 2006). Whereas, prediction itself is the process of forecasting future events that based on certain parameters to reduce uncertainty of a condition and create a benchmark to predict future events based on patterns that have occurred in the past (Hartatik, 2015;Mariscal et al, 2010).
In this study, one of the methods that is contained in Data Mining science will be applied, it is a prediction by analyzing the comparison of the C4.5 and Nearest Neighbor algorithms, which is expected by doing this forecasting or prediction model to obtain the best value by looking at the highest level of accuracy of the two algorithms which later can be use by Muhammadiyah University of Gorontalo in predicting the number of new student candidates who re-register and determine policies for the upcoming events of the admission of new students.

Methods
In this method of research, the researcher uses an analysis method that is based on CRISP-DM (CRoss-Industry Standard Process for Data Mining) (Kusrini, 2009)

The Business Understanding
At this stage the research aims to identify needs in detail, namely by identifying the patterns of the previous student admissions dataset based on the variables that have been selected, as shown in the This part will be explain about the use of the Data Mining technique with the prediction method by using the C45 algorithm and Nearest Neighbor, which produces the prediction rules and the most influential variables in predicting the entry of prospective new students.

Evaluation
In this part, it is about an evaluation that is carried out to obtain the quality and effectiveness of the model used, then the prediction results are obtained for each prediction algorithm. The prediction results are then tested for the level of accuracy with the help of the confusion matrix method. After that, the process of comparing the level of accuracy of each algorithm is carried out to determine which algorithm has the highest accuracy.

Deployment Phase
From the existing results, at this stage there will be a dissemination in the form of making reports and it can be implemented to the Muhammadiyah University of Gorontalo as a reference in predicting the number of incoming new student candidates in the future.

The C4.5 Algorithm Calculation Results
The C4.5 algoritma is one the most effective decision algorithms for classification. The dicision tree is built by recursively dividing the data until each part consists of data from the same class (Iskandar & Suprapto, 2016) Specifically, the C4.5 Decision Tree algorithm uses a modified split criterion called Gain Ration in the split attribute selection process (Jovanovic et al., 2012;Dongming et al., 2016;Mishra et al., 2016;Wang et al., 2019). In this algorithm, it is an algorithm that is used to form a decision tree which is, that consists of a set of rules to divide into a number of populations into a smaller ones.

Attribute
The The total row of the Entropy column is calculated by the following equation below: Meanwhile, the gain value for gender is calculated by the following equation below: From the calculation results contained in the node calculation table, it can be seen that the highest gain is found in the total test attribute, which is 0.44763. Thus the test total can be the root node. And in the total test there are 4 attribute values from which the results can be described as a temporary decision tree as follows: The results also look like the same, when we test using the Rapid-miner tools which are shown in the following image below:

Nearest Neighbor Algorithm Calculation Results
This algorithm uses an approach to searching for cases by calculating the closeness between a new cases and old cases which is, it is based on matching the weights from a number of an existing features. To predict whether the prospective new student will pass the exam or not, use the following steps: Calculate   From steps 1 st , 2 nd and 3 rd it can be seen that the highest value is in the case number 3, therefore the closest case to new case is case 3.

Uses the classification of cases with the closest proximity.
Based on the results of step 4, the case of number 3 will be used to predict new cases with the possibility that new students will pass the exam. The following explanation below are the results of the Nearest Neighbor algorithm using real data using Excel:

Proximity of Religious Attribute Values
Nilai  Algorithm C4.5 The conclusion of the C4.5 algorithm will be show up in the picture below; the accuracy that has been obtained is 96.47% with the Yes Prediction (Y) as true yes 4743 and true no 76 and also, in 98.42% as Precision class and Prediction No (N) yes itself as 122 and True no is 670 with the Class Precision 84.60%.

Nearest Neighbor Algorithm
In the picture below, the accuracy of value of the Nearest Neighbor Algorithm is 86.02% with the result of precision class is 86.05% and a precision class is 50.00%.
Based from both images above, it is very clear that the highest accuracy is in the C4.5 Algorithm itself. Because, according to the researchers' analysis that has been explained on the finding and discussion on this research, the C4.5 algorithm is in processing of the calculations, through the Rapid Miner tools in particular, it does not require to the process of changing the original data itself. Meanwhile, from data that is containing letters and numbers, while for the Nearest Neighbor algorithm, the process of changing the data is very necessary because, when processing the algorithm then it must be numerical and for other reasons, that is the Nearest Neighbor algorithm is more widely used for the classification process according to the proximity of values.

Conclusion
It is clear when the comparison between both of the algorithms itself is superior to the C4.5 algorithm, which in the decision tree is the top node in the Total test of attribute, meanwhile, it is seen in real data when researchers and the researcher's assistant is perform the data collection in a very differently way, with what has been expected, it is the amount of blank data that is redundant their data when new students fill in their data, especially on the attributes of the origin of the department during school and their Pure Eptanas Score (NEM) or the final scores, which are important attributes in determining the best results.