Text Data

Download Text Data (college1) Download Text Data (college2) Download Text Data (college3) Download Text Data (politics1) Download Text Data (politics2) Download Text Data (politics3)

Python Part

Based on the probability of count of word stem given the fact whether the article talks about college tuition or politics, Naive Bayes has calculated the probability of the label of test data given the frequency of word stem.There are two variables in test dadaset. Both of them are predicted correctly. Therefore, in this case, the Naive Bayes is a good model to predict the label of text data.

R Part

Confusion Matrix

Result of NB

Naive Bayes in r code is not as good as Naive Bayes in python. It only successfully predicted half of test data. One of the articles talks about college tuition is predicted as an article talks about politics.
Download Python Code ( for text data) Download R Code ( for text data)

Record Data

Download Record Data

Python Part

First, there are more expensive universities than non-expensive ones. Second, because samples of expensive schools are more than samples of non-expensive schools, it is understandable than Naive Bayes did better to predict expensive universities than to predict non-expensive universities. The Naive Bayes prediction is based on the information of factors such as tuitions, accept rate, and the type of colleges given the fact whether these college are expensive or not.

R Part

Confusion Matrix

Result of NB

Same result for record data. Many universities in test data are predicted as expensive. And most of expensive universities are predicted successfully. However, only one half of non-expensive universities are predicted correctly.
Download Python Code ( for record data) Download R Code ( for record data)

Conclusion

Since the number of expensive colleges is higher than the number of non-expensive colleges in the United States, it is better to predict a random college as an expensive university. In this way, the probability will be high when there is no extra information about that college. On the other hand, if the accpet rate, tuition, and other basic information about the college are given, it is better to predict based on these information. Tuition, of course, is the most important index to predict the whether the university is expensive. However, whether a college is expensive or not can be predicted based on other factors as well. For example, a university with high accept rate to admit students is highly probable to be a non-expensive university.