midterm

Arjun Somnali, Spencer Anderson, and Yuriy Bidochko,


Project maintained by MehtA-AI-AIMLResearchBootcamp22 Hosted on GitHub Pages — Theme by mattgraham
MehtA+ AI/ML Research Bootcamp '22 midterm project

MehtA+ AI/ML Research Bootcamp '22 midterm project

Arjun, Spencer, and Yuriy

Identifying Genders of Science Authors in the 1600s.

When attempting to solve this problem, the first question is "what distinguishes the writing styles of men and women?" Our first thought was to do Latent Dirichlet Allocation, which categorizes text, but we realized that since our dataset covered a specific range of topics, there would be similar categories. Therefore, LDA would not be the most optimal strategy in this prediction. After performing tests on the data and understanding the research, we concluded that the main difference between male and female authors is that the use of pronouns is more pronounced in the works of female authors than in males. Additionally, there was a significant increase in the use of adjectives and adverbs by female authors relative to the male authors. These features are all something we could quantify and use to identify the gender of an author of a document.


Methods

Before we were aware of the distinction, we were complacent with just putting the whole text through a model and letting it figure it out (Bag of Words), which worked surprisingly well. However, this doesn't help the user understand what's going on and often wastes resources like energy since the computer has to calculate larger datasets that are often very sparse. So we opted to use many of the natural language processing libraries we had been taught in class to separate the text by part of speech and then to give the model only the pronouns, adjectives, and adverbs.

Model Accuracies

With Support Vector Machine, we had about 95-97% accuracy.
When using K-Nearest Neighbours, our accuracy was 85-93%.
While using Linear regression, we had 92-96% accuracy.

View Our Code!