This project aimed to practice TF-IDF for sentiment classification in a large data set

stages of the project

  1. finding data. This is the dataset I found in Kaggle

  2. priming data. As a basic approach, I used a supervised dataset, deleted other columns I did not want to use, and convert floats to integers

  3. I applied the TF-IDF algorithm

  4. I used Accuracy and F1-Score to measure how the model performed.

Main takeaways

  1. SQLite is very useful for priming large quantities of data.

  2. You have to convert tags to integers ('positive' to 1, 'Negative' to 0) for the TD-IDF algorithm to function. Floats do not function either, 1.0 will not function.

  3. The gap between Positive and Negative reviews is pretty large, that's why the model has to be weighed to avoid errors.

  4. Adding another group to the classification probably will be a good idea.

Click here to access the GitHub repository <3