The following are two term projects from AC209A/B at Harvard University.
Donald Trump Tweet Analysis
My team and I were tasked with analyzing President Trump's twitter feed. This was an open-ended project, and we decided to use sentiment analysis to predict two key Twitter variables: the number of favorites a tweet receives, and the number of followers received on the following day. All of this data is public and can be downloaded or scraped. We started by visualizing several variables and interactions of variables, then modeled the two response variables using three modeling techniques: linear regression, XGBoost, and neural networks. We used variables like the length of the tweet in characters, the time of day, the source of the tweet (iPhone, Desktop, etc.), as well as numerical scores calculated for polarity (the emotion of the tweet, happy vs. sad or angry) and subjectivity (how emotionally charged the tweet is, factual vs. subjective). The sentiment analysis was conducted using TextBlob. Our models were not very accurate, but we did determine with high statistical significance that lower polarity and higher subjectivity result in higher engagement numbers. We also determined that polarity and subjectivity are positively correlated with each other.
Predicting Project Success
My team and I used several modeling techniques to predict the success of government projects. Federal, state, and local governments begin hundreds of new projects every day, but many of them will end overbudget in delays, or might not be completed at all. An accurate model of project success would be useful for two reasons: first, an accurate prediction of success could allow a government to cancel bad projects earlier and move funding to more successful projects; second, an interpretable model could yield important inferences about what makes a successful project.
We started by visualizing several project datasets from New York City, Washington DC, and Federal IT projects. Next, we focused on the New York City project dataset and applied several modeling techniques, from simpler models like k-Nearest Neighbors and Linear Regression, to more complex models like generalized additive models and neural networks. Overall, we did not achieve great accuracy with our models, but we did glean important inferences about what determines project success. For example, the team that is performing the project had a big impact on the project success.
Platforms, tools, and systems:
- Python
- Numpy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Keras