School Projects

Harvard University

COVID-19 News Article Analysis

During "CS205: Computing Foundations for Computational Science", I worked with a team of two other students to analyze a growing collection of COVID-19 news articles. The news article dataset is hosted by The GDELT Project and is a collection of web-scraped news articles relating to the COVID-19 pandemic. My team and I used parallel programming tools we learned in the class to develop a data pipeline for analysis.

The steps of the pipeline are as follows:

Preprocessing: remove stop words and meaningless words like 'said', 'like', 'also', etc.
Top word matrix: for each article in the dataset, count the frequency of words appearing in the article. Initially we did this for the full set of words in all articles, but later we reduced this to a much smaller set of words, like 'coronavirus', 'covid19', 'lockdown', 'health', 'cases', 'pandemic', 'virus', etc. Using a smaller set of words allowed us to process many more articles in a much shorter time.
Co-occurrence matrix: If X is the top word matrix, then the co-occurrence matrix is calculated by computing X^TX, which is also known as the Gram matrix of X. This matrix is a square n × n matrix, where n is the size of the dictionary used above, and it tells us the frequency of each word occurring with other words in the set of articles.

We implemented the preprocessing step using PySpark and MapReduce. The top word matrix step was implemented using PySpark as well as OpenMP and MPI in C++. Finally, the co-occurrence matrix step was implemented using OpenMP and MPI in C++.

More about the pipeline and our results can be viewed here: https://shravnages.github.io/cs205_fp/.

Platforms, tools, and systems:

C, C++
OpenMP, MPI
Hadoop MapReduce
Spark
Amazon Web Services

AC209A/B: Trump Tweet and Government Project Success

The following are two term projects from AC209A/B at Harvard University.

Donald Trump Tweet Analysis

My team and I were tasked with analyzing President Trump's twitter feed. This was an open-ended project, and we decided to use sentiment analysis to predict two key Twitter variables: the number of favorites a tweet receives, and the number of followers received on the following day. All of this data is public and can be downloaded or scraped. We started by visualizing several variables and interactions of variables, then modeled the two response variables using three modeling techniques: linear regression, XGBoost, and neural networks. We used variables like the length of the tweet in characters, the time of day, the source of the tweet (iPhone, Desktop, etc.), as well as numerical scores calculated for polarity (the emotion of the tweet, happy vs. sad or angry) and subjectivity (how emotionally charged the tweet is, factual vs. subjective). The sentiment analysis was conducted using TextBlob. Our models were not very accurate, but we did determine with high statistical significance that lower polarity and higher subjectivity result in higher engagement numbers. We also determined that polarity and subjectivity are positively correlated with each other.

Predicting Project Success

My team and I used several modeling techniques to predict the success of government projects. Federal, state, and local governments begin hundreds of new projects every day, but many of them will end overbudget in delays, or might not be completed at all. An accurate model of project success would be useful for two reasons: first, an accurate prediction of success could allow a government to cancel bad projects earlier and move funding to more successful projects; second, an interpretable model could yield important inferences about what makes a successful project.

We started by visualizing several project datasets from New York City, Washington DC, and Federal IT projects. Next, we focused on the New York City project dataset and applied several modeling techniques, from simpler models like k-Nearest Neighbors and Linear Regression, to more complex models like generalized additive models and neural networks. Overall, we did not achieve great accuracy with our models, but we did glean important inferences about what determines project success. For example, the team that is performing the project had a big impact on the project success.

Platforms, tools, and systems:

Python
Numpy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Keras

Framingham Heart Study Analysis

For my Generalized Linear Models class at Harvard University, I performed an analysis of data from the Framingham Heart Study using generalized linear models, generalized additive models, and tree-based models. I modeled the probability of developing heart disease using logistic regression as a function of several variables, like smoker/nonsmoker, diabetes, blood pressure, age, male/female, etc. Several variables were found to be significant, like age, cigarettes smoked per day, cholesterol, blood pressure, and sex.

Platforms, tools, and systems:

R, RStudio

Time Series Analysis of Agricultural Market Data

For my Time Series class at Harvard University, I performed an analysis of agricultural market data from the last 40 years using time series methods. I modeled the price of corn, cotton, soybeans, and wheat using ARIMA models. I also modeled the price of each as a function of the other to see how correlated the two are. The most correlated are corn-soybeans, corn-wheat, and soybeans-wheat. Cotton was much less correlated with the other three, but this is likely because cotton is fundamentally a different good than the other three.

Platforms, tools, and systems:

R, RStudio

Automatic Differentiation Python Package

For my "Systems Development for Computational Science" class at Harvard University, my team and I developed an automatic differentiation package for Python. Rather than calculating an approximation of a derivative, we calculate the exact derivative to machine precision using the chain rule. We implemented automatic differentiation for functions of one or multiple variables, as well as the ability to calculate derivatives for multiple x values. In addition, for ease of use, we implemented operator overloading and several functions like trigonometric, inverse trig, hyperbolic trig, logarithms, square root, and logistic. We also used our package to implement three algorithms: root finding with Newton's method, Broyden–Fletcher–Goldfarb–Shanno (BGFS), and gradient descent.

We also hosted our package on (Test)PyPI here: https://test.pypi.org/project/superdifferentiator-cs207-harvarduniv/

Platforms, tools, and systems:

Python
Setuptools, Pip

Centre College

Parallel Graph Theory/Graph Labeling Algorithm

For my "Research in Graph Labelings" and "Parallel Computing" classes at Centre College, I developed a tool for my classmates to use which implements a brute force algorithm to find subtractive vertex magic labelings of lemniscate graphs. It was very difficult to find these by hand, but the algorithm is able to find every labeling for small graphs in seconds, and for large graphs in a few hours or days. We used this program on a 64-core Linux server at Centre College, as well as the Comet cluster at the San Diego Supercomputer Center through an XSEDE grant. This program gave crucial insight into the nature of these labelings and graphs. This allowed us to find patterns in the labelings to attempt to generalize the labeling to any lemniscate graph. A paper is in progress publishing results for magic labelings and lemniscate graphs.

Platforms, tools, and systems:

C++
OpenMP, MPI
Python

Movie Recommendation System

During the summer of 2017 and 2018, I had the opportunity to conduct research with Dr. Michael Lamar and another student at Centre College. During this research project, we developed and implemented a machine learning recommender system model applied to the Netflix Prize dataset. We applied the Co-Occurance Data Embedding (CODE) algorithm to the Likert-scale data of the Netflix Prize. We learn an embedding of the users and movies, and use this embedding to make a prediction on a user's rating of a movie. The algorithm is fast, simple, and intuitive, and the predictions it makes match those of other more complex approaches.

At a high level, the algorithm works as follows. We have a set of observations which consist of the user, movie, and rating given to the movie in stars from one to five; an observation looks like {user_id: 5, movie_id: 8, rating: 3}. Imagine a unit sphere, or a 3D ball, like a basketball or a soccer ball. For every unique user, we put a random point on the ball. For every movie, we put five random points, which corresponds to one for each rating. We iterate through the data, and for every user giving a movie a rating, we move the user point and the movie-rating point a little bit closer together. To prevent all the points converging, we randomly pick a user and a movie-rating and move them apart. We also do some ad-hoc repulsions between pairs of users and pairs of movie-ratings, which seemed to improve the performance. We slow down the movements over time, and after many iterations of the algorithm, we converge to a useful embedding of the data points. We use the distance between a user and a movie to determine the likeliness of the user giving a rating.

We presented a poster at the 2019 Joint Math Meetings, which is shown below.

Platforms, tools, and systems:

C++
Python

TubeMaster Database Application

I worked with a team of students in our Database Systems class to develop an application for TubeMaster, a real-life client in Louisville, KY. We designed the database and implemented the application using PostgreSQL and PHP.

Platforms, tools, and systems:

HTML, CSS, JavaScript, Bootstrap
PHP
PostgreSQL
Google Cloud Platform

Department Assessment Project

I worked with a team of students in our Software Engineering class to develop a tool which simplifies and standardizes the yearly assessment report for each of the major departments at Centre College.

Platforms, tools, and systems:

HTML, CSS, JavaScript, Bootstrap
MEAN: MongoDB, Express.js, Angular.js, Node.js

AI Projects

In my "Artificial Intelligence" class (CSC 339), we explored many topics of AI, such as A* and other search techniques, boolean satisfiability (SAT) and SAT solvers, Bayesian networks, and more. We had projects due approximately every two weeks. I did all of these projects in Java, but we were allowed to use any language we wanted to.

Modeling and solving the game Back2Back using A*
Writing my own SAT solver, using both DPLL and WalkSAT
Reducing hexidecimal Sudoku to SAT and solving it using a SAT solver
Writing a Bayesian network
Writing a 2x2 Rubiks cube solver using A*

Platforms, tools, and systems:

Java, SAT4J library

Machine Learning Projects

In my "Statistical Modeling" (MAT 205) and "Machine Learning" class (CSC 420), we learned the theory behind several mathematical machine learning models, Regression and Logistic Regression, Naïve Bayes, k-Nearest Neighbors, Support Vector Machines, Neural Networks, and others. We also had several projects to apply these concepts to real-world data. I have included these projects and some others that I explored on my own here.

Platforms, tools, and systems:

R, RStudio
Python
Scikit-learn, TensorFlow, Keras

Dungeon Adventure Game

This is my final project for my "Software Development" class (CSC 300). It is a first person game set in a dungeon. There are monsters and items you can pick up, such as heatlth potions and weapons. The class worked on this project for a solid amount of the semester, with weekly deadlines of what work was to be done by that time.

Platforms, tools, and systems:

Java

Word Predictor Program

This is my final project for my "Data Structures and Intermediate Programming" class (CSC 223). It reads in a large amount of text from a file (we used the full text of Moby Dick), then lets you enter letters, and based on the frequency of matching words in the book, tries to predict your word.

Platforms, tools, and systems:

Java

Sayre School

Sayre Blue Gold

Sayre Blue Gold is a website I developed with a classmate for an independent study at my high school. It is a website for hosting the WSMS podcast that the school puts out.

Platforms, tools, and systems:

HTML, CSS, JavaScript, Bootstrap, jQuery
PHP
MySQL