LAS VEGAS – A team of students from Universite Laval won an international student competition with a very wordy analysis at the SAS Global Forum.
Analytics software powerhouse SAS challenged post-secondary students to use SAS Analytics to process big data. Each student team, guided by one faculty member, could select from one of eight publicly available data sets to determine a problem they would attempt to solve with SAS software. Each one also had to submit a paper detailing their solution.
The team from the Universite Laval, called GTDStat, won with an analysis of Google’s Ngram Data Set. Titled A General Method to Take the Growth of the Scientific Literature Into Account, the paper addresses a fundamental problem with Google’s data set – it’s skewed towards scientific literature.
Since Ngram is a tool that invites users to compare the popularity of words in published works over the past two centuries, it would be important to have a balanced representation of literary works as part of the data set to get an accurate answer. But as Google’s data set has a big spike in the number of scientific works published towards the latter half of the 20th century, that’s not the case – and the student team was able to prove that, with only five variables available to them, using a simple approach.
The word “Figure” became incredibly popular with the rise of scientific literature, and this was reflected in the data set. Used most commonly to denote charts and graphs, the data shows that users of Google’s Ngram tool must consider the answers they’re receiving.
We spoke with Universite Laval PhD candidate Aurelien Nicosia about winning the competition and the potential future applications of their solution in the business world.