Big Data meets a social network
So how much data is needed for it to be considered, “Big Data?”
Are 22 million rows of data on your computer screen enough?
That’s the unfathomable amount of information Warrington Ph.D. student Mohammadmahdi (Mahdi) Moqri has managed over the past nine months analyzing the world’s most popular open source software community. Moqri will present his findings at the Teradata University Network (TUN) 2015 Business Analytics Competition beginning Sunday in Anaheim, Calif.
TUN’s Business Analytics Competition is an opportunity for students to “present their business analytics research or application cases to professionals in the Business Analytics community.” This sharing of knowledge is at the heart of Moqri’s research regarding open source software communities.
Open source software communities are online sites where programmers and developers can collaborate on source codes without ownership, copyright or licensing restrictions. A programmer posts a source code he or she has been working on, and fellow programmers offer insights, suggestions and modifications. Moqri said some of the contributors to these forums are highly sought after programmers that shun working for major corporations because they enjoy the collaboration and transparency of open source communities.
Moqri has examined the comments made by these programmers in GitHub, the world’s most popular open source software community. In total, Moqri has accounted for the comments of four million developers over the past seven years resulting in 6 terabytes of data—which equals roughly 2.5 billion single-spaced typewritten pages.
“Imagine a network of these people where you have four or five million nodes, and every node is connected to several other nodes,” said Moqri, a third-year Ph.D. student in Warrington’s Department of Information Systems & Operations Management. “There could be a billion connections between them.”
While previous studies have examined the motivation behind these programmers contributing to these communities, Moqri is attempting to learn what, if any, effects social factors have on the amount of contributions. For instance, do GitHub users contribute more when they pick up a new follower? Moqri, who is collaborating with Warrington professors Subhajyoti Bandyopadhyay, Ira Horowitz and Liangfei Qiu, said the research hasn’t yet proved a direct causality between contributions and new followers, but he said there is certainly a relationship between the two.
Conducting this research was formidable. Moqri said standard software programs he’s used in the past were not able to support the colossal size of this data. So he used Google’s powerful cloud platform, BigQuery, to organize the data, and UF’s $3.4 million super computer – HiPerGator – to process it.
“By itself, when you look at the data and the scale of it, this is more or less unprecedented for research involving open source communities,” said Dr. Bandyopadhyay, Susan Cameron Professor and Moqri’s faculty advisor for this project. “I don’t think we have seen any research that looks at data at this kind of scale.”
Big Data has been a passion for Moqri for some time, and his enthusiasm for the topic is beginning to excite students and faculty in the ISOM department. For Moqri, Big Data is one of the most important and powerful tools in business.
“The amount of growth in data is higher than the capacity to store it, as are people’s skills to work with this data,” Moqri said. “So we’re for sure falling behind this growth both in terms of people’s capacity to manage them and hardware capacity.
“For the hardware, we can’t do anything. For the skills, now is the best time to get those skills and work with data as it is growing. It is a power. You can see what is being done with Big Data in any industry.”
Did You Know?
• Mahdi, along with Dr. Bandyopadhyay and former Warrington Ph.D. student Brent Kitchens, received the Most Promising Research/Advancing Science Award at TUN’s 2014 PARTNERS Conference in Nashville.
• Mahdi was born and raised in Iran. He earned a Bachelor of Science in Computer Science from Iran’s Sharif University of Technology and a Master of Science in Industrial Engineering from Iran University of Technology. He also has an MBA from the University of Massachusetts-Boston.
• Mahdi’s next research project involves Twitter. Thus far, he has collected more than 20 million tweets with a projected goal of about 45 million.
• Mahdi is a Data Ambassador for DataKind, a non-profit that connects data scientists with other non-profits to advance social change.