Modeling Latent Open Source Developer Communities: Evaluation of Relational Metrics
Reconstructing the social network of developers who create open source software is non-trivial. In the past, researchers have applied various methods to unearth these interactions. In this paper we evaluate several community discovery methods.
We propose a two-stage community discovery strategy that consists of creating a Developer Profile and calculating Developer Similarity. The Developer Profile is a vector that contains some measure of a developer's contribution to the files within a source code management tool, such as Subversion. Developer Similarity is calculated by relating two Developer Profiles using a secondary metric. We then compare the results to a baseline community taken from the Apache Software Foundation. We find that using Cosine Similarity and Pearon's Product-Moment Correlation Coefficient as Developer Similarity metrics produces the best results when paired with a Developer Profile created from the number of lines removed from files within the source code management tool.
Apache Commits: Social Network Dataset
Building non-trivial software is a social endeavor. Therefore, understanding the social network of developers is key to the study of software development organizations. We present a graph representation of the commit behavior of developers within the Apache Software Foundation for 2010 and 2011. Relationships between developers in the network represent collaborative commit behavior. Several similarity and summary metrics have been pre-calculated.
Reflexivity, Raymond, and the Success of Open Source Software Development: A SourceForge Empirical Study
Context: Conventional wisdom, inspired in part by Eric Raymond, suggests that open source developers should—and primarily do—develop software for developers like themselves. We refer to the production of software primarily for the benefit of developers as reflexivity, and we evaluate the applicability of this concept to open source software (OSS) by studying SourceForge projects. Objective: The goal of this research is to test Eric Raymond's assertions with respect to OSS success factors. Method: We present four criteria by which to assess project reflexivity in SourceForge. These criteria are based on three specific indicators: intended audiences, relevant topics, and supported operating systems. Results: We show in this short paper that 68% of SourceForge projects are likely reflexive (in the sense described by Raymond). Further, 76% of projects exceeding one million downloads are reflexive, 79% for projects exceeding ten million downloads, and 89% for projects exceeding one hundred million downloads. Conclusion: These results tentatively support Raymond's assertions that 1) OSS projects tend to be reflexive and 2) reflexive OSS projects tend to be more successful than irreflexive projects. Causality, however, is not addressed.
Prevalence of Reflexivity and Its Impact on Success in Open Source Software Development: An Empirical Study
Conventional wisdom, inspired in part by Eric Raymond, suggests that open source developers primarily develop software for developers like themselves. In our studies we distinguish between reflexive software (software written primarily for other developers) and irreflexive software (software written primarily for passive users). In the first study, we present four criteria which we then use to assess project reflexivity in SourceForge. These criteria are based on three specific indicators: intended audience, relevant topics, and supported operating systems. Based on our criteria, we find that 68% of SourceForge projects are reflexive (in the sense described by Raymond). In the second study, we randomly sample and statically estimate reflexivity within SourceForge. Our results support Raymond's assertions that 1) OSS projects tend to be reflexive and 2) reflexive OSS projects tend to be more successful than irreflexive projects. We also find a decrease in reflexivity from a high in 2001 to a low in 2011.
Cliff Walls: Threats to Validity in Empirical Studies of Open Source Forges
Artifact-based research provides a mechanism whereby researchers may study the creation of software yet avoid many of the difficulties of direct observation and experimentation. Open source software forges are of great value to the software researcher, because they expose many of the artifacts of software development. However, many challenges affect the quality of artifact-based studies, especially those studies examining software evolution. This thesis addresses one of these threats: the presence of very large commits, which we refer to as "Cliff Walls." Cliff walls are a threat to studies of software evolution because they do not appear to represent incremental development. In this thesis we demonstrate the existence of cliff walls in open source software projects and discuss the threats they present. We also seek to identify key causes of these monolithic commits, and begin to explore ways that researchers can mitigate the threats of cliff walls.
Commit Patterns and Threats to Validity in Analysis of Open Source Software Repositories
In the course of studying the effects of programming in multiple languages, we unearthed troubling trends in SourceForge artifacts. Our initial studies suggest that pro- gramming in multiple languages concurrently negatively affects developer productivity. While addressing our initial question of interest, we discovered a pattern of monolithic commits in the SourceForge community. Consequently, we also report on the effects that this pattern of commits can have when using SourceForge as a data-source for temporal analysis of open source projects or for studies of individual developers.
Analysis and Characterization of Author Contribution Patterns in Open Source Software Development
Software development is a process fraught with unpredictability, in part because software is created by people. Human interactions add complexity to development processes, and collaborative development can become a liability if not properly understood and managed. Recent years have seen an increase in the use of data mining techniques on publicly-available repository data with the goal of improving software development processes, and by extension, software quality. In this thesis, we introduce the concept of author entropy as a metric for quantifying interaction and collaboration (both within individual files and across projects), present results from two empirical observational studies of open-source projects, identify and analyze authorship and collaboration patterns within source code, demonstrate techniques for visualizing authorship patterns, and propose avenues for further research.
Python is a modern scripting language that has embraced a largely object-oriented framework, but has also supported a number of functional programming constructs. In previous work, we introduced extensions to increase the functional programming capabilities of the language and we also introduced a novel purely-python module that implemented a logic programming style pseudo-syntax. That module was purely academic and was significantly limited in scope and expressiveness. In this paper, we present the newly updated PyLogical module by first reviewing the philosophy behind the mixing of the two paradigms, give a brief overview of the updated pseudo-syntax, and compare this syntax with the Prolog. We note that our new module is capable of expressing almost all Prolog language features including DCG's with minimal syntactic overhead.
Report from the 2nd International Workshop on Replication in Empirical Software Engineering Research (RESER 2011)
The RESER workshop provides a venue in which empirical software engineering researchers can discuss the theoretical foundations and methods of replication, as well as present the results of specific replicated studies. In 2011, the workshop co-located with the International Symposium on Empirical Software Engineering and Measurement (ESEM) in Banff, Alberta, Canada. In addition to several outstanding paper sessions, highlights of the 2011 workshop included a keynote address by Dr. Victor R. Basili, in which he addressed the question, "What's so hard about replication of software engineering experiments?" The workshop also featured a joint replication panel session discussing the first cooperative joint replication ever conducted in empirical software engineering research and a planning session for next year's joint replication project addressing Conway's Law.
Programming Language Fragmentation and Developer Productivity: An Empirical Study
In an effort to increase both the quality of software applications and the efficiency with which applications can be written, developers often incorporate multiple programming languages into software projects. Although language specialization arguably introduces benefits, the total impact of the resulting language fragmentation (working concurrently in multiple programming languages) on developer performance is unclear. For instance, developers may solve problems more efficiently when they have multiple language paradigms at their disposal. However, the overhead of maintaining efficiency in more than one language may outweigh those benefits.
This thesis represents a first step toward understanding the relationship between language fragmentation and programmer productivity. We address that relationship within two different contexts: 1) the individual developer, and 2) the overall project. Using a data-centered approach, we 1) develop metrics for measuring productivity and language fragmentation, 2) select data suitable for calculating the needed metrics, 3) develop and validate statistical models that isolate the correlation between language fragmentation and individual programmer productivity, 4) develop additional methods to mitigate threats to validity within the developer context, and 5) explore limitations that need to be addressed in future work for effective analysis of language fragmentation within the project context using the SourceForge data set. Finally, we demonstrate that within the open source software development community, SourceForge, language fragmentation is negatively correlated with individual programmer productivity.
Knowledge Homogeneity and Specialization in the Apache HTTP Server Project
We present an analysis of developer communication in the Apache HTTP Server project. Using topic modeling techniques we expose latent conceptual sub-communities arising from developer specialization within the greater developer population. However, we found that among the major contributors to the project, very little specialization exists. We present theories to explain this phenomenon, and suggest further research.
Open Source: From Mythos to Meaning
Free open source software (FOSS) projects expose rich development, evolutionary, and collaborative data from which researchers have formed theories and conclusions about the FOSS development ecosystem. However, little work has been done to determine whether FOSS projects are analogous to proprietary development efforts. We propose several axes along which taxonomies of FOSS and proprietary projects may be created and compared, and preview several future stud- ies that will begin to populate these taxonomies.
Cliff Walls: An Analysis of Monolithic Commits Using Latent Dirichlet Allocation
Artifact-based research provides a mechanism whereby researchers may study the creation of software yet avoid many of the difficulties of direct observation and experimentation. However, there are still many challenges that can affect the quality of artifact-based studies, especially those studies examining software evolution. Large commits, which we refer to as "Cliff Walls," are one significant threat to studies of software evolution because they do not appear to represent incremental development. We used Latent Dirichlet Allocation to extract topics from over 2 million commit log messages, taken from 10,000 SourceForge projects. The topics generated through this method were then analyzed to determine the causes of over 9,000 of the largest commits. We found that branch merges, code imports, and auto-generated documentation were significant causes of large commits. We also found that corrective maintenance tasks, such as bug fixes, did not play a significant role in the creation of large commits.
An Analysis of Author Contribution Patterns in Eclipse Foundation Project Source Code
Collaborative development is a key tenet of open source software, but if not properly understood and managed, it can become a liability. We examine author contribution data for the newest revision of 251,633 Java source files in 592 Eclipse projects. We use this observational data to analyze collaboration patterns within files, and to explore relationships between file size, author count, and code authorship. We calculate author entropy to characterize the contributions of multiple authors to a given file, with an eye toward understanding the degree of collaboration and the most common interaction patterns.
Design Team Perception of Development Team Composition: Implications for Conway's Law
Conway's law, the idea that a software system reflects the structure of the organization that built it, is one of the most well-known "laws" in software engineering. However, the seemingly straightforward phenomenon described by Conway appears to be subject to nuances of personal and organizational dynamics as well as contextual factors, most of which are neither well-understood nor well-studied. As a pilot study intended to foster discussion within the RESER community, we performed a small and somewhat informal qualitative study designed to elucidate some of these nuances. We posited that the designers' perception of the ultimate composition of the development team would affect the resultant system architecture more so than would the actual composition of the design team. The results of the pilot study support this hypothesis and are intended as a motivator for on-going discussion, as well as a catalyst for more thorough and formal differentiated replications, to explore and elucidate the nuances of Conway's law.
Design Patterns in Software Maintenance: An Experiment Replication at Brigham Young University
In 2001 Prechelt et al. published the results of a controlled experiment in software maintenance comparing design patterns to simpler solutions. Since that time, only one replication of the experiment has been performed (published in 2004). The replication found remarkably (though not surprisingly) different results. In this paper we present the results of another replication of Prechelt's experiment, conducted at Brigham Young University (BYU) in 2010. This replication was performed as part of a joint replication project hosted by the 2011 Workshop on Replication in Empirical Software Engineering Research (RESER). The data and results from this experiment are meant to be considered in connection with the results of other contributions to the joint replication project.
The Problem of Private Information in Large Software Organizations
Coordination of project stakeholders is critical to timely and consistent software delivery. In this short paper we present the problem of private information as a guiding framework or lens through which to interpret coordination dynamics within software organizations. We provide evidence of this problem in the form of specific challenges, collected via interviews from a diverse set of extended (i.e., non-development) stakeholders in a globally distributed software development organization.
A Reusable Persistence Framework for Replicating Empirical Studies on Data From Open Source Repositories
Empirical research is inexact and error-prone leading researchers to agree that replication of experiments is a necessary step to validating empirical results. Unfortunately, replicating experiments requires substantial investments in manpower and time. These resource requirements can be reduced by incorporating component reuse when building tools for empirical experimentation. Bokeo is an initiative within the Sequoia Lab of the BYU Computer Science Department to develop a platform to assist in the empirical study of software engineering. The i3Persistence Framework is a component of Bokeo which enables researchers to easily build and rapidly deploy tools for empirical experiments by providing an easy-to-use database management service. We introduce the i3Persistence Framework of Bokeo to assist in the development of software to replicate experiments and conduct studies on data from open-source repositories.
Report from the 1st International Workshop on Replication in Empirical Software Engineering Research (RESER 2010)
The RESER 2010 Workshop, held on May 4, 2010 in Cape Town, South Africa was co-located with the 32nd International Conference on Software Engineering (ICSE 2010). The workshop provided a venue in which empirical Software Engineering researchers could present and discuss the theoretical foundations and methods of replication, as well as the results of specific replicated studies.
Applications of Data Mining in Software Engineering
Software engineering processes are complex, and the related activities often produce a large number and variety of artefacts, making them well-suited to data mining. Recent years have seen an increase in the use of data mining techniques on such artefacts with the goal of analysing and improving software processes for a given organisation or project. After a brief survey of current uses, we offer insight into how data mining can make a significant contribution to the success of current software engineering efforts.
Impact of Programming Language Fragmentation on Developer Productivity: a SourceForge Empirical Study
Programmers often develop software in multiple languages. In an effort to study the effects of programming language fragmentation on productivity—and ultimately on a developer's problem-solving abilities—we present a metric, language entropy, for characterizing the distribution of a developer's programming efforts across multiple programming languages. We then present an observational study examining the project contributions of a random sample of 500 SourceForge developers. Using a random coefficients model, we find a statistically (alpha level of 0.001) and practically significant correlation between language entropy and the size of monthly project contributions. Our results indicate that programming language fragmentation is negatively related to the total amount of code contributed by developers within SourceForge, an open source software (OSS) community.
Trends That Affect Temporal Analysis Using SourceForge Data
SourceForge is a valuable source of software artifact data for researchers who study project evolution and developer behavior. However, the data exhibit patterns that may bias temporal analyses. Most notable are cliff walls in project source code repository timelines, which indicate large commits that are out of character for the given project. These cliff walls often hide significant periods of development and developer collaboration—a threat to studies that rely on SourceForge repository data. We demonstrate how to identify these cliff walls, discuss reasons for their appearance, and propose preliminary measures for mitigating their effects in evolution-oriented studies.
1st International Workshop on Replication in Empirical Software Engineering Research (RESER)
The RESER 2010 workshop provides a venue in which empirical Software Engineering researchers may present and discuss theoretical foundations and methods of replication, as well as the results of replicated studies.
A Case for Replication: Synthesizing Research Methodologies in Software Engineering
Software Engineering (SE) problems are—from both practical and theoretical standpoints—immensely complex, involving interactions between technical, behavioral, and social forces. In an effort to dissect this complexity, SE researchers have incorporated a variety of research methods. Recently, the field has entered a paradigm shift—a broad awakening to the social aspects of software development. As a result, and in concert with an ongoing struggle to establish SE research as an empirical discipline, SE researchers are increasingly appropriating methodologies from other fields. In the wake of this self-discovery, the field is entering a period of methodological flux, during which it must establish for itself effective research practices. We present a unifying framework for organizing research methods in SE. In the process of elucidating this framework, we dissect the current literature on replication methods and place replication appropriately within the framework. We also further clarify, from a high level and with respect to SE, the mechanisms through which science builds usable knowledge.
Threats to Validity in Analysis of Language Fragmentation on SourceForge Data
Reaching general conclusions through analysis of SourceForge data is difficult and error prone. Several factors conspire to produce data that is sparse, biased, masked, and ambiguous. We explore these factors and the negative effect that they had on the results of "Impact of Programming Language Fragmentation on Developer Productivity: a SourceForge Empirical Study." In addition, we question the validity of evolutionary or temporal analysis of development practices based on this data.
Mining Programming Language Vocabularies from Source Code
We can learn much from the artifacts produced as the by-products of software development and stored in software repositories. Of all such potential data sources, one of the most important from the perspective of program comprehension is the source code itself. While other data sources give insight into what developers intend a program to do, the source code is the most accurate human-accessible description of what it will do. However, the ability of an individual developer to comprehend a particular source file depends directly on his or her familiarity with the specific features of the programming language being used in the file. This is not unlike the difficulties second-language learners may encounter when attempting to read a text written in a new language. We propose that by applying the techniques used by corpus linguists in the study of natural language texts to a corpus of programming language texts (i.e., source code repositories), we can gain new insights into the communication medium that is programming language. In this paper we lay the foundation for applying corpus linguistic methods to programming language by 1) defining the term “word” for programming language, 2) developing data collection tools and a data storage schema for the Java programming language, and 3) presenting an initial analysis of an example linguistic corpus based on version 1.5 of the Java Developers Kit.
Language Entropy: A Metric for Characterization of Author Programming Language Distribution
Programmers are often required to develop in multiple languages. In an effort to study the effects of programming language fragmentation on productivity—and ultimately on a programmer’s problem solving abilities—we propose a metric, language entropy, for characterizing the distribution of an individual’s development efforts across multiple programming languages. To evaluate this metric, we present an observational study examining all project contributions (through August 2006) of a random sample of 500 SourceForge developers. Using a random coefficients model, we found a statistically significant correlation (alpha level of 0.05) between language entropy and the size of monthly project contributions (measured in lines of code added). Our results indicate that language entropy is a good candidate for characterizing author programming language distribution.
Author Entropy Vs. File Size in the Gnome Suite of Applications
We present the results of a study in which author entropy was used to characterize author contributions per file. Our analysis reveals three patterns: banding in the data, uneven distribution of data across bands, and file size dependent distributions within bands. Our results suggest that when two authors contribute to a file, large files are more likely to have a dominant author than smaller files.
The 20-Minute Genealogist: A Context-Preservation Metaphor for Assisted Family History Research
What can you possibly do to be productive as a family history researcher in 20 minutes per week? Our studies suggest that currently the answer is, "Nothing." In 20 minutes a would-be researcher can’t even remember what happened last week, let alone what they were planning to do next. The 20-Minute Genealogist is a powerful metaphor within which software solutions must consider context preservation as the fundamental domain of the system, thus freeing the researcher to do research while the software manages the tasks that computers do best. Two survey-based studies were conducted that indicate a significant disconnect between the values espoused by would-be researchers and the actual level of time spent by those same individuals. Our preliminary results suggest that the overhead involved in context preservation is the predominant inhibitor of family history research productivity among those who claim that such work is very important, yet fail in their efforts.
Author Entropy: A Metric for Characterization of Software Authorship Patterns
We propose the concept of author entropy and describe how file-level entropy measures may be used to understand and characterize authorship patterns within individual files, as well as across an entire project. As a proof of concept, we compute author entropy for 28,955 files from 33 open-source projects. We explore patterns of author entropy, identify techniques for visualizing author entropy, and propose avenues for further study.
Programming Language Trends in Open Source Development: An Evaluation Using Data from All Production Phase SourceForge Projects
In this work, we analyze data collected from the CVS repositories of 9,997 Open Source projects hosted on SourceForge in an effort to understand trends in programming language usage in the Open Source community between 2000 and 2005. The trends we consider include: 1) the relative popularity of the ten most popular programming languages over time, 2) the use of multiple programming languages by individual programmers and by individual projects, and 3) the programming languages most often used in combination.
Studying Production Phase SourceForge Projects: A Case Study Using cvs2mysql and SFRA+
A wealth of data can be extracted from the natural byproducts of software development processes and used in empirical studies of software engineering. However, the size and accuracy of such studies depend in large part on the availability of tools that facilitate the collection of data from individual projects and the combination of data from multiple projects. To demonstrate this point, we present our experience gathering and analyzing data from nearly 10,000 open source projects hosted on SourceForge. We describe the tools we developed to collect the data and the ways in which these tools and data may be used by other researchers. We also provide examples of statistics that we have calculated from these data to describe interesting author- and project-level behaviors of the SourceForge community.
Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects
Brooks and others long ago suggested that on average computer programmers write the same number of lines of code in a given amount of time regardless of the programming language used. We examine data collected from the CVS repositories of 9,999 open source projects hosted on SourceForge.net to test this assumption for 10 of the most popular programming languages in use in the open source community. We find that for 24 of the 45 pairwise comparisons, the programming language is a significant factor in determining the rate at which source code is written, even after accounting for variations between programmers and projects.
Observational Studies of Software Engineering Using Data from Software Repositories
Data for empirical studies of software engineering can be difficult to obtain. Extrapolations from small controlled experiments to large development environments are tenuous and observation tends to change the behavior of the subjects. In this thesis we propose the use of data gathered from software repositories in observational studies of software engineering. We present tools we have developed to extract data from CVS repositories and the SourceForge Research Archive. We use these tools to gather data from 9,999 Open Source projects. By analyzing these data we are able to provide insights into the structure of Open Source projects. For example, we find that the vast majority of the projects studied have never had more than three contributors and that the vast majority of authors studied have never contributed to more than one project. However, there are projects that have had up to 120 contributors in a single year and authors who have contributed to more than 20 projects which raises interesting questions about team dynamics in the Open Source community. We also use these data to empirically test the belief that productivity is constant in terms of lines of code per programmer per year regardless of the programming language used. We find that yearly programmer productivity is not constant across programming languages, but rather that developers using higher level languages tend to write fewer lines of code per year than those using lower level languages.
Design Dysphasia and the Pattern Maintenance Cycle
Software developers utilize design methods that enable them to manipulate conceptual structures that correlate to programming language features. However, language evolution weakens the design-implementation interface introducing what we call "design dysphasia"—a partial disability in the use of programming language because of incongruous design methods.
Software design patterns are a popular design method that capture elements of reusable design within a specific context. When the programming languages that are part of pattern context evolve, patterns must adapt to the language change or they may reinforce design dysphasia in the practitioner. We assert that the current "capture/recapture" pattern maintenance model is suboptimal for adapting patterns to language evolution and propose a new "capture/modify/recapture" maintenance cycle as a more effective approach. We then suggest a concrete "modify" phase for current patterns to be adapted to OO++ language trends and present an OO++ Iterator pattern example.
OO++ Design Patterns: GOF Revisited
Programming languages and the programming paradigms they embody co-evolve over time. In many circles, for example, object-oriented programming has evolved from and effectively replaced imperative programming. More recently, many object-oriented languages have assimilated features from other programming paradigm, evolving into multiparadigm languages we refer to as "object-oriented plus-plus" or OO++. However, language evolution, like that seen in OO++, weakens the design-implementation interface introducing what we call "design dysphasia"—a partial disability in the use of programming language because of incongruous design methods. Design dysphasia persists until design methods are extended to match evolved language features.
One popular contemporary design method is the use of software design patterns. These patterns capture elements of design that can be reused within a specific context. When the programming languages that are part of pattern context evolve, patterns must adapt to the language changes. Otherwise they may reinforce design dysphasia in the practitioner. Because of this, the current pattern maintenance model of "capture/recapture" is suboptimal.
This thesis presents an investigation of the shift in contemporary object-oriented languages to OO++ and analyzes the characteristics of the OO++ paradigm. The nature of design dysphasia is defined and discussed generally and in the context of software design patterns. A "capture/modify/recapture" maintenance model is presented as an effective replacement to the "capture/recapture" cycle. A concrete "modify" phase is defined for the adaptation of existing object-oriented patterns to OO++ languages illustrated through the adaptation of the 23 patterns presented in Design Patters by Gamma et al. to OO++ variants.
OO++: Exploring the Multiparadigm Shift
Programming languages and the programming paradigms they embody co-evolve over time. Within industrial and academic circles, for example, object-oriented programming has evolved from and effectively replaced imperative programming. More recently, many object-oriented languages have assimilated features from other programming paradigms, evolving into multiparadigm languages we refer to as "object-oriented plus-plus" or OO++. In this paper we survey the capabilities of six OO++ languages, present OO++ code samples in Python, and propose key characteristics of an OO++ programming paradigm.