Display All AbstractsHide All Abstracts

Publications

Language Entropy: A Metric for Characterization of Author Programming Language Distribution

Jonathan L. Krein and Alexander C. MacLean and Daniel P. Delorey and Dennis L. Eggett and Charles D. Knutson
Fourth International Workshop on Public Data about Software Development (WoPDaSD '09)
June, 2009
Abstract  BibTex  URL  PDF

Programmers are often required to develop in multiple languages. In an effort to study the effects of programming language fragmentation on productivity—and ultimately on a programmer’s problem solving abilities—we propose a metric, language entropy, for characterizing the distribution of an individual’s development efforts across multiple programming languages. To evaluate this metric, we present an observational study examining all project contributions (through August 2006) of a random sample of 500 SourceForge developers. Using a random coefficients model, we found a statistically significant correlation (alpha level of 0.05) between language entropy and the size of monthly project contributions (measured in lines of code added). Our results indicate that language entropy is a good candidate for characterizing author programming language distribution.

@inproceedings{Krein:2009,
     author = {Jonathan L. Krein and Alexander C. MacLean and Daniel P. Delorey and Dennis L. Eggett and Charles D. Knutson},
     booktitle = {Fourth International Workshop on Public Data about Software Development (WoPDaSD '09)},
     keywords = {authors, metrics, version control, software evolution, software repositories, public domain software, SourceForge.net},
     location = {Skovde, Sweden},
     month = {June},
     pages = {6},
     title = {Language Entropy: A Metric for Characterization of Author Programming Language Distribution},
     url = {http://libresoft.es/activities/wopdasd-2009-4th-workshop-on-public-data-about-software-development},
     year = {2009}
}

Mining Programming Language Vocabularies from Source Code

Daniel P. Delorey and Charles D. Knutson and Mark Davies
Proceedings of the Psychology of Programming Interest Group Conference (PPIG 2009)
June, 2009
Abstract  BibTex  PDF

We can learn much from the artifacts produced as the by-products of software development and stored in software repositories. Of all such potential data sources, one of the most important from the perspective of program comprehension is the source code itself. While other data sources give insight into what developers intend a program to do, the source code is the most accurate human-accessible description of what it will do. However, the ability of an individual developer to comprehend a particular source file depends directly on his or her familiarity with the specific features of the programming language being used in the file. This is not unlike the difficulties second-language learners may encounter when attempting to read a text written in a new language. We propose that by applying the techniques used by corpus linguists in the study of natural language texts to a corpus of programming language texts (i.e., source code repositories), we can gain new insights into the communication medium that is programming language. In this paper we lay the foundation for applying corpus linguistic methods to programming language by 1) defining the term “word” for programming language, 2) developing data collection tools and a data storage schema for the Java programming language, and 3) presenting an initial analysis of an example linguistic corpus based on version 1.5 of the Java Developers Kit.

@inproceedings{Delory:2009,
     author = {Daniel P. Delorey and Charles D. Knutson and Mark Davies},
     title = {Mining Programming Language Vocabularies from Source Code},
     booktitle = {Proceedings of the Psychology of Programming Interest Group Conference (PPIG 2009)},
     location = {Limerick, Ireland},
     month = {June},
     year = {2009}
}

Author Entropy Vs. File Size in the Gnome Suite of Applications

Jason R. Casebolt and Jonathan L. Krein and Alexander C. MacLean and Charles D. Knutson and Daniel P. Delorey
International Workshop on Mining Software Repositories
May, 2009
Abstract  BibTex  URL 

We present the results of a study in which author entropy was used to characterize author contributions per file. Our analysis reveals three patterns: banding in the data, uneven distribution of data across bands, and file size dependent distributions within bands. Our results suggest that when two authors contribute to a file, large files are more likely to have a dominant author than smaller files.

@article{10.1109/MSR.2009.5069484,
     author = {Jason R. Casebolt and Jonathan L. Krein and Alexander C. MacLean and Charles D. Knutson and Daniel P. Delorey},
     title = {Author Entropy Vs. File Size in the Gnome Suite of Applications},
     journal = {International Workshop on Mining Software Repositories},
     volume = {0},
     location = {Vancouver, BC, Canada},
     month = {May},
     year = {2009},
     isbn = {978-1-4244-3493-0},
     pages = {91-94},
     url = {http://doi.ieeecomputersociety.org/10.1109/MSR.2009.5069484},
     publisher = {IEEE Computer Society},
     address = {Los Alamitos, CA, USA}
}

The 20-Minute Genealogist: A Context-Preservation Metaphor for Assisted Family History Research

Charles D. Knutson and Jonathan Krein
Proceedings of the 9th Annual Workshop on Technology for Family History and Genealogical Research
March, 2009
Abstract  BibTex  PDF

What can you possibly do to be productive as a family history researcher in 20 minutes per week? Our studies suggest that currently the answer is, “Nothing.” In 20 minutes a would-be researcher can’t even remember what happened last week, let alone what they were planning to do next. The 20-Minute Genealogist is a powerful metaphor within which software solutions must consider context preservation as the fundamental domain of the system, thus freeing the researcher to do research while the software manages the tasks that computers do best. Two survey-based studies were conducted that indicate a significant disconnect between the values espoused by would-be researchers and the actual level of time spent by those same individuals. Our preliminary results suggest that the overhead involved in context preservation is the predominant inhibitor of family history research productivity among those who claim that such work is very important, yet fail in their efforts.

@inproceedings{Knutson:2009,
     author = {Charles D. Knutson and Jonathan Krein},
     title = {The 20-Minute Genealogist: A Context-Preservation Metaphor for Assisted Family History Research},
     keywords = {family history, genealogist},
     booktitle = {Proceedings of the 9th Annual Workshop on Technology for Family History and Genealogical Research},
     location = {Provo, Utah},
     month = {March},
     year = {2009}
}

Author Entropy: A Metric for Characterization of Software Authorship Patterns

Quinn C. Taylor and James E. Stevenson and Daniel P. Delorey and Charles D. Knutson
Third International Workshop on Public Data about Software Development (WoPDaSD '08)
September, 2008
Abstract  BibTex  PDF

We propose the concept of author entropy and describe how file-level entropy measures may be used to understand and characterize authorship patterns within individual files, as well as across an entire project. As a proof of concept, we compute author entropy for 28,955 files from 33 open-source projects. We explore patterns of author entropy, identify techniques for visualizing author entropy, and propose avenues for further study.

@inproceedings{Taylor:2008,
     author = {Quinn C. Taylor and James E. Stevenson and Daniel P. Delorey and Charles D. Knutson},
     booktitle = {Third International Workshop on Public Data about Software Development (WoPDaSD '08)},
     keywords = {authors, metrics, version control, software evolution, software repositories, public domain software, SourceForge.net},
     location = {Milan, Italy},
     month = {September},
     pages = {6},
     title = {Author Entropy: A Metric for Characterization of Software Authorship Patterns},
     year = {2008}
}

Programming Language Trends in Open Source Development: An Evaluation Using Data from All Production Phase SourceForge Projects

Daniel P. Delorey and Charles D. Knutson and Christophe Giraud-Carrier
Second International Workshop on Public Data about Software Development (WoPDaSD '07)
June, 2007
Abstract  BibTex  PDF

In this work, we analyze data collected from the CVS repositories of 9,997 Open Source projects hosted on SourceForge in an effort to understand trends in programming language usage in the Open Source community between 2000 and 2005. The trends we consider include: 1) the relative popularity of the ten most popular programming languages over time, 2) the use of multiple programming languages by individual programmers and by individual projects, and 3) the programming languages most often used in combination.

@inproceedings{Delorey:2007a,
     author = {Daniel P. Delorey and Charles D. Knutson and Christophe Giraud-Carrier},
     booktitle = {Second International Workshop on Public Data about Software Development (WoPDaSD '07)},
     keywords = {software engineering, metrics, data mining, software repositories, programming language popularity, SourceForge.net},
     location = {Limerick, Ireland},
     month = {June},
     title = {Programming Language Trends in Open Source Development: An Evaluation Using Data from All Production Phase SourceForge Projects},
     year = {2007}
}

Studying Production Phase SourceForge Projects: A Case Study Using cvs2mysql and SFRA+

Daniel P. Delorey and Charles D. Knutson and Alex MacLean
Second International Workshop on Public Data about Software Development (WoPDaSD '07)
June, 2007
Abstract  BibTex  PDF

A wealth of data can be extracted from the natural byproducts of software development processes and used in empirical studies of software engineering. However, the size and accuracy of such studies depend in large part on the availability of tools that facilitate the collection of data from individual projects and the combination of data from multiple projects. To demonstrate this point, we present our experience gathering and analyzing data from nearly 10,000 open source projects hosted on SourceForge. We describe the tools we developed to collect the data and the ways in which these tools and data may be used by other researchers. We also provide examples of statistics that we have calculated from these data to describe interesting author- and project-level behaviors of the SourceForge community.

@inproceedings{Delorey:2007b,
     author = {Daniel P. Delorey and Charles D. Knutson and Alex MacLean},
     booktitle = {Second International Workshop on Public Data about Software Development (WoPDaSD '07)},
     keywords = {software engineering, metrics, data mining, software repositories, CVS, cvs2mysql, SFRA+, SourceForge.net},
     location = {Limerick, Ireland},
     month = {June},
     title = {Studying Production Phase SourceForge Projects: A Case Study Using cvs2mysql and SFRA+},
     year = {2007}
}

Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects

Daniel P. Delorey and Charles D. Knutson and Scott Chun
First International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS '07)
May, 2007
Abstract  BibTex  URL 

Brooks and others long ago suggested that on average computer programmers write the same number of lines of code in a given amount of time regardless of the programming language used. We examine data collected from the CVS repositories of 9,999 open source projects hosted on SourceForge.net to test this assumption for 10 of the most popular programming languages in use in the open source community. We find that for 24 of the 45 pairwise comparisons, the programming language is a significant factor in determining the rate at which source code is written, even after accounting for variations between programmers and projects.

@inproceedings{Delorey:2007,
     author = {Daniel P. Delorey and Charles D. Knutson and Scott Chun},
     booktitle = {First International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS '07)},
     doi = {10.1109/FLOSS.2007.5},
     keywords = {programming, programming languages, public domain software, SourceForge.net},
     location = {Minneapolis, MN},
     month = {May},
     title = {Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects},
     url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4273079},
     year = {2007}
}

Observational Studies of Software Engineering Using Data from Software Repositories

Daniel P. Delorey
April, 2007
Abstract  BibTex  PDF

Data for empirical studies of software engineering can be difficult to obtain. Extrapolations from small controlled experiments to large development environments are tenuous and observation tends to change the behavior of the subjects. In this thesis we propose the use of data gathered from software repositories in observational studies of software engineering. We present tools we have developed to extract data from CVS repositories and the SourceForge Research Archive. We use these tools to gather data from 9,999 Open Source projects. By analyzing these data we are able to provide insights into the structure of Open Source projects. For example, we find that the vast majority of the projects studied have never had more than three contributors and that the vast majority of authors studied have never contributed to more than one project. However, there are projects that have had up to 120 contributors in a single year and authors who have contributed to more than 20 projects which raises interesting questions about team dynamics in the Open Source community. We also use these data to empirically test the belief that productivity is constant in terms of lines of code per programmer per year regardless of the programming language used. We find that yearly programmer productivity is not constant across programming languages, but rather that developers using higher level languages tend to write fewer lines of code per year than those using lower level languages.

@mastersthesis{Delorey:2007c,
     author = {Daniel P. Delorey},
     month = {April},
     school = {Brigham Young University},
     title = {Observational Studies of Software Engineering Using Data from Software Repositories},
     year = {2007}
}

Design Dysphasia and the Pattern Maintenance Cycle

Seth James Nielson and Charles D. Knutson
Information and Software Technology
August, 2006
Abstract  BibTex  URL 

Software developers utilize design methods that enable them to manipulate conceptual structures that correlate to programming language features. However, language evolution weakens the design-implementation interface introducing what we call "design dysphasia"—a partial disability in the use of programming language because of incongruous design methods.

Software design patterns are a popular design method that capture elements of reusable design within a specific context. When the programming languages that are part of pattern context evolve, patterns must adapt to the language change or they may reinforce design dysphasia in the practitioner. We assert that the current "capture/recapture" pattern maintenance model is suboptimal for adapting patterns to language evolution and propose a new "capture/modify/recapture" maintenance cycle as a more effective approach. We then suggest a concrete "modify" phase for current patterns to be adapted to OO++ language trends and present an OO++ Iterator pattern example.

@article{Nielson:2006,
     author = {Seth James Nielson and Charles D. Knutson},
     doi = {10.1016/j.infsof.2005.07.004},
     journal = {Information and Software Technology},
     keywords = {dysphasia, pattern maintenance cycle, multiparadigm},
     month = {August},
     number = {8},
     pages = {660--675},
     title = {Design Dysphasia and the Pattern Maintenance Cycle},
     url = {http://linkinghub.elsevier.com/retrieve/pii/S0950584905001102},
     volume = {48},
     year = {2006}
}

OO++ Design Patterns: GOF Revisited

Seth J. Nielson
December, 2004
Abstract  BibTex  PDF

Programming languages and the programming paradigms they embody co-evolve over time. In many circles, for example, object-oriented programming has evolved from and effectively replaced imperative programming. More recently, many object-oriented languages have assimilated features from other programming paradigm, evolving into multiparadigm languages we refer to as "object-oriented plus-plus" or OO++. However, language evolution, like that seen in OO++, weakens the design-implementation interface introducing what we call "design dysphasia"—a partial disability in the use of programming language because of incongruous design methods. Design dysphasia persists until design methods are extended to match evolved language features.

One popular contemporary design method is the use of software design patterns. These patterns capture elements of design that can be reused within a specific context. When the programming languages that are part of pattern context evolve, patterns must adapt to the language changes. Otherwise they may reinforce design dysphasia in the practitioner. Because of this, the current pattern maintenance model of "capture/recapture" is suboptimal.

This thesis presents an investigation of the shift in contemporary object-oriented languages to OO++ and analyzes the characteristics of the OO++ paradigm. The nature of design dysphasia is defined and discussed generally and in the context of software design patterns. A "capture/modify/recapture" maintenance model is presented as an effective replacement to the "capture/recapture" cycle. A concrete "modify" phase is defined for the adaptation of existing object-oriented patterns to OO++ languages illustrated through the adaptation of the 23 patterns presented in Design Patters by Gamma et al. to OO++ variants.

@mastersthesis{Nielson:2004a,
     author = {Seth J. Nielson},
     month = {December},
     school = {Brigham Young University},
     title = {OO++ Design Patterns: GOF Revisited},
     year = {2004}
}

OO++: Exploring the Multiparadigm Shift

Seth J. Nielson and Charles D. Knutson
Workshop on Multiparadigm Programming with Object-Oriented Languages (MPOOL 2004)
June, 2004
Abstract  BibTex  URL  PDF

Programming languages and the programming paradigms they embody co-evolve over time. Within industrial and academic circles, for example, object-oriented programming has evolved from and effectively replaced imperative programming. More recently, many object-oriented languages have assimilated features from other programming paradigms, evolving into multiparadigm languages we refer to as "object-oriented plus-plus" or OO++. In this paper we survey the capabilities of six OO++ languages, present OO++ code samples in Python, and propose key characteristics of an OO++ programming paradigm.

@inproceedings{Nielson:2004,
     author = {Seth J. Nielson and Charles D. Knutson},
     booktitle = {Workshop on Multiparadigm Programming with Object-Oriented Languages (MPOOL 2004)},
     location = {Oslo, Norway},
     month = {June},
     title = {OO++: Exploring the Multiparadigm Shift},
     url = {http://www.cs.rice.edu/~sethn/multiparadigm/},
     year = {2004}
}