Apache Software Foundation Commit Social Network

This graph represents the social network of Apache committers for the three month period starting January 1st, 2010.

Description

The data presented here represents the commit behavior of Apache Software Foundation (ASF) developers over a two year period (2010 through 2011). It is stored in a Neo4j graph database that relates developers to their respective commits, and commits to the files that they modify. The graphs also contain precalculated metrics that relate committers to one another (see Apache Commits: Social Network Dataset).

Using the Data

Neo4j is a graph database engine written in Java. We recommend using the Java API (see Using Neo4j embedded in Java applications) to interact directly with the graph. This is how we created the data. Also, it will be much faster than working through the REST API.

The Neo4j server (see Neo4j Server) also provides a REST API, and the Neo4j Community has implemented language specific drivers to interact with that API (see Language Drivers). This approach requires running a Neo4j server to support the REST API.

Structure of the Data

We'll describe the structure of the data here rather than the schema because technically the schema of a graph in Neo4j is simply a single node type with relationships. The structure is imposed by adding attributes to nodes and creating custom relationship types. A thorough description of the structure of the data can be found in Apache Commits: Social Network Dataset listed below.

The basic structure of the data contained in the Neo4j graphs is comprised of three node types ("author", "revision", and "file") and four basic relationship types. Each of the node types corresponds to a conceptual entity within the Apache subversion repository. Because there is no concept of node type within Neo4j, the node types are indicated by the "node-type" property on the node. However, each relationship in Neo4j must have a specified relationship type.

Neo4j Graph Structure
node-type Description
author An author node represents a committer in the Apache Software Foundation Subversion repository.
revision A single revision to the repository.
file

A unique path within the repository

Relationship Type Description
AUTHOR_ON_REVISION Indicates that the author created the revision (author --> revision).
AUTHOR_TO_AUTHOR Contains precalculated metrics that relate the two developers (author --> author). Since there are no bidirectional edges in Neo4j, the graph contains two duplicate edges in opposite directions.
AUTHOR_TOUCHES_FILE The author modifies the file in at least on revision contained in the graph (author --> file).
ADD The revision adds lines to the file (revision --> file).
DELETE The revision removes lines from the file (revision --> file).
MODIFY The revision both adds and deletes lines from the file (revision --> file).
REPLACE The revision replaces the file with a new file (revision --> file). We include this type for completeness, but there are no edges of this type in the data set.

Note that because Neo4j is a Java database engine, each of the relationship types is a member of the RelTypes enum, whereas each of the node-types is simply a string. This distinction will become more important when we consider methods of querying the data.

Querying the Data

There are three methods for querying the data: direct access (via the Neo4JConnector library), the Neo4J Traversal Engine, and the Cypher Query Language. For each example, we will query for "steve", a fictional committer.

Neo4JConnector Library

We have provided a Java library that connects to these graphs. For a description and examples, go here.

Traversal Engine

In addition to following links defined in the classes of the Neo4JConnector library, Neo4j also provides a rich traversal engine (see the traversal documentation). To find all of the files for an author:

public static void main(String...args) {
    DeveloperNeo4JConnector connector = new DeveloperNeo4JConnector("../dbs/apache-db_2010-01-01_2010-04-01/");
    
    TraversalDescription desc = Traversal
            .description()
            .breadthFirst()
            .relationships(RelTypes.AUTHOR_ON_REVISION, Direction.OUTGOING)
            .relationships(RelTypes.ADD, Direction.OUTGOING)
            .relationships(RelTypes.DELETE, Direction.OUTGOING)
            .relationships(RelTypes.MODIFY, Direction.OUTGOING)
            .relationships(RelTypes.REPLACE, Direction.OUTGOING)
            .evaluator(Evaluators.excludeStartPosition())
            .evaluator(Evaluators.toDepth(3))
            .evaluator(new FileEvaluator())
            .uniqueness(Uniqueness.NODE_PATH);
    
    Traverser traverser = desc.traverse(AuthorNode.get(connector, "steve").getUnderlyingItem());
    
    for(Path path : traverser) {
        /* Do something with the path */
    }
}

private static class FileEvaluator implements Evaluator {
    @Override
    public Evaluation evaluate(Path path) {
        if(FileNode.isFileNode(path.endNode())) {
            return Evaluation.INCLUDE_AND_PRUNE;
        }
        else {
            return Evaluation.EXCLUDE_AND_CONTINUE;
        }
    }
}

Cypher

Lastly, Neo4j implements the Cypher Query Language for querying graphs (see the Cypher documentation). To find all of the files for an author:

START author=node:`author-index`(name = "steve")
MATCH (author)-[:AUTHOR_ON_REVISION]->(revision)-[:ADD|DELETE|MODIFY|REPLACE]->(file)
RETURN file

Generating Data

More information about the tool, SCM2DB, that generated the data, can be found here.

Publications

The data described here have been published in the following publications:

Apache Commits: Social Network Dataset

Alexander C. MacLean and Charles D. Knutson
Proceedings of the 10th Working Conference on Mining Software Repositories
May, 2013

Building non-trivial software is a social endeavor. Therefore, understanding the social network of developers is key to the study of software development organizations. We present a graph representation of the commit behavior of developers within the Apache Software Foundation for 2010 and 2011. Relationships between developers in the network represent collaborative commit behavior. Several similarity and summary metrics have been pre-calculated.

Data

Each of the data files listed below contain data for a three month period.

Name Last Modified Size Description
2010-01-01_2010-04-01.tar.bz2 2013-02-22 187.6M January 1st, 2010 to April 1st, 2010
2010-02-01_2010-05-01.tar.bz2 2013-02-22 202.3M February 1st, 2010 to May 1st, 2010
2010-03-01_2010-06-01.tar.bz2 2013-02-22 168.8M March 1st, 2010 to June 1st, 2010
2010-04-01_2010-07-01.tar.bz2 2013-02-22 193.6M April 1st, 2010 to July 1st, 2010
2010-05-01_2010-08-01.tar.bz2 2013-02-22 161.0M May 1st, 2010 to August 1st, 2010
2010-06-01_2010-09-01.tar.bz2 2013-02-22 161.8M June 1st, 2010 to September 1st, 2010
2010-07-01_2010-10-01.tar.bz2 2013-02-22 140.6M July 1st, 2010 to October 1st, 2010
2010-08-01_2010-11-01.tar.bz2 2013-02-22 269.2M August 1st, 2010 to November 1st, 2010
2010-09-01_2010-12-01.tar.bz2 2013-02-22 295.4M September 1st, 2010 to December 1st, 2010
2010-10-01_2011-01-01.tar.bz2 2013-02-22 295.5M October 1st, 2010 to January 1st, 2011
2010-11-01_2011-02-01.tar.bz2 2013-02-22 184.3M November 1st, 2010 to February 1st, 2011
2010-12-01_2011-03-01.tar.bz2 2013-02-22 175.7M December 1st, 2010 to March 1st, 2011
2011-01-01_2011-04-01.tar.bz2 2013-02-22 164.8M January 1st, 2011 to April 1st, 2011
2011-02-01_2011-05-01.tar.bz2 2013-02-22 143.6M February 1st, 2011 to May 1st, 2011
2011-03-01_2011-06-01.tar.bz2 2013-02-22 149.5M March 1st, 2011 to June 1st, 2011
2011-04-01_2011-07-01.tar.bz2 2013-02-22 168.1M April 1st, 2011 to July 1st, 2011
2011-05-01_2011-08-01.tar.bz2 2013-02-22 178.6M May 1st, 2011 to August 1st, 2011
2011-06-01_2011-09-01.tar.bz2 2013-02-22 164.1M June 1st, 2011 to September 1st, 2011
2011-07-01_2011-10-01.tar.bz2 2013-02-22 161.6M July 1st, 2011 to October 1st, 2011
2011-08-01_2011-11-01.tar.bz2 2013-02-22 168.8M August 1st, 2011 to November 1st, 2011
2011-09-01_2011-12-01.tar.bz2 2013-02-22 208.5M September 1st, 2011 to December 1st, 2011
2011-10-01_2012-01-01.tar.bz2 2013-02-22 230.2M October 1st, 2011 to January 1st, 2012

Discussion

Due to space limitations in our peer reviewed articles we aren't able to address every question regarding the data. Instead, we have listed a few questions that have been asked here.

How do these graphs treat changes that were authored by one person and committed by another?

CHANGES file and Subversion logs provides guidelines on how a committer to the Apache Subversion repository should provide attribution to the author of a patch when it has been submitted by a non-committer. These commits are attributed by including some variation of Submitted by: Jane Doe <janedoe example.com>.

The major issue with these types of changes is that they affect how we interpret a developer's behavior. In our analyses we have assumed that commit behavior can be used to determine similarity between two developers, and therefore ascertain the day-to-day working relationships within the network. However, if some changes attributed to a developer were written by someone else, are our assumptions invalid?

Luckily, no. Although we would like to take these changes into account the next time we gather data, for the two year period currently provided, non-committer authorship accounts for zero new author-to-file connections within the graph. Development activity committed by a proxy changed 7,410 unique paths within the Apache Subversion repository between January 1st, 2010 and December 31st, 2011. In each case, the changes committed by a proxy corresponded to a subset of the paths that the proxy has committed himself or herself.

Therefore, for our purposes, these new paths do not significantly change the outcome of our analyses.

A quarter of developers in the Apache Foundation spend less than three months contributing.

Some of those must be all of the developers who started less than three months before the data was collected, but how many?

Not as many as you might think. Although developer participation has increased steadily since the Apache Foundation began, the number of short-term (less then three months) developers active at any given time is a small percentage of the total.

The Active Developers graph illustrates the number of developers who are actively contributing to the foundation at any given time. The blue line represents the total number of developers. The red line represents the short-term developers. Although there is a rise in the number of short term developers at the end of the collection period, 80% of short-term developers stopped contributing more than six months prior.

For completeness, the precipitous decline of developer participation (the blue line) towards the end of the graph is simply an artifact of our collection method. However, immediately prior to the sharp decline, it appears that the active population of developers contracts slightly. The contraction may be a fluke or another artifact, but we'd be interested to delve into that further.