MSR 2004: International Workshop on Mining Software Repositories

MSR 2004: International Workshop on Mining Software Repositories
2004.msrconf.org

9:00-9:15	Welcome and Introduction [slides] Ahmed E. Hassan, Richard C. Holt, and Audris Mockus
9:15-10:30	Session 1: Infrastructure and Extraction Research Infrastructure for Empirical Science of FOSS [slides] Les Gasser, Gabriel Ripoche, and Robert Sandusky (University of Illinois at Urbana Champaign) Preprocessing CVS Data for Fine-Grained Analysis [slides] Thomas Zimmermann (Saarland University) and Peter Weißgerber (Catholic University of Eichstätt-Ingolstadt) Discussion Leader: Daniel German (University Of Victoria) [slides]
10:30-11:00	Coffee Break
10:30-11:15	Session 2: Integration and Presentation Using CVS historical information to understand how students develop software [slides] Ying Liu,Eleni Stroulia, Ken Wong (University of Alberta), and Daniel German (University of Victoria) Discussion Leader: Katsuro Inoue (Osaka University) [slides]
11:15-12:00	Session 3: System Understanding and Change Patterns Four Interesting Ways in Which History Can Teach Us About Software [slides] Michael Godfrey, Cory Kapser, Xinyi Dong, and Lijie Zou (University of Waterloo) Discussion Leader: Annie Ying (IBM T.J. Watson Research Center) [slides]
12:30-1:30	Lunch
1:30-2:30	Demos and Walkaround Presentations Email msr2004@msr.uwaterloo.ca to register All participants and authors of accepted papers are encouraged to present their MSR research and tools: Mining the Software Change Repository of a Legacy Telephony System - Jelber Sayyad Shirabad Database Techniques for the Analysis and Exploration of Software Repositories - Omar Alonso ROSE: Mining Version Histories to Guide Software Changes in Eclipse - Thomas Zimmermann Hackystat: Support for Software Telemetry - Philip Johnson CVSAnalY: Analysis and results - Gregorio Robles and Jesus M. Gonzalez-Barahona Augur: Unifying Activity, Artifacts and Authors in a Visual Tool for Distributed Software Development Teams - Jon Froehlich and Paul Dourish
2:30-3:30	Session 4: Defect Analysis Towards Understanding the Rhetoric of Small Changes [slides] Ranjith Purushothaman (Dell Computer Corporation) and Dewayne Perry (University of Texas at Austin) Bug Driven Bug Finders [slides] Chadd Williams and Jeff Hollingsworth (University of Maryland) Discussion Leader: Thomas Ostrand (AT&T Labs - Research)
3:30-4:00	Coffee Break
4:00-4:30	Session 5: Process and Community Analysis Applying Social Network Analysis to the Information in CVS Repositories [slides] Luis Lopez-Fernandez, Gregorio Robles, and Jesus M. Gonzalez-Barahona (Rey Juan Carlos University) Discussion Leader: Chris Jensen (University of California, Irvine) [slides]
4:30-5:00	Session 6: Software Reuse A Case Study on Recommending Reusable Software Components using Collaborative Filtering [slides] Frank McCarey, Mel Ó Cinnéide, and Nicholas Kushmerick (University College Dublin) Discussion Leader: Pankaj Garg (Zeesource) [slides]
5:00-5:30	Wrap-up: Common Themes and Future Direction [slides] Ahmed E. Hassan, Richard C. Holt and Audris Mockus

Papers:

Infrastructure and Extraction: (Schedule )

Preprocessing CVS Data for Fine-Grained Analysis by Zimmermann, Weißgerber
All analyses of version archives have one phase in common: the preprocessing of data. Preprocessing has a direct impact on the quality of the results returned by an analysis. In this paper we discuss four essential preprocessing tasks necessary for a fine-grained analysis of CVS archives: (a) data extraction, (b) transaction recovery, (c) mapping of changes to fine-grained entities, and (d) data cleaning. We formalize the concept of sliding time windows and show how commit mails can relate revisions to transactions. We also present two approaches that map changes to the affected building blocks of a file, e.g. functions or sections.
The perils and pitfalls of mining SourceForge by Howison, Crowston
SourceForge provides abundant accessible data from Open Source Software development projects, making it an attractive data source for software engineering research. However it is not without theoretical peril and practical pitfalls. In this paper, we outline practical lessons gained from our spidering, parsing and analysis of SourceForge data.
SourceForge can be practically difficult: projects are defunct, data from earlier systems has been dumped in and crucial data is hosted outside SourceForge, dirtying the retrieved data. These practical issues play directly into analysis: decisions made in screening projects can reduce the range of variables, skewing data and biasing correlations.
SourceForge is theoretically perilous: because it provides easily accessible data items for each project, tempting researchers to fit their theories to these limited data. Worse, few are plausible dependent variables. Studies are thus likely to test the same hypotheses even if they start from different theoretical bases. To avoid these problems, analyses of SourceForge projects should go beyond project level variables and carefully consider which variables are used for screening projects and which for testing hypotheses.
Research Infrastructure for Empirical Science of FOSS by Gasser, Ripoche, Sandusky
F/OSS research faces a new and unusual situation: the traditional difficulties of gathering enough empirical data have been replaced by issues of dealing with enormous amounts of freely available data from many disparate sources (forums, code, bug reports, etc.) At present no means exist for assembling these data under common access points and frameworks for comparative, longitudinal, and collaborative research. Gathering and maintaining large F/OSS data collections reliably and making them usable present several research challenges. For example, current projects usually rely on ``web scraping'' or on direct access to raw data from groups that generate it, and both of these methods require unique effort for each new corpus, or even for updating existing corpora. In this paper we identify several common needs and critical factors in F/OSS empirical research, and suggest orientations and recommendations for the design of a shared research infrastructure.
Mining CVS repositories, the softChange experience by German
CVS logs are a rich source of software trails (information left behind by the contributors to the development process, usually in the forms of logs). This paper describes how softChange extracts these trails, and enhances them. This paper also addresses some challenges that CVS fact extraction poses to researchers.
Text is Software Too by Dekhtyar, Huffman Hayes, Menzies
Software compiles and therefore is characterized by a parseable grammar. Natural language text rarely conforms to prescriptive grammars and therefore is much harder to parse. Mining parseable structures is easier than mining less structured entities. Therefore, most work on mining repositories focuses on software, not natural language text. Here, we report experiments with mining natural language text (requirements documents) suggesting that: (a)~mining natural language is not too difficult, so (b)~software repositories should routinely be augmented with all the natural language text used to develop that software.

Integration and Presentation: (Schedule )

GluTheos: Automating the Retrieval and Analysis of Data from Publicly Available Software Repositories by Robles, Gonzalez-Barahona, Ghosh
For efficient, large scale data mining of publicly available information about libre (free, open source) software projects, automating the retrieval and analysis processes is a must. A system implementing such automation must have into account the many kinds of repositories with interesting information (each with its own structure and access methods), and the many kinds of analysis which can be applied to the retrieved data. In addition, such a system should be capable of interfacing and reusing as much existing software for both retrieving and analyzing data as possible.
As a proof of concept of how that system could be, we started sometime ago to implement the GlueTheos system, featuring a modular,flexible architecture which has been already used in several of our studies of libre software projects. In this paper we show its structure, how it can be used, and how it can be extended.
Using CVS historical information to understand how students develop software by Liu, Stroulia, Wong, German
Software engineering courses are expected to teach students a wide range of knowledge, e.g. software development methodologies, tools, work habits, collaboration skills, and a good sense of scheduling, etc. In this paper, we present a method to track the progress of the students in the development of a term project using the historical information stored in their CVS repository. This information is analyzed and presented to the instructor in a variety of forms. The goal of this analysis is, first, to understand how students interact, and second, to find out if there is any correlation between their grades and the nature of their collaboration. Understanding these factors will allow instructors to detect potential problems early in the course, so they can concentrate their help in those teams who need it the most.
Database Techniques for the Analysis and Exploration of Software Repositories by Alonso, Devanbu, Gertz
In a typical software engineering project there is a large and diverse body of documents that a development team produces, including requirement documents, specifications, designs, code, and bug reports. Documents have different formats and are managed in several repositories. The heterogeneity among document formats and the diversity of repositories make it often not feasible to query and explore the repositories in a transparent fashion during the phases of the software development process.
In this paper, we present a framework for the analysis and exploration of software repositories. Our approach applies database techniques to integrate and manage different documents produced by a team. Tools that exploit the database functionality then allow for the processing of complex queries against a document collection to extract trends and analyze correlations, which provide important insights into the software development process.
We present a prototype implementation using the Apache Web-server project as a case study
Empirical Project Monitor: A Tool for Mining Multiple Project Data by Ohira, Yokomori, Sakai, Matsumoto, Inoue, Torii
Project management for effective software process improvement must be achieved based on quantitative data. However, because data collection for measurement requires high costs and collaboration with developers, it is difficult to collect coherent, quantitative data continuously and to utilize the data for practicing software process improvement. In this paper, we describe Empirical Project Monitor (EPM) which automatically collects and measures data from three kinds of repositories in widely used software development support systems such as configuration management systems, mailing list managers and issue tracking systems. Providing integrated measurement results graphically, EPM helps developers/managers keep projects under control in real time.

System Understanding and Change Patterns: (Schedule )

Mining Version Control Systems for FACs (Frequently Applied Changes) by Van Rysselberghe, Demeyer
Today, programmers are forced to maintain a software system based on their gut feeling and experience. This paper makes an attempt to turn the software maintenance craft into a more disciplined activity, by mining for frequently applied changes in a version control system. Next to some initial results, we show how this technique allows to recover and study successful maintenance strategies, adopted for the redesign of long-lived systems.
Mining the Software Change Repository of a Legacy Telephony System by Sayyad Shirabad, Lethbridge, Matwin
Ability to predict whether a change in one file may require a change in another can be extremely helpful to a software maintainer. Software change repositories store historic changes applied to a software system. They therefore inherently contain a wealth of information regarding (hidden) interactions between different components of the system, including the files that have changed together in the past. Data mining techniques can be employed to learn from this software change experience. We will report on our researrch into mining the software change repository of a legacy system to learn a relation that maps file pairs to a value indicating whether changing one may require a change in the other.
Four Interesting Ways in Which History Can Teach Us About Software by Godfrey, Kapser, Dong, Zou
In this position paper, we outline four kinds of studies that we have undertaken in trying to understand various aspects of a software system's evolutionary history. In each instance, the studies have involved detailed examination of real software systems based on "facts" extracted from various kinds of source artifact repositories, as well as the development of accompanying tools to aid in the extraction, abstraction, and comprehension processes. We briefly discuss the goals, results, and methodology of each approach
Predicting Source Code Changes by Mining Revision History by Ying, Murphy, Ng, Chu-Carroll
Software developers are often faced with modification tasks that involve source which is spread across a code base. Some dependencies between source, such as the dependencies between platform dependent fragments, cannot be determined by existing static and dynamic analyses. To help developers identify relevant source code during a modification task, we have developed an approach that applies data mining techniques to determine change patterns---files that were changed together frequently in the past---from the revision history of the code base. Our hypothesis is that the change patterns can be used to recommend potentially relevant source code to a developer performing a modification task. We show that this approach can reveal valuable dependencies by applying the approach to the Eclipse and Mozilla open source projects, and by evaluating the predictability and interestingness of the recommendations produced for actual modification tasks on these systems.
Mining Software Usage Data by El-Ramly, Stroulia
Many software systems collect or can be instrumented to collect data about how users use them, i.e., system-user interaction data. Such data can be of great value for program understanding and reengineering purposes. In this paper we demonstrate that sequential data mining methods can be applied to discover interesting patterns of user activities from system-user interaction traces. In particular, we developed a process for discovering a special type of sequential patterns, called interaction patterns. These are sequences of events with noise, in the form of spurious events that may occur anywhere in a pattern instance. In our case studies, we applied interaction pattern mining to systems with considerably different forms of interaction: Web-based systems and legacy systems. We used the discovered patterns for user interface reengineering, and personalization. The method is promising and generalizable to other systems with different forms of interaction.

Defect Analysis: (Schedule )

Bug Driven Bug Finders by Williams, Hollingsworth
We describe a method of creating tools to find bugs in software that is driven by analysis of previous bugs. We present a study of bug databases and software repositories that characterize commonly occurring types of bugs. Based on the types of bugs that were commonly reported and fixed in the code, we determine what types of bug finding tools should be developed. We have implemented one static checker, a return value usage checker. Novel features of this checker include the use of information from the software repository to try to improve its false positive rate by identifying patterns that have resulted in previous bug fixes.
Mining Repositories to Assist in Project Planning and Resource Allocation by Menzies, DiStefano, Chapman, Cunanan
Software repositories plus defect logs are useful for learning defect detectors. Such defect detectors could be a useful resource allocation tool for software managers. One way to view our detectors is that they are a V&V tool for V&V; i.e. they can be used to assess if "too much" of the testing budget is going to "too little" of the system. Finding such detectors could be used as the basis of the business case that constructing building a local repository is useful.
Three counter arguments to such a proposal are (1) no general conclusions have been reported in any such repository despite years of effort; (2) if such general conclusions existed then there would be no need to build a local repository; (3) no such general conclusions will ever exist, according to many reseaerchers. This article is a reply to these three arguments.
Bug Report Networks: Varieties, Strategies, and Impacts in a F/OSS Development Community by Sandusky, Gasser, Ripoche
Our empirical research has shown that a predominant structural feature of defect tracking repositories is the evolving "bug report network" (BRN). Community members create BRNs by progressively asserting various formal and informal relationships between bug reports (BRs). In one F/OSS bug repository under study, participants assert two formal relationships (duplications and dependencies) and various informal relationships (like "see also" references).
BRNs can be interpreted as (1) information ordering strategies that support collocation of related BRs, decreasing cognitive and organizational effort; (2) sense-making strategies wherein BRNs provide more refined representations of software and work-organization issues; (3) social ordering strategies that rearrange collective relationships among community members. This paper presents findings from an investigation of the nature, extent, and impact of BRNs in one F/OSS development community. We investigate whether and how specific classes of BRNs influence problem management within the community, and identify several new research questions.
A Tool for Mining Defect-Tracking Systems to Predict Fault-Prone Files by Ostrand, Weyuker
In earlier research we identified characteristics of files in large software systems that tend to make them particularly likely to contain faults. We then developed a statistical model that uses historical fault information and file characteristics to predict which files of a system are likely to contain the largest numbers of faults. Testers can use that information to prioritize their testing and focus their efforts to make the testing process more efficient and the resulting software more dependable. In this paper we describe a proposed new tool to automate this prediction process, and discuss issues involved in its design and implementation. The goal is to produce an automated tool that mines the project defect tracking system and that can be used by testers without requiring any particular statistical expertise or subjective judgements.
Towards Understanding the Rhetoric of Small Changes by Purushothaman, Perry
Understanding the impact of software changes has been a challenge since software systems were first developed. With the increasing size and complexity of systems, this problem has become more difficult. There are many ways to identify change impact from the plethora of software artifacts produced during development and maintenance. We present the analysis of the software development process using change and defect history data. Specifically, we address the problem of small changes. The studies revealed that (1) there is less than 4 percent probability that a one-line change will introduce an error in the code; (2) nearly 10 percent of all changes made during the maintenance of the software under consideration were one-line changes; (3 the phenomena of change differs for additions, deletions and modifications as well as for the number of lines affected.

Process and Community Analysis: (Schedule )

Data Mining for Software Process Discovery in Open Source Software Development Communities by Jensen, Scacchi
Software process discovery has historically been an intensive task, either done through exhaustive empirical studies or in an automated fashion using techniques such as logging and analysis of command shell operations. While empirical studies have been fruitful, data collection has proven to be tedious and time consuming. Existing automated approaches have expedited collection of fine-grained data, but do so at the cost of impinging on the developer's work environment, few of who may be observed. In this paper, we explore techniques for discovering development processes from publicly available open source software development repositories that exploit advances in artificial intelligence. Our goal is to facilitate process discovery in ways that are less cumbersome than empirical techniques and offer a more holistic, task-oriented view of the process than current automated systems provide.
Applying Social Network Analysis to the Information in CVS Repositories by Lopez-Fernandez, Robles, Gonzalez-Barahona
The huge quantities of data available in the CVS repositories of large, long-lived libre (free, open source) software projects, and the many interrelationships among those data, offer opportunities for extracting large amounts of valuable information about their structure, evolution and internal processes. Unfortunately, the sheer volume of that information renders it almost unusable without applying methodologies which highlight the relevant information for a given aspect of the project. In this paper, we propose the use of a well known set of methodologies (social network analysis) for characterizing libre software projects, their evolution over time and their internal structure. In addition, we show how we have applied such methodologies to real cases, and extract some preliminary conclusions from that experience.
Mining a Software Developer’s Local Interaction History by Schneider, Gutwin, Penner, Paquette
Although shared software repositories are commonly used during software development, it is typical that a software developer browses and edits a local snapshot of the software under development. Developers periodically check their changes into the software repository; however, their interaction with the local copy is not recorded. Local interaction histories are a valuable source of information and should be considered when mining software repositories.
In this paper we discuss the benefits of analyzing local interaction histories and present a technique and prototype implementation for their capture and analysis. As well, we discuss the implications of local interaction histories and the infrastructure of software repositories.

Reuse: (Schedule )

LASER: A Lexical Approach to Analogy in Software Reuse by Amin, Ó Cinnéide, Veale
Software reuse involves creating a software system using existing software components, rather than creating it entirely from scratch. With the increase in size and complexity of existing software repositories, the need to provide intelligent support to the programmer becomes more pressing. An analogy is a comparison of certain similarities between things that are otherwise unlike. This concept has shown to be valuable in developing UML-level reuse techniques. In the LASER project we apply lexically-driven Analogy at the code level, rather than at the UML-level, in order to retrieve matching components from a repository of existing components. Using the standard ontology WordNet, we have conducted a case study to assess if class and method names in open source applications are used in a semantically meaningful way. Our results demonstrate that both hierarchical reuse and parallel reuse can be enhanced through the use of lexically-driven Analogy.
A Case Study on Recommending Reusable Software Components using Collaborative Filtering by McCarey, Ó Cinnéide, Kushmerick
The demand for quality, highly functional software reinforces the need for reusable software components. However, as repositories of reusable components increase in size and complexity, the challenge for developers to remain conversant with all components becomes greater. This paper proposes a software recommendation system based on collaborative filtering, which has been shown to be effective in other domains. Based on the usage patterns of existing classes and the class currently being developed, our system proposes a set of reuse candidates to the programmer. We present the results of our analysis of the usage of Swing classes in several open-source applications and find that the collaborative filtering technique is promising in providing recommendations in this context.
Template Mining in Source-Code Digital Libraries by Yusof, Rana
As a greater number of software developers make their source code available, there is a need to store such open-source applications into a repository, and facilitate search over the repository. The objective of this research is to build a digital library of Java source code, to enable search and selection of source code. We believe that such a digital library will enable better sharing of experience amongst developers, and facilitate reuse of code segments. Information retrieval is often considered to be essential for the success of digital libraries, so they can achieve high level of effectiveness while at the same time affording ease of use to a diverse community of users. Four different matching mechanism: exact, generalization, reduction and nameOnly is used in retrieving java programs based from information extracted through template mining.
Multi-Project Software Engineering: An Example by Garg, Gschwind, Inoue
In this paper we present an approach for developers to benefit from multi-project software knowledge. As we show in this paper, this can be achieved by gathering information about how numerous software projects are being built, and about the interrelation of the modules within the projects. Compared to approaches that only monitor a single project, the contribution of our approach is that it not only supports the reuse of isolated software modules or libraries but also the knowledge surrounding the code and individual projects. For instance, if a component is replaced with another probably better implementation within a project, this knowledge can be shared with all relevant projects. In this paper, we show how the collection of such data allows developers to learn about such decisions from other projects, and hence how to benefit from such "multi-project" knowledge.

Last Modified by Ahmed E. Hassan on April 20 2004

Papers:

Infrastructure and Extraction: (Schedule)

Integration and Presentation: (Schedule)

System Understanding and Change Patterns: (Schedule)

Defect Analysis: (Schedule)

Process and Community Analysis: (Schedule)

Reuse: (Schedule)

Infrastructure and Extraction: (Schedule )

Integration and Presentation: (Schedule )

System Understanding and Change Patterns: (Schedule )

Defect Analysis: (Schedule )

Process and Community Analysis: (Schedule )

Reuse: (Schedule )