MSR 2004: International Workshop on Mining Software Repositories

9:00-9:15 Welcome and Introduction [slides]
              Ahmed E. Hassan, Richard C. Holt, and Audris Mockus
9:15-10:30 Session 1: Infrastructure and Extraction
10:30-11:00 Coffee Break
10:30-11:15 Session 2: Integration and Presentation
11:15-12:00 Session 3: System Understanding and Change Patterns
12:30-1:30 Lunch
1:30-2:30 Demos and Walkaround Presentations
2:30-3:30 Session 4: Defect Analysis
3:30-4:00 Coffee Break
4:00-4:30 Session 5: Process and Community Analysis
4:30-5:00 Session 6: Software Reuse
5:00-5:30 Wrap-up: Common Themes and Future Direction [slides]
             Ahmed E. Hassan, Richard C. Holt and Audris Mockus


Infrastructure and Extraction: (Schedule)

Integration and Presentation: (Schedule)

  • GluTheos: Automating the Retrieval and Analysis of Data from Publicly Available Software Repositories by Robles, Gonzalez-Barahona, Ghosh

    For efficient, large scale data mining of publicly available information about libre (free, open source) software projects, automating the retrieval and analysis processes is a must. A system implementing such automation must have into account the many kinds of repositories with interesting information (each with its own structure and access methods), and the many kinds of analysis which can be applied to the retrieved data. In addition, such a system should be capable of interfacing and reusing as much existing software for both retrieving and analyzing data as possible.

    As a proof of concept of how that system could be, we started sometime ago to implement the GlueTheos system, featuring a modular,flexible architecture which has been already used in several of our studies of libre software projects. In this paper we show its structure, how it can be used, and how it can be extended.

  • Using CVS historical information to understand how students develop software by Liu, Stroulia, Wong, German

    Software engineering courses are expected to teach students a wide range of knowledge, e.g. software development methodologies, tools, work habits, collaboration skills, and a good sense of scheduling, etc. In this paper, we present a method to track the progress of the students in the development of a term project using the historical information stored in their CVS repository. This information is analyzed and presented to the instructor in a variety of forms. The goal of this analysis is, first, to understand how students interact, and second, to find out if there is any correlation between their grades and the nature of their collaboration. Understanding these factors will allow instructors to detect potential problems early in the course, so they can concentrate their help in those teams who need it the most.

  • Database Techniques for the Analysis and Exploration of Software Repositories by Alonso, Devanbu, Gertz

    In a typical software engineering project there is a large and diverse body of documents that a development team produces, including requirement documents, specifications, designs, code, and bug reports. Documents have different formats and are managed in several repositories. The heterogeneity among document formats and the diversity of repositories make it often not feasible to query and explore the repositories in a transparent fashion during the phases of the software development process.

    In this paper, we present a framework for the analysis and exploration of software repositories. Our approach applies database techniques to integrate and manage different documents produced by a team. Tools that exploit the database functionality then allow for the processing of complex queries against a document collection to extract trends and analyze correlations, which provide important insights into the software development process.

    We present a prototype implementation using the Apache Web-server project as a case study

  • Empirical Project Monitor: A Tool for Mining Multiple Project Data by Ohira, Yokomori, Sakai, Matsumoto, Inoue, Torii

    Project management for effective software process improvement must be achieved based on quantitative data. However, because data collection for measurement requires high costs and collaboration with developers, it is difficult to collect coherent, quantitative data continuously and to utilize the data for practicing software process improvement. In this paper, we describe Empirical Project Monitor (EPM) which automatically collects and measures data from three kinds of repositories in widely used software development support systems such as configuration management systems, mailing list managers and issue tracking systems. Providing integrated measurement results graphically, EPM helps developers/managers keep projects under control in real time.

System Understanding and Change Patterns: (Schedule)

  • Mining Version Control Systems for FACs (Frequently Applied Changes) by Van Rysselberghe, Demeyer

    Today, programmers are forced to maintain a software system based on their gut feeling and experience. This paper makes an attempt to turn the software maintenance craft into a more disciplined activity, by mining for frequently applied changes in a version control system. Next to some initial results, we show how this technique allows to recover and study successful maintenance strategies, adopted for the redesign of long-lived systems.

  • Mining the Software Change Repository of a Legacy Telephony System by Sayyad Shirabad, Lethbridge, Matwin

    Ability to predict whether a change in one file may require a change in another can be extremely helpful to a software maintainer. Software change repositories store historic changes applied to a software system. They therefore inherently contain a wealth of information regarding (hidden) interactions between different components of the system, including the files that have changed together in the past. Data mining techniques can be employed to learn from this software change experience. We will report on our researrch into mining the software change repository of a legacy system to learn a relation that maps file pairs to a value indicating whether changing one may require a change in the other.

  • Four Interesting Ways in Which History Can Teach Us About Software by Godfrey, Kapser, Dong, Zou

    In this position paper, we outline four kinds of studies that we have undertaken in trying to understand various aspects of a software system's evolutionary history. In each instance, the studies have involved detailed examination of real software systems based on "facts" extracted from various kinds of source artifact repositories, as well as the development of accompanying tools to aid in the extraction, abstraction, and comprehension processes. We briefly discuss the goals, results, and methodology of each approach

  • Predicting Source Code Changes by Mining Revision History by Ying, Murphy, Ng, Chu-Carroll

    Software developers are often faced with modification tasks that involve source which is spread across a code base. Some dependencies between source, such as the dependencies between platform dependent fragments, cannot be determined by existing static and dynamic analyses. To help developers identify relevant source code during a modification task, we have developed an approach that applies data mining techniques to determine change patterns---files that were changed together frequently in the past---from the revision history of the code base. Our hypothesis is that the change patterns can be used to recommend potentially relevant source code to a developer performing a modification task. We show that this approach can reveal valuable dependencies by applying the approach to the Eclipse and Mozilla open source projects, and by evaluating the predictability and interestingness of the recommendations produced for actual modification tasks on these systems.

  • Mining Software Usage Data by El-Ramly, Stroulia

    Many software systems collect or can be instrumented to collect data about how users use them, i.e., system-user interaction data. Such data can be of great value for program understanding and reengineering purposes. In this paper we demonstrate that sequential data mining methods can be applied to discover interesting patterns of user activities from system-user interaction traces. In particular, we developed a process for discovering a special type of sequential patterns, called interaction patterns. These are sequences of events with noise, in the form of spurious events that may occur anywhere in a pattern instance. In our case studies, we applied interaction pattern mining to systems with considerably different forms of interaction: Web-based systems and legacy systems. We used the discovered patterns for user interface reengineering, and personalization. The method is promising and generalizable to other systems with different forms of interaction.

Defect Analysis: (Schedule)

  • Bug Driven Bug Finders by Williams, Hollingsworth

    We describe a method of creating tools to find bugs in software that is driven by analysis of previous bugs. We present a study of bug databases and software repositories that characterize commonly occurring types of bugs. Based on the types of bugs that were commonly reported and fixed in the code, we determine what types of bug finding tools should be developed. We have implemented one static checker, a return value usage checker. Novel features of this checker include the use of information from the software repository to try to improve its false positive rate by identifying patterns that have resulted in previous bug fixes.

  • Mining Repositories to Assist in Project Planning and Resource Allocation by Menzies, DiStefano, Chapman, Cunanan

    Software repositories plus defect logs are useful for learning defect detectors. Such defect detectors could be a useful resource allocation tool for software managers. One way to view our detectors is that they are a V&V tool for V&V; i.e. they can be used to assess if "too much" of the testing budget is going to "too little" of the system. Finding such detectors could be used as the basis of the business case that constructing building a local repository is useful.

    Three counter arguments to such a proposal are (1) no general conclusions have been reported in any such repository despite years of effort; (2) if such general conclusions existed then there would be no need to build a local repository; (3) no such general conclusions will ever exist, according to many reseaerchers. This article is a reply to these three arguments.

  • Bug Report Networks: Varieties, Strategies, and Impacts in a F/OSS Development Community by Sandusky, Gasser, Ripoche

    Our empirical research has shown that a predominant structural feature of defect tracking repositories is the evolving "bug report network" (BRN). Community members create BRNs by progressively asserting various formal and informal relationships between bug reports (BRs). In one F/OSS bug repository under study, participants assert two formal relationships (duplications and dependencies) and various informal relationships (like "see also" references).

    BRNs can be interpreted as (1) information ordering strategies that support collocation of related BRs, decreasing cognitive and organizational effort; (2) sense-making strategies wherein BRNs provide more refined representations of software and work-organization issues; (3) social ordering strategies that rearrange collective relationships among community members. This paper presents findings from an investigation of the nature, extent, and impact of BRNs in one F/OSS development community. We investigate whether and how specific classes of BRNs influence problem management within the community, and identify several new research questions.

  • A Tool for Mining Defect-Tracking Systems to Predict Fault-Prone Files by Ostrand, Weyuker

    In earlier research we identified characteristics of files in large software systems that tend to make them particularly likely to contain faults. We then developed a statistical model that uses historical fault information and file characteristics to predict which files of a system are likely to contain the largest numbers of faults. Testers can use that information to prioritize their testing and focus their efforts to make the testing process more efficient and the resulting software more dependable. In this paper we describe a proposed new tool to automate this prediction process, and discuss issues involved in its design and implementation. The goal is to produce an automated tool that mines the project defect tracking system and that can be used by testers without requiring any particular statistical expertise or subjective judgements.

  • Towards Understanding the Rhetoric of Small Changes by Purushothaman, Perry

    Understanding the impact of software changes has been a challenge since software systems were first developed. With the increasing size and complexity of systems, this problem has become more difficult. There are many ways to identify change impact from the plethora of software artifacts produced during development and maintenance. We present the analysis of the software development process using change and defect history data. Specifically, we address the problem of small changes. The studies revealed that (1) there is less than 4 percent probability that a one-line change will introduce an error in the code; (2) nearly 10 percent of all changes made during the maintenance of the software under consideration were one-line changes; (3 the phenomena of change differs for additions, deletions and modifications as well as for the number of lines affected.

Process and Community Analysis: (Schedule)

  • Data Mining for Software Process Discovery in Open Source Software Development Communities by Jensen, Scacchi

    Software process discovery has historically been an intensive task, either done through exhaustive empirical studies or in an automated fashion using techniques such as logging and analysis of command shell operations. While empirical studies have been fruitful, data collection has proven to be tedious and time consuming. Existing automated approaches have expedited collection of fine-grained data, but do so at the cost of impinging on the developer's work environment, few of who may be observed. In this paper, we explore techniques for discovering development processes from publicly available open source software development repositories that exploit advances in artificial intelligence. Our goal is to facilitate process discovery in ways that are less cumbersome than empirical techniques and offer a more holistic, task-oriented view of the process than current automated systems provide.

  • Applying Social Network Analysis to the Information in CVS Repositories by Lopez-Fernandez, Robles, Gonzalez-Barahona

    The huge quantities of data available in the CVS repositories of large, long-lived libre (free, open source) software projects, and the many interrelationships among those data, offer opportunities for extracting large amounts of valuable information about their structure, evolution and internal processes. Unfortunately, the sheer volume of that information renders it almost unusable without applying methodologies which highlight the relevant information for a given aspect of the project. In this paper, we propose the use of a well known set of methodologies (social network analysis) for characterizing libre software projects, their evolution over time and their internal structure. In addition, we show how we have applied such methodologies to real cases, and extract some preliminary conclusions from that experience.

  • Mining a Software Developer’s Local Interaction History by Schneider, Gutwin, Penner, Paquette

    Although shared software repositories are commonly used during software development, it is typical that a software developer browses and edits a local snapshot of the software under development. Developers periodically check their changes into the software repository; however, their interaction with the local copy is not recorded. Local interaction histories are a valuable source of information and should be considered when mining software repositories.

    In this paper we discuss the benefits of analyzing local interaction histories and present a technique and prototype implementation for their capture and analysis. As well, we discuss the implications of local interaction histories and the infrastructure of software repositories.

Reuse: (Schedule)

  • LASER: A Lexical Approach to Analogy in Software Reuse by Amin, Ó Cinnéide, Veale

    Software reuse involves creating a software system using existing software components, rather than creating it entirely from scratch. With the increase in size and complexity of existing software repositories, the need to provide intelligent support to the programmer becomes more pressing. An analogy is a comparison of certain similarities between things that are otherwise unlike. This concept has shown to be valuable in developing UML-level reuse techniques. In the LASER project we apply lexically-driven Analogy at the code level, rather than at the UML-level, in order to retrieve matching components from a repository of existing components. Using the standard ontology WordNet, we have conducted a case study to assess if class and method names in open source applications are used in a semantically meaningful way. Our results demonstrate that both hierarchical reuse and parallel reuse can be enhanced through the use of lexically-driven Analogy.

  • A Case Study on Recommending Reusable Software Components using Collaborative Filtering by McCarey, Ó Cinnéide, Kushmerick

    The demand for quality, highly functional software reinforces the need for reusable software components. However, as repositories of reusable components increase in size and complexity, the challenge for developers to remain conversant with all components becomes greater. This paper proposes a software recommendation system based on collaborative filtering, which has been shown to be effective in other domains. Based on the usage patterns of existing classes and the class currently being developed, our system proposes a set of reuse candidates to the programmer. We present the results of our analysis of the usage of Swing classes in several open-source applications and find that the collaborative filtering technique is promising in providing recommendations in this context.

  • Template Mining in Source-Code Digital Libraries by Yusof, Rana

    As a greater number of software developers make their source code available, there is a need to store such open-source applications into a repository, and facilitate search over the repository. The objective of this research is to build a digital library of Java source code, to enable search and selection of source code. We believe that such a digital library will enable better sharing of experience amongst developers, and facilitate reuse of code segments. Information retrieval is often considered to be essential for the success of digital libraries, so they can achieve high level of effectiveness while at the same time affording ease of use to a diverse community of users. Four different matching mechanism: exact, generalization, reduction and nameOnly is used in retrieving java programs based from information extracted through template mining.

  • Multi-Project Software Engineering: An Example by Garg, Gschwind, Inoue

    In this paper we present an approach for developers to benefit from multi-project software knowledge. As we show in this paper, this can be achieved by gathering information about how numerous software projects are being built, and about the interrelation of the modules within the projects. Compared to approaches that only monitor a single project, the contribution of our approach is that it not only supports the reuse of isolated software modules or libraries but also the knowledge surrounding the code and individual projects. For instance, if a component is replaced with another probably better implementation within a project, this knowledge can be shared with all relevant projects. In this paper, we show how the collection of such data allows developers to learn about such decisions from other projects, and hence how to benefit from such "multi-project" knowledge.


Last Modified by Ahmed E. Hassan on April 20 2004

Nedstat Basic - Free web site statistics