MSR Mining Challenge 2008

May 10-11, 2007
Leipzig
Germany

Special track within MSR 2008,
5th Working Conference on Mining Software Repositories
http://2008.msrconf.org

Co-located with ICSE 2008,
IEEE International Conference on Software Engineering
http://icse08.upb.de/

Organizers

Sung Kim (chair)
MIT, USA
Ahmed E. Hassan
Queen's University, Canada
Michele Lanza
University of Lugano, Switzerland
Michael W. Godfrey
University of Waterloo, Canada

Jury

Thomas Zimmermann
(U. of Calgary, Canada)
Marco D'Ambros
(U. of Lugano, Switzerland)
Peter Weißgerber
(U. of Trier, Germany)
Christian Bird
(University of California, Davis)
Shivkumar Shivaji
(Yahoo)
Miryung Kim
(University of Washington)

Location

Co-located with ICSE 2008,
Leipzig, Germany

Sponsor

We are grateful to our generous sponsor, "Static Analysis, Software Quality for C, C++, and Java" - Coverity Inc.

Submissions for Challenge are open!

Overview

This year's Working Conference on Mining Software Repositories (MSR 2008) will host a mining challenge. The MSR Mining Challenge brings together researchers and practitioners who are interested in applying, comparing, and challenging their mining tools and approaches on software repositories for an open source projects: Eclipse.

There will be two challenge tracks: #1: general and #2: prediction. The winner of each track will be given the Coverity MSR 2008 Challenge Award and will also receive an iPod Nano and an iPod Touch respectively.

Challenge #1: General

In this category you can demonstrate the usefulness of your mining tools. The main task will be to find interesting insights by analyzing the software repositories of Eclipse. Eclipse is large in size, several years mature, and provides lots of input for mining tools. The idea of this track is that by using the same data set, and the data set serves as a benchmark set. In this challenge, researchers can compare results to each other. Submissions are limited to 4 pages and will be included in the MSR proceedings.

Participation is straightforward:

Select your mining area (one of bug analysis, change analysis, architecture and design, process analysis, team structure).
Get project data of Eclipse.
Formulate your mining questions.
Use your mining tool(s) to answer them.
Write up and submit your 4-page challenge report.

The challenge report should describe the results of your work and cover the following aspects: questions addressed, input data, approach and tools used, derived results and interpretation of them, and conclusions. Keep in mind that the report will be evaluated by juries. Reports must be at most 4 pages long and in the ICSE format.

Data

Feel free to use any data source for the Mining Challenge. For your convenience, we provide mirrors for some of the repositories of Eclipse.

Eclipse CVS repository:

JDT: eclipse-cvs-jdt.tgz (1.1G, retrieved on 2007-12-19 by Thomas Zimmermann)
SWT: eclipse-cvs-swt.tgz (686M, retrieved on 2007-12-19 by Thomas Zimmermann)
Platform: eclipse-cvs-plarform.tgz (686M, retrieved on 2007-12-19 by Thomas Zimmermann)
Rest: eclipse-cvs-rest.tgz (1.7G, retrieved on 2007-12-19 by Thomas Zimmermann)
Eclipse Bugzilla export (in XML):
- Bugs: 1-213000: eclipse-bugs--000001-213000.zip (3.2G, retrieved on 2007-12-19 by Thomas Zimmermann)

Challenge #2: Predict

This year, the MSR Mining Challenge will have a special task: For Eclipse, predict the number of bugs per each module that will be reported between 2008/2/7 and 2008/5/7 (both days included). Suppose we are interested in some JDT modules, and the numbers of bug reports between 2001/10/10 and 2007/12/14. Your job is predicting the future bug numbers (per modules) using all possible resources such as previous bug report numbers, change numbers, or your intuition. (The numbers of bug reports are counted using a Java program at http://bugminer.googlecode.com/svn/trunk/. Feel free to check it out and use it for your project.)

Components Bug reports from 2001/10/10 to 2007/12/14 Bug reports from 2008/2/7 to 2008/5/7

JDT.APT 246 ?

JDT.Core 10880 ?

JDT.Debug 6657 ?

... ... ?

Participation is as follows:

Pick a team name, e.g., SCHNITZL.
Come up with predictions for bug reports based on some criteria or prediction model. A very simple model is for instance the number of past changes/bugs.
Annotate the corresponding files with your predictions
- Predict the numbers of bug reports for these modules components.html.
- Write a paragraph (max 200 words) that describes how you computed your predictions.
- Submit everything before Feb 7(Apia time) by email to hunkim@csail.mit.edu.

Obviously, the team with the best predictions will win. However, to increase the competition, we will organize a set of "benchmark" predictions.

Bug Prediction

The predictions for bugs should be on the component level. A component is specified directly in the bug reports. For instance bug report 42233 was reported for the component "UI" of the product "JDT". For the challenge, we will consider the core products of Eclipse: Equinox, JDT, PDE, Platform. A complete list of relevant products and components is in the file components.html. Note, that we will not remove duplicates from the final counts.

Frequently Asked Questions

Do I need to give a presentation at the MSR conference? For challenge #1, the jury will select finalists that are expected to give a short presentation at the conference. Then the audience will select a winner. For challenge #2, there is no presentation at the conference. The winners will be determined with statistical methods (correlation analysis) and announced at the conference.
Does the challenge report have to be four pages? — No, of course you can submit less than four pages. The page limit was set to ease the presentation of space-intensive results such as visualizations.
Wow, Eclipse data is soooo big! My tool won't finish in time. What can I do? — Just run your tool on a subset of the projects. For instance, you could use JDT for Eclipse. Especially when you are doing visualizations, it is almost impossible to show everything.
Predicting bugs? But, I have no clue how to build prediction models. — That's the fun thing about this category: you don't need to build sophisticated models. Of course, some people will, but others will just build simple predictors. In the end, we will see (a) whether we can predict future development events and (b) who does it best.
My cat is a visionary...can I submit its predictions or is the challenge #2 only for tools? — Of course, go ahead and submit its predictions as a benchmark. However, your cat will run out of competition—only predictions generated by tools or by humans in a systematic way are eligible to win challenge #2.
For the challenge #2-predict, is it acceptable if our team submit more than one prediction file? — Only one submission from a team (person) is allowed.

Important Dates

Submission of reports and predictions: 7th February (Apia time)
Acceptance notification: 14th February 2008
Camera-ready: 21st February 2008
Conference date: 10-11 May 2008