Lab 4: Data Collection and Experimentation

ycoady / UVic-Software-Evolution

Welcome to our extravaganza in software (r)evolution!!!

Creative Commons Zero v1.0 Universal

6 stars 0 forks source link

Lab 4: Data Collection and Experimentation #10

Open davidsjohnson opened 9 years ago

davidsjohnson commented 9 years ago

Read this article for insights on data collection of open source projects.

Since the article link isn't working, here is a link to the projects website which includes some datasets and metrics: http://maisqual.squoring.com/wiki/index.php/Data_sets and http://maisqual.squoring.com/wiki/index.php/Maisqual_Projects/Ant

Conduct an experiment based on your research question from last week:

Come up with an assertion
Perform data collection - Define specific metrics that will help analyze your assertion
Analyze the data collection to support/refute the assertion

Post your finding for each step as a response to this issue and update your project repository README with updated metrics.

Hoverbear commented 9 years ago

With @fraserd, @brodyholden

Assertion

Moving or refactoring code into distinct modules acts as a way to tame complexity and reduce cognitive load in order to contribute.

Metrics

The Connect project went from being (mostly) one repository into being several.

You can see some of these changes in this log: screen shot 2015-02-03 at 12 25 36

Prior to 99b9feb449578c28064bd630b429e89d5aa4e47a, there was approximately 9398 lines of code.

      17 ./docs/docs.js
       3 ./docs/jquery.js
      34 ./examples/basicAuth.js
      39 ./examples/bodyParser.js
      41 ./examples/cookieSession.js
      36 ./examples/csrf.js
      12 ./examples/directory.js
      19 ./examples/error.js
      11 ./examples/favicon.js
      11 ./examples/helloworld.js
      40 ./examples/limit.js
      13 ./examples/logger.fast.js
      61 ./examples/logger.format.js
      14 ./examples/logger.js
      32 ./examples/mounting.js
      17 ./examples/profiler.js
      64 ./examples/rollingSession.js
     198 ./examples/session.js
      13 ./examples/static.js
      36 ./examples/upload-stream.js
      28 ./examples/upload.js
      31 ./examples/vhost.js
       3 ./index.js
      81 ./lib/cache.js
      92 ./lib/connect.js
      49 ./lib/index.js
     106 ./lib/middleware/basicAuth.js
      67 ./lib/middleware/bodyParser.js
     193 ./lib/middleware/compress.js
      67 ./lib/middleware/cookieParser.js
     122 ./lib/middleware/cookieSession.js
     163 ./lib/middleware/csrf.js
     343 ./lib/middleware/directory.js
      86 ./lib/middleware/errorHandler.js
      80 ./lib/middleware/favicon.js
      87 ./lib/middleware/json.js
      88 ./lib/middleware/limit.js
     342 ./lib/middleware/logger.js
      58 ./lib/middleware/methodOverride.js
     171 ./lib/middleware/multipart.js
      47 ./lib/middleware/query.js
      32 ./lib/middleware/responseTime.js
     128 ./lib/middleware/session/cookie.js
     129 ./lib/middleware/session/memory.js
     116 ./lib/middleware/session/session.js
      84 ./lib/middleware/session/store.js
     358 ./lib/middleware/session.js
     102 ./lib/middleware/static.js
     238 ./lib/middleware/staticCache.js
      55 ./lib/middleware/timeout.js
      77 ./lib/middleware/urlencoded.js
      40 ./lib/middleware/vhost.js
      89 ./lib/patch.js
     233 ./lib/proto.js
     409 ./lib/utils.js
      48 ./support/app.js
      45 ./support/docs.js
      17 ./test/app.listen.js
     115 ./test/basicAuth.js
     301 ./test/bodyParser.js
     202 ./test/compress.js
      72 ./test/cookieParser.js
     480 ./test/cookieSession.js
      79 ./test/csrf.js
     183 ./test/directory.js
      81 ./test/errorHandler.js
      25 ./test/exports.js
     279 ./test/json.js
      29 ./test/limit.js
      27 ./test/logger.js
      43 ./test/methodOverride.js
     202 ./test/mounting.js
     300 ./test/multipart.js
     182 ./test/patch.js
      29 ./test/query.js
      23 ./test/responseTime.js
      70 ./test/rollingSession.js
      92 ./test/server.js
     624 ./test/session.js
      25 ./test/shared/index.js
     370 ./test/static.js
     116 ./test/support/http.js
      88 ./test/timeout.js
      38 ./test/urlencoded.js
      32 ./test/utils.js
      76 ./test/vhost.js
    9398 total

Now, at 3645741b86f12474cb6b90fefad3f2c405eef7f1 there are only 981 lines of code in the connect library.

       2 ./index.js
      34 ./lib/connect.js
     231 ./lib/proto.js
      19 ./test/app.listen.js
     132 ./test/fqdn.js
     282 ./test/mounting.js
     279 ./test/server.js
       2 ./test/support/env.js
     981 total

All of the rest of the code has been split off into their own respective repositories.

eburdon commented 9 years ago

Assertion: source code file download/sizes for Python have increased over time
Metrics: I collected version name, release dates, and file sizes from the Python downloads site, parsing the data into an excel file. The tabular results were:

Version	Release Date	File Size (Bytes)
Python 2.7.9	10/12/2014	16657930
Python 3.4.2	13/10/2014	19257270
Python 3.3.6	12/10/2014	16887234
Python 3.2.6	12/10/2014	13135239
Python 2.7.8	02/07/2014	14846119
Python 2.7.7	01/06/2014	14809415
Python 3.4.1	19/05/2014	19113124
Python 3.4.0	17/03/2014	19222299
Python 3.3.5	09/03/2014	16881688
Python 3.3.4	09/02/2014	16843278
Python 3.3.3	17/11/2013	16808057
Python 2.7.6	10/11/2013	14725931
Python 2.6.9	29/10/2013	12700000
Python 3.3.2	15/05/2013	16530940
Python 3.2.5	15/05/2013	13123323
Python 2.7.5	12/05/2013	14492759
Python 2.7.4	06/04/2013	14489063
Python 3.2.4	06/04/2013	13121703
Python 3.3.1	06/04/2013	16521332
Python 3.3.0	29/09/2012	16327785
Python 3.2.3	10/04/2012	12787688
Python 2.6.8	10/04/2012	13282574
Python 3.1.5	09/04/2012	11798798
Python 2.7.3	09/04/2012	14135620
Python 3.2.2	03/09/2011	12732276
Python 3.2.1	09/07/2011	12713430
Python 2.7.2	11/06/2011	14091337
Python 3.1.4	11/06/2011	11795512
Python 2.6.7	03/06/2011	13322372
Python 2.5.6	26/05/2011	11100000
Python 3.2.0	20/02/2011	12673043
Python 3.1.3	27/11/2010	11769584
Python 2.7.1	27/11/2010	14058131
Python 2.6.6	24/08/2010	13318547
Python 2.7.0	03/07/2010	14026384
Python 3.1.2	20/03/2010	11661773
Python 2.6.5	18/03/2010	13209175
Python 2.5.5	31/01/2010	11100000
Python 2.6.4	26/10/2009	13322131
Python 2.6.3	02/10/2009	13319447
Python 3.1.1	17/08/2009	11525876
Python 3.1.0	26/06/2009	11359455
Python 2.6.2	14/04/2009	13281177
Python 3.0.1	13/02/2009	11258272
Python 2.5.4	23/12/2008	11604497
Python 2.4.6	19/12/2008	9550168
Python 2.5.3	19/12/2008	11605520
Python 2.6.1	04/12/2008	13046455
Python 3.0.0	03/12/2008	11191348
Python 2.6.0	02/10/2008	13023860
Python 2.4.5	11/03/2008	9625509
Python 2.3.7	11/03/2008	8694077
Python 2.5.2	21/02/2008	11584231
Python 2.5.1	19/04/2007	11060830
Python 2.3.6	01/11/2006	8610359
Python 2.4.4	18/10/2006	9531474
Python 2.5.0	19/09/2006	11019675
Python 2.4.3	15/04/2006	9348239
Python 2.4.2	27/09/2005	9239975
Python 2.4.1	30/03/2005	9219882
Python 2.3.5	08/02/2005	8535749
Python 2.4.0	30/11/2004	9198035
Python 2.3.4	27/05/2004	8502738
Python 2.3.3	19/12/2003	8491380
Python 2.3.2	03/10/2003	8459427
Python 2.3.1	23/09/2003	8558611
Python 2.3.0	29/07/2003	8436880
Python 2.2.3	30/05/2003	6709556
Python 2.2.2	14/10/2002	6669400
Python 2.2.1	10/04/2002	6535104
Python 2.1.3	09/04/2002	6194432
Python 2.2.0	21/12/2001	6542443
Python 2.0.1	22/06/2001	3900000

Analysis:

Trend graph of growing Python download sizes:

It is clear to see from the graph that while there is some variation in the size of the downloads, the general trend over time is that the source code/release download size of Python has been growing. Does this mean Python has become more powerful? Or simply more complex? To be discovered later ...

chrisjcook commented 9 years ago

Although our experiment was not completed within the lab time, this experiment sheds some light on how we will define periods of high and low growth within the software projects we are analyzing. In addition, we get a better understanding of how we are testing our hypothesis.

Assertion - Growth of a software project is inversely proportionate to increasing coupling/dependencies.

Use GitStats on a software project (Django, for example) and consult the graphs for "File Count by Date" and "Lines of Code" to determine periods of high and low growth in the project's history.
Take multiple samples from each of the high and low growth periods and run snakefood to get snapshots (as well as numerical values for dependency connections) of the project's dependencies at those time.
Compare GitStats growth graphs with the dependency snapshots to determine if there is a correlation between changes in dependency/coupling growth rate and changes in a project's file count and/or lines of code produced.

--- Chris Cook, Richard Claus, Sarah Nicholson

PolloDiablo commented 9 years ago

Jeremy Kroeker

1. Come up with an assertion After the community finds a bug, a patch will soon be released to fix it.

2. Perform data collection - Define specific metrics that will help analyze your assertion

During this lab session I analyzed the data sources (Reddit posts and patch notes) and determined how I would bring them into an intermediate data format. I designed a rough database table (which columns would be required). The next phase of my project will involve scraping data from the websites and populating a database. More details here: https://github.com/PolloDiablo/SENG-371-Project-1/blob/master/docs/data.txt

Once I have a populated database, it will be easier to inspect with an text analytics tool.

The key metrics from the reddit posts will be the frequency and popularity of bug reports for a given topic over a given time frame. These can be compared to the patch notes, which should have a mention of a bugfix (for the same topic) in a proceeding time frame.

3. Analyze the data collection to support/refute the assertion

Coming Soon...

knowlesc commented 9 years ago

Colin Knowles and Ryan McDonald

Assertion: The number of lines of code in a file relates to the amount of common programming errors found by the PMD tool

Method: Ran PMD, generated XML, created python script to translate XML into an excel file, created a graph, added a trendline

Analysis:

guand commented 9 years ago

Jian Guan, Jonathan Lam, Paul Moon

Software Evolution Experiment

1) Hypothesis

The rate of feature additions will decrease as the project increases in size.

2) Data collection

We obtained a list of all closed issues (with "enhancement" label) and pull requests from the Backbone.js repo via the GitHub API. A Python parser was written to print out the pull requests merged into master and linked with an "enhancement" issue. This represents the pull requests which are feature enhancements to the repo. We will be researching a way to obtain how the LOC (Lines of Code) of the repo changes over time, as well as gathering data from other repos such as Bootstrap.

3) Data Analysis

Data analysis cannot be done yet because LOC data still need to be collected. Once obtained, a graph of LOC vs. time as well as a graph of total number of features vs. time will be plotted. We will also explore with the possibility of visualizing the rate of feature additions with the Gource tool.

gregnr commented 9 years ago

Parker Atkins, Rabjot Aujla, Greg Richardson, Jordan Heemskerk

1. Hypothesis

The amount of bugs a project has will decrease as the volume of unit tests increase.

2. Data collection

We use the number of issues in a GitHub repository to quantify the number of bugs. We use the number of lines of code in test files to quantify the volume of unit tests.

Here is a (raw) data set for the number of issues found over time for JQuery:

We plan to refine this to only count bugs - currently this includes all issues which may include issues that aren`t bugs (ie Pull requests).

We are in the process of developing a tool to count the lines of code in test files.

3. Analysis

To analyze the data, the number of bugs and the number of test lines will be graphed next to each other over time. From our hypothesis, we expect the number of bugs to decrease as the number of lines of tests increases. Our tool is not yet ready to collect/graph this data.

Jsyro commented 9 years ago

Jason Syrotuck, Evan Hildebrandt, Keith Rollans.

Assertion:

Organizing source code files by functionality will allow for more evolution.

Data Collection:

Using Gource to visually inspect the file structure to find points in time where many files were refactored. We checked out builds before and after the discovered change to thoroughly understand the update, looking for the addition (or removal) of folders and files.

Data Analysis:

We found an instance in the JQuery project where a collection of source code files were distributed into many sub folders. Therefore a group of developers must have deemed it more efficient for manageability.

BEFORE: 10943616_10202440177165819_54358320_n

AFTER: 10962164_10202440177245821_745946257_n

Diff of the files contained in the project code: 10962013_10202440177205820_1301438871_n

devinc13 commented 9 years ago

Assertion

The more of anti-regressive changes there are, more bugs will be reported in the following months.

Data

We are using the python script we wrote that hits GitHub's API in monthly increments to search the number of issues or pull requests with certain queries and/or labels. Today we just had time to run it on ruby on rails.

We ran it three times, the first time searching for bugs (used the search term "bug"), the second time searching for anti-regressive changes (used the search term "refactor OR rewrite") and finally we searched for progressive changes (used the search term "feature"). Since rails didn't use labels to identify each of these categories, these are just searing for the word in the pull request or issue. In other repositories that do use labels, these searches will be more precise.

This gave us the number of bugs, anti-regressive and progressive changes in every month of the project, which for rails is from April of 2011 to February of 2015. We then created the below graphs:

Analysis

Our data from this experiment hasn't given us enough of a pattern/trend to support or refute our assertion, but we are hoping that running the same process on other repositories will yield more conclusive results.

PS - If anyone thinks our tool might be helpful to them, come talk to us!

mitchellri commented 9 years ago

Hypothesis: Development time per feature release will be faster for agile open source methods in comparison with open source.

Metrics: We have access to the commits to any project in GitHub, but the projects we found that would help analyze our assertion were Titan, SonarQube, and Linux. Firefox could also be an option, as it uses an agile open source development methodology.

Data Analysis: We are unable to confirm whether Titan uses any development methodologies so we can not effectively analyze the data from that metric yet. However, we have posted a question about it on their questions page for future analysis.

SonarQube uses an agile open source development methodology. According to SonarQube's Jira the completion time per feature for SonarQube appeared to contain tasks that were completed in approximately four days, with some outlier tasks that lasted multiple months.

To accurately verify our hypothesis we would need to take the average of all of the completed tasks and compare. This takes too much time to do by inspection, so we could not complete the verification within the lab.

Mitchell Rivett Tyler Potter

Brayden-Arthur commented 9 years ago

Interest in a software system can also contribute to sales of a product utilizing that system.

Data Collection

I've got a dropbox link with some excel data that I can import into R, however this is just product interest and a revenue approximation since it's difficult determining patch size and getting all relevant patch data.

Data Analysis

I've made some graphs of the average sale prices of apple, google, and microsoft stocks over time. They're in the imgur album here, I plan on trying to see if I can find the prices correlating with either market share, or total sales.

Aside

The initial project question I feel is too large and does not have enough obtainable data available. Additional thought on the question needs to be considered.

Bleech94 commented 9 years ago

Conduct an experiment based on your research question from last week:

Come up with an assertion: Large changes in code follow large exchanges of communication.
Perform data collection - Define specific metrics that will help analyze your assertion:
- metric: the number of contributions correlated to the mailing list for ant developers over time.
Analyze the data collection to support/refute the assertion:
- we have seen that there is at least a suggestion that there is a connection between major communications followed by major changes in the code base.

correlation

Jorin Weatherston Brandon Leech

RobertLeahy commented 9 years ago

My assertion is that a change in programming paradigm from an older paradigm/language/library to a more modern paradigm/language/library leads to:

A transitional period, during which more code is written
A period thereafter, during which time less code is written

I'm analyzing GCC to attempt to prove this.

GCC recently transitioned from being written in C to C++, thereby making the following transitions:

From procedural to object oriented
From the C standard library (which is quite feature poor) to the C++ standard library (which is relatively feature rich, especially with respect to collections)

I was informed by this article which suggests in a totally anecdotal way that GCC has benefited from both the paradigm and the new library (specifically C++'s comprehensive library of generic collections). Moreover the aforementioned article gave me a start date for the transitional period: June 18, 2008, when this presentation was given at one of GCC's summits.

I deemed the end of this transitional period to be when GCC's main trunk stopped building under pure C compilers, which occurred at the time of this post to their mailing list (i.e. August 2nd, 2012).

I wrote a simple tool in Python which allows the output of git --no-pager log --shortstats to be aggregated, to obtain a high level overview of repo activity. This yields the following results:

Post-Transition

Commits: 18088 Insertions: 4050566 Deletions: 2238396 Duration: 903 days

During Transition

Commits: 30170 Insertions: 8483389 Deletions: 5495633 Duration: 1506 days

Before Transition

Commits: 88050 Insertions: 22460802 Deletions: 12568041 Duration: 7147 days

This yields the following rates:

Post-Transition

Commits/day: 20 Insertions/day: 4486 Deletions/day: 2479 Growth/day: 2007

During Transition

Commits/day: 20 Insertions/day: 5633 Deletions/day: 3649 Growth/day: 1984

Before Transition

Commits/day: 12 Insertions/day: 3143 Deletions/day: 1759 Growth/day: 1384

Where "growth" is defined as deletions subtracted from insertions.

We note that because "before transition" includes periods of relatively lower activity, the number of commits per day is 40% lower. We adjust the values to compensate:

Before Transition (Adjusted)

Commits/day: 20 Insertions/day: 5238 Deletions/day: 2932 Growth/day: 2306

We observe from the data the following:

During the transition period there were more insertions and more deletions, however the code base grew at a slightly lower rate, meaning that code was being replaced/rewritten as part of the transition
After the transition less code had to be written once the rate of contribution (i.e. commits per time period) were adjusted, meaning that the paradigm/language/library shift likely paid off in terms of lower programmer workload
After the transition the code base is growing at a slower rate than before, which will likely keep the code base manageable for longer