Open davidsjohnson opened 9 years ago
With @fraserd, @brodyholden
Moving or refactoring code into distinct modules acts as a way to tame complexity and reduce cognitive load in order to contribute.
The Connect project went from being (mostly) one repository into being several.
You can see some of these changes in this log:
Prior to 99b9feb449578c28064bd630b429e89d5aa4e47a, there was approximately 9398 lines of code.
17 ./docs/docs.js
3 ./docs/jquery.js
34 ./examples/basicAuth.js
39 ./examples/bodyParser.js
41 ./examples/cookieSession.js
36 ./examples/csrf.js
12 ./examples/directory.js
19 ./examples/error.js
11 ./examples/favicon.js
11 ./examples/helloworld.js
40 ./examples/limit.js
13 ./examples/logger.fast.js
61 ./examples/logger.format.js
14 ./examples/logger.js
32 ./examples/mounting.js
17 ./examples/profiler.js
64 ./examples/rollingSession.js
198 ./examples/session.js
13 ./examples/static.js
36 ./examples/upload-stream.js
28 ./examples/upload.js
31 ./examples/vhost.js
3 ./index.js
81 ./lib/cache.js
92 ./lib/connect.js
49 ./lib/index.js
106 ./lib/middleware/basicAuth.js
67 ./lib/middleware/bodyParser.js
193 ./lib/middleware/compress.js
67 ./lib/middleware/cookieParser.js
122 ./lib/middleware/cookieSession.js
163 ./lib/middleware/csrf.js
343 ./lib/middleware/directory.js
86 ./lib/middleware/errorHandler.js
80 ./lib/middleware/favicon.js
87 ./lib/middleware/json.js
88 ./lib/middleware/limit.js
342 ./lib/middleware/logger.js
58 ./lib/middleware/methodOverride.js
171 ./lib/middleware/multipart.js
47 ./lib/middleware/query.js
32 ./lib/middleware/responseTime.js
128 ./lib/middleware/session/cookie.js
129 ./lib/middleware/session/memory.js
116 ./lib/middleware/session/session.js
84 ./lib/middleware/session/store.js
358 ./lib/middleware/session.js
102 ./lib/middleware/static.js
238 ./lib/middleware/staticCache.js
55 ./lib/middleware/timeout.js
77 ./lib/middleware/urlencoded.js
40 ./lib/middleware/vhost.js
89 ./lib/patch.js
233 ./lib/proto.js
409 ./lib/utils.js
48 ./support/app.js
45 ./support/docs.js
17 ./test/app.listen.js
115 ./test/basicAuth.js
301 ./test/bodyParser.js
202 ./test/compress.js
72 ./test/cookieParser.js
480 ./test/cookieSession.js
79 ./test/csrf.js
183 ./test/directory.js
81 ./test/errorHandler.js
25 ./test/exports.js
279 ./test/json.js
29 ./test/limit.js
27 ./test/logger.js
43 ./test/methodOverride.js
202 ./test/mounting.js
300 ./test/multipart.js
182 ./test/patch.js
29 ./test/query.js
23 ./test/responseTime.js
70 ./test/rollingSession.js
92 ./test/server.js
624 ./test/session.js
25 ./test/shared/index.js
370 ./test/static.js
116 ./test/support/http.js
88 ./test/timeout.js
38 ./test/urlencoded.js
32 ./test/utils.js
76 ./test/vhost.js
9398 total
Now, at 3645741b86f12474cb6b90fefad3f2c405eef7f1 there are only 981 lines of code in the connect
library.
2 ./index.js
34 ./lib/connect.js
231 ./lib/proto.js
19 ./test/app.listen.js
132 ./test/fqdn.js
282 ./test/mounting.js
279 ./test/server.js
2 ./test/support/env.js
981 total
All of the rest of the code has been split off into their own respective repositories.
Version | Release Date | File Size (Bytes) |
---|---|---|
Python 2.7.9 | 10/12/2014 | 16657930 |
Python 3.4.2 | 13/10/2014 | 19257270 |
Python 3.3.6 | 12/10/2014 | 16887234 |
Python 3.2.6 | 12/10/2014 | 13135239 |
Python 2.7.8 | 02/07/2014 | 14846119 |
Python 2.7.7 | 01/06/2014 | 14809415 |
Python 3.4.1 | 19/05/2014 | 19113124 |
Python 3.4.0 | 17/03/2014 | 19222299 |
Python 3.3.5 | 09/03/2014 | 16881688 |
Python 3.3.4 | 09/02/2014 | 16843278 |
Python 3.3.3 | 17/11/2013 | 16808057 |
Python 2.7.6 | 10/11/2013 | 14725931 |
Python 2.6.9 | 29/10/2013 | 12700000 |
Python 3.3.2 | 15/05/2013 | 16530940 |
Python 3.2.5 | 15/05/2013 | 13123323 |
Python 2.7.5 | 12/05/2013 | 14492759 |
Python 2.7.4 | 06/04/2013 | 14489063 |
Python 3.2.4 | 06/04/2013 | 13121703 |
Python 3.3.1 | 06/04/2013 | 16521332 |
Python 3.3.0 | 29/09/2012 | 16327785 |
Python 3.2.3 | 10/04/2012 | 12787688 |
Python 2.6.8 | 10/04/2012 | 13282574 |
Python 3.1.5 | 09/04/2012 | 11798798 |
Python 2.7.3 | 09/04/2012 | 14135620 |
Python 3.2.2 | 03/09/2011 | 12732276 |
Python 3.2.1 | 09/07/2011 | 12713430 |
Python 2.7.2 | 11/06/2011 | 14091337 |
Python 3.1.4 | 11/06/2011 | 11795512 |
Python 2.6.7 | 03/06/2011 | 13322372 |
Python 2.5.6 | 26/05/2011 | 11100000 |
Python 3.2.0 | 20/02/2011 | 12673043 |
Python 3.1.3 | 27/11/2010 | 11769584 |
Python 2.7.1 | 27/11/2010 | 14058131 |
Python 2.6.6 | 24/08/2010 | 13318547 |
Python 2.7.0 | 03/07/2010 | 14026384 |
Python 3.1.2 | 20/03/2010 | 11661773 |
Python 2.6.5 | 18/03/2010 | 13209175 |
Python 2.5.5 | 31/01/2010 | 11100000 |
Python 2.6.4 | 26/10/2009 | 13322131 |
Python 2.6.3 | 02/10/2009 | 13319447 |
Python 3.1.1 | 17/08/2009 | 11525876 |
Python 3.1.0 | 26/06/2009 | 11359455 |
Python 2.6.2 | 14/04/2009 | 13281177 |
Python 3.0.1 | 13/02/2009 | 11258272 |
Python 2.5.4 | 23/12/2008 | 11604497 |
Python 2.4.6 | 19/12/2008 | 9550168 |
Python 2.5.3 | 19/12/2008 | 11605520 |
Python 2.6.1 | 04/12/2008 | 13046455 |
Python 3.0.0 | 03/12/2008 | 11191348 |
Python 2.6.0 | 02/10/2008 | 13023860 |
Python 2.4.5 | 11/03/2008 | 9625509 |
Python 2.3.7 | 11/03/2008 | 8694077 |
Python 2.5.2 | 21/02/2008 | 11584231 |
Python 2.5.1 | 19/04/2007 | 11060830 |
Python 2.3.6 | 01/11/2006 | 8610359 |
Python 2.4.4 | 18/10/2006 | 9531474 |
Python 2.5.0 | 19/09/2006 | 11019675 |
Python 2.4.3 | 15/04/2006 | 9348239 |
Python 2.4.2 | 27/09/2005 | 9239975 |
Python 2.4.1 | 30/03/2005 | 9219882 |
Python 2.3.5 | 08/02/2005 | 8535749 |
Python 2.4.0 | 30/11/2004 | 9198035 |
Python 2.3.4 | 27/05/2004 | 8502738 |
Python 2.3.3 | 19/12/2003 | 8491380 |
Python 2.3.2 | 03/10/2003 | 8459427 |
Python 2.3.1 | 23/09/2003 | 8558611 |
Python 2.3.0 | 29/07/2003 | 8436880 |
Python 2.2.3 | 30/05/2003 | 6709556 |
Python 2.2.2 | 14/10/2002 | 6669400 |
Python 2.2.1 | 10/04/2002 | 6535104 |
Python 2.1.3 | 09/04/2002 | 6194432 |
Python 2.2.0 | 21/12/2001 | 6542443 |
Python 2.0.1 | 22/06/2001 | 3900000 |
Trend graph of growing Python download sizes:
It is clear to see from the graph that while there is some variation in the size of the downloads, the general trend over time is that the source code/release download size of Python has been growing. Does this mean Python has become more powerful? Or simply more complex? To be discovered later ...
Although our experiment was not completed within the lab time, this experiment sheds some light on how we will define periods of high and low growth within the software projects we are analyzing. In addition, we get a better understanding of how we are testing our hypothesis.
--- Chris Cook, Richard Claus, Sarah Nicholson
Jeremy Kroeker
1. Come up with an assertion After the community finds a bug, a patch will soon be released to fix it.
2. Perform data collection - Define specific metrics that will help analyze your assertion
During this lab session I analyzed the data sources (Reddit posts and patch notes) and determined how I would bring them into an intermediate data format. I designed a rough database table (which columns would be required). The next phase of my project will involve scraping data from the websites and populating a database. More details here: https://github.com/PolloDiablo/SENG-371-Project-1/blob/master/docs/data.txt
Once I have a populated database, it will be easier to inspect with an text analytics tool.
The key metrics from the reddit posts will be the frequency and popularity of bug reports for a given topic over a given time frame. These can be compared to the patch notes, which should have a mention of a bugfix (for the same topic) in a proceeding time frame.
3. Analyze the data collection to support/refute the assertion
Coming Soon...
Colin Knowles and Ryan McDonald
Assertion: The number of lines of code in a file relates to the amount of common programming errors found by the PMD tool
Method: Ran PMD, generated XML, created python script to translate XML into an excel file, created a graph, added a trendline
Analysis:
Jian Guan, Jonathan Lam, Paul Moon
The rate of feature additions will decrease as the project increases in size.
We obtained a list of all closed issues (with "enhancement" label) and pull requests from the Backbone.js repo via the GitHub API. A Python parser was written to print out the pull requests merged into master and linked with an "enhancement" issue. This represents the pull requests which are feature enhancements to the repo. We will be researching a way to obtain how the LOC (Lines of Code) of the repo changes over time, as well as gathering data from other repos such as Bootstrap.
Data analysis cannot be done yet because LOC data still need to be collected. Once obtained, a graph of LOC vs. time as well as a graph of total number of features vs. time will be plotted. We will also explore with the possibility of visualizing the rate of feature additions with the Gource tool.
Parker Atkins, Rabjot Aujla, Greg Richardson, Jordan Heemskerk
The amount of bugs a project has will decrease as the volume of unit tests increase.
We use the number of issues in a GitHub repository to quantify the number of bugs. We use the number of lines of code in test files to quantify the volume of unit tests.
Here is a (raw) data set for the number of issues found over time for JQuery:
We plan to refine this to only count bugs - currently this includes all issues which may include issues that aren`t bugs (ie Pull requests).
We are in the process of developing a tool to count the lines of code in test files.
To analyze the data, the number of bugs and the number of test lines will be graphed next to each other over time. From our hypothesis, we expect the number of bugs to decrease as the number of lines of tests increases. Our tool is not yet ready to collect/graph this data.
Jason Syrotuck, Evan Hildebrandt, Keith Rollans.
Assertion:
Organizing source code files by functionality will allow for more evolution.
Data Collection:
Using Gource to visually inspect the file structure to find points in time where many files were refactored. We checked out builds before and after the discovered change to thoroughly understand the update, looking for the addition (or removal) of folders and files.
Data Analysis:
We found an instance in the JQuery project where a collection of source code files were distributed into many sub folders. Therefore a group of developers must have deemed it more efficient for manageability.
BEFORE:
AFTER:
Diff of the files contained in the project code:
The more of anti-regressive changes there are, more bugs will be reported in the following months.
We are using the python script we wrote that hits GitHub's API in monthly increments to search the number of issues or pull requests with certain queries and/or labels. Today we just had time to run it on ruby on rails.
We ran it three times, the first time searching for bugs (used the search term "bug"), the second time searching for anti-regressive changes (used the search term "refactor OR rewrite") and finally we searched for progressive changes (used the search term "feature"). Since rails didn't use labels to identify each of these categories, these are just searing for the word in the pull request or issue. In other repositories that do use labels, these searches will be more precise.
This gave us the number of bugs, anti-regressive and progressive changes in every month of the project, which for rails is from April of 2011 to February of 2015. We then created the below graphs:
Our data from this experiment hasn't given us enough of a pattern/trend to support or refute our assertion, but we are hoping that running the same process on other repositories will yield more conclusive results.
PS - If anyone thinks our tool might be helpful to them, come talk to us!
Hypothesis: Development time per feature release will be faster for agile open source methods in comparison with open source.
Metrics: We have access to the commits to any project in GitHub, but the projects we found that would help analyze our assertion were Titan, SonarQube, and Linux. Firefox could also be an option, as it uses an agile open source development methodology.
Data Analysis: We are unable to confirm whether Titan uses any development methodologies so we can not effectively analyze the data from that metric yet. However, we have posted a question about it on their questions page for future analysis.
SonarQube uses an agile open source development methodology. According to SonarQube's Jira the completion time per feature for SonarQube appeared to contain tasks that were completed in approximately four days, with some outlier tasks that lasted multiple months.
To accurately verify our hypothesis we would need to take the average of all of the completed tasks and compare. This takes too much time to do by inspection, so we could not complete the verification within the lab.
Mitchell Rivett Tyler Potter
I've got a dropbox link with some excel data that I can import into R, however this is just product interest and a revenue approximation since it's difficult determining patch size and getting all relevant patch data.
I've made some graphs of the average sale prices of apple, google, and microsoft stocks over time. They're in the imgur album here, I plan on trying to see if I can find the prices correlating with either market share, or total sales.
The initial project question I feel is too large and does not have enough obtainable data available. Additional thought on the question needs to be considered.
Conduct an experiment based on your research question from last week:
Jorin Weatherston Brandon Leech
My assertion is that a change in programming paradigm from an older paradigm/language/library to a more modern paradigm/language/library leads to:
I'm analyzing GCC to attempt to prove this.
GCC recently transitioned from being written in C to C++, thereby making the following transitions:
I was informed by this article which suggests in a totally anecdotal way that GCC has benefited from both the paradigm and the new library (specifically C++'s comprehensive library of generic collections). Moreover the aforementioned article gave me a start date for the transitional period: June 18, 2008, when this presentation was given at one of GCC's summits.
I deemed the end of this transitional period to be when GCC's main trunk stopped building under pure C compilers, which occurred at the time of this post to their mailing list (i.e. August 2nd, 2012).
I wrote a simple tool in Python which allows the output of git --no-pager log --shortstats
to be aggregated, to obtain a high level overview of repo activity. This yields the following results:
Commits: 18088 Insertions: 4050566 Deletions: 2238396 Duration: 903 days
Commits: 30170 Insertions: 8483389 Deletions: 5495633 Duration: 1506 days
Commits: 88050 Insertions: 22460802 Deletions: 12568041 Duration: 7147 days
This yields the following rates:
Commits/day: 20 Insertions/day: 4486 Deletions/day: 2479 Growth/day: 2007
Commits/day: 20 Insertions/day: 5633 Deletions/day: 3649 Growth/day: 1984
Commits/day: 12 Insertions/day: 3143 Deletions/day: 1759 Growth/day: 1384
Where "growth" is defined as deletions subtracted from insertions.
We note that because "before transition" includes periods of relatively lower activity, the number of commits per day is 40% lower. We adjust the values to compensate:
Commits/day: 20 Insertions/day: 5238 Deletions/day: 2932 Growth/day: 2306
We observe from the data the following:
Read this article for insights on data collection of open source projects.
Since the article link isn't working, here is a link to the projects website which includes some datasets and metrics: http://maisqual.squoring.com/wiki/index.php/Data_sets and http://maisqual.squoring.com/wiki/index.php/Maisqual_Projects/Ant
Conduct an experiment based on your research question from last week:
Post your finding for each step as a response to this issue and update your project repository README with updated metrics.