woc-hack / diversity-innovation

0 stars 0 forks source link

Potentially useful data in WoC #1

Open audrism opened 3 years ago

audrism commented 3 years ago

The 128 tables b2cPtaPkgRPY.*.s in /da0_data/play/PYthruMaps/

have the API import data for all versions of all python files.

zcat /da0_data/play/PYthruMaps/b2cPtaPkgRPY.24.s | head -1
180000048a7ec70ed3e798f936b3cf63a696630e;3e0d624850d30ac75eb1bcfbf8f71b827a44d464;ppp0_openbroadcast;1386864075;ohrstrom <jonas@anorg.net>;django.forms;postman.utils.WRAP_WIDTH;django.db.transaction;__future__.unicode_literals;django.conf.settings;postman
.models.Message;django.utils.translation.ugettext

the format is

blob;commit;deforked project;time;author;pkg1;...;pkgn
audrism commented 3 years ago

As discussed, repo to package mappings are quite unreliable in PyPi so I created them by parsing all versions of setup.py and setup.cfg files: /da0_data/play/PYthruMaps/PkgName2PFullS.s /da0_data/play/PYthruMaps/P2PkgNameFullS.s

  1. There are still a few package names that can not be resolved by parsing (need to run the scripts), such as variable names/function calls

  2. the package may be implemented in multiple places: while P takes care of the forks, there are still often multiple repos that implement the same package, for example if the repo does nor rely on PyPi and copies/version controls external code. Not sure how many instances of such are there, but these could be identified by a) unusual number og packages they implement b) low centrality (in terms o, e.g., authors shared with other repos

SAMFYB commented 3 years ago

Hi Audris, Thank you for the help! In addition to Python projects, we are now also running co-occurrance on JS projects from these tables /da0_data/play/JSthruMaps/b2cPtaPkgJJS.*.gz.

SAMFYB commented 3 years ago

How much data can we store on the server?

SAMFYB commented 3 years ago

Also, is it the case that for these tables, entries from the same project will only appear in one of the tables, not multiple?

audrism commented 3 years ago
SAMFYB commented 3 years ago

Thank you! From running the script on a small sample, we estimate the storage we need is just about 50G.

SAMFYB commented 3 years ago

@audrism Hi Audris, just a heads-up, we are currently using ~720G of disk space on the server. It is unlikely we will use too much more than that.