Open audrism opened 4 years ago
As discussed, repo to package mappings are quite unreliable in PyPi so I created them by parsing all versions of setup.py and setup.cfg files: /da0_data/play/PYthruMaps/PkgName2PFullS.s /da0_data/play/PYthruMaps/P2PkgNameFullS.s
There are still a few package names that can not be resolved by parsing (need to run the scripts), such as variable names/function calls
the package may be implemented in multiple places: while P takes care of the forks, there are still often multiple repos that implement the same package, for example if the repo does nor rely on PyPi and copies/version controls external code. Not sure how many instances of such are there, but these could be identified by a) unusual number og packages they implement b) low centrality (in terms o, e.g., authors shared with other repos
Hi Audris, Thank you for the help! In addition to Python projects, we are now also running co-occurrance on JS projects from these tables /da0_data/play/JSthruMaps/b2cPtaPkgJJS.*.gz
.
How much data can we store on the server?
Also, is it the case that for these tables, entries from the same project will only appear in one of the tables, not multiple?
the version J is more that two years old: perhaps use more recent version R: /da0_data/play/JSthruMaps/b2cPtaPkgRJS.0.s
Storage: I created a folder on da0 where you can store project data /data/play/diversity-innovation Please let me know if you plan to use more than 1TB of disk space so that I can arrange it.
Tables: tables are grouped by blob, so projects will be distributed over all tables You might want to use PtaPkgRJS.*.s if you want to group by project.
Thank you! From running the script on a small sample, we estimate the storage we need is just about 50G.
@audrism Hi Audris, just a heads-up, we are currently using ~720G of disk space on the server. It is unlikely we will use too much more than that.
The 128 tables b2cPtaPkgRPY.*.s in /da0_data/play/PYthruMaps/
have the API import data for all versions of all python files.
the format is