Closed marcelo-g-simas closed 1 month ago
Gizem in the meeting said she wants...
Note: Look into graphQL syntax to find the length of a list object like labels above.
Thoughts from the call:
Here's a working query that gets most of these new items:
fragment A on SearchResultItemConnection
{ pageInfo
{ endCursor hasNextPage }
edges { node { ... on Repository
{
id
createdAt
nameWithOwner
description
primaryLanguage { name }
languages(first:5)
{
nodes {
name
}
}
defaultBranchRef
{
id target
{ ... on Commit
{
history(until: "2024-01-01T00:00:00Z") { totalCount }
}
}
}
labels(first:10) {
edges {
node {
name
}
}
}
forkCount
forkingAllowed
hasIssuesEnabled
hasDiscussionsEnabled
hasWikiEnabled
isInOrganization
stargazerCount
url
watchers
{
totalCount
}
releases
{
totalCount
}
issues
{
totalCount
}
}
}
}
} query Search(){_1:search(query:"is:public fork:false mirror:false archived:false license:LGPL-3.0 created:2023-12-29T00:00:00Z..2024-01-01T00:00:00Z",type:REPOSITORY, first:1){...A}}
Still missing from @jeremycorry's comments:
I could not find sources for these data so I think we are done and now need to move to change the deserialization code and target dbs.
@jeremycorry, this is a helpful resource: https://medium.com/@tharshita13/github-graphql-api-cheatsheet-38e916fe76a3
Did a bit more work and got dependencies to download from the API. It's a bit nested so now I need to figure out how to save this:
"dependencyGraphManifests": {
"totalCount": 2,
"edges": [
{
"node": {
"dependencies": {
"edges": [
{
"node": {
"packageName": "breathe",
"repository": {
"nameWithOwner": "breathe-doc/breathe"
}
}
},
{
"node": {
"packageName": "myst-parser",
"repository": {
"nameWithOwner": "executablebooks/MyST-Parser"
}
}
},
{
"node": {
"packageName": "sphinx",
"repository": {
"nameWithOwner": "sphinx-doc/sphinx"
}
}
},
{
"node": {
"packageName": "sphinx-rtd-theme",
"repository": {
"nameWithOwner": "readthedocs/sphinx_rtd_theme"
}
}
},
{
"node": {
"packageName": "tomli",
"repository": {
"nameWithOwner": "hukkin/tomli"
}
}
},
{
"node": {
"packageName": "tomli-w",
"repository": null
}
},
{
"node": {
"packageName": "voluptuous",
"repository": {
"nameWithOwner": "alecthomas/voluptuous"
}
}
}
]
}
}
},
{
"node": {
"dependencies": {
"edges": [
{
"node": {
"packageName": "hyper",
"repository": {
"nameWithOwner": "hyperium/hyper"
}
}
},
{
"node": {
"packageName": "tokio",
"repository": {
"nameWithOwner": "tokio-rs/tokio"
}
}
}
]
}
}
}
]
}
Here's a revised query that includes all of the new fields we will need to add to repos:
fragment A on SearchResultItemConnection
{ pageInfo
{ endCursor hasNextPage }
edges { node { ... on Repository
{
id
createdAt
nameWithOwner
description
primaryLanguage { name }
languages(first:5)
{
nodes {
name
}
}
defaultBranchRef
{
id target
{ ... on Commit
{
history(until: "2024-01-01T00:00:00Z") { totalCount }
}
}
}
repositoryTopics(first:10) {
edges {
node {
topic
{
name
}
}
}
}
forkCount
isInOrganization
stargazerCount
homepageUrl
dependencyGraphManifests(first: 10)
{
totalCount
edges {
node {
dependencies {
edges {
node {
packageName
repository {
nameWithOwner
}
}
}
}
}
}
}
watchers
{
totalCount
}
releases
{
totalCount
}
issues
{
totalCount
}
}
}
}
} query Search(){_1:search(query:"is:public fork:false mirror:false archived:false license:LGPL-3.0 created:2023-12-29T00:00:00Z..2024-01-01T00:00:00Z",type:REPOSITORY, first:2){...A}}
We now have code that queries and parsed the following new attributes for repos:
- topics::text[]
- forks::integer
- isinorganization::boolean
- homepageurl::text
- dependencies::text[]
- stargazers:: integer
- watchers::integer
- releases::integer
- issues::integer
Deployed this and it's crashing GitHub in some queries:
Error: HTTP.Exceptions.StatusError(502, "POST", "/graphql", HTTP.Messages.Response:
From worker 7: │ """
From worker 7: │ HTTP/1.1 502 Bad Gateway
From worker 7: │ Date: Wed, 18 Sep 2024 18:13:39 GMT
From worker 7: │ Content-Type: application/json
From worker 7: │ Access-Control-Expose-Headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
From worker 7: │ Access-Control-Allow-Origin: *
From worker 7: │ Vary: Accept-Encoding, Accept, X-Requested-With
From worker 7: │ Transfer-Encoding: chunked
From worker 7: │ X-GitHub-Request-Id: B0D8:39A2D6:62C5F1:BE194C:66EB183E
From worker 7: │ Server: github.com
From worker 7: │
From worker 7: │ {
From worker 7: │ "data": null,
From worker 7: │ "errors":[
From worker 7: │ {
From worker 7: │ "message":"Something went wrong while executing your query. This may be the result of a timeout, or it could be a GitHub bug. Please include `B0D8:39A2D6:62C5F1:BE194C:66EB183E` when reporting this issue."
From worker 7: │ }
From worker 7: │ ]
From worker 7: │ }
From worker 7: │ """)
From worker 7: └ @ GHOST ~/Git/GHOST.jl/src/01_BaseUtils.jl:107
I think that the added attributes are making the queries too big, and we will need to recompute them using smaller time windows.
Smaller time windows did not change the outcome, so I am going to drop the dependency tree from the query.
I am still seeing a high rate of failure on queries, so we may need to drop even more of the new attributes and consider doing this in multiple passes.
Going to declare this resolved as we move to wrap up repo collection over the next week.
Given that we are about to collect data for all repos there is an opportunity to request additional data from the GH GraphQL API. Potential candidates for inclusion are:
Let's use this issue to discuss and decide which ones to add and document the changes to code and db structure to support this.