westat-oss / GHOST.jl

Fork of https://github.com/uva-bi-sdad/GHOST.jl.
ISC License
0 stars 0 forks source link

Add Attributes to Harvested Repo Data #1

Closed marcelo-g-simas closed 1 month ago

marcelo-g-simas commented 2 months ago

Given that we are about to collect data for all repos there is an opportunity to request additional data from the GH GraphQL API. Potential candidates for inclusion are:

Let's use this issue to discuss and decide which ones to add and document the changes to code and db structure to support this.

jeremycorry commented 2 months ago

Gizem in the meeting said she wants...

Note: Look into graphQL syntax to find the length of a list object like labels above.

marcelo-g-simas commented 2 months ago

Thoughts from the call:

marcelo-g-simas commented 2 months ago

Here's a working query that gets most of these new items:

fragment A on SearchResultItemConnection 
{ pageInfo 
 { endCursor hasNextPage } 
 edges { node { ... on Repository 
               { 
                 id 
                 createdAt 
                 nameWithOwner 
                 description 
                 primaryLanguage { name }
                 languages(first:5)
                 { 
                     nodes {
                      name
                    }                 
                 }
                 defaultBranchRef 
                 { 
                   id target 
                   { ... on Commit 
                    { 
                      history(until: "2024-01-01T00:00:00Z") { totalCount } 
                    } 
                   } 
                 }
                 labels(first:10) {
                  edges {
                    node {
                      name
                    }
                   }
                 }
                 forkCount
                 forkingAllowed
                 hasIssuesEnabled
                 hasDiscussionsEnabled
                 hasWikiEnabled
                 isInOrganization
                 stargazerCount
                 url
                 watchers 
                 { 
                     totalCount
                 }
                 releases
                 {
                   totalCount
                 }
                 issues
                 {
                   totalCount
                 }
               } 
              } 
       } 
} query Search(){_1:search(query:"is:public fork:false mirror:false archived:false license:LGPL-3.0 created:2023-12-29T00:00:00Z..2024-01-01T00:00:00Z",type:REPOSITORY, first:1){...A}}
marcelo-g-simas commented 2 months ago

Still missing from @jeremycorry's comments:

marcelo-g-simas commented 2 months ago

I could not find sources for these data so I think we are done and now need to move to change the deserialization code and target dbs.

marcelo-g-simas commented 2 months ago

@jeremycorry, this is a helpful resource: https://medium.com/@tharshita13/github-graphql-api-cheatsheet-38e916fe76a3

marcelo-g-simas commented 2 months ago

Did a bit more work and got dependencies to download from the API. It's a bit nested so now I need to figure out how to save this:

"dependencyGraphManifests": {
              "totalCount": 2,
              "edges": [
                {
                  "node": {
                    "dependencies": {
                      "edges": [
                        {
                          "node": {
                            "packageName": "breathe",
                            "repository": {
                              "nameWithOwner": "breathe-doc/breathe"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "myst-parser",
                            "repository": {
                              "nameWithOwner": "executablebooks/MyST-Parser"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "sphinx",
                            "repository": {
                              "nameWithOwner": "sphinx-doc/sphinx"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "sphinx-rtd-theme",
                            "repository": {
                              "nameWithOwner": "readthedocs/sphinx_rtd_theme"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "tomli",
                            "repository": {
                              "nameWithOwner": "hukkin/tomli"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "tomli-w",
                            "repository": null
                          }
                        },
                        {
                          "node": {
                            "packageName": "voluptuous",
                            "repository": {
                              "nameWithOwner": "alecthomas/voluptuous"
                            }
                          }
                        }
                      ]
                    }
                  }
                },
                {
                  "node": {
                    "dependencies": {
                      "edges": [
                        {
                          "node": {
                            "packageName": "hyper",
                            "repository": {
                              "nameWithOwner": "hyperium/hyper"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "tokio",
                            "repository": {
                              "nameWithOwner": "tokio-rs/tokio"
                            }
                          }
                        }
                      ]
                    }
                  }
                }
              ]
            }
marcelo-g-simas commented 2 months ago

Here's a revised query that includes all of the new fields we will need to add to repos:

fragment A on SearchResultItemConnection 
{ pageInfo 
 { endCursor hasNextPage } 
 edges { node { ... on Repository 
               { 
                 id 
                 createdAt 
                 nameWithOwner 
                 description 
                 primaryLanguage { name }
                 languages(first:5)
                 { 
                     nodes {
                      name
                    }                 
                 }
                 defaultBranchRef 
                 { 
                   id target 
                   { ... on Commit 
                    { 
                      history(until: "2024-01-01T00:00:00Z") { totalCount } 
                    } 
                   } 
                 }
                 repositoryTopics(first:10) {
                  edges {
                    node {
                      topic 
                      {
                        name
                      }
                    }
                   }
                 }
                 forkCount
                 isInOrganization
                 stargazerCount
                 homepageUrl
                 dependencyGraphManifests(first: 10)
                 {
                   totalCount
                   edges {
                     node {
                       dependencies {
                         edges {
                           node {
                            packageName
                            repository {
                              nameWithOwner
                            }
                           }
                         }
                       }
                     }
                   }
                 }
                 watchers 
                 { 
                     totalCount
                 }
                 releases
                 {
                   totalCount
                 }
                 issues
                 {
                   totalCount
                 }
               } 
              } 
       } 
} query Search(){_1:search(query:"is:public fork:false mirror:false archived:false license:LGPL-3.0 created:2023-12-29T00:00:00Z..2024-01-01T00:00:00Z",type:REPOSITORY, first:2){...A}}
marcelo-g-simas commented 2 months ago

We now have code that queries and parsed the following new attributes for repos:

    - topics::text[]
    - forks::integer
    - isinorganization::boolean
    - homepageurl::text
    - dependencies::text[]
    - stargazers:: integer 
    - watchers::integer
    - releases::integer
    - issues::integer
marcelo-g-simas commented 1 month ago

Deployed this and it's crashing GitHub in some queries:

Error: HTTP.Exceptions.StatusError(502, "POST", "/graphql", HTTP.Messages.Response:
      From worker 7:    │ """
      From worker 7:    │ HTTP/1.1 502 Bad Gateway
      From worker 7:    │ Date: Wed, 18 Sep 2024 18:13:39 GMT
      From worker 7:    │ Content-Type: application/json
      From worker 7:    │ Access-Control-Expose-Headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
      From worker 7:    │ Access-Control-Allow-Origin: *
      From worker 7:    │ Vary: Accept-Encoding, Accept, X-Requested-With
      From worker 7:    │ Transfer-Encoding: chunked
      From worker 7:    │ X-GitHub-Request-Id: B0D8:39A2D6:62C5F1:BE194C:66EB183E
      From worker 7:    │ Server: github.com
      From worker 7:    │ 
      From worker 7:    │ {
      From worker 7:    │    "data": null,
      From worker 7:    │    "errors":[
      From worker 7:    │       {
      From worker 7:    │          "message":"Something went wrong while executing your query. This may be the result of a timeout, or it could be a GitHub bug. Please include `B0D8:39A2D6:62C5F1:BE194C:66EB183E` when reporting this issue."
      From worker 7:    │       }
      From worker 7:    │    ]
      From worker 7:    │ }
      From worker 7:    │ """)
      From worker 7:    └ @ GHOST ~/Git/GHOST.jl/src/01_BaseUtils.jl:107
marcelo-g-simas commented 1 month ago

I think that the added attributes are making the queries too big, and we will need to recompute them using smaller time windows.

marcelo-g-simas commented 1 month ago

Smaller time windows did not change the outcome, so I am going to drop the dependency tree from the query.

marcelo-g-simas commented 1 month ago

I am still seeing a high rate of failure on queries, so we may need to drop even more of the new attributes and consider doing this in multiple passes.

marcelo-g-simas commented 1 month ago

Going to declare this resolved as we move to wrap up repo collection over the next week.