westat-oss / GHOST.jl

Fork of https://github.com/uva-bi-sdad/GHOST.jl.
ISC License
0 stars 0 forks source link

Add Attributes to Harvested Repo Data #1

Open marcelo-g-simas opened 1 week ago

marcelo-g-simas commented 1 week ago

Given that we are about to collect data for all repos there is an opportunity to request additional data from the GH GraphQL API. Potential candidates for inclusion are:

Let's use this issue to discuss and decide which ones to add and document the changes to code and db structure to support this.

jeremycorry commented 1 week ago

Gizem in the meeting said she wants...

Note: Look into graphQL syntax to find the length of a list object like labels above.

marcelo-g-simas commented 1 week ago

Thoughts from the call:

marcelo-g-simas commented 2 days ago

Here's a working query that gets most of these new items:

fragment A on SearchResultItemConnection 
{ pageInfo 
 { endCursor hasNextPage } 
 edges { node { ... on Repository 
               { 
                 id 
                 createdAt 
                 nameWithOwner 
                 description 
                 primaryLanguage { name }
                 languages(first:5)
                 { 
                     nodes {
                      name
                    }                 
                 }
                 defaultBranchRef 
                 { 
                   id target 
                   { ... on Commit 
                    { 
                      history(until: "2024-01-01T00:00:00Z") { totalCount } 
                    } 
                   } 
                 }
                 labels(first:10) {
                  edges {
                    node {
                      name
                    }
                   }
                 }
                 forkCount
                 forkingAllowed
                 hasIssuesEnabled
                 hasDiscussionsEnabled
                 hasWikiEnabled
                 isInOrganization
                 stargazerCount
                 url
                 watchers 
                 { 
                     totalCount
                 }
                 releases
                 {
                   totalCount
                 }
                 issues
                 {
                   totalCount
                 }
               } 
              } 
       } 
} query Search(){_1:search(query:"is:public fork:false mirror:false archived:false license:LGPL-3.0 created:2023-12-29T00:00:00Z..2024-01-01T00:00:00Z",type:REPOSITORY, first:1){...A}}
marcelo-g-simas commented 2 days ago

Still missing from @jeremycorry's comments:

marcelo-g-simas commented 2 days ago

I could not find sources for these data so I think we are done and now need to move to change the deserialization code and target dbs.

marcelo-g-simas commented 2 days ago

@jeremycorry, this is a helpful resource: https://medium.com/@tharshita13/github-graphql-api-cheatsheet-38e916fe76a3

marcelo-g-simas commented 2 days ago

Did a bit more work and got dependencies to download from the API. It's a bit nested so now I need to figure out how to save this:

"dependencyGraphManifests": {
              "totalCount": 2,
              "edges": [
                {
                  "node": {
                    "dependencies": {
                      "edges": [
                        {
                          "node": {
                            "packageName": "breathe",
                            "repository": {
                              "nameWithOwner": "breathe-doc/breathe"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "myst-parser",
                            "repository": {
                              "nameWithOwner": "executablebooks/MyST-Parser"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "sphinx",
                            "repository": {
                              "nameWithOwner": "sphinx-doc/sphinx"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "sphinx-rtd-theme",
                            "repository": {
                              "nameWithOwner": "readthedocs/sphinx_rtd_theme"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "tomli",
                            "repository": {
                              "nameWithOwner": "hukkin/tomli"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "tomli-w",
                            "repository": null
                          }
                        },
                        {
                          "node": {
                            "packageName": "voluptuous",
                            "repository": {
                              "nameWithOwner": "alecthomas/voluptuous"
                            }
                          }
                        }
                      ]
                    }
                  }
                },
                {
                  "node": {
                    "dependencies": {
                      "edges": [
                        {
                          "node": {
                            "packageName": "hyper",
                            "repository": {
                              "nameWithOwner": "hyperium/hyper"
                            }
                          }
                        },
                        {
                          "node": {
                            "packageName": "tokio",
                            "repository": {
                              "nameWithOwner": "tokio-rs/tokio"
                            }
                          }
                        }
                      ]
                    }
                  }
                }
              ]
            }
marcelo-g-simas commented 2 days ago

Here's a revised query that includes all of the new fields we will need to add to repos:

fragment A on SearchResultItemConnection 
{ pageInfo 
 { endCursor hasNextPage } 
 edges { node { ... on Repository 
               { 
                 id 
                 createdAt 
                 nameWithOwner 
                 description 
                 primaryLanguage { name }
                 languages(first:5)
                 { 
                     nodes {
                      name
                    }                 
                 }
                 defaultBranchRef 
                 { 
                   id target 
                   { ... on Commit 
                    { 
                      history(until: "2024-01-01T00:00:00Z") { totalCount } 
                    } 
                   } 
                 }
                 repositoryTopics(first:10) {
                  edges {
                    node {
                      topic 
                      {
                        name
                      }
                    }
                   }
                 }
                 forkCount
                 isInOrganization
                 stargazerCount
                 homepageUrl
                 dependencyGraphManifests(first: 10)
                 {
                   totalCount
                   edges {
                     node {
                       dependencies {
                         edges {
                           node {
                            packageName
                            repository {
                              nameWithOwner
                            }
                           }
                         }
                       }
                     }
                   }
                 }
                 watchers 
                 { 
                     totalCount
                 }
                 releases
                 {
                   totalCount
                 }
                 issues
                 {
                   totalCount
                 }
               } 
              } 
       } 
} query Search(){_1:search(query:"is:public fork:false mirror:false archived:false license:LGPL-3.0 created:2023-12-29T00:00:00Z..2024-01-01T00:00:00Z",type:REPOSITORY, first:2){...A}}
marcelo-g-simas commented 1 day ago

We now have code that queries and parsed the following new attributes for repos:

    - topics::text[]
    - forks::integer
    - isinorganization::boolean
    - homepageurl::text
    - dependencies::text[]
    - stargazers:: integer 
    - watchers::integer
    - releases::integer
    - issues::integer
marcelo-g-simas commented 2 hours ago

Deployed this and it's crashing GitHub in some queries:

Error: HTTP.Exceptions.StatusError(502, "POST", "/graphql", HTTP.Messages.Response:
      From worker 7:    │ """
      From worker 7:    │ HTTP/1.1 502 Bad Gateway
      From worker 7:    │ Date: Wed, 18 Sep 2024 18:13:39 GMT
      From worker 7:    │ Content-Type: application/json
      From worker 7:    │ Access-Control-Expose-Headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
      From worker 7:    │ Access-Control-Allow-Origin: *
      From worker 7:    │ Vary: Accept-Encoding, Accept, X-Requested-With
      From worker 7:    │ Transfer-Encoding: chunked
      From worker 7:    │ X-GitHub-Request-Id: B0D8:39A2D6:62C5F1:BE194C:66EB183E
      From worker 7:    │ Server: github.com
      From worker 7:    │ 
      From worker 7:    │ {
      From worker 7:    │    "data": null,
      From worker 7:    │    "errors":[
      From worker 7:    │       {
      From worker 7:    │          "message":"Something went wrong while executing your query. This may be the result of a timeout, or it could be a GitHub bug. Please include `B0D8:39A2D6:62C5F1:BE194C:66EB183E` when reporting this issue."
      From worker 7:    │       }
      From worker 7:    │    ]
      From worker 7:    │ }
      From worker 7:    │ """)
      From worker 7:    └ @ GHOST ~/Git/GHOST.jl/src/01_BaseUtils.jl:107
marcelo-g-simas commented 2 hours ago

I think that the added attributes are making the queries too big, and we will need to recompute them using smaller time windows.

marcelo-g-simas commented 13 minutes ago

Smaller time windows did not change the outcome, so I am going to drop the dependency tree from the query.