GitHub API Recursion - Githubissues

brucewrks commented 9 years ago

Related to #11 and #2

The GitHub API Class currently relies on the recursive=1 option on line#166.

This should be changed to remove this option, and instead recursively query items with the type of tree.

See: https://developer.github.com/v3/git/trees/

TODO list added by @jaswsinc for @brucewsinc

[x] Determine if we need to use a dated folder structure to make this easier to deal with.
[ ] Determine if we need to store a transient last_sub_tree value so the GitHub CRON Processor can use this to help it pick back up where it left off at. On very large repos with many thousands of KB articles, pulling the entire tree in a single CRON process (whether we process all of the entries or not), is less than ideal. We would need to deal with one sub-tree at a time, and then pull each sub-tree as part of the processing recursion when/if time allows within each individual process.
[x] Modify the retrieve_tree() method so that it's no longer recursive.
[x] Add a new method; retrieve_sub_tree() that will retrieve a specific sub-tree — also without recursion.
[x] The retrieve_articles() method should be modified to accept a $tree parameter. This way we can pull articles from a specific tree.

jaswrks commented 9 years ago

@brucewsinc In your testing did you notice what the limit actually is with this? i.e. when it is truncated exactly? I realize that is very difficult to test for, just thought I would ask.

Also, if I understand this correctly, a sub-tree would be a sub-folder. Is that correct?

cc @raamdev If that's the case, then perhaps it is a good idea for us build KB articles and nest them into a dated folder structure. This way whenever we add the recursive functionality to deal with larger KBs, we won't run into as much trouble. I think nesting the KB articles into dated sub-directories inside each of the categories that we have now would work OK. Thoughts?

e.g. /questions/YY-MM/123-my-article.md

brucewrks commented 9 years ago

In your testing did you notice what the limit actually is with this? i.e. when it is truncated exactly? I realize that is very difficult to test for, just thought I would ask.

No, and GitHub doesn't make that limit clear in the documentation for some reason.

Also, if I understand this correctly, a sub-tree would be a sub-folder. Is that correct?

Right.

raamdev commented 9 years ago

I think nesting the KB articles into dated sub-directories inside each of the categories that we have now would work OK. Thoughts?

e.g. /questions/YY-MM/123-my-article.md

Ugh, I can see why this would be helpful (even required on bigger KBs with hundreds or even thousands of articles), but that's butt-ugly from a UX perspective.

For one, KB articles generally don't have dates associated with them. They are not blog posts or news items. They are questions, or tutorials (which could be dated, I agree; but questions? no). So, suggesting (or requiring) that users use YY-MM sub-folders for their articles just feels messy, confusing, and inconvenient. It also makes finding KB articles in the repo much harder--"When did I write that KB article... 15-01? or was it 14-12?" That doesn't even begin to explain how challenging organization would be if you were working with a team of people on the KB...

If sub-directories are required... hmm, I'm not sure what I'd recommend. When I first envisioned us maintaining the KB in a GitHub repo, I figured we'd have a directory structure that reflected categories (e.g., /questions/billing/), however that has its own problems (how would something get filed in multiple categories?). I thought maybe we could use a per-author subdirectory (/questions/raamdev/), but that doesn't address the issue of a single directory growing too large.

I suppose a date-based sub-directory structure is the only option. I really don't like it, especially from a user experience perspective, but if that's what's required to make GitHub Integration work for larger KBs, so be it.

raamdev commented 9 years ago

If we do go with date-based subdirectories, maybe we could come up with a way of automatically generating an index to navigate the Repo... something that just lists all the Markdown files in the repo, regardless of what subdirectory they're in...

/questions/15-01/example1.md
/questions/14-12/example2.md
/questions/14-12/example3.md
/questions/14-11/example4.md

This could even just be something that the WP KB Articles plugin can spit out (accessed by a special URL).

I'm just trying to figure out how we can better find stuff, once it's all buried inside date-based subdirectories...

jaswrks commented 9 years ago

Yep. It's a bummer. I did not think about (or realize) how this was going to impact us and anyone who uses WPKBA together with GitHub. I'm going to run some tests soon and see if I can determine what the limit actually is. I read over a few comments and I'm seeing people say it's everything from 1000 files to 100,000.

I'm just trying to figure out how we can better find stuff, once it's all buried inside date-based subdirectories...

It looks like GitHub offers a similar feature. https://github.com/owner/repo/find/

Example: https://github.com/websharks/zencache-kb/find/master

2015-01-19_12-56-52

raamdev commented 9 years ago

It looks like GitHub offers a similar feature.

Oh, that's awesome. I've never seen or used that feature. That actually looks perfect to me. I wouldn't bother adding any sort of indexing feature to the plugin since users can just use that.

determine what the limit actually is

Yes, I think this is somewhat important to determine, especially given that our plugin could break (or behave unexpectedly) if a user goes over that limit in a single directory. It would also be helpful to be able to warn users on the front-end when the plugin detects more than X number of files in a given GitHub Repo directory...

raamdev commented 9 years ago

@brucewsinc This is labeled help wanted but I see it's assigned to you. Are you taking care of this issue? If so, please remove the help wanted label.

jaswrks commented 9 years ago

@brucewsinc This issue is now a top priority in this repo. Would you please provide an update here so that we know if you are planning to work on this issue any longer, or not?

jaswrks commented 9 years ago

This has been implemented. WPKBA has been updated to support an unlimited number of files/directories through recursion that is not dependent upon the GitHub API.

raamdev commented 9 years ago

@jaswsinc Woohoo! How does that work? Does it require any changes to the way markdown files are stored (as was discussed earlier in this issue)?

jaswrks commented 9 years ago

Important Note

While WPKBA now supports an unlimited number of files/directories, there is an important issue worth noting about the GitHub API. It pertains to the max number of items in any given directory tree.

Any tree that contains more than 1000 items is going to create a problem. You can have two gazillion directories with 1000 items each. However, if you have 1 directory with 1001 items this will cause an issue, because any single API call to retrieve a tree will return at most 1000 items.

For that reason, it is suggested that you use date-based subdirectories to prevent any single tree from containing more than 1000 items. 1000 being the most that a single API call will return.

The limit of 1000 is per-directory (i.e., tree) and not a total for all subdirectories. In other words, you can have one folder with 1000 subfolders, each with 1000 more subfolders, each with 1000 files. The key is that you don't allow more than 1000 sub-directories or files inside any given directory, regardless of it's depth.

To clarify further, a "tree" is a directory (regardless of it's depth). Each tree returns a list of items that consist of other trees (subdirectories), or files inside the tree. Trees can be nested as deep as you like, in any quantity that you like. However, there can never be more than 1000 items in any given tree.

That means that even the following directory structure may eventually hit a limit over time:

- tutorials/
  - 0101/
  - 0102/
  - ... up to 1000 items max in the `tutorials` tree.

In this example, tutorials is a tree, 0101 is another tree, and 0102 is another tree. Each of these can contain up to 1000 items each (i.e., subdirectories or files).

raamdev commented 9 years ago

any single API call to retrieve a tree will return at most 1000 items.

Is there any way to check how many items exist in a given subdirectory so that site owners can be warned on the front-end that a particular directory contains more than 1000 items?

jaswrks commented 9 years ago

Yes, there is a truncated flag that we could look for. https://developer.github.com/v3/git/trees/

wpsharks / wp-kb-articles

GitHub API Recursion #12

Important Note