Closed brucewrks closed 9 years ago
@brucewsinc In your testing did you notice what the limit actually is with this? i.e. when it is truncated exactly? I realize that is very difficult to test for, just thought I would ask.
Also, if I understand this correctly, a sub-tree would be a sub-folder. Is that correct?
cc @raamdev If that's the case, then perhaps it is a good idea for us build KB articles and nest them into a dated folder structure. This way whenever we add the recursive functionality to deal with larger KBs, we won't run into as much trouble. I think nesting the KB articles into dated sub-directories inside each of the categories that we have now would work OK. Thoughts?
e.g. /questions/YY-MM/123-my-article.md
In your testing did you notice what the limit actually is with this? i.e. when it is truncated exactly? I realize that is very difficult to test for, just thought I would ask.
No, and GitHub doesn't make that limit clear in the documentation for some reason.
Also, if I understand this correctly, a sub-tree would be a sub-folder. Is that correct?
Right.
I think nesting the KB articles into dated sub-directories inside each of the categories that we have now would work OK. Thoughts?
e.g. /questions/YY-MM/123-my-article.md
Ugh, I can see why this would be helpful (even required on bigger KBs with hundreds or even thousands of articles), but that's butt-ugly from a UX perspective.
For one, KB articles generally don't have dates associated with them. They are not blog posts or news items. They are questions, or tutorials (which could be dated, I agree; but questions? no). So, suggesting (or requiring) that users use YY-MM
sub-folders for their articles just feels messy, confusing, and inconvenient. It also makes finding KB articles in the repo much harder--"When did I write that KB article... 15-01
? or was it 14-12
?" That doesn't even begin to explain how challenging organization would be if you were working with a team of people on the KB...
If sub-directories are required... hmm, I'm not sure what I'd recommend. When I first envisioned us maintaining the KB in a GitHub repo, I figured we'd have a directory structure that reflected categories (e.g., /questions/billing/
), however that has its own problems (how would something get filed in multiple categories?). I thought maybe we could use a per-author subdirectory (/questions/raamdev/
), but that doesn't address the issue of a single directory growing too large.
I suppose a date-based sub-directory structure is the only option. I really don't like it, especially from a user experience perspective, but if that's what's required to make GitHub Integration work for larger KBs, so be it.
If we do go with date-based subdirectories, maybe we could come up with a way of automatically generating an index to navigate the Repo... something that just lists all the Markdown files in the repo, regardless of what subdirectory they're in...
/questions/15-01/example1.md
/questions/14-12/example2.md
/questions/14-12/example3.md
/questions/14-11/example4.md
This could even just be something that the WP KB Articles plugin can spit out (accessed by a special URL).
I'm just trying to figure out how we can better find stuff, once it's all buried inside date-based subdirectories...
Yep. It's a bummer. I did not think about (or realize) how this was going to impact us and anyone who uses WPKBA together with GitHub. I'm going to run some tests soon and see if I can determine what the limit actually is. I read over a few comments and I'm seeing people say it's everything from 1000 files to 100,000.
I'm just trying to figure out how we can better find stuff, once it's all buried inside date-based subdirectories...
It looks like GitHub offers a similar feature.
https://github.com/owner/repo/find/
Example: https://github.com/websharks/zencache-kb/find/master
It looks like GitHub offers a similar feature.
Oh, that's awesome. I've never seen or used that feature. That actually looks perfect to me. I wouldn't bother adding any sort of indexing feature to the plugin since users can just use that.
determine what the limit actually is
Yes, I think this is somewhat important to determine, especially given that our plugin could break (or behave unexpectedly) if a user goes over that limit in a single directory. It would also be helpful to be able to warn users on the front-end when the plugin detects more than X number of files in a given GitHub Repo directory...
@brucewsinc This is labeled help wanted
but I see it's assigned to you. Are you taking care of this issue? If so, please remove the help wanted
label.
@brucewsinc This issue is now a top priority in this repo. Would you please provide an update here so that we know if you are planning to work on this issue any longer, or not?
This has been implemented. WPKBA has been updated to support an unlimited number of files/directories through recursion that is not dependent upon the GitHub API.
@jaswsinc Woohoo! How does that work? Does it require any changes to the way markdown files are stored (as was discussed earlier in this issue)?
While WPKBA now supports an unlimited number of files/directories, there is an important issue worth noting about the GitHub API. It pertains to the max number of items in any given directory tree.
Any tree that contains more than 1000 items is going to create a problem. You can have two gazillion directories with 1000 items each. However, if you have 1 directory with 1001 items this will cause an issue, because any single API call to retrieve a tree will return at most 1000 items.
For that reason, it is suggested that you use date-based subdirectories to prevent any single tree from containing more than 1000 items. 1000 being the most that a single API call will return.
The limit of 1000 is per-directory (i.e., tree) and not a total for all subdirectories. In other words, you can have one folder with 1000 subfolders, each with 1000 more subfolders, each with 1000 files. The key is that you don't allow more than 1000 sub-directories or files inside any given directory, regardless of it's depth.
To clarify further, a "tree" is a directory (regardless of it's depth). Each tree returns a list of items that consist of other trees (subdirectories), or files inside the tree. Trees can be nested as deep as you like, in any quantity that you like. However, there can never be more than 1000 items in any given tree.
That means that even the following directory structure may eventually hit a limit over time:
- tutorials/
- 0101/
- 0102/
- ... up to 1000 items max in the `tutorials` tree.
In this example, tutorials
is a tree, 0101
is another tree, and 0102
is another tree. Each of these can contain up to 1000 items each (i.e., subdirectories or files).
any single API call to retrieve a tree will return at most 1000 items.
Is there any way to check how many items exist in a given subdirectory so that site owners can be warned on the front-end that a particular directory contains more than 1000 items?
Yes, there is a truncated
flag that we could look for. https://developer.github.com/v3/git/trees/
Related to #11 and #2
The GitHub API Class currently relies on the
recursive=1
option on line#166.This should be changed to remove this option, and instead recursively query items with the
type
oftree
.See: https://developer.github.com/v3/git/trees/
TODO list added by @jaswsinc for @brucewsinc
last_sub_tree
value so the GitHub CRON Processor can use this to help it pick back up where it left off at. On very large repos with many thousands of KB articles, pulling the entire tree in a single CRON process (whether we process all of the entries or not), is less than ideal. We would need to deal with one sub-tree at a time, and then pull each sub-tree as part of the processing recursion when/if time allows within each individual process.retrieve_tree()
method so that it's no longer recursive.retrieve_sub_tree()
that will retrieve a specific sub-tree — also without recursion.retrieve_articles()
method should be modified to accept a$tree
parameter. This way we can pull articles from a specific tree.