x-CK-x / Dataset-Curation-Tool

A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and auto-tag/caption models for your purposes. Custom datasets can be added!
GNU General Public License v3.0
31 stars 7 forks source link
auto-tagger captioning-images captioning-videos data-curation dataset-manager downloader imageboard-grabber tagging

Dataset-Curation-Tool

A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. Diffusion and Clip models for your purposes. Custom datasets can be added!

WIKI-Page / Tutorial for this Repository HERE

General Config

Installation Requirements

Make sure you have git installed!

Download either the windows, mac, or linux run file (repo will be installed for you):

Windows Download

Linux Download

MacOS Download

Mac and Linux Users should make the file executable with the following terminal command:

chmod +x linux_run.sh

OR

chmod +x mac_run.sh

Other System Install Options

(Linux)

sudo apt-get install unzip

How to Run Program

"DO NOT" run the file with admin/sudo perms!

"DO NOT" put the manually downloaded run file from the (INSTALLATION STEP ^^^) in the Data-Curation-Tool folder!

"DO NOT" use the run file/s in the Data-Curation-Tool folder! (Use the manually downloaded run file, from the INSTALLATION STEP ^^^ to install and/or update the repo)

"DO NOT" move the generated "dataset_curation_path.txt" file out of the Data-Curation-Tool folder!

The "DUPLICATE" run files (run.bat, mac_run.sh, linux_run.sh) residing in the Data-Curation-Tool folder, are intentionally deleted when the program is run.

Double-Click file to run with (Default) settings

Update dependencies i.e. in the yaml file with the following (make sure to use the most recent yaml file in the repo: https://raw.githubusercontent.com/x-CK-x/Dataset-Curation-Tool/main/environment.yml):

./RUN_FILE --update

Below are Several Run (additional) Options to choose from

Run with sharing turned on : Provides a live link that anyone can use

./RUN_FILE --share

Run password protected : Requires user to type in a username & password to access the webUI

./RUN_FILE --server_port 7860 --username NAME --password PASS

Run on a specified PORT : Displays the webUI relative to a specified PORT

./RUN_FILE --server_port 7860

OR CHOOSE ANY COMBINATION OF ^

Important Information

Bug Reporting & Troubleshooting

Create a Support Ticket or Bug Report here: https://github.com/x-CK-x/Dataset-Curation-Tool/issues

New Feature Requests

Feel free to suggest new feature/s here: https://github.com/x-CK-x/Dataset-Curation-Tool/discussions/categories/ideas

Future Objectives/Features

NEW Features Paused as of (09/05/2023) :: unless there are willing contributors to develop any of the other features.

New image board specific tagging/captioning models will be supported as they are released :: (There is "no" current eta. on the progress of those models being developed by others)

Contributors are welcome to open a Pull Request for their developments & I will promptly review it to be added

Additional Information

Default folder directory tree
base_folder/
├─ batch_folder/
│  ├─ downloaded_posts_folder/
│  │  ├─ png_folder/
│  │  ├─ jpg_folder/
│  │  ├─ gif_folder/
│  │  ├─ webm_folder/
│  │  ├─ swf_folder/
│  ├─ resized_img_folder/
│  ├─ tag_count_list_folder/
│  │  ├─ tags.csv
│  │  ├─ tag_category.csv
│  ├─ save_searched_list_path.txt

Any file path parameter that are empty will use the default path.

Files/folders that use the same path are merged, not overwritten. For example, using the same path for save_searched_list_path at every batch will result in a combined searched list of every batch in one .txt file.

Notes

For more information/help on the downloading script, Please see the original image board downloader script : https://github.com/pikaflufftuft/pikaft-e621-posts-downloader

License

MIT

Usage conditions

By using this downloader, the user agrees that the author is not liable for any misuse of this downloader. This downloader is open-source and free to use.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.