Open dulaku opened 5 years ago
Thanks for doing that! Mine finished recently too, I'll just check the diff between ours to verify when I get the time. After that it would be nice to put up a tgz of the scraped results but I'm not sure if that counts as fair use so until I can be certain I'd rather be safe. Any ideas about that?
Yeah, I wasn't sure about the copyright on the scraped material either. The strategy ImageNet uses for their main collection is probably the safest way to play - distributing the URLs but not the content.
I'm almost positive fair use would apply by United States law: 1) The nature of the dataset is noncommercial and research 2) The copyrighted works are freely-distributed already, rather than sold in order to generate revenue 3) We're not scraping entire websites, just pages, and we're redistributing only part of the data from each scraped page 4) The scraped data can't replace the original copyright holder's web application, so shouldn't have a meaningful impact on the original work's value
Additionally a legal problem is probably pretty unlikely; but it's possible. I'm not a legal professional, so the above shouldn't actually be considered legal advice. I can look into what distinguishes the ImageNet contest datasets, which include images, from the whole collection and gave them confidence to distribute those.
presumably with these urls and downloaded content - you could generate the 40gb training file? did anyone do this? what's next steps to get closer to openai parameters?
The biggest problem right now is that nobody has implemented the training code yet (as far as I can tell), so if any of you manage to do that I'd be willing to give it a shot on the 345M model.
May I ask, what's the difference between this repo and https://github.com/jcpeterson/openwebtext?
Hi dulaku, Thank you for your great work. I'm now using the 4gb dataset. I wonder whether it's possible to get the 40gb dataset?
Don't know about that - I certainly can't distribute the dataset I have by itself (by its nature the content is generally going to be copyrighted). Your best bet would be to download the URL list and then retrieve the contents manually; unfortunately, I'm sure a lot of the content has probably expired, and you also have to worry about getting blocked by websites that don't want you scraping (for which I recommend spinning up several cloud machines to run your downloads, if you can afford it). I'll try to make sure I'm seeding that URL list this weekend in case you don't already have it.
The big challenge for me is that I don't know what processing steps happened between downloading all the text and winding up with that 40 GB dataset. Since earlier this year I've been drifting into other topics and haven't had a lot of time to spend on this, so updates may have happened that answer that question.
Hi dulaku, Thank you for your reply. I'm now downloading the contents. I have a question about the number of URLs. There are 84545438 links. However in the original paper, it said: "webtext contains the text subset of 45 million links." By the way, the paper said that only "8 million documents" are actually used in these 45 million links. I wonder why the two numbers(8 million and 84545438) are different?
My speculation is that there was additional filtering performed by the authors. There's a lot of results that are substantially similar with only a few characters different, which it makes sense to remove. There were also several dead links in my experience. However to get the real answer you'd have to reach out to the original authors, I think - I certainly don't know it.
Noticed you mentioned possibly uploading a tar.gz of the URLs; after about 4 days the script finished running for me and I've done the same. I've also got a torrent set up since I don't really have good dedicated hosting right now. Torrent description is extremely barebones; happy to add/modify anything if you'd like. Hopefully these help.
The torrent's at Academic Torrents and the file is hosted on Google Drive
Thanks for sharing this repo!