wolfpld / usenetarchive

Usenet Archive Toolkit
Other
32 stars 4 forks source link

Usenet Archive Toolkit

The Usenet Archive Toolkit project aims to provide a set of tools to process various sources of usenet messages into a coherent, searchable archive.

Typically you will have two usage patterns:

  1. There is an already created archive file that you want to read. To do so, you only need to download or build the tbrowser utility.
  2. You want to create an archive file from the sources available to you. You will need to use most of the provided utilities. Following the workflow graph is a good starting point.

TL;DR: Download available archive files, use tbrowser to read them.

List of UAT archive files

Motivation

Usenet is dead. You may believe it's not, but it really is.

Polish usenet message count from 2016.08 to 2016.12 (approximate).

People went away to various forums, facebooks and twitters and seem fine there. Meanwhile, the old discussions slowly rot away. Google groups is a sad, unusable joke. Archive.org dataset, at least with regard to polish usenet archives, is vastly incomplete. There is no easy way to get the data, browse it, or search it. So, maybe something needs to be done. How hard can it be anyway? (Not very: one month for a working prototype, another one for polish and bugfixing.)

Advantages

Why use UAT? Why not use existing solutions, like google groups, archives from archive.org or NNTP servers with long history?

Toolkit description

UAT provides a multitude of utilities, each specialized for its own task. You can find a brief description of each one below.

Import Formats

Usenet messages may be retrieved from a number of different sources. Currently we support:

Imported messages are stored in a per-message LZ4 compressed meta+payload database.

Data Processing

Raw imported messages have to be processed to be of any use. We provide the following utilities:

Data Filtering

Raw data right after import is highly unfit for direct use. Messages are duplicated, there's spam. These utilities help clean it up:

Data Search

Search in archive is performed with the help of a word lexicon. The following tools are used for its preparation:

Data Access

These tools provide access to archive data:

End-user Utilities

Future work ideas

Here are some viable ideas that I'm not really planning to do any time soon, but which would be nice to have:

Workflow

Usenet Archive Toolkit operates on a couple of distinct databases. Each utility requires a specific set of these databases and produces its own database, or creates a completely new database indexing schema, which invalidates rest of databases.

groups.google.com → google-groups → produces: maildir tree
nntp server → nntp-get → produces: maildir tree
maildir directory → import-source-maildir → produces: LZ4
maildir compressed → import-source-maildir-7z → produces: LZ4
mbox file → import-source-mbox → produces: LZ4
LZ4, msgidexport-messages → produces: separate message files
LZ4kill-duplicates → produces: LZ4
LZ4extract-msgid → adds: msgid
LZ4, msgidconnectivity → adds: conn
LZ4, connfilter-newsgroups → produces: LZ4
LZ4, msgid, conn, strfilter-spam → produces: LZ4
LZ4extract-msgmeta → adds: str
(LZ4, msgid) + (LZ4, msgid) → merge-raw → produces: LZ4
(LZ4, msgid) + (LZ4, msgid) → relative-complement → produces: LZ4
LZ4utf8ize → produces: LZ4
LZ4repack-zstd → adds: zstd
zstdrepack-lz4 → adds: LZ4
(zstd, msgid) + (LZ4, msgid) → update-zstd → produces: zstd
LZ4, connlexicon → adds: lex
lexlexsort → modifies: lex
lexlexdist → adds: lexdist
lexlexstats → user interaction
LZ4, msgidquery-raw → user interaction
zstd, msgid, conn, str, lexlibuat → user interaction
everything but LZ4packageone file archive
everything but LZ4threadify → modifies: conn, invalidates: lex
archivesort → modifies: archive
collection of archivesgalaxy-utilarchive galaxy

Additional, optional information files, not created by any of the above utilities, but used in user-facing programs:

Typical Workflow

Notes

utf8ize doesn't compile on MSVC. Either compile it on cygwin, or have fun banging glib and gmime into submission. Your choice.

UAT only works on 64 bit machines.

License

Usenet Archive
Copyright (C) 2016-2023  Bartosz Taudul <wolf@nereid.pl>

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.