Supercazzola is my own scraper tar pit, designed to generate dynamically an endless graph of webpages.
I wrote it with the purpose of poisoning web crawlers that ignore my
robots.txt
This software requires cmake, pkg-config and libevent >= 2 as
dependencies.
It has been tested to work under GNU/Linux and FreeBSD.
License
3-Clause BSD License. See COPYING.txt
Binaries
-
mchain(1)- Compile a markov chain from one or more text files -
spamgen(1)- Generate random sentences out of a compiled Markov chain -
spamd(2)- Web daemon generating random HTML pages out of a compiled Markov chain
How-To
The following instructions refer to the provisioning and installation under FreeBSD systems, but they can be easily adapted to other operating systems (e.g. GNU/Linux).
-
Build the software
-
Install
pkg-configandlibevent2:root@freebsd:~ # pkg install -y devel/pkgconf devel/libevent devel/cmake-core -
Unpack, build and install the package:
root@freebsd:~ # tar -xzf ./supercazzola-*.tar.gz root@freebsd:~ # cmake -S ./supercazzola-*/ -B ./build root@freebsd:~ # cmake --build ./build root@freebsd:~ # cmake --install ./build
-
-
Create and install Markov chain
-
Get some long text, e.g. Frankenstein from Gutenberg.org and turn it into a Markov chain:
root@freebsd:~ # fetch 'https://www.gutenberg.org/ebooks/84.txt.utf-8' 84.txt.utf-8 438 kB 589 kBps 00s root@freebsd:~ # mkdir /usr/local/share/spamd root@freebsd:~ # mchain ./84.txt.utf-8 /usr/local/share/spamd/default.markov mchain: number of states: 42181 (build-time max: 81920) mchain: number of edges: 65106 mchain: spamd(8) mallocs: 858296 bytes -
Sample results with
spamgen(1)root@freebsd:~ # spamgen -k /usr/local/share/spamd/default.markov -
Test result by running daemon in foreground:
root@freebsd:~ # spamd -f spamd 2171 - - listening on localhost:7180 spamd 2171 - - listening on localhost:7181
-
-
Configure
spamd(8)-
spamd(8)will try to read configuration data from/usr/local/etc/spamd/spamd.confor from the file specified with-con the command line. -
If the default file does not exist, and if no alternative file is specified,
spamd(8)will be configured with default settings. -
See "Configuration" below.
-
Start spamd
-
Enable the
spamd(8)serviceroot@freebsd:~ # service spamd enable spamd enabled in /etc/rc.conf root@freebsd:~ # service spamd start Starting spamd. -
The daemon will log via
syslog(3)on the "daemon" facility (check "/var/log/daemon.log" if needed).root@freebsd:~ # tail -n2 /var/log/daemon.log Dec 8 23:53:23 freebsd spamd[3500]: listening on localhost:7180 Dec 8 23:53:23 freebsd spamd[3500]: listening on localhost:7181
-
-
Sit and enjoy some spam
-
spamd(8)will serve spam on the spam endpoint (http://localhost:7180by default) and provide information about the visitors on the info endpoint (http://localhost:7181by default).
Intended use
The recommended setup consists in forwarding requests from a web server acting
as reverse proxy to the spam endpoint. spamd(8) supports the
X-Forwarded-For header when determining the IP address of the peer for
statistical purposes, and allows to specify a prefix to strip from request URI
(see spam_ep.uri_prefix below).
The purpose of spamd(8) is to mess with greedy AI bots that violate
the netiquette. It is therefore highly recommended to list the URI
path leading to the spam endpoint in your robots.txt:
User-agent: *
Disallow: /spam/
Configuration
The configuration file of spamd(8) contains key-value pairs or key only
toggles, one per line. Empty lines and lines starting with # are treated as
comments. key only lines (toggles) are permitted only for settings
having boolean type, and interpreted as true.
Follows a list of accepted keys and their meaning:
-
daemon.foreground-
Tells
spamd(8)not to invokedaemon(3)-
Type: boolean
-
Default:
false
-
-
daemon.gid-
Tells
spamd(8)to drop privileges viasetgid(2)to the supplied gid. If daemon.gid is not specified,spamd(8)will not try to usesetgid(2), but it will still ensure the process is not executed with gid 0. -
-
Type: string
-
Default: undefined
-
-
daemon.pidfile-
Location on the filesystem of the pidfile. The pidfile is generated before dropping permissions, and is therefore not unlinked when the daemon terminates. The init system is responsible for unlinking this file.
-
-
Type: string
-
Default:
/var/run/spamd.pid
-
-
daemon.uid-
Tells
spamd(8)to drop privileges viasetuid(2)to the supplied uid. If daemon.uid is not specified,spamd(8)will not try to usesetuid(2), but it will still ensure the process is not executed with uid 0. -
-
Type: string
-
Default: undefined
-
-
info_ep.backlog-
TCP backlog of the info endpoint. See
listen(2). -
-
Type: integer
-
Default:
32
-
-
info_ep.bind-
Bind address of the info endpoint. See
bind(2). -
-
Type: string
-
Default:
localhost:7181
-
-
spam_ep.backlog-
TCP backlog of the spam endpoint. See
listen(2). -
-
Type: integer
-
Default:
32
-
-
spam_ep.bind-
Bind address of the info endpoint. See
bind(2). -
-
Type: string
-
Default:
localhost:7180
-
-
spam_ep.max_sentence_len-
Maximum length of a pseudo-random sentence served on the spam endpoint. Sentences are allowed to be shorter, according to the length of the random walk on the Markov chain.
-
-
Type: integer
-
Default:
40
-
-
spam_ep.mkvchain-
File system path of the Markov chain. Markov chain files are constructed via
mchain(1). -
-
Type: string
-
Default:
/usr/local/share/spamd/default.markov
-
-
spam_ep.n_paragraphs-
Number of paragraphs in each page served by the spam endpoint.
-
-
Type:
-
Default:
3
-
-
spam_ep.n_references-
Number of outbound links in each page served by the spam endpoint.
-
-
Type: integer
-
Default:
7
-
-
spam_ep.n_sentences-
Number of pseudo-random sentences per paragraph served by the spam endpoint.
-
-
Type: integer
-
Default:
5
-
-
spam_ep.uri_prefix-
Expected prefix of URIs in the spam endpoint. This option is useful when
spamd(8)is made reachable through a reverse proxy, which is prepending a prefix to each request URI. -
-
Type: string
-
Default: undefined
-
TODO (higher priority first)
-
control panel pages + bake in git version
-
reload config file upon SIGHUP
-
spamd(8)handle more gracefully missing file (e.g. default page?) -
mchain option to pick output format (dot or binary)
-
Add sanitizers to cmake (default ON, possibly OFF, only in Debug build-type)