← Back to Index

Supercazzola - Generate spam for web scrapers

Source: Lobsters: Newest Stories
Published: February 13, 2026 at 3:49 PM

Supercazzola is my own scraper tar pit, designed to generate dynamically an endless graph of webpages.

I wrote it with the purpose of poisoning web crawlers that ignore my robots.txt

This software requires cmake, pkg-config and libevent >= 2 as dependencies. It has been tested to work under GNU/Linux and FreeBSD.

License

3-Clause BSD License. See COPYING.txt

Binaries

  • mchain(1) - Compile a markov chain from one or more text files

  • spamgen(1) - Generate random sentences out of a compiled Markov chain

  • spamd(2) - Web daemon generating random HTML pages out of a compiled Markov chain

How-To

The following instructions refer to the provisioning and installation under FreeBSD systems, but they can be easily adapted to other operating systems (e.g. GNU/Linux).

  1. Build the software

    • Install pkg-config and libevent2:

      root@freebsd:~ # pkg install -y devel/pkgconf devel/libevent devel/cmake-core
    • Unpack, build and install the package:

      root@freebsd:~ # tar -xzf ./supercazzola-*.tar.gz
      root@freebsd:~ # cmake -S ./supercazzola-*/ -B ./build
      root@freebsd:~ # cmake --build ./build
      root@freebsd:~ # cmake --install ./build
  2. Create and install Markov chain

    • Get some long text, e.g. Frankenstein from Gutenberg.org and turn it into a Markov chain:

      root@freebsd:~ # fetch 'https://www.gutenberg.org/ebooks/84.txt.utf-8'
      84.txt.utf-8                                           438 kB  589 kBps    00s
      root@freebsd:~ # mkdir /usr/local/share/spamd
      root@freebsd:~ # mchain ./84.txt.utf-8 /usr/local/share/spamd/default.markov
      mchain: number of states:  42181 (build-time max: 81920)
      mchain: number of edges:   65106
      mchain: spamd(8) mallocs:  858296 bytes
    • Sample results with spamgen(1)

      root@freebsd:~ # spamgen -k /usr/local/share/spamd/default.markov
    • Test result by running daemon in foreground:

      root@freebsd:~ # spamd -f
      spamd 2171 - - listening on localhost:7180
      spamd 2171 - - listening on localhost:7181
  3. Configure spamd(8)

    spamd(8) will try to read configuration data from /usr/local/etc/spamd/spamd.conf or from the file specified with -c on the command line.

    If the default file does not exist, and if no alternative file is specified, spamd(8) will be configured with default settings.

    See "Configuration" below.

  4. Start spamd

    • Enable the spamd(8) service

      root@freebsd:~ # service spamd enable
      spamd enabled in /etc/rc.conf
      root@freebsd:~ # service spamd start
      Starting spamd.
    • The daemon will log via syslog(3) on the "daemon" facility (check "/var/log/daemon.log" if needed).

      root@freebsd:~ # tail -n2 /var/log/daemon.log
      Dec  8 23:53:23 freebsd spamd[3500]: listening on localhost:7180
      Dec  8 23:53:23 freebsd spamd[3500]: listening on localhost:7181
  5. Sit and enjoy some spam

    spamd(8) will serve spam on the spam endpoint (http://localhost:7180 by default) and provide information about the visitors on the info endpoint (http://localhost:7181 by default).

Intended use

The recommended setup consists in forwarding requests from a web server acting as reverse proxy to the spam endpoint. spamd(8) supports the X-Forwarded-For header when determining the IP address of the peer for statistical purposes, and allows to specify a prefix to strip from request URI (see spam_ep.uri_prefix below).

The purpose of spamd(8) is to mess with greedy AI bots that violate the netiquette. It is therefore highly recommended to list the URI path leading to the spam endpoint in your robots.txt:

User-agent: *
Disallow: /spam/

Configuration

The configuration file of spamd(8) contains key-value pairs or key only toggles, one per line. Empty lines and lines starting with # are treated as comments. key only lines (toggles) are permitted only for settings having boolean type, and interpreted as true.

Follows a list of accepted keys and their meaning:

  • daemon.foreground

    Tells spamd(8) not to invoke daemon(3)

    • Type: boolean

    • Default: false

  • daemon.gid

    Tells spamd(8) to drop privileges via setgid(2) to the supplied gid. If daemon.gid is not specified, spamd(8) will not try to use setgid(2), but it will still ensure the process is not executed with gid 0.

    • Type: string

    • Default: undefined

  • daemon.pidfile

    Location on the filesystem of the pidfile. The pidfile is generated before dropping permissions, and is therefore not unlinked when the daemon terminates. The init system is responsible for unlinking this file.

    • Type: string

    • Default: /var/run/spamd.pid

  • daemon.uid

    Tells spamd(8) to drop privileges via setuid(2) to the supplied uid. If daemon.uid is not specified, spamd(8) will not try to use setuid(2), but it will still ensure the process is not executed with uid 0.

    • Type: string

    • Default: undefined

  • info_ep.backlog

    TCP backlog of the info endpoint. See listen(2).

    • Type: integer

    • Default: 32

  • info_ep.bind

    Bind address of the info endpoint. See bind(2).

    • Type: string

    • Default: localhost:7181

  • spam_ep.backlog

    TCP backlog of the spam endpoint. See listen(2).

    • Type: integer

    • Default: 32

  • spam_ep.bind

    Bind address of the info endpoint. See bind(2).

    • Type: string

    • Default: localhost:7180

  • spam_ep.max_sentence_len

    Maximum length of a pseudo-random sentence served on the spam endpoint. Sentences are allowed to be shorter, according to the length of the random walk on the Markov chain.

    • Type: integer

    • Default: 40

  • spam_ep.mkvchain

    File system path of the Markov chain. Markov chain files are constructed via mchain(1).

    • Type: string

    • Default: /usr/local/share/spamd/default.markov

  • spam_ep.n_paragraphs

    Number of paragraphs in each page served by the spam endpoint.

    • Type:

    • Default: 3

  • spam_ep.n_references

    Number of outbound links in each page served by the spam endpoint.

    • Type: integer

    • Default: 7

  • spam_ep.n_sentences

    Number of pseudo-random sentences per paragraph served by the spam endpoint.

    • Type: integer

    • Default: 5

  • spam_ep.uri_prefix

    Expected prefix of URIs in the spam endpoint. This option is useful when spamd(8) is made reachable through a reverse proxy, which is prepending a prefix to each request URI.

    • Type: string

    • Default: undefined

TODO (higher priority first)

  • control panel pages + bake in git version

  • reload config file upon SIGHUP

  • spamd(8) handle more gracefully missing file (e.g. default page?)

  • mchain option to pick output format (dot or binary)

  • Add sanitizers to cmake (default ON, possibly OFF, only in Debug build-type)