Running JRPG Character Popularity Contests From the Command-Line

I have come up with a quick way to rank the popularity of a bunch of characters from a popular JRPG franchise using people’s replies to a Reddit post, all from the command-line.

My approach involves extracting all the comment content from the desired Reddit thread (this one happens to be from the thriving Atelier JRPG community subreddit) and counting up the number of times each character’s name is mentioned. Simple, isn’t it?

First, I fetched the HTML document of the Reddit post from an alternative, server-side rendered web frontend for Reddit called teddit. This saved me the trouble of using headless Chrome to interact with the scrape-hostile official Reddit website. The official frontend renders most of the page content via JavaScript and obfuscates DOM attributes, making automated data extraction very difficult.

wget -O reddit-answers.html \ 

Naturally, fetching the raw comment content from the official Reddit API would be more efficient and elegant, but I really didn’t feel like setting up an entire OAuth client and authentication flow for a one-off data extraction job. Moreover, Reddit recently announced they will begin charging for third-party usage of their APIs soon, so neither of these approaches may work that well in the future anyway.

Now we need to scrape all the user comments off the HTML document to process the contents. We can use XPath to achieve that. I used a CLI XPath processor called xidel for this. It should be readily available via most *nix systems package managers.

xidel \
    --html ./reddit-answers.html \
    -e '//div[contains(@class, "comment")]/*/div[@class="body"]/div/p/text()' \
    > reddit-answers.txt

This command will extract all the text from the DOM elements containing user comments and save it to a plaintext file we can now tokenise.

cat reddit-answers.txt \
    | tr -cd "[:alpha:][:space:]-'" \
    | tr ' [:upper:]' '\n[:lower:]' \
    | tr -s '\n' \
    | sed "s/^['-]*//;s/['-]$//" \
    | sort > reddit-answers-tokenised.txt

This command produces a text file where every word appears on a new line. This makes it easier to identify and count up each word.

Now that we have the file in a suitable format for counting, it should be plain sailing from here, right? Not quite. At the moment, the file contains every word written on the thread, which means 95% of the text is irrelevant to us as we’re just interested in the character names. Now I said I wanted to keep things simple here, so rather than trying to somehow identify all the character names, I will discard all the non-character words in the source text using an English dictionary.

To do this, I employed ripgrep, which is a faster version of the text search and manipulation tool grep. grep would work just as well though. Both of these tools allow you to use a wordlist document to supply the list of specific terms to search or exclude in your input. This is a simple text document containing a list of newline-separated words. I will use a simple English dictionary wordlist. If you’re using Debian or Ubuntu, there may be a dictionary wordlist preinstalled on /usr/share/dict/ so you can just reference that file instead. Debian publish all of their wordlist packages here.

rg -Niwv \
    --regex-size-limit 800M \
    --dfa-size-limit 1G \
    -f ./wordlist reddit-answers-tokenised.txt \
    > reddit-names.txt

This ripgrep command will perform a case-insensitive word search against the supplied English dictionary file ./wordlist and output the results to reddit-names.txt. I also pass in the --regex-size-limit and --dfa-size-limit flags to allocate enough resources for ripgrep to hold both the large dictionary and input files in memory. My settings are not overly scientific here, mind you. I have plenty of memory to spare on my 32GB RAM machine, so I just use large figures to guarantee memory will be plenty.

At this point, we should have a reasonably clean set of words, with most of the noisy, redundant English terms filtered off. Now we just need to run this final command to count up the number of times each word appears, exclude unpopular words (2 or fewer mentions) from the ranking, and display the results in descending numerical order:

cat ./reddit-names.txt \
    | uniq -c \
    | sort -nr \
    | rg -v '^ *[12] '

This is the final output,

16 ryza
13 firis
 8 suelle
 8 lol
 6 meruru
 5 totori
 5 shallie
 4 plachta
 3 rorona
 3 jrpgs
 3 escha

As you can see, it’s far from perfect. A few undesired words may be counted in if your dictionary wordlist is not comprehensive, there is a lot of jargon, or there are many alternative or incorrect spellings in the source text. These should appear infrequently enough on the clean dataset so as not to become an issue.

If you’d like to run your own nerdy rankings on Reddit like me, this script puts it all together so that you can produce rankings with a single command:

#!/usr/bin/env bash


xidel \
    --html \
    --data="${TEDDIT_URL}" \
    -e '//div[contains(@class, "comment")]/*/div[@class="body"]/div/p/text()' \
    | tr -cd "[:alpha:][:space:]-'" \
    | tr ' [:upper:]' '\n[:lower:]' \
    | tr -s '\n' \
    | sed "s/^['-]*//;s/['-]$//" \
    | sort \
    | rg -Niwv --regex-size-limit 800M --dfa-size-limit 1G -f "${WORDLIST_PATH}" \
    | uniq -c \
    | sort -nr \
    | rg -v '^ *[12] '
Back to top ↑

Do you have any questions, comments or feedback about this article to share with me or the world?

Send an email to my public mailing list. You can also reach out to me privately if you'd prefer. I would love to hear your thoughts either way!

Articles from friends and people I find interesting

rc: a new shell for Unix

rc is a Unix shell I’ve been working on over the past couple of weeks, though it’s been in the design stages for a while longer than that. It’s not done or ready for general use yet, but it is interesting, so let’s talk about it. As the name (which is subjec…

via Drew DeVault's blog April 18, 2023

Practical libc-free threading on Linux

Suppose you’re not using a C runtime on Linux, and instead you’re programming against its system call API. It’s long-term and stable after all. Memory management and buffered I/O are easily solved, but a lot of software benefits from concurrency. It wo…

via null program March 23, 2023


Greg is a Fellow at the Linux Foundation and is responsible for the Linux kernel stable releases. He is also the maintainer of a variety of different kernel subsystems (USB, char/misc, tty/serial, driver core, staging, etc.) and has written a few books an…

via Linux Kernel Monkey Log February 17, 2023

Generated by openring