Running JRPG Character Popularity Contests From the Command-Line

I have come up with a quick way to rank the popularity of a bunch of characters from a popular JRPG franchise using people’s replies to a Reddit post, all from the command-line.

My approach involves extracting all the comment content from the desired Reddit thread (this one happens to be from the thriving Atelier JRPG community subreddit) and counting up the number of times each character’s name is mentioned. Simple, isn’t it?

First, I fetched the HTML document of the Reddit post from an alternative, server-side rendered web frontend for Reddit called teddit. This saved me the trouble of using headless Chrome to interact with the scrape-hostile official Reddit website. The official frontend renders most of the page content via JavaScript and obfuscates DOM attributes, making automated data extraction very difficult.

wget -O reddit-answers.html \
    https://teddit.net/r/Atelier/comments/kk3ux6/who_is_your_favourite_atelier_protagonist/ 

Naturally, fetching the raw comment content from the official Reddit API would be more efficient and elegant, but I really didn’t feel like setting up an entire OAuth client and authentication flow for a one-off data extraction job. Moreover, Reddit recently announced they will begin charging for third-party usage of their APIs soon, so neither of these approaches may work that well in the future anyway.

Now we need to scrape all the user comments off the HTML document to process the contents. We can use XPath to achieve that. I used a CLI XPath processor called xidel for this. It should be readily available via most *nix systems package managers.

xidel \
    --html ./reddit-answers.html \
    -e '//div[contains(@class, "comment")]/*/div[@class="body"]/div/p/text()' \
    > reddit-answers.txt

This command will extract all the text from the DOM elements containing user comments and save it to a plaintext file we can now tokenise.

cat reddit-answers.txt \
    | tr -cd "[:alpha:][:space:]-'" \
    | tr ' [:upper:]' '\n[:lower:]' \
    | tr -s '\n' \
    | sed "s/^['-]*//;s/['-]$//" \
    | sort > reddit-answers-tokenised.txt

This command produces a text file where every word appears on a new line. This makes it easier to identify and count up each word.

Now that we have the file in a suitable format for counting, it should be plain sailing from here, right? Not quite. At the moment, the file contains every word written on the thread, which means 95% of the text is irrelevant to us as we’re just interested in the character names. Now I said I wanted to keep things simple here, so rather than trying to somehow identify all the character names, I will discard all the non-character words in the source text using an English dictionary.

To do this, I employed ripgrep, which is a faster version of the text search and manipulation tool grep. grep would work just as well though. Both of these tools allow you to use a wordlist document to supply the list of specific terms to search or exclude in your input. This is a simple text document containing a list of newline-separated words. I will use a simple English dictionary wordlist. If you’re using Debian or Ubuntu, there may be a dictionary wordlist preinstalled on /usr/share/dict/ so you can just reference that file instead. Debian publish all of their wordlist packages here.

rg -Niwv \
    --regex-size-limit 800M \
    --dfa-size-limit 1G \
    -f ./wordlist reddit-answers-tokenised.txt \
    > reddit-names.txt

This ripgrep command will perform a case-insensitive word search against the supplied English dictionary file ./wordlist and output the results to reddit-names.txt. I also pass in the --regex-size-limit and --dfa-size-limit flags to allocate enough resources for ripgrep to hold both the large dictionary and input files in memory. My settings are not overly scientific here, mind you. I have plenty of memory to spare on my 32GB RAM machine, so I just use large figures to guarantee memory will be plenty.

At this point, we should have a reasonably clean set of words, with most of the noisy, redundant English terms filtered off. Now we just need to run this final command to count up the number of times each word appears, exclude unpopular words (2 or fewer mentions) from the ranking, and display the results in descending numerical order:

cat ./reddit-names.txt \
    | uniq -c \
    | sort -nr \
    | rg -v '^ *[12] '

This is the final output,

16 ryza
13 firis
 8 suelle
 8 lol
 6 meruru
 5 totori
 5 shallie
 4 plachta
 3 rorona
 3 jrpgs
 3 escha

As you can see, it’s far from perfect. A few undesired words may be counted in if your dictionary wordlist is not comprehensive, there is a lot of jargon, or there are many alternative or incorrect spellings in the source text. These should appear infrequently enough on the clean dataset so as not to become an issue.

If you’d like to run your own nerdy rankings on Reddit like me, this script puts it all together so that you can produce rankings with a single command:

#!/usr/bin/env bash

TEDDIT_URL="$1"
WORDLIST_PATH="$2"

xidel \
    --html \
    --data="${TEDDIT_URL}" \
    -e '//div[contains(@class, "comment")]/*/div[@class="body"]/div/p/text()' \
    | tr -cd "[:alpha:][:space:]-'" \
    | tr ' [:upper:]' '\n[:lower:]' \
    | tr -s '\n' \
    | sed "s/^['-]*//;s/['-]$//" \
    | sort \
    | rg -Niwv --regex-size-limit 800M --dfa-size-limit 1G -f "${WORDLIST_PATH}" \
    | uniq -c \
    | sort -nr \
    | rg -v '^ *[12] '
Back to top ↑

Do you have any questions, comments or feedback about this article to share with me or the world?

Send an email to my public mailing list. You can also reach out to me privately if you'd prefer. I would love to hear your thoughts either way!

Articles from friends and people I find interesting

FAFO 7

Joy projects. Impact. Transclusion?

via ronjeffries.com February 2, 2024

Two handy GDB breakpoint tricks

Over the past couple months I’ve discovered a couple of handy tricks for working with GDB breakpoints. I figured these out on my own, and I’ve not seen either discussed elsewhere, so I really ought to share them. Continuable assertions The assert macro …

via null program January 28, 2024

Using Hugo as a redirect service

I have been building my website with Hugo since early 2021. I love the control it gives me. I recently wanted to start using short URLs in presentations, that would link to either a longer URL on the website or to somewhere else altogether. It turns out Hu…

via Blog on Dan North & Associates Limited October 23, 2023

Generated by openring