Add word frequency analysis

2024-07-06 23:53:36 -04:00 · 2024-07-06 23:53:36 -04:00 · 512453d748
parent 572d05471d
commit 512453d748
3 changed files with 2742 additions and 6 deletions
--- a/README.md
+++ b/README.md
@ -19,14 +19,72 @@ As soon as I read this, I had to know more.

 `lang.dat.uk` and `lang.dat.us` are the original subtitle files for the UK and US releases, respectively. These are copyright Gremlin Interactive and included here for research purposes.

-`diffscript.py` is a simple Python script that examines the `lang.dat` files and produces two text files: `all-subtitles-uk-us.txt` and `diff-subtitles-uk-us.txt`. 
+`diffscript.py` is a simple Python script that examines the `lang.dat` files and produces three text files: `all-subtitles-uk-us.txt`, `diff-subtitles-uk-us.txt`, and `word-frequency-analysis.txt`.

-`all-subtitles-uk-us.txt` contains a full human-readable dump of the two subtitle files, with strings at the same index place next to one another. `diff-subtitles-uk-us.txt` contains only the strings that are different between the two files, sorted by a "difference score" - the biggest changes are at the beginning of the file.
+`all-subtitles-uk-us.txt` contains a full human-readable dump of the two subtitle files, with strings at the same index place next to one another. 
+
+`diff-subtitles-uk-us.txt` contains only the strings that are different between the two files, sorted by a "difference score" - the biggest changes are at the beginning of the file.
+
+`word-frequency-analysis.txt` lists the number of times words were either added or removed from a line. This allows us to see that, for example, the word "gnarly" was added to two lines in the US release, but removed from two other lines.

 ## Preliminary Findings

 The writer in both releases is credited as [Ade Carless](https://www.mobygames.com/person/5209/adrian-carless/). However, the US release has an additional credit for "Additional Script Writing", attributed to [Dennis M. Miller](https://www.mobygames.com/person/183890/dennis-m-miller/), who is also credited as the US producer. This suggests to me that he is the person primarily responsible for the changes to the US script. If anyone wanted to interview someone to get the full scoop, he would be the guy.

+`word-frequency-analysis.txt` has some interesting data that points to how Kent's character was changed. Some words that were inserted into the US script:
+
+* yeah: 122 times
+* whoa: 118 times
+* dude: 51 times
+* hey: 37 times
+* ha: 30 times
+* cool: 29 times
+* check: 23 times (generally appearing as "check it" or "check it out")
+* damn: 15 times
+* totally: 14 times
+* [bitchen](#bitchen): 14 times
+* buddy: 10 times
+* dudes: 9 times
+* groovy: 7 times
+* babe: 7 times
+* yo: 6 times
+* duh: 6 times
+* grodey: 5 times
+* righteous: 4 times
+* city: 4 times ("Singe City", "Lobotomy city")
+* righty: 3 times
+* radical: 3 times
+* [rama](#kent-o-rama): 3 times
+* dweebs: 3 times
+* bummer: 3 times
+* snot: 2 times
+* [poo](#poopoo): 2 times
+* guts: 2 times
+* gnarly: 2 times
+* [einstein](#einstein): 2 times
+* [clown](#bozo-the-clown): 2 times
+* [butt](#butt-bongo-fiesta): 2 times
+* schwing: 1 time
+* freakazoids: 1 time
+* baboom: 1 time
+
+Here's some interesting words that were removed from the UK script:
+
+* sheesh: 12 times
+* wow: 9 times
+* fancy: 5 times
+* yeah: 5 times
+* whoa: 2 times
+* whoah: 2 times
+* dude: 2 times
+* manjana: 2 times
+* ballcock: 1 time
+* boredsville: 1 time
+* farty: 1 time ("arty farty" replaced with "artsy fartsy")
+* haberdashery: 1 time
+* scareymongeroos: 1 time
+* schwarzenegger: 1 time (replaced with "Stallone")
+
 Scrolling through `diff-subtitles-uk-us.txt`, some themes emerge:

 ### Less British
@ -75,6 +133,14 @@ _[2813] (diff score: 108)_
 **UK:** Sheesh!! A triple CD compilation of train noises! English import, from the land of shopkeepers and eccentric Lords.  
 **US:** This Blows!! A triple CD compilation of train noises! Who in their right mind would buy this?? ...only an engineer.  

+_[4290] (diff score: 85)_  
+**UK:** I can see the city out there in all its polluted splendour. Surely they could do something to brighten the place up. Something along the lines of the Great Fire of London perhaps.  
+**US:** I can see the city out there in all its polluted splendor. Sure they could  do something to brighten the place up. Possibly, make a city wide bonfire.
+
+_[5929] (diff score: 17)_  
+**UK:** Whizzz... Wheeee...  
+**US:** Snore, snore.  
+

 ### More Attitude

@ -114,6 +180,10 @@ _[202] (diff score: 130)_
 **UK:** Hey, it's Barbara Barbie! I remember watching her on 'Minor-Celebrity Triangles'!  
 **US:** Hey, it's Barbara Barbie! What a hot chick. She was always a guest on that stupid show "Love Cruise." That was one of those Miss Spelling productions.  

+_[2370] (diff score: 26)_  
+**UK:** With two people sitting on it? Yeah, GREAT idea!  
+**US:** With two people sitting on it? Yeah, GREAT idea! You must be...<a id="einstein"></a>Einstein?? All righty.
+
 _[5366] (diff score: 70)_  
 **UK:** It's an intangible lattice of light, man! How can I pick it up?  
 **US:** Catch a clue, dude! It's a ray of light! How could I pick a ray of light up? It's just not possible.... Get a brain!  
@ -207,6 +277,18 @@ _[3014] (diff score: 29)_
 **UK:** He can be a diversion. I just need to get him to bark.  
 **US:** He can be a diversion. I just need to get him to bark. Bark! Ruff...Ruff...that's what you're supposed to say...Ruff!  

+_[2631] (diff score: 40)_  
+**UK:** She looks very very bored.  
+**US:** She looks very bored. Time to wake up! Rise and Shine my little...puckadee!
+
+_[4329] (diff score: 43)_  
+**UK:** This is a family game. Plus, I'm not sure what it does.  
+**US:** Whoa! Hold on a second there! This is not <a id="butt-bongo-fiesta"></a>butt bongo fiesta! This is a family game. Plus, I'm not exactly sure what it does.
+
+_[5552] (diff score: 46)_  
+**UK:** Looking ahead, I don't see any use for it.  
+**US:** Call me Nuts.....<a id="bozo-the-clown"></a>Bozo the clown!...but I just don't see any use for it at all.
+

 ### Other People's Jokes Are Also The Soul Of Wit  

@ -219,12 +301,16 @@ _[4271] (diff score: 15)_

 _[6172] (diff score: 47)_  
 **UK:** Someone let a dog do doo-doo, dude.  
-**US:** Oh, Grodey! Somebody let their dog take a big a'poo-poo.  
+**US:** Oh, Grodey! Somebody let their dog take a big <a id="poopoo"></a>a'poo-poo.  

 _[2400]_  
 **UK:** I'm too shy to date skeletons!  
 **US:** Yeah, right?! Like I could use a bunch of bones?!  

+_[5459] (diff score: 33)_  
+**UK:** What would I do with a bust? No, not THAT kind of a bust!! Behave!  
+**US:** What would I do with a bust? Wait! What am I saying?
+

 ### Just Needs More Words  

@ -295,6 +381,10 @@ _[1318] (diff score: 62)_
 **UK:** Hey, respect the dead!  
 **US:** Hey, it's the King man! I dig that! I believe in him. I believe he's alive. I believe in those sightings!  

+_[5760] (diff score: 94)_  
+**UK:** Look, Brian...  
+**US:** Oh...God Forbid...I'm So sorry! Look, Brian, has all that heavy, loud, heavy metal music damaged the manner portions of your brain?!! ...Head banger!
+

 ### Kent-o-rama  

--- a/diffscript.py
+++ b/diffscript.py
@ -36,8 +36,11 @@ def dumplines(filename, lines1, lines2):
        for line1, line2, lineno in zip(lines1, lines2, range(len(lines1))):
            f.write(f"[{lineno}]\nUK: {line1}\nUS: {line2}\n")

+def words_from_line(line):
+    return re.split(r"\W", line.lower())
+
 def diffscore(line1, line2):
-    diffset = set(re.split(r"\W", line1.lower())) ^ set(re.split(r"\W", line2.lower()))
+    diffset = set(words_from_line(line1)) ^ set(words_from_line(line2))
    return sum(len(word) for word in diffset)

 def scorelines(filename, lines1, lines2):
@ -48,5 +51,52 @@ def scorelines(filename, lines1, lines2):
            if lineno not in IGNORED_LINES:
                f.write(f"[{lineno}] (diff score: {score})\nUK: {line1}\nUS: {line2}\n")

-dumplines("all-subtitles-uk-us.txt", parsedat("lang.dat.uk"), parsedat("lang.dat.us"))
-scorelines("diff-subtitles-uk-us.txt", parsedat("lang.dat.uk"), parsedat("lang.dat.us"))
+class WordCount:
+    def __init__(self):
+        self.words = {}
+    
+    def ensure(self, word):
+        if word not in self.words:
+            self.words[word] = 0
+
+    def add(self, word, count=1):
+        self.ensure(word)
+        self.words[word] += count
+    
+    def remove(self, word):
+        self.add(word, count=-1)
+    
+    def merge(self, wordcount, pred):
+        for word, count in wordcount.words.items():
+            if pred(count):
+                self.add(word, count)
+    
+    def sortedTuples(self, pred, descending=False):
+        return sorted(((count, word) for word, count in self.words.items() if pred(count, word)), reverse=descending)
+
+def wordfrequencies(filename, lines1, lines2):
+    removedWords = WordCount()
+    addedWords = WordCount()
+
+    for line1, line2 in zip(lines1, lines2):
+        linecount = WordCount()
+        for word in words_from_line(line1):
+            linecount.remove(word)
+        for word in words_from_line(line2):
+            linecount.add(word)
+        removedWords.merge(linecount, lambda c: c < 0)
+        addedWords.merge(linecount, lambda c: c > 0)
+    
+    with open(filename, "wt") as f:
+        f.write("Words added to lines in US version:\n")
+        for count, word in addedWords.sortedTuples(lambda c, w: len(w) > 1 and c > 0, descending=True):
+            f.write(f"{word}: {count}\n")
+        f.write("\nWords removed from lines in US version:\n")
+        for count, word in removedWords.sortedTuples(lambda c, w: len(w) > 1 and c < 0):
+            f.write(f"{word}: {count}\n")
+
+uk_lines = parsedat("lang.dat.uk")
+us_lines = parsedat('lang.dat.us')
+dumplines("all-subtitles-uk-us.txt", uk_lines, us_lines)
+scorelines("diff-subtitles-uk-us.txt", uk_lines, us_lines)
+wordfrequencies("word-frequency-analysis.txt", uk_lines, us_lines)
--- a/word-frequency-analysis.txt
+++ b/word-frequency-analysis.txt