Notes from a Linguistic Mystic

I’ve been too busy teaching, dissertating, and applying for Academic jobs (Hire me!) to post regularly, but I just figured I’d post a quick update and respond to current events.

I’ve been running the Yosemite beta for a month or so now, and have had no problems at all. The OS is good, my previous tutorial on using IPA fonts with Mac OS X still works, my P2FA install guide still works, and Praat is good-to-go.

Putting aside the strangeness of 10.10 not being the same thing as “10.1” or just “11”, Mac OS X 10.10 “Yosemite” gets my seal of approval.

Seal of Approval

~ ə ~

I’ve just learned a wonderful term for a common state of mind: “precarious manhood”.

The authors report 5 studies that demonstrate that manhood, in contrast to womanhood, is seen as a precarious state requiring continual social proof and validation. Because of this precariousness, they argue that men feel especially threatened by challenges to their masculinity. (from this paper)

This term (and idea), first publicized by J.A. Vandello, describes the feeling that your very masculinity is always under threat, and that, much like the cooties of yore, one is liable to catch femininity from any association with the feminine (or effeminate).

I just happened upon a wonderful example of this fear in action. Below are two small pocketknives, both by Spyderco, a Colorado knifemaker:

Image from ‘dyril’ on EDCForums

They are both made of the same materials. They’re about the same size (although the bottom one is marginally bigger). They have the same function, mechanism, and overall design (although again, the bottom is slightly wider). What’s the difference, you may ask?

The top knife is a popular model from Spyderco, the “Ladybug”. The bottom is a recent addition to the lineup, called the “Manbug”.

What is a “manbug”? Google returns no results except for the knife, so it’s clearly not an insect. Well, as explained on Spyderco’s website for the Manbug:

Anyone who has ever used a LadyBug knows that it cuts with an authority far beyond its size. However, as part of Spyderco’s quest for C.Q.I. (Constant Quality Improvement), Glesser decided to make it even better. And, for anyone whose male ego made it difficult for them to carry a “LadyBug,” he also made life easier.

So, apparently, although the “Ladybug” cuts many things well, what it’s most effective at cutting is precarious masculinity, and that alone was worth a creating new product line.

I generally don’t go into gender dynamics in language, as I lack the sociological chops that some of my colleagues have (and which seem required to write about social topics without hurting yourself). But reading about the “manbug” on the same day I learned about “precarious masculinity”, I just couldn’t help myself.

~ ə ~

One of the advantages to being a linguist is that I get to see some of the strange things people do when they need “other language”, but don’t expect people to understand.

One of my favorite video games from childhood was GoldenEye 007 for the Nintendo 64. In the game, which is based (very loosely!) on the 1995 James Bond film “Goldeneye”, there’s a level set in a Soviet Archive building. All throughout there are boxes, marked with big red type with “картофель”, pronounced “car-TOE-fell” ([kar’tofʲel]).

When I was a kid, playing the game, I always assumed it meant “records”, or maybe even something cool, like “nuclear weapons” or “spies who crossed the KGB”!

Nope. картофель is the Russian word for ‘potato’. Just nice.

~ ə ~

As part of my dissertation, I’m having to record a large number of subjects and do analyses on their speech. The biggest problem with doing that is that in order to do the analyses automatically, you need to time-align the words, creating files which tell your analysis software (in this case, Praat) where each sentence/word/sound starts and ends.

The fastest way to do this automatically is using what’s called “forced alignment”, and the current best forced aligner for English for phonetic use is the Penn Phonetics Lab Forced Aligner. In this post, I’ll describe how I got it working on my Mac running Mavericks (10.9), in a step-by-step sort of way.

There are four basic steps involved: 1. Install HTK (the hard part!) 2. Install the Penn Phonetics Lab Forced Aligner (henceforth P2FA) 3. Install Sox (which is required by P2FA) 4. Set it up for your data and run it to get aligned textgrids


This post is up as a public service. I’ve done my absolute best to be comprehensive and clear, but your system/install/issue may vary, and they might update any of these tools at any time, and this post may not change when they do. I’m also mid-dissertation, so I’m unable to offer personal assistance setting up P2FA to commenters or by email.

Feel free to leave a comment if you have a question or issue, and maybe somebody can help, but nothing’s guaranteed. In short, the Linguistic Mystic is not responsible for any troubles, your mileage may vary, good luck and godspeed.

Step 0.5: Xcode Command Line Tools

If you’re doing anything code-y on a Mac, you need Xcode for the compilers and other useful tools it has.

  1. Download Xcode from the Mac App Store (it’s free).
  2. Follow these instructions to install the XCode command line tools.

Step 1: Installing HTK

This is the hardest and most terrifying part if you’re not used to compiling and installing command-line tools. We’ll take it step by step, though.

The P2FA readme is very specific that you need version 3.4 of HTK, so let’s install that. The manual isn’t terribly helpful for a Mac install, so we’ll have to go this alone.

  1. Go over to and register. It’s free and only takes a minute.
  2. Download HTK 3.4 from this page. Since you’re on a Mac, grab HTK-3.4.tar.gz.
  3. On your Mac, go to wherever the file downloaded to, and double-click the .tar.gz file to expand it. This will create a folder called “htk”, and for the rest of this tutorial, I’m going to pretend it’s on your desktop.
  4. Open up (/Applications/Utilities/
    • Any time you see a command inside a code block, that means “type the command into a terminal exactly”
  5. Enter the command cd ~/Desktop/htk
  6. Run ./configure -build=i686-apple-macos LDFLAGS=-L/opt/X11/lib CFLAGS='-I/opt/X11/include -I/usr/include/malloc'to configure the software for OS X. A major hat-tip to this post for helping me with that command.

    At this point, the HTK manual says you should be able to make all && make install. But it’s not that easy. If you run that command, you’ll get a couple of errors which look like:

    esignal.c:1184:25: error: use of undeclared identifier 'ARCH' architecture = ARCH;

    Translation: “I’m trying to prep the file HTKLib/esignal.c, but nobody told me what system architecture this code is gonna be run on. Unless I know that, I can’t build!” This is actually a problem with the way HTK is written, but luckily, we can fix it by manually specifying that the Architecture is “darwin” (which it always is, for OS X). A major hat-tip to this post for helping me figure out some of these issues.

  7. Open the HTKLib/esignal.c file in a code-friendly text editor. You can use Xcode, or my personal favorite TextMate 2.
  8. Find and change the below lines:

    Change Line 974: if (strcmp(architecture, ARCH) == 0) /* native architecture */

    To: if (strcmp(architecture, "darwin") == 0) /* native architecture */

    Change Line 1184: ` architecture = ARCH;`

    To: ` architecture = “darwin”;`

  9. Now, let’s build! Run make all in the terminal window. Some warnings will pop up, but we don’t care.

  10. Now we’ll install it. Run make install

  11. Just to test, run LMerge. It’ll pop up a message about USAGE, and that’s fine. That just tells us it installed OK.

Whew. HTK is installed. That was the tough part. Now let’s install P2FA.

Step 2: Installing Penn Phonetics Lab Forced Aligner

This part’s easier!

  1. Download P2FA.
  2. Double-click the .tgz file to open it up, giving a “p2fa” folder.
  3. Move that folder someplace easy to find.


Step 3: Installing Sox

P2FA does depend on Sox to work. The easiest way to get Sox, by far, is using Homebrew, so we’ll do that. Homebrew is a great little program for easily and quickly installing all sorts of fun commandline tools. I love it.

  1. Open your terminal back up.
  2. Run ruby -e "$(curl -fsSL"(This command straight from the Homebrew Homepage)
  3. Once that’s done, install sox using brew install sox


Step 4: Setting P2FA up for your data and running it

Unfortunately, P2FA needs a particular format for your data to work. In my case, I had a bunch of files of people saying the exact same things, as prompted by a script. So, my sound files started with:

The word is men

The word is mint

… and so forth for the other 318 words which they read in the script.

To force-align data like this:

  1. Make sure that any extraneous talk is trimmed out, such that the speech actually matches the script.
  2. Create a text file which it will be aligned against. To capture the above, this file would look like:

    {SL} {NS} sp THE sp WORD sp IS sp MEN {SL} {NS} sp THE sp WORD sp IS sp MINT {SL} {NS}

    … and then goes on to do the same thing for all the other ‘the word is (word)’ sentences. The {SL} stands for “silence”, and covers the silence after they finish the sentence. The {NS} means “noise”, which is there to pick up the click of the keyboard as they advance the slide. Then, each sp (small pause) is in case the person pauses again between words. In P2FA, these “small pauses” can be present or not, and they should be sprinkled liberally throughout your data. All words need to be capitalized.

    Save the file you’ve created. I’ll call it “alignscript.txt” in other examples.

  3. Make sure that all words are included in the dictionary. It’ll yell at you at runtime if you’ve asked it to align a word which isn’t in the dictionary, so, if you’re aligning non-words (or even odd, new words), you’ll need to add them. Let’s say you want to add “neighed”:

    1. To add a new word to dictionary, open the “model” folder, and then open “dict” in your text editor.
    2. Find the line for a word which rhymes with your new word, like “made”:

      MADE M EY1 D

    3. Modify the sounds for the new word:


  4. Downsample the file you’re looking to align using Praat. I’ve had great luck using the suggested 11,025 Hz sampling rate. Save this as a .wav file.
    • Remember, you can always use the Textgrid with the full-quality file later, the downsampled file is just temporary for alignment.
  5. Run the aligner, modifying the paths in the command to fit where you’ve got P2FA and your data. The command is: python /path/to/ sound_file.wav alignscript.txt output_name.TextGrid

    So, for my actual work, if I wanted to align the recording session file for a subject named “sarah”: ~/data/p2fa/ ~/data/sarah_session.wav ~/data/alignscript.txt ~/data/sarah_session.TextGrid

  6. Go have coffee. For a 15 minute recording, it takes around 10 minutes for the forced aligner to run on my (fairly recent) Mac.

  7. Open up the newly-generated .TextGrid file in Praat alongside the sound file and see how it did.

P2FA in Practice

So far, I’ve been really impressed with the results. It’s pretty good, with only one major error (missed word or complete mis-identification) in every two files. Individual sounds are missed more regularly (where it’ll cut off the /z/ in “meds” or the /n/ in “plan”). Vowel boundaries are off by 10 ms or so in around 1/3 of tokens.

I’ve been hand-correcting the data because I care a lot about those boundaries, but if I just wanted a measure at the center of the vowel, I wouldn’t even bother, as the vowel’s center is quite reliably in the center of the vowel span. Regardless of these issues, using P2FA with hand-correction, I’m able to beautifully annotate data in around 1/4 of the time it takes to do it by hand. It’s an absolutely excellent too, and would recommend it to anybody.

So, I hope this was helpful, good luck, and good alignment!

~ ə ~

Much of my time these days is going into writing my doctoral dissertation, the big long paper I have to write before they’ll give me the Ph.D and send me on my way. I’ve had a few people ask me for a concise explanation of what I’m actually doing which is understandable to non-linguists and to readers of all ages, so, here it goes:

The goal of my dissertation, simply put, is to figure out how humans can hear the difference between “pat” and “pant” without resorting to magic.

Where’s the magic in nasality?

Say “pat”. Now say “pant”. Say them again, and listen to the vowel in the middle. Even before you start the “n”, there’s something funky and “nasally” about that vowel, and the “n” isn’t really that strong. That “nasal-ish” difference in sound is called vowel nasality, and it happens when you let some air escape through your nose while you’re making a vowel.

In English, although vowel nasality happens all the time (any time there’s an n, m, or the /ŋ/ sound in “ring”), we as listeners don’t really care whether vowels are nasalized, it’s just something that happens naturally. The word “pant”, said without the nasality, is still “pant”, but it’s a lot easier to make it with the nasality. And it turns out that it’s useful for English speakers to have it there, as it makes decisions about whether we’d heard, say, “cloud” or “clown” a bit easier and faster, especially if you didn’t hear the last part of the word.

In French, though, nasality is crucial to the language. The only difference between “beau” (‘beautiful’) and “bon” (‘good’) is whether the /o/ is nasal or not (nasal or “oral”), so French speakers need to listen for it. There, it’s contrastive, meaning that it can make the difference between different words. Nasality is contrastive in lots of other languages, like Hindi, or Lakota, or Brazillian Portuguese.

So, that’s vowel nasality. We know a bunch about it, and it’s useful to a bunch of speakers of a bunch of languages. The problem is, we as linguists don’t actually know what nasality sounds like.

The Proof is in the measurement

In phonetics, just like in any other sciencey field, we need to be able to measure something to be able to say intelligent things about it. We want to be able to say things like “Based on this study of how they sound, these vowels are more nasal than these other vowels”, or even just “This vowel is nasal, this one isn’t.”

Being able to detect vowel nasality from sound is also useful for non-linguists. It’s good for speech recognition. French Siri badly needs to do better at understanding French speakers. It’s also good for speech pathology. “Hypernasality” is a problem that some people have, where they’re not able to control the amount of air going through their noses during speech, and many things are nasal that shouldn’t be. At the moment, testing for hypernasality involves strapping air masks to people’s heads, and it would be much nicer to just set down a microphone on the table and measure it that way.

Right now, if I want to measure nasality, I’ve got to use a really complicated measurement looking at the strength different frequencies in the signal (higher and lower pitches within the overall sound of the voice). This measure, called “A1-P0”, is great in some ways. If I’ve got 3000 vowels to look at, the measure’s good enough to say things like “Yeah, overall, these vowels over here are more nasal than those over there”. But if I look at any single vowel and ask “Hey, is this oral or nasal?”, it’s got something like a 54% chance of getting the answer right.

But that also points to something awesome: Even though we as linguists aren’t very good at measuring it, humans are REALLY good at hearing nasality. In fact, people are good enough at vowel nasality that languages all over the world have baked it into how they work, and use it every day without any problems. And if people can reliably hear nasality in speech, there must be something to hear, some acoustical feature, which is more reliable for detecting nasality than 54%. In short, we linguists have clearly missed something.

So, the goal of my dissertation, put less simply, is to figure out what, exactly, humans are listening to when they hear the difference between “pat” and “pant” in English, or “beau” and “bon” in French.

How do you do that?

To figure out what we’re actually listening to when we’re hearing nasality, I’ve got a few steps to take.

In short, I need to find cues. Cues are just things that tip you off for perception. Smoke, heat, and light are all cues to fire. I need to figure out what parts of the speech sound signal are cues to nasality.

First, I’m going to first measure a bunch of other acoustical features, different parts of the speech sound signal, that people have said might be a cue for nasality, and see how often they occur in nasal vowels (relative to oral vowels). I’ll also combine some of them, and see if looking at a bunch of features together might be better cues than one thing alone (just like heat alone doesn’t mean fire, but heat and smoke does).

Then, once I’ve got some suspects, some elements of the sound that I think might be useful to humans in noticing nasality, I’ll try and teach a computer to perceive using those features (mostly because computers are much cheaper to experiment on than humans). I’ll give the computer a bunch of data, showing it all the different features I’m thinking about, have it learn from that data, then give it more vowels and ask it to decide whether each vowel is “oral” or “nasal”. By looking at how well the computer did using each individual feature, I’ll be able to narrow down the 30+ features I’m starting with to the ones I know to actually be useful in making decisions about nasality.

Now, I’ll have to get humans involved. I’ll drag a bunch of English and French speakers into the lab and make them listen to words and make choices (“Oral or nasal?”, or “did you hear “pat” or “pant”?”). But these won’t be just any words, they’ll be words I’ve messed with in some very important ways.

Some of the words will be nasal words (like “pant” or “bon”) where I’ve removed the parts of the sound which I think make people say “Aha! Nasal!”. My hope is that people, when those parts of the sound are missing, will think that it wasn’t a nasal word after all. If removing a feature makes people think a word isn’t nasal, we’ll know it was important in the perception process, and that it’s a cue.

On the other side, I’ll have oral words (like “pat” or “beau”) where I’ve added things that I think are nasal cues. My hope is that I can take “dote”, add some features to the signal, and people will hear that added stuff and say “Oh, that’s “don’t”!!”. Those things, we’ll know are really cues, because they’re proof of nasality alone.

By adding and subtracting parts of the sound signal, I’ll figure out what’s necessary for people to hear nasality (What people need to hear to call something that’s nasal, nasal), and what’s sufficient for people to hear nasality (what people need to hear in order to call something nasal). And once I know that, I’ll know what people are actually listening for when they hear nasality.

Then it’s a question of seeing if I can use that knowledge to do nasality measurement (asking a computer to look for the same things humans are), and then saying a bit more about how people listen in French vs. English (if they’re any different). Then, I’ve just gotta write the thing up and convince my committee that it was as awesome as I think it is.

That’s it.

So, that’s my dissertation, and that’s what’s eating my time. I’m hoping to defend it next Spring (Spring 2015), and then, ideally, find a job where I can teach new people about the awesome of nasality, phonetics, Linguistics, and language in general.

But in the mean time, if you happen to be in France, India, or North Dakota, and you overhear native speakers discussing their secret to nasality perception, do me a favor and drop me a line.

~ ə ~