From ja-goldsmith@uchicago.edu Tue Oct 31 09:21:45 2000
Date: Sun, 29 Oct 2000 13:58:43 -0800
From: John Goldsmith <ja-goldsmith@uchicago.edu>
To: tony@math.sunysb.edu
Subject: RE: information on information

Just a follow-up note to the previous one, which was terminated on
account of getting lunch for the kids. I'm attaching a graph of the
same information as in the previous message. 

What this little bit of analysis suggests is that there is relatively 
little variation in information per letter. There are a couple of 
reasons to not be very interested in these results. The first is that
we'd like to see samples from languages with more variety, and
I just don't have corpora on hand, but I'll look into it. Another,
more interesting reason is one that you will have noticed if you
had a chance to look at the paper I pointed you to, which is that
information in words is spread over morphemes, and a model which
doesn't take that into account (such as a model in which probability
is based on the preceding letter) is missing a good deal of the
structure of human language. In your message below, you mention
syllables, and while syllables are certainly important in language,
I suspect that bringing syllables into our model will have relatively
little impact on the results that we get (because the easier models
are based on looking at the one or two preceding letters as the
conditioning factor; and knowing syllable boundary will never improve
that model in the case where you're looking at two preceding letters;
it will occasionally improve the csae where you're conditioning
on only one preceding letter). 

Coming back to the main point, quantifying information is always
a matter of going back to Boltzmann's insight that entropy is the
log of the number of possible states, and in language, there are
always multiple ways of conceiving of the number of states. If we
think about things phonologically (or orthograpically), then the
state space is the space of sequences of letters, whereas if we
think of things in terms of morphemes, the state space (for individual
words) consists of a single dimension over which you select one
stem, and then a more complex set of alternatives for suffixes (and
occasionally the odd prefix). So a large part of the morphological
information is bound up in (is quantified by) the entropy of the set
of stems of the language. 

This gets to the heart of the point about whether one would speak
slower or faster in a language in which the information was denser
or less dense: the phonological information is by no means the same
as the morphological information -- which in turn is not the same
as the syntactic information, or the semantic information. About those
last two we have less understanding at this point, but about the
phonological and morphological we have some grasp.

What do you have in mind to write about? What got you started in this
area?

best,
John Goldsmith


-----Original Message-----
From: tony@math.sunysb.edu [mailto:tony@math.sunysb.edu]
Sent: Thursday, October 26, 2000 5:15 PM
To: ja-goldsmith@uchicago.edu; tony@math.sunysb.edu
Subject: information on information


Hello John

I came across your Royaumont article while compiling a list of web
resources on information theory to put with my web column this month.
I'm planning it to be "The Mathematics of Communication" (to appear
in http://www.ams.org/new-in-math which I edit and usually write).

I have been interested in natural languages all my life and thought
they could come together with my love of mathematics during those crazy
days in the 50s when I worked with Yngve & Co. on MT as an MIT
undergrad.
I was soon disabused (by Lees himself who took me aside one day
-during my summer job at IBM- and said, as I remember it, that there
was no real hidden math and that TM was a chimera). But I did learn
about information theory and that stuck with me. They had me cook up
an optimal code for Russian. In those days IBM had a contract with the
Air Force to provide a hard-wired Russian-English translating device.

What I'm hoping you can do for me, and soon if possible, is let me know
if there have been any useful studies of  relative information content,
say of syllables, across languages. For example, since Mandarin has
a relatively small set of possible syllables (even counting tones),
compared
to English, one might think that the information per syllable must be
lower
in Mandarin, and that Mandarin speakers could/would speak more quickly
and still be understood. Or, would have to speak more quickly to
transmit
the same information in the same amount of time.

My personal axiom is that all spoken languages are equally efficient,
but
this may be wrong. Does anyone know one way or the other? I mean
efficient
in general. Clearly some particular things are more pithily expressed in

one language than in another.

Here's the kind of joke we used to tell. "The most interesting thing
about
any language is the way it resembles Russian."

Tony Phillips

  [Part 2, Application/X-MSEXCEL  18KB]
  [Unable to print this part]

From ja-goldsmith@uchicago.edu Tue Oct 31 09:22:20 2000
Date: Sun, 29 Oct 2000 09:33:24 -0800
From: John Goldsmith <ja-goldsmith@uchicago.edu>
To: Tony Phillips <tony@math.sunysb.edu>
Subject: RE: information on information

I ran the following numbers on a few languages, and am enclosing results.
I measured the letter-entropy on a corpus, and then the bigram (2-letter
sequence)
entropy. The difference between these is the conditional entropy, that is,
the weighted average of the entropy of a letter given the preceding letter.
That should be a reasonable measure of how much information each letter
provides.
Then I multiplied by the average number of letters per word. Now we need to
factor in the average number of words per sentence, but I haven't done that
yet.
(KiRundi is a Bantu language, as is Swahili).

If you have trouble with an excel attachment, let me know.

best,
John


-----Original Message-----
From: Tony Phillips [mailto:tony@math.sunysb.edu]
Sent: Thursday, October 26, 2000 5:40 PM
To: John Goldsmith
Cc: Tony Phillips
Subject: RE: information on information


Wow. Thanks for your speedy answer. I'll look up
the paper you mention and I'll be grateful for
more if you can send it. Tony

  [Part 2, Application/X-MSEXCEL  18KB]
  [Unable to print this part]