Project Blog

13th July_Animation Research

http://animation.about.com/od/flashanimationtutorials/a/animationphonem.htm

tutorial on animation

‘For a quick fix, it's no problem to just animate the mouth opening and closing, and it's a simple shortcut, especially when animating for the web. But if you want to add actual expression and realistic mouth-movements, it helps to study how the shape of the mouth

changes with each sound. There are dozens upon dozens of variations, but my sketches are renderings from the basic ten shapes of the Preston Blair phoneme series. (They're also an example of what happens when Adri dashes off ten-minute sketches from memory rather than detailed artwork.)

These ten basic phoneme shapes (click the image thumbnail in the right-hand column of this page, or click here for a full-size version) can match almost any sound of speech, in varying degrees of expression--and with the in-between frames moving from one to the other, are remarkably accurate. You may want to keep this for reference.

A and I: For the A and I vowel sounds, the lips are generally pulled a bit wider, teeth open, tongue visible and flat against the floor of the mouth.
E: The E phoneme is similar to the A and I, but the lips are stretched a bit wider, the corners uplifted more, and the mouth and teeth closed a bit more.
U: For the U sound, the lips are pursed outwards, drawn into a pucker but still somewhat open; the teeth open, and the tongue somewhat lifted.
O: Again the mouth is drawn to a pucker, but the lips don't purse outwards, and the mouth is rounder, the tongue flat against the floor of the mouth.
C, D, G, K, N, R, S, Th, Y, and Z: Long list, wasn't it? This configuration pretty much covers all the major hard consonants: lips mostly closed, stretched wide, teeth closed or nearly closed.
F and V: Mouth at about standard width, but teeth pressed down into the lower lip. At times there can be variations closer to the D/Th configuration.
L: The mouth is open and stretched apart much like the A/I configuration, but
M, B, and P: These sounds are made with the lips pressed together; it's the duration that matters. "M" is a long hold, "mmm"; "B" is a shorter hold then part, almost a "buh" sound; P is a quick hold, puff of air.
W and Q: These two sounds purse the mouth the most, almost closing it over the teeth, with just the bottoms of the upper teeth visible, sometimes not even that. Think of a "rosebud mouth".
Rest Position: Think of this as the "slack" position, when the mouth is at rest--only with the thread of drool distinctly absent.

When you're drawing or modeling your animation, by listening to each word and the syllable combinations inherent you can usually break them down into a variation of these ten phoneme sets. Note that my drawings aren't perfectly symmetrical; that wasn't just shoddy sketching. No two people express themselves in an identical fashion, and each has individual facial quirks that make their speech and expressions asymmetrical. ‘

http://en.wikipedia.org/wiki/Viseme

def of viseme:

“A viseme is a supposed basic unit of speech in the visual domain. The term viseme was introduced based on the interpretation of the phoneme as a basic unit of speech in the acoustic/auditory domain, (Fisher, 1968). This is, however, at variance with the accepted definition of the phoneme as the smallest structural unit that distinguishes meaning within a given language - as a cognitive abstraction that is not bound to any sensory modality.^[who?]

A "viseme" describes the particular facial and oral positions and movements that occur alongside the voicing of phonemes. The analogous term for the acoustic reflection of a phoneme would be "audieme", but this is not in use.

Phonemes and visemes do not always share a one-to-one correspondence; often, several phonemes share the same viseme. In other words, several phonemes look the same on the face when produced, such as /k/, /g/, /ŋ/, (viseme: /k/), or /ʧ/, /ʃ/, /ʤ/, /ʒ/ (viseme: /ch/). However, there could be differences in timing and duration during actual speech in terms of the visual 'signature' of a given gesture that can not be captured with a single photograph. Conversely, some sounds which are hard to distinguish acoustically are clearly distinguished by the face (Chen 2001). For example, acoustically speaking English /l/ and /r/ could be quite similar (especially in clusters, such as 'grass' vs. 'glass'). Yet visual information can show a clear contrast. This is demonstrated by the more frequent mishearing of words on the telephone than in person. Some linguists have argued that speech is best understood as bimodal (aural and visual), and comprehension can be compromised if one of these two domains is absent (McGurk and MacDonald 1976). The comprehension of speech by visemes alone is known as speechreading or "lip reading".Applications for the study of visemes includes speech processing, speech recognition and computer facial animation.”

http://en.wikipedia.org/wiki/Computer_facial_animation#Speech_Animation

Speech Animation

Speech is usually treated in a different way to the animation of facial expressions, this is because simple keyframe-based approaches to animation typically provide a poor approximation to real speech dynamics. Often visemes are used to represent the key poses in observed speech (i.e. the position of the lips, jaw and tongue when producing a particular phoneme), however there is a great deal of variation in the realisation of visemes during the production of natural speech. The source of this variation is termed coarticulation which is the influence of surrounding visemes upon the current viseme (i.e. the effect of context.) To account for coarticulation current systems either explicitly take into account context when blending viseme keyframes or use longer units such as diphone, triphone, syllable or even word and sentence-length units.

One of the most common approaches to speech animation is the use of dominance functions introduced by Cohen and Massaro. Each dominance function represents the influence over time that a viseme has on a speech utterance. Typically the influence will be greatest at the center of the viseme and will degrade with distance from the viseme center. Dominance functions are blended together to generate a speech trajectory in much the same way that spline basis functions are blended together to generate a curve. The shape of each dominance function will be different according to both which viseme it represents and what aspect of the face is being controlled (e.g. lip width, jaw rotation etc.) This approach to computer-generated speech animation can be seen in the Baldi talking head.

Other models of speech use basis units which include context (e.g. diphones, triphones etc.) instead of visemes. As the basis units already incorporate the variation of each viseme according to context and to some degree the dynamics of each viseme, no model of coarticulation is required. Speech is simply generated by selecting appropriate units from a database and blending the units together. This is similar to concatenative techniques in audio speech synthesis. The disadvantage to these models is that a large amount of captured data is required to produce natural results, and whilst longer units produce more natural results the size of database required expands with the average length of each unit.

Finally, some models directly generate speech animations from audio. These systems typically use hidden markov models or neural nets to transform audio parameters into a stream of control parameters for a facial model. The advantage of this method is the capability of voice context handling, the natural rhythm, tempo, emotional and dynamics handling without complex approximation algorithms. The training database is not needed to be labeled since there are no phoneme or viseme needed, the only needed data is the voice and the animation parameters. An example of this approach is the Johnnie Talker system[2].

Tuesday, August 25, 2009

Speech Animation

No comments: