Wednesday, September 2, 2009

Meeting with Ciara O'Toole

Similar Applications Ciara mentioned:

http://www.uiowa.edu/~acadtech/phonetics/# - Phonetics flash animation-The Sound of Spoken Language
Speechviewer3 with english voice rec - no longer made
Visispeech

Phonetics:The Sounds of Spoken Language
http://www.uiowa.edu/~acadtech/phonetics/#

VISISPEECH

An Autonomous Speech Rehabilitation System for Hearing Impaired People

Final Report

http://www.xanthi.ilsp.gr/olp/pdf/HARP_final_report.pdf

NOTES TAKEN WHILE READING:

aim: to develop a visual speech training aid for hearing-impaired speakers
suitable for use in schools
provide advice on pronounciation errors thereby providing rehab assistance
guides user through a set of lessons which gradually improve pronounciation (in my case vocab)
provides a flexible, accessible and highly motivating teaching tool (I could add extensible)
used to complement and extend services already offered (primary cirriculum for example)
speech analysis provided
prototype coursework provided integrating excercises in pitch, loudness, intonation, vowel & consonants
successful attempts to exploit recent advances in multimedia suer interface such as interactive computer graphics & speech processing... the project completed a detailed investigation of how such multimedia schemes can be exploited for pronounciation teaching for hearing impaired people
stimulating and motivational material
bilingual-whole interface designed to be

EXTRA: INDIANA SPEECH TRAINING - ISTRA
The path of speech technologies in computer assisted language learning: from ...

By V. Melissa Holland, F. Peter Fisher

stuff on HMM and common errors with training the corpus

NOTES (from meeting):it would be a good idea to integrate it with the primary school cirriculum-can download the primary school books online so can check them-try to as extension work from vocab learnt in school class room-as an optional extension to do extra work at home if they wanted.

Check school ciriculum for first and second class…APPLY IT!
And test to see if it worked/was helpful; in schools?

There are 2 types of learning : expressive and comprehension/understanding
I’m aiming at expressive so first and second class level
Before that they just learn the actual words, but im interested in their use of it

Books that may be of use:

Cead focal, the first hundred words

Buntus foclora: a children’s irish picture dictionary

Phonemes that are unique to Irish – get native speakers to pronounce

For an articulation game (not language), the apps in existence already measure pitch & loudness

GOOD WAY OF LEARNING WOULD BE TO:

To have a range of acceptability for each word rather than a straight-cut right or no & if there was a way of testing what they did incorrectly

English App Learning Games:

http://www.manythings.org/

USER-INPUT-Audio- EXTRACTION, STORAGE & REPLAY

Extracting Audio Files from user input

It is not possible to record and store microphone input locally.
Therefore it has to be sent to a server (Flash Media Server for it to be compatible).
It is also difficult to extract audio files only from a .flv file

http://www.gotoandlearnforum.com/viewtopic.php?f=29&t=21974&p=115759

"it's enough if I can record it, replay it and overwrite it again if the user records another sound clip.

So far I gather this is impossible in Flash without the use of a mediaserver. Then again if I'm right the only way a media server can help me is by actually streaming the microphone input to the mediaserver, actually storing the sound as an mp3 file or similar, and then serving the mp3 file back to the app. It seems a bit overkill for just replaying short soundclips while the app is up, soundclips which are going to be re-recorded over and over while the user is using the app and trashed when he stops using the app.

I'm afraid not. I spend some time looking around and there seems to be no way to do this purely with Flash alone. As far as I can tell the only way to record audio is by streaming the microphone input to a media server such as adobe's own media server or an open source alternative like Red5 and then play back the mp3 stored by the server.

Quite an ugly sollution so we ended up building a simple java applet to take care of the temporary recording. ( HYPERLINK "http://www.jsresources.org/examples/audio_playing_recording.html" http://www.jsresources.org/examples/aud ... rding.html )"

http://www.jsresources.org/examples/aud ... rding.html

"Could use java applet (but would have privacy issues)

Purpose. Plays a single audio file. Capable of playing some compressed audio formats (A-law, μ-law, maybe ogg vorbis, mp3, GSM06.10). Allows control over buffering and which mixer to use. Usage:

java AudioPlayer -l

java AudioPlayer [ -M mixername ] [ -e buffersize ] [ -i buffersize ] audiofile"

PROBLEM:

Have to record as video file for the nelly moser converter!!

http://blog.andrewpaulsimmons.com/labels/Flash%20Media%20Server.html

"My team is currently developing a series of interactive speech recognition application. One of the applications requires us to create a web front end that allows us to record audio from a user microphone and return it to the server... We decided to use Flash and quickly found that we could not extract the audio from our recorded FLV files... But we simply could not get the audio out of the files that were being streamed to our Flash Media Server. "
"We discovered that all files converted from another format to FLV store audio in an embedded MP3. Unfortunately, all FLV files recorded from the user’s microphone in by the Flash Player use the Nellymoser audio format. Nellymoser is a highly proprietary mono audio format designed solely for streaming speech. When we looked for a program to decompress this converter we found that Nellymoser offered a converter for $7,500."
"one other converter that would do our decoding, the Total Video Converter. for only $50"
"You can convert the flv to many different audio or video formats... To convert an FLV that contains video or video and audio, but not audio only, you may use the GUI which is self explanatory. Audio only clips can not be converted with the GUI at this time. (The application simply locks up when we try to convert Nellymoseraudio only FLV files)."

Therefore to record and extract audio from a user's mic it is necessary to record video as well &

this can be done using the following code:

var bandwidth:int = 0; // Specifies the maximum amount of bandwidth that the current outgoing video feed can use, in bytes per second. The default value is 16384.
var quality:int = 50; // this value is 0-100 with 1 being the lowest quality.
var camera:Camera = Camera.getCamera();
camera.setQuality(bandwidth, quality);
camera.setMode(320,240,15,true); // setMode(videoWidth, videoHeight, video fps, favor area)

// Now attach the webcam stream to a video object.
var video:Video = new Video();
video.attachCamera(camera);
addChild(video);

Depending on the project, you can change bandwidth, quality, and frame-rate settings to find the best combination.

http://blog.728media.com/2009/02/24/actionscript-3-webcam-configure/

Sending to Server:

Make network connection with local host...

var nc:NetConnection = new NetConnection();
nc.connect("rtmp://YOUR_SERVER_URI/vod/");

var ns:NetStream = new NetStream(nc);
ns.play("NAME_OF_STREAM");
// ns.play("mp4:NAME_OF_STREAM.mp4");

Thursday, August 27, 2009

Lesson 1 Plan & Design

BEFORE:

Notes on Design for Lesson 1

Move clouds slowly in the background
Add detail to grass in front and sky behind
Birds in sky

BUTTONS NEEDED:

Listen to word and watch mouth movement...(Maybe a zoom in option for close-up of visemes)
Repeat Lesson
Volume Level Pan
Next Number
Previous Button
Back to Main Menu
S&Ls Button on the last frame of lesson
Understand? Button (help) which brings up English translation under the Irish word

ELEMENTS ON SCREEN:

Figure of the number
Object to associate number with (Eg: petals on a flower)
Irish Word & English translation under (set to invisible until help is requested) spell in sky!
Simple animation to illustrate meaning of word

"Each topic will have a lesson explaining the words in that vocabulary set, with their accompanying images. These combined with the interactive animated character create an optimal learning enviroment, also allowing for the user to check and recheck how to shape their mouths and how a word should sound." Project Proposal July, 09.

ADDITIONAL FEATURES:

Accompanying song for revision style animation at the end of the lesson, prior to option of S&Ls game. (similar to those on youtube.com)

SCREENSHOT OF LESSON NEAR COMPLETION:

LIP-SYNCHING

Code for sound in Flash:
var snd:Sound = new Sound();
snd.load(new URLRequest("aHaon.mp3"));

var channel:SoundChannel = new SoundChannel();
channel = snd.play();

snd.addEventListener(IOErrorEvent.IO_ERROR, onIOError, false, 0, true);
function onIOError(evt:IOErrorEvent):void{
trace("An error occurred when loading the sound;", evt.text);
}

//An event listener to ensure sound only plays once fully loaded.
snd.addEventListener(Event.COMPLETE, onLoadComplete, false, 0, true);
function onLoadComplete(evt:Event):void{
var localSnd:Sound = evt.target as Sound;
channel = localSnd.play();
}

Synched the character counting to 10 & adding accompanying simple animations to depict meaning.

Mouth Animation

http://www.gamasutra.com/features/20000406/lander_01.htm

Early work focussed on the animation of geometrical facial

models, which can be performed using predefined

expression and mouth shape components [1][2][3]. The

focus then shifted to physics-based anatomical face models,

with animation performed using numerical simulation of

muscle actions [4][5]. More recently an alternative

approach has emerged which takes photographs of a real

individual enunciating the different phonemes and

concatenates or interpolates between them [6][7][8].

Incorrect or absent visual speech cues are an

impediment to the correct interpretation of speech for both

hearing and lip-reading viewers [10].

The resultant viseme images can then be integrated

with a text-to-speech engine and animated to produce realtime

photo realistic visual speech.

The simplest approach to producing animated speech

using visemes is to simply switch to the image that

corresponds to the current phoneme being spoken. Several

existing speech engines support concatenated viseme visual

speech by reporting either the current phoneme or a

suggested viseme. For example, the Microsoft Speech API

(SAPI) will report which of the 21 Disney visemes is

currently being spoken. We use a table to convert between

our extended Ezzat and Poggio set and the Disney set. For

slightly smoother visual speech we detect changes in

viseme and form a 50/50 blend of the previous and current

viseme.

other head movements - blinking,

nods and expression changes, required for realism."

Many animations are based on the Disney visemes:

http://www.gamasutra.com/features/20000406/lander_01.htm

Disney Visemes:

Phonemes and Visemes: No discussion of facial animation is possible without discussing phonemes. Jake Rodgers’s article “Animating Facial Expressions” (Game Developer, November 1998) defined a phoneme as an abstract unit of the phonetic system of a language that corresponds to a set of similar speech sounds. More simply, phonemes are the individual sounds that make up speech. A naive facial animation system may attempt to create a separate facial position for each phoneme. However, in English (at least where I speak it) there are about 35 phonemes. Other regional dialects may add more.

Now, that’s a lot of facial positions to create and keep organized. Luckily, the Disney animators realized a long time ago that using all phonemes was overkill. When creating animation, an artist is not concerned with individual sounds, just how the mouth looks while making them. Fewer facial positions are necessary to visually represent speech since several sounds can be made with the same mouth position. These visual references to groups of phonemes are called visemes. How do I know which phonemes to combine into one viseme? Disney animators relied on a chart of 12 archetypal mouth positions to represent speech as you can see in Figure 1.

Figure 1. The 12 classic Disney mouth positions.

Each mouth position or viseme represented one or more phonemes. I have seen many facial animation guidelines with different numbers of visemes and different organizations of phonemes. They all seem to be similar to the Disney 12, but also seem like they involved animators talking to a mirror and doing some guessing.

Along with the animator’s eye for mouth positions, there are the more scientific models that reduce sounds into visual components. For the deaf community, which does not hear phonemes, spoken language recognition relies entirely on lip reading. Lip-reading samples base speech recognition on 18 speech postures. Some of these mouth postures show very subtle differences that a hearing individual may not see.

http://www.generation5.org/content/2001/visemes.asp

Visemes: Representing Mouth Positions By James Matthews

SAPI provides the programmer with a very powerful feature - viseme notification. A viseme refers to the mouth position currently being "used" by the speaker. SAPI 5 uses the Disney 13 visemes:

typedef enum SPVISEMES

                        // English examples

                        //------------------

    SP_VISEME_0 = 0,    // silence

    SP_VISEME_1,        // ae, ax, ah

    SP_VISEME_2,        // aa

    SP_VISEME_3,        // ao

    SP_VISEME_4,        // ey, eh, uh

    SP_VISEME_5,        // er

    SP_VISEME_6,        // y, iy, ih, ix

    SP_VISEME_7,        // w, uw

    SP_VISEME_8,        // ow

    SP_VISEME_9,        // aw

    SP_VISEME_10,       // oy

    SP_VISEME_11,       // ay

    SP_VISEME_12,       // h

    SP_VISEME_13,       // r

    SP_VISEME_14,       // l

    SP_VISEME_15,       // s, z

    SP_VISEME_16,       // sh, ch, jh, zh

    SP_VISEME_17,       // th, dh

    SP_VISEME_18,       // f, v

    SP_VISEME_19,       // d, t, n

    SP_VISEME_20,       // k, g, ng

    SP_VISEME_21,       // p, b, m

} SPVISEMES;

Everytime a viseme is used, the SAPI5 engine can send your application a notification which it can use draw the mouth position. Microsoft provides an excellent example of this with its SAPI5 SDK, called TTSApp. TTSApp is written using the standard Win32 SDK and has additional features that bog down the code. Therefore, I created my own version using MFC that is hopefully a little easier to understand.

Using visemes is relatively simple, it is the graphical side of it that is the hard part. This is why, for demonstration purposes, I used the microphone character that was used in the TTSApp

http://msdn.microsoft.com/en-us/library/ms720881(VS.85).aspx

The SpeechVisemeType enumeration lists the visemes supported by the SpVoice object. This list is based on the original Disney visemes.

Definition

Enum SpeechVisemeType

    SVP_0 = 0       'silence

    SVP_1 = 1       'ae ax ah

    SVP_2 = 2       'aa

    SVP_3 = 3       'ao

    SVP_4 = 4       'ey eh uh

    SVP_5 = 5       'er

    SVP_6 = 6       'y iy ih ix

    SVP_7 = 7       'w uw

    SVP_8 = 8       'ow

    SVP_9 = 9       'aw

    SVP_10 = 10     'oy

    SVP_11 = 11     'ay

    SVP_12 = 12     'h

    SVP_13 = 13     'r

    SVP_14 = 14     'l

    SVP_15 = 15     's z

    SVP_16 = 16     'sh ch jh zh

    SVP_17 = 17     'th dh

    SVP_18 = 18     'f v

    SVP_19 = 19     'd t n

    SVP_20 = 20     'k g ng

    SVP_21 = 21     'p b m

End Enum

Elements

SVP_0

The viseme representing silence.

SVP_1

The viseme representing ae, ax, and ah.

SVP_2

The viseme representing aa.

SVP_3

The viseme representing ao.

SVP_4

The viseme representing ey, eh, and uh.

SVP_5

The viseme representing er.

SVP_6

The viseme representing y, iy, ih, and ix.

SVP_7

The viseme representing w and uw.

SVP_8

The viseme representing ow.

SVP_9

The viseme representing aw.

SVP_10

The viseme representing oy.

SVP_11

The viseme representing ay.

SVP_12

The viseme representing h.

SVP_13

The viseme representing r.

SVP_14

The viseme representing l.

SVP_15

The viseme representing s and z.

SVP_16

The viseme representing sh, ch, jh, and zh.

SVP_17

The viseme representing th and dh.

SVP_18

The viseme representing f and v.

SVP_19

The viseme representing d, t and n.

SVP_20

The viseme representing k, g and ng.

SVP_21

The viseme representing p, b and m.

http://www.gamasutra.com/view/feature/3179/read_my_lips_facial_animation_.php?page=2

"Visemes

1. [p, b, m] - Closed lips.

2. [w] & [boot] - Pursed lips.

3. [r*] & [book] - Rounded open lips with corner of lips slightly puckered. If you look at Chart 1, [r] is made in the same place in the mouth as the sounds of #7 below. One of the attributes not denoted in the chart is lip rounding. If [r] is at the beginning of a word, then it fits here. Try saying “right” vs. “car.”

4. [v] & [f ] - Lower lip drawn up to upper teeth.

5. [thy] & [thigh] - Tongue between teeth, no gaps on sides.

6. [l] - Tip of tongue behind open teeth, gaps on sides.

7. [d,t,z,s,r*,n] - Relaxed mouth with mostly closed teeth with pinkness of tongue behind teeth (tip of tongue on ridge behind upper teeth).

8. [vision, shy, jive, chime] Slightly open mouth with mostly closed teeth and corners of lips slightly tightened.

9. [y, g, k, hang, uh-oh] - Slightly open mouth with mostly closed teeth.

10. [beat, bit] - Wide, slightly open mouth.

11. [bait, bet, but] - Neutral mouth with slightly parted teeth and slightly dropped jaw.

12. [boat] - very round lips, slight dropped jaw.

13. [bat, bought] - open mouth with very dropped jaw."

Emphasis and exaggeration are also very important in animation. You may wish to punch up a sound by the use of a viseme to punctuate the animation. This emphasis along with the addition of secondary animation to express emotion is key to a believable sequence. In addition to these viseme frames, you will want to have a neutral frame that you can use for pauses. In fast speech, you may not want to add the neutral frame between all words, but in general it gives good visual cues to sentence boundaries

The list of phonemes above corresponds with the Disney visemes.

I have designed Disney equivalent of visemes:

Corresponding with phonemes:

(From left to right across)

1)AH,H,I,A..ending TA, 2)BA,B,MMM,P, 3)CGJKSZTS, 4) E, 5) O, 6) OOO, 7) Q U, 8) R, 9) silence, 10) start-D,T,THA,LA.11) ending-L,N, START-WHA,Y & 12) vee,f,fah

I had to ensure that all elements of the face are identical in each viseme.This will aid animation later.
I designed the nose first and then the lips around that.
I decided to omit the eyes from this part of the animation as they don't particularly add to the mouth shape guide.
Hopefully the character will be totally interactive (ie: user can choose appearance), and so the design of the eyes could be chosen from a range of pre-designed eyes.
With these visemes prepared for animation there's only one thing to do... Get animating! Synch up to sound, and deal with sound approproiately (ie:error catching, ensuring it only plays once fully loaded, use of channels etc).

Extra Reading:

Prototyping and Transforming Visemes for Animated Speech
Bernard Tiddeman and David Perrett
School of Computer Science and School of Psychology,
University of St Andrews, Fife KY16 9JU.

Tuesday, August 25, 2009

Designing Interactive Character

Audio Recordings

Recording Plan:

Details of Recordings:

Numbers 1-20, each uttered 5 times, once with each syllable pronounced separately.

List of colours (white, black, blue, yellow, red, orange, green, purple, brown, grey, pink), each uttered 5 times, once with each syllable pronounced separately.

Animals (cow, bird, dog, cat, sheep, bee, frog, cuckoo, chicken, lion), each uttered 5 times, once with each syllable pronounced separately. Also, 3 recordings of each of the animal sounds, with variations of each, if necessary.

Body Parts (arms, hands, fingers, shoulders, legs, knees, feet, toes, chest, stomach, neck, head, back), each uttered 5 times, once with each syllable pronounced separately.

List of Words:

Numbers:

Aon

Dó

Trí

Ceathar

Cúig

Sé

Seacht

Ocht

Naoi

Deich

Aon Déag

Dó Dhéag

Trí Dhéag

Ceathar Dhéag

Cúig Déag

Sé Déag

Seacht Déag

Ocht Déag

Naoi Déag

Fiche

Colours:

bán

dubh

gorm

buí

dearg

oráiste

glas

corchradh

donn

liath

bán-dearg

Animals & Associated Sounds:

bó moo

éan tweet

madra woof

cat meow

caoire baa

beach buzz

froga croak

cuach cuckoo

sicín cluck

leon roar

Some introductions and various phrases in both English and Irish also.

------------------------------------------------------------------------------------------------------------

Recording Audio Files

University of Limerick

05.08.09

Studio Description: Recording Studio 1

An acoustically isolated Live Room of professional standard which has a clear-glass window allowing visual communication between the Live Room and the Control Room where the producer works from. With a Digidesign HD setup and using ProTools HD 7.3 it was used to record 24-bit audio at 44.1kHz in .wav format.

Mic Used: U87 Ai Neumen

It is equipped with a large dual-diaphragm capsule with three directional patterns: omnidirectional, cardioid and figure-8. These are selectable with a switch below the headgrille. I chose the cardioid as I was in a static position throughout the entire recording, It has a 10 dB attenuation switch is located on the rear. It enables the microphone to handle sound pressure levels up to 127 dB without distortion. The U 87 Ai can be used as a main microphone for orchestra recordings, as a spot mic for single instruments, and extensively as a vocal microphone for all types of music and speech. As can be seen from the accompanying images, a pop shield was used in conjunction with the U87 Ai. These are typically used in recording studios and serve as a noise protection filter for microphones; preventing interference and protecting the mic from saliva.

Post-Production

· The files were recorded using the .wav lossless format using using ProTools HD 7.3. It was necessary in post-production to convert all the recordings to .mp3 as Flash does not always accept .wav files. It will only accept WAV PCM files which are large files and use up unnecessary hard-drive (or in this case, server) space. Using .mp3 files here saves *on both space and computation.*

"http://livedocs.adobe.com/flash/9.0/main/wwhelp/wwhimpl/common/html/wwhelp.htm?context=LiveDocs_Parts&file=00000284.html

Although there are various sound file formats used to encode digital audio, ActionScript 3.0 and Flash Player support sound files that are stored in the mp3 format. They cannot directly load or play sound files in other formats like WAV or AIFF."

· It was then necessary to edit and then organize the audio files into folders. This was done using Audacity.

--------------------------------------------------------------------------------------------------------------------------------

SCREENSHOTS

The audio files were recorded on one track & separated by category during the recording process.

The format & encoding options are chosen once the recordings were made and the files were being bounced.