Cross Audio Visual Recognition – Lip Reading

Varsha C Bendre, Prabhat Kumar Singh, Rohit Anand, Mayuri .K.P


Lip reading is the task of decoding text from movement of a speaker’s mouth. There are two stages in the   task namely the designing or learning the visual features and prediction. It learns the spatiotemporal visual features and the sequence model. The three dominant models that are being utilized to design the Lip reading is Convolution Neural Networks (CNN), LSTM’s and Reinforcement learning. The one to many relationships between the visemes and phoneme creates issues in the predicting the phrases and words. The 3-D convolutional model is used for the cross audio-video recognition. The new technologies are being utilized to improve the way of communication with the deaf people, so this project deals with the collecting of the random videos which might be noisy or with low quality audio and map to the words and sentences. The project is being retrieved from the applications of the 3-D Convolutional Neural Networks Reinforcement learning.

Full Text:



McGurk, H. and MacDonald, J. Hearing lips and seeing voices. Nature, 264(5588):746{748,1976.

Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: Continual prediction with lstm. Neural computation 12(2000) 2451{2471

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., Warde-Farley, D., Bengio, Y.: Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590(2012)

Hubel, D.H., Wiesel, T.N.: Receptive elds and a functional architecture of monkey striate cortex. The Journal of physiology.195(1968)

J. Yuan and M. Liberman. Speaker identification on the scotus corpus. Journal of the Acoustical Society of America, , 123(5):3878, 2008

H.Hermansky, Perpetual linear predictive (PLP) analysis of speech.the Journal of the Acoustical Society of America, 87(4):1738-1752,1990

F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio, “Theano: new features and speed improvements,” Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012

Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729(2014)

I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages. 3104-3112, 2014

Ioe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint rXiv:1502.03167(2015)

P. Badin, G. Bailly, L. Rev´eret, M. Baciu, C. Segebarth, and C. Savariaux, “Three-dimensional linear articulatory modeling of tongue lips and face, based on MRI and video images,” Journal of Phonetics, vol. 30, no. 3, pp. 533–553, jul 2002.

R. Collobert, S. Bengio, and J. Mari´ethoz, “Torch: a modular machine learning software library,” IDIAP, Tech. Rep, 2002.

Y. Lan, R. Harvey, B. Theobald, E.-J. Ong, and R. Bowden. Comparing visual features for lipreading. In International Conference on Auditory-Visual Speech Processing 2009, pages 102-106, 2009.

H. Yehia, P. Rubin, and E. Vatikiotis-Bateson, “Quantitative association of vocal-tract and facial behavior,” Speech Communication, vol. 26, pp. 23-43, 1998

J. Barker and F. Berthommier, “Evidence of correlation between acoustic and visual features of speech,” in Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco, CA, USA, Aug. 1999, pp. 5–9. 1153.

Stafylakis T, Tzimiropoulos G (2017). Combining residual networks with lstms for lipreading. arXiv preprint arXiv:1703.04105.

Sumby WH, Pollack I (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America 26:212–5.

Hinton, G. and Salakhutdinov, R. Reducing the dimensionality of data with neural networks. Science, 313(5786): 504{507, 2006.

J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild”, in IEEE Conference on Computer Vision and Pattern Recognition, 2017.

J. S. Chung and A. Zisserman, “Lip reading in profile”, in British Machine Vision Conference, 2018



  • There are currently no refbacks.

© International Journals of Advanced Research in Computer Science and Software Engineering (IJARCSSE)| All Rights Reserved | Powered by Advance Academic Publisher.