Windows Speech Recognition (WSR) is a speech recognition component developed by Microsoft for Windows Vista that enables the use of voice commands to control the desktop user interface; dictate text in electronic documents, forms and email; navigate websites; perform keyboard shortcuts; operate the mouse cursor; and create macros to perform additional tasks.
WSR is a locally-processed speech recognition platform; it does not rely on cloud computing for accuracy, dictation, or recognition services, but instead adapts based on a user's continued input, grammar, speech samples, training sessions, and vocabulary. For dictation, it provides a personal dictionary that allows users to include or exclude words or expressions and to optionally record pronunciations to increase recognition accuracy. With Windows Search, WSR can also optionally analyze and collect text in email and documents, as well as handwritten input on a tablet PC, to contextualize and disambiguate user terms to further adapt and personalize the recognizer. It also supports custom language models that adapt the recognizer to the context, phonetics, and terminology of users in particular occupational fields such as legal or medical.
WSR was developed to be an integrated component of Windows Vista, as the Windows operating system previously only included support for speech recognition capabilities that were limited to individual applications such as Windows Media Player. Microsoft Office XP introduced speech recognition, but this support was limited to Internet Explorer and Office. In Windows Vista, the majority of integrated applications can be controlled through speech, and Office 2007 and later versions rely on WSR, replacing the previously separate Office speech recognition functionality.
WSR relies on the Speech API developed by Microsoft, and third-party applications must support the Text Services Framework. It is also present in Windows 7, Windows 8, Windows 8.1, and Windows 10.
Video Windows Speech Recognition
History
Precursors
Microsoft has been involved in speech recognition and speech synthesis research for many years. In 1993, Microsoft hired Xuedong Huang from Carnegie Mellon University to lead its speech development efforts. The company's research eventually led to the development of the Speech API introduced in 1994. Speech recognition has also been used in Microsoft's products prior to WSR. Versions of Microsoft Office including Office XP and Office 2003 provided speech recognition capabilities among Internet Explorer and Office applications; installation of Office would also enable limited speech functionality in Windows NT 4.0, Windows 98 and Windows ME. The 2002 Tablet PC Edition of Windows XP included speech recognition with the Tablet PC Input Panel, and the Microsoft Plus! for Windows XP expansion package also enabled voice commands to be used in Windows Media Player. However, this support was limited to individual applications, and before Windows Vista, Windows did not include integrated speech recognition capabilities.
Development
Windows Vista
At the 2002 Windows Hardware Engineering Conference (WinHEC 2002) Microsoft announced that Windows Vista, then known by its codename "Longhorn," would include advances in speech recognition and features such as support for microphone arrays; these features were part of the company's goal to "provide a consistent quality audio infrastructure for natural (continuous) speech recognition and (discrete) command and control." Bill Gates expanded upon this information during the 2003 Professional Developers Conference (PDC 2003) where he stated that Microsoft would "build speech capabilities into the system -- a big advance for that in 'Longhorn,' in both recognition and synthesis, real-time"; and pre-release builds throughout the development of Windows Vista included a speech engine with training features. A PDC 2003 developer presentation dedicated to user input stated that Windows Vista would also include a user interface for microphone feedback and control, as well as separate user configuration and training features. Microsoft later clarified its intent when it stated in a pre-release software development kit that "the common speech scenarios, like speech-enabling menus and buttons, will be enabled system-wide."
During WinHEC 2004, Microsoft listed WSR as part of its "Longhorn" mobile PC strategy to improve productivity. At WinHEC 2005, Microsoft emphasized accessibility, new mobility scenarios, and improvements to the speech user experience and revealed that, unlike the speech support included in Windows XP, which was integrated with the Tablet PC Input Panel and required switching between separate Commanding and Dictation modes, Windows Vista would introduce a dedicated interface for speech input on the desktop and unify the previously separate speech modes. In previous versions of Windows, users could not speak a command after dictating or vice versa without first switching between these two modes. Microsoft also stated that speech recognition in Windows Vista would improve dictation accuracy and support additional languages and microphone arrays. A demonstration at WinHEC 2005 focused on e-mail dictation with correction and editing commands, and a presentation about microphone arrays was also shown. Windows Vista Beta 1 later included an integrated speech recognition application. To incentivize company employees to analyze WSR for software glitches and provide feedback during its development, Microsoft offered an opportunity for testers to win a Premium model of its Xbox 360 video game console.
On July 27, 2006, before the operating system's release to manufacturing (RTM), a notable incident involving WSR occurred during a demonstration by Microsoft, which resulted in an unintended output of "Dear aunt, let's set so double the killer delete select all" when several attempts to dictate led to consecutive output errors; the incident was a subject of significant derision among analysts and journalists in the audience. Microsoft later revealed that these issues were due to an audio gain glitch that caused the speech recognizer to distort the dictated words. The glitch was fixed before Windows Vista's release.
Security report
In early 2007 reports surfaced that WSR might be vulnerable to an attack that could allow attackers to take advantage of its capabilities to perform undesired user operations on a target computer by playing audio through its speakers; it was the first vulnerability discovered after Windows Vista's general availability. While Microsoft stated that such an attack is theoretically possible, it would have to meet a number of prerequisites to be successful: the target system would be required to have the speech recognition feature properly configured and activated; speakers and microphone(s) connected to the targeted system would need to be turned on; and the exploit would require the software to interpret commands without a user noticing--an unlikely scenario as the affected system would perform visible interface operations and produce audible feedback; mitigating factors would include dictation clarity, and microphone feedback and placement. Because of User Account Control, an exploit of this nature would not be able to perform privileged operations for users or protected administrators without explicit consent.
Windows 7
With Windows 7 Microsoft introduced several changes to improve the user experience. The recognizer was updated to use Microsoft UI Automation--substantially enhancing its performance--and the recognition engine was updated to use the WASAPI audio stack, which enables support for microphone arrays and echo cancellation. The document harvester, which optionally analyzes and collects text in email and documents to contextualize and disambiguate user terms, has been updated to run periodically in the background instead of only after starting the recognizer, and its performance has been improved. The sleep mode of WSR has also seen performance improvements and, to address security issues, Windows 7 introduces a new "voice activation" option--enabled by default--that turns the recognizer off after users speak "stop listening" instead of putting it to sleep.
For applications that are not compatible with the Text Services Framework, Windows 7 introduces an optional dictation scratchpad interface that functions as a temporary document into which users can dictate or type text for insertion into these applications. WSR previously provided an "enable dictation everywhere option" in Windows Vista.
Windows 7 also introduces an option to submit speech training data to Microsoft to improve future speech recognizer versions.
Maps Windows Speech Recognition
Overview and features
WSR allows a user to control a computer, including the operating system desktop user interface, through voice commands. Applications, including most of those bundled with Windows, can also be controlled through voice commands. By using speech recognition, users can dictate text within documents and e-mail messages, fill out forms, control the operating system user interface, perform keyboard shortcuts, and move the mouse cursor.
WSR uses a speech profile to store information about a user's voice. Accuracy of speech recognition increases through use, which helps the feature adapt to a user's grammar, speech patterns, vocabulary, and word usage. Speech recognition also includes a tutorial to improve accuracy, and can optionally review a user's personal documents, including e-mail messages, to improve its command and dictation accuracy. Individual speech profiles can be created on a per-user basis, and backups of profiles can be performed via Windows Easy Transfer. WSR supports the following languages: Chinese (Traditional), Chinese (Simplified), English (U.S.), English (U.K.), French, German, Japanese, and Spanish.
Interface
The WSR interface consists of a status area for instructions and for information about commands (e.g., if a command is not heard by the recognizer) and for the status of the speech recognizer; a voice meter also displays visual feedback about volume levels. The status area represents the current state of WSR in a total of three modes, listed below with their respective meanings:
- Listening: The speech recognizer is active and waiting for user input
- Sleeping: The speech recognizer will not listen for or respond to commands other than "Start listening"
- Off: The speech recognizer will not listen or respond to any commands; this mode can be enabled by speaking "Stop listening"
The status area can also display custom user information as part of Windows Speech Recognition Macros.
Alternates panel
A disambiguation alternates panel interface displays a list of items interpreted as being relevant to a user's spoken word(s); if the word or phrase that a user desired to insert into an application is listed among results, a user can speak the corresponding number of the word or phrase in the results and confirm this choice by speaking "OK" to insert it within the application.
The alternates panel will also appear when launching applications or speaking commands that refer to more than one item (e.g., speaking "Start Internet Explorer" may list the web browser and a version of it with browser add-ons disabled). However, an ExactMatchOverPartialMatch Windows Registry entry can limit commands to items with exact names if there is more than one instance included in results.
Common commands
Listed below are common WSR commands. Words in italics indicate a variable that can be substituted for a desired item (e.g., the word "direction" in the "scroll direction" command can be substituted with the word "down" to scroll down). A "start typing" command enables WSR to interpret all dictation commands as keyboard shortcuts.
- Dictation commands: "New line," "new paragraph," "tab," "literal word," "numeral number," "go to word," "go after word," "no space," "go to start of sentence," "go to end of sentence," "go to start of paragraph," "go to end of paragraph," "go to start of document," "go to end of document," "go to field name" (e.g., go to address, cc, or subject). Special characters, such as a comma, can be dictated simply by stating the name of the special character.
- Navigation commands:
- Keyboard shortcuts: "Press keyboard key," "press ? Shift plus a," "press capital b."
- Keys that can be pressed without first giving the press command include: <- Backspace, Delete, End, ? Enter, Home, Page Down, Page Up, Tab ?.
- Mouse commands: "Click," "click that," "double-click," "double-click that," "mark," "mark that," "right-click," "right-click that," mousegrid.
- Window management commands: "Close (alternatively maximize, minimize, or restore) window," "close that," or "close application name," "switch applications," "switch to application name," "scroll direction," "scroll direction in number of pages," "show desktop," "show numbers."
- Speech recognition commands: "Start listening," "stop listening," "show speech options," "open speech dictionary," "move speech recognition," "mimimize speech recognition." A list of applicable commands can be shown by speaking "What can I say?" This command is currently only available in English. Users can also query the recognizer about tasks in Windows by speaking "How can I task name," which opens related help documentation.
Mousegrid
A mousegrid command enables users to control the mouse cursor by overlaying numbers across nine regions on the screen; these regions gradually narrow as a user speaks the number(s) of the region on which to focus until the desired interface element is reached. The regions with which a user can interact are based on commands including "click number of region," which moves the mouse cursor to the desired region and then clicks it; and "mark number of region", which allows individual items (such as a computer icon) in a region to be selected--these items can then be clicked with the previous click command. A user can also simultaneously interact with multiple regions of the mousegrid.
Show numbers
Applications and interface elements that do not present identifiable commands can still be controlled by asking the system to overlay numbers on top of them through a show numbers command. Once active, speaking the overlaid number selects that item so a user can open it or perform other operations. Show numbers was designed so that users could interact with items that are not readily identifiable.
Dictation
WSR enables dictation of text in the operating system and applications. If a dictation mistake occurs it can be corrected by speaking "correct word" or "correct that" and the alternates panel will appear and provide suggestions for correction; these suggestions can be selected by speaking the number corresponding to the number of the suggestion in the list and by speaking "OK." If the desired item is not listed among suggestions, a user can speak it so that it might appear. Alternatively, users can speak "spell it" or "I'll spell it myself" to speak the desired item on a per-letter basis; users can use their personal alphabet or the NATO phonetic alphabet when spelling. Multiple words in a sentence can be corrected simultaneously (for example, if a user speaks "dictating" but the recognizer interprets this word as "the thing," a user can state "correct the thing" to correct both words). In the English language over 100,000 words are recognized by default.
Speech dictionary
WSR includes a personal dictionary that allows users to include or exclude certain words or expressions from dictation. When a user adds a word beginning with a capital letter to the dictionary, a user can specify whether it should always be capitalized during dictation or if capitalization depends on the context where the word is spoken. Users can also record pronunciations for words added to the dictionary to increase recognition accuracy; words written via a stylus on a tablet PC for the Windows handwriting recognition feature are also stored. Most of the information stored within a dictionary is included as part of a user's speech profile.
Macros
WSR supports custom macros through a supplementary application by Microsoft that enables additional natural language commands. As an example of this functionality, an e-mail macro released by Microsoft enables a natural language command where a user can state "send e-mail to contact about subject," which opens Microsoft Outlook to compose a new message with the designated contact and subject automatically inserted within the application. Microsoft has also released sample macros for the speech dictionary, for Windows Media Player, for Microsoft PowerPoint, for speech synthesis, to switch between multiple microphones, to customize various aspects of audio device configuration such as volume levels, and for general natural language queries such as "What is the weather forecast?" "What time is it?" and "What's the date?" Answers to these queries are spoken via a speech synthesizer.
Users and developers can create their own macros that can be based on text transcription and substitution; application execution (with support for command-line arguments); keyboard shortcuts; emulation of existing voice commands; or a combination of these items. XML, JScript and VBScript are supported. Macros can be limited to individual applications if desired, and rules for macros can be defined programmatically. In order for a macro to be loaded, it must be stored within a Speech Macros folder within the current user's Documents directory. All macros are digitally signed by default if a user certificate is available to ensure that created commands are not loaded or tampered with by third-parties; if one is not available, an administrator can create a certificate for use. The macros utility also includes security levels to prohibit unsigned macros from being loaded, to prompt users to sign macros, and to load unsigned macros if a user desires for this to occur.
Performance
As of 2017 WSR uses Microsoft Speech Recognizer 8.0, which has not been changed since Windows Vista. For dictation it was found to be 93.6% accurate by Mark Hachman, a Senior Editor of PC World without training the application--a rate that is not as accurate when compared with competing software; however, according to Microsoft when WSR is properly trained, the rate of accuracy is 99%. Hachman commented that Microsoft does not publicly discuss WSR--attributing this to the aforementioned 2006 incident during the development of Windows Vista--with few users knowing that documents could be dictated within Windows before the introduction of other offerings such as Cortana.
See also
- List of speech recognition software
- Microsoft Narrator
- Microsoft Voice Command
- Technical features new to Windows Vista
- Windows HotStart
- Windows Mobility Center
- Windows SideShow
References
External links
- Windows Vista Speech Recognition demonstration at Microsoft Financial Analyst Meeting
Source of article : Wikipedia