Azure Cognitive Services
Customer Feedback & Ideas for Azure Cognitive Services
Catch up on the latest News and Updates
Share your Ideas and Feedback
To share your ideas on how we can make Cognitive Services better, click one of the categories underneath "Give Feedback" located in the sidebar menu to access the forum.
Documentation
API documentation available here. Within, you'll find:
Contact Support
UserVoice is intended for product feedback. If you need product support, please contact either: Azure support (https://azure.microsoft.com/en-us/support/plans/) or ask a question on stack overflow (https://stackoverflow.com/questions/tagged/microsoft-cognitive)
Become a Cloud Design Insider!
Join Cloud Design Insiders, and help shape the future of Cognitive Services! As an insider, you’ll speak with program managers, designers & researchers, see new designs and ideas, provide feedback through surveys, and try out prototypes. Take the short survey to join the Cloud Design Insiders now, and we’ll see you in the community.
-
Generate accurate audio clip for each utterance
Getting an audio clip for each utterance will make it possible to generate a basis for a human-labeled transcript for training a custom model. This will make it possible to gradually improve the recognition accuracy after every "session", by checking the transcription and the corresponding audio clip and fixing the text for incorrect transcriptions.
Additionally the audio clip can be used as a live read-back of the original audio.
6 votes -
Speaker diarization for more than 2 speakers
Speaker diarization for more than 2 speakers.
I dont feel this should be marked as resolved. Would expect support for at least 10 speakers. Additionally its currently really poor and switches between speaker 1 and 2 almost randomly. Please make this more intelligent. Its a deal breaker for us and I'm sure many others. Especially considering the google alternative can handle unlimited speakers and is far more accurate at identifying them.
https://cloud.google.com/speech-to-text/docs/multiple-voices
And no... expecting a sample to train it for each voice is not an option. We literally just need it to assign a number…
6 votes -
Need the new metric to check the number of characters used for text to speech on the Azure portal
It is needed to be able to check the number of characters used for text to speech.
Under the metrics tab on the Azure portal, we can only see the number of requests that have been made.5 votes -
Audio Offset / Duration for Best Result on normalized words
The JSON and/or result object needs to have the offset and duration of the whole normalized word.
I've reviewed the JSON and it still doesn't solve the problem. I need to know the relationship of the DisplayText words to the Word Timings in the detail When the DisplayText outputs 007 and the Word Timings output "double" "oh" "seven" as 3 different words I don't know that 007 = those three words as there is no reference. There needs to be a display word reference to the audio word to track offset/duration of an underlying audio file. The only option that…3 votes -
Site banner when there is a known issue
Twice now the Speech portal has been broken by the owning Product Group.
Twice now I have wasted hours of my time as well as MS support personnel time trying to debug something only to find out that the portal (and associated APIs) were broken and it was known by the group.
Twice now the fix has been weeks in the deploying so god knows how many other customer's time has been wasted.
If you have a known issue that affects your customers, especially given the woeful error messaging on the portal, then please add a banner on the…
2 votes -
Improve the Speech Studio Text Editor.
Being able to change the type, color, size and even highlighting the font with colors in the text editor, this would be very practical.
1 vote -
Dictionary function in Speech Studio to ignore words.
Add the function of a dictionary to Speech Studio which allows to ignore or change the pronunciation of a word in the whole document, that is, when adding this word in the dictionary, it is not read regardless of whether it appears 100 times in the same document and not having to mark it one by one.
1 vote -
Add speech profiles in Speech Studio.
Have the option of saving voice profiles for dialogues, and that these profiles include: voice, tone, rate, volume and intonation of the voice, so when you want to apply this profile, select the desired text and press the profile and that all the aforementioned values apply.
1 vote -
Enterprise pricing for Speech to text and Speech to Text neural, would provide extension to the current pricing for large volume users.
Enterprise pricing for Speech to text and Speech to Text neural, would provide extension to the current pricing for large volume users. We have clients that currently use hundreds of millions of characters using traditional data capture methods, and see the current pricing as not addressing their enterprise client market.
1 vote -
Javascript support for keyword recognition
Not necessarily an idea, and please let me know if this is not the right place for this. But it would be great to have the Javascript SDK supporting Keyword Recognition (specifically Custom Keywords).
1 vote -
Actionable Error Messaging in Speech Portal
When a Dataset upload fails the error messaging is literally "Failed" and clicking on the Dataset displays "Failed to upload data. Please check your data format and try to upload again."
This is not actionable error messaging. I have checked the data multiple times. I have been uploading this data, with additions, using an automated process for a year without issue.
Tell us why it failed. Give us a hint. I have 15,000 files and entries in the Trans.txt file. "It failed" is not useful information. Especially when it could easily be a problem server-side and Microsoft provides no validation…
1 vote -
I'd Like To Use C++ To Create An Exercise App With Voice Commands
I have a degree in Exercise and Sports Science and just got into coding 8 months ago, and I now want to create an Exercise App that uses voice command to run the app. For example, I'd like for the user to be able to use the commands "What's today's workout?", "What's the first exercise?", "What's the next exercise?", and "I'm finished with the workout?". I've used C++ before for simple projects, but I've never used it to create an app with voice commands. I'd really like to start from scratch and have somebody guide me through a Teams meeting…
1 vote -
Speaking some letters such as "A" and "E" using the English neural voices sounds bad.
The sound of the neural voices speaking some single letters such as would be done when speaking multiple choice test options does not match with other utterances by the same voice. One particular example of this is the letter "A" sounds very very short and also of a lower volume than "B" "C" and "D". It sticks out like a sore thumb especially since much of the rest of the utterances of words and sentences sounds so good. Compare the single letter utterances of Guy (neural) with Noah and you will find the latter are much more natural sounding and…
1 vote -
iOS Speech SDK: 'SPXDialogServiceConnector' class is missing
With ref. to https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/860#issuecomment-726436315 raising it here.
Missing Wrapper Class:
Connection to Bot service using 'SPXDialogServiceConnector' class is unavailable in iOS Speech SDK.
Note: It is available for Windows SDK and Android SDK.Alternative: Developers need to write their own Objective-C++ wrapper to utilize the core C++ SDK class.
If it will be available natively from iOS Speech SDK, everyone won't have to write!!!
And this is a need of time where SDK has more potential when we connect Speech to Bot service.
1 vote -
Azure TTS bug: <prosody rate="100%"> not handled correctly
Problem you have encountered:
<prosody rate="108%"> does not work as per the W3C spec for SSML.
Neither does <prosody rate="100%">These result in the TTS being spoken at about twice normal rate - Which is not right.
What you expected to happen:
I expect the speaking rate to be DEFAULT with rate="100%", as the W3C spec your documentation references at: https://www.w3.org/TR/speech-synthesis/#S3.3.2
literally says: "For example, a value of 100% means no change in speaking rate"However if instead we used: <prosody rate="+100%"> (With a '+') - THEN the speed should be doubled. The "+" and "-" are critical for relative…
1 vote
- Don't see your idea?