Evaluation in Azure AI Studio: A Comprehensive Guide

Azure AI Studio, a product of Microsoft, is a comprehensive platform designed to simplify the process of building, training, and deploying machine learning models. It provides a user-friendly interface and a wide range of tools and features for beginners and experienced data scientists.

Building AI tools isn't just about creating the model, it's also about its working and safety. That's where Evaluation in Azure AI Studio comes in. Imagine Azure AI Studio as a big toolbox that helps you build, train, and use AI models, even if you're new to data science.

Evaluation plays a crucial role in the lifecycle of AI applications. It provides a systematic way to assess the performance and safety of these applications. Evaluation in Azure AI Studio is not just about measuring the accuracy of a model, but also about understanding its behavior, identifying potential risks, and ensuring that it aligns with ethical and fairness principles. This process will improve the model’s performance, enhance reliability, and build trust.

In the following sections, we will explore the Evaluation in Azure AI Studio, discussing its various aspects and how they contribute to developing robust and reliable AI applications. It goes deeper with a comprehensive understanding of model behavior in the context of Generative AI Evaluation and Machine Learning Model Evaluation. This includes identifying potential risks and ensuring alignment with ethical and fairness principles. Ultimately, this process cultivates improved model performance, enhanced reliability, and the cornerstone of trust in your AI application.

Stay tuned for an insightful journey into the world of AI evaluation in Azure AI Studio.

What is Evaluation in Azure AI Studio?

Azure AI Studio offers a range of tools to help you evaluate the Model performance and Model safety of your generative AI models. Evaluation is a crucial step in the development process, allowing you to identify strengths and weaknesses in your model before deploying it for real-world use.

Why Evaluation Matters

Here are some key reasons why evaluation is essential in Azure AI Studio:

Improved Model Performance: Evaluation helps you identify areas where your model can be improved. You can pinpoint weaknesses and enhance accuracy, relevance, or other metrics.
Bias Detection: Generative models can inherit biases from the data they are trained on. Evaluation helps you uncover potential biases and allows you to take steps to mitigate them, ensuring fairer outcomes.
Model Safety and Trust: It's important to ensure that your model generates safe and appropriate outputs. Evaluation helps identify potential risks such as generating harmful content or outputs containing sensitive information.
Informed Decisions: Evaluation results provide valuable insights to guide decisions about deploying your model. You can assess its readiness for real-world scenarios and make informed choices about potential limitations or risks.

Evaluation Approaches in Azure AI Studio

In Azure AI Studio, you can evaluate the performance and safety of your AI models using two methods:

Metric Evaluation
Manual Evaluation

Metric Evaluation

Metric evaluation in Azure AI Studio is a process that allows you to assess the performance and safety of your generative AI models. It involves selecting appropriate metrics, setting test data, and interpreting the results. Metric evaluation provides how the model performs and whether it meets the required standards.

Setting up a metric evaluation in Azure AI Studio involves several steps:

STEP 1. Navigate to the playground and select the "Evaluation" option.

STEP 2. Click "+ New Evaluation" to start a new evaluation process.

STEP 3. Now, enter the basic information:

Evaluation Name: Provide a unique name for this evaluation.

Evaluation Scenario: Specify the type of scenario you are evaluating. This could be one of the following:

Question and answer with content
Question and answer without context
Conversation with context

Select a Flow to evaluate (Optional): If applicable, select a flow for the evaluation.

After entering the details, click ‘Next’.

STEP 4. Select Metrics:

Performance and Quality Metrics curated by Microsoft: Choose from metrics curated by Microsoft such as

Groundedness
Relevance
Coherence
Fluency
GPT Similarity
F1 Score.

Select the Connection and Deployment Model

Select the appropriate connection and the deployment model for the evaluation.

Risk and Safety Metrics curated by Microsoft: Select from metrics curated by Microsoft that measure risk and safety, including

Self-harm-related content
Hateful and unfair content
Violent content
Sexual content.

You can also set the threshold to calculate the defect rate:

Very Low
Low
Medium
High

STEP 5. Configure Test Data:

Select configuration test data to evaluate: Azure AI Studio provides two options for your dataset:

Use Existing Dataset: If you have a pre-existing dataset that fits your needs, you can use it for the evaluation.
Add Your Dataset: If you have a custom dataset that you’d like to use, Azure AI Studio allows you to add it to the platform.

Dataset Mapping for Evaluation: Map the fields in your dataset to the following:

Answer: The response to the question generated by the model. This should be a string.
Ground Truth: The response to the question generated by a human, serves as the true answer. This should also be a string.
Question: A query seeking specific information.

After making your selections, click ‘Next’.

STEP 6. Review all the information and configurations. If everything is correct, click ‘Finish’ to complete the evaluation setup.

Manual Evaluation

Manual evaluation in Azure AI Studio is when you manually review the application’s generated outputs. It is useful for tracking progress on a small set of priority issues. Here are the key steps involved:

Navigate to the playground and select the "Evaluation" option.

Switch to the Manual Evaluations section and click "+ New Manual Evaluation" to start a new evaluation process.

Scenario: Imagine you are developing a search assistant AI model and want to see how well it performs with finding information.

Here are the steps to use manual evaluation:

STEP 1: Input: In the “Input” column, you can enter a question and expect the AI assistant to answer informatively. For example, you could type in “What is the capital of France?”.

STEP 2: Expected Response: In this column, enter the answer. In this case, the expected response is “The capital of France is Paris”.

STEP 3: Run: Click the “Run” button. This will prompt the model to generate a response to the input question.

STEP 4: Output: The model will generate its answer in the “Output” column. In this case, the output is “The capital of France is Paris”.

STEP 5: Metric Evaluation (Optional): You can use the metric evaluation options to assess the model’s output based on various criteria.

Azure AI Studio offers options for human evaluation through

Data rated
Thumbs up
Thumbs down

Manual Evaluation Result in Azure AI Studio

STEP 6: Save Results: Click “Save Results” to save your manual evaluation for future reference or to share with your team.

Benefits of Manual Evaluation:

Targeted Evaluation: Manual evaluation allows you to focus on specific areas of your model's performance by designing inputs that target weaknesses you suspect.
Quick and Easy: Manual evaluation is a relatively simple process to set up and conduct, especially for small datasets.

Limitations of Manual Evaluation:

Time-Consuming: Manually evaluating large datasets can be very time-consuming.
Subjectivity: Manual evaluation can be subjective, as evaluators may have different expectations for the model's output.

Use cases for manual evaluation in Azure AI Studio include:

Spot-checking small datasets to track progress on priority issues.
Reviewing the model’s responses to specific prompts or questions.
Assessing the model’s ability to handle edge cases or unusual inputs.

Metric vs Manual Evaluation in Azure AI Studio

Here’s a comparison of Metric Evaluation and Manual Evaluation in Azure AI Studio:

Feature	Metric Evaluation	Manual Evaluation
Data used	Uses a large dataset for evaluation.	Uses a smaller, more focused dataset for evaluation.
Output	Provides a set of scores based on the selected metrics.	Provides a set of ratings based on human judgment.
Time Efficiency	As an automated process, metric evaluation can process large amounts of data making it more time-efficient.	Manual evaluation, a human-driven process, can be time-consuming, especially for large datasets.
Subjectivity	Metric evaluation is objective, as it relies on predefined metrics and algorithms.	Manual evaluation involves human judgment, which can introduce subjectivity.
Scalability	Metric evaluation is highly scalable as it can handle large datasets and multiple evaluations simultaneously.	Manual evaluation may not scale well because it requires human effort and time.
Depth of Analysis	Metric evaluation provides a broad overview of the model’s performance across various metrics.	Manual evaluation can provide a deeper understanding of specific issues or nuances in the model’s responses.
Use of Resources	Metric evaluation primarily uses computational resources.	Manual evaluation primarily uses human resources.

Interpreting Evaluation Results

Understanding the output of metric evaluation is crucial for improving your AI model. The results provide insights into how well the model performs and whether it meets the required standards.

Here’s how you can use the evaluation results:

Performance and Quality Metrics: These metrics help you understand the model’s ability to generate relevant, coherent, and fluent responses. If these scores are low, you might need to improve your model’s training data or adjust its parameters.
Risk and Safety Metrics: These metrics assess the model’s predisposition toward different risks. If these scores are high, you might need to implement additional safety measures or retrain your model to risks.

Supported Evaluation Metrics in Azure AI Studio

Azure AI Studio supports various evaluation metrics for different task types. Here are some of them:

Single-turn question answering without retrieval (non-RAG):

In this setup, users pose individual questions or prompts, and a generative AI model is employed to generate responses. For example:

User: Tell me a short story about a detective solving a mystery.
AI: Once upon a time in the bustling city of Noirville, Detective John Steele received a cryptic letter in the mail. The message was simple but perplexing…
Or User: Who wrote the book “Pride and Prejudice”?
AI: “Pride and Prejudice” was written by Jane Austen.

Multi-turn or single-turn chat with retrieval (RAG):

In this context, users will engage in conversational interactions with a series of turns or a single exchange. The generative AI model, equipped with retrieval mechanisms, not only generates responses but also can access and incorporate information from external sources, such as documents.

Conclusion

Evaluation in Azure AI Studio, whether it’s metric evaluation or manual evaluation, plays a crucial role in the development and deployment of AI models. By understanding and interpreting the evaluation results, you can continuously improve your AI model to meet your needs and standards.