Best Practices and Guidelines for Preparing Your Dataset for Fine Tuning with Azure OpenAI

Fine tuning models using Azure OpenAI is an exciting venture, offering the potential to optimize your natural language processing (NLP) tasks, chatbots, and various AI-driven applications. However, the key to harnessing the full potential of fine tuning lies in the quality of your dataset. In this guide, we'll explore the best practices and guidelines to ensure your dataset is primed and ready for effective fine tuning with Azure OpenAI.

From data preprocessing to feature engineering, handling imbalances, and addressing data quality issues, these best practices and guidelines are designed to empower you to achieve superior results when fine tuning models with Azure OpenAI. Whether you're a data scientist, machine learning engineer, or AI enthusiast, optimizing your dataset is a crucial step toward unlocking the true potential of Azure OpenAI-powered models.

Read: Exploring Azure OpenAI on Your Data: A Comprehensive Guide

What is the difference between Azure OpenAI Services and OpenAI?

Best Practices

Fine tuning, whether it's for machine learning models, statistical analysis, or any data-driven task, is a critical step in enhancing performance and accuracy. However, the quality of the results you can achieve through fine tuning heavily relies on the quality of your dataset.

Below are the best practices to guide you through the process of preparing your dataset for fine tuning.

Best Practice 1: Use high-quality examples

This means that the examples should be well-written, relevant to the task, and free of errors. High-quality examples will help the model to learn the task more effectively and generate more accurate and informative results.

Example:

Suppose you are fine tuning a model to generate product descriptions. You could collect a few hundred examples of high-quality product descriptions from your website or from other websites. These examples should be well-written, informative, and relevant to the products that you want the model to generate descriptions for.

Best Practice 2: Provide a simple prompt

This will help the model to learn the task more effectively. A single prompt will also help to ensure that the model generates consistent results.

Example:

Suppose you are fine tuning a model to generate product descriptions. You could use the following prompt:

<Product Name> <Product Description> ###

This prompt tells the model that it should generate a product description for the product named <Product Name>. The model will then use the <Product Description> text to generate a description that is relevant and informative.

Best Practice 3: Use a fixed separator

This will help the model to identify the different parts of the input and output. A fixed separator will also make it easier to use the model in production.

Example:

Suppose you are fine tuning a model to generate product descriptions. You could use the following separator:

###

This separator will tell the model that the input text ends after the ### symbol. The model will then generate the output text after the ### symbol.

Guidelines to prepare your Dataset for Fine tuning

Fine tuning your models demands high-quality data. In the absence of thorough preparation, your efforts might yield less-than-optimal outcomes. To ensure your dataset is clean, well-organized, and relevant, we've compiled a set of guidelines.

Classifiers for Classification tasks:

Classifiers are the easiest machine learning models to get started with. They are well-suited for classification problems, where the goal is to predict a category for a given input.

Classification is a machine learning task where a model is trained to identify the category of a given input. This can be used for a variety of tasks, such as:

Spam filtering
Sentiment analysis
Image classification
Product categorization

Best Practices for Classification

When using classification models, it is important to follow these best practices:

Use a dataset that is representative of the types of inputs that you want to classify.
Choose classes that are mutually exclusive and exhaustive.
Use a separator at the end of the prompt and append it to subsequent requests.
Specify max_tokens=1 at inference time, since you only need the first token for classification.
Ensure that the prompt + completion does not exceed 2048 tokens, including the separator.
Aim for at least 100 examples per class.
To get class log probabilities, you can specify logprobs=5 (for five classes) when using your model.
Ensure that the dataset used for fine tuning is very similar in structure and type of task as what the model will be used for.

To use classifiers for classification tasks, OpenAI recommends using the ada model. It is generally faster than other models, while still performing very well.

Case study: How to use classifiers to detect untrue statements?

Suppose you have a website that sells insurance, and you want to ensure that the ads on your website mention the correct product and company. You can fine-tune a classifier to filter out incorrect ads.

Your dataset might look something like this:

{   
    "prompt": "Company: BHFF insurance\nProduct: allround insurance\nAd:One stop shop for all your insurance needs!\nSupported:",
    "completion": " yes" 
} 
{   
    "prompt": "Company: Loft conversion specialists\nProduct: -\nAd:Straight teeth in weeks!\nSupported:",   
    "completion": " no" 
}

In this example, the separator is \nSupported:.

Once you have trained the classifier, you can use it to query new ads to see if they are correct. For example, you could send the following request to the OpenAI API:

curl https://YOUR_RESOURCE_NAME.openaiazure.com/openai/deployments/YOUR_DEPLOYMENT_NAME/completions?api-version=2023-05-15\ \   
    -H 'Content-Type: application/json' \   
    -H 'api-key: YOUR_API_KEY' \   
    -d '{   
        "prompt": "Company: Reliable accountants Ltd\nProduct: Personal Tax help\nAd:Best advice in town!\nSupported:",   
        "max_tokens": 1  
}'

This request will return either yes or no, indicating whether the ad is correct.

Classifiers can be a powerful tool for detecting untrue statements, and they are relatively easy to use. If you have a task that involves classifying text into different categories, I recommend trying out classifiers.

Classification models can be biased and inaccurate, especially if the training data is not representative of the real world.

To reduce bias in classification models, it is important to use a diverse training dataset and to evaluate the model on a held-out test set.

Conditional Generation

Conditional generation is a type of Machine Learning task where a model is trained to generate text based on a given input. This can be used for a variety of tasks, such as:

Paraphrasing: Generating a new version of a text passage that has the same meaning, but is expressed in different words.
Summarizing: Generating a shorter version of a text passage that captures the main points.
Entity extraction: Identifying and extracting named entities from a text passage, such as people, places, and organizations.
Product description writing: Generating product descriptions based on a set of product specifications.
Chatbots: Generating text responses to user queries in a conversational manner.

Best practices for Conditional Generation

When using conditional generation models, it is important to follow these best practices:

Use a dataset that is of high quality and that is representative of the type of content you want to generate.
Use a separator at the end of the prompt and append it to subsequent requests.
Use an ending token at the end of the completion and add it as a stop sequence during inference.
Aim for at least 500 examples in your dataset.
Ensure that the prompt + completion does not exceed 2048 tokens, including the separator.
Use a lower learning rate and only 1-2 epochs for fine tuning.

Case study: Write an engaging ad based on a Wikipedia article.

This is a generative use case, so it is important to ensure that the samples you provide are of the highest quality. A good starting point is around 500 examples. A sample dataset might look like this:

{     
    "prompt": "<Product Name>\n<Wikipedia description>\n\n###\n\n",     
    "completion": " <engaging ad> END" 
}

For example:

{     
    "prompt": "Samsung Galaxy Feel\nThe Samsung Galaxy Feel is an Android smartphone developed by Samsung Electronics exclusively for the Japanese market. The phone was released in June 2017 and was sold by NTT Docomo. It runs on Android 7.0 (Nougat), has a 4.7 inch display, and a 3000 mAh battery.\nSoftware\nSamsung Galaxy Feel runs on Android 7.0 (Nougat), but can be later updated to Android 8.0 (Oreo).\nHardware\nSamsung Galaxy Feel has a 4.7 inch Super AMOLED HD display, 16 MP back facing and 5 MP front facing cameras. It has a 3000 mAh battery, a 1.6 GHz Octa-Core ARM Cortex-A53 CPU, and an ARM Mali-T830 MP1 700 MHz GPU. It comes with 32GB of internal storage, expandable to 256GB via microSD. Aside from its software and hardware specifications, Samsung also introduced a unique a hole in the phone's shell to accommodate the Japanese perceived penchant for personalizing their mobile phones. The Galaxy Feel's battery was also touted as a major selling point since the market favors handsets with longer battery life. The device is also waterproof and supports 1seg digital broadcasts using an antenna that is sold separately.\n\n###\n\n",
    
    "completion": "Looking for a smartphone that can do it all? Look no further than Samsung Galaxy Feel! With a slim and sleek design, our latest smartphone features high-quality picture and video capabilities, as well as an award winning battery life. END" 
}

Here, we used a multiline separator, as Wikipedia articles contain multiple paragraphs and headings. We also used a simple end token, to ensure that the model knows when the completion should finish.

By following these best practices, you can use conditional generation models to create high-quality text content that is tailored to your specific needs.

Open - Ended Generation

Open-ended generation is a type of machine learning task where a model is trained to generate text without any given input. This can be used for a variety of tasks, such as:

Creative writing: Generating poems, stories, and other creative text formats.
Code generation: Generating code snippets based on natural language descriptions.
News article generation: Generating news articles based on a set of facts.

Best practices for Open - Ended generation

When using Open-ended generation models, it is important to follow these best practices:

Use a large number of examples, at least a few thousand.
Ensure that the examples cover the intended domain or the desired tone of voice.

Case study: Maintaining Company Voice

Many companies have a large amount of high quality content generated in a specific voice. Ideally all generations from the OpenAI API should follow that voice for the different use cases. To achieve this, you can leave the prompt empty and feed in all the documents which are good examples of the company voice. A fine-tuned model can be used to solve many different use cases with similar prompts to the ones used for base models, but the outputs are going to follow the company voice much more closely.

For example, you could fine-tune a model on a dataset of company blog posts and marketing materials. This model could then be used to generate new content in the same voice, such as product descriptions, social media posts, and email campaigns.

Generative tasks have a potential to leak training data when requesting completions from the model, so extra care needs to be taken that this is addressed appropriately. For example, personal or sensitive company information should be replaced by generic information or not be included into fine tuning in the first place.

You can also use the following techniques to mitigate the risk of data leakage:

Use a technique called differential privacy to add noise to the training data.
Use a technique called adversarial training to generate synthetic data that is similar to the training data, but does not contain any sensitive information.
Use a technique called federated learning to train the model on multiple devices without sharing the data between them.

By following these best practices and Guidelines for Preparing Your Dataset for Fine Tuning with Azure OpenAI, you can use open ended generation models to create high-quality and creative text content that is tailored to your specific needs.

Conclusion

By adhering to these practices, you're not only laying a robust foundation for your fine tuning endeavors but also enhancing the efficiency and accuracy of your AI models. Whether your goal is to supercharge your natural language processing tasks, refine your chatbots, or optimize any AI-driven application, a well-prepared dataset is your path to success.