Merging pre-trained language models allows us to combine the strengths of multiple models into a single, optimized model without the need for extensive retraining. This process can enhance performance and adaptability, making it a valuable technique in the field of natural language processing. In this tutorial, we'll explore how to merge two models using the mergekit library, guiding you through each step to create your custom merged model.

Github Link here.

Step 1: Clone the MergeKit Repository

Begin by cloning the mergekit repository from GitHub and navigating into the directory:

!git clone https://github.com/arcee-ai/mergekit.git
%cd mergekit

Step 2: Install MergeKit

Install mergekit in editable mode to ensure that any changes in the source code are immediately reflected without the need for reinstallation:

pip install -e .

Step 3: Authenticate with Hugging Face

To access models from the Hugging Face Hub, authenticate using the notebook_login function from the huggingface_hub library:

from huggingface_hub import notebook_login
notebook_login()

Step 4: Define the Merge Configuration

Create a YAML configuration file that specifies the models to merge, the merge method, and other parameters. In this example, we're using the Spherical Linear Interpolation (SLERP) method to merge two models:

merge_config = """
slices:
- sources:
  - model: teknium/OpenHermes-2.5-Mistral-7B
    layer_range:
    - 0
    - 32
  - model: Open-Orca/Mistral-7B-OpenOrca
    layer_range:
    - 0
    - 32
merge_method: slerp
base_model: Open-Orca/Mistral-7B-OpenOrca
parameters:
  t:
  - filter: self_attn
    value:
    - 0
    - 0.5
    - 0.3
    - 0.7
    - 1
  - filter: mlp
    value:
    - 1
    - 0.5
    - 0.7
    - 0.3
    - 0
  - value: 0.5
dtype: bfloat16
"""

Save this configuration to a file named config.yaml:

with open('config.yaml', 'w') as f:
    f.write(merge_config)

Step 5: Execute the Merge

Run the mergekit-yaml command with the configuration file to perform the merge:

!mergekit-yaml config.yaml ./merge --copy-tokenizer --allow-crimes --out-shard-size 1B --trust-remote-code

This command will generate the merged model in the ./merge directory.

Step 6: Upload the Merged Model to Hugging Face

To share your merged model, upload it to the Hugging Face Hub. First, create a new repository:

from huggingface_hub import HfApi
api = HfApi(token="YOUR_HUGGING_FACE_TOKEN")
username = "your_username"
MODEL_NAME = "orca-teknium-merge"
api.create_repo(
    repo_id=f"{username}/{MODEL_NAME}",
    repo_type="model"
)

Then, upload the merged model to the repository:

api.upload_folder(
    repo_id=f"{username}/{MODEL_NAME}",
    folder_path="merge"
)

Step 7: Install Additional Dependencies

Ensure that bitsandbytes and gradio are installed for quantization and creating a user interface, respectively:

pip install -U bitsandbytes
pip install gradio

Step 8: Load and Quantize the Merged Model

Use the BitsAndBytesConfig to load the model with 4-bit quantization, which reduces memory usage and speeds up inference:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig


MODEL_NAME = "your_username/orca-teknium-merge"
# Enable quantization with bitsandbytes
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Step 9: Create a Chat Interface with Gradio

Utilize gradio to build a simple web-based chat interface for interacting with your merged model:

import gradio as gr

def chat(message, history):
    prompt = f"### Instruction:\nRespond to the user's query concisely and helpfully.\n\n### User:\n{message}\n\n### Response:"  
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs, 
        max_new_tokens=200, 
        num_beams=5, 
        early_stopping=True, 
        no_repeat_ngram_size=2
    )
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    
    # Ensure the response doesn't contain the prompt
    response = response.replace(prompt, "").strip()
    
    return response
# Gradio ChatInterface
iface = gr.ChatInterface(
    fn=chat,
    title="Custom Merged Model Chatbot",
    description="Chat with the AI! Type your message below.",
)
iface.launch()

This script sets up a chat interface where users can input messages and receive responses generated by the merged model.

Key Considerations

  • Ensure Model Architecture Matches: The models being merged must share the same architecture (e.g., Mistral-7B, LLaMA) for compatibility. Merging models with different architectures will lead to errors or suboptimal performance.
  • MergeKit simplifies model merging, allowing for easy integration of different pre-trained models.
  • SLERP (Spherical Linear Interpolation) provides a smooth method for merging layers while maintaining stability.
  • Quantization with BitsAndBytes reduces memory usage and speeds up inference.
  • Uploading to Hugging Face makes your model accessible to the community or private use.
  • Gradio provides a simple UI to interact with your model and test its capabilities.

By following these steps, you've successfully merged two pre-trained language models into a single, optimized model using mergekit.