model fine tuning for summarisation using data distll – dump

  1. cnn/dailynews dataset : https://drive.google.com/file/d/1bf-IArddQIc7F9EzkykT041P7uXMF1oB/view?usp=sharing
  2. Data distillation script : used this to distill data with reasoning added for the summerization, used groq as model provider
    • Dataset size : 3000 rows, training epoch set to 10
import csv
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from groq import Groq # Import Groq

# CONFIGS
INPUT_FILE = './cnn_dailymail/train.csv'
OUTPUT_FILE = './cnn_dailymail/train_with_generated_summary_deepseek-r1-distill-llama-70b.csv'
START_ROW = 1 # Skip header
END_ROW = 3000
NUM_WORKERS = 5
BATCH_SIZE = 10
TEACHER_MODEL = "deepseek-r1-distill-llama-70b" # Changed to a Groq-compatible model

# Initialize Groq client (replace with your actual API key)
# It's recommended to load the API key from environment variables for security.
# Example: groq_client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
groq_client = Groq()

PROMPT_TEMPLATE = """
You are an expert at summarization.
Read the following text and generate a concise summary. Also, explain your reasoning briefly.
Text: "{text}"
Respond in the format:
Reasoning: <reasoning text>
Summary: <summary text>
"""

def parse_response(response_text):
    reasoning = ""
    summary = ""
    lines = response_text.strip().splitlines()
    for line in lines:
        if line.lower().startswith("reasoning:"):
            reasoning = line.split(":", 1)[1].strip()
        elif line.lower().startswith("summary:"):
            summary = line.split(":", 1)[1].strip()
    return reasoning, summary

def call_teacher_llm(row_id, text):
    prompt = PROMPT_TEMPLATE.format(text=text)
    try:
        chat_completion = groq_client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            model=TEACHER_MODEL,
            temperature=0.7, # You can adjust temperature if needed
            max_tokens=500, # You can adjust max_tokens if needed
        )
        content = chat_completion.choices[0].message.content
        reasoning, summary = parse_response(content)
        return row_id, text, reasoning, summary
    except Exception as e:
        print(f"❌ Error with Groq API for row {row_id}: {e}")
        return row_id, text, "Error", "Error"

def process_csv():
    with open(INPUT_FILE, 'r', encoding='utf-8') as infile, open(OUTPUT_FILE, 'w', encoding='utf-8', newline='') as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)

        rows = []
        for i, row in enumerate(reader):
            if i < START_ROW:
                continue
            if i >= END_ROW:
                break
            if len(row) < 3:
                continue
            rows.append((row[0], row[1], row[2]))  # (id, text, existing_summary)

        print(f"🚀 Processing {len(rows)} rows using {NUM_WORKERS} threads...")

        batch_rows = []
        completed = 0

        try:
            with ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
                # We need to create a list of futures and map them back to original row data
                futures = {executor.submit(call_teacher_llm, row_id, text): (row_id, text, existing_summary) for row_id, text, existing_summary in rows}

                for future in as_completed(futures):
                    original_row_data = futures[future]
                    row_id, text, existing_summary = original_row_data

                    result = future.result()
                    # result is (row_id, text, reasoning, summary) from call_teacher_llm
                    # We need to use the original text and existing_summary for the output row
                    batch_rows.append([row_id, text, existing_summary, result[2], result[3]])
                    completed += 1

                    print(f"[{completed}/{len(rows)}] Processed row {row_id}...")

                    if len(batch_rows) >= BATCH_SIZE:
                        writer.writerows(batch_rows)
                        batch_rows = []
                        outfile.flush()

        except KeyboardInterrupt:
            print("\n⚠️ Interrupted! Saving progress...")

        if batch_rows:
            writer.writerows(batch_rows)
            outfile.flush()

if __name__ == "__main__":
    start_time = time.time()
    process_csv()
    elapsed = time.time() - start_time
    print(f"\n✅ Processing complete in {elapsed:.2f} seconds! Output saved to {OUTPUT_FILE}")
  • Data splitted using below script into training and validation
import pandas as pd
from sklearn.model_selection import train_test_split

# Config
INPUT_CSV = './cnn_dailymail/train_with_generated_summary_Qwen3-32B_new.csv'       # Path to your CSV file
VALIDATION_SPLIT = 0.15              # Use 0.10 for 10%, or random between 0.10–0.15
RANDOM_SEED = 42                     # For reproducibility

# Load dataset

df = pd.read_csv(INPUT_CSV)

# Split the data

train_df, val_df = train_test_split(df, test_size=VALIDATION_SPLIT, random_state=RANDOM_SEED)

# Save the splits
train_df.to_csv('train_set_Qwen3-32B.csv', index=False)
val_df.to_csv('validation_set_Qwen3-32B.csv', index=False)

print(f"Dataset split completed. Training size: {len(train_df)}, Validation size: {len(val_df)}")

Used following script to test summarisation quality

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the model and tokenizer
model_name = "eprasad/t5-small-deepseek-distill-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# New input text (≈300 words)
text = (
    "The Amazon rainforest, often referred to as the “lungs of the Earth,” is a vast tropical rainforest that spans across nine South American countries, "
    "with the majority located in Brazil. It covers approximately 5.5 million square kilometers and plays a crucial role in regulating the global climate by "
    "absorbing vast amounts of carbon dioxide. The Amazon is home to an incredibly diverse range of flora and fauna, with more than 390 billion individual trees "
    "representing around 16,000 different species. In addition to its biodiversity, the forest supports the livelihoods of millions of indigenous people who have "
    "lived there for centuries. Despite its importance, the Amazon faces numerous threats, primarily from deforestation caused by logging, agriculture, mining, and "
    "infrastructure development. Over the past few decades, large areas have been cleared for cattle ranching and soybean farming, contributing to habitat loss and "
    "increased greenhouse gas emissions. Climate change further exacerbates the challenges, leading to longer dry seasons and more frequent wildfires. Governments, "
    "environmental organizations, and indigenous groups are working together to implement conservation strategies, such as establishing protected areas, promoting "
    "sustainable land use, and enforcing anti-deforestation laws. Technological advancements like satellite monitoring have improved the ability to detect and prevent "
    "illegal logging activities. Public awareness and international pressure have also played a role in encouraging more sustainable practices. However, balancing economic "
    "development with environmental preservation remains a complex and ongoing challenge. The future of the Amazon will depend on the collective efforts of local communities, "
    "national governments, global stakeholders, and individual actions. Protecting this critical ecosystem is not only essential for preserving biodiversity but also for "
    "combating climate change and maintaining the health of the planet for future generations."
)

# Tokenize the input
inputs = tokenizer(text, return_tensors="pt", truncation=True)

# Generate summary
summary_ids = model.generate(
    inputs["input_ids"],
    max_length=100,
    min_length=30,
    do_sample=False
)

# Decode and print the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:\n", summary)

Outputs

google-t5/t5-small
 ,,, and,, and promoting sustainable land use, and enforcing anti-deforestation laws. The future of the Amazon will depend on the collective efforts of local communities, national governments, global stakeholders, and individual actions. The future of the Amazon will depend on the collective efforts of local communities, national governments, global stakeholders, and individual actions.

AhilanPonnusamy/distilled-t5small-summarizer


 , with more than 390 billion individual trees representing around 16,000 different species. The Amazon rainforest, often referred to as the “lungs of the Earth,” covers approximately 5.5 million square kilometers. It covers approximately 5.5 million square kilometers and plays a crucial role in regulating the global climate by absorbing vast amounts of carbon dioxide.

eprasad/t5-small-qwen3-distill-summarization


 The Amazon rainforest, often referred to as the “lungs of the Earth,” is a vast tropical rainforest that spans across nine South American countries, with the majority located in Brazil. It covers approximately 5.5 million square kilometers and plays a crucial role in regulating the global climate by absorbing vast amounts of carbon dioxide. The Amazon is home to an incredibly diverse range of flora and fauna, with more than 390 billion individual trees representing around 16,000 different species

eprasad/t5-small-llama70b-distill-summarization

 The future of the Amazon will depend on the collective efforts of local communities, national governments, global stakeholders, and individual actions. Protecting this critical ecosystem is not only essential for preserving biodiversity but also for combating climate change and maintaining the health of the planet for future generations.

eprasad/t5-small-deepseek-distill-summarization

The Amazon rainforest, often referred to as the “lungs of the Earth,” is a vast tropical rainforest. It covers approximately 5.5 million square kilometers. The Amazon is home to an incredible range of flora and fauna.

eprasad/bart-base-qwen32-distill-summarization


 The Amazon rainforest, often referred to as the “lungs of the Earth,” is a vast tropical rainforest that spans across nine South American countries, with the majority located in Brazil. It covers approximately 5.5 million square kilometers and plays a role in encouraging more sustainable practices. However, balancing economic development with environmental preservation remains a complex and ongoing challenge. The future of the Amazon will depend on the collective efforts of local communities, national governments, global stakeholders,

Used MoverScore https://arxiv.org/abs/1909.02622 for metrics

Leave a Reply

Your email address will not be published. Required fields are marked *