- cnn/dailynews dataset : https://drive.google.com/file/d/1bf-IArddQIc7F9EzkykT041P7uXMF1oB/view?usp=sharing
- Data distillation script : used this to distill data with reasoning added for the summerization, used groq as model provider
- Dataset size : 3000 rows, training epoch set to 10
import csv
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from groq import Groq # Import Groq
# CONFIGS
INPUT_FILE = './cnn_dailymail/train.csv'
OUTPUT_FILE = './cnn_dailymail/train_with_generated_summary_deepseek-r1-distill-llama-70b.csv'
START_ROW = 1 # Skip header
END_ROW = 3000
NUM_WORKERS = 5
BATCH_SIZE = 10
TEACHER_MODEL = "deepseek-r1-distill-llama-70b" # Changed to a Groq-compatible model
# Initialize Groq client (replace with your actual API key)
# It's recommended to load the API key from environment variables for security.
# Example: groq_client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
groq_client = Groq()
PROMPT_TEMPLATE = """
You are an expert at summarization.
Read the following text and generate a concise summary. Also, explain your reasoning briefly.
Text: "{text}"
Respond in the format:
Reasoning: <reasoning text>
Summary: <summary text>
"""
def parse_response(response_text):
reasoning = ""
summary = ""
lines = response_text.strip().splitlines()
for line in lines:
if line.lower().startswith("reasoning:"):
reasoning = line.split(":", 1)[1].strip()
elif line.lower().startswith("summary:"):
summary = line.split(":", 1)[1].strip()
return reasoning, summary
def call_teacher_llm(row_id, text):
prompt = PROMPT_TEMPLATE.format(text=text)
try:
chat_completion = groq_client.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model=TEACHER_MODEL,
temperature=0.7, # You can adjust temperature if needed
max_tokens=500, # You can adjust max_tokens if needed
)
content = chat_completion.choices[0].message.content
reasoning, summary = parse_response(content)
return row_id, text, reasoning, summary
except Exception as e:
print(f"❌ Error with Groq API for row {row_id}: {e}")
return row_id, text, "Error", "Error"
def process_csv():
with open(INPUT_FILE, 'r', encoding='utf-8') as infile, open(OUTPUT_FILE, 'w', encoding='utf-8', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
rows = []
for i, row in enumerate(reader):
if i < START_ROW:
continue
if i >= END_ROW:
break
if len(row) < 3:
continue
rows.append((row[0], row[1], row[2])) # (id, text, existing_summary)
print(f"🚀 Processing {len(rows)} rows using {NUM_WORKERS} threads...")
batch_rows = []
completed = 0
try:
with ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
# We need to create a list of futures and map them back to original row data
futures = {executor.submit(call_teacher_llm, row_id, text): (row_id, text, existing_summary) for row_id, text, existing_summary in rows}
for future in as_completed(futures):
original_row_data = futures[future]
row_id, text, existing_summary = original_row_data
result = future.result()
# result is (row_id, text, reasoning, summary) from call_teacher_llm
# We need to use the original text and existing_summary for the output row
batch_rows.append([row_id, text, existing_summary, result[2], result[3]])
completed += 1
print(f"[{completed}/{len(rows)}] Processed row {row_id}...")
if len(batch_rows) >= BATCH_SIZE:
writer.writerows(batch_rows)
batch_rows = []
outfile.flush()
except KeyboardInterrupt:
print("\n⚠️ Interrupted! Saving progress...")
if batch_rows:
writer.writerows(batch_rows)
outfile.flush()
if __name__ == "__main__":
start_time = time.time()
process_csv()
elapsed = time.time() - start_time
print(f"\n✅ Processing complete in {elapsed:.2f} seconds! Output saved to {OUTPUT_FILE}")
- Generated following datasets
- Distill-llama70b : https://drive.google.com/file/d/1pFIdo8JKn-YGMQhKIvXYyVCVGxbyBc5t/view?usp=drive_link
- Distill-qwen3-32: https://drive.google.com/file/d/1-abCAzaBqgpgao0ZMevlWLaiC6fy6d_i/view?usp=drive_link
- Distill-deepseek-r1-llama70b: https://drive.google.com/file/d/10yMuir1p0S2OI79VhrbG69pY1avEMUps/view?usp=drive_link
- Data splitted using below script into training and validation
import pandas as pd
from sklearn.model_selection import train_test_split
# Config
INPUT_CSV = './cnn_dailymail/train_with_generated_summary_Qwen3-32B_new.csv' # Path to your CSV file
VALIDATION_SPLIT = 0.15 # Use 0.10 for 10%, or random between 0.10–0.15
RANDOM_SEED = 42 # For reproducibility
# Load dataset
df = pd.read_csv(INPUT_CSV)
# Split the data
train_df, val_df = train_test_split(df, test_size=VALIDATION_SPLIT, random_state=RANDOM_SEED)
# Save the splits
train_df.to_csv('train_set_Qwen3-32B.csv', index=False)
val_df.to_csv('validation_set_Qwen3-32B.csv', index=False)
print(f"Dataset split completed. Training size: {len(train_df)}, Validation size: {len(val_df)}")
- Following models finetuned
- eprasad/t5-small-qwen3-distill-summarization
- eprasad/t5-small-deepseek-distill-summarization
- eprasad/t5-small-llama70b-distill-summarization
- eprasad/bart-base-qwen32-distill-summarization
Used following script to test summarisation quality
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load the model and tokenizer
model_name = "eprasad/t5-small-deepseek-distill-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# New input text (≈300 words)
text = (
"The Amazon rainforest, often referred to as the “lungs of the Earth,” is a vast tropical rainforest that spans across nine South American countries, "
"with the majority located in Brazil. It covers approximately 5.5 million square kilometers and plays a crucial role in regulating the global climate by "
"absorbing vast amounts of carbon dioxide. The Amazon is home to an incredibly diverse range of flora and fauna, with more than 390 billion individual trees "
"representing around 16,000 different species. In addition to its biodiversity, the forest supports the livelihoods of millions of indigenous people who have "
"lived there for centuries. Despite its importance, the Amazon faces numerous threats, primarily from deforestation caused by logging, agriculture, mining, and "
"infrastructure development. Over the past few decades, large areas have been cleared for cattle ranching and soybean farming, contributing to habitat loss and "
"increased greenhouse gas emissions. Climate change further exacerbates the challenges, leading to longer dry seasons and more frequent wildfires. Governments, "
"environmental organizations, and indigenous groups are working together to implement conservation strategies, such as establishing protected areas, promoting "
"sustainable land use, and enforcing anti-deforestation laws. Technological advancements like satellite monitoring have improved the ability to detect and prevent "
"illegal logging activities. Public awareness and international pressure have also played a role in encouraging more sustainable practices. However, balancing economic "
"development with environmental preservation remains a complex and ongoing challenge. The future of the Amazon will depend on the collective efforts of local communities, "
"national governments, global stakeholders, and individual actions. Protecting this critical ecosystem is not only essential for preserving biodiversity but also for "
"combating climate change and maintaining the health of the planet for future generations."
)
# Tokenize the input
inputs = tokenizer(text, return_tensors="pt", truncation=True)
# Generate summary
summary_ids = model.generate(
inputs["input_ids"],
max_length=100,
min_length=30,
do_sample=False
)
# Decode and print the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:\n", summary)
Outputs
google-t5/t5-small
,,, and,, and promoting sustainable land use, and enforcing anti-deforestation laws. The future of the Amazon will depend on the collective efforts of local communities, national governments, global stakeholders, and individual actions. The future of the Amazon will depend on the collective efforts of local communities, national governments, global stakeholders, and individual actions.
AhilanPonnusamy/distilled-t5small-summarizer
, with more than 390 billion individual trees representing around 16,000 different species. The Amazon rainforest, often referred to as the “lungs of the Earth,” covers approximately 5.5 million square kilometers. It covers approximately 5.5 million square kilometers and plays a crucial role in regulating the global climate by absorbing vast amounts of carbon dioxide.
eprasad/t5-small-qwen3-distill-summarization
The Amazon rainforest, often referred to as the “lungs of the Earth,” is a vast tropical rainforest that spans across nine South American countries, with the majority located in Brazil. It covers approximately 5.5 million square kilometers and plays a crucial role in regulating the global climate by absorbing vast amounts of carbon dioxide. The Amazon is home to an incredibly diverse range of flora and fauna, with more than 390 billion individual trees representing around 16,000 different species
eprasad/t5-small-llama70b-distill-summarization
The future of the Amazon will depend on the collective efforts of local communities, national governments, global stakeholders, and individual actions. Protecting this critical ecosystem is not only essential for preserving biodiversity but also for combating climate change and maintaining the health of the planet for future generations.
eprasad/t5-small-deepseek-distill-summarization
The Amazon rainforest, often referred to as the “lungs of the Earth,” is a vast tropical rainforest. It covers approximately 5.5 million square kilometers. The Amazon is home to an incredible range of flora and fauna.
eprasad/bart-base-qwen32-distill-summarization
The Amazon rainforest, often referred to as the “lungs of the Earth,” is a vast tropical rainforest that spans across nine South American countries, with the majority located in Brazil. It covers approximately 5.5 million square kilometers and plays a role in encouraging more sustainable practices. However, balancing economic development with environmental preservation remains a complex and ongoing challenge. The future of the Amazon will depend on the collective efforts of local communities, national governments, global stakeholders,
Used MoverScore https://arxiv.org/abs/1909.02622 for metrics