SELF-INSTRUCT — Step 1: Instruction Generation

Project: self-instruct-gpt4mini • Author: Dimitris Markopoulos • Date: 2025‑10‑15

Goal. Expand from a small human-written nucleus of 175 tasks to a large, diverse set of synthetic instructions that look and read like human prompts. This step does not create answers; it only invents new tasks for later stages.

Input: 175 seed instructions

Output: synthetic instructions (text only)

Model: GPT-4-mini (API)

Budget target: < $100 (full pipeline)

Concept

Provide the model with a few high-quality seed instructions as in-context demonstrations and prompt it to continue the list with new, plausible tasks. Repeat to accumulate thousands of candidates. This bootstraps instruction diversity without additional manual authoring.

Inputs

data/seed_tasks.jsonl — 175 human-written seeds (Apache-2.0).
Sampler that draws k seeds per batch (default k=8, e.g., 6 human + 2 model‑generated in later rounds).

Prompt Template

Minimal list-continuation prompt (adapted to our repo):

Come up with a series of tasks:

Task 1. <seed instruction A>
Task 2. <seed instruction B>
Task 3. <seed instruction C>
Task 4. <seed instruction D>
Task 5. <seed instruction E>
Task 6. <seed instruction F>
Task 7. <seed instruction G>
Task 8. <seed instruction H>
Task 9.

The model continues until tokens are exceeded (hyperparameter, research uses token threshold) or it decides to stop or reaches 16 items. The constraints are purposely lax to maximize the models "creativity". We then parse the continuation into standalone instructions.

Algorithm

Sample 8 distinct seed instructions (6 from the human curated set (size 175) and 2 from the generated set).
Render prompt (f-string input + instructions) and call the API with mild creativity.
Hyperparameters:
- temperature = 0.8 — controls the model's randomness, i.e., how freely it may branch off-topic.
- top_p = 0.9 — restricts token sampling to the top 90% probability mass, keeping outputs coherent.
- max_tokens = 1024 — allows the model to produce multiple new task instructions per call.

Parse the generated output and grab new data!
Filter and postprocess generated instructions before adding them to the task pool.
- Similarity filter: Add a new instruction only if its ROUGE-L or embedding similarity with existing tasks is below 0.7 to maintain diversity.
- Keyword filter: Remove tasks mentioning unsupported modalities such as image, picture, or graph.
- Length filter: Discard instructions that are too short (<5 words) or too long (>50 words).
- Duplicate cleanup: Exclude instances with identical inputs or conflicting outputs.

Repeat N batches until target count reached (depends on my budget).

Note for step 4: I diverge from the original research paper; I prefer dumping all generated responses first and then performing post-processing.

Outputs

Artifact	Description	Location
`generated_instructions.jsonl`	Raw synthetic instructions (unlabeled)	`data/generated_instructions.jsonl`

🔗 View all code on GitHub →

Reference Implementation (Python)


#==================================
# STEP 1 : INSTRUCTION GENERATION
#==================================
import random
import openai
import json
from pathlib import Path
import os
from dotenv import load_dotenv
import yaml
import re

# Load API KEY
load_dotenv("secrets.env") # store secret
with open("config.yaml") as f:
    raw_yaml = f.read()
    expanded_yaml = os.path.expandvars(raw_yaml) # Contains secret
    config = yaml.safe_load(expanded_yaml)
openai.api_key = config["openai"]["api_key"] # Configure the OpenAI client

# Download the 175 human-written tasks
with open("data/seed_tasks.jsonl") as f:
    seed_tasks = [json.loads(line) for line in f]

##
def grab_subsample():
    """
    Pulls 6 human-written instructions and 2 LLM instructions. 
    Initalizes with 8 human-written instructions.
    """
    with open("data/generated_tasks.jsonl") as f:
        sampled_llm_tasks = [json.loads(line) for line in f]
    
    if not sampled_llm_tasks: 
        # intialize to all human-written instructions if no generated responses exist
        return random.sample(seed_tasks, 8)
    
    llm_sample = random.sample(sampled_llm_tasks, 2)
    human_sample = random.sample(seed_tasks, 6)

    return llm_sample + human_sample

##
def create_prompt(instruction_sample : list) -> str:
    """
    Render the instruction-generation prompt (using Table 5 from SELF-INSTRUCT).
    """
    header = "Come up with a series of tasks:\n\n"
    lines = []
    for i, task in enumerate(instruction_sample):
        line = f'Task {i+1}: {task["instruction"].strip(" ")}'
        lines.append(line)
        if i == 7: # add last section (empty); for LLM to fill in
            lines.append(f'Task {i+2}:')
    return "\n".join(lines)

##
def generate_instructions(prompt: str, model="gpt-4o-mini") -> str:
    """
    Query OpenAI api to generate new task instructions.
    """
    resp = openai.chat.completions.create(
        model=config["openai"]["model"],
        messages=[ 
            # had to include role because model was being too helpful and did not continue list
            # this gave it further direction - original research did not do this
            {"role": "system", "content": (
                "Continue the numbered list of tasks below. "
                "Write 3 to 8 new Task entries that follow the same style. "
                "Do not explain, comment, or ask for clarification."
            )},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        top_p=0.5,
        max_tokens=1024
    )
    return resp.choices[0].message.content

##
def parse_instructions(text: str) -> list:
    """
    Parse raw model output (continuation of 'Task 9: ...') into a list of clean instruction dicts.
    """
    if not text.strip().startswith("Task"): # prepend "Task 9:" if model didn't include it
        text = "Task 9: " + text.strip()
    matches = re.findall(r"Task\s*\d+:\s*(.+?)(?=\s*Task\s*\d+:|$)", text, flags=re.S) # find all task blocks like "Task 9: some text"
    tasks = []
    for t in matches:
        cleaned = t.strip().replace("\n", " ").strip(" .")
        if cleaned:
            tasks.append({
                "instruction": cleaned,
                "source": "gpt-4o-mini",
            })
    return tasks

##
def create_task(model:str="gpt-4o-mini") -> list[dict]:
    """
    Wrapper of entire pipeline for step 1. 
    Outputs nice list[dict] with keys instruction and source.
    To use in loop to fill data/generated_task.jsonl
    """
    sample_list = grab_subsample()
    prompt = create_prompt(sample_list)
    generated_instructions_str = generate_instructions(prompt, model=model)
    task_list = parse_instructions(generated_instructions_str)
    return task_list

if __name__ == '__main__':
    task_list = create_task()
    print(f"Dict : {task_list}")
    print("\n")
    for i in range(len(task_list)):
        print(task_list[i]['instruction'])