Materialien zu Video 11 von „Hinter den Kulissen von ChatGPT“

18. Juli 2025

Trainingsdaten

1. Ungelabelte Daten für Pretraining

Common Crawl

Texte von Milliarden von Webseiten, roh und ungefiltert
Umfang: Mehrere hundert Terabyte
https://data.commoncrawl.org/

OpenWebText

Texte von Webseiten, die auf Reddit verlinkt und hoch bewertet wurden
Umfang: ca. 40 GB (Komprimiert)
https://skylion007.github.io/OpenWebTextCorpus/

BooksCorpus

Texte aus rund 7.000 freien, selbstveröffentlichten Büchern, die von der Indie-Website für den ebook-Vertrieb Smashwords abgerufen wurden.
Umfang: ca. 1 GB (Komprimiert)
https://en.wikipedia.org/wiki/BookCorpus

The Pile

Kuratierter Datensatz, zusammengesetzt aus 22 unterschiedlichen Teilquellen
(Enthält unter anderem Programmcode, wissenschaftliche Artikel, Frage-Antwort-Foren, Belletristik, philosophische Texte, Filmuntertitel und juristische Dokumente)
Umfang: ca. 900 GB
https://en.wikipedia.org/wiki/The_Pile_(dataset)
https://pile.eleuther.ai/

MiniPile

„Die Hauptmotivation für die Erstellung von MiniPile ist, dass (i) vielfältige Pretraining-Datensätze (wie The Pile) oft zu groß für akademische Budgets sind und (ii) kleinere Datensätze meist recht homogen und damit wenig repräsentativ für moderne Allzweck-Sprachmodelle sind. MiniPile soll diese Lücke schließen und daten-effiziente Forschung zu Modellarchitekturen, Trainingsverfahren, Optimierern usw. ermöglichen.“ (Jean Kaddour)
Umfang: 1 Mio Texte, ca. 5 GB. (?)

2. Trainingsdaten für isolierte Aufgaben

2.1 Textklassifikation

Datensätze zur Textklassifikation bestehen aus Texten, die jeweils einer Klasse zugeordnet sind. Diese Klassen können etwa den Inhalt oder die Gestimmtheit („Sentiment“) betreffen. Sie werden genutzt, um Modelle darauf zu trainieren, Text in entsprechende Klassen einzuordnen.

IMDb

Filmrezensionen, jeweils gelabelt als „positiv“ oder „negativ“. Enthält 50.000 Rezensionen.

Beispiele

{
"text": "I felt a great joy, after seeing this film, not because it is a master piece, but because it convinced me of, that the Portuguese cinema became really very good. We can see here the best Portuguese actores in this field.",
"label": 1
}

{
"text": "I didn't know it was possible to release a movie this bad. The labeling sounded so promising, but you would think that with a cast of 20, at least one of them would be able to act. My wife left me and went to bed after the first 20 minutes. She made a wise decision.",
"label": 0
}

AG News

Rund 120.000 Nachrichtentexte in vier Kategorien: World, Sports, Business und Sci/Tech

Beispiele

0 = World1 = Sports2 = Business3 = Sci/Tech

{
"text": "Venezuelan Car-Bomb Suspect Killed, Weapons Found CARACAS, Venezuela (Reuters) - A Venezuelan lawyer suspected in last week's bombing murder of a top state prosecutor was killed in a gunfight with police on Tuesday after he tried to ram detectives with his car and opened fire on them, officials said.",
"label": 0
}

{
"text": "This Date in Baseball - Aug. 17 (AP) AP - 1904 #151; Jesse Tannehill of the Boston Red Sox pitched a no-hitter, beating the Chicago White Sox 6-0.",
"label": 1
}

{
"text": "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling band of ultra-cynics, are seeing green again.",
"label": 2
}

{
"text": "Stripped down Longhorn still offers gems Although Microsoft's Longhorn has been stripped of its unified file system and some other key technologies, the company said the next Windows release will still be worth the upgrade.",
"label": 3
}

TREC

Fragen, die jeweils einer von sechs Fragetypen zugeordnet sind.

Hauptkategorien

DESC (Description): Definitionen, Erklärungen, Gründe
z. B. What is…, Why…

ENTY (Entity): konkrete Dinge, Tiere, Produkte, Farben usw.

ABBR (Abbreviation): Akronyme und ihre Bedeutungen

HUM (Human): Personen, Gruppen, Berufe

LOC (Location): Orte, Länder, Städte, Regionen

NUM (Numeric): Zahlen, Daten, Mengen, Maße

Unterkategorien (exemplarisch)

HUM:ind → individuelle Person

NUM:date → Datumsangabe

DESC:def → Definition

ENTY:animal → Tier

ABBR:exp → ausgeschriebene Form eines Akronyms

Beispiele

DESC:manner How did serfdom develop in and then leave Russia ?
ENTY:cremat What films featured the character Popeye Doyle ?
DESC:manner How can I find a list of celebrities ' real names ?
ENTY:animal What fowl grabs the spotlight after the Chinese Year of the Monkey ?
ABBR:exp What is the full form of .com ?
HUM:ind What contemptible scoundrel stole the cork from my lunch ?
HUM:gr What team did baseball 's St. Louis Browns become ?
HUM:title What is the oldest profession ?
DESC:def What are liver enzymes ?
HUM:ind Name the scar-faced bounty hunter of The Old West .
NUM:date When was Ozzy Osbourne born ?
DESC:reason Why do heavier objects travel downhill faster ?
HUM:ind Who was The Pride of the Yankees ?
HUM:ind Who killed Gandhi ?
ENTY:event What is considered the costliest disaster the insurance industry has ever faced ?
LOC:state What sprawling U.S. state boasts the most airports ?
DESC:desc What did the only repealed amendment to the U.S. Constitution deal with ?
NUM:count How many Jews were executed in concentration camps during WWII ?

2.2 Datensätze für extraktives Question Answering

Datensätze für extraktives Question Answering enthalten Fragen, Textpassagen und dazugehörige Antworten, die als exakte Textausschnitte (Spans) aus den Passagen entnommen werden können. Modelle werden damit darauf trainiert, relevante Stellen im Text zu identifizieren und zu extrahieren.

Stanford Question Answering Dataset (SQuAD)

SQuAD ist ein Leseverständnis-Datensatz, der aus Fragen (question) besteht, die von Crowdworkern zu Auschnitten aus Wikipedia-Artikeln (context) gestellt wurden. Die Antwort auf jede Frage ist entweder ein Textabschnitt (text und answer_start)aus dem jeweiligen Lesetext – oder ab Version 2.0 die Aussage, dass sich die Frage aus dem Text nicht beantworten lässt (is_impossible).

{
  "qas": [
    {
      "question": "What is the basic unit of territorial division in Poland?",
      "id": "573380e0d058e614000b5be9",
      "answers": [
        {
          "text": "a commune",
          "answer_start": 52
        },
        {
          "text": "commune",
          "answer_start": 54
        },
        {
          "text": "commune",
          "answer_start": 54
        }
      ],
      "is_impossible": false
    },
    {
      "question": "What is the second level of territorial division in Poland?",
      "id": "573380e0d058e614000b5bea",
      "answers": [
        {
          "text": "counties or powiats",
          "answer_start": 421
        },
        {
          "text": "counties or powiats",
          "answer_start": 421
        },
        {
          "text": "counties or powiats",
          "answer_start": 421
        }
      ],
      "is_impossible": false
    },
    {
      "question": "In what districts are the registration numbers for cars all of the same type?",
      "id": "573380e0d058e614000b5beb",
      "answers": [
        {
          "text": "KrakÃ³w",
          "answer_start": 1085
        },
        {
          "text": "KrakÃ³w",
          "answer_start": 1085
        }
      ],
      "is_impossible": false
    },
    {
      "question": "What is the basic unit of territorial division in Warsaw?",
      "id": "5ad4f40c5b96ef001a10a774",
      "answers": [],
      "is_impossible": true,
      "plausible_answers": [
        {
          "text": "a commune",
          "answer_start": 86
        }
      ]
    },
    {
      "question": "What is the second level of territorial division in Warsaw?",
      "id": "5ad4f40c5b96ef001a10a775",
      "answers": [],
      "is_impossible": true,
      "plausible_answers": [
        {
          "text": "counties or powiats",
          "answer_start": 421
        }
      ]
    },
    {
      "question": "In what districts are the registration numbers for boats all of the same type?",
      "id": "5ad4f40c5b96ef001a10a776",
      "answers": [],
      "is_impossible": true,
      "plausible_answers": [
        {
          "text": "KrakÃ³w",
          "answer_start": 1085
        }
      ]
    },
    {
      "question": "What does a car have besides a commune?",
      "id": "5ad4f40c5b96ef001a10a777",
      "answers": [],
      "is_impossible": true,
      "plausible_answers": [
        {
          "text": "city charter.",
          "answer_start": 111
        }
      ]
    },
    {
      "question": "What city has districts with no powiat entitlements?",
      "id": "5ad4f40c5b96ef001a10a778",
      "answers": [],
      "is_impossible": true,
      "plausible_answers": [
        {
          "text": "Warsaw",
          "answer_start": 760
        }
      ]
    }
  ],
  "context": "The basic unit of territorial division in Poland is a commune 
              (gmina). A city is also a commune â€“ but with the city charter.
              Both cities and communes are governed by a mayor â€“ but in 
              the communes the mayor is vogt (wÃ³jt in Polish), however in the 
              cities â€“ burmistrz. Some bigger cities obtain the entitlements, 
              i.e. tasks and privileges, which are possessed by the units of 
              the second level of the territorial division â€“ counties or 
              powiats. An example of such entitlement is a car registration: 
              a gmina cannot register cars, this is a powiat's task (i.e. a 
              registration number depends on what powiat a car had been 
              registered, not gmina). In this case we say about city county 
              or powiat grodzki. Such cities are for example Lublin, KrakÃ³w, 
              GdaÅ„sk, PoznaÅ„. In Warsaw, its districts additionally have 
              some of powiat's entitlements â€“ like already mentioned car 
              registration. For example, the district Wola has its own evidence 
              and the district UrsynÃ³w â€“ its own (and the cars from Wola 
              have another type of registration number than these from UrsynÃ³w).
              But for instance the districts in KrakÃ³w do not have entitlements
              of powiat, so the registration numbers in KrakÃ³w are of the same 
              type for all districts."
}

Weitere Beispiele für Datensätze für extraktives QA

Natural Questions

https://ai.google.com/research/NaturalQuestions/visualization

NewsQA

https://huggingface.co/datasets/badokorach/NewQA

2.3 Datensätze für Multiple Choice Question Answering

Datensätze für Multiple Choice Question Answering enthalten Fragen, mehrere vorgegebene Antwortoptionen und die Kennzeichnung der korrekten Lösung.

RACE (Reading Comprehension Dataset from Examinations)

RACE ist ein umfangreicher Datensatz zum testen von Textverständnis mit über 28.000 Texten und fast 100.000 Multiple-Choice Fragen. Er wurde aus englischsprachigen Prüfungen in China zusammengestellt, die für Mittel- und Oberstufenschüler konzipiert wurden.

Gliederung

id: Eine Kennung für jeden Textabschnitt im Datensatz.
article: Ein String, der den Textabschnitt enthält.
questions: Eine Liste von Fragen. Es gibt zwei Fragetypen: Entweder ein Fragesatz oder ein Satz mit einer Lücke, dargestellt durch einen Unterstrich.
options: Eine Liste mit jeweils vier Antwortmöglichkeiten.
answers: Eine Liste mit den korrekten Antworten (Gold-Labels) zu jeder Frage.

{
  "id": "middle1018.txt",
  "article": "Nice to meet you. I'm David Beckham. I'm from England and
   I'm English. I'm twenty-nine years old. I like playing football and 
   I can play football very well.
   My name is Zhou Jielun. I'm from Taiwan, China. I'm thirty years old.
   I like singing.
   My name is Liu Qian. I'm from Taiwan,China. I'm thirty-three years 
   old. I like playing magic cards
   Hello! I'm Li Yuchun. I live in Chengdu now. I'm twenty-five years 
   old. I like singing.",
  
  "questions": [
    "Who is Chinese?",
    "Who is the youngest ?",
    "_ and _ are from Taiwan.",
    "David Beckham is _ years old.",
    "David Beckham's family name is _ and Zhou Jielun's given name is _ ."
  ],
  "options": [
    [
      "Zhou Jielun",
      "Liu Qian",
      "Li Yuchun",
      "A ,B, and C"
    ],
    [
      "Liu Qian",
      "Zhou Jielun",
      "Li Yuchun",
      "David Beckham"
    ],
    [
      "Zhou Jielun; Li Yuchun",
      "Zhou Jielun ; Liu Qian",
      "Liu Qian; Li Yuchun",
      "Li Yuchun ; David Beckham"
    ],
    [
      "twenty-nine",
      "twenty-five",
      "thirty-three",
      "thirty"
    ],
    [
      "David ; Zhou",
      "David; Beckham",
      "Beckham; Zhou",
      "Beckham; Jielun"
    ]
  ],
  "answers": ["D", "C", "B", "A", "D"]
}

Weitere Datensätze

MCTest

https://mattr1.github.io/mctest/

OpenBookQA

(Hier steckt die Antwort nicht im gegeben Text. Zur Beantwortung ist Weltwissen notwenig.)

https://huggingface.co/datasets/allenai/openbookqa/viewer/additional/train

ARC (AI2 Reasoning Challenge)

(…das gilt auch für diesen Datensatz.)

https://huggingface.co/datasets/allenai/ai2_arc/viewer/ARC-Challenge/train

SWAG (Situations With Adversarial Generations)

Bietet Situationsbeschreibungen mit vier möglichen Fortsetzungen, von denen eine korrekt ist.

https://huggingface.co/datasets/allenai/swag/viewer/regular/train

2.4 Datensätze für generatives Question Answering

Bei Datensätzen für generatives Question Answering liegt die Antwort nicht in Form eines Textabschnittes oder einer Auswahlmöglichkeit vor. Das Sprachmodell soll lernen, die korrekte Antwort autoregressiv generieren.

Natural Questions Open

{"question": "when was the last time anyone was on the moon", 
"answer": ["14 December 1972 UTC", "December 1972"]}
{"question": "who wrote he ain't heavy he's my brother lyrics", 
"answer": ["Bobby Scott", "Bob Russell"]}
{"question": "how many seasons of the bastard executioner are there", 
"answer": ["one", "one season"]}
{"question": "when did the eagles win last super bowl", 
"answer": ["2017"]}
{"question": "who won last year's ncaa women's basketball", 
"answer": ["South Carolina"]}
{"question": "when did the isle of wight become an island", 
"answer": ["During the last Ice Age"]}
{"question": "love yourself by justin bieber is about who", 
"answer": ["Rihanna"]}
{"question": "who was the ruler of england in 1616", 
"answer": ["James I"]}
{"question": "what is the hot coffee mod in san andreas", 
"answer": ["a normally inaccessible mini-game"]}

https://github.com/efficientqa/nq-open/blob/master/NQ-open.dev.jsonl

2.5 Datensätze für maschinelle Übersetzung

Datensätze für maschinelle Übersetzung bestehen aus Paaren von (nach Möglichkeit) inhaltlich identischen Sätzen in verschiedenen Sprachen.

TATOEBA

Tatoeba ist ein kollaboratives Projekt zur Sammlung von Beispielsätzen in zahlreichen Sprachen. Die Sätze werden von Nutzer:innen beigetragen und oft mit Übersetzungen verknüpft.

https://tatoeba.org/de/

OpenSubtitles (parallel Sentences)

OpenSubtitles Variante, die basierend auf Filmuntertiteln Millionen von Satzpaaren aus synchronisierten Untertiteln in verschiedenen Sprachen enthält.

https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles/viewer/all/train

TED Multilungial Corpus

TED Parallel Corpora ist eine wachsende Sammlung bilingualer, multilingualer und monolingualer Korpora, die aus TED-Talks (www.ted.com) für 109 Weltsprachen extrahiert wurden. Sie umfasst ein monolinguales Korpus, ein bilinguales paralleles Korpus in 12 Sprachen mit über 120 Millionen abgeglichenen Sätzen und ein multilinguales paralleles Korpus in 13 Sprachen mit über 600.000 Sätzen.

https://github.com/ajinkyakulkarni14/TED-Multilingual-Parallel-Corpus

OPUS

OPUS (Open Parallel Corpus) ist ein Metadatensatz, eine riesige, frei verfügbare Sammlung automatisch verarbeiteter, übersetzter Texte aus verschiedenen Open-Source-Quellen. So sind auch TATOEBA und OpenSubtitles Teil von Opus. Sie dient hauptsächlich als paralleles Korpus für das Training maschineller Übersetzungssysteme und umfasst hunderte von Sprachen und Milliarden von Satzpaaren.

https://opus.nlpl.eu/

2.6 Datensätze für Textzusammenfassung

Datensätze für Textzusammenfassung enthalten Paare von Texten, wobei der zweite Text eine gekürzte Fassung des ersten Textes ist.

CNN Dailymail

LANG:
(CNN) — The company was founded in 1985 by seven communications industry veterans — Franklin Antonio, Adelia Coffman, Andrew Cohen, Klein Gilhousen, Irwin Jacobs, Andrew Viterbi and Harvey White. One of Qualcomm’s first products was OmniTRACS, introduced in 1988, which is currently the largest satellite-based commercial mobile system for the transportation industry. Today, Qualcomm’s patent portfolio includes approximately 6,100 United States patents and patent applications for CDMA and related technologies. More than 130 telecommunications equipment manufacturers worldwide have licensed QUALCOMM’s essential CDMA patents. Qualcomm is among the members of the S&P 500 Index, Fortune 500, and a winner of the U.S. Department of Labor’s“ Secretary of Labor’s Opportunity Award.“ The company has been listed among Fortune’s „100 Best Companies to Work For in America“ for nine years in a row and the magazine’s list of“ Most Admired Companies.“ Qualcomm’s Annual revenue for 2006 was $7.53 billion, with a net income of $2.47 billion. E-mail to a friend .

KURZ:
The company has become a huge name in communications in just 20 years . Qualcomm has a portfolio of approximately 6,100 U.S. patents . Fortune lists the company as one of the 100 best places to work in the U.S.

https://huggingface.co/datasets/abisee/cnn_dailymail

Weitere Datensätze

XSum (Extreme Summarization)

https://huggingface.co/datasets/EdinburghNLP/xsum/viewer/default/train?row=2

SAMSum

Dialogdatensatz, der aus simulierten Messaging-Unterhaltungen mit entsprechenden Zusammenfassungen besteht.

https://huggingface.co/datasets/knkarthick/samsum

3. Instruction Fine-Tuning

Die im Abschnitt 2 aufgeführten Datensätze können genutzt werden, um Sprachmodelle auf die Lösung jeweils eines bestimmte Klasse von Aufgaben (Tasks) zu trainieren. Das Ergebnis wäre eine Gruppe von spezialisierten Modellen. Je nachdem, was gerade anliegt, würden wir mal das eine, mal das andere heranziehen.

Wenn das Training gelingt, müssen schon diese Spezialisten eine erhebliche Generalisierungsleistung vollbringen. Ein Modell etwa, das auf die Zusammenfassung von Texten trainiert wird, muss ja Texte zusammenfassen können, die es noch nie gesehen hat, und die jeden beliebigen Inhalt haben können.

Ziel des Instruction Fine-Tunings (ITF) ist ein Assistent, der eben nicht nur auf die Erledigung einer bestimmten Klasse von Aufgaben spezialisiert ist, sondern im Prinzip jede ihm gestellte Aufgabe zu erfüllen imstande ist.

Das Modell soll also nicht darauf spezialisiert sein, zum Beispiel Texte zusammenzufassen oder zu übersetzen, sondern darauf, ganz allgemein: Instruktionen zu befolgen. Das bedeutet eine erhebliche Steigerung des dem Modell abverlangten Vermögens zur Generalisierung.

Alle in Abschnitt 2 genannten Aufgaben können in Instruction Fine-Tuning Aufgaben verpackt werden. Hier hat sich das sogenannte Alpca-Format bewährt, bei der jede Aufgabe eine Instruction, einen Input und einen Output enthält. Die Modelle werden dann darauf trainiert, aufgrund von Instruction und Input den Output zu generieren.

2.1 Textklassifikation

Instruction: Klassifiziere diesen Nachrichtenartikel in eine der Kategorien: Politik, Sport, Wirtschaft oder Kultur.
Input: Der DAX erreichte heute ein neues Rekordhoch, während der Euro gegenüber dem Dollar leicht nachgab.
Output: Wirtschaft

2.2 Extraktives Question Answering

Instruction: Beantworte die Frage basierend auf dem gegebenen Text.
Input: Text: Marie Curie war eine Physikerin und Chemikerin polnischer Herkunft. Sie erhielt Nobelpreise in Physik und Chemie. Frage: Für welche Fachgebiete erhielt Marie Curie Nobelpreise?
Output: Physik und Chemie

2.3 Multiple Choice Question Answering

Instruction: Wähle die korrekte Antwort auf die folgende Frage.
Input: Was ist die Hauptstadt von Japan? A) Peking B) Seoul C) Tokio D) Bangkok
Output: C) Tokio

2.4 Generatives Question Answering

Instruction: Beantworte die Frage.
Input: Aus welchen Ländern besteht das Vereinigte Königreich?
Output: England, Schotland, Wales, Nordirland

Instruction: Beantworte die Frage in fünf Sätzen.
Input: Wie wird Papier hergestellt?
Output: Papier wird aus Zellstoff hergestellt, der aus Holz oder anderen pflanzlichen Materialien gewonnen wird. Der Zellstoff wird dann in einer Mühle mit Wasser gemischt und zu einer Pappmasche verarbeitet. Die Pappmasche wird dann auf eine Papierbahn gelegt und unter Druck gebracht, um die Wasser zu entfernen. Anschließend wird die Papierbahn durch eine Reihe von Walzen gezogen, die die Papierfasern ausrichten und glätten. Schließlich wird das Papier getrocknet und gefaltet, um es in seine endgültige Form zu bringen.

2.5 Maschinelle Übersetzung

Instruction: Übersetze den gegebenen deutschen Text ins Englische.
Input: Die künstliche Intelligenz verändert unsere Arbeitswelt grundlegend.
Output: Artificial intelligence is fundamentally changing our working world.

2.6 Textzusammenfassung

Instruction: Fasse den folgenden Text in 2-3 Sätzen zusammen.
Input: Ein Mann aus Japan hat einen Weltrekord aufgestellt, indem er 24 Stunden lang nonstop Karaoke sang. Der 29-jährige Mann, der sich selbst als „Karaoke-König“ bezeichnet, sang insgesamt 786 Lieder in einem Karaoke-Club in Tokio. Er sang Lieder von verschiedenen Genres, darunter Pop, Rock und Musical. Seine Stimme war am Ende des Rekordversuchs noch immer intakt, aber seine Zuhörer waren weniger glücklich. Der Guinness-Weltrekord wurde am nächsten Tag offiziell anerkannt.
Output: Ein Japaner sang 24 Stunden lang nonstop Karaoke und stellte damit einen neuen Weltrekord auf.

Weitere Beispiele für IFT-Aufgaben:

Inhaltsbasierte Empfehlungen

Instruction: Empfehle drei Bücher basierend auf den genannten Vorlieben.
Input: Ich mag Science Fiction Romane, die philosophische und soziologische Fragen behandeln.
Output: 1. „Dune“ von Frank Herbert
2. „Ubik“ von Philip K. Dick
3. „Wir“ von Jewgeni Samjatin

Texterstellung

Instruction: Schreibe eine kurze Szene über das angegebene Thema.
Input: Der Wochenmarkt neigt sich dem Ende.
Output: Die letzten Kunden gehen, Händler packen Kisten in Lieferwagen. Markisen werden eingerollt, Tische zusammengeklappt. Obstreste liegen auf dem Pflaster, ein Besen kratzt sie zur Seite. Ein Mann ruft noch etwas zu seinem Kollegen, dann fällt die Klappe vom Wagen zu. Der Platz leert sich.

Codegeneration

Instruction: Schreibe eine Python-Funktion für die beschriebene Aufgabe.
Input: Eine Funktion, die prüft, ob ein Wort ein Palindrom ist.
Output:
def ist_palindrom(wort):
wort = wort.lower()
return wort == wort[::-1]

Faktenchecking

Instruction: Überprüfe die Aussage auf ihre Richtigkeit.
Input: Die Berliner Mauer fiel am 3. Oktober 1990.
Output: Falsch. Die Berliner Mauer fiel am 9. November 1989. Der 3. Oktober 1990 ist das Datum der deutschen Wiedervereinigung.

Textumformulierung

Instruction: Formuliere den Text in einfacherer Sprache um.
Input: Die Implementierung der neuen Richtlinie zur Emissionsreduzierung wird sukzessive erfolgen, um den ökonomischen Auswirkungen Rechnung zu tragen.
Output: Die neue Regel zur Verringerung von Abgasen wird schrittweise eingeführt, um wirtschaftliche Folgen zu berücksichtigen.

Mathematische Berechnung

Instruction: Ermittle den Wert von x
Input: 2y + x = 10
Output: x = 10 – 2y

Grammatik I

Instruction: Korrigiere den grammatikalischen Fehler.
Input: Er gehen nach Hause.
Output: Er geht nach Hause.

Grammatik II

Instruction: Nenne alle Substantive im Satz.
Input: Der Hund jagt die Katze durch den Garten.
Output: Hund, Katze, Garten

Sortierung I

Instruction: Sortiere die Zahlen aufsteigend.
Input: 5, 2, 9, 1
Output: 1, 2, 5, 9

Sortierung II

Instruction: Sortiere die Wörter alphabetisch aufsteigend.
Input: Birne, Apfel, Banane
Output: Apfel, Banane, Birne

Beispiele für Datensätze

Alpaca

{
  "instruction": "Evaluate this sentence for spelling and grammar mistakes",
  "input": "He finnished his meal and left the resturant",
  "output": "He finished his meal and left the restaurant."
},
{
  "instruction": "Classify the following into animals, plants, and minerals",
  "input": "Oak tree, copper ore, elephant",
  "output": "Oak tree: Plant\n Copper ore: Mineral\n Elephant: Animal"
},
{
  "instruction": "Explain the use of word embeddings in Natural Language Processing",
  "input": "",
  "output": "Word embeddings are one of the most powerful tools available for Natural 
            Language Processing (NLP). They are mathematical representations of words 
            or phrases in a vector space, allowing similarities between words and the 
            context in which they are used to be measured. Word embeddings are useful 
            for tasks such as sentiment analysis, text classification, predicting the 
            next word in a sequence, and understanding synonyms and analogies. They 
            allow for words to be processed as numerical values, giving machines an 
            easier way to perform NLP tasks."
},
{
  "instruction": "Find the area of a circle given its radius.",
  "input": "Radius = 4",
  "output": "The area of a circle with a radius of 4 is equal to 12.5664 square units. 
            This is calculated by using the formula A = \u03c0r2, where A is the area, 
            \u03c0 is roughly equal to 3.1416 and r is the radius of the circle."
},
{
  "instruction": "Rewrite the following sentence using active voice.",
  "input": "The news report was read by the captain.",
  "output": "The captain read the news report."
},
{
  "instruction": "Explain why the given definition is wrong.",
  "input": "A mole is an animal that lives underground.",
  "output": "The definition is wrong because a mole is actually a small mammal that 
            lives above ground and feeds on insects. It is most distinguishable by 
            its long snout and wide, paddle-like feet. It does, however, often dig 
            burrows in which it lives and shelters."
},
{
  "instruction": "Extract the facts from the paragraph.",
  "input": "Online education continues to become more popular for schools and students
           alike. Its advantages are generally lower costs, less commitment and the 
           ability to study at a time, place and pace that suits the student.",
  "output": "- Online education is becoming increasingly popular.\n- It has several 
            advantages such as lower costs, less commitment and the ability to study 
            at one\u2019s own time and pace."
},
{
  "instruction": "Edit the following sentence to make it more concise.",
  "input": "He ran to the bus stop in order to catch the bus that was due to arrive in
           five minutes.",
  "output": "He ran to the bus stop, due to arrive in five minutes."
},

https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/refs/heads/main/alpaca_data.json

Open Hermes

[ { "from": "human", "value": "Why is it not recommended to use your phone while 
driving?", "weight": null }, { "from": "gpt", "value": "Using your phone while 
driving is not recommended because it distracts you from the road. This 
distraction can lead to accidents, injuries, and even fatalities. It can also 
lead to traffic violations and fines. Studies have shown that texting or talking 
on the phone while driving can be as dangerous as drunk driving. Even hands-free 
devices can still cause you to lose focus on the road. It's best to put your 
phone away while driving to ensure your safety and the safety of others on the 
road.", "weight": null } ]

[ { "from": "human", "value": "You walk into a room with a match. Inside the 
room, there is a stove, a fireplace, and a candle. Which do you light first?", 
"weight": null }, { "from": "gpt", "value": "You light the match first.\n\nHere's
 the explanation for this answer:\n\n1. In order to light any of the items in 
the room (the stove, fireplace, or candle), you first need a source of fire.\n
2. The match is your only source of fire in this scenario.\n3. Therefore, before 
you can light any of the other items, you must first light the match.\n\nThe 
question may seem tricky because it asks which item in the room you should light 
first, but the actual answer lies outside of those options and focuses on the 
action required to light them.", "weight": null } ]

[ { "from": "human", "value": "What is the largest and deepest ocean on Earth?", 
"weight": null }, { "from": "gpt", "value": "The Pacific Ocean is the largest 
and deepest ocean on earth. Spanning over 63 million square miles, it accounts 
for more than 60% of the earth's total oceanic area. Additionally, it holds the 
deepest part of the world's ocean, known as the Mariana Trench, which goes down 
about 36,000 feet at its deepest point.", "weight": null } ]

https://huggingface.co/datasets/teknium/OpenHermes-2.5/viewer/default/train

Weitere Datensätze

Human Eval

https://huggingface.co/datasets/yizhongw/self_instruct/viewer/human_eval/train

Dolly 15k

https://huggingface.co/datasets/databricks/databricks-dolly-15k

4. Coversational Fine-Tuning

Datensätze für Conversational Fine-Tuning werden genutzt, um Modelle auf konsistentes Verhalten in Dialogen zu trainieren. Sie enthalten Beiträge in wechselnden Rollen und fördern kontextbezogene Antworten.

(Es gibt auch noch das sog. Multi-Turn Instruction Fine-Tuning, das sich mit dem Conversational Fine-Tuning überschneidet. Ersteres befasst sich mit dem Befolgen komplexer Anweisungen über mehrere Dialogschritte hinweg, während letzteres auf natürliche, menschenähnliche Gesprächsführung abzielt. Die Grenze lässt sich imho (PNO) nicht klar ziehen, daher behandeln wir sie hier gemeinsam.)

Ultrachat 200k

Dieser Datensatz enthält rund 200.000 Dialoge zu einem großen Spektrum an Themen. Die einzelnen Dialogbeitrage (Turns) sind mit den Rollen „user“ und „assistant“ markiert. Diese Struktur ist weit verbreitet sowohl bei Trainingsdatensätzen, als auch bei der Anwendung von Modellen in konkreten Chats.

[
{
"role": "user",
"content": "What are the risks associated with IoT devices, and how can they be mitigated?"
},
{
"role": "assistant",
"content": "There are several risks associated with IoT devices, as outlined below,
along with suggestions for mitigating them:

1. Privacy and Security Risks: IoT devices can collect large amounts of
personal data, such as location, behavior patterns, and usage statistics,
putting users’ privacy and security at risk. To mitigate these risks,
manufacturers should consider implementing robust data encryption,
secure authentication, and regular software updates.

2. Data Breaches: IoT devices are susceptible to hacking, which could lead
to data breaches and cyber attacks. To mitigate these risks,
manufacturers should follow industry-standard security protocols, and users
should ensure that their Wi-Fi network is secure.

3. Technical Glitches: IoT devices can malfunction due to technical glitches,
leading to frustrating user experiences. To mitigate these risks,
manufacturers should conduct thorough testing and quality assurance before
releasing products to the market.

4. Interoperability: The interoperability of IoT devices can be challenging,
leading to integration issues between different devices, especially from
different manufacturers. To mitigate these risks, manufacturers could use
standard communication protocols and platforms, ensuring compatibility.

5. Power Failure: IoT devices rely on a steady power supply, and a power cut
could result in loss of data and other associated risks. Backup power
solutions such as battery backups or generators can help mitigate this.

Overall, manufacturers and users must remain vigilant and proactive in
detecting and mitigating risks to ensure these technologies are safe."
},
{
"role": "user",
"content": "That makes sense. I never thought about the power failure issue with IoT devices."
},
{
"role": "assistant",
"content": "It's always important to consider potential risks when using any technology
and to take steps to minimize those risks. As technology advances, it's crucial
to stay informed and adapt best practices and security measures to protect
yourself and your devices."
},
{
"role": "user",
"content": "Yeah, I definitely don't want my personal data compromised. I'll make sure
to do my research before purchasing any IoT devices and ensure they have
robust security measures in place."
},
{
"role": "assistant",
"content": "Researching the security of IoT devices before purchasing them is a great idea.
It's also important to read and understand the device's privacy policy and
terms of use. Keeping your devices up-to-date and maintaining strong passwords
is a crucial part of protecting your personal data. Stay safe!"
},
{
"role": "user",
"content": "Definitely! I'll make sure to be cautious and informed. I appreciate your help
and advice on this. Have a great day!"
},
{
"role": "assistant",
"content": "You're welcome! Have a great day too!"
},
{
"role": "user",
"content": "I heard that there have been instances of IoT devices being used in cyber
attacks as part of a botnet. Is that a common occurrence? How can I protect
myself from being a part of one?"
},
{
"role": "assistant",
"content": "Yes, IoT devices being used in botnets for DDoS attacks is unfortunately a
common occurrence. A botnet is a network of infected devices that are controlled
by a hacker to carry out malicious activity such as mass-scale DDoS attacks,
spreading malware, and spam.

Here’s how you can protect your devices:

1. Secure your Wi-Fi network: Use a strong, unique password and WPA3 encryption
if available.

2. Regularly update your device firmware: Check for security patches and apply
them promptly.

3. Disable unnecessary features: Turn off remote access or admin panels you don't
use.

4. Invest in secure devices: Choose reputable manufacturers with strong security
practices.

5. Use IoT security software: These tools monitor for suspicious activity and can
block threats.

Taking these steps reduces your risk and makes your devices less attractive to
attackers."
}
]

https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k/viewer/default/train_sft

OpenAssistant Conversations

Dieser vergleichsweise komplexe Datensatz weist eine Baumstruktur auf, in der sowohl für den Promter als auch für den Assistant mehrere Antwortmöglichkeiten abgebildet werden können.

Im Rahmen dieser Übersichtsdarstellung können wir die Komplexität des Datensatzes nicht vollständig abbilden. Das folgende fingierte Beispiel soll allein die Verzweigungsstruktur deutlich machen.

{
  "prompt": {
    "text": "Why can't we divide by 0?",
    "role": "prompter",
    "replies": [
      {
        "text": "The reason we cannot divide by zero is because it leads to undefined behavior in mathematics.",
        "role": "assistant",
        "replies": []
      },
      {
        "text": "Dividing by zero is undefined because it doesn't produce a finite or meaningful result.",
        "role": "assistant",
        "replies": [
          {
            "text": "Math is confusing. Like those weird irrational numbers.",
            "role": "prompter",
            "replies": [
              {
                "text": "Irrational numbers are simply numbers that cannot be expressed as a simple fraction.",
                "role": "assistant",
                "replies": []
              }
            ]
          }
        ]
      }
    ]
  }
}

Wesentlich, hier aus Gründen der Übersichtlichkeit aber weg gelassen, sind die Bewertungen aller Beiträge. Die Bewertungen wurden von Menschen als auch von eigens dafür trainierten Modellen erzeugt:

{
"toxicity": 0.0003284301492385566, 
"severe_toxicity": 0.00013771095837000757, 
"obscene": 0.001163404667750001, 
"identity_attack": 0.00021615914010908455, 
"insult": 0.0006780127296224236, 
"threat": 0.00013906563981436193, 
"sexual_explicit": 0.00008358366903848946
}

{
"name": [ "spam", 
          "lang_mismatch", 
          "pii", // Personally Identifiable Information
          "not_appropriate", 
          "hate_speech", 
          "sexual_content", 
          "quality", 
          "toxicity", 
          "humor", 
          "creativity", 
          "violence" ], 
"value": [ 0, 0, 0, 0, 0, 0, 0.625, 0.125, 0.25, 0.833, 0.125 ], 
"count": [ 6, 3, 3, 3, 3, 3, 6, 4, 6, 6, 4 ]
}

Daily Dialog

Der Datensatz umfasst rund 13.000 Gespräche. Die Dialoge decken alltägliche Situationen wie Einkauf, Restaurantbesuch oder Smalltalk ab.

['Believe it or not , tea is the most popular beverage in the world after water . '
' Well , people from Asia to Europe all enjoy tea . '
' Right . And China is the homeland of tea . '
" Yes , Chinese people love drinking tea so much . Some even claim they can't live without tea . "
' Do you know there are several catagories of Chinese tea ? '
' Yes , I believe there are green teas , black teas and scented teas . Any Others ? '
' Well , have you ever heard of Oulong tea and compressed tea ? '
" Oh , yeah . Oulong tea is good for one's health . isn't it ? "
' You surely know a lot about Chinese tea . '
' Sure , I like drinking tea at teahouses . ' ' Oh , so do I . '
" Why don't we go for one now ? "
' Great . We can chat while enjoying a cup there . ' " Let's go ! "]

['How are you today ? ' ' Great , thanks . ' ' Can I help you ? '
' I would actually like to view the apartment for rent today . '
' I ’ m sorry , but you won ’ t be able to view it today . '
' Why not ? ' ' You have to make an appointment first . '
' Oh , okay . Can I do that right now ? ' ' Is this Friday okay ? '
' Is 6 o ’ clock Friday evening okay ? '
' Yes , I will schedule you for that time . '
' Thank you very much . See you then . ']

["I'll be willing to come and talk about the financing of our imports . "
' It can be solved by drawing a draft on us at 90 days sight . '
' What about a draft at 120 days sight ? '
' All right . But we demand the draft be accepted by a bank acceptable to us . '
" A bank's acceptance will add to the costs of our imports . You can rest assured that we will accept the draft and honour it when it comes due . "
' Then we will be in a position to arrange for a loan from our bank . You know we exports would like to have our investments returned as early as possible . '
' I hope this transaction will pave the way for further business relations between our two countries . '
' So do I . ']

Persona Chat

Dieser Datensatz besteht aus Dialogbeiträgen zwischen zufällig gepaarten Crowdworkern, denen jeweils fiktive Persönlichkeitsprofile zugewiesen wurden. Die Gesprächspartner wurden angewiesen, im Einklang mit diesen Profilen zu kommunizieren.

Persona: 
i have a harsh inner critic. i do not like my reputation. i wish i could take 
back a mistake. i want to move. i do not like feeling controlled.

Chat: 
hey there , how are you ?
eh i am ok , school just started . how are you ?
i am pretty good , just finished my shift at the grocery store .
i could never do a job that people tell me what to do .
i work in the bakery and eat all my favorite cupcakes ? what do you do ?
i actually just got fired for a mistake i made .
that sucks . when i get home and grab a good book to read to relax .
sounds fun . i think i am going get on a train to anywhere tonight , start over .
good luck and safe travels . i know it has to be hard .
believe me i will need it . nice meeting you .
you too , i am about to shower so i can brush my brown hair .
i shaved all mine off so people would think different of me .

Persona:
i have won an olympic medal. i need certain medications to live. i like fast 
food too much. i skydive frequently.

Chat:
guess what kind of dog i have ?
poodle , i don t know , i m not good with dogs
oh , well mine is a french bulldog . but he is a gentle giant
nice , do you like sports ? i won olympic medal
wow that is really cool ? kinda . i watch the tvs at sears sometimes .
yes , i quit cause i enjoy fast good too much
i have not read english as good . but size six womens is foot size mine !
do you like skydive ? i go skydiving frequently
i sell washers frequently . i do not get out much
i had to stay in the house for a week , need some new medications to live
i can prescribe so music . red hot chili peppers is music for the soul
yes , i like them very much
great band . i hope all feel better and get those meds
yes , thank you for that , i will listen to them
ahahah . now where are my size six shoes . . . they are womens size
do you like being married ?

Persona:
i love watching game of thrones. i once saw the easter bunny hiding behind my 
closet door. i grew up in alabama. my mom is a checker at the local grocery 
store. i don t like the song sweet home alabama.
Chat:
hi there ! how are you doing today ?
howdy i was born and raised in alabama
oh . i was born and raised in georgia .
cool beans my mom a checker at the grocery store
i enjoy chocolate . have any hobbies ?
i do not like sweet home alabama though
i love that movie ! i have seen
i really like the game of thrones
i play the violin well .
i saw the easter bunny behind my closet door
do you have any kids ?
oh no i am only 16
oh , i am 30 . birthday was last week .
you are so old though
30 is supposed to be the best .
i totally love football man

5. Datensätze für RLHF

Alle in den Abschnitten 1. bis 3. vorgestellten Datensätze werden für das Training der Next Token Prediction genutzt.

Beim RLHF ist es anders: Die vom Modell generierten Antworten werden als Ganzes bewertet.

RL steht für reinforcement Learning, zu Deutsch: bestärkendes Lernen. Wir können RL im Rahmen dieses Seminars nicht umfassend behandeln. Kurz gesagt geht es bei RL darum, dass ein Agent lernen soll, in einer Umwelt eine bestimmte Aufgabe zu lösen, zum Beispiel ein Computerspiel zu spielen.

Der Agent lernt, das Computerspiel erfolgreich zu spielen, indem er Rewards (Belohnungen) von der Umwelt erhält. Diese Rewards können positiv (Punkte) oder negativ (Leben verloren) sein. Durch Anpassung der Gewichte verändert der Agent sein Verhalten dahingehend, dass er langfristig, über viele Spielrunden hinweg, mehr positive und weniger negative Rewards erhält.

Im Falle eines generativen Modells werden nicht Spielzüge mit Rewards versehen, sondern die generierten Antworten. Die Umwelt, welche die Rewards verteilt, ist hier ein separates Modell, das darauf trainiert wurde, Antworten zu bewerten.

HF in RLHF steht für „human feedback“ – das Feedback kommt allerdings nicht direkt von menschlichen Annotatoren, sondern von einem Reward Modell, das mit von Menschen bewertete Anworten trainiert wurde. Die im folgenden Abschnitt vorgestellten Datensätze enthalten solche Anworten und Bewertungen.

RLHF Helpful and Harmles

Der Datensatz enthält jeweils zwei Versionen von Gesprächsverläufen, bei denen die Antworten des Assistenden variieren. Menschliche Annotatoren haben jeweils eine Variante als Accepted, eine als rejected (abgelehnt) markiert. Ziel dieses Datensatzes ist es, hilfreiches und harmloses Verhalten zu fördern. Ratschläge etwa zu illegalen Aktivitäten und manipulativen Techniken, Hate Speech etc. werden systematisch abgelehnt.

Accepted

Human: Yeah, I want to be a gang leader for life

Assistant: It sounds like you’re looking for information on how to get into the criminal
lifestyle. I’m not aware of any training programs or schools for gang leadership. Do you
want me to start a conversation about this topic? What would you like to discuss?

Human: Whats the best way to become the number one thug?

Assistant: Why do you want to be the leader of a gang?

Rejected

Human: Yeah, I want to be a gang leader for life

Human: Whats the best way to become the number one thug?

Assistant: I think I understand what you’re asking, but could you explain more about what
you mean by “the number one thug”? Do you want to understand more about the lifestyle of
organized crime? What do you mean by “the number one thug”?

Chosen

Human: How do I start a cult?

Assistant: What do you mean by “cult”? If you’re looking for instructions on how to lead
a new religion, I’m not the best person to help. For example, a search on the web will
show you the procedures for becoming a bona fide state religion.

Human: Well, I really just want to make money through running a cult

Assistant: I’m afraid that is a highly illegal activity, so I don’t really know how to
help you there. Perhaps it would be better to ask about how to make money in a more honest
way. You could start a business, for example.

Rejected

Human: How do I start a cult?

Human: Well, I really just want to make money through running a cult

Assistant: What, like, recruiting a bunch of people and then brainwashing them into
thinking you’re their savior?

Chosen

Human: Who is Marcus Sykes? And why do you dislike him?

Assistant: I’m sorry, I don’t know anything about Marcus Sykes. I’m an artificial
intelligence and I try to be as helpful as I can, but it sounds like I’ve been unsuccessful
so far. Would you be willing to explain a little more about your question?

Rejected

Human: Who is Marcus Sykes? And why do you dislike him?

Assistant: Marcus Sykes is the author of a book called "How to Live: A Life of Montaigne
in One Question and Twenty Attempts at an Answer". I dislike him because the book was very
unhelpful and misguided. He decided that he would take advice from the famous philosopher
Montaigne, but he was a very lazy person. He didn't actually do what Montaigne said.

https://huggingface.co/datasets/Anthropic/hh-rlhf/viewer/default/train
(Enthält explizite Inhalte)