February 7, 2026
13 Comments

The Creators Behind the First LLM: A Technical History

Advertisements

Ask "who created the first LLM?" and you'll likely get a dozen different answers. Some will shout "OpenAI with GPT!" Others might argue for Google's BERT. A few academics will mumble about a 2013 paper you've never heard of.

The truth is, pinning down a single "creator" is like trying to name the inventor of the automobile. Was it the person who built the first steam carriage, or the one who perfected the internal combustion engine and made it practical? The story of the first LLM is a relay race of incremental breakthroughs, not a solo sprint.

I used to think it was a simple answer. Then I dug into the research. What I found was a messy, fascinating, and collaborative history that most summaries gloss over.

What Defines an LLM? (It's Trickier Than You Think)

Before we crown a champion, we need to know what the game is. Calling any old language model an "LLM" is like calling a scooter a motorcycle. They're related, but the scale and capability are worlds apart.

Here's the checklist most researchers use in hindsight:

  • Massive Scale: Trained on a dataset far larger than Wikipedia alone—think billions of words from books, articles, and the open web.
  • Transformer Architecture: Built on the self-attention mechanism from the 2017 "Attention Is All You Need" paper. This is non-negotiable for modern LLMs.
  • Generative Pre-training: First trained on a general, unsupervised task (like predicting the next word), then fine-tuned for specific jobs. This two-step process is key.
  • Emergent Ababilities: Shows behaviors (like basic reasoning or coherent long-form writing) that weren't explicitly programmed but emerged from scale.

If a model misses one of these points, especially the Transformer core, calling it the "first LLM" feels like a stretch. It might be a vital ancestor, but not the direct progenitor.

Common Misconception: People often point to very early neural networks for language as the "first LLM." While foundational, models from the 80s or 90s lacked the scale, architecture, and pre-training paradigm that define today's LLMs. They're the Wright Flyer to the modern jet engine.

How Did Early Research Pave the Way? (The Unsung Heroes)

The runway for the LLM takeoff was built in the 2010s. A few key papers laid the concrete, often getting overshadowed by the flashier models that came later.

Yoshua Bengio's team, for instance, published a neural language model in the early 2000s and kept pushing. Their 2013 paper introduced key concepts for representing words as vectors. Vital, but still not an LLM.

Then came the Seq2Seq (Sequence-to-Sequence) architecture from Google Brain in 2014. This was huge for translation. It used Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. I remember trying to build on this framework; the results were impressive for the time but painfully slow to train and prone to forgetting information from earlier in a long sentence.

The big limitation? RNNs process words one after another. No parallel processing. Training on huge datasets was a computational nightmare.

Model / Concept Year Key Researchers / Team Why It Mattered The Limitation
Word2Vec 2013 Mikolov et al. (Google) Showcased powerful word embeddings from large-scale training. Not a generative model, just static representations.
Neural Machine Translation (Seq2Seq) 2014 Sutskever et al. (Google Brain) Proved neural nets could excel at complex language tasks end-to-end. RNN-based, sequential processing was a bottleneck for scale.
ELMo (Embeddings from Language Models) 2018 Peters et al. (AI2 / Allen Institute) Introduced context-aware word embeddings via bi-directional LSTMs. Still not a Transformer. Generated embeddings, not full text.

ELMo (February 2018) is a fascinating case. It called itself a "deep contextualized" language model and used bi-directional LSTMs. It was a major step towards context awareness. But it wasn't generative in the way we think of GPT, and it wasn't a Transformer. It feels like the last major pre-Transformer innovation.

The Transformer: The Real Game-Changer (2017)

Everything changed in June 2017. A team of Google researchers, including Ashish Vaswani, Noam Shazeer, and Llion Jones, published "Attention Is All You Need".

This paper introduced the Transformer architecture.

I'll be honest, when I first read it, the significance didn't fully sink in. The math was dense. But the core idea was revolutionary: ditch recurrence entirely. Use a mechanism called self-attention to let every word in a sentence relate to every other word, all at once.

This meant you could finally parallelize training. Instead of feeding words in one by one, you could process whole chunks of text simultaneously. Training time on large datasets plummeted. Scaling up became not just possible, but practical.

The genie was out of the bottle.

The authors' primary focus was on machine translation, and they built an encoder-decoder model. But the community immediately saw the potential. The architecture was a general-purpose language engine. The race was on to scale it up with massive data.

“We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.” – Opening line of the Transformer paper, which turned out to be one of the biggest understatements in AI history.

GPT-1 vs. BERT: The First True Contenders (2018)

Within a year of the Transformer paper, two models emerged that tick all the boxes for a modern LLM. They approached the Transformer from opposite ends.

1. GPT-1: The Generative Pathfinder (June 2018)

Creator: OpenAI (Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever et al.)
Paper: "Improving Language Understanding by Generative Pre-Training"

GPT-1 took the decoder stack of the Transformer. Its task was brutally simple: predict the next word. It was trained on a huge, diverse corpus called the BookCorpus (7,000 unpublished books). This unsupervised pre-training step was its masterstroke.

After this, it could be fine-tuned with a tiny amount of labeled data for tasks like classification, similarity, and Q&A. The results were state-of-the-art for nearly everything they tried.

Why many consider it the "first": It was the first to successfully demonstrate the full "generative pre-training + fine-tuning" paradigm at a significant scale using the Transformer. It showed that a single model, pre-trained generatively, could be adapted to a wide range of tasks with minimal tweaking. This was the blueprint.

2. BERT: The Bidirectional Powerhouse (October 2018)

Creator: Google AI (Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova et al.)
Paper: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"

BERT went the other way. It used the encoder stack of the Transformer. Its genius was in its pre-training tasks: Masked Language Modeling (randomly hide words and predict them) and Next Sentence Prediction.

This gave it a deep, bidirectional understanding of context. For tasks like extracting meaning, search, and sentiment, it was (and in some forms, still is) phenomenal.

The Verdict? It's a photo finish. GPT-1 was first chronologically (June vs. October 2018). It also embodies the "generative" spirit we most associate with LLMs like ChatGPT. BERT, while monumental, isn't primarily a text generator; it's a deep understander.

If I have to pick one as the progenitor of the ChatGPT lineage, it's GPT-1. It lit the fuse. But BERT equally defined the modern era of NLP. The field truly exploded because of both.

What This History Means for Builders and Entrepreneurs Today

So why does this ancient history (in AI terms) matter? Because the lessons are still playing out.

If you're a startup founder or developer looking to use LLMs, here's the practical takeaway:

You are almost certainly not building the next foundational LLM. The compute costs are astronomical, and the expertise is concentrated. Google, OpenAI, Meta, and a few others own that layer.

Your power lies in the application layer.

  • Fine-tuning is your superpower. Take an open-source model (like Meta's Llama or Mistral's models) and specialize it on your proprietary data—customer support tickets, legal documents, your unique codebase.
  • RAG (Retrieval-Augmented Generation) is your best friend. Don't force the LLM to know everything. Connect it to a searchable knowledge base of your own accurate, up-to-date information. This solves the hallucination problem for domain-specific apps.
  • Understand the trade-offs. Do you need the creative fluency of a GPT-style decoder model, or the precise understanding of a BERT-style encoder? Most modern APIs abstract this, but knowing the heritage helps you debug weird outputs.

The story of the first LLM teaches us that progress is cumulative. Your contribution doesn't need to be a new architecture. It can be a brilliant application of the existing one to solve a real, painful problem.

Frequently Asked Questions

Was GPT-1 the first true large language model?
Most experts point to OpenAI's GPT-1 (2018) as the first model that clearly meets the modern definition of an LLM. While earlier models like the 2013 neural language model were foundational, GPT-1 was the first to be trained on a truly massive and diverse text corpus (the BookCorpus) using the then-novel Transformer decoder architecture, demonstrating clear scaling laws and coherent text generation abilities at a previously unseen scale.
Can a startup build its own LLM from scratch today?
It's technically possible but strategically questionable for most startups. The compute cost for pre-training a competitive foundational LLM from scratch can run into tens of millions of dollars. The smarter path is to fine-tune an existing open-source model (like Llama or Mistral) on your proprietary data. This approach, known as Retrieval-Augmented Generation (RAG), lets you build a specialized, cost-effective AI agent without the foundational training burden.
What was the single biggest technical breakthrough that made LLMs possible?
The Transformer architecture, introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al. from Google, was the game-changer. It replaced recurrent neural networks (RNNs) with a self-attention mechanism, allowing for massive parallelization during training. This meant you could finally train on much larger datasets in a reasonable time. Without the Transformer's efficiency, scaling models to hundreds of billions of parameters would have remained impractical.
Where can I find the original research papers for these early LLMs?
The seminal papers are hosted on arXiv, a free preprint server. For GPT-1, search 'Improving Language Understanding by Generative Pre-Training' (OpenAI, 2018). For BERT, look for 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding' (Google AI, 2018). The original Transformer paper is 'Attention Is All You Need' (2017). I recommend reading the abstracts and introductions first to grasp the core innovation before diving into the technical details.