Deep Dive into LLMs like ChatGPT | Michael Lu’s Blog

type

status

date

slug

summary

pre-training

download and preprocess the internet

tokenization

group 8 bits —>bytes. ( from 010101010—→124. ) (compress)—>like unique symbols/emojis

group bytes(pairs of bytes)—> into a new symbols

network training:

input:sequence of token

output:prob of next token(parallely happen in all whole dataset)

nn internals:

weights change

inference:

predict one token at a time

lower loss—>better network

release of a model:

github: sequence of the model (forward pass of a nn?)

parameters: a list of billions of parameters

gpt2, llama3

parameters —> zip file

in context learning( with few short prompt )

—> base model

interner document simulator

post training:supervised finetuning

conversations into token

(. use protocol or tool

)

database how to build up:

llm / human (scale ai?)

hallucination:—> let llm to say idk

1.take paragraph—>construct questions. (3 times compared with the correct answer)

—> add new QA into training set A: idk —>. let llm to say idk

do some search

tool—> <search_start> xxx <search_end>. xxx —>search query

—>. training data

knowledge in the parameters== vague recollection

knowledge in the tokens of the context window== working memory

model of self

model itself—> if not providing examples of question like: ‘who are you’ —> the answer will be “chatgpt”.—>lots of data in the internet

model need tokens to think

let model slowly get the answer instead of giving out the answer directly.

right one is better

single forward pass—>not enough calculate

—> use code might be a better way? when asking a math question to the llm model

models are not good with spelling, they see tokens (text chunks),not indiviual letter

after that we have sft(supervised finetuning) model

post training:reinforcement learning

pretraing: background knowledge

supervised finetuning: worked problems. problem+answer for imitation

reinforcement learning:practical problems (try —> reach the correct answer)

deepseek r1—> rl model

rlhf

human feedback—> like writing jokes

—>better training set

problem: cost human’s time

method: human feedback —> train a nn simulator of human preferences(reward models)—> use reward model to give feedback

x use too much reward model