type
status
date
slug
summary
tags
category
icon
password
.png?t=19cdb349-4e0e-803f-8052-cf4a8950df2f)
pre-training
download and preprocess the internet
tokenization
group 8 bits —>bytes. ( from 010101010—→124. ) (compress)—>like unique symbols/emojis
group bytes(pairs of bytes)—> into a new symbols
network training:
input:sequence of token
output:prob of next token(parallely happen in all whole dataset)
nn internals:
weights change
inference:
predict one token at a time
lower loss—>better network
release of a model:
github: sequence of the model (forward pass of a nn?)
parameters: a list of billions of parameters
gpt2, llama3
parameters —> zip file
in context learning( with few short prompt )
—> base model
interner document simulator
post training:supervised finetuning
conversations into token
(. use protocol or tool
)
database how to build up:
llm / human (scale ai?)
hallucination:—> let llm to say idk
1.take paragraph—>construct questions. (3 times compared with the correct answer)
—> add new QA into training set A: idk —>. let llm to say idk
2.
do some search
tool—> <search_start> xxx <search_end>. xxx —>search query
—>. training data
knowledge in the parameters== vague recollection
knowledge in the tokens of the context window== working memory
model of self
model itself—> if not providing examples of question like: ‘who are you’ —> the answer will be “chatgpt”.—>lots of data in the internet
model need tokens to think
let model slowly get the answer instead of giving out the answer directly.

right one is better
single forward pass—>not enough calculate
—> use code might be a better way? when asking a math question to the llm model
models are not good with spelling, they see tokens (text chunks),not indiviual letter
after that we have sft(supervised finetuning) model
post training:reinforcement learning
pretraing: background knowledge
supervised finetuning: worked problems. problem+answer for imitation
reinforcement learning:practical problems (try —> reach the correct answer)
deepseek r1—> rl model
rlhf
human feedback—> like writing jokes
—>better training set
problem: cost human’s time
method: human feedback —> train a nn simulator of human preferences(reward models)—> use reward model to give feedback
x use too much reward model