Devin AI
dc81cb368c
fix: ensure non-zero learning rate during warmup at iteration 0
...
The warmup learning rate calculation has been modified to use (it + 1)/(warmup_iters + 1)
instead of it/warmup_iters. This ensures a non-zero learning rate at iteration 0
while maintaining the same linear warmup behavior.
Fixes #443
2024-12-09 07:35:08 +00:00
Kevin Slagle
5156fef93c
fix np.memmap memory leak
...
nn.memmap doesn't free memory that it accesses. Thus, the entire dataset gets stored in RAM as the dataset has been fully accessed. The simplest workaround on stackoverflow is to just recreate the memmap for each batch. The extra overhead is negligible.
https://stackoverflow.com/questions/45132940/numpy-memmap-memory-usage-want-to-iterate-once/61472122#61472122
2024-01-25 11:41:01 -08:00
o
1eaceae193
Fix AssertionError on macOS - need to check CUDA availability for bf16
2023-06-19 18:05:09 -04:00
Andrej Karpathy
7339b904ef
use WORLD_SIZE instead of device_count, supports both the case where the number of gpus we train on is smaller than gpus available, and also multinode training may be a bugfix
2023-06-14 23:33:07 +00:00
Alexander Pivovarov
eb33b8bf1c
Use bf16 only if supported
2023-05-17 03:26:48 +00:00
Andrej
a6a708c7f1
Merge branch 'master' into grad_accum
2023-04-17 20:11:00 -07:00
Andrej
2457471c9c
Merge pull request #236 from ymurenko/master
...
fix "cuda out of memory" when resuming training
2023-04-12 22:09:42 -07:00
Andrej Karpathy
553f949f46
fix minor bug where we have to scale the loss to account for gradient accumulation, which sums before backprop. note that this is not a major bug because AdamW is scale invariant. however, this did affect gradient clipping
2023-04-13 04:59:11 +00:00
ymurenko
4ac2e8ce3a
fix "cuda out of memory" when resuming training
2023-04-05 17:28:55 -04:00
Otavio Good
978d4fe538
Fix for gradient_accumulation_steps training slow
2023-03-25 00:04:45 -07:00
Otavio Good
086ebe1822
fix for training stability on single GPU
2023-02-13 10:42:44 -08:00
Andrej Karpathy
e58f0cfa94
oops i should not be needing or multiplying by world_size to calculate mfu
2023-02-07 21:38:39 +00:00
Andrej Karpathy
8b1e43209e
small tweaks, make default WD be 0.1 as is often cited, and remove spurious init of LayerNorm, which is already initialized at 1,0
2023-02-06 23:07:25 +00:00
Andrej Karpathy
ab21d6c15d
bugfix we have to call the raw_model's estimate_mfu ty @jprobichaud for original PR
2023-02-06 19:55:35 +00:00
Andrej Karpathy
ab0718a7dd
add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving
2023-02-05 00:48:58 +00:00
Andrej Karpathy
a74e8363a2
clean up TODOs a bit, they are stale
2023-02-04 21:11:25 +00:00
Andrej Karpathy
25d95dbd65
mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags
2023-02-04 21:06:17 +00:00
Andrej Karpathy
e108ffb973
very slight refactor, bit cleaner
2023-02-04 19:34:24 +00:00
Nan Yang
b8286f343e
Pin memory only when training on GPU
2023-02-04 11:16:26 -08:00
Andrej Karpathy
77e7e04c26
padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count.
2023-02-04 16:06:18 +00:00
Andrej Karpathy
b3c17c6c6a
slight tweak compressing LOC
2023-02-04 15:57:29 +00:00
Ramtin Gharleghi
9da1627c7f
Explicitly set ddp device
2023-02-04 15:07:36 +11:00
Andrej Karpathy
3fd4c0c5ef
who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol
2023-02-04 02:52:48 +00:00
Andrej
7d44bdf6b5
Merge pull request #106 from YassineYousfi/master
...
use the ``enabled`` arg in GradScaler
2023-02-02 17:23:22 -08:00
Andrej Karpathy
d8b1a94519
change grad accum to default off because i think it just confuses everyone
2023-02-02 18:38:49 +00:00
Yassine Yousfi
40f4d6ff70
use the enabled arg in GradScaler
2023-01-31 21:12:49 -08:00
Andrej Karpathy
038ce89438
rename iter to it, because iter is a concrete Python builtin
2023-01-31 23:34:02 +00:00
Andrej Karpathy
924a0873eb
merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training
2023-01-30 23:40:35 +00:00
Andrej Karpathy
0e90ee9d48
based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup
2023-01-30 08:07:58 +00:00
Andrej Karpathy
001c1e7be7
stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2
2023-01-27 20:51:50 +00:00
Andrej Karpathy
79dbe0086d
let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality
2023-01-27 20:45:28 +00:00
Andrej Karpathy
e808a67149
bunch of plumbing of bias all around. measuring bias=False to be about 6% faster
2023-01-27 20:41:17 +00:00
Andrej Karpathy
3cb3fc059c
grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains
2023-01-27 16:45:09 +00:00
johnwildauer
e0e94a1094
use GradScaler in model only if dtype is float16
2023-01-24 15:53:31 -07:00
Andrej
3611338959
Merge pull request #71 from cchan/patch-1
...
Zero-grad more aggressively to save memory
2023-01-20 14:38:10 -08:00
Andrej Karpathy
1f77d03024
make mentions of mps in docs. ty good people in issue #28
2023-01-20 21:28:20 +00:00
Clive Chan
67166079c9
Zero-grad more aggressively to save memory
2023-01-19 22:10:44 -08:00
Andrej Karpathy
46ce9971df
small tweaks to docs and variable names stylistically
2023-01-16 16:56:05 +00:00
Andrej Karpathy
684800dd87
clarify that these should be run on two separate machines
2023-01-16 06:02:46 +00:00
Andrej Karpathy
9352df23de
docs for multinode ddp
2023-01-16 05:57:33 +00:00
Andrej Karpathy
c3dddbff3d
get rid of gpu_id, the world is more complicated than that when world_size > 8
2023-01-16 05:44:50 +00:00
Andrej Karpathy
f5e6ac8b02
local rank -> rank
2023-01-16 05:13:13 +00:00
Andrej Karpathy
cf99914886
add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR
2023-01-15 17:49:55 +00:00
Andrej Karpathy
57735f532d
correctly propagate the vocab_size from the rendered dataset into the model args
2023-01-14 02:26:44 +00:00
Andrej Karpathy
8f85b83347
inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token
2023-01-12 06:02:50 +00:00
Andrej Karpathy
d17350a31d
add support for character-level language models, a new character-level shakespeare dataset, a new config file that shows how to train a character-level baby GPT on it, and adjust the sample function to figure out if it should decode with characters or GPT2 bpe tokens. The current implementation is a bit hacky and basically assumes just these two possibilities. In the future we may want to support more general encoders or decoders.
2023-01-11 05:27:19 +00:00
Andrej Karpathy
c2a402f7f7
guess the config from globals() and log all of it with wandb
2023-01-11 01:00:22 +00:00
Andrej Karpathy
a855d316fd
add device and dtype support to train.py args
2023-01-08 19:20:38 +00:00
Luca Antiga
09f1f458e8
Move conditional import
2023-01-08 15:51:50 +01:00
Luca Antiga
aba47f0a35
Make wandb import conditioned to wandb_log=True
2023-01-08 15:42:08 +01:00