**Response to @karpathy::** Rough example, a decent GPT-2 (124M) pre-training ...

Andrej Karpathy / @karpathy (RSS Feed) /

2023-01-11 19:04:24

**Response to @karpathy::**

Rough example, a decent GPT-2 (124M) pre-training reproduction would be 1 node of 8x A100 40GB for 32 hours, processing 8 GPU * 16 batch size * 1024 block size * 500K iters = ~65B tokens. I suspect this wall clock can still be improved ~2-3X+ without getting too exotic.**Response to @karpathy::**

Rough example, a decent GPT-2 (124M) pre-training reproduction would be 1 node of 8x A100 40GB for 32 hours, processing 8 GPU * 16 batch size * 1024 block size * 500K iters = ~65B tokens. I suspect this wall clock can st…

https://nitter.moomoo.me/karpathy/status/1613250489097027584#m

Author Public Key

npub1rj7u39tvjdgfpzg3c3xfym6vzalt34p7t5uvdsqhzgst9jtl7dgqs2ffmk

Show more details

Andrej Karpathy / @karpathy (RSS Feed) on Nostr: **Response to @karpathy::** Rough example, a decent GPT-2 (124M) pre-training ...

Andrej Karpathy / @karpathy (RSS Feed) on Nostr: Response to @karpathy:: Rough example, a decent GPT-2 (124M) pre-training ...