46 comments.

  1. whydoesthisitch

    Deep learning research scientist here. I work on training algorithms for large scale foundational models, so I spend a lot of time working on exactly this kind of hardware.

    Basically, Dojo is still likely in relatively early development, and even when it is up and running, will still be 4+ years behind Nvidia in terms of performance. Tesla has really overplayed both the chip itself, and how far along they are with development (which shouldn't surprise anyone). Even in what they've claimed will be their eventual cluster, it's going to take several generations to even come close to catching up with the big players in the field. For example, Tesla really played up Dojo breaking the exaflop barrier at AI day, implying that would make it among the most powerful compute clusters in the world. But in reality, Google is already running multiple 9 exaflop cluster with TPUv4, AWS is running 6 exaflop clusters with their internal Trainium chips, and 20 exaflop clusters with H100 GPUs.

    I'd be happy to answer any questions about the technical details around Dojo vs. alternative AI accelerators. But the big takeaway is, like all things Tesla, there are some interesting elements here, but it's not the world changing tech they imply with their marketing hype.

    1. ConcernedCitizens_

      u/whydoesthisitch

      I posted the below on the related thread that got shut down immediately thereafter and hoped that you might still have appetite to help educate me!

      Thanks for taking the time to share your expertise on Dojo.

      A few questions from an avid follower of the automotive industry who knows precious little about AI.

      1 - Tesla recently announced that they would be scaling to 100 exaflops (or what they say is the equivalent of 300k A100 GPUs) in around 16 months time. Allowing for the usual Tesla-time, let's assume that they get there early 2025 as a guess.

      Am I right in thinking that you're saying that this level of compute is unremarkable vs what Tesla would be able to access economically via, for example, AWS or similar? To help me quantify it, do you have an idea of what the comparable level of compute would be for one of the leaders in the space (i.e. a Google or AWS) today/by 2025?

      2 - You've mentioned that Dojo is, more or less, pointless because it is behind the current industry best in class (h100?). For the purposes of the models that Tesla will need to run for its computer vision goals, does it matter that the Dojo tile/unit is inferior to the best that Nvidia has to offer in terms of specs if Tesla is able to make the costs work (i.e. if the cost is half of H100 but it's 70% as good for the job, can you jist have two Dojo units and end up 40% better off than 1 Nvidia unit - crudely speaking)?

      If the numbers that they presented are broadly accurate then it seems likely that the cost of scaling to that level will be into 10 figures in terms of the hardware cost, and although Tesla hypes and overpromises constantly, they do tend to be capital efficient (the hype has, to this point, never come along with the incineration of billions of dollars, so my expectation had been that Tesla internally thought this was worthwhile, in particular as previously they have indicated that they weren't blindly committed to Dojo if it wasn't competitive with what they could get elsewhere.

      3 - In terms of Tesla conflating performance benchmarks FP16 vs 64, for the purposes of AI applications, is FP16 the relevant standard? And if so, do you have any information on how competitor computers would perform on an FP16 test?

      Again, hopefully the question makes sense, the context here is that my (uninformed) understanding was that FP16 was the more relevant precision for training large neural nets and that, whilst an Nvidea unit would perform better in an FP64 benchmark, that wouldn't necessarily be relevant for the desired application.

      4 - Obviously I know nothing about this area and so when confronted with someone much more knowledgeable than myself and forward looking predictions about what might happen in the future I always like to ask what needs to happen for what you've said here to be wrong or for you to change your mind. I guess, for example, if Tesla were to scale broadly as they say they will to 100 exaflops with Dojo (and not Nvidea GPUs) by 2025, does that change anything in your mind or would you still consider the exercise to be a folly?

      5 - on the possible timeline for reaching a general autonomous driving solution, you've mentioned that solving 99% of the problem is 1% of the work, which makes sense, how does compute factor into that equation? I guess, what I see is Tesla saying that they will something like 100x their compute in the next 18 months (and presumably continue to grow it rapidly thereafter) and whilst I imagine that you don't just 100x compute and therefore solve a problem 100x as fast(!) presumably there is some kind of relationship there and so when you look at getting highway L4 by 2030 and then true FSD maybe in the 2040s, what kind of compute trajectory is that based on, or is that the wrong way of thinking about the problem.

      6 - do you have any views on who is likely to be the leader in self driving technology over the coming decades (notwithstanding that you see true L5 or broad use case L4 being a long way off)?

      Cheers

      1. whydoesthisitch

        let's assume that they get there early 2025 as a guess.

        Using Tesla's usual timelines, I wouldn't expect them to get there anytime in the next few years. Remember, this is a company that's been saying they'll have fully autonomous cars "next year" since 2014.

        this level of compute is unremarkable

        Pretty much, yeah. AWS already has multiple clusters running at 20 EFLOPs each, and easily hosts hundreds of EFLOPs across clusters in their datacenters. By 2025, it will easily be several times larger, and by the time Tesla realistically has any chance of getting to 100 EFLOPs (I'd say more like 2027-2028), the major cloud providers will be in the zettaflop range.

        if the cost is half of H100 but it's 70% as good for the job

        Dojo is anywhere from 10-35% the performance of the H100. By the time it's up and running, Nvidia will likely be on the Blackwell generation, which will probably be 3-5x more powerful than the H100. Dojo really only makes sense if it's an order of magnitude cheaper than what Nvidia offers. The problem is, Tesla just doesn't operate on the scale necessary to make up that kind of R&D investment.

        the hype has, to this point, never come along with the incineration of billions of dollars

        And that's really down to the fact that Tesla never delivers when it comes to AI projects. They hype, and promise it's around the corner, but it never materializes. That's why I want to see actual Dojo benchmarks, not just statements that it's "in production". Tesla has made various vague statements about already using Dojo since they first announced it, and it's turned out to be false every time.

        is FP16 the relevant standard?

        Yes, mostly, but that's Tesla's bait and switch. When they first announced Dojo, they made these big claims about it breaking the exaflop barrier, and comparing it to Fugaku. If you look at the talk page on the wildly incorrect Dojo wikipedia page, the author keeps comparing it to Fugaku as well. The implication was that Dojo would be the world's most powerful supercomputer. But that's not true, because Fugaku is measured in FP64 (which is the normal supercomputing standard). FP16 typically takes 32x less compute than FP64, so comparing the two is completely misleading. Realistically, in FP16, we've had exaflop scale machines since about 2017. My team was running exaflop scale ML training jobs in 2019. That's not really a big deal at this point. The problem I have with Tesla's presentation is they kept conflating the two precisions to make it sound like Dojo is way more cutting edge than it actually is.

        how does compute factor into that equation?

        Not much honestly. AV perception and planning algorithms aren't actually that hard to train, compared to things like GPT or PALM. The problem is, creating reliable models, knowing what data will give the best performance, and making models fast and efficient enough to run in the vehicles. We probably have all the pieces to make a reasonably reliable highway autonomous vehicle today. But from building the algorithms, testing, refining the sensors, and deploying the whole system, it's a 10 year process. It's not something that can happen in a few months, like Musk keeps claiming.

        who is likely to be the leader in self driving technology

        In terms of fully autonomous robotaxis, Waymo for the foreseeable future. Despite what people think, mapping cities is relatively easy compared to building reliable algorithms. I expect they'll keep slowly expanding in the next few years. In terms of consumer system, Mobileye. My guess would be the first full speed L3 highway system will be them or Mercedes/Nvidia. Remember that Mobileye was the original Autopilot supplier for Tesla. But their approach is different, they only release an autonomous system when they're really sure of its high reliability (which is why AP1 still tends to be more reliable in its operational design domain than Tesla's newer systems).

        Overall, I would say Dojo is interesting, but nothing revolutionary. It's similar to the Graphcore IPU. It might be handling some small workloads in the next year or two, but I seriously doubt it'll be competitive with Nvidia or other cloud services anytime soon. I'm also not taking Tesla's word for anything about it being "in production" or running workloads, given that they've been so misleading about that in the past. Ideally, I want to see what it does on MLPerf, a standardized machine learning benchmark.

        1. ConcernedCitizens_

          Thank you so much for taking the time, that's both helpful and interesting. Will take the time to digest it properly 👍

        2. Nervous-Camera-6674

          Hello whydoesthisitch; thanks alot for your responses.

          I have a question regarding Tesla now keeping purchasing h100 chips, why don't they aim for buying the new Backwell chips?I assume the Backwell ain't fully available yet, but if it will be 30x better, why not wait until it is? Thinking if it's not a waste of money.

        3. funkdrools

          1 year later your opinions are wrong about autonomous driving. LOL

          1. whydoesthisitch

            How so? Tesla still has zero autonomous cars on the road, and won’t anytime this decade.

          2. tigole

            I find that when people make a statement and end it with "LOL", the "LOL" basically means "ignore me, I'm an idiot."

          3. EnvironmentalTry1037

            I can drive from South Orange County to North LA on any route or streets I want with only a few disengagements, sometimes none. And it gets better with every update. I can't do that with Waymo or anything else and most likely never will.

          4. whydoesthisitch

            few disengagements

            So not autonomous.

        4. WildDogOne

          awesome insights, thanks!

        5. mendelseed

          Yes and dont forget CUDA. It was developed back in 2006. So Tesla also has to develope this. AMDs ROC is development begann Janaury 2022. So they are decade ahead...

          I will stay with CUDA forever and most other Computer Scientists also i guess.

      2. laberdog

        Do we have evidence that Dojo actually exists?

        1. failinglikefalling

          Not really and in an unusual hedge he literally signaled they may never even get passed test bench phase when he announced it. It may have been the most realistic announcement he has ever made.

          1. laberdog

            And yet the stock jumps 10% because Morgan Stanley thinks Dodo might add $800 billion to the market cap

          2. failinglikefalling

            Still not back to post split so anything under 300 is just “nice try”

  2. zippy9002

    Because dojo, at least for now, is worse than nvda.

    As of last presentation they hadn’t solve the main problems that everyone is having in trying to do what they’re trying to do.

    Also they don’t “have dojo”, as of last presentation it didn’t work. It’s in development, they are trying to do something nobody has been able to do yet. And even if they succeed it doesn’t mean it’s going to be better than nvda.

  3. whateveridiot

    New company isn’t Tesla, it is a Twitter spin off.

    Tesla has no ability to allow external companies to use DOJO, it’d need a similar user facing platform to AWS.

    They’ll switch to DOJO eventually.

    Internally at Tesla, we don’t know what they’re doing. Next AI day we may find out.

    1. [deleted]

      We know what they are doing internally. They arre using GPUs while trying to to get Dojo running in parallel. But Dojo is not ready yet.

    2. SuperNewk

      Just had me questioning everything Tesla has said

      1. [deleted]

        You clearly haven't read what Tesla has actually said. They have been super clear with that Dojo is an attempt to beat current GPUs for this specific task, but they aren't certain that it will work, but that they are hopeful. You have clearly not followed news at all.

        1. SuperNewk

          Solving the money problem, said it was far ahead of NVDA and everyone else

          1. [deleted]

            He is not your best source. And he is not Tesla. I listen to him sometimes, but he is full of hyperbole and overly bullish. What he says has some basis, but things are always more complicated.

            Dojo could really turn out amazingly and a lot of realistic Tesla bulls believe in this. But it is not certain. If you listen to what Elon has said on this topic you can see that it is more uncertain. Last year he said very clearly that it was not in use and that it wasn't certain that it would replace GPUs, but that the goal was that it should (for certain tasks) and that the proof would be when Tesla engineers used Dojo more than GPUs.

            Last year Elon was careful each time he talked about Dojo and was like "maybe it will work". Last couple of months has sounded more bullish. I think the problem now is mainly software. For GPUs all software exist. Here they have to write it themselves and, as someone said in this thread already, some things have not been solved by anyone yet.

          2. whydoesthisitch

            Ignore literally everything that guy says. He has no clue what he’s talking about. In terms of chip level compute, Nvidia is 3-10x faster, depending on precision and sparcity. In terms of total cluster compute, the big cloud providers are 20x ahead. And those are all systems that are already up and running, compared to Dojo’s ultimate goal.

          3. Cosmacelf

            Well, there’s your problem. YouTube channels can be either way too optimistic about Tesla or way too pessimistic. Listen to the people writing here, since we actually listen to Elon’s presentations. Dojo is still very much a work in progress and it is speculative. It may or may not work in the end.

        2. whydoesthisitch

          Dojo is an attempt to beat current GPUs for this specific task

          That's what they've said, but it's not entirely true. They're using RISC-V cores in the D1 chip, which are actually even more general than GPUs.

      2. fuckswithboats

        I think we can assume most of it is bullshit at this point.

  4. zR0B3ry2VAiH

    Do they though? Had manufacturing been worked out?

    1. SuperNewk

      It’s been over 2 years so not a good sign?

      1. [deleted]

        2 years is a super short time. They are going very fast, but even Tesla isn't super human. It is still not certain that Dojo will work, but recent comments have been positive. But is not in much practical use yet.

      2. Lancaster61

        You know chips development is extremely slow right? The chips we have in our devices today started development 10+ years ago.

      3. zR0B3ry2VAiH

        I can't imagine it is. Definitely interested to find out the answers to these questions.

  5. MartyBecker

    Elon said a long time ago that they’d switch to dojo when the engineers want to use it, and they’re not going to do that until it’s the better option.

  6. [deleted]

    Dojo isn't even in use yet! They don't have software for it yet. They can't use something they don't have.

  7. Dmpaden

    Where is Ganesh Venkataramanan

  8. luckymethod

    Two things: dojo is a server chip uses for training, different beast than running the software on the car. Second it's also not ready, making a chip isn't that easy and takes time and I'm not sure Tesla picked the right fight here, IMHO they'll give up and just buy off the shelf like everyone else.

  9. Ambiwlans

    Why would dojo be better at llms?

    It is specifically designed for tesla.

    1. whydoesthisitch

      Because a large portion of Tesla's vision work is based on the same transformer architecture used in LLMs like GPT.

      1. Ambiwlans

        That doesn't necessarily mean a whole lot depending on how the chips are designed/optimized.

        "transformer" is really broad.

        1. whydoesthisitch

          Transformers? They’re all very similar architecturally. Just self attention and layer norm.

          These are RISC-V cores. Basically bigger Graphcore ipus. What kind of specialization are you expecting?

          1. Ambiwlans

            I mean, even layer size can have a big impact here. I don't think there is enough public info to determine if dojo would be ideal, so it seems odd to assume it would be. It could be hyper tuned for the very structured data coming from the tesla fleet.

            (i haven't recently looked into how much has been released though, so i suppose it is possible to know the answer, albeit, i doubt op does)

          2. whydoesthisitch

            It could be hyper tuned for the very structured data coming from the tesla fleet.

            That just doesn't make any sense. The data Tesla is using isn't unique, and the models they're training are just simple ViT. This isn't something that out of the ordinary. Dojo, on the other hand, is literally just a large cluster of RISC-V cores, nothing that new. But why would Tesla build a hyper specialized chip to one particular model, when they change their model about every year? It's pretty clear they're still trying to figure out how to get basic things working (remember, they don't develop their own models. They're using Google's old perception models).

          3. Ambiwlans

            Maybe enough data is public then, I'll retract my position as you seem more informed.

            I assumed the chips would have been much more specialized (there are a lot of ways this is possible) ... otherwise I don't know what the point of the project is if it underperforms COTS options.

          4. whydoesthisitch

            Publicly, the main point of the project is to try to get away from Nvidia, because their chips are so expensive. Realistically, it doesn't make much sense for Tesla to be building their own chip on this scale. It's likely to end up being more expensive. So I suspect Dojo was really another one of Tesla's various vaporware programs designed to excite investors than it was a serious ML accelerator development.

          5. Ambiwlans

            Why must you bring my day so much realistic disappointment?

            .

            thanks i guess

          6. laberdog

            Exactly this. If the next “big” thing to trigger a stock run was nuclear fusion powered by coal and the lives of Ukrainian children Tesla would announce its program within hours

Add a new comment.