Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Disclaimer: I work on Google Cloud.

I saw a lot of "should we use Cloud, no its crazy a GPU only costs $X". The key is that if you believe GPUs are going to get updated every year, and/or the best thing for ML may change (see TPU and plenty of startups with custom hardware) then suddenly buying hardware for 24? 36? months isn't as obvious.

We (and AWS and Microsoft) have K80s because Maxwell wasn't a sufficiently friendly all around part. We're all going to offer Pascal P100s and in the future V100s. The challenge for buying your own is that P100s are available now-ish and V100s may be available in less than 12 months.

Buying a P100 or similar part today doesn't mean it won't still be working in a year, but it will suddenly mean you've bought a part that now has much worse !/$ in just N months. If you have an accounting team that is spreading your $Xk over those 36 months, the reality is that you have two options: tell everyone they have to make use the old parts ("we're not buying new GPUs until it's been 36 months!") or realize you're going to get a lot less use out of them.

To be clear, the progress in this space is really impressive. And the same problem above applies to us (the cloud providers). Despite my obvious bias, if I were fired today, I'd be renting to do deep learning just based on the roadmap alone (not to mention the ability to suddenly spin up and down).

Again, Disclosure: I work on Google Cloud and want to sell you things that train ML models :).



Except buying hardware isn't a 24-36 month investment, you could buy new hardware every 3 months or so and it would still be cheaper than renting. Renting only makes sense when you need to scale horizontally, but for most people's use cases having an in-house machine is sufficient.


The problem is that a month of GPU time on a cluster can buy you the hardware itself. If you are doing serious deep learning work it's just not cost effective at this point. If the costs come down by half or more, it may start looking viable for people who need a lot of resources.


Can you explain your math? A K80 on GCE is $.7/hr x 730 => $511/month if you were really 24x7. A K80 (and really we sell them by the die not the board) is more than $1000.

I don't disagree that a consumer board is about that price, but they're not apples to apples. (Either in memory size, reliability or both). I'm fine with that being the real complaint: (major) cloud providers only sell the Tesla class boards, and they're really expensive ;).


> but they're not apples to apples

It would be nice to see Nvidia or someone expand on this, so that users who have to make this choice can do so without guessing. If Google or AWS or M$ could publish reliability information, that'd be cool too.

Illustrative case: I run Monte Carlo work on GPUs and administer a local compute cluster. I tested a workload on a 16 GB P100 and a GTX 1080. A 12 GB P100 costs (academic, EU) 5000 euros while the latter costs 700 euros, but the performance difference is about 2x. Still, when we ask Nvidia reps, they say not to bother installing GTX cards in our cluster, because they aren't designed for 24/7 work, not commercializable etc. Even so, the GTX would have to burn out 3 times before the choice of P100 breaks even. Burning out three times means GTX 1080, then 1180, 1280 etc.


Thank you for the answer. When you say performance difference is 2x, I presume the P100 is 2x faster than the 1080 in training epoch time?


yep


A GTX 1080 is like $550 at this point, K80 cannot really compete with the cost effectiveness here. And it is probably not going out of fashion in one year, so the price is totally worth it.

The real attractiveness for cloud at this point is if you are going to train your model with 8-GPU or more, that is likely not feasible for individual enthusiasts, but demand for such machine is rare for hobbyists anyway.


This is kind of the key. If P100s were common on gce, it would be a better comparison. But the k80 is very old by today's standards, and a consumer card will run circles around it.


Sure, in looking at what it would cost to replicate this:

https://arxiv.org/pdf/1704.01444.pdf

They say they used four Titan X cards for a month. I was actually looking at google cloud machine learning since I thought being targeted to tensorflow would be the most cost effective. But it's 0.49/hour per ML training unit, or 1.47/hour per GPU. A basic gpu gives 3 training units, so I thought that something approximately similar would require 12 training units, which comes to something like $4000+ a month. Maybe I completely misunderstood the resources being offered though, because you are right that cloud engine costs seem much lower.

I'll have to revisit the math here, though it worries me that it's not at all clear that a K80 will be much faster than a Titan X for a given problem. E.g. https://www.amax.com/blog/?p=907. It would also be really nice to get some pricing and benchmarks for the new TPUs, assuming they are priced better.

Maybe part of the problem is that vendors are not making it remotely easy to even understand what performance you'll actually get for a given price.


The comparison should be between a machine off-premises with as many GPUs in it of the same type that you intend to use and power + cooling.

Payback time according to my own calculations depending on GPU model and machine used compared to the various cloud offerings is between 4 and 6 months when used continuously.


Isn't there an abstraction cost though, since that price is only for half a board? In my experience (with AWS) I've had a machine with a Titan X outrun a cloud instance by 1.5-2x, which is significant.

There are also other hidden costs that aren't factored into that number. I'm having a hard time getting anything reasonable for under $750/month for 24/7 usage: https://cloud.google.com/products/calculator


Disclosure: My incentives compete with google cloud. We have had enormous cost savings with on premise customers. I think one thing that isn't being said here is: Most enterprise customers can't actually leverage that much GPU capacity anyways. We have found incremental addition of GPUs to hadoop clusters (yes this is a thing) to be great.

It's cheaper, allows gradual adoption of deep learning and is a familiar toolset for folks already doing some sort of machine learning.

I won't comment on research (not our domain). Depreciation on hardware is pretty standard - that being said it also comes with established SKUs from dell,hp,cisco,.. with proper support.

Analytics clusters (while hard to manage) are fairly robust already to job failures. The cost savings just makes a ton more sense when you are doing continuous workloads for different use cases.


While it's easy enough to add GPUs to a Hadoop/Spark cluster (and we did so too via Dataproc [1]) are you just saying that means you assume closer to 100% utilization due to sharing?

If so, that's fine-ish, but then people have to wait (you're either full and people are waiting or you're at less than 100%). My preference is to run for XX minutes per job on-demand (per person). If you have tons of non-overlapping users, you can absolutely aggregate them. But how many do you buy and how quickly do you upgrade to newer hardware?

[1] https://cloud.google.com/dataproc/docs/concepts/gpu-clusters


I'm talking about on prem clusters where people aren't using cloud. Many of these are just managed by central IT depts (eg: most of enterprise) . We've found folks are perfectly happy to add gpu nodes to an existing cluster managed by yarn or mesos. The incremental upgrade with just another SKU with an existing hardware vendor purchasing/procurement already works with is usually enough.

That being said - it can usually be every 6 months with a renewal of every 2 years or so. Your hardware timelines aren't far off. That being said - I was getting that upgrade cycles don't matter as much.

GPU clusters are enough of a new thing for enterprise yet that the jobs being run aren't even that high spec yet.

My opinion nothing more not claiming this is fact: Google has marched so far ahead of the rest of the world they aren't really paying attention to where current clusters and usage are. It seems a lot of the cloud usage is oriented towards startups and researchers (which isn't a bad thing, most folks in DL are researchers).

For enterprise, they might offload some workloads. There are definitely some workloads where cloud resources (spin up and shut down) make a ton of sense. Cloud servers are overly expensive otherwise.


The site is up again.

I have a prepared draft for a blog post exactly on the topic of cloud computing and deep learning. I did not finish it as I thought that there would not be much interest in the overall question since most people will just buy GTX cards. However, it seems that there is quite a confusion going on what makes sense and what does given certain circumstances. I think I will finish that blog post now and post it in the next days.

If you want me to discuss certain questions regarding deep learning hardware and cloud computing let me know here.


One problem with this argument is there isn't enough history. When aws added k80 the price of the old ones did not decrease much; they just increased the price of the new k80 higher than the k20. Theoretically what you said can happen, but Google still has to run the k80s and pay for them.

By the way, I appreciate the useful comments you've left in the past on your experience for gce.


Another option is to consider something that is even more abstracted from Google Cloud, AWS or Microsoft - such as, https://www.floydhub.com/ (Heroku for Deep learning). Ultimately, someone like them may be quicker to switch between different providers than individual companies.

(I have no association with Floyd)


Is this any different than advising 18 month depreciation for GPUs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: