304 stories

Introducing Thanos: Prometheus at Scale


Fabian Reinartz is a software engineer who enjoys building systems in Go and chasing tough problems. He is a Prometheus maintainer and co-founder of the Kubernetes SIG instrumentation. In the past, he was a production engineer at SoundCloud and led the monitoring team at CoreOS. Nowadays he works at Google.

Bartek Plotka is an Improbable infrastructure software engineer. He’s passionate about emerging technologies and Distributed System problems. With a low-level coding background at Intel, previous experience as a Mesos contributor, and with production, global-scale SRE experience at Improbable, he is focused on improving the world of microservices. His three loves are Golang, open-source software and volleyball.

As you might guess from our flagship product SpatialOS, Improbable has a need for highly dynamic cloud infrastructure at a global scale, running dozens of Kubernetes clusters. To stay on top of them, we were early adopters of the Prometheus monitoring system. Prometheus is capable of tracking millions of measurements in real time and comes with a powerful query language to extract meaningful insights from them.

Prometheus’s simple and reliable operational model is one of its major selling points. However, past a certain scale, we’ve identified a few shortcomings. To resolve those, we’re today officially announcing Thanos, an open source project by Improbable to seamlessly transform existing Prometheus deployments in clusters around the world into a unified monitoring system with unbounded historical data storage.

Always-on metric monitoring

Our goals with Thanos

At a certain cluster scale, problems arise that go beyond the capabilities of a vanilla Prometheus setup. How can we store petabytes of historical data in a reliable and cost-efficient way? Can we do so without sacrificing responsive query times? Can we access all our metrics from different Prometheus servers from a single query API? And can we somehow merge replicated data collected via Prometheus HA setups?

We built Thanos as the solution to these questions. In the next sections, we describe how we mitigated the lack of these features prior to Thanos and explain the goals we had in mind in detail.

Global query view

Prometheus encourages a functional sharding approach. Even single Prometheus server provides enough scalability to free users from the complexity of horizontal sharding in virtually all use cases.

While this is a great deployment model, you often want to access all the data through the same API or UI – that is, a global view. For example, Grafana graphs can only be ever pointed to a single data source, so you need multiple tabs to see the graphs from multiple data sources at the same time. With Thanos, on the other hand, you can aggregate data from multiple Prometheus servers into a single Grafana tab.

Previously, to enable global view at Improbable, we arranged our Prometheus instances in a multiple-level Hierarchical Federation. That meant setting up a single meta-Prometheus server that scraped a portion of the metrics from each “leaf” server.

This has been proven to be problematic. It resulted in an increased configuration burden, added an additional potential failure point and required complex rules to expose only certain data on the federated endpoint. In addition, that kind of federation does not allow a truly global view, since not all data is available from a single query API.

Closely related to this is a unified view of data collected by high-availability (HA) pairs of Prometheus servers. Prometheus’s HA model independently collects data twice, which is as simple as it could be. However, a merged and de-duplicated view of both data streams is a huge usability improvement.

Undoubtedly, there is a need for highly available Prometheus servers. At Improbable we are really serious about monitoring every minute of data, but having a single Prometheus instance per cluster is a single point of failure. Any configuration error or hardware failure could potentially result in the loss of important insights. Even a simple rollout could be a small disruption in the metric collection because a restart can be significantly longer than the scraping interval.

Reliable historical data storage

One of our dreams (shared by most Prometheus users) is for a cheap, responsive, long-term metric storage. At Improbable, using Prometheus 1.8, we were forced to set our metric retention to an embarrassing nine days. This adds an obvious operator limitation on how much we can look back with our graphs.

Prometheus 2.0 helps a lot in this area, as a total number of time series no longer impact overall server performance (See Fabian’s KubeCon keynote about Prometheus 2). Still, Prometheus stores metric data to its local disk. While highly-efficient data compression can get significant mileage out of a local SSD, there is ultimately a limit on how much historical data can be stored.

Additionally, at Improbable, we care about reliability, simplicity and cost. Larger local disks are harder to operate and backup. They are more expensive and require more tooling around backup, which introduces unnecessary complexity.


Once we started querying historical data, we soon realized that there are fundamental big-O complexities that make queries slower and slower as we retrieve weeks, months, and ultimately years worth of data.

The usual solution to that problem is called downsampling, a process of reducing the sampling rate of the signal. With downsampled data, we can “zoom out” to a larger time range and maintain the same number of samples, thus keeping queries responsive.

Downsampling old data is an inevitable requirement of any long-term storage solution and is beyond the scope of vanilla Prometheus.

Additional goals

One of the initial goals of the Thanos project was to integrate with any existing Prometheus setups seamlessly. A second goal was that operations should be simple, with a minimal barrier to entry. If there are any dependencies, they should be easy to satisfy for small- and large-scale users alike, which also implies a negligible baseline cost.

A global view

The architecture of Thanos

Having our goals enumerated in the previous section, let’s work down that list and see how Thanos tackles these issues.

Global view

To get a global view on top of an existing Prometheus setup, we need to interconnect a central query layer with all our servers. The Thanos Sidecar component does just that and is deployed next to each running Prometheus server. It acts as a proxy that serves Prometheus’s local data over Thanos’s canonical gRPC-based Store API, which allows selecting time series data by labels and time range.

On the other end stands a horizontally scalable and stateless Querier component, which does little more than answering PromQL queries via the standard Prometheus HTTP API. Queriers, Sidecars and other Thanos components are communicating via gossip protocol.

  1. When the Querier receives a request, it fans out to relevant Store API servers, i.e. our Sidecars, and fetches the time series data from their Prometheus servers.
  2. It aggregates the responses together and evaluates the PromQL query against them. It can aggregate disjoint data as well as duplicated data from Prometheus high-availability pairs.

This solves a central piece of our puzzle by unifying well-separated Prometheus deployments into a global view of our data. In fact, Thanos can be deployed like this to only make use of these features if desired. No changes to the existing Prometheus servers are necessary at all!

Unlimited retention!

Sooner or later, however, we will want to preserve some data beyond Prometheus’s regular retention time. To do this, we settled on an object storage system for backing up our historical data. It is widely available in every cloud and even most on-premise data centres, and is extremely cost efficient. Furthermore, virtually every object storage solution can be accessed through the well known S3 API.

Prometheus’s storage engine writes its recent in-memory data to disk about every two hours. A block of persisted data contains all the data for a fixed time range and is immutable. This is rather useful since the Thanos Sidecar can simply watch Prometheus’s data directory and upload new blocks into an object storage bucket as they appear.

An additional advantage of having Sidecar uploading metric blocks to the object store as soon as it is written to disk is an ability to keep the “scraper” (Prometheus with Thanos Sidecar) lightweight. This simplifies maintenance, cost and system design.

Backing up our data is easy. What about querying data from the object store again?

The Thanos Store component acts as a data retrieval proxy for data inside our object storage. Just like the Thanos Sidecars, it participates in the gossip cluster and implements the Store API. This way existing Queriers can treat it just like Sidecars as another source of time series data – no special handling is required.


The blocks of time series data consist of several large files. Downloading them on-demand would be rather inefficient and caching them locally would require huge memory and disk space.

Instead, the Store Gateway knows how to deal with the data format of the Prometheus storage engine. Through smart query planning and by only caching the necessary index parts of blocks, it can reduce complex queries to a minimal amount of HTTP range requests against files in the object storage. This way it can reduce the number of naive requests by four to six orders of magnitude and achieve response times that are, in the big picture, hard to distinguish from queries against data on a local SSD.

As shown in the diagram above, Thanos Querier significantly reduces the per-request costs from object storage offerings, by leveraging the Prometheus storage format that co-locates related data in a block file. Having that in mind, we can aggregate multiple byte fetches into a minimal number of bulk calls.

Compaction & downsampling

At the moment when a new block of time series data is successfully uploaded to the object storage, we treat it as “historical” data that’s immediately available via the Store Gateway.

However, after some time, these blocks from the single Source (i.e Prometheus with Sidecar) accumulate and simply do not use the full potential of the indexing. To solve this we introduced a separate singleton component called Compactor. It simply applies Prometheus’s local compaction mechanism to historical data in the object storage and can be run as a simple periodic batch job.

Thanks to Prometheus’s efficient sample compression, querying many series from storage over a long time range is not problematic from a data size perspective. However, the potential cost of decompressing billions of samples and running them through query processing inevitably causes drastic increases in query latency. On the other hand, as there are hundreds of data points per available screen pixel, it becomes impossible to even render the full resolution data. Thus, downsampling is not only feasible but involves no noticeable loss of precision.

To produce downsampled data, the Compactor continuously aggregates series down to five minute and one hour resolutions. For each raw chunk, encoded with TSDB’s XOR compression, it stores different types of aggregations, e.g. min, max, or sum in a single block. This allows Querier to automatically choose the aggregate that is appropriate for a given PromQL query.

From the user perspective, no special configuration is required to use downsampled data. Querier will automatically switch between different resolutions and raw data as a user zooms in and out. Optionally, the user can control it directly by specifying a custom “step” in the query parameters.

Since the storage cost per GB is marginal, by default Thanos always keeps the raw, five minute and one hour resolutions in storage, there is no need to delete the original data.

Recording rules

Even with Thanos, recording rules are an essential part of the monitoring stack. They reduce query complexity, latency and cost. They also provide users with convenient shortcuts for important aggregated views on metric data. Thanos builds upon vanilla Prometheus instances and, therefore, it is perfectly valid to keep recording and alerting rules in the existing Prometheus server. However, this might be not enough for the following cases:

  • Global alerts and rules (e.g: alert when service is down in more than two of three clusters).
  • Rules beyond a Prometheus’s local data retention.
  • The desire to store all rules and alerts in a single place.

For all these cases, Thanos includes a separate component called Ruler that evaluates rules and alerts against Thanos Queriers. By exposing the well-known StoreAPI, the Query node can access fresh evaluated metrics. Later, they are also backed up in the object store and accessible via Store Gateway.

The power of Thanos

Thanos is flexible enough to be set up differently to fit your use cases. It is particularly useful in case of an actual migration from plain Prometheus. Let’s quickly recap what we learned about Thanos components, though a quick example. Here’s how to migrate your own vanilla Prometheus to reach our shiny ‘unlimited retention metric’ world:

  1. Add Thanos Sidecar to your Prometheus servers – for example, a neighbouring container in the Kubernetes pod.
  2. Deploy a few replica Thanos Queriers to enable data browsing. At this point, it’s easy to set-up gossip between your Scrapers and Queriers. Use the `thanos_cluster_members` metric to ensure all the components are connected.

Notably, these two steps alone are enough to enable a global view and seamless deduplication of the result from potential Prometheus HA replicas! Just connect your dashboards to Querier HTTP endpoint or use Thanos UI directly.

However, if you desire a metric data backup and long term retention, we need three more steps:

  1. Create either a AWS S3 or a GCS bucket. Simply configure your Sidecars to back up data there. You can now also reduce local retention to a minimum.
  2. Deploy a Store Gateway and connect it to your existing gossip cluster. Having that query can access backed-up data as well!
  3. Deploy Compactor to improve your long term query responsiveness by applying compactions and downsampling.

If you want to learn more, feel free to check out our example kubernetes manifests and getting started page!

With just five steps, we transformed Prometheus servers into a robust monitoring system that gives you a global view, unlimited retention and potential metric high availability.

Pull request: we need you!

Thanos has been an open-source project from the very beginning. Seamless integration with Prometheus and the ability to use only part of the Thanos project makes it a perfect choice if you want to scale your monitoring system without a superhuman effort.

GitHub Pull Requests and Issues are very welcome. At the same time, do not hesitate to contact us via Github issues or Improbable-eng #thanos slack if you have any questions or feedback, or even if you want to share your use case! And, if you like what we do at Improbable, do not hesitate to reach out to us – we are always hiring!

Read the whole story
8 days ago
Share this story

The Best Training Plans You’ve Never Heard Of

1 Comment

I have no idea what training plan this random Italian guy is on, but it didn’t help him get through the steep section of the classic Rombo di Venti in Finale. Maybe he should try one of the ones listed below…

Everybody is training these days. Pros, recreational climbers, even folks who’ve only been at it for a couple months think they need to train. And there’s no lack of information out there, in fact, maybe the hardest part of finding a training plan is sorting through all the noise to figure out what actually works.

My advice would be to look at something from Power Company, Steve Bechtel or the Anderson Brothers, but if you have tried all those and want something a little more outside the box, here are some unverified (and mostly untrue) methods that might (or might not) be worth checking out. Some I’ve even tried myself and can vouch for the effectiveness of (or not).

Get Sick & Injured for a Month

January started out strong for me, with a fun week in Red Rocks getting back into the flow after Christmas break. I came home ready to train hard for our trip to Italy, and then bam! I bruised my ribs out skiing and had to take three weeks off. During that time, I also got sick twice. When I finally was able to climb again, I felt like a rocket, lighter than air. I hadn’t done much that was physical in the month that I was healing, but I was so eager to climb again that stoke alone took me up the rock. The take away here is to make yourself feel so awful for a month that even just getting back to baseline feels like the best thing in the world!

The Train Only in the Gym and Never Climb Outside Plan

When I first got into “training” I will admit I got sucked down this road. Like with traditional weightlifting, in training you can track and analyze and record everything, so it’s easy to see progress. Way more so than climbing outside on the nebulous “grades” that some of us use to measure our egos, er, success with. Adding more weight on a hangboard, hitting a new PR on the campus board, these are tangible accomplishments, and can be addicting in their own right. Plus you never have to go outside, deal with bad weather, lack of a partner, etc. However, if this doesn’t sound like a good way to measure yourself against other climbers, you can always try what I do and use height. That automatically puts me way ahead of many people, regardless of how much stronger they may be.

The Weird Diet Program

Everybody has a diet these days. Talk about first world problems. I’m vegan, I’m Keto, I’m Paleo, I’m only eating the fat of baby dolphins that were raised in a spherical tank listening to Enya. Imagine trying to tell the 800 million people in the world who are starving every day what your diet is. Anyway, there are people that swear by these tings, so find the one that will make you the most hip with your friend group and attribute all of your sending success to it. You can probably even write a magazine article about it, sometimes they have slow months and have to publish something.

Go Somewhere You Suck for a Couple of Weeks Program

This one has actually worked really well for me the last couple of springs. I am crimping impaired, and two years ago I spent a week in St George followed by a week in Red Rocks, and then last year two weeks in Siurana. After both those trips, where I never climbed anything respectable for me grade-wise, I came home and crushed some local hard routes very quickly. Surprisingly quickly, in fact, but after so much effort onsighting in a style that was hard for me, redpointing on familiar terrain was casual by comparison! Changing things up was just the thing I needed to get a new perspective and break through some mental barriers.

Good luck if you embark on one of these. I can only really recommend the last one, but if you try the others let us know how it goes (or not)!

Read the whole story
8 days ago
hmm heh heh - I'm taking it to a new level with a year off!
Share this story

In case you've forgotten: Microsoft is still a vile garbage fire of a company.

2 Comments and 10 Shares
E-waste recycler Eric Lundgren loses appeal on computer restore disks, must serve 15-month prison term:

A California man who built a sizable business out of recycling electronic waste is headed to federal prison for 15 months after a federal appeals court in Miami rejected his claim that the "restore disks" he made to extend the lives of computers had no financial value, instead ruling that he had infringed Microsoft's products to the tune of $700,000.

The appeals court upheld a federal district judge's ruling that the disks made by Eric Lundgren to restore Microsoft operating systems had a value of $25 apiece, even though they could be downloaded free and could be used only on computers with a valid Microsoft license. [...]

But he said the court had set a precedent for Microsoft and other software-makers to pursue criminal cases against those seeking to extend the life span of computers. "I got in the way of their agenda," Lundgren said, "this profit model that's way more profitable than I could ever be."

Lundgren said he wasn't sure when he would be surrendering. He said prosecutors in Miami told him he could have a couple of weeks to put his financial affairs in order, including plans for his company of more than 100 employees. "But I was told if I got loud in the media, they'd come pick me up," Lundgren said. "If you want to take my liberty, I'm going to get loud."

"I am going to prison, and I've accepted it," Lundgren said Monday. "What I'm not okay with is people not understanding why I'm going to prison. Hopefully my story can shine some light on the e-waste epidemic we have in the United States, how wasteful we are. At what point do people stand up and say something? I didn't say something, I just did it."

Previously, previously, previously, previously.

Read the whole story
30 days ago
Such an odd situation!
Share this story
1 public comment
28 days ago
This is crazy!
Apex, North Carolina

Exploring container security: An overview

1 Comment and 3 Shares

Containers are increasingly being used to deploy applications, and with good reason, given their portability, simple scalability and lower management burden. However, the security of containerized applications is still not well understood. How does container security differ from that of traditional VMs? How can we use the features of container management platforms to improve security?

This is the first in a series of blog posts that will cover container security on Google Cloud Platform (GCP), and how we help you secure your containers running in Google Kubernetes Engine. The posts in the series will cover the following topics:
  • Container networking security 
  • New security features in Kubernetes Engine 1.10
  • Image security 
  • The container software supply chain 
  • Container runtime security 
  • Multitenancy 
Container security is a huge topic. To kick off the the series, here’s an overview of container security and how we think about it at Google.

At Google, we divide container security into three main areas:
  1. Infrastructure security, i.e., does the platform provide the necessary container security features? This is how you use Kubernetes security features to protect your identities, secrets, and network; and how Kubernetes Engine uses native GCP functionality, like IAM, audit logging and networking, to bring the best of Google security to your workloads. 
  2. Software supply chain, i.e., is my container image secure to deploy? This is how you make sure your container images are vulnerability-free, and that the images you built aren't modified before they're deployed. 
  3. Runtime security, i.e., is my container secure to run? This is how you identify a container acting maliciously in production, and take action to protect your workload.
Let’s dive a bit more into each of these.

Infrastructure security

Container infrastructure security is about ensuring that your developers have the tools they need to securely build containerized services. This covers a wide variety of areas, including:
  • Identity, authorization and authentication: How do my users assert their identities in my containers and prove they are who they say they are, and how do I manage these permissions?
    • In Kubernetes, Role-Based Access Control (RBAC) allows the use of fine-grained permissions to control access to resources such as the kubelet. (RBAC is enabled by default since Kubernetes 1.8.)
    • In Kubernetes Engine, you can use IAM permissions to control access to Kubernetes resources at the project level. You can still use RBAC to restrict access to Kubernetes resources within a specific cluster.
  • Logging: How are changes to my containers logged, and can they be audited?
    • In Kubernetes, Audit Logging automatically captures API audit logs. You can configure audit logging based on whether the event is metadata, a request or a request response.
    • Kubernetes Engine integrates with Cloud Audit Logging, and you can view audit logs in Stackdriver Logging or in the GCP Activity console. The most commonly audited operations are logged by default, and you can view and filter these.
  • Secrets: How does Kubernetes store secrets, and how do containerized applications access them?
  • Networking: How should I segment containers in a network, and what traffic flows should I allow?
    • In Kubernetes, you can use network policies to specify how to segment the pod network. When created, network policies define with which pods and endpoints a particular pod can communicate.
    • In Kubernetes Engine, you can create a network policy, currently in beta, and manage these for your entire cluster. You can also create Private Clusters, in beta, to use only private IPs for your master and nodes.
These are just some of the tools that Kubernetes uses to secure your cluster the way you want, making it easier to maintain the security of your cluster.

Software supply chain 

Managing the software supply chain, including container image layers that you didn't create, is about ensuring that you know exactly what’s being deployed in your environment, and that it belongs there. In particular, that means giving your developers access to images and packagers that are known to be free of vulnerabilities, to avoid introducing known vulnerabilities into your environment.

A container runs on a server's OS kernel but in a sandboxed environment. A container's image typically includes its own operating system tools and libraries. So when you think about software security, there are in fact many layers of images and packages to secure:
  • The host OS, which is running the container 
  • The container image, and any other dependencies you need to run the container. Note that these are not necessarily images you built yourself—container images included from public repositories like Docker Hub also fall into this category 
  • The application code itself, which runs inside the container. This is outside of the scope of container security, but you should follow best practices and scan your code for known vulnerabilities. Be sure to review your code for security vulnerabilities and consider more advanced techniques such as fuzzing to find vulnerabilities. The OWASP Top Ten web application security risks is a good resource for knowing what to avoid. 

Runtime security 

Lastly, runtime security is about ensuring that your security response team can detect and respond to security threats to containers running in your environment. There are a few desirable capabilities here:
  • Detection of abnormal behaviour from the baseline, leveraging syscalls, network calls and other available information 
  • Remediation of a potential threat, for example, via container isolation on a different network, pausing the container, or restarting it 
  • Forensics to identify the event, based on detailed logs and the containers’ image during the event 
  • Run-time policies and isolation, limiting what kinds of behaviour are allowed in your environment 
All of these capabilities are fairly nascent across the industry, and there are many different ways today to perform runtime security.

A container isn’t a strong security boundary 

There’s one myth worth clearing up: containers do not provide an impermeable security boundary, nor do they aim to. They provide some restrictions on access to shared resources on a host, but they don’t necessarily prevent a malicious attacker from circumventing these restrictions. Although both containers and VMs encapsulate an application, the container is a boundary for the application, but the VM is a boundary for the application and its resources, including resource allocation.

If you're running an untrusted workload on Kubernetes Engine and need a strong security boundary, you should fall back on the isolation provided by the Google Cloud Platform project. For workloads sharing the same level of trust, you may get by with multi-tenancy, where a container is run on the same node as other containers or another node in the same cluster.

Upcoming talks at KubeCon EU 

In addition to this blog post series, we’ll be giving several talks on container security at KubeCon Europe in Copenhagen. If you’ll be at the show, make sure to add these to your calendar:
Note that everything discussed above is really just focused at the container level; you still need a secure platform underlying this infrastructure, and you need application security to protect the applications you build in containers. To learn more about Google Cloud’s security, see the Google Infrastructure Security Design Overview whitepaper.

Stay tuned for next week’s installment about image security!

Read the whole story
57 days ago
Worth reading as containers take the focus from VMs
Share this story

Docker is Dead

1 Comment and 2 Shares

To say that Docker had a very rough 2017 is an understatement. Aside from Uber, I can’t think of a more utilized, hyped, and well funded Silicon Valley startup (still in operation) fumbling as bad as Docker did in 2017. People will look back on 2017 as the year Docker, a great piece of software, was completely ruined by bad business practices leading to its end in 2018. This is an outside facing retrospective on how and where Docker went wrong and how Docker’s efforts to fix it are far too little way too late.

Subscribe to DevOps’ish for updates on Docker as well as other DevOps, Cloud Native, and Open Source news.

Docker is Good Software

To be clear, Docker has helped revolutionize software development. Taking Linux primitives like cgroups, namespaces, process isolation, etc. and putting them into a single tool is an amazing feat. In 2012, I was trying to figure out how development environments could be more portable. Docker’s rise allowed a development environment to become a simple, version controllable Dockerfile. The tooling went from Packer, Vagrant, VirtualBox, and a ton of infrastructure to Docker. The Docker UI is actually pretty good too! It’s a good tool with many applications. The folks on the Docker team should be very proud of the tooling they built.

Docker is a Silicon Valley Darling

Docker’s early success lead to the company building a big community around its product. That early success fueled funding round after funding round. Well known investors like Goldman Sachs, Greylock Partners, Sequoia Capital, and Insight Venture Partners lined up to give truckloads of money to Docker. To date, Docker has raised capital investments totaling between $242 to over $250 million dollars.

But, like most well funded, win at all cost start-ups of the 2010s, Docker made some human resources missteps. Docker has protected some crappy people along its rise. This led to my personal dislike of the company’s leadership. The product is still quality but it doesn’t excuse the company’s behavior AT ALL. Sadly, this is the case for a lot of Silicon Valley darlings and it needs to change.

Kubernetes Dealt Damage to Docker

Docker’s doom has been accelerated by the rise of Kubernetes. Docker did itself no favors in its handling of Kubernetes, the open source community’s darling container orchestrator. Docker’s competing product, Docker Swarm, was the only container orchestrator in Docker’s mind. This decision was made despite Kubernetes preferring Docker containers at first. Off the record, Docker Captains confirmed early in 2017 that Kubernetes discussions in articles, at meetups, and at conferences was frowned upon by Docker.

Through dockercon17 in Austin this Kubernetes-less mantra held. Then, rather abruptly, at dockercon EU 17 Docker decided to go all in on Kubernetes. The sudden change was an obvious admission to Kubernetes’ rise and impending dominance. This is only exacerbated by the fact that Docker sponsored and had a booth at KubeCon + CloudNativeCon North America 2017.


No one understood what Docker was doing in April at dockercon17 when it announced Moby. Moby is described as the new upstream for the Docker project. But, the rollout of Moby was not announced in advance. It was as if millions of voices suddenly cried out in terror when the drastic shift from Docker to Moby occurred on GitHub as Solomon Hykes was speaking at dockercon17. This drastic and poorly thought through change required intervention from GitHub staff directly.

Not only was the change managed poorly, the messaging was given little consideration as well. This led to an apology and later hand drawn explanations of the change. This further muddies the already cloudy container space and Docker (or is it Moby?) ecosystem. The handling of the Moby rollout continues to baffle those working in the industry. The Docker brand is likely tarnished due to this.

The Cold Embrace of Kubernetes

Docker’s late and awkward embrace of Kubernetes at the last possible moment is a sign of an impending downfall. When asked if Docker Swarm was dead, Solomon Hykes tweeted, “Docker will continue to support both Kubernetes and Swarm as first-class citizens, and encourage cross-pollination. Openness and choice create a healthier ecosystem for everyone.” The problem here is that Docker Swarm isn’t fully baked and is quite far from it. The Docker Swarm product team and its handful of open source contributors will not be able to keep up with the Kubernetes community. As good as the Docker UI is the Kubernetes UI is far superior. It’s almost as if Docker is conceding itself to being a marginal consulting firm in the container space.


The real problem with Docker is a lack of coherent leadership. There appears to have been a strategic focus around a singular person in the organization. This individual has been pushed further and further away from the core of the company but still remains. The company has reorganized and has shifted its focus to the enterprise. This shift makes sense for Docker’s investors (the company does have a fiduciary responsibility after all). But, this shift is going to reduce the brand’s cool factor that fueled its wild success. It is said that, “Great civilizations are not murdered. They commit suicide.” Docker has done just that.

Bonus: Conspiracy Theory

Conspiracy Theory: Docker knows it is over for them. The technical folk decided to roll out Moby drastically and embraced Kubernetes suddenly to make sure their work still lives on. #Docker #DevOps

— Chris Short (@ChrisShort) December 29, 2017

I floated out a theory on Twitter about the awkward moments for Docker in 2017. It is possible Docker knows the end is near for the company itself. As organizational changes have indicated a pending exit (likely through acquisition), the technical core of the company prioritized some changes. Donating containerd to CNCF, making Moby the upstream of Docker, and embracing Kubernetes will immortalize the good work done by the folks at Docker. This allows a large organization like Oracle or Microsoft to come along and acquire the company without worrying about the technological advances made by Docker employees being locked behind licenses. This provides the best of both worlds for the software teams and company itself. Needless to say, 2018 will be an interesting year for Docker.

Subscribe to DevOps’ish for updates on Docker as well as other DevOps, Cloud Native, and Open Source news.

See Also

Read the whole story
141 days ago
This is a community, tech centric analysis. Business wise they appear to have struggled to keep their business strategy aligned with organisational growth and change. I have a lot of sympathy. They remain in the driving seat, with a massive opportunity: big year ahead
Share this story

Saturday Morning Breakfast Cereal - Language

5 Comments and 10 Shares

Click here to go see the bonus panel!

It gets really bad when they start using loops instead of actively engaging in conversation.

New comic!
Today's News:

3 weeks left to submit your proposal for BAHFest MIT or BAHFest London!

Read the whole story
141 days ago
Share this story
5 public comments
136 days ago
So is it like math where you read all the nested words first then work your way out?
143 days ago
I have been accused (by some) of using (entirely) too many parentheses in conversation (both written (email) and spoken).
143 days ago
144 days ago
"Daddy, where did Lisp come from?"
145 days ago
Replace all the s's with th's and that's accurate.
Next Page of Stories