Archives de catégorie : CloudFlare

How to use Cloudflare for Service Discovery

Cloudflare runs 3,588 containers, making up 1,264 apps and services that all need to be able to find and discover each other in order to communicate — a problem solved with service discovery.

You can use Cloudflare for service discovery. By deploying microservices behind Cloudflare, microservices’ origins are masked, secured from DDoS and L7 exploits and authenticated, and service discovery is natively built in. Cloudflare is also cloud platform agnostic, which means that if you have distributed infrastructure deployed across cloud platforms, you still get a holistic view of your services and the ability to manage your security and authentication policies in one place, independent of where services are actually deployed.

How it works

Service locations and metadata are stored in a distributed KV store deployed in all 100+ Cloudflare edge locations (the service registry).

Services register themselves to the service registry when they start up and deregister themselves when they spin down via a POST to Cloudflare’s API. Services provide data in the form of a DNS record, either by giving Cloudflare the address of the service in an A (IPv4) or AAAA (IPv6) record, or by providing more metadata like transport protocol and port in an SRV record.

Services are also automatically registered and deregistered by health check monitors so only healthy nodes are sent traffic. Health checks are over HTTP and can be setup with custom configuration so that responses to the health check must return a specific response body and or response code otherwise the nodes are marked as unhealthy.

Traffic is distributed evenly between redundant nodes using a load balancer. Clients of the service discovery query the load balancer directly over DNS. The load balancer receives data from the service registry and returns the corresponding service address. If services are behind Cloudflare, the load balancer returns a Cloudflare IP address to route traffic to the service through Cloudflare’s L7 proxy.

Traffic can also be sent to specific service nodes based on client geography, so the data replication service in North America, for example, can talk to a specific North American version of the billing service, or European data can stay in Europe.

Clients query the service registry over DNS, and service location and metadata is packaged in A, AAAA, CNAME or SRV records. The benefit of this is that no additional client software needs to be installed on service nodes beyond a DNS client. Cloudflare works natively over DNS, meaning that if your services have a DNS client, there’s no extra software to install, manage, upgrade or patch.

While usually, TTL’s in DNS mean that if a service location changes or deregisters, clients may still get stale information, Cloudflare DNS keeps low TTL’s (it’s able to do this and maintain fast performance because of its distributed network) and if you are using Cloudflare as a proxy, the DNS answers always point back to Cloudflare even when the IP’s of services behind Cloudflare change, removing the effect of cache staleness.

If your services communicate over HTTP/S and websockets, you can additionally use Cloudflare as a L7 proxy for added security, authentication and optimization. Cloudflare prevents DDoS attacks from hitting your infrastructure, masks your IP’s behind its network, and routes traffic through an optimized edge PoP to edge PoP route to shave latency off the internet.

Once service <–> service traffic is going through Cloudflare, you can use TLS client certificates to authenticate traffic between your services. Cloudflare can authenticate traffic at the edge by ensuring that the client certificate presented during the TLS handshake is signed by your root CA.

Setting it up

Sign up for Cloudflare account

During the signup process, add all your initial services as DNS records in the DNS editor.

To finish sign up, move DNS to Cloudflare by logging into your registrar and changing your nameservers to the Cloudflare nameservers assigned to you when you signed up for Cloudflare. If you want traffic to those services to be proxied through Cloudflare, click on the cloud next to each DNS record to make it orange.

Run a script on each node so that:

On startup, the node sends a POST to the DNS record API to register itself and PUT to load balancing API to add itself to the origin pool.

On shutdown, the node sends a DELETE to the DNS record API to deregister itself and PUT to load balancing API to remove itself to the origin pool.

These can be accomplished via startup and shutdown scripts on Google Compute Engine or user data scripts or auto scaling lifecycle hooks on AWS.

Registration:

curl -X POST « https://api.cloudflare.com/client/v4/zones/023e105f4ecef8ad9ca31a8372d0c353/dns_records »
-H « X-Auth-Email: [email protected]
-H « X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41 »
-H « Content-Type: application/json »
–data ‘{« type »: »SRV », »data »:{« service »: »_http », »proto »: »_tcp », »name »: »name », »priority »:1, »weight »:1, »port »:80, »target »: »staging.badtortilla.com »}, »ttl »:1, »zone_name »: »badtortilla.com », »name »: »_http._tcp.name. », »content »: »SRV 1 1 80 staging.badtortilla.com. », »proxied »:false, »proxiable »:false, »priority »:1}’

De-Registration:

curl -X DELETE « https://api.cloudflare.com/client/v4/zones/023e105f4ecef8ad9ca31a8372d0c353/dns_records/372e67954025e0ba6aaa6d586b9e0b59 »
-H « X-Auth-Email: [email protected]
-H « X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41 »
-H « Content-Type: application/json »

Add or remove an origin from an origin pool (this should be a unique IP per node added to the pool):

curl -X PUT « https://api.cloudflare.com/client/v4/user/load_balancers/pools/17b5962d775c646f3f9725cbc7a53df4 »
-H « X-Auth-Email: [email protected]
-H « X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41 »
-H « Content-Type: application/json »
–data ‘{« description »: »Primary data center – Provider XYZ », »name »: »primary-dc-1″, »enabled »:true, »monitor »: »f1aba936b94213e5b8dca0c0dbf1f9cc », »origins »:[{« name »: »app-server-1″, »address »: »1.2.3.4″, »enabled »:true}], »notification_email »:[email protected]}’

Create a health check. You can do this in the API or in the Cloudflare dashboard (in the Load Balancer card).

curl -X POST « https://api.cloudflare.com/client/v4/organizations/01a7362d577a6c3019a474fd6f485823/load_balancers/monitors »
-H « X-Auth-Email: [email protected]
-H « X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41 »
-H « Content-Type: application/json »
–data ‘{« type »: »https », »description »: »Login page monitor », »method »: »GET », »path »: »/health », »header »:{« Host »:[« example.com »], »X-App-ID »:[« abc123″]}, »timeout »:3, »retries »:0, »interval »:90, »expected_body »: »alive », »expected_codes »: »2xx »}’

Create an initial load balancer, either through the API or in the Cloudflare dashboard.

curl -X POST « https://api.cloudflare.com/client/v4/zones/699d98642c564d2e855e9661899b7252/load_balancers »
-H « X-Auth-Email: [email protected]
-H « X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41 »
-H « Content-Type: application/json »
–data ‘{« description »: »Load Balancer for www.example.com », »name »: »www.example.com », »ttl »:30, »fallback_pool »: »17b5962d775c646f3f9725cbc7a53df4″, »default_pools »:[« de90f38ced07c2e2f4df50b1f61d4194″, »9290f38c5d07c2e2f4df57b1f61d4196″, »00920f38ce07c2e2f4df50b1f61d4194″], »region_pools »:{« WNAM »:[« de90f38ced07c2e2f4df50b1f61d4194″, »9290f38c5d07c2e2f4df57b1f61d4196″], »ENAM »:[« 00920f38ce07c2e2f4df50b1f61d4194″]}, »pop_pools »:{« LAX »:[« de90f38ced07c2e2f4df50b1f61d4194″, »9290f38c5d07c2e2f4df57b1f61d4196″], »LHR »:[« abd90f38ced07c2e2f4df50b1f61d4194″, »f9138c5d07c2e2f4df57b1f61d4196″], »SJC »:[« 00920f38ce07c2e2f4df50b1f61d4194″]}, »proxied »:true}’

(optional) Setup geographic routing rules. You can do this via API or in the Cloudflare dashboard.

curl -X POST « https://api.cloudflare.com/client/v4/zones/699d98642c564d2e855e9661899b7252/load_balancers »
-H « X-Auth-Email: [email protected]
-H « X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41 »
-H « Content-Type: application/json »
–data ‘{« description »: »Load Balancer for www.example.com », »name »: »www.example.com », »ttl »:30, »fallback_pool »: »17b5962d775c646f3f9725cbc7a53df4″, »default_pools »:[« de90f38ced07c2e2f4df50b1f61d4194″, »9290f38c5d07c2e2f4df57b1f61d4196″, »00920f38ce07c2e2f4df50b1f61d4194″], »region_pools »:{« WNAM »:[« de90f38ced07c2e2f4df50b1f61d4194″, »9290f38c5d07c2e2f4df57b1f61d4196″], »ENAM »:[« 00920f38ce07c2e2f4df50b1f61d4194″]}, »pop_pools »:{« LAX »:[« de90f38ced07c2e2f4df50b1f61d4194″, »9290f38c5d07c2e2f4df57b1f61d4196″], »LHR »:[« abd90f38ced07c2e2f4df50b1f61d4194″, »f9138c5d07c2e2f4df57b1f61d4196″], »SJC »:[« 00920f38ce07c2e2f4df50b1f61d4194″]}, »proxied »:true}’

(optional) Setup Argo for faster PoP to PoP transit in the traffic app of the Cloudflare dashboard.

(optional) Setup rate limiting via API or in the dashboard

curl -X POST « https://api.cloudflare.com/client/v4/zones/023e105f4ecef8ad9ca31a8372d0c353/rate_limits »
-H « X-Auth-Email: [email protected]
-H « X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41 »
-H « Content-Type: application/json »
–data ‘{« id »: »372e67954025e0ba6aaa6d586b9e0b59″, »disabled »:false, »description »: »Prevent multiple login failures to mitigate brute force attacks », »match »:{« request »:{« methods »:[« GET », »POST »], »schemes »:[« HTTP », »HTTPS »], »url »: »*.example.org/path* »}, »response »:{« status »:[401,403], »origin_traffic »:true}}, »bypass »:[{« name »: »url », »value »: »api.example.com/* »}], »threshold »:60, »period »:900, »action »:{« mode »: »simulate », »timeout »:86400, »response »:{« content_type »: »text/xml », »body »: »<error>This request has been rate-limited.</error> »}}}’

(optional) Setup TLS client authentication. (Enterprise only) Send your account manager your root CA certificate and which options you would like enabled.
Source: CloudFlare

Aquele Abraço Rio de Janeiro: Cloudflare's 116th Data Center!

Cloudflare is excited to announce our newest data center in Rio de Janeiro, Brazil. This is our eighth data center in South America, and expands the Cloudflare network to 116 cities across 57 countries. Our newest deployment will improve the performance and security of over six million Internet applications across Brazil, while providing redundancy to our existing São Paulo data center. As additional ISPs peer with us at the local internet exchange (IX.br), we’ll be able to provide even closer coverage to a growing share of Brazil Internet users.

A Cloudflare está muito feliz de anunciar o nosso mais recente data center: Rio de Janeiro, Brasil. Este é o nosso oitavo data center na América do Sul, e com ele a rede da Cloudflare se expande por 116 cidades em 57 países. Este lançamento vai acelerar e proteger mais de seis milhões de sites e aplicações web pelo Brasil, também provendo redundância para o nosso data center em São Paulo. Provendo acesso à nossa rede para mais parceiros através do Ponto de Troca de Tráfego (IX-RJ), nós estamos chegando mais perto dos usuários da Internet em todo o Brasil.

History

Rio de Janeiro plays a great role in the history of Internet in Brazil. In 1988, the National Laboratory of Scientific Computation, headquartered in Rio de Janeiro connected to the University of Maryland via Bitnet, a network to exchange messages. The next year, the Federal University of Rio de Janeiro also connected to Bitnet, becoming the third institution (with São Paulo State Foundation for Research Support) to have access to this technology.

O Rio de janeiro tem papel central na história da Internet no Brasil. Em 1988, o Laboratório Nacional de Computação Científica (LNCC), conectou-se à Universidade de Mariland através da Bitnet, que era uma rede que permitia o envio de e-mail entre as instituições acadêmicas. Em 1989, a Universidade Federal do Rio de Janeiro também se conectou na Bitnet através de outra universidade americana, se tornando a terceira instituição Brasileira a se conectar na Internet (a FAPESP também já estava na rede).

CC BY-NC 2.0 image by Lau Rey

Today, the city of Rio de Janeiro is very well connected. Internet access can be found all over, and better connectivity can boost entrepreneurship. In some Favelas (slums), the residents are creating their own ISPs, providing Internet access to some users that big ISPs are not able to reach.

Hoje, a cidade do Rio de Janeiro é muito bem conectada. Acesso à internet pode ser encontrado em todo lugar, inclusive incentivando o empreendedorismo. Em algumas favelas os próprios moradores criaram seus provedores de internet via Wi-Fi, e estão proporcionando a inclusão digital em áreas onde os grandes provedores não chegam.

LatAm expansion

We have an additional eight datacenters in progress across Latin America. If managing the many moving parts of building a large global network interest you, come join our team!

Nós temos mais oito datacenters a caminho na América Latina. Se você se interessa em gerenciar uma rede de alcance global, venha fazer parte do nosso time!

-The Cloudflare team
Source: CloudFlare

Ninth Circuit Rules on National Security Letter Gag Orders

As we’ve previously discussed on this blog, Cloudflare has been challenging for years the constitutionality of the FBI’s use of national security letters (NSLs) to demand user data on a confidential basis. On Monday morning, a three-judge panel of the U.S. Ninth Circuit Court of Appeals released the latest decision in our lawsuit, and endorsed the use of gag orders that severely restrict a company’s disclosures related to NSLs.

CC-BY 2.0 image by a200/a77Wells

This is the latest chapter in a court proceeding that dates back to 2013, when Cloudflare initiated a challenge to the previous form of the NSL statute with the help of our friends at EFF. Our efforts regarding NSLs have already seen considerable success. After a federal district court agreed with some of our arguments, Congress passed a new law that addressed transparency, the USA FREEDOM Act. Under the new law, companies were finally permitted to disclose the number of NSLs they receive in aggregate bands of 250. But there were still other concerns about judicial review or limitation of gag orders that remained.

Today’s outcome is disappointing for Cloudflare. NSLs are “administrative subpoenas” that fall short of a warrant, and are frequently accompanied by nondisclosure requirements that restrict even bare disclosures regarding the receipt of such letters. Such gag orders hamper transparency efforts, and limit companies’ ability to participate in the political process around surveillance reform.

What did the Court say?

In its ruling, the Ninth Circuit upheld NSL gag orders by ruling that the current system does not run afoul of the First Amendment. Currently, the laws governing the issuance of NSLs permit a nondisclosure requirement so long as the requesting official certifies that the lack of a prohibition “may result” in certain types of harm. However, there is no judicial scrutiny of these claims before the gag order goes into full effect. Only once the restriction has already been imposed can a company seek judicial review before a court. Furthermore, the FBI must only reassess the gag order at three years in, or when investigation has closed.

Along with our co-petitioner, CREDO Mobile, Cloudflare challenged the NSL gag orders as a “prior restraint” on free speech. In First Amendment law, prior restraints are judicial orders or administrative rules that function to suppress speech before it ever takes place. There is a heavy presumption against the constitutionality of prior restraints, but they can be justified in narrowly defined circumstances or if the restraint follows certain procedural safeguards. In the context of NSLs, we considered those safeguards to be lacking.

The Appeals Court disagreed: in its ruling, the Court determined that NSL gag order was indeed a prior restraint subject to “strict” constitutional scrutiny, but that such orders were “narrowly tailored to a compelling state interest” and provided enough procedural safeguards to pass constitutional muster.

What’s Next?

While we are still reviewing the specifics of the court’s decision, Cloudflare will continue to report on NSLs to the extent permitted by law. We will also continue to work with EFF as we weigh how to proceed: the next steps may be to make a request for an en banc appeal all the members of the 9th Circuit, or petition the U.S. Supreme Court to take up the case.

Cloudflare’s approach to law enforcement requests will continue to be that while we are supportive of their work, any requests we receive must adhere to due process, and be subject to judicial oversight. When we first decided to challenge the FBI’s request for customer information through a confidential NSL, we were a much smaller company. It was not an easy decision, but we decided to contest a gag order that we felt was overbroad and in violation of our principles. We are grateful to our friends at EFF for taking our case, and applaud the excellent job they have done pushing this effort.
Source: CloudFlare

Getting started with Cloudflare Apps

We recently launched our new Cloudflare Apps platform, and love to see the community it is building. In an effort to help people who run web services such as websites, APIs and more, we would like to help make your web services faster, safer and more reliable using our new Apps Platform by leveraging our 115 points of presence around the world. (Skip ahead to the fun part if you already know how Cloudflare Apps works)

How Cloudflare apps work

Here is a quick diagram of how Cloudflare apps work:

The “Origin” is the server that is providing your services, such as your website or API. The “Edge” represents a point of presence that is closest to your visitors. Cloudflare uses a routing method known as Anycast to ensure the end user, pictured on the far right, is routed through the best network path to our points of presence closest to them around the world.

Historically, to make to make changes or additions to your site at the edge changes to a site, you needed to be a Cloudflare employee. Now with apps, anyone can quickly make changes to the pages rendered to their users via Javascript and CSS. Today, you can do amazing things like add a donation button using PayPal, or inject a video intelligently using CSS to position the objects wherever you like.

Awesome apps that you can already turn on today

A great way to explore our existing Apps would be to explore our Apps store. You can access is by visiting our App store.

You can review all of them by visiting your Cloudflare dashboard and accessing the apps section, which is a button on the far right hand corner of the dashboard.

Creating an app (AKA the fun part)

Cloudflare has a simple example app that is easy to use. Feel free to fork our app to have fun with it. You can find it on GitHub here.

We frequently recommend using javascript validation tools such as npm or yarn, as they have built in error checking to help ensure your code will scale. Today, we will be using npm in our examples.

To start, you will want to rely on the install.json file and perform the install execution run:

npm install

It’s also best practice to double check the Javascript to ensure there are no errors in the source:

npm run lint

From here, your files can be located in your source directory:

source/app.js

This is where the magic happens. Your app starts here.

source/app.css

Styles for your app.

media/**

A directory for icons, tile images, and screenshots.

The easiest way to test your app is to use our app creation dashboard.

From there, it’s as simple as directing the creator to the folder of your app in the app creator, and testing the app. You can modify the source/app.js file to modify the nature of the Javascript injected and source/app.css to select where those changes are implemented. Once you’re happy with your app, you simply click create app at the bottom left of the page and it will be reviewed based on the code created for your page.

Would you like to get community feedback for your app before submitting it for moderation? Share your work or work-in-progress with the Cloudflare Apps part of the community. We can’t wait to see what you build.

Cloudflare is very excited about the apps platform because it not only enables our users to do powerful new things with their internet properties, but also because gives our users the chance to create an app that will be available to more than 7 million websites around the world.

If you have any questions, feel free to join our new Cloudflare Community today to join in on the fun!
Source: CloudFlare

High-reliability OCSP stapling and why it matters

At Cloudflare our focus is making the internet faster and more secure. Today we are announcing a new enhancement to our HTTPS service: High-Reliability OCSP stapling. This feature is a step towards enabling an important security feature on the web: certificate revocation checking. Reliable OCSP stapling also improves connection times by up to 30% in some cases. In this post, we’ll explore the importance of certificate revocation checking in HTTPS, the challenges involved in making it reliable, and how we built a robust OCSP stapling service.

Why revocation is hard

Digital certificates are the cornerstone of trust on the web. A digital certificate is like an identification card for a website. It contains identity information including the website’s hostname along with a cryptographic public key. In public key cryptography, each public key has an associated private key. This private key is kept secret by the site owner. For a browser to trust an HTTPS site, the site’s server must provide a certificate that is valid for the site’s hostname and a proof of control of the certificate’s private key. If someone gets access to a certificate’s private key, they can impersonate the site. Private key compromise is a serious risk to trust on the web.

Certificate revocation is a way to mitigate the risk of key compromise. A website owner can revoke a compromised certificate by informing the certificate issuer that it should no longer be trusted. For example, back in 2014, Cloudflare revoked all managed certificates after it was shown that the Heartbleed vulnerability could be used to steal private keys. There are other reasons to revoke, but key compromise is the most common.

Certificate revocation has a spotty history. Most of the revocation checking mechanisms implemented today don’t protect site owners from key compromise. If you know about why revocation checking is broken, feel free to skip ahead to the OCSP stapling section below.

Revocation checking: a history of failure

There are several ways a web browser can check whether a site’s certificate is revoked or not. The most well-known mechanisms are Certificate Revocation Lists (CRL) and Online Certificate Status Protocol (OCSP). A CRL is a signed list of serial numbers of certificates revoked by a CA. OCSP is a protocol that can be used to query a CA about the revocation status of a given certificate. An OCSP response contains signed assertions that a certificate is not revoked.

Certificates that support OCSP contain the responder’s URL and those that support CRLs contain a URLs where the CRL can be obtained. When a browser is served a certificate as part of an HTTPS connection, it can use the embedded URL to download a CRL or an OCSP response and check that the certificate hasn’t been revoked before rendering the web page. The question then becomes: what should the browser do if the request for a CRL or OCSP response fails? As it turns out, both answers to that question are problematic.

Hard-fail doesn’t work

When browsers encounter a web page and there’s a problem fetching revocation information, the safe option is to block the page and show a security warning. This is called a hard-fail strategy. This strategy is conservative from a security standpoint, but prone to false positives. For example, if the proof of non-revocation could not be obtained for a valid certificate, a hard-fail strategy will show a security warning. Showing a security warning when no security issue exists is dangerous because it can lead to warning fatigue and teach users to click through security warnings, which is a bad idea.

In the real world, false positives are unavoidable. OCSP and CRL endpoints subject to service outages and network errors. There are also common situations where these endpoints are completely inaccessible to the browser, such as when the browser is behind a captive portal. For some access points used in hotels and airplanes, unencrypted traffic (like OCSP endpoints) are blocked. A hard-fail strategy force users behind captive portals and other networks that block OCSP requests to click through unnecessary security warnings. This reality is unpalatable to browser vendors.

Another drawback to a hard-fail strategy is that it puts an increased burden on certificate authorities to keep OCSP and CRL endpoints available and online. A broken OCSP or CRL server becomes a central point of failure for all certificates issued by a certificate authority. If browsers followed a hard-fail strategy, an OCSP outage would be an Internet outage. Certificate authorities are organizations optimized to provide trust and accountability, not necessarily resilient infrastructure. In a hard-fail world, the availability of the web as a whole would be limited by the ability for CAs to keep their OCSP services online at all times; a dangerous systemic risk to the internet as a whole.

Soft-fail: it’s not much better

To avoid the downsides of a hard-fail strategy, most browsers take another approach to certificate revocation checking. Upon seeing a new certificate, the browser will attempt to fetch the revocation information from the CRL or OCSP endpoint embedded in the certificate. If the revocation information is available, they rely on it, and otherwise they assume the certificate is not revoked and display the page without any errors. This is called a “soft-fail” strategy.

The soft-fail strategy has a critical security flaw. An attacker with network position can block the OCSP request. If this attacker also has the private key of a revoked certificate, they can intercept the outgoing connection for the site and present the revoked certificate to the browser. Since the browser doesn’t know the certificate is revoked and is following a soft-fail strategy, the page will load without alerting the user. As Adam Langley described: “soft-fail revocation checks are like a seat-belt that snaps when you crash. Even though it works 99% of the time, it’s worthless because it only works when you don’t need it.”

A soft-fail strategy also makes connections slower. If revocation information for a certificate is not already cached, the browser will block the rendering of the page until the revocation information is retrieved, or a timeout occurs. This additional step causes a noticeable and unwelcome delay, with marginal security benefits. This tradeoff is a hard sell for the performance-obsessed web. Because of the limited benefit, some browsers have eliminated live revocation checking for at least some subset of certificates.

Live OCSP checking has an additional downside: it leaks private browsing information. OCSP requests are sent over unencrypted HTTP and are tied to a specific certificate. Sending an OCSP request tells the certificate authority which websites you are visiting. Furthermore, everyone on the network path between your browser and the OCSP server will also know which sites you are browsing.

Alternative revocation checking

Some client still perform soft-fail OCSP checking, but it’s becoming less common due to the performance and privacy downsides described above. To protect high-value certificates, some browsers have explored alternative mechanisms for revocation checking.

One technique is to pre-package a list of revoked certificates and distribute them through browser updates. Because the list of all revoked certificates is so large, only a few high-impact certificates are included in this list. This technique is called OneCRL by Firefox and CRLSets by Chrome. This has been effective for some high-profile revocations, but it is by no means a complete solution. Not only are not all certificates covered, this technique leaves a window of vulnerability between the time the certificate is revoked and the certificate list gets to browsers.

OCSP Stapling

OCSP stapling is a technique to get revocation information to browsers that fixes some of the performance and privacy issues associated with live OCSP fetching. In OCSP stapling, the server includes a current OCSP response for the certificate included (or « stapled ») into the initial HTTPS connection. That removes the need for the browser to request the OCSP response itself. OCSP stapling is widely supported by modern browsers.

Not all servers support OCSP stapling, so browsers still take a soft-fail approach to warning the user when the OCSP response is not stapled. Some browsers (such as Safari, Edge and Firefox for now) check certificate revocation for certificates, so OCSP stapling can provide a performance boost of up to 30%. For browsers like Chrome that don’t check for revocation for all certificates, OCSP stapling provides a proof of non-revocation that they would not have otherwise.

High-reliability OCSP stapling

Cloudflare started offering OCSP stapling in 2012. Cloudflare’s original implementation relied on code from nginx that was able to provide OCSP stapling for a some, but not all connections. As Cloudflare’s network grew, the implementation wasn’t able to scale with it, resulting in a drop in the percentage of connections with OCSP responses stapled. The architecture we had chosen had served us well, but we could definitely do better.

In the last year we redesigned our OCSP stapling infrastructure to make it much more robust and reliable. We’re happy to announce that we now provide reliable OCSP stapling connections to Cloudflare. As long as the certificate authority has set up OCSP for a certificate, Cloudflare will serve a valid OCSP stapled response. All Cloudflare customers now benefit from much more reliable OCSP stapling.

OCSP stapling past

In Cloudflare’s original implementation of OCSP stapling, OCSP responses were fetched opportunistically. Given a connection that required a certificate, Cloudflare would check to see if there was a fresh OCSP response to staple. If there was, it would be included in the connection. If not, then the client would not be sent an OCSP response, and Cloudflare would send a request to refresh the OCSP response in the cache in preparation for the next request.

If a fresh OCSP response wasn’t cached, the connection wouldn’t get an OCSP staple. The next connection for that same certificate would get a OCSP staple, because the cache will have been populated.

This architecture was elegant, but not robust. First, there are several situations in which the client is guaranteed to not get an OCSP response. For example, the first request in every cache region and the first request after an OCSP response expires are guaranteed to not have an OCSP response stapled. With Cloudflare’s expansion to more locations, these failures were more common. Less popular sites would have their OCSP responses fetched less often resulting in an even lower ratio of stapled connections. Another reason that connections could be missing OCSP responses is if the OCSP request from Cloudflare to fill the cache failed. There was a lot of room for improvement.

Our solution: OCSP pre-fetching

In order to be able to reliably include OCSP staples in all connection, we decided to change the model. Instead of fetching the OCSP response when a request came in, we would fetch it in a centralized location and distribute valid responses to all our servers. When a response started getting close to expiration, we’d fetch a new one. If the OCSP request fails, we put it into a queue to re-fetch at a later time. Since most OCSP staples are valid for around 7 days, there is a lot of flexibility in term of refreshing expiring responses.

To keep our cache of OCSP responses fresh, we created an OCSP fetching service. This service ensures that there is a valid OCSP response for every certificate managed by Cloudflare. We constantly crawl our cache of OCSP responses and refresh those that are close to expiring. We also make sure to never cache invalid OCSP responses, as this can have bad consequences. This system has been running for several months now, and we are now reliably including OCSP staples for almost every HTTPS request.

Reliable stapling improves performance for browsers that would have otherwise fetched OCSP, but it also changes the optimal failure strategy for browsers. If a browser can reliably get an OCSP staple for a certificate, why not switch back from a soft-fail to a hard-fail strategy?

OCSP must-staple

As described above, the soft-fail strategy for validating OCSP responses opens up a security hole. An attacker with a revoked certificate can simply neglect to provide an OCSP response when a browser connects to it and the browser will accept their revoked certificate.

In the OCSP fetching case, a soft-fail approach makes sense. There many reasons the browser would not be able to obtain an OCSP: captive portals, broken OCSP servers, network unreliability and more. However, as we have shown with our high-reliability OCSP fetching service, it is possible for a server to fetch OCSP responses without any of these problems. OCSP responses are re-usable and are valid for several days. When one is close to expiring, the server can fetch a new one out-of-band and be able to reliably serve OCSP staples for all connections.

Public Domain

If the client knows that a server will always serve OCSP staples for every connection, it can apply a hard-fail approach, failing a connection if the OCSP response is missing. This closes the security hole introduced by the soft-fail strategy. This is where OCSP must-staple fits in.

OCSP must-staple is an extension that can be added to a certificate that tells the browser to expect an OCSP staple whenever it sees the certificate. This acts as an explicit signal to the browser that it’s safe to use the more secure hard-fail strategy.

Firefox enforces OCSP must-staple, returning the following error if such a certificate is presented without a stapled OCSP response.

Chrome provides the ability to mark a domain as “Expect-Staple”. If Chrome sees a certificate for the domain without a staple, it will send a report to a pre-configured report endpoint.

Reliability

As a part of our push to provide reliable OCSP stapling, we put our money where our mouths are and put an OCSP must-staple certificate on blog.cloudflare.com. Now if we ever don’t serve an OCSP staple, this page will fail to load on browsers like Firefox that enforce must-staple. You can identify a certificate by looking at the certificate details for the “1.3.6.1.5.5.7.1.24” OID.

Cloudflare customers can choose to upload must-staple custom certificates, but we encourage them not to do so yet because there may be a multi-second delay between the certificate being installed and our ability to populate the OCSP response cache. This will be fixed in the coming months. Other than the first few seconds after uploading the certificate, Cloudflare’s new OCSP fetching is robust enough to offer OCSP staples for every connection thereafter.

As of today, an attacker with access to the private key for a revoked certificate can still hijack the connection. All they need to do is to place themselves on the network path of the connection and block the OCSP request. OCSP must-staple prevents that, since an attacker will not be able to obtain an OCSP response that says the certificate has not been revoked.

The weird world of OCSP responders

For browsers, an OCSP failure is not the end of the world. Most browsers are configured to soft-fail when an OCSP responder returns an error, so users are unaffected by OCSP server failures. Some Certificate Authorities have had massive multi-day outages in their OCSP servers without affecting the availability of sites that use their certificates.

There’s no strong feedback mechanism for broken or slow OCSP servers. This lack of feedback has led to an ecosystem of faulty or unreliable OCSP servers. We experienced this first-hand while developing high-reliability OCSP stapling. In this section, we’ll outline half a dozen unexpected behaviors we found when deploying high-reliability OCSP stapling. A big thanks goes out to all the CAs who fixed the issues we pointed out. CA names redacted to preserve their identities.

CA #1: Non-overlapping periods

We noticed CA #1 certificates frequently missing their refresh-deadline, and during debugging we were lucky enough to see this:

$ date -u
Sat Mar 4 02:45:35 UTC 2017
$ ocspfetch <redacted, customer ID>
This Update: 2017-03-04 01:45:49 +0000 UTC
Next Update: 2017-03-04 02:45:49 +0000 UTC
$ date -u
Sat Mar 4 02:45:48 UTC 2017
$ ocspfetch <redacted, customer ID>
This Update: 2017-03-04 02:45:49 +0000 UTC
Next Update: 2017-03-04 03:45:49 +0000 UTC

It shows that CA #1 had configured their OCSP responders to use an incredibly short validity period with almost no overlap between validity periods, which makes it functionally impossible to always have a fresh OCSP response for their certificates. We contacted them, and they reconfigured the responder to produce new responses every half-interval.

CA #2: Wrong signature algorithm

Several certificates from CA #2 started failing with this error:

bad OCSP signature: crypto/rsa: verification error

The issue is that the OCSP claims to be signed with SHA256-RSA, when it is actually signed with SHA1-RSA (and the reverse: indicates SHA1-RSA, actually signed with SHA256-RSA).

CA #3: Malformed OCSP responses

When we first started the project, our tool was unable to parse dozens of certificates in our database because of this error

asn1: structure error: integer not minimally-encoded

and many OCSP responses that we fetched from the same CA:

parsing ocsp response: bad OCSP signature: asn1: structure error: integer not minimally-encoded

What happened was that this CA had begun issuing <1% of certificates with a minor formatting error that rendered them unparseable by Golang’s x509 package. After contacting them directly, they quickly fixed the issue but then we had to patch Golang’s parser to be more lenient about encoding bugs.

CA #4: Failed responder behind a load balancer

A small number of CA #4’s OCSP responders fell into a « bad state » without them knowing, and would return 404 on every request. Since the CA used load balancing to round-robin requests to a number of responders, it looked like 1 in 6 requests would fail inexplicably.

Two times between Jan 2017 and May 2017, this CA also experienced some kind of large data-loss event that caused them to return persistent « Try Later » responses for a large number of requests.

CA #5: Delay between certificate and OCSP responder

When a certificate is issued by CA #5, there is a large delay between the time a certificate is issued and the OCSP responder is able to start returning signed responses. This results in a delay between certificate issance and the availability of OCSP. It has mostly been resolved, but this general pattern is dangerous for OCSP must-staple. There have been some recent changes discussed in the CA/B Forum, an organization that regulates the issuance and management of certificates, to require CAs to offer OCSP soon after issuance.

CA #6: Extra certificates

It is typical to embed only one certificate in an OCSP response, if any. The one embedded certificate is supposed to be a leaf specially issued for signing an intermediate’s OCSP responses. However, several CAs embed multiple certificates: the leaf they use for signing OCSP responses, the intermediate itself, and sometimes all the intermediates up to the root certificate.

Conclusions

We made OCSP stapling better and more reliable for Cloudflare customers. Despite the various strange behaviors we found in OCSP servers, we’ve been able to consistently serve OCSP responses for over 99.9% of connections since we’ve moved over to the new system. This is an important step in protecting the web community from attackers who have compromised certificate private keys.
Source: CloudFlare

Participate in the Net Neutrality Day of Action

We at Cloudflare strongly believe in network neutrality, the principle that networks should not discriminate against content that passes through them.  We’ve previously posted on our views on net neutrality and the role of the FCC here and here.

In May, the FCC took a first step toward revoking bright-line rules it put in place in 2015 to require ISPs to treat all web content equally. The FCC is seeking public comment on its proposal to eliminate the legal underpinning of the 2015 rules, revoking the FCC’s authority to implement and enforce net neutrality protections. Public comments are also requested on whether any rules are needed to prevent ISPs from blocking or throttling web traffic, or creating “fast lanes” for some internet traffic.

To raise awareness about the FCC’s efforts, July 12th will be “Internet-Wide Day of Action to save Net Neutrality.” Led by the group Battle for the Net, participating websites will show the world what the web would look like without net neutrality by displaying an alert on their homepage. Website users will be encouraged to contact Congress and the FCC in support of net neutrality.

We wanted to make sure our users had an opportunity to participate in this protest. If you install the Battle For The Net App, your visitors will see one of four alert modals — like the “spinning wheel of death” — and have an opportunity to submit a comment to the FCC or a letter to Congress in support of net neutrality. You can preview the app live on your site, even if you don’t use Cloudflare yet.

To participate, install the Battle For The Net App. The app will appear for your site’s visitors on July 12th, the Day of Action for Net Neutrality.
Source: CloudFlare

How to make your site HTTPS-only

The Internet is getting more secure every day as people enable HTTPS, the secure version of HTTP, on their sites and services. Last year, Mozilla reported that the percentage of requests made by Firefox using encrypted HTTPS passed 50% for the first time. HTTPS has numerous benefits that are not available over unencrypted HTTP, including improved performance with HTTP/2, SEO benefits for search engines like Google and the reassuring lock icon in the address bar.

So how do you add HTTPS to your site or service? That’s simple, Cloudflare offers free and automatic HTTPS support for all customers with no configuration. Sign up for any plan and Cloudflare will issue an SSL certificate for you and serve your site over HTTPS.

HTTPS-only

Enabling HTTPS does not mean that all visitors are protected. If a visitor types your website’s name into the address bar of a browser or follows an HTTP link, it will bring them to the insecure HTTP version of your website. In order to make your site HTTPS-only, you need to redirect visitors from the HTTP to the HTTPS version of your site.

Going HTTPS-only should be as easy as a click of a button, so we literally added one to the Cloudflare dashboard. Enable the “Always Use HTTPS” feature and all visitors of the HTTP version of your website will be redirected to the HTTPS version. You’ll find this option just above the HTTP Strict Transport Security setting and it is of course also available through our API.

In case you would like to redirect only some subset of your requests you can still do this by creating a Page Rule. Simply use the “Always Use HTTPS” setting on any URL pattern.

Securing your site: next steps

Once you have confirmed that your site is fully functional with HTTPS-only enabled, you can take it a step further and enable HTTP Strict Transport Security (HSTS). HSTS is a header that tells browsers that your site is available over HTTPS and will be for a set period of time. Once a browser sees an HSTS header for a site, it will automatically fetch the HTTPS version of HTTP pages without needing to follow redirects. HSTS can be enabled in the crypto app right under the Always Use HTTPS toggle.

It’s also important to secure the connection between Cloudflare and your site. To do that, you can use Cloudflare’s Origin CA to get a free certificate for your origin server. Once your origin server is set up with HTTPS and a valid certificate, change your SSL mode to Full (strict) to get the highest level of security.
Source: CloudFlare

Three little tools: mmsum, mmwatch, mmhistogram

In a recent blog post, my colleague Marek talked about some SSDP-based DDoS activity we’d been seeing recently. In that blog post he used a tool called mmhistogram to output an ASCII histogram.

That tool is part of a small suite of command-line tools that can be handy when messing with data. Since a reader asked for them to be open sourced… here they are.

mmhistogram

Suppose you have the following CSV of the ages of major Star Wars characters at the time of Episode IV:

Anakin Skywalker (Darth Vader),42
Boba Fett,32
C-3PO,32
Chewbacca,200
Count Dooku,102
Darth Maul,54
Han Solo,29
Jabba the Hutt,600
Jango Fett,66
Jar Jar Binks,52
Lando Calrissian,31
Leia Organa (Princess Leia),19
Luke Skywalker,19
Mace Windu,72
Obi-Wan Kenobi,57
Palpatine,82
Qui-Gon Jinn,92
R2-D2,32
Shmi Skywalker,72
Wedge Antilles,21
Yoda,896

You can get an ASCII histogram of the ages as follows using the mmhistogram tool.

$ cut -d, -f2 epiv | mmhistogram -t « Age »
Age min:19.00 avg:123.90 med=54.00 max:896.00 dev:211.28 count:21
Age:
value |————————————————– count
0 | 0
1 | 0
2 | 0
4 | 0
8 | 0
16 |************************************************** 8
32 | ************************* 4
64 | ************************************* 6
128 | ****** 1
256 | 0
512 | ************ 2

Handy for getting a quick sense of the data. (These charts are inspired by the ASCII output from systemtap).

mmwatch

The mmwatch tool is handy if you want to look at output from a command-line tool that provides some snapshot of values, but need to have a rate.

For example, here’s df -H on my machine:

$ df -H
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 250G 222G 28G 89% 54231161 6750085 89% /
devfs 384k 384k 0B 100% 1298 0 100% /dev
map -hosts 0B 0B 0B 100% 0 0 100% /net
map auto_home 0B 0B 0B 100% 0 0 100% /home
/dev/disk4 7.3G 50M 7.2G 1% 12105 1761461 1%
/Volumes/LANGDON

Now imagine you were interested in understanding the rate of change in iused and ifree. You can with mmwatch. It’s just like watch but looks for changing numbers and interprets them as rates:

$ mmwatch ‘df -H’

Here’s a short GIF showing it working:

mmsum

And the final tool is mmsum that simply sums a list of floating point numbers (one per line).

Suppose you are downloading real-time rainfall data from the UK’s Environment Agency and would like to know the total current rainfall. mmsum can help:

$ curl -s ‘https://environment.data.gov.uk/flood-monitoring/id/measures?parameter=rainfall’ | jq -e ‘.items[].latestReading.value+0’ | ./mmsum
40.2

All these tools can be found on the Cloudflare Github.
Source: CloudFlare

A container identity bootstrapping tool

Everybody has secrets. Software developers have many. Often these secrets—API tokens, TLS private keys, database passwords, SSH keys, and other sensitive data—are needed to make a service run properly and interact securely with other services. Today we’re sharing a tool that we built at Cloudflare to securely distribute secrets to our Dockerized production applications: PAL.

PAL is available on Github: https://github.com/cloudflare/pal.

Although PAL is not currently under active development, we have found it a useful tool and we think the community will benefit from its source being available. We believe that it’s better to open source this tool and allows others to use the code then leave it hidden from view and unmaintained.

Secrets in production

CC BY 2.0 image by Personal Creations

How do you get these secrets to your services? If you’re the only developer, or one of a few on a project, you might put the secrets with your source code in your version control system. But if you just store the secrets in plain text with your code, everyone with access to your source repository can read them and use them for nefarious purposes (for example, stealing an API token and pretending to be an authorized client). Furthermore, distributed version control systems like Git will download a copy of the secret everywhere a repository is cloned, regardless of whether it’s needed there, and will keep that copy in the commit history forever. For a company where many people (including people who work on unrelated systems) have access to source control this just isn’t an option.

Another idea is to keep your secrets in a secure place and then embed them into your application artifacts (binaries, containers, packages, etc.) at build time. This can be awkward for modern CI/CD workflows because it results in multiple parallel sets of artifacts for different environments (e.g. production, staging, development). Once you have artifacts with secrets, they become secret themselves, and you will have to restrict access to the “armed” packages that contain secrets after they’re built. Consider the discovery last year that the source code of Twitter’s Vine service was available in public Docker repositories. Not only was the source code for the service leaked, but the API keys that allow Vine to interact with other services were also available. Vine paid over $10,000 when they were notified about this.

A more advanced technique to manage and deploy secrets is to use a secret management service. Secret management services can be used to create, store and rotate secrets as well as distribute them to applications. The secret management service acts as a gatekeeper, allowing access to some secrets for some applications as prescribed by an access control policy. An application that wants to gain access to a secret authenticates itself to the secret manager, the secret manager checks permissions to see if the application is authorized, and if authorized, sends the secret. There are many options to choose from, including Vault, Keywhiz, Knox, Secretary, dssss and even Docker’s own secret managment service.

Secret managers are a good solution as long as an identity/authorization system is already in place. However, since most authentication systems involve the client already being in possession of a secret, this presents a chicken and egg problem.

Identity Bootstrapping

Once we have verified a service’s identity, we can make access control decisions about what that service can access. Therefore, the real problem we must solve is the problem of bootstrapping service identity.

This problem has many solutions when services are tightly bound to individual machines (for example, we can simply install host-level credentials on each machine or even use a machine’s hardware to identify it. Virtual machine platforms like Amazon AWS have machine-based APIs for host-level identity, like IAM and KMS). Containerized services have a much more fluid lifecycle – instances may appear on many machines and may come and go over time. Furthermore, any number of trusted and untrusted containers might be running on the same host at the same time. So what we need instead is an identity that belongs to a service, not to a machine.

Every Application needs an ID to prove to the bouncer that they’re on the guest list for Club Secret.

Bootstrapping the identity of a service that lives in a container is not a solved problem, and most of the existing are deeply integrated into the container orchestration (Kubernetes, Docker Swarm, Mesos, etc.). We ran into the problem of container identity bootstrapping, and wanted something that worked with or current application deployment stack (Docker/Mesos/Marathon) but wasn’t locked down to a given orchestration platform.

Enter PAL

We use Docker containers to deploy many services across a shared, general-purpose Mesos cluster. To solve the service identity bootstrapping problem in our Docker environment, we developed PAL, which stands for Permissive Action Link, a security device for nuclear weapons. PAL makes sure secrets are only available in production, and only when jobs are authorized to run.

PAL makes it possible to keep only encrypted secrets in the configuration for a service while ensuring that those secrets can only be decrypted by authorized service instances in an approved environment (say, a production or staging environment). If those credentials serve to identify the service requesting access to secrets, PAL becomes a container identity bootstrapping solution that you can easily deploy a secret manager on top of.

How it works

The model for PAL is that the secrets are
provided in an encrypted form and either embedded in containers, or provided as runtime configuration for jobs running in an orchestration framework such as Apache Mesos.

PAL allows secrets to be decrypted at runtime after the service’s identity has been established. These credentials could allow authenticated inter-service communication, which would allow you to keep service secrets in a central repository such as Hashicorp’s Vault, KeyWhiz, or others. The credentials could also be used to issue service-level credentials (such as certificates for an internal PKI). Without PAL, you must distribute the identity credentials, required by tools like these themselves, inside your infrastructure.

PAL consists of two components: a small in-container initialization tool, pal, that requests secrets decryption and installs decrypted secret, and a daemon called pald that runs on every node in the cluster. pal and pald communicate with each other via a Unix socket. The pal tool is set as each job’s entrypoint, and it sends pald the encrypted secrets. pald then identifies the process making the job and determines whether it is allowed to access the requested secret. If so, it decrypts the secret on behalf of the job and returns the plaintext to pal, which installs the plaintext within the calling job’s container as either an environment variable or a file.

PAL currently supports two methods of encryption – PGP and Red October – but it can be extended to support more.

PAL-PGP

PGP is a popular form of encryption that has been around since the early 90s. PAL allows you to use secrets that are encrypted with PGP keys that are installed on the host. The current version of PAL does not apply policies at a per-key level (e.g. only containers with Docker Label A can use key 1), but it could easily be extended to do so.

PAL-Red October

The Red October mode is used for secrets that are very high value and need to be managed manually or with multi-person control. We open sourced Red October back in 2013. It has the nice feature of being able to encrypt a secret in such a way that multiple people are required to authorize the decryption.

In the PAL-RO typical setup, each machine in your cluster will be provisioned a Red October account. Before a container is scheduled to run, the secret owners delegate the ability to decrypt the secret to the host on which the container is going to run. When the container starts, pal calls pald which uses the machine’s Red October credentials to decrypt the secret via a call to the Red October server. Delegations can be of a limited time or for a number of decryptions. Once the delegations are used up, Red October has no way to decrypt the secret.

These two modes have been invaluable for protecting high-value secrets, where Red October provides additional oversight and control. For lower-value secrets, PGP provides a non-interactive experience that works well with ephemeral containers.

Authorization details

An important part of a secret management tool is ensuring that only authorized entities can decrypt a secret. PAL enables you to control which containers can decrypt a secret by leveraging existing code signing infrastructure. Both secrets and containers can be given optional labels that PAL will respect. Labels define which containers can access which secrets—a container must have the label of any secret it accesses. Labels are named references to security policies. An example label could be “production-team-secret” which denotes that a secret should conform to the production team’s secret policy. Labels bind cipher texts to an authorization to decrypt. These authorizations allow you to use PAL to control when and in what environment secrets can be decrypted.

By opening the Unix socket with the option SO_PASSCRED, we enable pald to obtain the process-level credentials (uid, pid, and gid) of the caller for each request. These credentials can then be used to identify containers and assign them a label. Labels allow PAL to consult a predefined policy and authorize containers to receive secrets. To get the list of labels on a container, pald uses the process id (pid) of the calling process to get its cgroups from Linux (by reading and parsing /proc/<pid>/cgroups). The names of the cgroups contain the Docker container id, which we can use to get container metadata via Docker’s inspect call. This container metadata carries a list of labels assigned by the Docker LABEL directive at build time.

Containers and their labels must be bound together using code integrity tools. PAL supports using Docker’s Notary, which confirms that a specific container hash maps to specific metadata like a container’s name and label.

PAL’s present and future

PAL represents one solution for identity bootstrapping for our environment. Other service identity bootstrapping tools bootstrap at the host-level or are highly environmentally coupled. AWS IAM, for example, only works at the level of virtual machines running on AWS; Kubernetes secrets and Docker secrets management only work in Kubernetes and Docker Swarm, respectively. While we’ve developed PAL to work alongside Mesos, we designed it to used as a service identity mechanism for many other environments simply by plugging in new ways for PAL to read identities from service artifacts (containers, packages, or binaries).

Recall the issue where Vine disclosed their source code in their Docker containers on a public repository, Docker Hub. With PAL, Vine could have kept their API keys (or even the entire codebase) encrypted in the container, published that safe version of the container to Docker Hub, and decrypted the code at container startup in their particular production environment.

Using PAL, you can give your trusted containers an identity that allows them to safely receive secrets only in production, without the risks associated with other secret distribution methods. This identity can be a secret like a cryptographic key, allowing your service to decrypt its sensitive configuration, or it could be a credential that allows it to access sensitive services such as secret managers or CAs. PAL solves a key bootstrapping problem for service identity, making it simple to run trusted and untrusted containers side-by-side while ensuring that your secrets are safe.

Credits

PAL was created by Joshua Kroll, Daniel Dao, and Ben Burkert with design prototyping by Nick Sullivan. This post was adapted from an internal blog by Joshua Kroll and a presentation I made at O’Reilly Security Amsterdam in 2016 , and BSides Las Vegas.
Source: CloudFlare

Stupidly Simple DDoS Protocol (SSDP) generates 100 Gbps DDoS

Last month we shared statistics on some popular reflection attacks. Back then the average SSDP attack size was ~12 Gbps and largest SSDP reflection we recorded was:

30 Mpps (millions of packets per second)
80 Gbps (billions of bits per second)
using 940k reflector IPs

This changed a couple of days ago when we noticed an unusually large SSDP amplification. It’s worth deeper investigation since it crossed the symbolic threshold of 100 Gbps.

The packets per second chart during the attack looked like this:

The bandwidth usage:

This packet flood lasted 38 minutes. According to our sampled netflow data it utilized 930k reflector servers. We estimate that the during 38 minutes of the attack each reflector sent 112k packets to Cloudflare.

The reflector servers are across the globe, with a large presence in Argentina, Russia and China. Here are the unique IPs per country:

$ cat ips-nf-ct.txt|uniq|cut -f 2|sort|uniq -c|sort -nr|head
439126 CN
135783 RU
74825 AR
51222 US
41353 TW
32850 CA
19558 MY
18962 CO
14234 BR
10824 KR
10334 UA
9103 IT

The reflector IP distribution across ASNs is typical. It pretty much follows the world’s largest residential ISPs:

$ cat ips-nf-asn.txt |uniq|cut -f 2|sort|uniq -c|sort -nr|head
318405 4837 # CN China Unicom
84781 4134 # CN China Telecom
72301 22927 # AR Telefonica de Argentina
23823 3462 # TW Chunghwa Telecom
19518 6327 # CA Shaw Communications Inc.
19464 4788 # MY TM Net
18809 3816 # CO Colombia Telecomunicaciones
11328 28573 # BR Claro SA
7070 10796 # US Time Warner Cable Internet
6840 8402 # RU OJSC « Vimpelcom »
6604 3269 # IT Telecom Italia
6377 12768 # RU JSC « ER-Telecom Holding »

What’s SSDP anyway?

The attack was composed of UDP packets with source port 1900. This port is used by the SSDP and is used by the UPnP protocols. UPnP is one of the zero-configuration networking protocols. Most likely your home devices support it, allowing them to be easily discovered by your computer or phone. When a new device (like your laptop) joins the network, it can query the local network for specific devices, like internet gateways, audio systems, TVs, or printers. Read more on how UPnP compares to Bonjour.

UPnP is poorly standardised, but here’s a snippet from the spec about the M-SEARCH frame – the main method for discovery:

When a control point is added to the network, the UPnP discovery protocol allows that control point to search for devices of interest on the network. It does this by multicasting on the reserved address and port (239.255.255.250:1900) a search message with a pattern, or target, equal to a type or identifier for a device or service.

Responses to M-SEARCH frame:

To be found by a network search, a device shall send a unicast UDP response to the source IP address and port that sent the request to the multicast address. Devices respond if the ST header field of the M-SEARCH request is “ssdp:all”, “upnp:rootdevice”, “uuid:” followed by a UUID that exactly matches the one advertised by the device, or if the M-SEARCH request matches a device type or service type supported by the device.

This works in practice. For example, my Chrome browser regularly asks for a Smart TV I guess:

$ sudo tcpdump -ni eth0 udp and port 1900 -A
IP 192.168.1.124.53044 > 239.255.255.250.1900: UDP, length 175
M-SEARCH * HTTP/1.1
HOST: 239.255.255.250:1900
MAN: « ssdp:discover »
MX: 1
ST: urn:dial-multiscreen-org:service:dial:1
USER-AGENT: Google Chrome/58.0.3029.110 Windows

This frame is sent to a multicast IP address. Other devices listening on that address and supporting this specific ST (search-target) multiscreen type are supposed to answer.

Apart from queries for specific device types, there are two « generic » ST query types:

upnp:rootdevice: search for root devices
ssdp:all: search for all UPnP devices and services

To emulate these queries you can run this python script (based on this work):

#!/usr/bin/env python2
import socket
import sys

dst = « 239.255.255.250 »
if len(sys.argv) > 1:
dst = sys.argv[1]
st = « upnp:rootdevice »
if len(sys.argv) > 2:
st = sys.argv[2]

msg = [
‘M-SEARCH * HTTP/1.1’,
‘Host:239.255.255.250:1900’,
‘ST:%s’ % (st,),
‘Man: »ssdp:discover »‘,
‘MX:1’,
 »]

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_UDP)
s.settimeout(10)
s.sendto(‘rn’.join(msg), (dst, 1900) )

while True:
try:
data, addr = s.recvfrom(32*1024)
except socket.timeout:
break
print « [+] %sn%s » % (addr, data)

On my home network two devices show up:

$ python ssdp-query.py
[+] (‘192.168.1.71’, 1026)
HTTP/1.1 200 OK
CACHE-CONTROL: max-age = 60
EXT:
LOCATION: http://192.168.1.71:5200/Printer.xml
SERVER: Network Printer Server UPnP/1.0 OS 1.29.00.44 06-17-2009
ST: upnp:rootdevice
USN: uuid:Samsung-Printer-1_0-mrgutenberg::upnp:rootdevice

[+] (‘192.168.1.70’, 36319)
HTTP/1.1 200 OK
Location: http://192.168.1.70:49154/MediaRenderer/desc.xml
Cache-Control: max-age=1800
Content-Length: 0
Server: Linux/3.2 UPnP/1.0 Network_Module/1.0 (RX-S601D)
EXT:
ST: upnp:rootdevice
USN: uuid:9ab0c000-f668-11de-9976-000adedd7411::upnp:rootdevice

The firewall

Now that we understand the basics of SSDP, understanding the reflection attack should be easy. You see, there are in fact two ways of delivering the M-SEARCH frame:

what we presented, over the multicast address
directly to a UPnP/SSDP enabled host on a normal unicast address

The latter method works. We can specifically target my printer IP address:

$ python ssdp-query.py 192.168.1.71
[+] (‘192.168.1.71’, 1026)
HTTP/1.1 200 OK
CACHE-CONTROL: max-age = 60
EXT:
LOCATION: http://192.168.1.71:5200/Printer.xml
SERVER: Network Printer Server UPnP/1.0 OS 1.29.00.44 06-17-2009
ST: upnp:rootdevice
USN: uuid:Samsung-Printer-1_0-mrgutenberg::upnp:rootdevice

Now the problem is easily seen: the SSDP protocol does not check whether the querying party is in the same network as the device. It will happily respond to an M-SEARCH delivered over the public Internet. All it takes is a tiny misconfiguration in a firewall – port 1900 UDP open to the world – and a perfect target for UDP amplification will be available.

Given a misconfigured target our script will happily work over the internet:

$ python ssdp-query.py 100.42.x.x
[+] (‘100.42.x.x’, 1900)
HTTP/1.1 200 OK
CACHE-CONTROL: max-age=120
ST: upnp:rootdevice
USN: uuid:3e55ade9-c344-4baa-841b-826bda77dcb2::upnp:rootdevice
EXT:
SERVER: TBS/R2 UPnP/1.0 MiniUPnPd/1.2
LOCATION: http://192.168.2.1:40464/rootDesc.xml

The amplification

The real damage is done by the ssdp:all ST type though. These responses are much larger:

$ python ssdp-query.py 100.42.x.x ssdp:all
[+] (‘100.42.x.x’, 1900)
HTTP/1.1 200 OK
CACHE-CONTROL: max-age=120
ST: upnp:rootdevice
USN: uuid:3e55ade9-c344-4baa-841b-826bda77dcb2::upnp:rootdevice
EXT:
SERVER: TBS/R2 UPnP/1.0 MiniUPnPd/1.2
LOCATION: http://192.168.2.1:40464/rootDesc.xml

[+] (‘100.42.x.x’, 1900)
HTTP/1.1 200 OK
CACHE-CONTROL: max-age=120
ST: urn:schemas-upnp-org:device:InternetGatewayDevice:1
USN: uuid:3e55ade9-c344-4baa-841b-826bda77dcb2::urn:schemas-upnp-org:device:InternetGatewayDevice:1
EXT:
SERVER: TBS/R2 UPnP/1.0 MiniUPnPd/1.2
LOCATION: http://192.168.2.1:40464/rootDesc.xml

… 6 more response packets….

In this particular case, a single SSDP M-SEARCH packet triggered 8 response packets. tcpdump view:

$ sudo tcpdump -ni en7 host 100.42.x.x -ttttt
00:00:00.000000 IP 192.168.1.200.61794 > 100.42.x.x.1900: UDP, length 88
00:00:00.197481 IP 100.42.x.x.1900 > 192.168.1.200.61794: UDP, length 227
00:00:00.199634 IP 100.42.x.x.1900 > 192.168.1.200.61794: UDP, length 299
00:00:00.202938 IP 100.42.x.x.1900 > 192.168.1.200.61794: UDP, length 295
00:00:00.208425 IP 100.42.x.x.1900 > 192.168.1.200.61794: UDP, length 275
00:00:00.209496 IP 100.42.x.x.1900 > 192.168.1.200.61794: UDP, length 307
00:00:00.212795 IP 100.42.x.x.1900 > 192.168.1.200.61794: UDP, length 289
00:00:00.215522 IP 100.42.x.x.1900 > 192.168.1.200.61794: UDP, length 291
00:00:00.219190 IP 100.42.x.x.1900 > 192.168.1.200.61794: UDP, length 291

That target exposes 8x packet count amplification and 26x bandwidth amplification. Sadly, this is typical for SSDP.

IP Spoofing

The final step for the attack is to fool the vulnerable servers to flood the target IP – not the attacker. For that the attacker needs to spoof the source IP address on their queries.

We probed the reflector IPs used in the shown 100 Gbps+ attack. We found that out of the 920k reflector IPs, only 350k (38%) still respond to SSDP probes.

Out of the reflectors that responded, each sent on average 7 packets:

$ cat results-first-run.txt|cut -f 1|sort|uniq -c|sed -s ‘s#^ +##g’|cut -d  »  » -f 1| ~/mmhistogram -t « Response packets per IP » -p
Response packets per IP min:1.00 avg:6.99 med=8.00 max:186.00 dev:4.44 count:350337
Response packets per IP:
value |————————————————– count
0 | ****************************** 23.29%
1 | **** 3.30%
2 | ** 2.29%
4 |************************************************** 38.73%
8 | ************************************** 29.51%
16 | *** 2.88%
32 | 0.01%
64 | 0.00%
128 | 0.00%

The response packets had 321 bytes (+/- 29 bytes) on average. Our request packets had 110 bytes.

According to our measurements with the ssdp:all M-SEARCH attacker would be able to achieve:

7x packet number amplification
20x bandwidth amplification

We can estimate the 43 Mpps/112 Gbps attack was generated with roughly:

6.1 Mpps of spoofing capacity
5.6 Gbps of spoofed bandwidth

In other words: a single well connected 10 Gbps server able to perform IP spoofing can deliver a significant SSDP attack.

More on the SSDP servers

Since we probed the vulnerable SSDP servers, here are the most common Server header values we received:

104833 Linux/2.4.22-1.2115.nptl UPnP/1.0 miniupnpd/1.0
77329 System/1.0 UPnP/1.0 IGD/1.0
66639 TBS/R2 UPnP/1.0 MiniUPnPd/1.2
12863 Ubuntu/7.10 UPnP/1.0 miniupnpd/1.0
11544 ASUSTeK UPnP/1.0 MiniUPnPd/1.4
10827 miniupnpd/1.0 UPnP/1.0
8070 Linux UPnP/1.0 Huawei-ATP-IGD
7941 TBS/R2 UPnP/1.0 MiniUPnPd/1.4
7546 Net-OS 5.xx UPnP/1.0
6043 LINUX-2.6 UPnP/1.0 MiniUPnPd/1.5
5482 Ubuntu/lucid UPnP/1.0 MiniUPnPd/1.4
4720 AirTies/ASP 1.0 UPnP/1.0 miniupnpd/1.0
4667 Linux/2.6.30.9, UPnP/1.0, Portable SDK for UPnP devices/1.6.6
3334 Fedora/10 UPnP/1.0 MiniUPnPd/1.4
2814 1.0
2044 miniupnpd/1.5 UPnP/1.0
1330 1
1325 Linux/2.6.21.5, UPnP/1.0, Portable SDK for UPnP devices/1.6.6
843 Allegro-Software-RomUpnp/4.07 UPnP/1.0 IGD/1.00
776 Upnp/1.0 UPnP/1.0 IGD/1.00
675 Unspecified, UPnP/1.0, Unspecified
648 WNR2000v5 UPnP/1.0 miniupnpd/1.0
562 MIPS LINUX/2.4 UPnP/1.0 miniupnpd/1.0
518 Fedora/8 UPnP/1.0 miniupnpd/1.0
372 Tenda UPnP/1.0 miniupnpd/1.0
346 Ubuntu/10.10 UPnP/1.0 miniupnpd/1.0
330 MF60/1.0 UPnP/1.0 miniupnpd/1.0

The most common ST header values we saw:

298497 upnp:rootdevice
158442 urn:schemas-upnp-org:device:InternetGatewayDevice:1
151642 urn:schemas-upnp-org:device:WANDevice:1
148593 urn:schemas-upnp-org:device:WANConnectionDevice:1
147461 urn:schemas-upnp-org:service:WANCommonInterfaceConfig:1
146970 urn:schemas-upnp-org:service:WANIPConnection:1
145602 urn:schemas-upnp-org:service:Layer3Forwarding:1
113453 urn:schemas-upnp-org:service:WANPPPConnection:1
100961 urn:schemas-upnp-org:device:InternetGatewayDevice:
100180 urn:schemas-upnp-org:device:WANDevice:
99017 urn:schemas-upnp-org:service:WANCommonInterfaceConfig:
98112 urn:schemas-upnp-org:device:WANConnectionDevice:
97246 urn:schemas-upnp-org:service:WANPPPConnection:
96259 urn:schemas-upnp-org:service:WANIPConnection:
93987 urn:schemas-upnp-org:service:Layer3Forwarding:
91108 urn:schemas-wifialliance-org:device:WFADevice:
90818 urn:schemas-wifialliance-org:service:WFAWLANConfig:
35511 uuid:IGD{8c80f73f-4ba0-45fa-835d-042505d052be}000000000000
9822 urn:schemas-upnp-org:service:WANEthernetLinkConfig:1
7737 uuid:WAN{84807575-251b-4c02-954b-e8e2ba7216a9}000000000000
6063 urn:schemas-microsoft-com:service:OSInfo:1

The vulnerable IPs are seem to be mostly unprotected home routers.

Open SSDP is a vulnerability

It’s not a novelty that allowing UDP port 1900 traffic from the Internet to your home printer or such is not a good idea. This problem has been known since at least January 2013:

« Security Flaws in Universal Plug and Play: Unplug, Don’t Play »

Authors of SSDP clearly didn’t give any thought to UDP amplification potential. There are a number of obvious recommendations about future use of SSDP protocol:

The authors of SSDP should answer if there is any real world use of unicast M-SEARCH queries. From what I understand M-SEARCH only makes practical sense as a multicast query in local area network.
Unicast M-SEARCH support should be either deprecated or at least rate limited, in similar way to DNS Response Rate Limit techniques.
M-SEARCH responses should be only delivered to local network. Responses routed over the network make little sense and open described vulnerability.

In the meantime we recommend:

Network administrators should ensure inbound UDP port 1900 is blocked on firewall.
Internet service providers should never allow IP spoofing to be performed on their network. IP spoofing is the true root cause of the issue. See the infamous BCP38.
Internet service providers should allow their customers to use BGP flowspec to rate limit inbound UDP source port 1900 traffic, to relieve congestion during large SSDP attacks.
Internet providers should internally collect netflow protocol samples. The netflow is needed to identify the true source of the attack. With netflow it’s trivial to answer questions like: « Which of my customers sent 6.4Mpps of traffic to port 1900? ». Due to privacy concerns we recommend collecting netflow samples with largest possible sampling value: 1 in 64k packets. This will be sufficient to track DDoS attacks while preserving decent privacy of single customer connections.
Developers should not roll out their own UDP protocols without careful consideration of UDP amplification problems. UPnP should be properly standardized and scrutinized.
End users are encouraged to use the script scan their network for UPnP enabled devices. Consider if these devices should be allowed to access to the internet.

Furthermore, we prepared on online checking website. Click if you want to know if your public IP address has a vulnerable SSDP service:

https://badupnp.benjojo.co.uk/

Sadly, the most unprotected routers we saw in the described attack were from China, Russia and Argentina, places not historically known for the most agile internet service providers.

Summary

Cloudflare customers are fully protected from SSDP and other L3 amplification attacks. These attacks are nicely deflected by Cloudflare anycast infrastructure and require no special action. Unfortunately the raising of SSDP attack sizes might be a tough problem for other Internet citizens. We should encourage our ISPs to stop IP spoofing within their network, support BGP flowspec and configure in netflow collection.

This article is a joint work of Marek Majkowski and Ben Cartwright-Cox.

Dealing with large attacks sounds like fun? Join our world famous DDoS team in London, Austin, San Francisco and our elite office in Warsaw, Poland.
Source: CloudFlare