Archives de catégorie : CloudFlare

7 Cloudflare Apps Which Increase User Engagement on Your Site

Cloudflare Apps now lists 95 apps from apps which grow email lists to apps which acquire new customers to apps which help site owners make more money. The great thing about these apps is that users don’t have to have any coding or development skills. They can just sign up for the app and start using it on their sites.

Let’s take a moment to highlight some apps which increase a site’s user engagement. Check out more Cloudflare Apps which grow your email list, make money on your site, and get more customers.

I hope you enjoy them and I hope you build (or use) great apps like these too.

Check out other Cloudflare Apps »

Build an app on Cloudflare Apps »

1. Privy

Over 100,000 businesses use Privy to capture and convert website visitors. Privy offers a free suite of email capture tools, including exit-intent driven website popups & banners, email list sign-up, an online store, social media channels, mobile capability, and in-store traffic.

In the left preview pane, you can view the different packages and their features users may sign up for from free to “growth” ($199/month) options.

In the right pane, you can preview your choices, seeing how Privy’s functionality interacts with the site and users and even play around with the popups yourself. I personally love the confetti.

2. AddThis Share Buttons

The Share Buttons app from AddThis is a super easy way to add sidebar share buttons to a website and get the site’s content distributed to over 200 social media channels, auto-personalized for users. Site owners can display between 1-10 share services, customized for their users. Site owners are able to control the size and style of buttons as well.

In the left pane, you can see the install options where you can customize the size, theme and position of the banner. You can adjust the number of social platforms listed and choose to list the number of content shares per platform or total.

In the right pane, you can preview choices, seeing what they’d look like on any website and experiment with placement and how it flows with the site. This is very similar to the tool that the app developer uses to test the app for how it behaves on a wide range of web properties.

3. Vimeo

This app embeds Vimeo videos directly onto sites, so people can easily find a view videos the site owners made, or maybe just a few of their favorites. The Vimeo app supports autoplay and multiple videos on one page, in multiple locations on the page. A site owner can put videos almost all over their site, if they wish.

In the left pane, you can change the location of the video on the page, change its orientation in each location, switch Vimeo video links, and add multiple videos to the page.

In the right pane, you can preview your choices, seeing where each video will be displayed.

4. SoundCloud

You can’t grow as a musician or podcaster or influencer if your favorite tracks are not heard. The SoundCloud app embeds SoundCloud tracks or playlists directly on sites to grow a site owner’s audience, help them find fans, or gives the reader a taste of the music being reviewed in a blog post, all of which work towards engaging users on their sites longer.

This preview works very similarly to the Vimeo preview. You can change the location of the track/playlist on the page, its orientation in each location, switch audio tracks/playlists, and add multiple tracks/playlists to the page.

5. Drift

Drift uses messaging and artificial intelligence to catch leads that are falling through the cracks in sales funnels. Billions of people around the world are using messaging apps like Slack and HipChat to communicate in real-time these days and customers often don’t want to wait for follow-up emails or calls.

After installing Drift, site owners are able to engage with leads when they’re at their most interested – while they’re actually live on their sites. Using intelligent bots, qualified leads can be captured 24/7, even when no human is available to chat.

In Drift’s preview, you can view how the styling and messaging of the chat box may be adjusted. You can change the organization name, welcome message, and away message. You can also select the “automatically open chat” to give users a more visible chat invitation.

6. Skype Live Chat

Skype Live Chat installs on websites easily. Site owners just add their Skype usernames or bot IDs to start chatting with their customers. Considering there are over 74 million Skype users, this app allows users to chat in a way that they’re comfortable and familiar with.

In the left pane, you can see options to change the color of the chat button and message bubble. In the right pane, you can see the results and how users will be prompted to log into their Skype accounts, if they aren’t already.

7. Weather Widget

WeatherWidget.io provides an easy, interactive interface which site owners can create and customize a weather widget for any website. WeatherWidget.io is highly customizable. The labels, font, theme, number of forecast days, and icon sets, may all be edited to maximize user engagement. The app is responsive to the size of its container and available in 20 languages with more languages on their way.

There are a lot of options to play with here. In the left pane, you can see the location of the banner can be changed. The weather location can be adjusted and custom labels assigned to it. I chose “SF, CA” in the preview. The language and units (celsius vs fahrenheit) can be switched. Next, the font, icon type (I chose animated climacons) and option to choose forecast vs. current weather can be changed. Finally you can play with theme with options from “orange” to “blue mountains” and adjust the widget’s margins.

In the right pane, you can see your options come to life.

I hope you enjoy playing with and using these apps. Cloudflare is encouraging app authors to write their apps on the app platform. I hope you’ll consider developing one of your own.

Build an app on Cloudflare Apps »

© 2017 Cloudflare, Inc. All rights reserved. The Cloudflare logo and marks are trademarks of Cloudflare. All other company and product names may be trademarks of the respective companies with which they are associated.
Source: CloudFlare

The Super Secret Cloudflare Master Plan, or why we acquired Neumob

We announced today that Cloudflare has acquired Neumob. Neumob’s team built exceptional technology to speed up mobile apps, reduce errors on challenging mobile networks, and increase conversions. Cloudflare will integrate the Neumob technology with our global network to give Neumob truly global reach.

It’s tempting to think of the Neumob acquisition as a point product added to the Cloudflare portfolio. But it actually represents a key part of a long term “Super Secret Cloudflare Master Plan”.

CC BY 2.0 image by Neil Rickards

Over the last few years Cloudflare has been building a large network of data centers across the world to help fulfill our mission of helping to build a better Internet. These data centers all run an identical software stack that implements Cloudflare’s cache, DNS, DDoS, WAF, load balancing, rate limiting, etc.

We’re now at 118 data centers in 58 countries and are continuing to expand with a goal of being as close to end users as possible worldwide.

The data centers are tied together by secure connections which are optimized using our Argo smart routing capability. Our Quicksilver technology enables us to update and modify the settings and software running across this vast network in seconds.

We’ve also been extending the network to reach directly into devices and servers. Our 2014 technology Railgun helped to speed up connections to origin HTTP servers. Our recently announced Warp technology is used to connect servers and services (such as those running inside a Kubernetes cluster) to Cloudflare without having a public HTTP endpoint. Our IoT solution, Orbit, enables smart devices to connect to our network securely.

The goal is that any end device (web browser, mobile application, smart meter, …) should be able to securely connect to our network and have secure, fast communication from device to origin server with every step of the way optimized and secured by Cloudflare.

While we’ve spent a lot of time on the latest encryption and performance technologies for the web browser and server, we had not done the same for mobile applications. That changes with our acquisition of Neumob.

Why Neumob

The Neumob software running on your phone changes how a mobile app interacts with an API running on an HTTP server. Without Neumob, that API traffic uses that standard Internet protocols (such as HTTPS) and is prone to be affected negatively by varying performance and availability in mobile networks.

With the Neumob software any API request from a mobile application is sent across an optimized set of protocols to the nearest Cloudflare data center. These optimized protocols are able to handle mobile network variability gracefully and securely.

Cloudflare’s Argo then optimizes the route across the long-haul portion of the network to our data center closest to the origin API server. Then Cloudflare’s Warp optimizes the path from the edge of our network to the origin server where the application’s API is running. End-to-end, Cloudflare can supercharge and secure the network experience.

Ultimately, the Neumob software is easily extended to operate as a “VPN” for mobile devices that can secure and accelerate all HTTP traffic from a mobile device (including normal web browsing and app API calls). Most VPN software, frankly, is awful. Using a VPN feels like a backward step to the dial up era of obscure error messages, slow downs, and clunky software. It really doesn’t have to be that way.

And in an era where SaaS, PaaS, IaaS and mobile devices have blown up the traditional company ‘perimeter’ the entire concept of a Virtual Private Network is an anachronism.

Going Forward

At the current time the Neumob service has been discontinued as we move their server components onto the Cloudflare network. We’ll soon relaunch it under a new name and make it available to mobile app developers worldwide. Developers of iOS and Android apps will be able to accelerate and protect their applications’ connectivity by adding just two lines of code to take advantage of Cloudflare’s global network.

As a personal note, I’m thrilled that the Neumob team is joining Cloudflare. We’d been tracking their progress and development for a few years and had long wanted to build a Cloudflare Mobile App SDK that would bring the network benefits of Cloudflare right into devices. It became clear that Neumob’s technology and team was world-class and that it made more sense to abandon our own work to build an SDK and adopt theirs.
Source: CloudFlare

Thwarting the Tactics of the Equifax Attackers

We are now 3 months on from one of the biggest, most significant data breaches in history, but has it redefined people’s awareness on security?

The answer to that is absolutely yes, awareness is at an all-time high. Awareness, however, does not always result in positive action. The fallacy which is often assumed is “surely, if I keep my software up to date with all the patches, that’s more than enough to keep me safe?”. It’s true, keeping software up to date does defend against known vulnerabilities, but it’s a very reactive stance. The more important part is protecting against the unknown.

Something every engineer will agree on is that security is hard, and maintaining systems is even harder. Patching or upgrading systems can lead to unforeseen outages or unexpected behaviour due to other fixes which may be applied. This, in most cases, can cause huge delays in the deployment of patches or upgrades, due to requiring either regression testing or deployment in a staging environment. Whilst processes are followed, and tests are done, systems are sat vulnerable, ready to be exploited if they are exposed to the internet.

Looking at the wider landscape, an increase in security research has created a surge of CVEs (Common Vulnerability and Exposures) being announced. This compounded by GDPR, NIST and other new data protection legislation, businesses are now forced to pay much more attention to security vulnerabilities that potentially could affect their software, and ultimately put them on the forever growing list of victims of data breaches.

Stats from cvedetails.com (November 2017)

Dissecting the Equifax tragedy, in testimony from the CEO, he mentions that the reason for the breach was that there was one single person within the organisation who was responsible for communicating the availability of the patch for Apache Struts, the software at the heart of the breach. The crucial lesson learned from Equifax is that we are all human, and that mistakes can happen, however having multiple people responsible for communicating and notifying teams about threats is crucial. In this case, the mistake almost destroyed one of the largest credit agency in the World.

How could attacks and breaches like Equifax be avoided? First is about understanding how these attacks happen. There are some key attacks which are often the source of data exfiltration through vulnerable software.

Remote Code Execution (RCE) – which is what was used in the Equifax Breach
SQL Injection (SQLi), which is delivering an SQL statement hidden in a payload, accessing a backend database powering a website.

Remote Code Execution

The Struts vulnerability, CVE-2017-5638, which is protected by rule 100054 in Cloudflare Specials, was quite simple. In a payload targeted at the web-server, a specific command could be executed which can be seen in the example below:

“(#context.setMemberAccess(#dm)))).”
“(#cmd=’touch /tmp/hacked’).”
“(#iswin=(@[email protected](‘os.name’).toLowerCase().contains(‘win’))).”
“(#cmds=(#iswin?{‘cmd.exe’,’/c’,#cmd}:{‘/bin/bash’,’-c’,#cmd})).”
“(#p=new java.lang.ProcessBuilder(#cmds)).”

More critically however, further to this CVE, Apache Struts also announced another vulnerability earlier this year (CVE-2017-9805), which works by delivering a payload against the REST plugin combined with the Xstream handler for, which provides an XML ingest capability. By delivering an specially crafted XML payload, a shell command can be embedded, and will be executed.

<next class=”java.lang.ProcessBuilder”>
<command>
“touch /tmp/hacked”.
</command>
<redirectErrorStream>false</redirectErrorStream>
</next>

And the result from the test:

[email protected]:~$ ls /tmp
hacked
[email protected]:~$

In the last week, we have seen over 180,000 hits on our WAF rules protecting against Apache Struts across the Cloudflare network.

SQL Injection

SQL Injection (SQLi) is an attempt to inject nefarious queries into a GET or POST dynamic variable, which is used to query a database. Cloudflare, on a day to day basis will see over 2.3m SQLi attempts on our network. Most commonly, we see SQLi attacks against WordPress sites, as it is one of the biggest web applications used on Cloudflare today. WordPress is used by some of the world’s giants, like Sony Music, all the way down to “mom & pop” businesses. The challenge with being the leader in the space, is you then become a hot target. Looking at the CVE list as we near the close of 2017, there have been 41 vulnerabilities found in multiple versions of WordPress which would force people to upgrade to the latest versions. To protect our customers, and buy them time to upgrade, Cloudflare works with a number of vendors to address vulnerabilities and then virtual-patching using our WAF to prevent these vulnerabilities being exploited.

The way a SQL injection works is by “breaking out” or malforming a query when a web application is needing data from a database. As an example, a Forgotten Password page has a single email input field, which will be used to validate whether the username exists, and if so, sends the user a “Forgotten Password” link. Below is a straightforward SQL query example, which could be used in a web application:

SELECT user, password FROM users WHERE user = ‘[email protected]’;

Which results in:

+————+——————————+
| user | password |
+————+——————————+
| [email protected] | $2y$10$h9XJRX.EBnGFrWQlnt… |
+————+——————————+

Without the right query validation, an attacker could escape out of this query, and carry out some extremely malicious queries. For example, if an attacker was looking to takeover another user’s account, and an attacker found that the query validation was inadequate, he could escape the query, and UPDATE the username, which is an email address in this instance, to his own. This can simply be done by entering the query string below into the email input field, instead of an email address.

[email protected]’;UPDATE users SET user = ‘[email protected]’ WHERE user = ‘[email protected]’;

Due to the lack of validation, the query which the web application sends to the database will be:

SELECT user, password FROM users WHERE user = ‘[email protected]’;UPDATE users SET user = ‘[email protected]’ WHERE user = ‘[email protected]’;

Now this has been updated, the attacker can now request a password reset using his own email address, gaining him access to the victim’s account.

+———-+——————————+
| user | password |
+———-+——————————+
| [email protected] | $2y$10$h9XJRX.EBnGFrWQlnt… |
+———-+——————————+

Many SQLi attacks are often on fields which are not often considered high risk, like an authentication form for example. To put the seriousness of SQLi attacks in perspective, in the last week, we have seen over 2.4 million matches.

The Cloudflare WAF is built to not only protect customers against SQLi and RCE based attacks, but also add protection against Cross Site Scripting (XSS) and a number of other known attacks. On an average week, just on our Cloudflare Specials WAF ruleset, we see over 138 million matches.

The next important part is communication and awareness; understanding what you have installed, what versions you are running, and most importantly, what announcements your vendor is making. Generally, most notifications are received via email, and are usually quite cumbersome to digest, regardless of their complexity, it is crucial to try and understand them.

And, finally, the last line of defense is to have protection in front of your application, which is where Cloudflare can help. At Cloudflare, Security is very core to our values, and was one of the foundation pillars we were founded upon. Even to this day, we are known as one of the most cost effective ways of being able to sure up your Web Applications with just our Pro [email protected]$20/mo.
Source: CloudFlare

Go, don't collect my garbage

Not long ago I needed to benchmark the performance of Golang on a many-core machine. I took several of the benchmarks that are bundled with the Go source code, copied them, and modified them to run on all available threads. In that case the machine has 24 cores and 48 threads.

CC BY-SA 2.0 image by sponki25

I started with ECDSA P256 Sign, probably because I have warm feeling for that function, since I optimized it for amd64.

First, I ran the benchmark on a single goroutine: ECDSA-P256 Sign,30618.50, op/s

That looks good; next I ran it on 48 goroutines: ECDSA-P256 Sign,78940.67, op/s.

OK, that is not what I expected. Just over 2X speedup, from 24 physical cores? I must be doing something wrong. Maybe Go only uses two cores? I ran top, it showed 2,266% utilization. That is not the 4,800% I expected, but it is also way above 400%.

How about taking a step back, and running the benchmark on two goroutines? ECDSA-P256 Sign,55966.40, op/s. Almost double, so pretty good. How about four goroutines? ECDSA-P256 Sign,108731.00, op/s. That is actually faster than 48 goroutines, what is going on?

I ran the benchmark for every number of goroutines from 1 to 48:

Looks like the number of signatures per second peaks at 274,622, with 17 goroutines. And starts dropping rapidly after that.

Time to do some profiling.

(pprof) top 10
Showing nodes accounting for 47.53s, 50.83% of 93.50s total
Dropped 224 nodes (cum <= 0.47s)
Showing top 10 nodes out of 138
flat flat% sum% cum cum%
9.45s 10.11% 10.11% 9.45s 10.11% runtime.procyield /state/home/vlad/go/src/runtime/asm_amd64.s
7.55s 8.07% 18.18% 7.55s 8.07% runtime.futex /state/home/vlad/go/src/runtime/sys_linux_amd64.s
6.77s 7.24% 25.42% 19.18s 20.51% runtime.sweepone /state/home/vlad/go/src/runtime/mgcsweep.go
4.20s 4.49% 29.91% 16.28s 17.41% runtime.lock /state/home/vlad/go/src/runtime/lock_futex.go
3.92s 4.19% 34.11% 12.58s 13.45% runtime.(*mspan).sweep /state/home/vlad/go/src/runtime/mgcsweep.go
3.50s 3.74% 37.85% 15.92s 17.03% runtime.gcDrain /state/home/vlad/go/src/runtime/mgcmark.go
3.20s 3.42% 41.27% 4.62s 4.94% runtime.gcmarknewobject /state/home/vlad/go/src/runtime/mgcmark.go
3.09s 3.30% 44.58% 3.09s 3.30% crypto/elliptic.p256OrdSqr /state/home/vlad/go/src/crypto/elliptic/p256_asm_amd64.s
3.09s 3.30% 47.88% 3.09s 3.30% runtime.(*lfstack).pop /state/home/vlad/go/src/runtime/lfstack.go
2.76s 2.95% 50.83% 2.76s 2.95% runtime.(*gcSweepBuf).push /state/home/vlad/go/src/runtime/mgcsweepbuf.go

Clearly Go spends a disproportionate amount of time collecting garbage. All my benchmark does is generates signatures and then dumps them.

So what are our options? The Go runtime states the following:

The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package’s SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.

The GODEBUG variable controls debugging variables within the runtime. It is a comma-separated list of name=val pairs setting these named variables:

Let’s see what setting GODEBUG to gctrace=1 does.

gc 1 @0.021s 0%: 0.15+0.37+0.25 ms clock, 3.0+0.19/0.39/0.60+5.0 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 2 @0.024s 0%: 0.097+0.94+0.16 ms clock, 0.29+0.21/1.3/0+0.49 ms cpu, 4->4->1 MB, 5 MB goal, 48 P
gc 3 @0.027s 1%: 0.10+0.43+0.17 ms clock, 0.60+0.48/1.5/0+1.0 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 4 @0.028s 1%: 0.18+0.41+0.28 ms clock, 0.18+0.69/2.0/0+0.28 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 5 @0.031s 1%: 0.078+0.35+0.29 ms clock, 1.1+0.26/2.0/0+4.4 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 6 @0.032s 1%: 0.11+0.50+0.32 ms clock, 0.22+0.99/2.3/0+0.64 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 7 @0.034s 1%: 0.18+0.39+0.27 ms clock, 0.18+0.56/2.2/0+0.27 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 8 @0.035s 2%: 0.12+0.40+0.27 ms clock, 0.12+0.63/2.2/0+0.27 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 9 @0.036s 2%: 0.13+0.41+0.26 ms clock, 0.13+0.52/2.2/0+0.26 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 10 @0.038s 2%: 0.099+0.51+0.20 ms clock, 0.19+0.56/1.9/0+0.40 ms cpu, 4->5->0 MB, 5 MB goal, 48 P
gc 11 @0.039s 2%: 0.10+0.46+0.20 ms clock, 0.10+0.23/1.3/0.005+0.20 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 12 @0.040s 2%: 0.066+0.46+0.24 ms clock, 0.93+0.40/1.7/0+3.4 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 13 @0.041s 2%: 0.099+0.30+0.20 ms clock, 0.099+0.60/1.7/0+0.20 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 14 @0.042s 2%: 0.095+0.45+0.24 ms clock, 0.38+0.58/2.0/0+0.98 ms cpu, 4->5->0 MB, 5 MB goal, 48 P
gc 15 @0.044s 2%: 0.095+0.45+0.21 ms clock, 1.0+0.78/1.9/0+2.3 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 16 @0.045s 3%: 0.10+0.45+0.23 ms clock, 0.10+0.70/2.1/0+0.23 ms cpu, 4->5->0 MB, 5 MB goal, 48 P
gc 17 @0.046s 3%: 0.088+0.40+0.17 ms clock, 0.088+0.45/1.9/0+0.17 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
.
.
.
.
gc 6789 @9.998s 12%: 0.17+0.91+0.24 ms clock, 0.85+1.8/5.0/0+1.2 ms cpu, 4->6->1 MB, 6 MB goal, 48 P
gc 6790 @10.000s 12%: 0.086+0.55+0.24 ms clock, 0.78+0.30/4.2/0.043+2.2 ms cpu, 4->5->1 MB, 6 MB goal, 48 P

The first round of GC kicks in at 0.021s, then it starts collecting every 3ms and then every 1ms. That is insane, the benchmark runs for 10 seconds, and I saw 6,790 rounds of GC. The number that starts with @ is the time since program start, followed by a percentage that supposedly states the amount of time spent collecting garbage. This number is clearly misleading, because the performance indicates at least 90% of the time is wasted (indirectly) on GC, not 12%. The synchronization overhead is not taken into account. What really is interesting are the three numbers separated by arrows. They show the size of the heap at GC start, GC end, and the live heap size. Remember that a collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage, and defaults to 100%.

I am running a benchmark, where all allocated data is immediately discarded, and collected at the next GC cycle. The only live heap is fixed to the Go runtime, and having more goroutines does not add to the live heap. In contrast the freshly allocated data grows much faster with each additional goroutine, triggering increasingly frequent, and expensive GC cycles.

Clearly what I needed to do next was to run the benchmark with the GC disabled, by setting GOGC=off. This lead to a dramatic improvement: ECDSA-P256 Sign,413740.30, op/s.

But still not the number I was looking for, and running an application without garbage collection is unsustainable in the long run. I started playing with the GOGC variable. First I set it to 2,400, which made sense since we have 24 cores, perhaps collecting garbage 24 times less frequently will do the trick: ECDSA-P256 Sign,671538.90, op/s, oh my that is getting better.

What if I tried 4,800, for the number of threads? ECDSA-P256 Sign,685810.90, op/s. Getting warmer.

I ran a script to find the best value, from 100 to 20,000, in increments of 100. This is what I got:

Looks like the optimal value for GOGC in that case is 11,300 and it gets us 691,054 signatures/second. That is 22.56X times faster than the single core score, and overall pretty good for a 24 core processor. Remember that when running on a single core, the CPU frequency is 3.0GHz, and only 2.1GHz when running on all cores.

Per goroutine performance when running with GOGC=11330 now looks like that:

The scaling looks much better, and even past 24 goroutines, when we run out of physical cores, and start sharing cores with hyper-threading, the overall performance improves.

The bottom line here is that although this type of benchmarking is definitely an edge case for garbage collection, where 48 threads allocate large amounts of short lived data, this situation can occur in real world scenarios. As many-core CPUs become a commodity, one should be aware of the pitfalls.

Most languages with garbage collection offer some sort of garbage collection control. Go has the GOGC variable, that can also be controlled with the SetGCPercent function in the runtime/debug package. Don’t be afraid to tune the GC to suit your needs.

We’re always looking for Go programmers, so if you found this blog post interesting, why not check out our jobs page?
Source: CloudFlare

Cloudflare Wants to Buy Your Meetup Group Pizza

If you’re a web dev / devops / etc. meetup group that also works toward building a faster, safer Internet, I want to support your awesome group by buying you pizza. If your group’s focus falls within one of the subject categories below and you’re willing to give us a 30 second shout out and tweet a photo of your group and @Cloudflare, your meetup’s pizza expense will be reimbursed.

Get Your Pizza $ Reimbursed »

Developer Relations at Cloudflare & why we’re doing this

I’m Andrew Fitch and I work on the Developer Relations team at Cloudflare. One of the things I like most about working in DevRel is empowering community members who are already doing great things out in the world. Whether they’re starting conferences, hosting local meetups, or writing educational content, I think it’s important to support them in their efforts and reward them for doing what they do. Community organizers are the glue that holds developers together socially. Let’s support them and make their lives easier by taking care of the pizza part of the equation.

What’s in it for Cloudflare?

We want web developers to target the apps platform
We want more people to think about working at Cloudflare
Some people only know of Cloudflare as a CDN, but it’s actually a security company and does much more than that. General awareness helps tell the story of what Cloudflare is about.

What kinds of groups we most want to support

We want to work with groups that are closely aligned with Cloudflare, groups which focus on web development, web security, devops, or tech ops. We often work with language-specific meetups (such as Go London User Group) or diversity & inclusion meetups (such as Women Who Code Portland). We’d also love to work with college student groups and any other coding groups which are focused on sharing technical knowledge with the community.

To get sponsored pizza, groups must be focused on…

Web performance & optimization
Front-end frameworks
Language-specific (JavaScript / PHP / Go / Lua / etc.)
Diversity & inclusion meetups for web devs / devops / tech ops
Workshops / classes for web devs / devops / tech ops

How it works

Interested groups need to read our Cloudflare Meetup Group Pizza Reimbursement Rules document.
If the group falls within the requirements, the group should proceed to schedule their meetup and pull the Cloudflare Introductory Slides ahead of time.
At the event, an organizer should present the Cloudflare Introductory Slides at the beginning, giving us a 30 second shout out, and take a photo of the group in front of the slides. Someone from the group should Tweet the photo and @Cloudflare, so our Community Manager may retweet.
After the event, an organizer will need to fill out our reimbursement form where they will upload a completed W-9, their mailing address, group details, and the link to the Tweet.
Within a month, the organizer should have a check from Cloudflare (and some laptop stickers for their group, if they want) in their hands.

Here are some examples of groups we’ve worked with so far

Go London User Group (GLUG)

The Go London User Group is a London-based community for anyone interested in the Go programming language.

GLUG provides offline opportunities to:

Discuss Go and related topics
Socialize with friendly people who are interested in Go
Find or fill Go-related jobs

They want GLUG to be a diverse and inclusive community. As such all attendees, organizers and sponsors are required to follow the community code of conduct.

This group’s mission is very much in alignment with Cloudflare’s guidelines, so we sponsored the pizza for their October Gophers event.

Women Who Code Portland

Women Who Code’s mission statement

Women Who Code is a global nonprofit dedicated to inspiring women to excel in technology careers by creating a global, connected community of women in technology.

What they offer

Monthly Networking Nights
Study Nights (JavaScript, React, DevOps, Design + Product, Algorithms)
Technical Workshops
Community Events (Hidden Figures screening, Women Who Strength Train, Happy Hours, etc.)
Free or discounted tickets to conferences

Women Who Code are a great example of an inclusive technical group. Cloudflare reimbursed them for their pizza at their October JavaScript Study Night.

GDG Phoenix / PHX Android

GDG Phoenix / PHX Android group is for anyone interested in Android design and development. All skill levels are welcome. Participants are welcome to the meet up to hang out and watch or be part of the conversation at events.

This group is associated with the Google Developer Group program and coordinate activities with the national organization. Group activities are not exclusively Android related, however, they can cover any technology.

Cloudflare sponsored their October event “How espresso works . . . Android talk with Michael Bailey”.

I hope a lot of awesome groups around the world will take advantage of our offer. Enjoy the pizza!

Get Your Pizza $ Reimbursed »

© 2017 Cloudflare, Inc. All rights reserved. The Cloudflare logo and marks are trademarks of Cloudflare. All other company and product names may be trademarks of the respective companies with which they are associated.
Source: CloudFlare

On the dangers of Intel's frequency scaling

While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.

When benchmarking OpenSSL 1.1.1dev, I discovered that the performance of the cipher ChaCha20-Poly1305 does not scale very well. On a single thread, it performed at the speed of approximately 2.89GB/s, whereas on 24 cores, and 48 threads it performed at just over 35 GB/s.

CC BY-SA 2.0 image by blumblaum

Now this is a very high number, but I would like to see something closer to 69GB/s. 35GB/s is just 1.46GB/s/core, or roughly 50% of the single core performance. AES-GCM scales much better, to 80% of single core performance, which is understandable, because the CPU can sustain higher frequency turbo on a single core, but not all cores.

Why is the scaling of ChaCha20-Poly1305 so poor? Meet AVX-512. AVX-512 is a new Intel instruction set that adds many new 512-bit wide SIMD instructions and promotes most of the existing ones to 512-bit. The problem with such wide instructions is that they consume power. A lot of power. Imagine a single instruction that does the work of 64 regular byte instructions, or 8 full blown 64-bit instructions.

To keep power in check Intel introduced something called dynamic frequency scaling. It reduces the base frequency of the processor whenever AVX2 or AVX-512 instructions are used. This is not new, and has existed since Haswell introduced AVX2 three years ago.

The scaling gets worse when more cores execute AVX-512 and when multiplication is used.

If you only run AVX-512 code, then everything is good. The frequency is lower, but your overall productivity is higher, because each instruction does more work.

OpenSSL 1.1.1dev implements several variants of ChaCha20-Poly1305, including AVX2 and AVX-512 variants. BoringSSL implements a different AVX2 version of ChaCha20-Poly1305. It is understandable then why BoringSSL achieves only 1.6GB/s on a single core, compared to the 2.89GB/s OpenSSL does.

So how does this affect you, if you mix a little AVX-512 with your real workload? We use the Xeon Silver 4116 CPUs, with a base frequency 2.1GHz, in a dual socket configuration. From a figure I found on wikichip it seems that running AVX-512 even just on one core on this CPU will reduce the base frequency to 1.8GHz. Running AVX-512 on all cores will reduce it to just 1.4GHz.

Now imagine you run a webserver with Apache or NGINX. In addition you have many other services, performing some real, important work. What happens if you start encrypting your traffic with ChaCha20-Poly1305 using AVX-512? That is the question I asked myself.

I compiled two versions of NGINX, one with OpenSSL1.1.1dev and the other with BoringSSL, and installed it on our server with two Xeon Silver 4116 CPUs, for a total of 24 cores.

I configured the server to serve a medium sized HTML page, and perform some meaningful work on it. I used LuaJIT to remove line breaks and extra spaces, and brotli to compress the file.

I then monitored the number of requests per second served under full load. This is what I got:

By using ChaCha20-Poly1305 over AES-128-GCM, the server that uses OpenSSL serves 10% fewer requests per second. And that is a huge number! It is equivalent to giving up on two cores, for nothing. One might think that this is due to ChaCha20-Poly1305 being inherently slower. But that is not the case.

First, BoringSSL performs equivalently well with AES-GCM and ChaCha20-Poly1305.

Second, even when only 20% of the requests use ChaCha20-Poly1305, the server throughput drops by more than 7%, and by 5.5% when 10% of the requests are ChaCha20-Poly1305. For reference, 15% of the TLS requests Cloudflare handles are ChaCha20-Poly1305.

Finally, according to perf, the AVX-512 workload consumes only 2.5% of the CPU time when all the requests are ChaCha20-Poly1305, and less then 0.3% when doing ChaCha20-Poly1305 for 10% of the requests. Irregardless the CPU throttles down, because that what it does when it sees AVX-512 running on all cores.

It is hard to say just how much each core is throttled at any given time, but doing some sampling using lscpu, I found out that when executing the openssl speed -evp chacha20-poly1305 -multi 48 benchmark, it shows CPU MHz: 1199.963, for OpenSSL with all AES-GCM connections I got CPU MHz: 2399.926 and for OpenSSL with all ChaCha20-Poly1305 connections I saw CPU MHz: 2184.338, which is obviously 9% slower.

Another interesting distinction is that ChaCha20-Poly1305 with AVX2 is slightly slower in OpenSSL but is the same in BoringSSL. Why might that be? The reason here is that the BoringSSL code does not use AVX2 multiplication instructions for Poly1305, and only uses simple xor, shift and add operations for ChaCha20, which allows it to run at the base frequency.

OpenSSL 1.1.1dev is still in development, therefore I suspect no one is affected by this issue yet. We switched to BoringSSL months ago, and our server performance is not affected by this issue.

What the future holds in unclear. Intel announced very cool new ISA extensions for the future generation of CPUs, that are expected to improve crypto performance even further. Those extensions include AVX512+VAES, AVX512+VPCLMULQDQ and AVX512IFMA. But if the frequency scaling issue is not resolved by then, using those for general purpose cryptography libraries will do (much) more harm than good.

The problem is not with cryptography libraries alone. OpenSSL did nothing wrong by trying to get the best possible performance, on the contrary, I wrote a decent amount of AVX-512 code for OpenSSL myself. The observed behavior is a sad side effect. There are many libraries that use AVX and AVX2 instructions out there, they will probably be updated to AVX-512 at some point, and users are not likely to be aware of the implementation details. If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.
Source: CloudFlare

Privacy Pass – “The Math”

This is a guest post by Alex Davidson, a PhD student in Cryptography at Royal Holloway, University of London, who is part of the team that developed Privacy Pass. Alex worked at Cloudflare for the summer on deploying Privacy Pass on the Cloudflare network.

During a recent internship at Cloudflare, I had the chance to help integrate support for improving the accessibility of websites that are protected by the Cloudflare edge network. Specifically, I helped develop an open-source browser extension named ‘Privacy Pass’ and added support for the Privacy Pass protocol within Cloudflare infrastructure. Currently, Privacy Pass works with the Cloudflare edge to help honest users to reduce the number of Cloudflare CAPTCHA pages that they see when browsing the web. However, the operation of Privacy Pass is not limited to the Cloudflare use-case and we envisage that it has applications over a wider and more diverse range of applications as support grows.

In summary, this browser extension allows a user to generate cryptographically ‘blinded’ tokens that can then be signed by supporting servers following some receipt of authenticity (e.g. a CAPTCHA solution). The browser extension can then use these tokens to ‘prove’ honesty in future communications with the server, without having to solve more authenticity challenges.

The ‘blind’ aspect of the protocol means that it is infeasible for a server to link tokens token that it signs to tokens that are redeemed in the future. This means that a client using the browser extension should not compromise their own privacy with respect to the server they are communicating with.

In this blog post we hope to give more of an insight into how we have developed the protocol and the security considerations that we have taken into account. We have made use of some interesting and modern cryptographic techniques that we believe could have a future impact on a wide array of problems.

Previously…

The research team released a specification last year for a “blind signing” protocol (very similar to the original proposal of Chaum using a variant of RSA known as ‘blind RSA’. Blind RSA simply uses the homomorphic properties of the textbook RSA signature scheme to allow the user to have messages signed obliviously. Since then, George Tankersley and Filippo Valsorda gave a talk at Real World Crypto 2017 explaining the idea in more detail and how the protocol could be implemented. The intuition behind a blind signing protocol is also given in Nick’s blog post.

A blind signing protocol between a server A and a client B roughly takes the following form:

B generates some value t that they require a signature from A for.
B calculates a ‘blinded’ version of t that we will call bt
B sends bt to A
A signs bt with their secret signing key and returns a signature bz to B
B receives bz and ‘unblinds’ to receive a signature z for value t.

Due to limitations arising from the usage of RSA (e.g. large signature sizes, slower operations), there were efficiency concerns surrounding the extra bandwidth and computation time on the client browser. Fortunately, we received a lot of feedback from many notable individuals (full acknowledgments below). In short, this helped us to come up with a protocol with much lower overheads in storage, bandwidth and computation time using elliptic curve cryptography as the foundation instead.

Elliptic curves (a very short introduction)

An elliptic curve is defined over a finite field modulo some prime p. Briefly, an (x,y) coordinate is said to lie on the curve if it satisfies the following equation:

y^2 = x^3 + a*x + b (modulo p)

Nick Sullivan wrote an introductory blog post on the use of elliptic curves in cryptography a while back, so this may be a good place to start if you’re new to the area.

Elliptic curves have been studied for use in cryptography since the independent works of Koblitz and Miller (1984-85). However, EC-based ciphers and signature algorithms have rapidly started replacing older primitives in the Internet-space due to large improvements in the choice of security parameters available. What this translates to is that encryption/signing keys can be much smaller in EC cryptography when compared to more traditional methods such as RSA. This comes with huge efficiency benefits when computing encryption and signing operations, thus making EC cipher suites perfect for use on an Internet-wide scale.

Importantly, there are many different elliptic curve configurations that are defined by the choice of p, a and b for the equation above. These prevent different security and efficiency benefits; some have been standardized by NIST. In this work, we will be using the NIST specified P256 curve, however, this choice is largely agnostic to the protocol that we have designed.

Blind signing via elliptic curves

Translating our blind signing protocol from RSA to elliptic curves required deriving a whole new protocol. Some of the suggestions pointed out cryptographic constructions known as “oblivious pseudorandom functions”. A pseudorandom function or PRF is a mainstay of the traditional cryptographic arsenal and essentially takes a key and some string as input and outputs some cryptographically random value.

Let F be our PRF, then the security requirement on such a function is that evaluating:

y = F(K,x)

is indistinguishable from evaluating:

y’ = f(x)

where f is a randomly chosen function with outputs defined in the same domain as F(K,-). Choosing a function f at random undoubtedly leads to random outputs, however for F, randomness is derived from the choice of key K. In practice, we would instantiate a PRF using something like HMAC-SHA256.

Oblivious PRFs

An oblivious PRF (OPRF) is actually a protocol between a server S and a client C. In the protocol, S holds a key K for some PRF F and C holds an input x. The security goal is that C receives the output y = F(K,x) without learning the key K and S does not learn the value x.

It may seem difficult to construct such a functionality without revealing the input x or the key K. However, there are numerous (and very efficient) constructions of OPRFs with applications to many different cryptographic problems such as private set intersection, password-protected secret-sharing and cryptographic password storage to name a few.

OPRFs from elliptic curves

A simple instantiation of an OPRF from elliptic curves was given by Jarecki et al. JKK14, we use this as the foundation for our blind signing protocol.

Let G be a cyclic group of prime-order
Let H be a collision-resistant hash function hashing into G
Let k be a private key held by S
Let x be a private input held by C

The protocol now proceeds as:

C sends H(x) to S
S returns kH(x) to C

Clearly, this is an exceptionally simple protocol, security is established since:

The collision-resistant hash function prevents S from reversing H(x) to learn x
The hardness of the discrete log problem (DLP) prevents C from learning k from kH(x)
The output kH(x) is pseudorandom since G is a prime-order group and k is chosen at random.

Blind signing via an OPRF

Using the OPRF design above as the foundation, the research team wrote a variation that we can use for a blind signing protocol; we detail this construction below. In our ‘blind signing’ protocol we require that:

The client/user can have random values signed obliviously by the edge server
The client can ‘unblind’ these values and present them in the future for verification
The edge can commit to the secret key publicly and prove that it is used for signing all tokens globally

The blind signing protocol is split into two phases.

Firstly, there is a blind signing phase that is carried out between the user and the edge after the user has successfully solved a challenge. The result is that the user receives a number of signed tokens (default 30) that are unblinded and stored for future use. Intuitively, this mirrors the execution of the OPRF protocol above.

Secondly, there is a redemption phase where an unblinded token is used for bypassing a future iteration of the challenge.

Let G be a cyclic group of prime-order q. Let H_1,H_2 be a pair of collision-resistant hash functions; H_1 hashes into the group G as before, H_2 hashes into a binary string of length n.

In the following, we will slightly different notation to make it consistent with existing literature. Let x be a private key held by the server S. Let t be the input held by the user/client C. Let ZZ_q be the ring of integers modulo q. We write all operations in their scalar multiplication form to be consistent with EC notation. Let MAC_K() be a message-authentication code algorithm keyed by a key K.

Signing phase

C samples a random ‘blind’ r ← ZZ_q
C computes T = H_1(t) and then blinds it by computing rT
C sends M = rT to S
S computes Z = xM and returns Z to C
C computes (1/r)*Z = xT = N and stores the pair (t,N) for some point in the future

We think of T = H_1(t) as a token, these objects form the backbone of the protocol that we use to bypass challenges.
Notice, that the only difference between this protocol and the OPRF above is the blinding factor r that we use.

Redemption phase

C calculates request binding data req and chooses an unspent token (t,N)
C calculates a shared key sk = H_2(t,N) and sends (t, MAC_sk(req)) to S
S recalculates req’ based on the request data that it witnesses
S checks that t has not been spent already and calculates T = H_1(t), N = xT, and sk = H_2(t,N)
Finally S checks that MAC_sk(req’) =?= MAC_sk(req), and stores t to check against future redemptions

If all the steps above pass, then the server validates that the user has a validly signed token. When we refer to ‘passes’ we mean the pair (t, MAC_sk(req)) and if verification is successful the edge server grants the user access to the requested resource.

Cryptographic security of protocol

There are many different ways in which we need to ensure that the protocol remains “secure”. Clearly one of the main features is that the user remains anonymous in the transaction. Furthermore, we need to show that the client is unable to leverage the protocol in order to learn the private key of the edge, or arbitrarily gain infinite tokens. We give two security arguments for our protocol that we can easily reduce to cryptographic assumptions on the hardness of widely-used problems. There are a number of other security goals for the protocol but we consider the two arguments below as fundamental security requirements.

Unlinkability in the presence of an adversarial edge

Similarly to the RSA blind signing protocol, the blind r is used to prevent the edge from learning the value of T, above. Since r is not used in the redemption phase of the protocol, there is no way that the server can link a blinded token rT in the signing phase to any token in a given redemption phase. Since S recalculates T during redemption, it may be tempting to think that S could recover r from rT. However, the hardness of the discrete log problem prevents S from launching this attack. Therefore, the server has no knowledge of r.

As mentioned and similarly to the JKK14 OPRF protocol above, we rely on the hardness of standard cryptographic assumptions such as the discrete log problem (DLP), and collision-resistant hash functions. Using these hardness assumptions it is possible to write a proof of security in the presence of a dishonest server. The proof of security shows that assuming that these assumptions are hard, then a dishonest server is unable to link an execution of the signing phase with any execution of the redemption phase with probability higher than just randomly guessing.

Intuitively, in the signing phase, C sends randomly distributed data due to the blinding mechanism and so S cannot learn anything from this data alone. In the redemption phase, C unveils their token, but the transcript of the signing phase witnessed by S is essentially random and so it cannot be used to learn anything from the redemption phase.

This is not a full proof of security but gives an idea as to how we can derive cryptographic hardness for the underlying protocol. We hope to publish a more detailed cryptographic proof in the near future to accompany our protocol design.

Key privacy for the edge

It is also crucial to prove that the exchange does not reveal the secret key x to the user. If this were to happen, then the user would be able to arbitrarily sign their own tokens, giving them an effectively infinite supply.

Notice that the only time when the client is exposed to the key is when they receive Z = xM. In elliptic-curve terminology, the client receives their blinded token scalar multiplied with x. Notice, that this is also identical to the interaction that an adversary witnesses in the discrete log problem. In fact, if the client was able to compute x from Z, then the client would also be able to solve the DLP — which is thought to be very hard for established key sizes. In this way, we have a sufficient guarantee that an adversarial client would not be able to learn the key from the signing interaction.

Preventing further deanonymization attacks using “Verifiable” OPRFs

While the proof of security above gives some assurances about the cryptographic design of the protocol, it does not cover the possibility of possible out-of-band deanonymization. For instance, the edge server can sign tokens with a new secret key each time. Ignoring the cost that this would incur, the server would be able to link token signing and redemption phases by simply checking the validation for each private key in use.

There is a solution known as a ‘discrete log equivalence proof’ (DLEQ proof). Using this, a server commits to a secret key x by publicly posting a pair (G, xG) for a generator G of the prime-order group G. A DLEQ proof intuitively allows the server to prove to the user that the signed tokens Z = xrT and commitment xG both have the same discrete log relation x. Since the commitment is posted publicly (similarly to a Certificate Transparency Log) this would be verifiable by all users and so the deanonymization attack above would not be possible.

DLEQ proofs

The DLEQ proof objects take the form of a Chaum-Pedersen CP93 non-interactive zero-knowledge (NIZK) proof. Similar proofs were used in JKK14 to show that their OPRF protocol produced “verifiable” randomness, they defined their construction as a VOPRF. In the following, we will describe how these proofs can be augmented into the signing phase above.

The DLEQ proof verification in the extension is still in development and is not completely consistent with the protocol below. We hope to complete the verification functionality in the near future.

Let M = rT be the blinded token that C sends to S, let (G,Y) = (G,xG) be the commitment from above, and let H_3 be a new hash function (modelled as a random oracle for security purposes). In the protocol below, we can think of S playing the role of the ‘prover’ and C the ‘verifier’ in a traditional NIZK proof system.

S computes Z = xM, as before.
S also samples a random nonce k ← ZZ_q and commits to the nonce by calculating A = kG and B = kM
S constructs a challenge c ← H_3(G,Y,M,Z,A,B) and computes s = k-cx (mod q)
S sends (c,s) to the user C
C recalculates A’ = sG + cY and B’ = s*M + c*Z and hashes c’ = H_3(G,Y,M,Z,A’,B’).
C verifies that c’ =?= c.

Note that correctness follows since

A’ = sG + cY = (k-cx)G + cxG = kG and B’ = sM + cZ = r(k-cx)T + crxT = krT = kM

We write DLEQ(Z/M == Y/G) to denote the proof that is created by S and validated by C.
In summary, if both parties have a consistent view of (G,Y) for the same epoch then the proof should verify correctly. As long as the discrete log problem remains hard to solve, then this proof remains zero-knowledge (in the random oracle model). For our use-case the proof verifies that the same key x is used for each invocation of the protocol, as long as (G,Y) does not change.

Batching the proofs

Unfortunately, a drawback of the proof above is that it has to be instantiated for each individual token sent in the protocol. Since we send 30 tokens by default, this would require the server to also send 30 DLEQ proofs (with two EC elements each) and the client to verify each proof individually.

Interestingly, Henry showed that it was possible to batch the above NIZK proofs into one object with only one verification required Hen14. Using this batching technique substantially reduces the communication and computation cost of including the proof.

Let n be the number of tokens to be signed in the interaction, so we have M_i = r_i*T_i for the set of blinded tokens corresponding to inputs t_i.

S generates corresponding Z_i = x*M_i
S also computes a seed z = H_3(G,Y,M_1,…,M_n,Z_1,…,Z_n)
S then initializes a pseudorandom number generator PRNG with the seed z and outputs c_1, … , c_n ← PRNG(z) where the output domain of PRNG is ZZ_q
S generates composite group elements:

M = (c_1*M_1) + … + (c_n*M_n), Z = (c_1*Z_1) + … + (c_n*Z_n)

S calculates (c,s) ← DLEQ(M:Z == G:Y) and sends (c,s) to C, where DLEQ(Z/M == Y/G) refers to the proof protocol used in the non-batching case.
C computes c’_1, … , c’_n ← PRNG(z) and re-computes M’, Z’ and checks that c’ =?= c

To see why this works, consider the reduced case where m = 2:

Z_1 = x(M_1),
Z_2 = x(M_2),
(c_1*Z_1) = c_1(x*M_1) = x(c_1*M_1),
(c_2*Z_2) = c_2(x*M_2) = x(c_2*M_2),
(c_1*Z_1) + (c_2*Z_2) = x[(c_1*M_1) + (c_2*M_2)]

Therefore, all the elliptic curve points will have the same discrete log relation as each other, and hence equal to the secret key that is committed to by the edge.

Benefits of V-OPRF vs blind RSA

While the blind RSA specification that we released fulfilled our needs, we make the following concrete gains

Simpler, faster primitives
10x savings in pass size (~256 bits using P-256 instead of ~2048)
The only thing edge to manage is a private scalar. No certificates.
No need for public-key encryption at all, since the derived shared key used to calculate each MAC is never transmitted and cannot be found from passive observation without knowledge of the edge key or the user’s blinding factor.
Exponentiations are more efficient due to use of elliptic curves.
Easier key rotation. Instead of managing certificates pinned in TBB and submitted to CT, we can use the DLEQ proofs to allow users to positively verify they’re in the same anonymity set with regard to the edge secret key as everyone else.

Download

Privacy Pass v1.0 is available as a browser extension for Chrome and Firefox. If you find any issues while using then let us know.

Source code

The code for the browser extension and server has been open-sourced and can be found at https://github.com/privacypass/challenge-bypass-extension and https://github.com/privacypass/challenge-bypass-server respectively. We are welcoming contributions if you happen to notice any improvements that can be made to either component. If you would like to get in contact with the Privacy Pass team then find us at our website.

Protocol details

More information about the protocol can be found here.

Acknowledgements

The creation of Privacy Pass has been a joint effort by the team made up of George Tankersley, Ian Goldberg, Nick Sullivan, Filippo Valsorda and myself.

I’d also like to thank Eric Tsai for creating the logo and extension design, Dan Boneh for helping us develop key parts of the protocol, as well as Peter Wu and Blake Loring for their helpful code reviews. We would also like to acknowledge Sharon Goldberg, Christopher Wood, Peter Eckersley, Brian Warner, Zaki Manian, Tony Arcieri, Prateek Mittal, Zhuotao Liu, Isis Lovecruft, Henry de Valence, Mike Perry, Trevor Perrin, Zi Lin, Justin Paine, Marek Majkowski, Eoin Brady, Aaran McGuire, and many others who were involved in one way or another and whose efforts are appreciated.

References

Cha82: Chaum. Blind signatures for untraceable payments. CRYPTO’82
CP93: Chaum, Pedersen. Wallet Databases with Observers. CRYPTO’92.
Hen14: Ryan Henry. Efficient Zero-Knowledge Proofs and Applications, August 2014.
JKK14: Jarecki, Kiayias, Krawczyk. Round-Optimal Password-Protected Secret Sharing and T-PAKE in the Password-Only model.
JKKX16: Jarecki, Kiayias, Krawczyk, Xu. Highly-Efficient and Composable Password-Protected Secret Sharing.
Source: CloudFlare

Cloudflare supports Privacy Pass

Enabling anonymous access to the web with privacy-preserving cryptography

Cloudflare supports Privacy Pass, a recently-announced privacy-preserving protocol developed in collaboration with researchers from Royal Holloway and the University of Waterloo. Privacy Pass leverages an idea from cryptography — zero-knowledge proofs — to let users prove their identity across multiple sites anonymously without enabling tracking. Users can now use the Privacy Pass browser extension to reduce the number of challenge pages presented by Cloudflare. We are happy to support this protocol and believe that it will help improve the browsing experience for some of the Internet’s least privileged users.

The Privacy Pass extension is available for both Chrome and Firefox. When people use anonymity services or shared IPs, it makes it more difficult for website protection services like Cloudflare to identify their requests as coming from legitimate users and not bots. Privacy Pass helps reduce the friction for these users—which include some of the most vulnerable users online—by providing them a way to prove that they are a human across multiple sites on the Cloudflare network. This is done without revealing their identity, and without exposing Cloudflare customers to additional threats from malicious bots. As the first service to support Privacy Pass, we hope to help demonstrate its usefulness and encourage more Internet services to adopt it.

Adding support for Privacy Pass is part of a broader initiative to help make the Internet accessible to as many people as possible. Because Privacy Pass will only be used by a small subset of users, we are also working on other improvements to our network in service of this goal. For example, we are making improvements in our request categorization logic to better identify bots and to improve the web experience for legitimate users who are negatively affected by Cloudflare’s current bot protection algorithms. As this system improves, users should see fewer challenges and site operators should see fewer requests from unwanted bots. We consider Privacy Pass a piece of this puzzle.

Privacy Pass is fully open source under a BSD license and the code is available on GitHub. We encourage anyone who is interested to download the source code, play around with the implementations and contribute to the project. The Pass Team have also open sourced a reference implementation of the server in Go if you want to test both sides of the system. Privacy Pass support at Cloudflare is currently in beta. If you find a bug, please let the team know by creating an issue on GitHub.

In this blog post I’ll be going into depth about the problems that motivated our support for this project and how you can use it to reduce the annoyance factor of CAPTCHAs and other user challenges online.

Enabling universal access to content

Cloudflare believes that the web is for everyone. This includes people who are accessing the web anonymously or through shared infrastructure. Tools like VPNs are useful for protecting your identity online, and people using these tools should have the same access as everyone else. We believe the vast collection of information and services that make up the Internet should be available to every person.

In a blog post last year, our CEO, Matthew Prince, spoke about the tension between security, anonymity, and convenience on the Internet. He posited that in order to secure a website or service while still allowing anonymous visitors, you have to sacrifice a bit of convenience for these users. This tradeoff is something that every website or web service has to make.

The Internet is full of bad actors. The frequency and severity of online attacks is rising every year. This turbulent environment not only threatens websites and web services with attacks, it threatens their ability to stay online. As smaller and more diverse sites become targets of anonymous threats, a greater percentage of the Internet will choose to sacrifice user convenience in order to stay secure and universally accessible.

The average Internet user visits dozens of sites and services every day. Jumping through a hoop or two when trying to access a single website is not that big of a problem for people. Having to do that for every site you visit every day can be exhausting. This is the problem that Privacy Pass is perfectly designed to solve.

Privacy Pass doesn’t completely eliminate this inconvenience. Matthew’s trilemma still applies: anonymous users are still inconvenienced for sites that want security. What Privacy Pass does is to notably reduce that inconvenience for users with access to a browser. Instead of having to be inconvenienced thirty times to visit thirty different domains, you only have to be inconvenienced once to gain access to thirty domains on the Cloudflare network. Crucially, unlike unauthorized services like CloudHole, Privacy Pass is designed to respect user privacy and anonymity. This is done using privacy-preserving cryptography, which prevents Cloudflare or anyone else from tracking a user’s browsing across sites. Before we go into how this works, let’s take a step back and take a look at why this is necessary.

Am I a bot or not?

D J Shin Creative Commons Attribution-Share Alike 3.0 Unported

Without explicit information about the identity of a user, a web server has to rely on fuzzy signals to guess which request is from a bot and which is from a human. For example, bots often use automated scripts instead of web browsers to do their crawling. The way in which scripts make web requests is often different than how web browsers would make the same request in subtle ways.

A simple way for a user to prove they are not a bot to a website is by logging in. By providing valid authentication credentials tied to a long-term identity, a user is exchanging their anonymity for convenience. Having valid authentication credentials is a strong signal that a request is not from a bot. Typically, if you authenticate yourself to a website (say by entering your username and password) the website sets what’s called a “cookie”. A cookie is just a piece of data with an expiration date that’s stored by the browser. As long as the cookie hasn’t expired, the browser includes it as part of the subsequent requests to the server that set it. Authentication cookies are what websites use to know whether you’re logged in or not. Cookies are only sent on the domain that set them. A cookie set by site1.com is not sent for requests to site2.com. This prevents identity leakage from one site to another.

A request with an authentication cookie is usually not from a bot, so bot detection is much easier for sites that require authentication. Authentication is by definition de-anonymizing, so putting this in terms of Matthew’s trilemma, these sites can have security and convenience because they provide no anonymous access. The web would be a very different place if every website required authentication to display content, so this signal can only be used for a small set of sites. The question for the rest of the Internet becomes: without authentication cookies, what else can be used as a signal that a user is a person and not a bot?

The Turing Test

One thing that can be used is a user challenge: a task that the server asks the user to do before showing content. User challenges can come in many forms, from a proof-of-work to a guided tour puzzle to the classic CAPTCHA. A CAPTCHA (an acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart”) is a test to see if the user is a human or not. It often involves reading some scrambled letters or identifying certain slightly obscured objects — tasks that humans are generally better at than automated programs. The goal of a user challenge is not only to deter bots, but to gain confidence that a visitor is a person. Cloudflare uses a combination of different techniques as user challenges.

CAPTCHAs can be annoying and time-consuming to solve, so they are usually reserved for visitors with a high probability of being malicious.

The challenge system Cloudflare uses is cookie-based. If you solve a challenge correctly, Cloudflare will set a cookie called CF_CLEARANCE for the domain that presented the challenge. Clearance cookies are like authentication cookies, but instead of being tied to an identity, they are tied to the fact that you solved a challenge sometime in the past.

Person sends Request
Server responds with a challenge
Person sends solution
Server responds with set-cookie and bypass cookie
Person sends new request with cookie
Server responds with content from origin

Site visitors who are able to solve a challenge are much more likely to be people than bots, the harder the challenge, the more likely the visitor is a person. The presence of a valid CF_CLEARANCE cookie is a strong positive signal that a request is from a legitimate person.

How Privacy Pass protects your privacy: a voting analogy

You can use cryptography to prove that you have solved a challenge of a certain difficulty without revealing which challenge you solved. The technique that enables this is something called a Zero-knowledge proof. This may sound scary, so let’s use a real-world scenario, vote certification, to explain the idea.

In some voting systems the operators of the voting center certify every ballot before sending them to be counted. This is to prevent people from adding fraudulent ballots while the ballots are being transferred from where the vote takes place to where the vote is counted.

An obvious mechanism would be to have the certifier sign every ballot that a voter submits. However, this would mean that the certifier, having just seen the person that handed them a ballot, would know how each person voted. Instead, we can use a better mechanism that preserves voters’ privacy using an envelope and some carbon paper.

The voter fills out their ballot
The voter puts their ballot into an envelope along with a piece of carbon paper, and seals the envelope
The sealed envelope is given to the certifier
The certifier signs the outside of the envelope. The pressure of the signature transfers the signature from the carbon paper to the ballot itself, effectively signing the ballot.
Later, when the ballot counter unseals the envelope, they see the certifier’s signature on the ballot.

With this system, a voting administrator can authenticate a ballot without knowing its content, and then the ballot can be verified by an independent assessor.

Privacy Pass is like vote certification for the Internet. In this analogy, Cloudflare’s challenge checking service is the vote certifier, Cloudflare’s bot detection service is the vote counter and the anonymous visitor is the voter. When a user encounters a challenge on site A, they put a ballot into a sealed envelope and send it to the server along with the challenge solution. The server then signs the envelope and returns it to the client. Since the server is effectively signing the ballot without knowing its contents, this is called a blind signature.

When the user sees a challenge on site B, the user takes the ballot out of the envelope and sends it to the server. The server then checks the signature on the ballot, which proves that the user has solved a challenge. Because the server has never seen the contents of the ballot, it doesn’t know which site the challenge was solved for, just that a challenge was solved.

It turns out that with the right cryptographic construction, you can approximate this scenario digitally. This is the idea behind Privacy Pass.

The Privacy Pass team implemented this using a privacy-preserving cryptographic construction called an Elliptic Curve Verifiable Oblivious Pseudo-Random Function (EC-VOPRF). Yes, it’s a mouthful. From the Privacy Pass Team:

Every time the Privacy Pass plugin needs a new set of privacy passes, it creates a set of thirty random numbers t1 to t30, hashes them into a curve (P-256 in our case), blinds them with a value b and sends them along with a challenge solution. The server returns the set of points multiplied by its private key and a batch discrete logarithm equivalence proof. Each pair tn, HMAC(n,M) constitutes a Privacy Pass and can be redeemed to solve a subsequent challenge. Voila!

If none of these words make sense to you and you want to know more, check out the Privacy Pass team’s protocol design document.

Making it work in the browser

It takes more than a nice security protocol based on solid cryptography to make something useful in the real world. To bring the advantages of this protocol to users, the Privacy Pass team built a client in JavaScript and packaged it using WebExtensions, a cross-browser framework for developing applications that run in the browser and modify website behavior. This standard is compatible with both Chrome and Firefox. A reference implementation of the server side of the protocol was also implemented in Go.

If you’re a web user and are annoyed by CAPTCHAs, you can download the Privacy Pass extension for Chrome here and for Firefox here. It will significantly improve your web browsing experience. Once it is installed, you’ll see a small icon on your browser with a number under it. The number is how many unused privacy passes you have. If you are running low on passes, simply click on the icon and select “Get More Passes,” which will load a CAPTCHA you can solve in exchange for thirty passes. Every time you visit a domain that requires a user challenge page to view, Privacy Pass will “spend” a pass and the content will load transparently. Note that you may see more than one pass spent up when you load a site for the first time if the site has subresources from multiple domains.

The Privacy Pass extension works by hooking into the browser and looking for HTTP responses that have a specific header that indicates support for the Privacy Pass protocol. When a challenge page is returned, the extension will either try to issue new privacy passes or redeem existing privacy passes. The cryptographic operations in the plugin were built on top of SJCL.

If you’re a Cloudflare customer and want to opt out from supporting Privacy Pass, please contact our support team and they will disable it for you. We are soon adding a toggle for Privacy Pass in the Firewall app in the Cloudflare dashboard.

The web is for everyone

The technology behind Privacy Pass is free for anyone to use. We see a bright future for this technology and think it will benefit from community involvement. The protocol is currently only deployed at Cloudflare, but it could easily be used across different organizations. It’s easy to imagine obtaining a Privacy Pass that proves that you have a Twitter or Facebook identity and using it to access other services on the Internet without revealing your identity, for example. There are a wide variety of applications of this technology that extend well beyond our current use cases.

If this technology is intriguing to you and you want to collaborate, please reach out to the Privacy Pass team on GitHub.
Source: CloudFlare

ARM Takes Wing: Qualcomm vs. Intel CPU comparison

One of the nicer perks I have here at Cloudflare is access to the latest hardware, long before it even reaches the market.

Until recently I mostly played with Intel hardware. For example Intel supplied us with an engineering sample of their Skylake based Purley platform back in August 2016, to give us time to evaluate it and optimize our software. As a former Intel Architect, who did a lot of work on Skylake (as well as Sandy Bridge, Ivy Bridge and Icelake), I really enjoy that.

Our previous generation of servers was based on the Intel Broadwell micro-architecture. Our configuration includes dual-socket Xeons E5-2630 v4, with 10 cores each, running at 2.2GHz, with a 3.1GHz turbo boost and hyper-threading enabled, for a total of 40 threads per server.

Since Intel was, and still is, the undisputed leader of the server CPU market with greater than 98% market share, our upgrade process until now was pretty straightforward: every year Intel releases a new generation of CPUs, and every year we buy them. In the process we usually get two extra cores per socket, and all the extra architectural features such upgrade brings: hardware AES and CLMUL in Westmere, AVX in Sandy Bridge, AVX2 in Haswell, etc.

In the current upgrade cycle, our next server processor ought to be the Xeon Silver 4116, also in a dual-socket configuration. In fact, we have already purchased a significant number of them. Each CPU has 12 cores, but it runs at a lower frequency of 2.1GHz, with 3.0GHz turbo boost. It also has smaller last level cache: 1.375MiB/core, compared to 2.5MiB the Broadwell processors had. In addition, the Skylake based platform supports 6 memory channels and the AVX-512 instruction set.

As we head into 2018, however, change is in the air. For the first time in a while, Intel has serious competition in the server market: Qualcomm and Cavium both have new server platforms based on the ARMv8 64-bit architecture (aka aarch64 or arm64). Qualcomm has the Centriq platform (codename Amberwing), based on the Falkor core, and Cavium has the ThunderX2 platform, based on the, ahm … ThunderX2 core?

The majestic Amberwing powered by the Falkor CPU CC BY-SA 2.0 image by DrPhotoMoto

Recently, both Qualcomm and Cavium provided us with engineering samples of their ARM based platforms, and in this blog post I would like to share my findings about Centriq, the Qualcomm platform.

The actual Amberwing in question

Overview

I tested the Qualcomm Centriq server, and compared it with our newest Intel Skylake based server and previous Broadwell based server.

PlatformGrantley(Intel)Purley(Intel)Centriq(Qualcomm)

CoreBroadwellSkylakeFalkor
Process14nm14nm10nm
Issue8 µops/cycle8 µops/cycle8 instructions/cycle
Dispatch4 µops/cycle5 µops/cycle4 instructions/cycle
# Cores10 x 2S + HT (40 threads)12 x 2S + HT (48 threads)46
Frequency2.2GHz (3.1GHz turbo)2.1GHz (3.0GHz turbo)2.5 GHz
LLC2.5 MB/core1.35 MB/core1.25 MB/core
Memory Channels466
TDP170W (85W x 2S)170W (85W x 2S)120W
Other featuresAESCLMUL
AVX2AESCLMUL
AVX512AESCLMUL
NEONTrustzoneCRC32

Overall on paper Falkor looks very competitive. In theory a Falkor core can process 8 instructions/cycle, same as Skylake or Broadwell, and it has higher base frequency at a lower TDP rating.

Ecosystem readiness

Up until now, a major obstacle to the deployment of ARM servers was lack, or weak, support by the majority of the software vendors. In the past two years, ARM’s enablement efforts have paid off, as most Linux distros, as well as most popular libraries support the 64-bit ARM architecture. Driver availability, however, is unclear at that point.

At Cloudflare we run a complex software stack that consists of many integrated services, and running each of them efficiently is top priority.

On the edge we have the NGINX server software, that does support ARMv8. NGINX is written in C, and it also uses several libraries written in C, such as zlib and BoringSSL, therefore solid C compiler support is very important.

In addition our flavor of NGINX is highly integrated with the lua-nginx-module, and we rely a lot on LuaJIT.

Finally a lot of our services, such as our DNS server, RRDNS, are written in Go.

The good news is that both gcc and clang not only support ARMv8 in general, but have optimization profiles for the Falkor core.

Go has official support for ARMv8 as well, and they improve the arm64 backend constantly.

As for LuaJIT, the stable version, 2.0.5 does not support ARMv8, but the beta version, 2.1.0 does. Let’s hope it gets out of beta soon.

Benchmarks

OpenSSL

The first benchmark I wanted to perform, was OpenSSL version 1.1.1 (development version), using the bundled openssl speed tool. Although we recently switched to BoringSSL, I still prefer OpenSSL for benchmarking, because it has almost equally well optimized assembly code paths for both ARMv8 and the latest Intel processors.

In my opinion handcrafted assembly is the best measure of a CPU’s potential, as it bypasses the compiler bias.

Public key cryptography

Public key cryptography is all about raw ALU performance. It is interesting, but not surprising to see that in the single core benchmark, the Broadwell core is faster than Skylake, and both in turn are faster than Falkor. This is because Broadwell runs at a higher frequency, while architecturally it is not much inferior to Skylake.

Falkor is at a disadvantage here. First, in a single core benchmark, the turbo is engaged, meaning the Intel processors run at a higher frequency. Second, in Broadwell, Intel introduced two special instructions to accelerate big number multiplication: ADCX and ADOX. These perform two independent add-with-carry operations per cycle, whereas ARM can only do one. Similarly the ARMv8 instruction set does not have a single instruction to perform 64-bit multiplication, instead it uses a pair of MUL and UMULH instructions.

Nevertheless, at the SoC level, Falkor wins big time. It is only marginally slower than Skylake at an RSA2048 signature, and only because RSA2048 does not have an optimized implementation for ARM. The ECDSA performance is ridiculously fast. A single Centriq chip can satisfy the ECDSA needs of almost any company in the world.

It is also very interesting to see Skylake outperform Broadwell by a 30% margin, despite losing the single core benchmark, and only having 20% more cores. This can be explained by more efficient all-core turbo, and improved hyper-threading.

Symmetric key cryptography

Symmetric key performance of the Intel cores is outstanding.

AES-GCM uses a combination of special hardware instructions to accelerate AES and CLMUL (carryless multiplication). Intel first introduced those instructions back in 2010, with their Westmere CPU, and every generation since they have improved their performance. ARM introduced a set of similar instructions just recently, with their 64-bit instruction set, and as an optional extension. Fortunately every hardware vendor I know of implemented those. It is very likely that Qualcomm will improve the performance of the cryptographic instructions in future generations.

ChaCha20-Poly1305 is a more generic algorithm, designed in such a way as to better utilize wide SIMD units. The Qualcomm CPU only has the 128-bit wide NEON SIMD, while Broadwell has 256-bit wide AVX2, and Skylake has 512-bit wide AVX-512. This explains the huge lead Skylake has over both in single core performance. In the all-cores benchmark the Skylake lead lessens, because it has to lower the clock speed when executing AVX-512 workloads. When executing AVX-512 on all cores, the base frequency goes down to just 1.4GHz—keep that in mind if you are mixing AVX-512 and other code.

The bottom line for symmetric crypto is that although Skylake has the lead, Broadwell and Falkor both have good enough performance for any real life scenario, especially considering the fact that on our edge, RSA consumes more CPU time than all of the other crypto algorithms combined.

Compression

The next benchmark I wanted to see was compression. This is for two reasons. First, it is a very important workload on the edge, as having better compression saves bandwidth, and helps deliver content faster to the client. Second, it is a very demanding workload, with a high rate of branch mispredictions.

Obviously the first benchmark would be the popular zlib library. At Cloudflare we use an improved version of the library, optimized for 64-bit Intel processors, and although it is written mostly in C, it does use some Intel specific intrinsics. Comparing this optimized version to the generic zlib library wouldn’t be fair. Not to worry, with little effort I adapted the library to work very well on the ARMv8 architecture, with the use of NEON and CRC32 intrinsics. In the process it is twice as fast as the generic library for some files.

The second benchmark is the emerging brotli library, it is written in C, and allows for a level playing field for all platforms.

All the benchmarks are performed on the HTML of blog.cloudflare.com, in memory, similar to the way NGINX performs streaming compression. The size of the specific version of the HTML file is 29,329 bytes, making it a good representative of the type of files we usually compress. The parallel benchmark compresses multiple files in parallel, as opposed to compressing a single file on many threads, also similar to the way NGINX works.

gzip

When using gzip, at the single core level Skylake is the clear winner. Despite having lower frequency than Broadwell, it seems that having lower penalty for branch misprediction helps it pull ahead. The Falkor core is not far behind, especially with lower quality settings. At the system level Falkor performs significantly better, thanks to the higher core count. Note how well gzip scales on multiple cores.

brotli

With brotli on single core the situation is similar. Skylake is the fastest, but Falkor is not very much behind, and with quality setting 9, Falkor is actually faster. Brotli with quality level 4 performs very similarly to gzip at level 5, while actually compressing slightly better (8,010B vs 8,187B).

When performing many-core compression, the situation becomes a bit messy. For levels 4, 5 and 6 brotli scales very well. At level 7 and 8 we start seeing lower performance per core, bottoming with level 9, where we get less than 3x the performance of single core, running on all cores.

My understanding is that at those quality levels Brotli consumes significantly more memory, and starts thrashing the cache. The scaling improves again at levels 10 and 11.

Bottom line for brotli, Falkor wins, since we would not consider going above quality 7 for dynamic compression.

Golang

Golang is another very important language for Cloudflare. It is also one of the first languages to offer ARMv8 support, so one would expect good performance. I used some of the built-in benchmarks, but modified them to run on multiple goroutines.

Go crypto

I would like to start the benchmarks with crypto performance. Thanks to OpenSSL we have good reference numbers, and it is interesting to see just how good the Go library is.

As far as Go crypto is concerned ARM and Intel are not even on the same playground. Go has very optimized assembly code for ECDSA, AES-GCM and Chacha20-Poly1305 on Intel. It also has Intel optimized math functions, used in RSA computations. All those are missing for ARMv8, putting it at a big disadvantage.

Nevertheless, the gap can be bridged with a relatively small effort, and we know that with the right optimizations, performance can be on par with OpenSSL. Even a very minor change, such as implementing the function addMulVVW in assembly, lead to an over tenfold improvement in RSA performance, putting Falkor ahead of both Broadwell and Skylake, with 8,009 signatures/second.

Another interesting thing to note is that on Skylake, the Go Chacha20-Poly1305 code, that uses AVX2 performs almost identically to the OpenSSL AVX512 code, this is again due to AVX2 running at higher clock speeds.

Go gzip

Next in Go performance is gzip. Here again we have a reference point to pretty well optimized code, and we can compare it to Go. In the case of the gzip library, there are no Intel specific optimizations in place.

Gzip performance is pretty good. The single core Falkor performance is way below both Intel processors, but at the system level it manages to outperform Broadwell, and lags behind Skylake. Since we already know that Falkor outperforms both when C is used, it can only mean that Go’s backend for ARMv8 is still pretty immature compared to gcc.

Go regexp

Regexp is widely used in a variety of tasks, so its performance is quite important too. I ran the builtin benchmarks on 32KB strings.

Go regexp performance is not very good on Falkor. In the medium and hard tests it takes second place, thanks to the higher core count, but Skylake is significantly faster still.

Doing some profiling shows that a lot of the time is spent in the function bytes.IndexByte. This function has an assembly implementation for amd64 (runtime.indexbytebody), but generic implementation for Go. The easy regexp tests spend most of time in this function, which explains the even wider gap.

Go strings

Another important library for a webserver is the Go strings library. I only tested the basic Replacer class here.

In this test again, Falkor lags behind, and loses even to Broadwell. Profiling shows significant time is spent in the function runtime.memmove. Guess what? It has a highly optimized assembly code for amd64, that uses AVX2, but only very simple ARM assembly, that copies 8 bytes at a time. By changing three lines in that code, and using the LDP/STP instructions (load pair/store pair) to copy 16 bytes at a time, I improved the performance of memmove by 30%, which resulted in 20% faster EscapeString and UnescapeString performance. And that is just scratching the surface.

Go conclusion

Go support for aarch64 is quite disappointing. I am very happy to say that everything compiles and works flawlessly, but on the performance side, things should get better. Is seems like the enablement effort so far was concentrated on the compiler back end, and the library was left largely untouched. There are a lot of low hanging optimization fruits out there, like my 20 minute fix for addMulVVW clearly shows. Qualcomm and other ARMv8 vendors intends to put significant engineering resources to amend this situation, but really any one can contribute to Go. So if you want to live your mark, now is the time.

LuaJIT

Lua is the glue that holds Cloudflare together.

With the exception of the binary_trees benchmark, the performance of LuaJIT on ARM is very competitive. It wins two benchmarks, and is in almost a tie in a third one.

That being said, binary_trees is a very important benchmark, because it triggers many memory allocations and garbage collection cycles. It will require deeper investigation in the future.

NGINX

For the NGINX workload, I decided to generate a load that would resemble an actual server.

I set up a server that serves the HTML file used in the gzip benchmark, over https, with the ECDHE-ECDSA-AES128-GCM-SHA256 cipher suite.

It also uses LuaJIT to redirect the incoming request, remove all line breaks and extra spaces from the HTML file, while adding a timestamp. The HTML is then compressed using brotli with quality 5.

Each server was configured to work with as many workers as it has virtual CPUs. 40 for Broadwell, 48 for Skylake and 46 for Falkor.

As the client for this test, I used the hey program, running from 3 Broadwell servers.

Concurrently with the test, we took power readings from the respective BMC units of each server.

With the NGINX workload Falkor handled almost the same amount of requests as the Skylake server, and both significantly outperform Broadwell. The power readings, taken from the BMC show that it did so while consuming less than half the power of other processors. That means Falkor managed to get 214 requests/watt vs the Skylake’s 99 requests/watt and Broadwell’s 77.

I was a bit surprised to see Skylake and Broadwell consume about the same amount of power, given both are manufactured with the same process, and Skylake has more cores.

The low power consumption of Falkor is not surprising, Qualcomm processors are known for their great power efficiency, which has allowed them to be a dominant player in the mobile phone CPU market.

Conclusion

The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive.

The production version of the Centriq SoC will feature up to 48 Falkor cores, running at a frequency of up to 2.6GHz, for a potential additional 8% better performance.

Obviously the Skylake server we tested is not the flagship Platinum unit that has 28 cores, but those 28 cores come both with a big price and over 200W TDP, whereas we are interested in improving our bang for buck metric, and performance per watt.

Currently my main concern is weak Go language performance, but that is bound to improve quickly once ARM based servers start gaining some market share.

Both C and LuaJIT performance is very competitive, and in many cases outperforms the Skylake contender. In almost every benchmark Falkor shows itself as a worthy upgrade from Broadwell.

The largest win by far for Falkor is the low power consumption. Although it has a TDP of 120W, during my tests it never went above 89W (for the go benchmark). In comparison Skylake and Broadwell both went over 160W, while the TDP of the two CPUs is 170W.

If you enjoy testing and selecting hardware on behalf of millions of Internet properties, come join us.
Source: CloudFlare

Perfect locality and three epic SystemTap scripts

In a recent blog post we discussed epoll behavior causing uneven load among NGINX worker processes. We suggested a work around – the REUSEPORT socket option. It changes the queuing from “combined queue model” aka Waitrose (formally: M/M/s), to a dedicated accept queue per worker aka “the Tesco superstore model” (formally: M/M/1). With this setup the load is spread more evenly, but in certain conditions the latency distribution might suffer.

After reading that piece, a colleague of mine, John, said: “Hey Marek, don’t forget that REUSEPORT has an additional advantage: it can improve packet locality! Packets can avoid being passed around CPUs!”

John had a point. Let’s dig into this step by step.

In this blog post we’ll explain the REUSEPORT socket option, how it can help with packet locality and its performance implications. We’ll show three advanced SystemTap scripts which we used to help us understand and measure the packet locality.

A shared queue

The standard BSD socket API model is rather simple. In order to receive new TCP connections a program calls bind() and then listen() on a fresh socket. This will create a single accept queue. Programs can share the file descriptor – pointing to one kernel data structure – among multiple processes to spread the load. As we’ve seen in a previous blog post connections might not be distributed perfectly. Still, this allows programs to scale up processing power from a limited single-process, single-CPU design.

Modern network cards split the inbound packets across multiple RX queues, allowing multiple CPUs to share interrupt and packet processing load. Unfortunately in the standard BSD API the new connections will all be funneled back to single accept queue, causing a potential bottleneck.

Introducing REUSEPORT

This bottleneck was identified at Google, where a reported application was dealing with 40,000 connections per second. Google kernel hackers fixed it by adding a TCP support for SO_REUSEPORT socket option in Linux kernel 3.9.

REUSEPORT allows the application to set multiple accept queues on a single TCP listen port. This removes the central bottleneck and enables the CPUs to do more work in parallel.

REUSEPORT locality

Initially there was no way to influence the load balancing algorithm. While REUSEPORT allowed setting up a dedicated accept queue per each worker process, it wasn’t possible to influence what packets would go into them. New connections flowing into the network stack would be distributed using only the usual 5-tuple hash. Packets from any of the RX queues, hitting any CPU, might flow into any of the accept queues.

This changed in Linux kernel 4.4 with the introduction of the SO_INCOMING_CPU settable socket option. Now a userspace program could add a hint to make the packets received on a specific CPU go to a specific accept queue. With this improvement the accept queue won’t need to be shared across multiple cores, improving CPU cache locality and fixing lock contention issues.

There are other benefits – with proper tuning it is possible to keep the processing of packets belonging to entire connections local. Think about it like that: if a SYN packet was received on some CPU it is likely that further packets for this connection will also be delivered to the same CPU1. Therefore, making sure the worker on the same CPU called the accept() has strong advantages. With the right tuning all processing of the connection might be performed on a single CPU. This can help keep the CPU cache warm, reduce cross-CPU interrupts and boost the performance of memory allocation algorithms.

SO_INCOMING_CPU interface is pretty rudimentary and was deemed unsuitable for more complex usage. It was superseded by the more powerful SO_ATTACH_REUSEPORT_CBPF option (and it’s extended variant: SO_ATTACH_REUSEPORT_EBPF) in kernel 4.6. These flags allow a program to specify a fully functional BPF program as a load balancing algorithm.

Beware that the introduction of SO_ATTACH_REUSEPORT_[CE]BPF broke SO_INCOMING_CPU. Nowadays there isn’t a choice – you have to use the BPF variants to get the intended behavior.

Setting CBPF on NGINX

NGINX in “reuseport” mode doesn’t set the advanced socket options increasing packet locality. John suggested that improving packet locality is beneficial for performance. We must verify such a bold claim!

We wanted to play with setting couple of SO_ATTACH_REUSEPORT_CBPF BPF scripts. We didn’t want to hack the NGINX sources though. After some tinkering we decided it would be easier to write a SystemTap script to set the option from outside the server process. This turned out to be a big mistake!

After plenty of work, numerous kernel panics caused by our buggy scripts (running in “guru” mode), we finally managed to get it into working order. The SystemTap script that calls “setsockopt” with right parameters. It’s one of the most complex scripts we’ve written so far. Here it is:

setcbpf.stp

We tested it on kernel 4.9. It sets the following CBPF (classical BPF) load balancing program on the REUSEPORT socket group. Sockets received on Nth CPU will be passed to Nth member of the REUSEPORT group:

A = #cpu
A = A % <reuseport group size>
return A

The SystemTap script takes three parameters: pid, file descriptor and REUSEPORT group size. To figure out the pid of a process and a file descriptor number use the “ss” tool:

$ ss -4nlp -t ‘sport = :8181’ | sort
LISTEN 0 511 *:8181 *:* users:((“nginx”,pid=29333,fd=3),…
LISTEN 0 511 *:8181 *:* …

In this listing we see that pid=29333 fd=3 points to REUSEPORT descriptor bound to port tcp/8181. On our test machine we have 24 logical CPUs (including HT) and we run 12 NGINX workers – the group size is 12. Example invocation of the script:

$ sudo stap -g setcbpf.stp 29333 3 12

Measuring performance

Unfortunately on Linux it’s pretty hard to verify if setting CBPF actually does anything. To understand what’s going on we wrote another SystemTap script. It hooks into a process and prints all successful invocations of the accept() function, including the CPU on which the connection was delivered to kernel, and current CPU – on which the application is running. The idea is simple – if they match, we’ll have good locality!

The script:

accept.stp

Before setting the CBPF socket option on the server, we saw this output:

$ sudo stap -g accept.stp nginx|grep “cpu=#12”
cpu=#12 pid=29333 accept(3) -> fd=30 rxcpu=#19
cpu=#12 pid=29333 accept(3) -> fd=31 rxcpu=#21
cpu=#12 pid=29333 accept(3) -> fd=32 rxcpu=#16
cpu=#12 pid=29333 accept(3) -> fd=33 rxcpu=#22
cpu=#12 pid=29333 accept(3) -> fd=34 rxcpu=#19
cpu=#12 pid=29333 accept(3) -> fd=35 rxcpu=#21
cpu=#12 pid=29333 accept(3) -> fd=37 rxcpu=#16

We can see accept()s done from a worker on CPU #12 returning client sockets received on some other CPUs like: #19, #21, #16 and so on.

Now, let’s run CBPF and see the results:

$ sudo stap -g setcbpf.stp `pidof nginx -s` 3 12
[+] Pid=29333 fd=3 group_size=12 setsockopt(SO_ATTACH_REUSEPORT_CBPF)=0

$ sudo stap -g accept.stp nginx|grep “cpu=#12”
cpu=#12 pid=29333 accept(3) -> fd=30 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=31 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=32 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=33 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=34 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=35 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=36 rxcpu=#12

Now the situation is perfect. All accept()s called from the NGINX worker pinned to CPU #12 got client sockets received on the same CPU.

But does it actually help with the performance?

Sadly: no. We’ve run a number of tests (using the setup introduced in previous blog post) but we weren’t able to record any significant performance difference. Compared to other costs incurred by running a high level HTTP server, a couple of microseconds shaved by keeping connections local to a CPU doesn’t seem to make a measurable difference.

Measuring packet locality

But no, we didn’t give up!

Not being able to measure an end to end performance gain, we decided to try another approach. Why not try to measure packet locality itself!

Measuring locality is tricky. In certain circumstances a packet can cross multiple CPUs on its way down the networking stack. Fortunately we can simplify the problem. Let’s define “packet locality” as the probability of a packet (to be specific: the Linux sock_buff data structure, skb) being allocated and freed on the same CPU.

For this, we wrote yet another SystemTap script:

locality.stp

When run without the CBPF option the script gave us this results:

$ sudo stap -g locality.stp 12
rx= 21% 29kpps tx= 9% 24kpps
rx= 8% 130kpps tx= 8% 131kpps
rx= 11% 132kpps tx= 9% 126kpps
rx= 10% 128kpps tx= 8% 127kpps
rx= 10% 129kpps tx= 8% 126kpps
rx= 11% 132kpps tx= 9% 127kpps
rx= 11% 129kpps tx= 10% 128kpps
rx= 10% 130kpps tx= 9% 127kpps
rx= 12% 94kpps tx= 8% 90kpps

During our test the HTTP server received about 130,000 packets per second and transmitted about as much. 10-11% of the received and 8-10% of the transmitted packets had good locality – were allocated and freed on the same CPU.

Achieving good locality is not that easy. On the RX side, this means the packet must be received on the same CPU as the application that will read() it. On the transmission side it’s even trickier. In case of TCP, a piece of data must all: be sent() by application, get transmitted, and receive back an ACK from the other party, all on the same CPU.

We performed a bit of tuning, which included inspecting:

number of RSS queues and their interrupts being pinned to right CPUs
the indirection table
correct XPS settings on the TX path
NGINX workers being pinned to right CPUs
NGINX using the REUSEPORT bind option
and finally setting CBPF on the REUSEPORT sockets

We were able to achieve almost perfect locality! With all tweaks done the script output looked better:

$ sudo stap -g locality.stp 12
rx= 99% 18kpps tx=100% 12kpps
rx= 99% 118kpps tx= 99% 115kpps
rx= 99% 132kpps tx= 99% 129kpps
rx= 99% 138kpps tx= 99% 136kpps
rx= 99% 140kpps tx=100% 134kpps
rx= 99% 138kpps tx= 99% 135kpps
rx= 99% 139kpps tx=100% 137kpps
rx= 99% 139kpps tx=100% 135kpps
rx= 99% 77kpps tx= 99% 74kpps

Now the test runs at 138,000 packets per second received and transmitted. The packets have a whopping 99% packet locality.

As for performance difference in practice – it’s too small to measure. Even though we received about 7% more packets, the end-to-end tests didn’t show a meaningful speed boost.

Conclusion

We weren’t able to prove definitely if improving packet locality actually improves performance for a high-level TCP application like an HTTP server. In hindsight it makes sense – the added benefit is minuscule compared to the overhead of running an HTTP server, especially with logic in a high level language like Lua.

This hasn’t stopped us from having fun! We (myself, Gilberto Bertin and David Wragg) wrote three pretty cool SystemTap scripts, which are super useful when debugging Linux packet locality. They may come handy for demanding users, for example running high performance UDP servers or doing high frequency trading.

Most importantly – in the process we learned a lot about the Linux networking stack. We got to practice writing CBPF scripts, and learned how to measure locality with hackish SystemTap scripts. We got reminded of the obvious – out of the box Linux is remarkably well tuned.

Dealing with the internals of Linux and NGINX sound interesting? Join our world famous team in London, Austin, San Francisco and our elite office in Warsaw, Poland.

We are not taking into account aRFS – accelerated RFS. ↩
Source: CloudFlare