Engineering @ Facebook 的Facebook随笔How to Hack Out an Election Counter 
Published on 2008-11-20 8:02:23 by Engineering @ Facebook

Published on 2008-11-20 8:02:23 by Engineering @ Facebook
As part of a team flushing out Facebook’s 2008 Election campaign designed to get more people to register to vote and then actually vote, I was responsible for the message on top of the News Feed. We had our design and messaging ready to go as the weekend before election day was approaching. The message was up and ready to be turned on. During the week, someone had suggested that we create an “I voted!” button that a user could click and tell his or her friends about the patriotic action of the day. I was excited; clearly this would encourage more people to go vote so they could tell their friends. So I buckled down, flushed out a News Feed and Minifeed story, and checked it in with plenty of time before Tuesday.
On Sunday someone threw out an idea to have a counter that people could watch and see that their vote was indeed being counted on Facebook as well. The election team decided this would be awesome functionality during a quick chat on Monday morning. I had about 12 hours to design a counter that could handle millions of clicks and live update via ajax regularly enough to be exciting on millions of browsers at the same time.
Here’s the last-second solution I came up with.
We pushed the counter Monday night with code to turn it on automatically and went to sleep. The next morning, I noticed the counter only updating every 30-40 seconds. While intriguing, most people wouldn’t see an update before they went to another page. It turns out that scribe was only flushing its logs to our mounts every 30 seconds. I made a quick change that forced the counter to update its number one-third of the difference between its shown value and the real value every seven seconds. Thus, it would show a number very close to the live value, but almost always have something to update. I committed, merged, and pushed by 10am.
The election page and counters came together quite quickly. It involved last-second commits, hotfixes, pushes, and scripts running on dev servers. We received over 5.4 million clicks and more than 3.6 billion counter update requests in under 24 hours. Thanks to Ben, Alex, Niket, and the rest of the election team for the great designs, ideas, and execution.
On Sunday someone threw out an idea to have a counter that people could watch and see that their vote was indeed being counted on Facebook as well. The election team decided this would be awesome functionality during a quick chat on Monday morning. I had about 12 hours to design a counter that could handle millions of clicks and live update via ajax regularly enough to be exciting on millions of browsers at the same time.
Here’s the last-second solution I came up with.
- Every time a user clicked “I voted!”, I logged a message to our internal open-source logging system Scribe that included time, user id, and which U.S. State the user is in.
- I wrote a quick aggregation script that tailed the Scribe log. The script aggregated by state and globally. Then, it updated a memcached key that was accessible from our web tiers.
- The script ran throughout the day on my dev box, sending a heartbeat to our monitoring tools so it would email me if it died.
- Since the counter was live updating, millions of browsers would be asking for the same key over and over. So, I replicated the key 20 times across different memcached servers to not topple any single memcached box.
- With a couple designers, we made a cool fade animation that would update the counter every 10 seconds using our animation library to make it interesting to watch.
We pushed the counter Monday night with code to turn it on automatically and went to sleep. The next morning, I noticed the counter only updating every 30-40 seconds. While intriguing, most people wouldn’t see an update before they went to another page. It turns out that scribe was only flushing its logs to our mounts every 30 seconds. I made a quick change that forced the counter to update its number one-third of the difference between its shown value and the real value every seven seconds. Thus, it would show a number very close to the live value, but almost always have something to update. I committed, merged, and pushed by 10am.
The election page and counters came together quite quickly. It involved last-second commits, hotfixes, pushes, and scripts running on dev servers. We received over 5.4 million clicks and more than 3.6 billion counter update requests in under 24 hours. Thanks to Ben, Alex, Niket, and the rest of the election team for the great designs, ideas, and execution.
Facebook's Scribe technology now open source 
Published on 2008-10-25 2:30:02 by Engineering @ Facebook

Published on 2008-10-25 2:30:02 by Engineering @ Facebook
Here at Facebook, we're constantly facing scaling challanges because of our enormous growth. One particular problem we encountered a couple of years ago was collection of data from our servers. We were collecting a few billion messages a day (which seemed like a lot at the time) for everything from access logs to performance statistics to actions that went to News Feed. We used a variety of different technologies for the different use cases, and all of them were bursting at the seams. We decided to build a unified system (called Scribe) to handle all of these cases, and do it in a way that would scale with Facebook's growth. The system we built turned out to be enormously useful, handling over 100 use cases and tens of billions of messages a day. It has also been battle tested by just about anything that can go wrong, so I encourage you to take a look at the newly opened Scribe source and see if it might be useful for you. To give the code some context, I'm going to go through the major design decisions we made to allow the system to scale.
The first decision we made was to not lock ourselves into a particular network topology. The Scribe servers are arranged in a directed graph, but each server only knows about the next server in the graph. This flexible topology allows for things like adding an extra layer of fan-in if the system grows too large, and batching messages before sending them between datacenters, but without having any code that explicitly needs to understand datacenter topology, only a simple configuration.
The second major design decision was about reliability. We chose was a middle ground here, reliable enough that we can expect to get all of the data almost all of the time, but not reliable enough to require heavyweight protocols and disk usage. More specifically, Scribe spools data to disk on any node to handle intermittent connectivity node failure, but it doesn't sync a log file for every message, so there's a possibility of a small amount of data loss in the event of a crash or catastrophic hardware failure. Basically, this is more reliability than you get with most logging systems, but not something you should use for database transactions. As it turned out, this is a reasonable level of reliability for a lot of use cases, and has made scaling much easier. It's also the source of a lot of the hard-learned lessons: getting the system to catch up seamlessly after a significant network problem is tricky, especially when there are tens or hundreds of gigabytes of data backed up.
The final design decision was about the data model. When you're building something that looks like a logging system there are a lot of things people expect: logging levels and rules about when they get sent, timestamping and ordering of messages, schemas for common messages, etc. We decided that this was a can of worms that shouldn't be mixed up with the asynchronous and mostly reliable delivery of data, so we made the data model very simple. A message is two strings: a category and the actual message. The category is the description of what the message is about, and the expectation is that messages of the same category end up in the same place. The message is the actual data to be logged. We also don't have any a priori list of categories that must be maintained. If you create a new category it shows up at a new file. This is following the Unix philosophy of doing exactly one thing and doing it well, and it has definitely paid off in ease of use and development. We started with four or five use cases in mind and now we have hundreds, but we didn't have to modify the Scribe source for any of them.
Another choice we made early on was to build Scribe using Thrift. This sped up development enormously because a lot of the hard parts were already taken care of, and it also made the resulting system much more flexible. We currently log messages to Scribe from PHP, python, C++, and Java code, and the list of possible languages is growing all the time from the contributions of developers around the world. So Scribe has already benefitted enormously from Thrift being open, and it will be even better having Scribe open too. I hope you find it as useful as we do.
The first decision we made was to not lock ourselves into a particular network topology. The Scribe servers are arranged in a directed graph, but each server only knows about the next server in the graph. This flexible topology allows for things like adding an extra layer of fan-in if the system grows too large, and batching messages before sending them between datacenters, but without having any code that explicitly needs to understand datacenter topology, only a simple configuration.
The second major design decision was about reliability. We chose was a middle ground here, reliable enough that we can expect to get all of the data almost all of the time, but not reliable enough to require heavyweight protocols and disk usage. More specifically, Scribe spools data to disk on any node to handle intermittent connectivity node failure, but it doesn't sync a log file for every message, so there's a possibility of a small amount of data loss in the event of a crash or catastrophic hardware failure. Basically, this is more reliability than you get with most logging systems, but not something you should use for database transactions. As it turned out, this is a reasonable level of reliability for a lot of use cases, and has made scaling much easier. It's also the source of a lot of the hard-learned lessons: getting the system to catch up seamlessly after a significant network problem is tricky, especially when there are tens or hundreds of gigabytes of data backed up.
The final design decision was about the data model. When you're building something that looks like a logging system there are a lot of things people expect: logging levels and rules about when they get sent, timestamping and ordering of messages, schemas for common messages, etc. We decided that this was a can of worms that shouldn't be mixed up with the asynchronous and mostly reliable delivery of data, so we made the data model very simple. A message is two strings: a category and the actual message. The category is the description of what the message is about, and the expectation is that messages of the same category end up in the same place. The message is the actual data to be logged. We also don't have any a priori list of categories that must be maintained. If you create a new category it shows up at a new file. This is following the Unix philosophy of doing exactly one thing and doing it well, and it has definitely paid off in ease of use and development. We started with four or five use cases in mind and now we have hundreds, but we didn't have to modify the Scribe source for any of them.
Another choice we made early on was to build Scribe using Thrift. This sped up development enormously because a lot of the hard parts were already taken care of, and it also made the resulting system much more flexible. We currently log messages to Scribe from PHP, python, C++, and Java code, and the list of possible languages is growing all the time from the contributions of developers around the world. So Scribe has already benefitted enormously from Thrift being open, and it will be even better having Scribe open too. I hope you find it as useful as we do.
The All-Night Hackathon Is Back! 
Published on 2008-10-24 12:51:17 by Engineering @ Facebook

Published on 2008-10-24 12:51:17 by Engineering @ Facebook
Every few months, our engineers unleash their talents in one epic, all-night coding session. These are the Facebook Hackathons. They start with takeout Chinese food around 8 p.m. and end with a dawn breakfast at any pancake house or donut shop that will have us. In between, dozens of Facebook engineers create working prototypes of projects that they always wanted to build but couldn’t ever pursue during their regular hours.
Hackathon XI – also known as The Presidential Hackathon – will take place the evening of Wednesday, Nov. 5. In past years, Facebook Hackathons have been the starting point for all sorts of new features that rapidly became mainstays of the site, such as Facebook Chat, internationalization, the type-ahead feature in search, the friend suggester, etc. etc. Not every experiment pays off right away, but it’s a low-risk, high-reward setting that encourages engineers to bring to life some of the great ideas that are floating around here.
We have plenty of in-house ideas already in the mix for Hackathon XI, but there’s no reason we can’t welcome outside input, too. If Facebook users have ideas about new features or site improvements that could be tackled in our one-night time frame, please accept our invitation to submit those concepts to the address hackathon-ideas@facebook.com. The deadline is noon Pacific time, Monday, Nov. 3. In doing so, you agree to our standard Terms of Use, which specify that Facebook is entitled to the unrestricted use of such submissions for any purpose, without acknowledgement or compensation to you.
We are awarding free Facebook T-shirts to the top 10 ideas, so please include your mailing address and shirt size. We hope that the ideas generated and carried out in this Hackathon will make the site more appealing and useful to everyone.
Watch this space next week for updates about the Hackathon itself.
Pedram Keyani is a Facebook engineer who is organizing The Presidential Hackathon.
Hackathon XI – also known as The Presidential Hackathon – will take place the evening of Wednesday, Nov. 5. In past years, Facebook Hackathons have been the starting point for all sorts of new features that rapidly became mainstays of the site, such as Facebook Chat, internationalization, the type-ahead feature in search, the friend suggester, etc. etc. Not every experiment pays off right away, but it’s a low-risk, high-reward setting that encourages engineers to bring to life some of the great ideas that are floating around here.
We have plenty of in-house ideas already in the mix for Hackathon XI, but there’s no reason we can’t welcome outside input, too. If Facebook users have ideas about new features or site improvements that could be tackled in our one-night time frame, please accept our invitation to submit those concepts to the address hackathon-ideas@facebook.com. The deadline is noon Pacific time, Monday, Nov. 3. In doing so, you agree to our standard Terms of Use, which specify that Facebook is entitled to the unrestricted use of such submissions for any purpose, without acknowledgement or compensation to you.
We are awarding free Facebook T-shirts to the top 10 ideas, so please include your mailing address and shirt size. We hope that the ideas generated and carried out in this Hackathon will make the site more appealing and useful to everyone.
Watch this space next week for updates about the Hackathon itself.
Pedram Keyani is a Facebook engineer who is organizing The Presidential Hackathon.
10 billion photos 
Published on 2008-10-15 9:03:58 by Engineering @ Facebook

Published on 2008-10-15 9:03:58 by Engineering @ Facebook
We recently hit a really cool milestone, our users have now uploaded over 10 billion photos to the site. Now, that’s a big number, but we actually store four image sizes for each uploaded photo, so that’s over 40 billion files.
To celebrate, we got a bunch of cupcakes and handed them out to our engineering and operations groups. One of our engineers calculated that if we had gotten one cupcake for each of our photos, and lined them up side by side, the line could reach halfway to the moon.
Here’s some other interesting recent stats on photos:
Doug is looking forward to the day we reach the moon, and then the day we make it back.
To celebrate, we got a bunch of cupcakes and handed them out to our engineering and operations groups. One of our engineers calculated that if we had gotten one cupcake for each of our photos, and lined them up side by side, the line could reach halfway to the moon.
Here’s some other interesting recent stats on photos:
- 2-3 Terabytes of photos are being uploaded to the site every day
- We have just over one petabyte of photo storage
- We serve over 15 billion photo images per day
- Photo traffic now peaks at over 300,000 images served per second
Doug is looking forward to the day we reach the moon, and then the day we make it back.
Cassandra – A structured storage system on a P2P Network 
Published on 2008-8-26 7:31:08 by Engineering @ Facebook

Published on 2008-8-26 7:31:08 by Engineering @ Facebook
When I joined Facebook I was eagerly looking forward to a new challenge. Fortunately, Facebook cannot be accused of a lack of challenging assignments. Prashant Malik,a colleague in Facebook from the Search team, was thinking about how to solve the Inbox Search problem. This challenge is about storing reverse indices of Facebook messages that Facebook users send and receive while communicating with their friends on the Facebook network. The amount of data to be stored, the rate of growth of the data and the requirement to serve it within strict SLAs made it very apparent that a new storage solution was absolutely essential. The solution needed to scale incrementally and in a cost effective fashion. Traditional data storage solutions just wouldn’t fit the bill. The aim was to design a solution that not only solved the Inbox Search problem but also provided a system as a storage infrastructure for many problems of the same nature. Hence was born Cassandra. To keep up with Facebook tradition, Prashant and I started the implementation of Cassandra about a year ago in one of our Hackthons.
Cassandra is a distributed storage system for managing structured data that is designed to scale to a very large size across many commodity servers, with no single point of failure. Reliability at massive scale is a very big challenge. Outages in the service can have significant negative impact. Hence Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different datacenters). At this scale, small and large components fail continuously; the way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. Cassandra has achieved several goals – scalability, high performance, high availability and applicability. In many ways Cassandra resembles a database and shares many design and implementation strategies with databases. Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. The rest of the material talks about the data model and the distributed properties, provided by the system.
Data Model
Distribution, Replication and Fault Tolerance
First deployment of Cassandra system within Facebook was for the Inbox search system. The system currently stores TB’s of indexes across a cluster of 600+ cores and 120+ TB of disk space. Performance of the system has been well within our SLA requirements and more applications are in the pipeline to use the Cassandra system as their storage engine. A beta version of Cassandra has been open sourced and can be found here. Systems of this nature are never really done. One starts by building certain core features that are absolutely required, achieves to get them right and proceeds to build the more fancy features. We have certain core features that needed to be built and have a lot more planned in the road ahead. If one is interested in solving hard distributed systems problems which affects the lives of millions of our users, feel free to contact us at Facebook
Cassandra is a distributed storage system for managing structured data that is designed to scale to a very large size across many commodity servers, with no single point of failure. Reliability at massive scale is a very big challenge. Outages in the service can have significant negative impact. Hence Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different datacenters). At this scale, small and large components fail continuously; the way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. Cassandra has achieved several goals – scalability, high performance, high availability and applicability. In many ways Cassandra resembles a database and shares many design and implementation strategies with databases. Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. The rest of the material talks about the data model and the distributed properties, provided by the system.
Data Model
- Every row is identified by a unique key. The key is a string and there is no limit on its size.
- An instance of Cassandra has one table which is made up of one or more column families as defined by the user.
- The number of column families and the name of each of the above must be fixed at the time the cluster is started. There is no limitation the number of column families but it is expected that there would be a few of these.
- Each column family can contain one of two structures: supercolumns or columns. Both of these are dynamically created and there is no limit on the number of these that can be stored in a column family.
- Columns are constructs that have a name, a value and a user-defined timestamp associated with them. The number of columns that can be contained in a column family is very large. Columns could be of variable number per key. For instance key K1 could have 1024 columns/super columns while key K2 could have 64 columns/super columns.
- “Supercolumns” are a construct that have a name, and an infinite number of columns assosciated with them. The number of “Supercolumns” associated with any column family could be infinite and of a variable number per key. They exhibit the same characteristics as columns.
Distribution, Replication and Fault Tolerance
- Data is distributed across the nodes in the cluster using Consistent Hashing based and on an Order Preserving Hash function. We use an Order Preserving Hash so that we could perform range scans over the data for analysis at some later point.
- Cluster membership is maintained via Gossip style membership algorithm. Failures of nodes within the cluster are monitored using an Accrual Style Failure Detector.
- High availability is achieved using replication and we actively replicate data across data centers. Since eventual consistency is the mantra of the system reads execute on the closest replica and data is repaired in the background for increased read throughput.
- System exhibits incremental scalability properties which can be achieved as easily as dropping nodes and having them automatically bootstrapped with data.
First deployment of Cassandra system within Facebook was for the Inbox search system. The system currently stores TB’s of indexes across a cluster of 600+ cores and 120+ TB of disk space. Performance of the system has been well within our SLA requirements and more applications are in the pipeline to use the Cassandra system as their storage engine. A beta version of Cassandra has been open sourced and can be found here. Systems of this nature are never really done. One starts by building certain core features that are absolutely required, achieves to get them right and proceeds to build the more fancy features. We have certain core features that needed to be built and have a lot more planned in the road ahead. If one is interested in solving hard distributed systems problems which affects the lives of millions of our users, feel free to contact us at Facebook
Scaling Out 
Published on 2008-8-21 2:05:34 by Engineering @ Facebook

Published on 2008-8-21 2:05:34 by Engineering @ Facebook
I joined Facebook in April 2007 and, after getting settled over the course of a few weeks, my manager Robert Johnson approached me. We talked for a while but the conversation boiled down to:
Bobby: "So, Jason, we're going to open a new datacenter in Virginia by 2008. Do you think you can help?"
Me: "Uh.... yes?"
Bobby: "Great!"
My first project at Facebook was a tad more involved then I was expecting, but I think that is one reason why we have such a great engineering organization; we have a lot of hard problems to solve and everyone here is excited to jump in and tackle them. I set out to really understand why we were building a new datacenter and what problems we had to overcome to make it work.
The primary reason for building a new datacenter on the east coast was latency. It takes about 70 milliseconds to send a packet across the country on a high-speed link, and it can be much longer for an average internet user. By putting servers in Virginia we could reduce the time to send a page to users on the east coast and in Europe by a noticeable amount.
Secondary concerns were space, power, and disaster recovery. We were running out of physical space in our primary datacenters in California and the Virginia site would give us lots of room to grow. We were having a similar problem with getting enough electricity to power all those servers. Finally, restricting ourselves to only one location meant that, in the event of a disaster (power failure, earthquake, Godzilla), Facebook could be unusable for extended periods of time.
Build It!
Bobby: "So, Jason, we're going to open a new datacenter in Virginia by 2008. Do you think you can help?"
Me: "Uh.... yes?"
Bobby: "Great!"
My first project at Facebook was a tad more involved then I was expecting, but I think that is one reason why we have such a great engineering organization; we have a lot of hard problems to solve and everyone here is excited to jump in and tackle them. I set out to really understand why we were building a new datacenter and what problems we had to overcome to make it work.
Why Bother?
The primary reason for building a new datacenter on the east coast was latency. It takes about 70 milliseconds to send a packet across the country on a high-speed link, and it can be much longer for an average internet user. By putting servers in Virginia we could reduce the time to send a page to users on the east coast and in Europe by a noticeable amount.
Secondary concerns were space, power, and disaster recovery. We were running out of physical space in our primary datacenters in California and the Virginia site would give us lots of room to grow. We were having a similar problem with getting enough electricity to power all those servers. Finally, restricting ourselves to only one location meant that, in the event of a disaster (power failure, earthquake, Godzilla), Facebook could be unusable for extended periods of time.
Build It!
Before we could go to work on the application level challenges our operations team put in a heroic effort to build out the servers and the physical space in Virginia. They also brought up the intra-datacenter network and the low latency inter-datacenter fiber channel link. This work was an enormous undertaking but our operations team is top-notch and made it all look easy.
With the network and hardware in place we set up our standard 3 tier architecture: web server, memcache server, and MySQL database. The MySQL databses in Virginia were going to run as slaves of the west coast databases, so we spent a couple weeks copying all the data across the country and setting up replication streams.
Now that the hardware, network, and basic infrastructure was set up it was time to face the two main application level challenges: cache consistency and traffic routing.
Cache Consistency
A bit of background on our caching model: when a user modifies a data object our infrastructure will write the new value in to a database and delete the old value from memcache (if it was present). The next time a user requests that data object we pull the result from the database and write it to memcache. Subsequent requests will pull the data from memcache until it expires out of the cache or is deleted by another update.
This setup works really well with only one set of databases because we only delete the value from memcache after the database has confirmed the write of the new value. That way we are guaranteed the next read will get the updated value from the database and put it in to memcache. With a slave database on the east coast, however, the situation got a little tricky.
When we update a west coast master database with some new data there is a replication lag before the new value is properly reflected in the east coast slave database. Normally this replication lag is under a second but in periods of high load it can spike up to 20 seconds.
Now let's say we delete the value from Virginia memcache tier at the time we update the master database in California. A subsequent read from the slave database in Virginia might see the old value instead of the new one because of replication lag. Then Virginia memcache would be updated with the old (incorrect) value and it would be "trapped" there until another delete. As you can see, in the worst case the Virginia memcache tier would always be one "version" behind of the correct data.
Consider the following example:
- I update my first name from "Jason" to "Monkey"
- We write "Monkey" in to the master database in California and delete my first name from memcache in California and Virginia
- Someone goes to my profile in Virginia
- We don't find my first name in memcache so we read from the Virginia slave database and get "Jason" because of replication lag
- We update Virginia memcache with my first name as "Jason"
- Replication catches up and we update the slave database with my first name as "Monkey"
- Someone else goes to my profile in Virginia
- We find my first name in memcache and return "Jason"
Until I update my first name again or it falls out of cache and we go back to the database, we will show my first name as "Jason" in Virginia and "Monkey" in California. Confusing? You bet. Welcome to the world of distributed systems, where consistency is a really hard problem.
Fortunately, the solution is a lot easier to explain than the problem. We made a small change to MySQL that allows us to tack on extra information in the replication stream that is updating the slave database. We used this feature to append all the data objects that are changing for a given query and then the slave database "sees" these objects and is responsible for deleting the value from cache after it performs the update to the database.
How'd we do it? MySQL uses a lex parser and a yacc grammar to define the structure of a query and then parse it. I've simplified the following for ease of explanation, but at the highest level this grammar looks like:
query:
statement END_OF_INPUT {};
statement:
alter
| analyze
| backup
| call
... (insert, replace, select, etc.)
Pretty straightforward, right? A query is a statement which breaks down to one of the MySQL expressions we all know and love. We modified this grammar to allow appending memcache keys to the end of any query, as follows:
query:
statement mc_dirty END_OF_INPUT {};
mc_dirty:
{}
| MEMCACHE_DIRTY mc_key_list;
mc_key_list:
mc_key_list ',' text_string { Lex->mc_key_list.push_back($3); }
| text_string { Lex->mc_key_list.push_back($1); };
A query now has an additional component; after the statement comes the mc_dirty which is either empty or a keyword MEMCACHE_DIRTY followed by a mc_key_list. A mc_key_list is just a comma-separated list of strings and the rule tells the parser to push all the strings one-by-one on to a vector named mc_key_list which is stored inside a per-query parser object.
As an example, an old query might look like:
REPLACE INTO profile (`first_name`) VALUES ('Monkey') WHERE `user_id`='jsobel'
and under the new grammar it would change to:
REPLACE INTO profile (`first_name`) VALUES ('Monkey') WHERE `user_id`='jsobel' MEMCACHE_DIRTY 'jsobel:first_name'
The new query is telling MySQL that, in addition to changing my first name to Monkey, it also needs to dirty a corresponding memcache key. This is easily implemented. Since the per-query parser object now stores all memcache keys we tack on to a query, we added a small piece of code at the end of mysql_execute_command that dirties those keys if the query is successful. Voila, we've hijacked the MySQL replication stream for our own purpose: cache consistency.
The new workflow becomes (changed items in bold):
- I update my first name from "Jason" to "Monkey"
- We write "Monkey" in to the master database in California and delete my first name from memcache in California but not Virginia
- Someone goes to my profile in Virginia
- We find my first name in memcache and return "Jason"
- Replication catches up and we update the slave database with my first name as "Monkey." We also delete my first name from Virginia memcache because that cache object showed up in the replication stream
- Someone else goes to my profile in Virginia
- We don't find my first name in memcache so we read from the slave and get "Monkey"
Page Routing
The other main problem we had to address was that only our master databases in California could accept write operations. This fact meant we needed to avoid serving pages that did database writes from Virginia because each one would have to cross the country to our master databases in California. Fortunately, our most frequently accessed pages (home page, profiles, photo pages) don't do any writes under normal operation. The problem thus boiled down to, when a user makes a request for a page, how do we decide if it is "safe" to send to Virginia or if it must be routed to California?
This question turned out to have a relatively straightforward answer. One of the first servers a user request to Facebook hits is called a load balancer; this machine's primary responsibility is picking a web server to handle the request but it also serves a number of other purposes: protecting against denial of service attacks and multiplexing user connections to name a few. This load balancer has the capability to run in Layer 7 mode where it can examine the URI a user is requesting and make routing decisions based on that information. This feature meant it was easy to tell the load balancer about our "safe" pages and it could decide whether to send the request to Virginia or California based on the page name and the user's location.
There is another wrinkle to this problem, however. Let's say you go to editprofile.php to change your hometown. This page isn't marked as safe so it gets routed to California and you make the change. Then you go to view your profile and, since it is a safe page, we send you to Virginia. Because of the replication lag we mentioned earlier, however, you might not see the change you just made! This experience is very confusing for a user and also leads to double posting. We got around this concern by setting a cookie in your browser with the current time whenever you write something to our databases. The load balancer also looks for that cookie and, if it notices that you wrote something within 20 seconds, will unconditionally send you to California. Then when 20 seconds have passed and we're certain the data has replicated to Virginia, we'll allow you to go back for safe pages.
Looking Back
Nine months after our first user viewed a page in the Virginia datacenter we're still running the same architecture with good success. There were bumps along the way, of course; for the first month or two the cache consistency infrastructure was very shaky and would periodically force us to divert traffic away from Virginia while we diagnosed and fixed bugs. Over time, however, we've ironed out the issues and now serve a substantial portion of Facebook's traffic out of this datacenter.
The main scaling challenge with this architecture is pretty obvious: all write operations must happen in one location. Going forward we're very excited to develop new technologies that will let us perform writes in any location. We're also thinking a lot about how to use our new datacenter as a disaster recovery site in case Godzilla decides to attack our California locations! Interested in helping us out? www.facebook.com/jobs!
Thrift: (slightly more than) one year later 
Published on 2008-6-13 6:36:50 by Engineering @ Facebook

Published on 2008-6-13 6:36:50 by Engineering @ Facebook
A little over a year ago, Facebook released Thrift as open source software. (See the original announcement.) Thrift is a lightweight software framework for enabling communication between programs written in different programming languages, running on different computers, or both. We decided to release it because we thought it could help other groups solve some of the same technical problems we have faced, and because we hoped that developers who found it useful would contribute improvements back to the project.
A lot has happened in the year since we released Thrift. First and foremost, Thrift has gained a lot of cool features:
I'm even more excited about some of the stuff being worked on now.
If you are interested in Thrift, the best documentation is still the original whitepaper. You can also check out Thrift's new homepage. Most of the interesting updates are on the Thrift mailing lists (subscription info on the homepage). Thrift's Subversion repository has just moved into the Apache Incubator repository. Information on accessing it is available on the Thrift homepage. A lot of experimental development is published in the unofficial Git repository.
A lot has happened in the year since we released Thrift. First and foremost, Thrift has gained a lot of cool features:
- Support for C#, Perl, Objective C, Erlang, Smalltalk, OCaml, and Haskell.
- More idiomatic style in Java and Ruby.
- Two new protocols: one for dense encoding and one using JSON syntax.
- Significant speed boost in Python using a C module and PHP using an extension.
I'm even more excited about some of the stuff being worked on now.
- Here at Facebook, we're working on a fully asynchronous client and server for C++. This server uses event-driven I/O like the current TNonblockingServer, but its interface to the application code is all based on asynchronous callbacks. This will allow us to write servers that can service thousands of simultaneous requests (each of which requires making calls to other Thrift or Memcache servers) with only a few threads.
- Powerset's Chad Walters and I are working on templatizing the C++ API. This will be an almost entirely backward-compatible change that will preserve all the flexibility of the Thrift API, but it will allow performance-conscious developers to add type-annotations that will improve serialization and deserialization speed by about 5x and 3x respectively.
- Thrift's Ruby mapping, which never got much attention here at Facebook, has had a surge of popularity amongst our external contributors. It's getting some much-needed attention from Powerset's Kevin Clark and our newest corporate contributor: RapLeaf. They've already got an accelerator extension in testing (which works like the existing Python and PHP accelerators) and are working on some serious style overhauls. At least, that's what they tell me. I don't know Ruby, so I mostly leave them alone. :)
- Ross McFarland has been working on a C mapping for Thrift using glib. A C mapping has been one of our oft-requested features, so it's great to see this finally taking shape.
- There are a host of other features that are "on the back burner" for now, but which I expect to be incorporated eventually. These include patches that we received for an asynchronous Perl client, an SSL transport for C++ (based on GNU TLS), and a more robust file-based transport.
If you are interested in Thrift, the best documentation is still the original whitepaper. You can also check out Thrift's new homepage. Most of the interesting updates are on the Thrift mailing lists (subscription info on the homepage). Thrift's Subversion repository has just moved into the Apache Incubator repository. Information on accessing it is available on the Thrift homepage. A lot of experimental development is published in the unofficial Git repository.
Hadoop 
Published on 2008-6-5 13:33:11 by Engineering @ Facebook

Published on 2008-6-5 13:33:11 by Engineering @ Facebook
With tens of millions of users and more than a billion page views every day, Facebook ends up accumulating massive amounts of data. One of the challenges that we have faced since the early days is developing a scalable way of storing and processing all these bytes since using this historical data is a very big part of how we can improve the user experience on Facebook. This can only be done by empowering our engineers and analysts with easy to use tools to mine and manipulate large data sets.
About a year back we began playing around with an open source project called Hadoop. Hadoop provides a framework for large scale parallel processing using a distributed file system and the map-reduce programming paradigm. Our hesitant first steps of importing some interesting data sets into a relatively small Hadoop cluster were quickly rewarded as developers latched on to the map-reduce programming model and started doing interesting projects that were previously impossible due to their massive computational requirements. Some of these early projects have matured into publicly released features (like the Facebook Lexicon) or are being used in the background to improve user experience on Facebook (by improving the relevance of search results, for example).
We have come a long way from those initial days. Facebook has multiple Hadoop clusters deployed now - with the biggest having about 2500 cpu cores and 1 PetaByte of disk space. We are loading over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and have hundreds of jobs running each day against these data sets. The list of projects that are using this infrastructure has proliferated - from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality. An amazingly large fraction of our engineers have run Hadoop jobs at some point (which is also a great testament to the quality of technical talent here at Facebook).
The rapid adoption of Hadoop at Facebook has been aided by a couple of key decisions. First, developers are free to write map-reduce programs in the language of their choice. Second, we have embraced SQL as a familiar paradigm to address and operate on large data sets. Most data stored in Hadoop's file system is published as Tables. Developers can explore the schemas and data of these tables much like they would do with a good old database. When they want to operate on these data sets, they can use a small subset of SQL to specify the required dataset. Operations on datasets can be written as map and reduce scripts or using standard query operators (like joins and group-bys) or as a mix of the two. Over time, we have added classic data warehouse features like partitioning, sampling and indexing to this environment. This in-house data warehousing layer over Hadoop is called Hive and we are looking forward to releasing an open source version of this project in the near future.
At Facebook, it is incredibly important that we use the information generated by and from our users to make decisions about improvements to the product. Hadoop has enabled us to make better use of the data at our disposal. So we'd like to take this opportunity to say, "Thank you" to all the people who have contributed to this awesome open-source project.
Joydeep is a Facebook Engineer
About a year back we began playing around with an open source project called Hadoop. Hadoop provides a framework for large scale parallel processing using a distributed file system and the map-reduce programming paradigm. Our hesitant first steps of importing some interesting data sets into a relatively small Hadoop cluster were quickly rewarded as developers latched on to the map-reduce programming model and started doing interesting projects that were previously impossible due to their massive computational requirements. Some of these early projects have matured into publicly released features (like the Facebook Lexicon) or are being used in the background to improve user experience on Facebook (by improving the relevance of search results, for example).
We have come a long way from those initial days. Facebook has multiple Hadoop clusters deployed now - with the biggest having about 2500 cpu cores and 1 PetaByte of disk space. We are loading over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and have hundreds of jobs running each day against these data sets. The list of projects that are using this infrastructure has proliferated - from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality. An amazingly large fraction of our engineers have run Hadoop jobs at some point (which is also a great testament to the quality of technical talent here at Facebook).
The rapid adoption of Hadoop at Facebook has been aided by a couple of key decisions. First, developers are free to write map-reduce programs in the language of their choice. Second, we have embraced SQL as a familiar paradigm to address and operate on large data sets. Most data stored in Hadoop's file system is published as Tables. Developers can explore the schemas and data of these tables much like they would do with a good old database. When they want to operate on these data sets, they can use a small subset of SQL to specify the required dataset. Operations on datasets can be written as map and reduce scripts or using standard query operators (like joins and group-bys) or as a mix of the two. Over time, we have added classic data warehouse features like partitioning, sampling and indexing to this environment. This in-house data warehousing layer over Hadoop is called Hive and we are looking forward to releasing an open source version of this project in the near future.
At Facebook, it is incredibly important that we use the information generated by and from our users to make decisions about improvements to the product. Hadoop has enabled us to make better use of the data at our disposal. So we'd like to take this opportunity to say, "Thank you" to all the people who have contributed to this awesome open-source project.
Joydeep is a Facebook Engineer
Facebook Chat 
Published on 2008-5-14 13:56:23 by Engineering @ Facebook

Published on 2008-5-14 13:56:23 by Engineering @ Facebook
One of the things I like most about working at Facebook is the ability to launch products that are (almost) immediately used by millions of people. Unlike a three-guys-in-a-garage startup, we don't have the luxury of scaling out infrastructure to keep pace with user growth; when your feature's userbase will go from 0 to 70 million practically overnight, scalability has to be baked in from the start. The project I'm currently working on, Facebook Chat, offered a nice set of software engineering challenges:
The most resource-intensive operation performed in a chat system is not sending messages. It is rather keeping each online user aware of the online-idle-offline states of their friends, so that conversations can begin.
The naive implementation of sending a notification to all friends whenever a user comes online or goes offline has a worst case cost of O(average friendlist size * peak users * churn rate) messages/second, where churn rate is the frequency with which users come online and go offline, in events/second. This is wildly inefficient to the point of being untenable, given that the average number of friends per user is measured in the hundreds, and the number of concurrent users during peak site usage is on the order of several millions.
Surfacing connected users' idleness greatly enhances the chat user experience but further compounds the problem of keeping presence information up-to-date. Each Facebook Chat user now needs to be notified whenever one of his/her friends
(a) takes an action such as sending a chat message or loads a Facebook page (if tracking idleness via a last-active timestamp) or
(b) transitions between idleness states (if representing idleness as a state machine with states like "idle-for-1-minute", "idle-for-2-minutes", "idle-for-5-minutes", "idle-for-10-minutes", etc.).
Note that approach (a) changes the sending a chat message / loading a Facebook page from a one-to-one communication into a multicast to all online friends, while approach (b) ensures that users who are neither chatting nor browsing Facebook are nonetheless generating server load.
Another challenge is ensuring the timely delivery of the messages themselves. The method we chose to get text from one user to another involves loading an iframe on each Facebook page, and having that iframe's Javascript make an HTTP GET request over a persistent connection that doesn't return until the server has data for the client. The request gets reestablished if it's interrupted or times out. This isn't by any means a new technique: it's a variation of Comet, specifically XHR long polling, and/or BOSH.
Having a large-number of long-running concurrent requests makes the Apache part of the standard LAMP stack a dubious implementation choice. Even without accounting for the sizeable overhead of spawning an OS process that, on average, twiddles its thumbs for a minute before reporting that no one has sent the user a message, the waiting time could be spent servicing 60-some requests for regular Facebook pages. The result of running out of Apache processes over the entire Facebook web tier is not pretty, nor is the dynamic configuration of the Apache process limits enjoyable.
Fault tolerance is a desirable characteristic of any big system: if an error happens, the system should try its best to recover without human intervention before giving up and informing the user. The results of inevitable programming bugs, hardware failures, et al., should be hidden from the user as much as possible and isolated from the rest of the system.
The way this is typically accomplished in a web application is by separating the model and the view: data is persisted in a database (perhaps with a separate in-memory cache), with each short-lived request retrieving only the parts relevant to that request. Because the data is persisted, a failed read request can be re-attempted. Cache misses and database failure can be detected by the non-database layers and either reported to the user or worked around using replication.
While this architecture works pretty well in general, it isn't as successful in a chat application due to the high volume of long-lived requests, the non-relational nature of the data involved, and the statefulness of each request.
For Facebook Chat, we rolled our own subsystem for logging chat messages (in C++) as well as an epoll-driven web server (in Erlang) that holds online users' conversations in-memory and serves the long-polled HTTP requests. Both subsystems are clustered and partitioned for reliability and efficient failover. Why Erlang? In short, because the problem domain fits Erlang like a glove. Erlang is a functional concurrency-oriented language with extremely low-weight user-space "processes", share-nothing message-passing semantics, built-in distribution, and a "crash and recover" philosophy proven by two decades of deployment on large soft-realtime production systems.
Despite those advantages, using Erlang for a component of Facebook Chat had a downside: that component needed to communicate with the other parts of the system. Glueing together PHP, Javascript, Erlang, and C++ is not a trivial matter. Fortunately, we have Thrift. Thrift translates a service description into the RPC glue code necessary for making cross-language calls (marshalling arguments and responses over the wire) and has templates for servers and clients. Since going open source a year ago (we had the gall to release it on April Fool's Day, 2007), the Thrift project has steadily grown and improved (with multiple iterations on the Erlang binding). Having Thrift available freed us to split up the problem of building a chat system and use the best available tool to approach each sub-problem.
The secret for going from zero to seventy million users overnight is to avoid doing it all in one fell swoop. We chose to simulate the impact of many real users hitting many machines by means of a "dark launch" period in which Facebook pages would make connections to the chat servers, query for presence information and simulate message sends without a single UI element drawn on the page. With the "dark launch" bugs fixed, we hope that you enjoy Facebook Chat now that the UI lights have been turned on.
Eugene is a Facebook Engineer
Real-time presence notification:
The most resource-intensive operation performed in a chat system is not sending messages. It is rather keeping each online user aware of the online-idle-offline states of their friends, so that conversations can begin.
The naive implementation of sending a notification to all friends whenever a user comes online or goes offline has a worst case cost of O(average friendlist size * peak users * churn rate) messages/second, where churn rate is the frequency with which users come online and go offline, in events/second. This is wildly inefficient to the point of being untenable, given that the average number of friends per user is measured in the hundreds, and the number of concurrent users during peak site usage is on the order of several millions.
Surfacing connected users' idleness greatly enhances the chat user experience but further compounds the problem of keeping presence information up-to-date. Each Facebook Chat user now needs to be notified whenever one of his/her friends
(a) takes an action such as sending a chat message or loads a Facebook page (if tracking idleness via a last-active timestamp) or
(b) transitions between idleness states (if representing idleness as a state machine with states like "idle-for-1-minute", "idle-for-2-minutes", "idle-for-5-minutes", "idle-for-10-minutes", etc.).
Note that approach (a) changes the sending a chat message / loading a Facebook page from a one-to-one communication into a multicast to all online friends, while approach (b) ensures that users who are neither chatting nor browsing Facebook are nonetheless generating server load.
Real-time messaging:
Another challenge is ensuring the timely delivery of the messages themselves. The method we chose to get text from one user to another involves loading an iframe on each Facebook page, and having that iframe's Javascript make an HTTP GET request over a persistent connection that doesn't return until the server has data for the client. The request gets reestablished if it's interrupted or times out. This isn't by any means a new technique: it's a variation of Comet, specifically XHR long polling, and/or BOSH.
Having a large-number of long-running concurrent requests makes the Apache part of the standard LAMP stack a dubious implementation choice. Even without accounting for the sizeable overhead of spawning an OS process that, on average, twiddles its thumbs for a minute before reporting that no one has sent the user a message, the waiting time could be spent servicing 60-some requests for regular Facebook pages. The result of running out of Apache processes over the entire Facebook web tier is not pretty, nor is the dynamic configuration of the Apache process limits enjoyable.
Distribution, Isolation, and Failover:
Fault tolerance is a desirable characteristic of any big system: if an error happens, the system should try its best to recover without human intervention before giving up and informing the user. The results of inevitable programming bugs, hardware failures, et al., should be hidden from the user as much as possible and isolated from the rest of the system.
The way this is typically accomplished in a web application is by separating the model and the view: data is persisted in a database (perhaps with a separate in-memory cache), with each short-lived request retrieving only the parts relevant to that request. Because the data is persisted, a failed read request can be re-attempted. Cache misses and database failure can be detected by the non-database layers and either reported to the user or worked around using replication.
While this architecture works pretty well in general, it isn't as successful in a chat application due to the high volume of long-lived requests, the non-relational nature of the data involved, and the statefulness of each request.
For Facebook Chat, we rolled our own subsystem for logging chat messages (in C++) as well as an epoll-driven web server (in Erlang) that holds online users' conversations in-memory and serves the long-polled HTTP requests. Both subsystems are clustered and partitioned for reliability and efficient failover. Why Erlang? In short, because the problem domain fits Erlang like a glove. Erlang is a functional concurrency-oriented language with extremely low-weight user-space "processes", share-nothing message-passing semantics, built-in distribution, and a "crash and recover" philosophy proven by two decades of deployment on large soft-realtime production systems.
Glueing with Thrift:
Despite those advantages, using Erlang for a component of Facebook Chat had a downside: that component needed to communicate with the other parts of the system. Glueing together PHP, Javascript, Erlang, and C++ is not a trivial matter. Fortunately, we have Thrift. Thrift translates a service description into the RPC glue code necessary for making cross-language calls (marshalling arguments and responses over the wire) and has templates for servers and clients. Since going open source a year ago (we had the gall to release it on April Fool's Day, 2007), the Thrift project has steadily grown and improved (with multiple iterations on the Erlang binding). Having Thrift available freed us to split up the problem of building a chat system and use the best available tool to approach each sub-problem.
Ramping up:
The secret for going from zero to seventy million users overnight is to avoid doing it all in one fell swoop. We chose to simulate the impact of many real users hitting many machines by means of a "dark launch" period in which Facebook pages would make connections to the chat servers, query for presence information and simulate message sends without a single UI element drawn on the page. With the "dark launch" bugs fixed, we hope that you enjoy Facebook Chat now that the UI lights have been turned on.
Eugene is a Facebook Engineer
Welcome to the Facebook Engineering Blog! 
Published on 2008-5-14 13:51:31 by Engineering @ Facebook

Published on 2008-5-14 13:51:31 by Engineering @ Facebook
We are going to use this space to tell you a little about the code and systems that power Facebook. We thought it would be fun to share what goes on behind the scenes to ensure that the site scales smoothly and that we continue to provide the best overall user experience. Expect to see technical details, architecture discussions and maybe even some code samples. You should also stay tuned to find out about ripstiks, daft punk and a lot of other small things that make Facebook Engineering tick.
Aditya is a Director of Engineering at Facebook
Aditya is a Director of Engineering at Facebook
10 items








Engineering @ Facebook的Facebook随笔