|By Dana Gardner||
|January 23, 2017 12:00 PM EST||
The next BriefingsDirect big data case study discussion explores how Etsy, a global e-commerce site focused on handmade and vintage items, uses data science to improve buyers and sellers’ discovery and shopping experiences.
We'll learn how mining big data at speed and volume helps Etsy define and distribute top trends, and allows those with specific interests to find items that will best appeal to them.
To learn more about leveraging big data in the e-commerce space, please join Chris Bohn aka “CB,” a Senior Data Engineer at Etsy, based in Brooklyn, New York. The discussion is moderated by me, Dana Gardner, Principal Analyst at Interarbor Solutions.
Here are some excerpts:
Gardner: Tell us about Etsy for those that aren’t familiar with it. I've heard it described as it’s like being able to go through your grandmother's basement. Is that fair?
CB: Well, I hope it’s not as musty and dusty as my grandmother’s basement. The best way to describe it is that Etsy is a marketplace. We create a marketplace for sellers of handcrafted goods and the people who want to buy those goods.
We've been around for 10 years. We're the leader in this space and we went public in 2015. Just some quick little metrics. The total of value of the merchandise sold on Etsy in 2014 was about $1.93 billion. We have about 1.5 million sellers and about 22 million buyers.
Gardner: That's an awful lot of stuff that’s being moved around. What does the big data and analytics role bring to the table?
CB: It’s all about understanding more about our customers, both buyers and sellers. We want to know more about them and make the buying experience easier for them. We want them to be able to find products easier. Too much choice sometimes is no choice. You want to get them to the product they want to buy as quickly as possible.
We also want to know how people are different in their shopping habits across the geography of the world. There are some people in different countries that transact differently than we do here in the States, and big data lets us get some insight into that.
Gardner: Is this insight derived primarily from what they do via their clickstreams, what they're doing online? Or are there other ways that you can determine insights that then you can share among yourself and also back to your users?
CB: I'll describe our data architecture a little bit. When Etsy started out, we had a monolithic Postgres database and we threw everything in there. We had listings, users, sellers, buyers, conversations, and forums. It was all in there, but we outgrew that really quickly, and so the solution to that was to shard horizontally.
Now we have many hundreds of sharded MySQL servers, horizontal. Then we decided that we needed to do some analytics on this stuff. So we scratched our heads. This was about five years ago. So we said, "Let’s just set up a Postgres server and we'll copy all the data from these shards into the Postgres server that we call BI server." And we got that done.
Then, we kind of scratched our heads and said, "Wait a minute. We just came full circle. We started with a monolithic database, then we went sharded, and now all the data is back monolithic."
It didn't perform well, because it's hard to get the volume of big data into that database. A relational database like Postgres just isn’t designed to do analytic-type queries. Those are big aggregations, and Postgres, even though it is a great relational database, is really tailored for single-record lookup.
So we decided to get something else going on. About three-and-a-half years ago, we set about searching for the replacement to our monolithic business-intelligence (BI) database and looked at what the landscape was. There were a number of very worthy products out there, but we eventually settled on HPE Vertica for a number of reasons.
One of those is that it derives, in large part, from Postgres. Postgres has a Berkeley license. So companies could take it private. They can take that code and they don’t have to republish it out to the community, unlike other types of open source copyright agreements.
So we found out that the parser was right out of Postgres and all the date handling and typecasting stuff that is usually different from database to database was exactly spot-on the same between Vertica and Postgres. Also, data ingestion via the copy command is the best way to bulk-load data, exactly the same in both, and it’s the same format.
We said, "This looks good, because we can get the data in quickly, and queries will probably not have to be edited much." So that's where we went. We experimented with it and we found exactly that. Queries would run unchanged, except they ran a lot faster and we were able to get the data in easily.
We built some data replication tools to get data from the shards and also some legacy Postgres databases that we had laying around for billing and got that all data into HPE Vertica.
Then, we built some tools that allowed our analysts to bring over custom tables they had created on that old BI machine. We were able to get up to speed really quickly with Vertica, and boom, we had an analytics database that we were able to hit the ground running with it.
Gardner: And is the challenge for you about the variety of that data? Is it about the velocity that you need to move it in and out? Is it about simply volume that you just have so much of it, or a little of some of those?
All of the above
CB: It’s really all of those problems. Velocity-wise, we want our replication system to be eventually consistent, and we want it to be as near real-time as possible. There is a challenge in that, because you really start to get into micro-batching data in.
This is where we ended up having to pay off some technical debt, because years ago, disk storage was fairly pricey, and databases were designed to minimize storage. Practices grew up around that fact. So data would get deleted and updated. That's the policy that the early originators of Etsy followed when they designed the first database for it.
Eventually what we have got now is lossy data. If someone changes the description or the tags that are associated with a listing, the old ones go away. They are lost forever. And that's too bad, because if we kept those, we can do analytics on a product that wasn’t selling for a long time and all of a sudden it started selling. What changed? We would love to do analytics on that, but we can't do it because of the loss of data. That's one thing that we learned in this whole process.
But getting back to your question here about velocity and then also the volume of data, we have a lot of data from our production databases. We need to get it all into Vertica. We also have a lot of clickstream data. Etsy is a top 50 website, I believe, for traffic, and that generates a lot of clicks and that all gets put into Vertica.
We run big batch jobs every night to load that. It's important that we have that, because one of the biggest things that our analytics like to do is correlate clickstream data with our production data. Clickstream data doesn't have a lot of information about the user who is doing those clicks. It’s just information about their path through the site at that time.
To really get a value-add on that, you want to be able to join on your user details tables, so that you can know where this person lives, how old they are, or their buying history in the past. You need to be able to join those, too, and we do that in HPE Vertica.
Gardner: CB, give us a sense about the paybacks, when you do this well, when you've architected, and when you've paid your technical debts, as you put it. How are your analysts able to leverage this in order to make your business better and make the experience of your users better?
CB: When we first installed Vertica, it was just a small group of analysts that were using it. Our analytics program was fairly new, but it just exploded. Everybody started to jump in on it, because all of a sudden, there was a database with which you could write good SQL, with a rich SQL engine, and get fantastic results quickly.
The results weren’t that different from what we were getting in the past, but they were just coming to us so fast, the cycle of getting information was greatly shortened. Getting result sets was so much better that it was like a whole different world. It’s like the Pony Express versus email. That’s the kind of difference it was. So everybody started jumping in on it.
Engineers who were adding new facets of the product wanted to have dashboards, more or less real time, so they could monitor what the thing was doing. For example, we added postage to Etsy, so that our sellers can have preprinted labels. We'd like to monitor that in real time to see how it's this going. Is it going well or what?
That was something that took a long time to analyze before we got into big-data analytics. All of a sudden, we had Vertica and we could do that for them, and that pattern has repeated with other groups in the company.
We're doing different aspects of the site. All of a sudden, you have your marketing people, your finance people, saying, "Wow, I can run these financial reports that used to take days in literally seconds." There was a lot of demand. Etsy has about 750 employees and we have way more than 200 Vertica accounts. That shows you how popular it is.
One anecdotal story. I've been wanting to update Vertica for the past couple of months. The woman who runs our analytics team said, "Don't you dare. I have to run Q2 numbers. Everybody is working on this stuff. You have to wait until this certain week to be able to do that." It’s not just HPE Vertica, but big data is now relied on for so many things in the company.
Gardner: So the technology led to the culture. Many times we think it's the other way around, but having that ability to do those easy SQL queries and get information opened up people's imagination, but it sounds like it has gone beyond that. You have a data-driven company now.
CB: That's an astute observation. You're right. This is technology that has driven the culture. It's really changed the way people do their job at Etsy. And I hear that elsewhere also, just talking to other companies and stuff. It really has been impactful.
Gardner: Just for the sake of those of our readers who are on the operations side, how do you support your data infrastructure? Are you thinking about cloud? Are you on-prem? Are you split between different data centers? How does that work?
We would run nightly jobs. We collected all of the search terms that were used and buying patterns and we fed these into MapReduce jobs. The output from that then went into MATLAB, and we would get a set of rules out of that, that then would drive our search engine, basically improving search.
We did that for a while and then realized we were spending a lot of money in AWS. It was many thousands of dollars a month. We said, "Wait a minute. This is crazy. We could actually buy our own servers. This is commodity hardware that this can run on, and we can run this in our own data center. We will get the data in faster, because there are bigger pipes." So that's what we did.
We created what we call Etsydoop, which has got 200+ nodes and we actually save a lot of money doing it that way. That's how we got into it.
We really have a bifurcated data analytics, big-data system. On the one hand, we have Vertica for doing ad hoc queries, because the analysts and the people out there understand SQL and they demand it. But for batch jobs, Hadoop rocks, and it's really, really good for that.
But the tradeoff is that those are hard jobs to write. Even a good engineer is not going to get it right every time, and for most analysts, it's probably a little bit beyond their reach to get down, roll up their sleeves, and get into actual coding and that kind of stuff.
But they're great at SQL, and we want to encourage exploration and discovering new things. We've discovered things about our business just by some of these analysts wildcatting in the database, finding interesting stuff, and then exploring it, and we want to encourage that. That's really important.
Gardner: CB, in getting to understand Etsy a little bit more, I saw that you have something called Top Trends and Etsy Finds, ways that you can help people with affinity for a product or a craft or some interest to pursue that. Did that come about as a result of these technologies that you have put in place, or did they have a set of requirements that they wanted to be able to do this and then went after you to try to accommodate it? How do you pull off that Etsy Finds capability?
CB: A lot of that is cross-architecture. Some of our production data is used to find that. Then, a lot of the hard crunching is done in Vertica to find that. Some of it is MapReduce. There's a whole mix of things that go into that.
I couldn't claim for Etsy Finds, for example, that it’s all big data. There are other things that go in there, but definitely HPE Vertica plays a role in that stuff.
I'll give you another example, fraud. We fingerprint a lot of our users digitally, because we have problems with resellers. These are people who are selling resold mass-produced stuff on Etsy. It's not huge, but it's an annoyance. Those products compete against really quality handmade products that our regular sellers sell in their shops.
Sometimes it’s like a game of Whack-a-Mole. You knock one of these guys down -- sometimes they're from the Far East or other parts of the world -- and as soon as you knock one down, another one pops up. Being able to capture them quickly is really important, and we use Vertica for that. We have a team that works just on that problem.
Gardner: Thinking about the future, with this great architecture, with your ability to do things like fraud detection and affinity correlations, what's next? What can you do that will help make Etsy more impactful in its market and make your users more engaged?
CB: The whole idea behind databases and computing in general is just making things faster. When the first punch-card machines came out in the 1930s or whatever, the phone companies could do faster billing, because billing was just getting out of control. That’s where the roots of IBM lie.
As time went by, punch cards were slow and they wanted to go faster. So they developed magnetic tape, and then spinning rust disks. Now, we're into SSDs, the flash drives. And it’s the same way with databases and getting answers. You always want to get answers faster.
We do a lot of A/B testing. We have the ability to set the site so that maybe a small percentage of users get an A path through the site, and the others a B path, and there's control stuff on that. We analyze those results. This is how we test to see if this kind of button work better than this other one. Is the placement right? If we just skip this page, is it easier for someone to buy something?
So we do A/B testing. In the past, we've done it where we had to run the test, gather the data, and then comb through it manually. But now with Vertica, the turnaround time to iterate over each cycle of an A/B test has shrunk dramatically. We get our data from the clickstreams, which go into Vertica, and then the next day, we can run the A/B test results on that.
The next step is shrinking that even more. One of the themes that’s out there at the various big data conferences is streaming analytics. That's a really big thing. There is a new database out there called PipelineDB, a fork of Postgres. It allows you to create an event steam into Postgres.
You can then create a view and a window on top of that stream. Then you can pump your event data, like your clickstream data, and you can join the data in that window to your regular Postgres tables, which is really great, because we could get A/B information in real time. You set up a one minute turnaround as opposed to one day. I think that’s where a lot of things are going.
If you just look at the history of big data, MapReduce started about 10 years ago at Google, and that was batch jobs, overnight runs. Then, we started getting into the columnar stores to make databases like Vertica possible, and it’s really great for aggregation. That kicked it up to the next level.
Another thing is real-time analytics. It’s not going to replace any of these things, just like Vertica didn't replace Hadoop. They're complementary. Real-time streaming analytics will be complementary. So we're continuing to add these tools to our big data toolbox.
Gardner: It has compressed those feedback loops if we provide that capability into innovative, creative organization. The technology might drive the culture, and who knows what sort of benefits they will derive from that.
All plugged in
CB: That's very true. You touched earlier about how we do our infrastructure. I'm in data engineering, and we're responsible for making sure that our big databases are healthy and running right. But we also have our operations department. They're working on the actual pipes and hardware and making sure it’s all plugged in. It's tough to get all this stuff working right, but if you have the right people, it can happen.
I mentioned earlier about AWS. The reason we were able to move off of that and save money is because we have the people who can do it. When you start using AWS extensively, what you're doing is you are paying for a very high priced but good IT staff at Amazon. If you have got a good IT staff of your own, you're probably going to be able to realize some efficiencies there, and that's why really we moved over. We do it all ourselves.
Gardner: Having it as a core competency might be an important thing moving forward. The whole idea behind databases and computing in general is just making things faster.
CB: Absolutely. You have to stay on top of all this stuff. A lot is made of the word disruption, and you don't go knocking on disruption’s door; it usually knocks on yours. And you had better be agile enough to respond to it.
I'll give you an example that ties back into big data. One of the most disruptive things that has happened to Etsy is the rise of the smartphone. When Etsy started back in 2005, the iPhone wasn't around yet; it was still two years out. Then, it came on the scene, and people realized that this was a suitable device for commerce.
It’s very easy to just be complacent and oblivious to new technologies sneaking up on you. But we started seeing that there was more and more commerce being done on smartphones. We actually fell a little bit behind, as a lot of companies did five years ago. But our management made decisions to invest in mobile, and now 60 percent of our traffic is on mobile. That's turned around in the past two years and it has been pretty amazing.
Big data helps us with that, because we do a lot of crunching of what these mobile devices are doing. Mobile is not the best device maybe for buying stuff because of the form factor, but it is a really good device for managing your store, paying your Etsy bill, and doing that kind of stuff. So we analyzed all that and crunched it in big data.
Gardner: And big data allowed you to know when to make that strategic move and then take advantage of it?
CB: Exactly. There are all sorts of crossover points that happen with technology, and you have to monitor it. You have to understand your business really well to see when certain vectors are happening. If you can pick up on those, you're going to be okay.
You may also be interested in:
- IoT plus big data analytics translate into better services management at Auckland Transport
- How HPE’s internal DevOps paved the way for speed in global software delivery
- Extreme Apps approach to analysis makes on-site retail experience king again
- How New York Genome Center Manages the Massive Data Generated from DNA Sequencing
- The UNIX evolution: A history of innovation reaches an unprecedented 20-year milestone
- Redmonk analysts on best navigating the tricky path to DevOps adoption
- DevOps by design--A practical guide to effectively ushering DevOps into any organization
- Need for Fast Analytics in Healthcare Spurs Sogeti Converged Solutions Partnership Model
- HPE's composable infrastructure sets stage for hybrid market brokering role
- Nottingham Trent University Elevates Big Data's role to Improving Student Retention in Higher Education
- Forrester analyst Kurt Bittner on the inevitability of DevOps
- Agile on fire: IT enters the new era of 'continuous' everything
The essence of cloud computing is that all consumable IT resources are delivered as services. In his session at 15th Cloud Expo, Yung Chou, Technology Evangelist at Microsoft, demonstrated the concepts and implementations of two important cloud computing deliveries: Infrastructure as a Service (IaaS) and Platform as a Service (PaaS). He discussed from business and technical viewpoints what exactly they are, why we care, how they are different and in what ways, and the strategies for IT to transi...
Mar. 29, 2017 05:00 AM EDT Reads: 6,401
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound e...
Mar. 29, 2017 04:00 AM EDT Reads: 15,076
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
Mar. 29, 2017 03:45 AM EDT Reads: 2,128
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
Mar. 29, 2017 03:30 AM EDT Reads: 3,206
Extreme Computing is the ability to leverage highly performant infrastructure and software to accelerate Big Data, machine learning, HPC, and Enterprise applications. High IOPS Storage, low-latency networks, in-memory databases, GPUs and other parallel accelerators are being used to achieve faster results and help businesses make better decisions. In his session at 18th Cloud Expo, Michael O'Neill, Strategic Business Development at NVIDIA, focused on some of the unique ways extreme computing is...
Mar. 29, 2017 03:30 AM EDT Reads: 11,748
Information technology (IT) advances are transforming the way we innovate in business, thereby disrupting the old guard and their predictable status-quo. It’s creating global market turbulence. Industries are converging, and new opportunities and threats are emerging, like never before. So, how are savvy chief information officers (CIOs) leading this transition? Back in 2015, the IBM Institute for Business Value conducted a market study that included the findings from over 1,800 CIO interviews ...
Mar. 29, 2017 01:45 AM EDT Reads: 5,438
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
Mar. 29, 2017 01:15 AM EDT Reads: 2,484
"We host and fully manage cloud data services, whether we store, the data, move the data, or run analytics on the data," stated Kamal Shannak, Senior Development Manager, Cloud Data Services, IBM, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Mar. 29, 2017 01:15 AM EDT Reads: 9,276
In his General Session at 17th Cloud Expo, Bruce Swann, Senior Product Marketing Manager for Adobe Campaign, explored the key ingredients of cross-channel marketing in a digital world. Learn how the Adobe Marketing Cloud can help marketers embrace opportunities for personalized, relevant and real-time customer engagement across offline (direct mail, point of sale, call center) and digital (email, website, SMS, mobile apps, social networks, connected objects).
Mar. 28, 2017 11:15 PM EDT Reads: 3,477
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
Mar. 28, 2017 08:15 PM EDT Reads: 2,354
SYS-CON Events announced today that Auditwerx will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Auditwerx specializes in SOC 1, SOC 2, and SOC 3 attestation services throughout the U.S. and Canada. As a division of Carr, Riggs & Ingram (CRI), one of the top 20 largest CPA firms nationally, you can expect the resources, skills, and experience of a much larger firm combined with the accessibility and atten...
Mar. 28, 2017 06:15 PM EDT Reads: 483
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), will provide an overview of various initiatives to certifiy the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldw...
Mar. 28, 2017 06:00 PM EDT Reads: 866
MongoDB Atlas leverages VPC peering for AWS, a service that allows multiple VPC networks to interact. This includes VPCs that belong to other AWS account holders. By performing cross account VPC peering, users ensure networks that host and communicate their data are secure. In his session at 20th Cloud Expo, Jay Gordon, a Developer Advocate at MongoDB, will explain how to properly architect your VPC using existing AWS tools and then peer with your MongoDB Atlas cluster. He'll discuss the secur...
Mar. 28, 2017 04:45 PM EDT Reads: 526
Deep learning has been very successful in social sciences and specially areas where there is a lot of data. Trading is another field that can be viewed as social science with a lot of data. With the advent of Deep Learning and Big Data technologies for efficient computation, we are finally able to use the same methods in investment management as we would in face recognition or in making chat-bots. In his session at 20th Cloud Expo, Gaurav Chakravorty, co-founder and Head of Strategy Development ...
Mar. 28, 2017 03:45 PM EDT Reads: 3,815
[session] Offshore Development: How Not to Screw It Up | @CloudExpo @MobiDev_ #Cloud #DigitalTransformation
In his session at Cloud Expo, Alan Winters, an entertainment executive/TV producer turned serial entrepreneur, will present a success story of an entrepreneur who has both suffered through and benefited from offshore development across multiple businesses: The smart choice, or how to select the right offshore development partner Warning signs, or how to minimize chances of making the wrong choice Collaboration, or how to establish the most effective work processes Budget control, or how to m...
Mar. 28, 2017 03:45 PM EDT Reads: 446
SYS-CON Events announced today that Linux Academy, the foremost online Linux and cloud training platform and community, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Linux Academy was founded on the belief that providing high-quality, in-depth training should be available at an affordable price. Industry leaders in quality training, provided services, and student certification passes, its goal is to c...
Mar. 28, 2017 03:45 PM EDT Reads: 4,206
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
Mar. 28, 2017 03:00 PM EDT Reads: 2,157
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
Mar. 28, 2017 02:15 PM EDT Reads: 2,232
SYS-CON Events announced today that Technologic Systems Inc., an embedded systems solutions company, will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Technologic Systems is an embedded systems company with headquarters in Fountain Hills, Arizona. They have been in business for 32 years, helping more than 8,000 OEM customers and building over a hundred COTS products that have never been discontinued. Technologic Systems’ pr...
Mar. 28, 2017 02:15 PM EDT Reads: 3,736
In his keynote at @ThingsExpo, Chris Matthieu, Director of IoT Engineering at Citrix and co-founder and CTO of Octoblu, focused on building an IoT platform and company. He provided a behind-the-scenes look at Octoblu’s platform, business, and pivots along the way (including the Citrix acquisition of Octoblu).
Mar. 28, 2017 02:00 PM EDT Reads: 14,303