Big DataReal-time analytics at scale with DruidGuillaume Torche, Big Data Engineer, GumGum10:30am (GC-150)
GumGum uses Druid to ingest more than 30 billion events every day, which can be queried almost as soon as they happen with a very low response time. This is a tell-all talk about GumGum's love story with Druid, how Druid works and how GumGum leverages Druid's capabilities.
Big DataPortable Stream and Batch processing with Apache Beam and Google Cloud DataflowEric Anderson, Product Manager, Google11:10am (Library 4th Fl.)This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
Big DataBuilding scalable enterprise data flows and IoT apps using Apache NiFiDhruv Kumar, Senior Solutions Architect, Hortonworks11:50am (GC-130)
Connecting enterprise systems has always been a tough task. Modern IoT applications have exacerbated the issue by the need to integrate legacy systems with novel high velocity data streams. Various patterns like messaging, REST, etc. have been proposed, but they necessitate rearchitecting the integration layer which is extremely arduous. In this talk we will show you how to use Apache NiFi to solve your data integration, movement and ingestion problems. Next, we will examine how Apache NiFi can be used to construct durable, scalable and responsive IoT apps in conjunction with other stream processing and messaging frameworks.
Big DataTwitter Heron @ ScaleKarthik Ramasamy, Engineering Manager, Twitter1:30pm (Library 4th Fl.)
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. This talk looks at Twitter's operating experiences and challenges of running Heron at scale and the approaches taken to solve those challenges.
Big DataWarner Bros. Digital Consumer Intelligence at ScaleBrian Kursar, VP - Data Strategy and Architecture, Warner Bros.2:10pm (GC-160)
Warner Bros. processes billions of records each day Globally between its web assets, digital content distribution, OTT streaming services, online and mobile games, technical operations, anti-piracy programs, social media, and retail point of sale transactions. Combining these datasets with content metadata, Warner Bros. is able produce Consumer insights and affinity models that result in highly accurate Audience segments.
Big DataRapid Analytics @ Netflix LA (Updated and Expanded)Chris Stephens, Senior Data Engineer, Netflix, Inc2:50pm (FA-100)
This talk explores how Netflix equips its engineers with the freedom to find and introduce the right software for the job - even if it isn't used anywhere else in-house. Examples include how Netflix has enabled analysts to fluidly switch between MPP RDBMS and an auto-scaling Presto cluster, how Spark + NoSQL stores are used when deploying data sets to internal web apps, and how data scientists are enabled to work in the ML framework of their choosing and deploy models as a service.
Big DataSponsored - Puree through Trillions of clicks in secondsJag Srawan, Engineer, Interana3:50pm (GC-130)
Interana is a full stack analytics solution that provides lightening fast querying capabilities using a proprietary storage format. Interana was designed to utilize best of both in-memory and disk architectures. This talk serves as an introduction to concepts on event data and utilizing advanced behavior analysis built into Interana. The attendee will gain knowledge about how to model their data effectively using our full service solution.
Big DataHow To Use Impala and Kudu To Optimize Performance for Analytic WorkloadsDavid Alves, Software Engineer, Cloudera4:30pm (Library 4th Fl.)
This session describes how Impala integrates with Kudu for analytic SQL queries on Hadoop and how this integration, taking full advantage of the distinct properties of Kudu, has significant performance benefits.
Big DataApply R in Enterprise ApplicationsLouis Bajuk-Yorgan, Sr. Director, Product Management, TIBCO Software Inc.5:10pm (GC-150)
Prototypes are typically re-implemented in another language due to compatibility issues with R in the enterprise, but TIBCO Enterprise Runtime for R (TERR) allows the language to be run on several platforms. Enterprise-level scalability has been brought to the R language, enabling rapid iteration without the need to recode, re-implement and test. This presentation will delve further into these topics, highlighting specific use cases and the true value that can be gained from utilizing R. The session will be followed by a lively, open Q&A discussion.
Big DataFluentd and Embulk: Collect More Data, Grow FasterKazuki Ohta, Chief Technology Officer, Treasure Data5:50pm (GC-130)
Since Doug Cutting invented Hadoop and Amazon Web Services released S3 ten years ago, we've seen quite a bit of innovation in large-scale data storage and processing. These innovations have enabled engineers to build data infrastructure at scale, many of them fail to fill their scalable systems with useful data, struggling to unify data silos or failing to collect logs from thousands of servers and millions of containers. Fluentd and Embulk are two projects that I've been involved to solve the unsexy yet critical problem of data collection and transport. In this talk, I will give an overview of Fluentd and Embulk and give a survey of how they are used at companies like Microsoft and Atlassian or in projects like Docker and Kubernetes.
Data ScienceData storytelling for impactDave Goodsmith, Data Scientist, DataScience Inc.10:30am (GC-160)
How can our data make the biggest impact? How do we find the stories worth sharing buried in our analytics? How important are visuals, hooks, connections, content? As data science and journalism have co-evolved, the potential for effectively communicating with data has skyrocketed. We'll look at case studies of impactful data stories and share the process for developing data stories that drive action.
Data ScienceDecision Making and Lambda ArchitectureGirish Kathalagiri, Staff Engineer, Samsung SDS Research America11:10am (GC-130)
Online decision making over time needs interacting with an ever changing environment and underlying machine learning models need to change and adapt to this changing environment. This talk discusses class of machine learning algorithms and provides details of how the computation is parallelized using the Spark framework.
Data ScienceSponsored - Data Science + HollywoodTodd Holloway, Director Of Content Science & Algorithms, Netflix; Conor Dowling, Content Analytics Manager At Netflix11:50am (FA-100)
Netflix will spend six billion dollars this year on content, making the company a major player in Hollywood. An increasing portion of this spend will be on original shows such as House of Cards, and original movies such as Beasts of No Nation. As we continue to expand our involvement with Hollywood, we want to leverage data and data science to make the best decisions possible. This talk will explore areas where we see the most opportunity to apply data science to Hollywood, and some early approaches we've taken.
Data ScienceEnabling Cross-Screen Advertising with Machine Learning and SparkDebajyoti (Deb) Ray, Chief Data Officer, VideoAmp1:30pm (FA-100)
With content now viewed seamlessly across multiple screens, this shift in consumer behavior/consumption has come to a head with the way advertising is sold - separately in TV and online silos - creating an opportunity to make advertising more effective using data and machine learning. This talk explores technological developments at VideoAmp that bring together data from disparate mediums and creates cross-screen audience models using ML methods for cross-screen bid optimization, and graph based audience models for 150 Million users, across over a billion unique device IDs, as well as behavioral insights gleaned from observing such a large variety of data.
Data ScienceThe right tool for the job: Guidelines for algorithm selection in predictive modelingDerek Wilcox, Senior Data Scientist, ZestFinance2:10pm (Library 4th Fl.)
The goal of this talk to lay out a framework for what algorithms work best in which situations, and why. Drawing on results of hundreds of crowd-sourced predictive modeling contests, this talk shows examples of how structure informs a choice in algorithm. As an illustration of these concepts, ZestFinance's work with China's retail giant, is used to describe how the right algorithms were applied to the right datasets to turn shopping data into credit data -- creating credit scores from scratch.
Data ScienceStream processing with R and Amazon KinesisGergely Daroczi, Director of Analytics, CARD.com2:50pm (GC-160)
This talk presents an original R client library to interact with Amazon Kinesis via a simple daemon to start multiple R sessions on a machine or cluster of nodes to process data from theoretically any number of shards, and will also feature some demo micro-applications streaming dummy credit cards transactions, enriching this data and then triggering other data consumers for various business needs, such as scoring transactions, updating real-time dashboards and messaging customers. Besides the technical details, the motives behind choosing R and Kinesis will be also covered, including a quick overview on the related data infrastructure changes at CARD.
Data ScienceSponsored - The Evolving Data Science LandscapeKyle Polich, Principal Consulting Engineer, Datascience3:50pm (FA-100)
The impact of data science on business is undeniable, and the value it provides is growing without signs of slowing. To keep up with this rapidly evolving technology landscape, data scientists must adapt and specialize through continuous learning. This talk focuses on how they can do that in a way that maximizes the positive impact data science will have on their organization.
Data ScienceAffinity Marketing Leveraging Crowdsourced PsychographicsRavi Iyer, Chief Data Scientist, Ranker; Glenn Walker, COO, Ranker4:30pm (GC-130)
The most important variables to use to discover your best future customers are increasingly psychological. Borrowing techniques from psychometrics, this talk shows how marketers can use disparate online data sources to measure the right psychographic variables in order to maximize both performance and scale.
Data ScienceIntuit's Payments Risk PlatformDusan Bosnjakovic, Data Scientist, Intuit; Boris Belyi, Manager - Risk Analytics, Intuit5:10pm (FA-100)
This talk explores the path taken at Intuit, the maker of TurboTax, Mint and Quickbooks, to operationalize predictive analytics and highlights automations that have allowed Intuit to stay ahead of the fraud curve.
Data ScienceBackstage to a Data Driven Culture: Your Data Science and Analytics StackPauline Chow, Consultant / Lead Data Science Instructor @ General Assembly5:50pm (GC-150)
When you're the first data professional at the organization there are technical, process, and qualitative considerations for analytics and data science to address (A/DS). This talk is an overview of strategy, infrastructure, and tools for creating your first A/DS stacks. At this stage, the range of problems that you are able to solve relate to organization, operational, data engineering, business intelligence, and communication. Creating the optimal A/DS stack can seamlessly pave the way to big data and integrating the newest technologies in the future. Please share your stories and experience with us as well. Outline of talk, where sections intend to be interactive and get feedback from the audience:
1. So you're the first Data Scientist
2. Setting Their Expectations
3. Lay of the Land - Data requirements and organizational survey
4. Setting Your Expectations
5. Infrastructure - Your Stack Options
6. Resources: Get Help, Get a Team
7. Discussion
Hadoop / Spark / KafkaReal-time Aggregations, Approximations, Similarities, and Recommendations at Scale using Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter AlgebirdChris Fregly, Research Scientist, PipelineIO10:30am (FA-100)
Live, Interactive Recommendations Demo
Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird ( Types of Similarity - Euclidean vs. Non-Euclidean, Similarity, Jaccard Similarity, Cosine Similarity, LogLikelihood Similarity, Edit Distance. Text-based Similarities and Analytics - Word2Vec, LDA Topic Extraction, TextRank, Similarity-based Recommendations, User-to-User, Content-based, Item-to-Item (Amazon), Collaborative-based, User-to-Item (Netflix), Graph-based, Item-to-Item ""Pathways"" (Spotify). Aggregations, Approximations, and Similarities at Scale - Twitter Algebird, MinHash and Bucketing, Locality Sensitive Hashing (LSH), BloomFilters, CountMin Sketch, HyperLogLog
Hadoop / Spark / KafkaIterative Spark Development at BloombergNimbus Goehausen, Senior Software Engineer, Bloomberg11:10am (FA-100)
This presentation will explore how Bloomberg uses Spark, with its formidable computational model for distributed, high-performance analytics, to take this process to the next level, and look into one of the innovative practices the team is currently developing to increase efficiency: the introduction of a logical signature for datasets.
Hadoop / Spark / KafkaAlluxio (formerly Tachyon): An Open Source Memory Speed Virtual Distributed StorageGene Pang, Software Engineer/ Founding Member At Alluxio11:50am (Library 4th Fl.)
Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system. The Alluxio open source community is one of the fastest growing open source communities in big data history with more than 300 developers from over 100 organizations around the world. In the past year, the Alluxio project experienced a tremendous improvement in performance and scalability and was extended with key new features including tiered storage, transparent naming, and unified namespace. Alluxio now supports a wide range of under storage systems, including Amazon S3, Google Cloud Storage, Gluster, Ceph, HDFS, NFS, and OpenStack Swift. This year, our goal is to make Alluxio accessible to an even wider set of users, through our focus on security, new language bindings, and further increased stability.
Hadoop / Spark / KafkaData Provenance Support in SparkMatteo Interlandi, Postdoc, UCLA1:30pm (GC-150)
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. To aid this effort, we built Titian, a library that enables data provenance tracking data through transformations in Apache Spark.
Hadoop / Spark / KafkaIntroduction to KafkaJesse Anderson, CEO, Smoking Hand2:10pm (GC-130)
An introduction to what Kafka is, the concepts behind it and its API.
Hadoop / Spark / KafkaBuilding an Event-oriented Data PlatformEric Sammer, CTO & Co-Founder, Rocana2:50pm (Library 4th Fl.)
While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. In this session, we’ll follow the flow of data through an end to end system built to handle tens of terabytes per day of event-oriented data, providing real time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive are actually stitched together; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. This session is especially recommended for data infrastructure engineers and architects planning, building, or maintaining similar systems.
Hadoop / Spark / KafkaPanel - Interactive Applications on Spark?Raj Babu, CEO at AgileISS - Moderator; Raymond Fu, Practice Architect at Trace3 - Panelist; David Levinger, Sr. Director Information Technology at Paxata - Panelist4:00pm - 5:00pm (GC-150)
In this interactive panel discussion, you will hear from these Spark experts as to why they chose to go "all-in" on Spark, leveraging the rich core capabilities that make Spark so exciting, and committing to significant IP that turns Spark into a world-class enterprise data preparation engine.
Raymond and David will explain specific cases where capabilities were built on top of core Spark to provide a true interactive data prep application experience. Innovations such as creating a Domain Specific Language (DSL), an optimizing compiler, a persistent columnar caching layer, application specific Resilient Distributed Datasets (RDDs), on-line aggregation operators to solve the core memory, pipelining and shuffling obstacles to produce a highly interactive application with the core user and data volume scale-out benefits of Spark.
Hadoop / Spark / KafkaWhy is my Hadoop cluster slow?Bikas Saha, Software Engineer, Hortonworks5:10pm (GC-160)
This talk draws on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
Hadoop / Spark / KafkaDeep Learning at ScaleAlexander Kern, Co-founder / CTO, Pavlov5:50pm (FA-100)
The advent of modern deep learning techniques has given organizations new tools to understand, query, and structure their data. However, maintaining complex pipelines, versioning models, and tracking accuracy regressions over time remain ongoing struggles of even the most advanced data engineering teams. This talk presents a simple architecture for deploying machine learning at scale and offer suggestions for how companies can get their feet wet with open source technologies they already deploy.
NoSQLReal Life IoT ArchitectureDinesh Srirangpatna, Big Data Strategist, Microsoft10:30am (GC-130)
Learn how to benefit from IoT (internet of things) to reduce costs and spur transformation for your company and clients. Attendees will learn about building blocks to create an IoT solution, and walk through real life architectural decisions in building a solution.
NoSQLAmazon DynamoDB - Focus on Your Data and Leave Ops to Someone ElseMichael Limcaco, Principal Solutions Architect, Amazon Web Services (AWS)11:10am (GC-150)This talk explores features and benefits of Amazon DynamoDB, a fully managed NoSQL database service, in detail, and discusses how to get the most out of DynamoDB, in addition to design best practices with DynamoDB across multiple use cases.
NoSQLSponsored - Spark And Couchbase: Augmenting The Operational Database With SparkMatt Ingenthron, Senior Director Engineering At Couchbase11:50am (GC-160)
For an operational database, Spark is like Batman’s utility belt: it handles a variety of important tasks from data cleanup and migration to analytics and machine learning that make the operational database much more powerful than it would be on its own. In this talk, we describe the Couchbase Spark Connector that lets you easily integrate Spark with Couchbase Server, an open source distributed NoSQL document database that provides low latency data management for large scale, interactive online applications. We’ll start with common use cases for Spark and Couchbase, then cover the basics of creating, persisting and consume RDDs and DataFrames from Couchbase’s key/value and SQL interfaces.
NoSQLUsing Redis Data Structures to Make Your App Blazing FastAdi Foulger, Solution's Architect, Redis Labs1:30pm (GC-130)
Open Source Redis is not only the fastest NoSQL database but also the most popular among the new wave of databases running in containers. This talk introduces the data structures used to speed up applications and solve the everyday use cases that are driving Redis' popularity.
NoSQLApache Kudu: Fast Analytics on Fast DataDan Burkert, Software Engineer, Cloudera2:10pm (GC-150)
Apache Kudu (incubating) is a new storage engine for the Hadoop ecosystem that enables extremely high-speed analytics without imposing data-visibility latencies. This talk provides an introduction to Kudu, and provides an overview of how, when, and why practitioners use Kudu as a platform for building analytics solutions.
NoSQLBig Data and Real EstateJon Zifcak MBA, MSIM, CEO, Zulloo Inc.; Anton Polishko, CTO, Zulloo Inc.;2:50pm (GC-150)
The real estate industry is generating terabytes of data, but a very small percentage is being utilized or processed. ZULLOO Inc. is creating a artificial intelligence engine utilizing big data and machine learning. The question is, why aren't more data scientists exploring the real estate industry when it represents 15% of the US GDP, measured in the Trillions?
NoSQLSponsored - Analytics at the Speed of Light with Redis and SparkDave Neilsen, Developer Relations at RedisLabs3:50pm (Library 4th Fl.)
Spark is in-memory, Redis is in-memery. The Spark-Redis connector gives Spark access to Redis' data structures as RDDs. Redis, with its blazing fast performance and optimized in-memory data structures, reduces Spark processing time by up to 98%. In this talk, Dave will share the top use cases for Spark-Redis such as time-series, recommendations and real-time bid management.
NoSQLIntroduction to Graph DatabasesOren Golan, VP of Engineering, Sanguine4:30pm (FA-100)
Many organizations have adopted graph databases - IoT, health care, financial services, telecommunications and governments. This talk, based on our research and implementation of a graph database at Sanguine, a startup based in LA, dives into a few use cases and equips attendees with everything they need to start using a graph database.
NoSQLMongoDB 3.2 Goodness!!!Mark Helmstetter, Principal Consulting Engineer, MongoDB5:10pm (GC-130)
This talk explores the new features of MongoDB 3.2 such as $lookup, document validation rules, encryption-at-rest and tools like the BI Connector, OpsManager 2.0 and Compass.
NoSQLPrivacy vs. Security in a Big Data WorldTamara Dull, Director of Emerging Technologies, SAS Institute5:50pm (Library 4th Fl.)
The jury is still out on whether Edward Snowden was a hero, traitor, or schmuck. Regardless of the scarlet letter we want to hang around his neck, we should thank him for helping bring the discussion of big data privacy and security to the public square. This session examines the issues of big data privacy and security in the context of the six-stage (big) data lifecycle: create, store, use, share, archive, and destroy.
Use Case DrivenReliable Media Reporting in an Ever-changing Data LandscapeJosh Andrews, Data & Analytics Architect, OnPrem; Rachel Kelley, Project Manager , OnPrem and Eric Avila, Senior Anti-Piracy Technologist, NBC Universal10:30am (Library 4th Fl.)
OnPrem Solution Partners worked with NBCU to profile in-house data to determine data quality, and recommend process and quality improvements. We present our process for data import, improvements we want to make, and lessons learned regarding various tools used, including MariaDB, ElasticSearch, Cassandra, and others.
Use Case DrivenThe Encyclopedia of World ProblemsChristine Zhang, Data Journalist, Knight-Mozilla @ LA Times11:10am (GC-160)
Born more than four decades ago from the partnership of two international NGOs in Brussels, the Encyclopedia of World Problems has hand-picked and refined profiles of tens of thousands of problems occurring around the world: from notorious global issues all the way down to very specific and peculiar ones. This talk presents an overview of the Encyclopedia and the interesting data science applications that have arisen from the Encyclopedia's body of work - notably, its database resources.
Use Case DrivenSponsored - BI is brokenDave Fryer, Product Advocate at Domo11:50am (GC-150)
Not all BI solutions are created equal. The problem in most organizations is that disparate systems hold data hostage. Most systems create barriers between the data and the people who need the data to make decisions. We create silos of data that do not give us a holistic view of how the organization is operating. Domo is breaking down these silos and giving business users unparalleled access to the data they need to optimize their business.
Use Case DrivenDealing with Data Discomfort: Getting Bureaucrats to Embrace Data and AnalyticsJuan Vasquez, Communications & Data Analyst, Mayor's Operations Innovation Team at City of Los Angeles1:30pm (GC-160)
Government is traditionally known for red tape, stuffy hierarchies, endless policies, and clashing priorities. These and other variables make it difficult for government entities to embrace change and innovation, and more importantly apprehensive about peeling back the layers and letting data tell the stories.

So how do you change that? In this talk we'll discuss how the Mayor's Operations Innovation Team is leveraging storytelling, education, public-private partnerships, and data visualization technologies to help LA embrace data.
Use Case DrivenData and Hollywood: "Je t'Aime ... Moi Non Plus"Yves Bergquist, Project Director, "Data & Analytics", USC - Entertainment Technology Center2:10pm (FA-100)
Application of machine learning to problems such as script and story analysis, audience segmentation, and security, is revolutionizing the way Hollywood is creating and marketing entertainment.
Use Case DrivenHydrator: Open Source, Code-Free Data PipelinesJon Gray, CEO, Cask Data2:50pm (GC-130)
This talk will present how to build data pipelines with no code using the open-source, Apache 2.0, Cask Hydrator. The talk will continue with a live demonstration of creating data pipelines for two use cases.
Use Case DrivenSponsored - From Clusters to Clouds, Hardware Still MattersEric Lesser, Director of Operations at PSSC Labs3:50pm (GC-160)
Today’s Software Defined environments attempt to remove the weakness of computing hardware from the operational equation. There is no doubt that this is a natural progress away from overpriced, proprietary compute and storage layers. However, even at the heart of any Software Defined universe is an underlying hardware stack that must be robust, reliable and cost effective. Our 20+ years experience delivering over 2000 clusters and clouds has taught us how to properly design and engineer the right hardware solution for Big Data, Cluster and Cloud environments. This presentation will share this knowledge allowing user to make better design decisions for any deployment.
Use Case DrivenHow to Use Design Thinking to Jumpstart Your Big Data ProjectsPeter Reale, Solutions Engineer, Datameer4:30pm (GC-160)
There is a novel approach to identifying big data use cases, one which will ultimately lower the barrier to entry to big data projects and increase overall implementation success. This talk describes the approach used by big data pioneer and Datameer CEO Stefan Groschupf to drive over 200 production implementations.
Use Case DrivenShaping the Role of Data Science: An Evolution towards Prescriptive Analytics as Key Driver in Revenue AccelerationThomas Sullivan, Chief Data Scientist, IRIS.TV5:10pm (Library 4th Fl.)
At IRIS.TV, our business builds algorithmic solutions for video recommendation with the end goal to deliver a great user experience as evidenced by users viewing more video content. This talk outlines our reasons for expanding from a descriptive/predictive approach to data analytics toward a philosophy that features more prescriptive analytics, driven by our data science team.
Use Case DrivenRaising Venture Capital for Data Driven StartupsAustin Clements, Associate, TenOneTen Ventures5:50pm (GC-160)
Get an inside look into how VCs evaluate your team, market, and product before making an investment decision. Learn how to identify the right investors for your business and how to stand out from the crowd.