Delta

Maintaining Your Data Lake At Scale With Spark

2019-06-17 · Data Engineering Podcast Listen

podcast_episode

by Michael Armbrust (Databricks) , Tobias Macey

AI/ML Analytics API Big Data Cloud Computing Data Analytics Data Engineering Data Lake Data Management Data Science Databricks Spark +1 more

Summary Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Michael Armbrust about Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark and big data workloads.

Interview

Introduction How did you get involved in the area of data m

Found on Friday: 4 Indian Playback Singers and 2 Norteño Bandas Killing the YouTube Game

2019-06-14 · How Music Charts Listen

podcast_episode

by Rutger (Chartmetric) , Jason Joven (Chartmetric)

Singer

Highlights Do you know what a playback singer is? Or how about that Mexican Norteño music has German polka in it? I sure didn’t, but our A&R tool did!Mission Good morning, it’s Jason here at Chartmetric with your 3-minute Data Dump where we upload charts, artists and playlists into your brain so you can stay up on the latest in the music data world.DateThis is your Data Dump for Friday, June 14th, 2019.Found on Friday: 4 Indian Playback Singers and 2 Norteño BandasSo checking into our A&R tool which roams the Interwebs for the biggest delta, or change, in between now and 28 days ago, we focus on the singular metric of total YouTube views via their artist channel.Looking at the Top 20 biggest gains, what’s not surprising? Billie Eilish at #5, that’s cool, Will Smith at #7 after the new Aladdin movie releasing, that’s also awesome…But you know what’s really hot? Indian playback singers, because they occupy positions 1 through 4!A playback singer in Bollywood masterfully records world-class vocals for songs for the on-camera actors to lip-sync to during shooting. For us Westerners who are obsessed with authenticity, let’s just imagine a publicly accepted form of lip-sync that not only helps create great Indian movies, but also celebrates the playback singers themselves.In the #1 spot is Calcutta-born Kumar Sanu with 30% YouTube view growth to 16.5M, who also just appeared on TV show Sa Re Ga Ma Pa L'il Champs, which pits 5-15 year olds against each other in a singing competition.In the #2 position is Arijit Singh who saw 20% YouTube view growth to 18.7M, and just released “Bekhayali” from Indian dramatic film Kabir Singh on June 3rd.Coming #3 on our list, but #1 in the Bollywood industry, is Lata Mangeshkar with 19% view growth to 9M, but it’s honestly a footnote to one of the most well-known and highly-respected playback singers ever.Mangeshkar has been listed in the Guinness Book of World Records as the most recorded artist with over 30K tracks in 20 different languages, the recipient of the Bharat Ratna, India’s highest civilian honor (equivalent to the US Presidential Medal of Freedom), recipient of France’s Legion of Honour, and publicly selected as 10th Greatest Indian of modern times.How’s that for achievement? I really don’t think she cares about her YouTube views right now, nor should she. Hats off to her.Moving to Mexico, Norteño music is a genre of Northern Mexico that blends German polka and waltz traditions with Mexican ones.For all of us not familiar with Mexican music, the key instruments that define Norteño is the accordion (gracias a los europeos) and the bajo sexto, which translates to “sixth bass”, and looks like a 12-string guitar, but is used as a bass instrument.Now in the #6 position is Los Invasores De Nuevo León, with 10% YouTube view growth to 26M.The Latin Grammy-nominated Los Invasores, or “The Invaders of Nuevo León”, formed in 1978, and are currently on tour in south Texas,In the #16 position is Los Tucanes De Tijuana, with 5% view growth to 132M.“Los Tucanes”, or “The Toucans of Tijuana”, made history this year as first norteño act to play Coachella, also getting keys to the city.And if you want to catch up with some meme action, look up the “La Chona” challenge...their fast-paced 1994 record received a revival last year when uploaders recorded themselves dancing to “La Chona” outside their moving vehicles, a la Drake’s “In My Feelings”.OutroBueno! That’s it for your Daily Data Dump for Friday, June 14th, 2019. This is Jason from Chartmetric.Please give us a shout-out on iTunes. If you’re on an iPhone, dodge those crafty notifications and just scroll down on the Daily Data Dump page in your Apple Podcasts app or in the Ratings and Review tab in your iTunes app on your laptop, and show some love, Rutger and I appreciate it.Free accounts are at chartmetric.comAnd article links and show notes are at: podcast.chartmetric.comHappy Friday, have a great weekend, and see you on Monday!

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

2018-05-07 · Data Engineering Podcast Listen

podcast_episode

by Stepan Pushkarev (Hydrosphere.io) , Alan Anders (Applecart) , Tobias Macey

AI/ML API Data Engineering Data Management Data Science Databricks Spark

Summary

The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and this week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.

Interview

Alan Anders from Applecart

What are the challenges of gathering and processing data from multiple data sources and representing them in a unified manner for merging into single entities? What are the biggest technical hurdles at Applecart?

Contact Info

@alanjanders on Twitter LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Spark DataBricks DataBricks Delta Applecart

Stepan Pushkarev from Hydrosphere.io

What is Hydropshere.io? What metrics do you track to determine when a machine learning model is not producing an appropriate output? How do you determine which data points to sample for retraining the model? How does the role of a machine learning engineer differ from data engineers and data scientists?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Hydrosphere Machine Learning Engineer

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14

2018-01-15 · Data Engineering Podcast Listen

podcast_episode

by Christopher Meiklejohn (LASP) , Tobias Macey

Cassandra Data Engineering Data Management Docker DynamoDB GitHub Kubernetes Linux TensorFlow

Summary

As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to build mechanisms to properly handle replication of data and conflict resolution. In this episode Christopher Meiklejohn discusses the research he is doing with Conflict-Free Replicated Data Types (CRDTs) and how they fit in with existing methods for sharing and sharding data. He also shares resources for systems that leverage CRDTs, how you can incorporate them into your systems, and when they might not be the right solution. It is a fascinating and informative treatment of a topic that is becoming increasingly relevant in a data driven world.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Christopher Meiklejohn about establishing consensus in distributed systems

Interview

Introduction How did you get involved in the area of data management? You have dealt with CRDTs with your work in industry, as well as in your research. Can you start by explaining what a CRDT is, how you first began working with them, and some of their current manifestations? Other than CRDTs, what are some of the methods for establishing consensus across nodes in a system and how does increased scale affect their relative effectiveness? One of the projects that you have been involved in which relies on CRDTs is LASP. Can you describe what LASP is and what your role in the project has been? Can you provide examples of some production systems or available tools that are leveraging CRDTs? If someone wants to take advantage of CRDTs in their applications or data processing, what are the available off-the-shelf options, and what would be involved in implementing custom data types? What areas of research are you most excited about right now? Given that you are currently working on your PhD, do you have any thoughts on the projects or industries that you would like to be involved in once your degree is completed?

Contact Info

Website cmeiklejohn on GitHub Google Scholar Citations

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Basho Riak Syncfree LASP CRDT Mesosphere CAP Theorem Cassandra DynamoDB Bayou System (Xerox PARC) Multivalue Register Paxos RAFT Byzantine Fault Tolerance Two Phase Commit Spanner ReactiveX Tensorflow Erlang Docker Kubernetes Erleans Orleans Atom Editor Automerge Martin Klepman Akka Delta CRDTs Antidote DB Kops Eventual Consistency Causal Consistency ACID Transactions Joe Hellerstein

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

How to Calculate Options Prices and Their Greeks: Exploring the Black Scholes Model from Delta to Vega

2015-06-02 · O'Reilly Data Science Books O'Reilly Amazon

book

by Pierino Ursone

data data-science data-science-tasks statistics time-series

A unique, in-depth guide to options pricing and valuing their greeks, along with a four dimensional approach towards the impact of changing market circumstances on options How to Calculate Options Prices and Their Greeks is the only book of its kind, showing you how to value options and the greeks according to the Black Scholes model but also how to do this without consulting a model. You'll build a solid understanding of options and hedging strategies as you explore the concepts of probability, volatility, and put call parity, then move into more advanced topics in combination with a four-dimensional approach of the change of the P&L of an option portfolio in relation to strike, underlying, volatility, and time to maturity. This informative guide fully explains the distribution of first and second order Greeks along the whole range wherein an option has optionality, and delves into trading strategies, including spreads, straddles, strangles, butterflies, kurtosis, vega-convexity, and more. Charts and tables illustrate how specific positions in a Greek evolve in relation to its parameters, and digital ancillaries allow you to see 3D representations using your own parameters and volumes. The Black and Scholes model is the most widely used option model, appreciated for its simplicity and ability to generate a fair value for options pricing in all kinds of markets. This book shows you the ins and outs of the model, giving you the practical understanding you need for setting up and managing an option strategy. Understand the Greeks, and how they make or break a strategy See how the Greeks change with time, volatility, and underlying Explore various trading strategies Implement options positions, and more Representations of option payoffs are too often based on a simple two-dimensional approach consisting of P&L versus underlying at expiry. This is misleading, as the Greeks can make a world of difference over the lifetime of a strategy. How to Calculate Options Prices and Their Greeks is a comprehensive, in-depth guide to a thorough and more effective understanding of options, their Greeks, and (hedging) option strategies.

Analytics at Work

2010-02-12 · O'Reilly Data Science Books O'Reilly Amazon

book

by Thomas Davenport (Babson College) , Jeanne G. Harris , Robert Morison

Analytics Marketing analytics-platforms data data-science

Most companies have massive amounts of data at their disposal, yet fail to utilize it in any meaningful way. But a powerful new business tool - analytics - is enabling many firms to aggressively leverage their data in key business decisions and processes, with impressive results. In their previous book, Competing on Analytics, Thomas Davenport and Jeanne Harris showed how pioneering firms were building their entire strategies around their analytical capabilities. Rather than "going with the gut" when pricing products, maintaining inventory, or hiring talent, managers in these firms use data, analysis, and systematic reasoning to make decisions that improve efficiency, risk-management, and profits. Now, in Analytics at Work, Davenport, Harris, and coauthor Robert Morison reveal how any manager can effectively deploy analytics in day-to-day operationsone business decision at a time. They show how many types of analytical tools, from statistical analysis to qualitative measures like systematic behavior coding, can improve decisions about everything from what new product offering might interest customers to whether marketing dollars are being most effectively deployed. Based on all-new research and illustrated with examples from companies including Humana, Best Buy, Progressive Insurance, and Hotels.com, this implementation-focused guide outlines the five-step DELTA model for deploying and succeeding with analytical initiatives. You'll learn how to: · Use data more effectively and glean valuable analytical insights · Manage and coordinate data, people, and technology at an enterprise level · Understand and support what analytical leaders do · Evaluate and choose realistic targets for analytical activity · Recruit, hire, and manage analysts Combining the science of quantitative analysis with the art of sound reasoning, Analytics at Work provides a road map and tools for unleashing the potential buried in your company's data.

Get started with Data Engineering

· Data + AI Summit 2025

talk

Data Engineering Databricks DWH SQL

In this course, you will learn basic skills that will allow you to use the Databricks Data Intelligence Platform to perform a simple data engineering workflow and support data warehousing endeavors. You will be given a tour of the workspace and be shown how to work with objects in Databricks such as catalogs, schemas, volumes, tables, compute clusters and notebooks. You will then follow a basic data engineering workflow to perform tasks such as creating and working with tables, ingesting data into Delta Lake, transforming data through the medallion architecture, and using Databricks Workflows to orchestrate data engineering tasks. You’ll also learn how Databricks supports data warehousing needs through the use of Databricks SQL, DLT, and Unity Catalog.

talk-data.com

Activity Trend

Top Events

Top Speakers

Maintaining Your Data Lake At Scale With Spark

Found on Friday: 4 Indian Playback Singers and 2 Norteño Bandas Killing the YouTube Game

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14

How to Calculate Options Prices and Their Greeks: Exploring the Black Scholes Model from Delta to Vega

Analytics at Work

Get started with Data Engineering