Building Enterprise Big Data Systems At LEGO

2019-01-21 · Data Engineering Podcast Listen

podcast_episode

by Keld Antonsen (LEGO) , Jesper Søgaard (LEGO) , Tobias Macey

Analytics Big Data Cloud Computing Data Engineering Data Management ERP Hadoop Spark

Summary Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld Antonsen share the story of starting and growing the big data group at LEGO. They discuss the challenges of being at global scale from the start, hiring and training talented engineers, prototyping and deploying new systems in the cloud, and what they have learned in the process. This is a useful conversation for engineers, managers, and leadership who are interested in building enterprise big data systems.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Keld Antonsen and Jesper Soegaard about the data infrastructure and analytics that powers LEGO

Interview

Introduction How did you get involved in the area of data management? My understanding is that the big data group at LEGO is a fairly recent development. Can you share the story of how it got started?

What kinds of data practices were in place prior to starting a dedicated group for managing the organization’s data? What was the transition process like, migrating data silos into a uniformly managed platform?

What are the biggest data challenges that you face at LEGO? What are some of the most critical sources and types of data that you are managing? What are the main components of the data infrastructure that you have built to support the organizations analytical needs?

What are some of the technologies that you have found to be most useful? Which have been the most problematic?

What does the team structure look like for the data services at LEGO?

Does that reflect in the types/numbers of systems that you support?

What types of testing, monitoring, and metrics do you use to ensure the health of the systems you support? What have been some of the most interesting, challenging, or useful lessons that you have learned while building and maintaining the data platforms at LEGO? How have the data systems at Lego evolved over recent years as new technologies and techniques have been developed? How does the global nature of the LEGO business influence the design strategies and technology choices for your platform? What are you most excited for in the coming year?

Contact Info

Jesper

Keld

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

LEGO Group ERP (Enterprise Resource Planning) Predictive Analytics Prescriptive Analytics Hadoop Center Of Excellence Continuous Integration Spark

Podcast Episode

Apache NiFi

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Understanding #BigData for #BigCities with Maksim ( @MrMaksimize @CityofSanDiego )

2018-10-04 · The Future of Data Podcast | conversation with leaders, influencers, and change makers in the World of Data & Analytics Listen

podcast_episode

by Maksim Pecherskiy (City of San Diego)

Agile/Scrum AI/ML Analytics API Big Data Data Science DevOps

In this podcast, Maksim, CDO @ City of San Diago, discussed the nuances of running big data for big cities. He shares his perspectives on effectively building a central data office in a complex and extremely collaborative environment like a big city. He shared his thoughts on some ways to effectively prioritize which project to pursue. He shared how leadership and execution could blend to solve civic issues relating to big and small cities. A great practitioner podcast for folks seeking to build a robust data science practice across a large and collaborative ecosystem.

Timeline: 0:28 Maksim's journey. 6:45 Maksim's current role. 11:46 Collaboration process in creating a data inventory. 14:52 Working with the bureaucracy. 18:35 Dealing with unforeseen circumstances at work. 20:22 Prioritization at work. 22:58 Qualities of a good data leader. 26:15 Collaboration with other cities. 27:40 Cool data projects in other cities. 30:55 Shortcomings of other city representatives. 36:54 Use cases in AI 39:00 What would Maksim change about himself? 40:50 Future cities and data 43:55 Opportunities for private investors in the public sector. 45:53 Maksim's success mantra. 50:19 Closing remark.

Maksim's Book Recommendation: The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win by Gene Kim, Kevin Behr, George Spafford amzn.to/2MAu5Xv

Podcast Link: https://futureofdata.org/understanding-bigdata-for-bigcities-with-maksim-mrmaksimize-cityofsandiego-futureofdata-podcast/

Maksim's BIO: Maksim Pecherskiy: As the CDO for the City of San Diego, working in the Performance & Analytics Department, Maksim strives to bring the necessary components together to allow the City's residents to benefit from a more efficient, agile government that is as innovative as the community around it. He has been solving complex problems with technology for nearly a decade. He spent 2014 working as a Code For America fellow in Puerto Rico, focusing on economic development. His team delivered a product called PrimerPeso that provides business owners and residents a tool to search, and apply for, government programs for which they may be eligible.

Before moving to California, Maksim was a Solutions Architect at Promet Source in Chicago, where he built large web applications and designed complex integrations. He shaped workflow, configuration management, and continuous integration processes while leading and training international development teams. Before his work at Promet, he was a software engineer at AllPlayers, who was instrumental in the design and architecture of its APIs and the development and documentation of supporting client libraries in various languages.

Maksim graduated from DePaul University with a bachelor of science degree in information systems and from Linköping University, Sweden, with a bachelor of science degree in international business. He is also certified as a Lean Six Sigma Green Belt.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Wanna Join? If you or any you know wants to join in, Register your interest by mailing us @ [email protected]

Want to sponsor? Email us @ [email protected]

Keywords: FutureOfData,

DataAnalytics,

Leadership,

Futurist,

Podcast,

BigData,

Strategy

Dev Ops for Data Science

2018-07-11 · Data Skeptic Listen

podcast_episode

by Kyle Polich , Damien Brady (Microsoft) , Donovan Brown (Microsoft) , Paige Bailey (Google)

AI/ML Cloud Computing Data Science DevOps Git Microsoft

We revisit the 2018 Microsoft Build in this episode, focusing on the latest ideas in DevOps. Kyle interviews Cloud Developer Advocates Damien Brady, Paige Bailey, and Donovan Brown to talk about DevOps and data science and databases. For a data scientist, what does it even mean to "build"? Packaging and deployment are things that a data scientist doesn't normally have to consider in their day-to-day work. The process of making an AI app is usually divided into two streams of work: data scientists building machine learning models and app developers building the application for end users to consume. DevOps includes all the parties involved in getting the application deployed and maintained and thinking about all the phases that follow and precede their part of the end solution. So what does DevOps mean for data science? Why should you adopt DevOps best practices? In the first half, Paige and Damian share their views on what DevOps for data science would look like and how it can be introduced to provide continuous integration, delivery, and deployment of data science models. In the second half, Donovan and Damian talk about the DevOps life cycle of putting a database under version control and carrying out deployments through a release pipeline.

Understanding Experimentation Platforms

2018-04-15 · O'Reilly Data Science Books O'Reilly Amazon

book

by Adil Aijaz , Henry Jewkes , Trevor Stuart (Harness)

a-b-testing a/b testing data data-science data-science-tasks

Thanks to approaches such as continuous integration and continuous delivery, companies that once introduced new products every six months are now shipping software several times a day. Reaching the market quickly is vital today, but rapid updates are impractical unless they provide genuine customer value. With this ebook, you’ll learn how online controlled experiments can help you gain customer feedback quickly so you can maintain a speedy release cycle. Using examples from Google, LinkedIn, and other organizations, Adil Aijaz, Trevor Stuart, and Henry Jewkes from Split Software explain basic concepts and show you how to build a scalable experimentation platform for conducting full-stack, comprehensive, and continuous tests. You’ll learn practical tips on best practices and common pitfalls you’re likely to face along the way. This ebook is ideal for engineers, data scientists, and product managers. Build an experimentation platform that includes a robust targeting engine, a telemetry system, a statistics engine, and a management console Dive deep into types of metrics, as well as metric frameworks, including Google’s HEART framework and LinkedIn’s 3-tiered framework Learn best practices for an building experimentation platform, such as A/A testing, power measuring, and an optimal ramp strategy Understand common pitfalls: how users are assigned across variants and control, how data is interpreted, and how metrics impact is understood

Database Refactoring Patterns with Pramod Sadalage - Episode 22

2018-03-12 · Data Engineering Podcast Listen

podcast_episode

by Pramod Sadalage , Tobias Macey

Agile/Scrum Data Engineering Data Management DevOps Docker DWH GitHub Java Linux MongoDB Neo4j NoSQL +1 more

Summary

As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Pramod Sadalage about refactoring databases and integrating database design into an iterative development workflow

Interview

Introduction How did you get involved in the area of data management? You first co-authored Refactoring Databases in 2006. What was the state of software and database system development at the time and why did you find it necessary to write a book on this subject? What are the characteristics of a database that make them more difficult to manage in an iterative context? How does the practice of refactoring in the context of a database compare to that of software? How has the prevalence of data abstractions such as ORMs or ODMs impacted the practice of schema design and evolution? Is there a difference in strategy when refactoring the data layer of a system when using a non-relational storage system? How has the DevOps movement and the increased focus on automation affected the state of the art in database versioning and evolution? What have you found to be the most problematic aspects of databases when trying to evolve the functionality of a system? Looking back over the past 12 years, what has changed in the areas of database design and evolution?

How has the landscape of tooling for managing and applying database versioning changed since you first wrote Refactoring Databases? What do you see as the biggest challenges facing us over the next few years?

Contact Info

Website pramodsadalage on GitHub @pramodsadalage on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Database Refactoring

Website Book

Thoughtworks Martin Fowler Agile Software Development XP (Extreme Programming) Continuous Integration

The Book Wikipedia

Test First Development DDL (Data Definition Language) DML (Data Modification Language) DevOps Flyway Liquibase DBMaintain Hibernate SQLAlchemy ORM (Object Relational Mapper) ODM (Object Document Mapper) NoSQL Document Database MongoDB OrientDB CouchBase CassandraDB Neo4j ArangoDB Unit Testing Integration Testing OLAP (On-Line Analytical Processing) OLTP (On-Line Transaction Processing) Data Warehouse Docker QA==Quality Assurance HIPAA (Health Insurance Portability and Accountability Act) PCI DSS (Payment Card Industry Data Security Standard) Polyglot Persistence Toplink Java ORM Ruby on Rails ActiveRecord Gem

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16

2018-01-29 · Data Engineering Podcast Listen

podcast_episode

by Danielle Robinson (Dat Project) , Joe Hand (Dat Project) , Tobias Macey

AI/ML Data Engineering Data Management Data Science DWH Git Linux Rust

Summary Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements:

There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.

Your host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future

Interview

Introduction How did you get involved in the area of data management? What is the Dat project and how did it get started? How have the grants to the Dat project influenced the focus and pace of development that was possible?

Now that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project?

Can you explain how the Dat protocol is designed and how it has evolved since it was first started? How does Dat manage conflict resolution and data versioning when replicating between multiple machines? One of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions? One of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made? How does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases? What mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default? For someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network? What have been the most challenging aspects of building and promoting Dat? What are some of the most interesting or inspiring uses of the Dat protocol that you are aware of?

Contact Info

Dat

datproject.org Email @dat_project on Twitter Dat Chat

Danielle

Email @daniellecrobins

Joe

Email @joeahand on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Dat Project Code For Science and Society Neuroscience Cell Biology OpenCon Mozilla Science Open Education Open Access Open Data Fortune 500 Data Warehouse Knight Foundation Alfred P. Sloan Foundation Gordon and Betty Moore Foundation Dat In The Lab Dat in the Lab blog posts California Digital Library IPFS Dat on Open Collective – COMING SOON! ScienceFair Stencila eLIFE Git BitTorrent Dat Whitepaper Merkle Tree Certificate Transparency Dat Protocol Working Group Dat Multiwriter Development – Hyperdb Beaker Browser WebRTC IndexedDB Rust C Keybase PGP Wire Zenodo Dryad Data Sharing Dataverse RSync FTP Globus Fritter Fritter Demo Rotonde how to Joe’s website on Dat Dat Tutorial Data Rescue – NYTimes Coverage Data.gov Libraries+ Network UC Conservation Genomics Consortium Fair Data principles hypervision hypervision in browser

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the unedited transcript… Tobias Macey 00:13…

Liberty in IBM CICS: Deploying and Managing Java EE Applications

2018-01-18 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Mitch Johnson , Phil Wakelin , Jonathan Lawrence , Tito Paiva , Carlos Donatucci , Michael Jones

API GitHub IBM Java Cyber Security data data-engineering

Abstract This IBM® Redbooks® publication is intended for IBM CICS® system programmers and IBM Z architects. It describes how to deploy and manage Java EE 7 web-based applications in an IBM CICS Liberty JVM server and access data on IBM Db2® for IBM z/OS® and IBM MQ for z/OS sub systems. In this book, we describe the key steps to create and install a Liberty JVM server within a CICS region. We then describe how to best use the different deployment techniques for Java EE applications and the specific considerations when deploying applications that use JDBC, JMS, and the new CICS link to Liberty API. Finally, we describe how to secure web applications in CICS Liberty, including transport-level security and request authentication and authorization by using IBM RACF® and LDAP registries. Information is also provided about how to build a high availability infrastructure and how to use the logging and monitoring functions that are available in the CICS Liberty environment. This book is based on IBM CICS Transaction Server (CICS TS) V5.4 that uses the embedded IBM WebSphere® Application Server Liberty technology. It is also applicable to CICS TS V5.3 with the fixes for the continuous delivery APAR PI77502 applied. Sample applications are used throughout this publication and are freely available for download from the IBM CICSDev GitHub organization along with detailed deployment instructions.

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

2018-01-08 · Data Engineering Podcast Listen

podcast_episode

by Ozgun Erdogan (Citus Data) , Craig Kerstiens (Citus Data) , Tobias Macey

Analytics Aurora Amazon RDS Big Data Data Engineering Data Management GitHub Linux NoSQL SQL Data Streaming postgresql

Summary

PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ozgun Erdogan and Craig Kerstiens about Citus, worry free PostGreSQL

Interview

Introduction How did you get involved in the area of data management? Can you describe what Citus is and how the project got started? Why did you start with Postgres vs. building something from the ground up? What was the reasoning behind converting Citus from a fork of PostGres to being an extension and releasing an open source version? How well does Citus work with other Postgres extensions, such as PostGIS, PipelineDB, or Timescale? How does Citus compare to options such as PostGres-XL or the Postgres compatible Aurora service from Amazon? How does Citus operate under the covers to enable clustering and replication across multiple hosts? What are the failure modes of Citus and how does it handle loss of nodes in the cluster? For someone who is interested in migrating to Citus, what is involved in getting it deployed and moving the data out of an existing system? How do the different options for leveraging Citus compare to each other and how do you determine which features to release or withhold in the open source version? Are there any use cases that Citus enables which would be impractical to attempt in native Postgres? What have been some of the most challenging aspects of building the Citus extension? What are the situations where you would advise against using Citus? What are some of the most interesting or impressive uses of Citus that you have seen? What are some of the features that you have planned for future releases of Citus?

Contact Info

Citus Data

citusdata.com @citusdata on Twitter citusdata on GitHub

Craig

Email Website @craigkerstiens on Twitter

Ozgun

Email ozgune on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Citus Data PostGreSQL NoSQL Timescale SQL blog post PostGIS PostGreSQL Graph Database JSONB Data Type PipelineDB Timescale PostGres-XL Aurora PostGres Amazon RDS Streaming Replication CitusMX CTE (Common Table Expression) HipMunk Citus Sharding Blog Post Wal-e Wal-g Heap Analytics HyperLogLog C-Store

The intro and outro musi

Wallaroo with Sean T. Allen - Episode 12

2017-12-25 · Data Engineering Podcast Listen

podcast_episode

by Sean T. Allen (Wallaroo Labs) , Tobias Macey

Ansible Flink Chef Data Engineering Data Management Docker GitHub Kafka Linux Python Redis

Summary

Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project. He explains the motivation for building Wallaroo, how it is implemented, and how you can start using it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Sean T. Allen about Wallaroo, a framework for building and operating stateful data applications at scale

Interview

Introduction How did you get involved in the area of data engineering? What is Wallaroo and how did the project get started? What is the Pony language, and what features does it have that make it well suited for the problem area that you are focusing on? Why did you choose to focus first on Python as the language for interacting with Wallaroo and how is that integration implemented? How is Wallaroo architected internally to allow for distributed state management?

Is the state persistent, or is it only maintained long enough to complete the desired computation? If so, what format do you use for long term storage of the data?

What have been the most challenging aspects of building the Wallaroo platform? Which axes of the CAP theorem have you optimized for? For someone who wants to build an application on top of Wallaroo, what is involved in getting started? Once you have a working application, what resources are necessary for deploying to production and what are the scaling factors?

What are the failure modes that users of Wallaroo need to account for in their application or infrastructure?

What are some situations or problem types for which Wallaroo would be the wrong choice? What are some of the most interesting or unexpected uses of Wallaroo that you have seen? What do you have planned for the future of Wallaroo?

Contact Info

IRC Mailing List Wallaroo Labs Twitter Email Personal Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Wallaroo Labs Storm Applied Apache Storm Risk Analysis Pony Language Erlang Akka Tail Latency High Performance Computing Python Apache Software Foundation Beyond Distributed Transactions: An Apostate’s View Consistent Hashing Jepsen Lineage Driven Fault Injection Chaos Engineering QCon 2016 Talk Codemesh in London: How did I get here? CAP Theorem CRDT Sync Free Project Basho Wallaroo on GitHub Docker Puppet Chef Ansible SaltStack Kafka TCP Dask Data Engineering Episode About Dask Beowulf Cluster Redis Flink Haskell

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11

2017-12-18 · Data Engineering Podcast Listen

podcast_episode

by Jeroen van der Heijden , Tobias Macey

Data Engineering Data Management GitHub Grafana Linux

Summary

Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Jeroen van der Heijden about SiriDB, a next generation time series database

Interview

Introduction How did you get involved in the area of data engineering? What is SiriDB and how did the project get started?

What was the inspiration for the name?

What was the landscape of time series databases at the time that you first began work on Siri? How does Siri compare to other time series databases such as InfluxDB, Timescale, KairosDB, etc.? What do you view as the competition for Siri? How is the server architected and how has the design evolved over the time that you have been working on it? Can you describe how the clustering mechanism functions?

Is it possible to create pools with more than two servers?

What are the failure modes for SiriDB and where does it fall on the spectrum for the CAP theorem? In the documentation it mentions needing to specify the retention period for the shards when creating a database. What is the reasoning for that and what happens to the individual metrics as they age beyond that time horizon? One of the common difficulties when using a time series database in an operations context is the need for high cardinality of the metrics. How are metrics identified in Siri and is there any support for tagging? What have been the most challenging aspects of building Siri? In what situations or environments would you advise against using Siri?

Contact Info

joente on Github LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

SiriDB Oversight InfluxDB LevelDB OpenTSDB Timescale DB KairosDB Write Ahead Log Grafana

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

2017-12-10 · Data Engineering Podcast Listen

podcast_episode

by Ewen Cheslack-Postava (Confluent) , Tobias Macey

Avro Data Engineering Data Management GitHub JSON Kafka Linux Parquet Protobuf

Summary

To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it. He also discusses how it can be extended for other deployment targets and use cases, and additional features that are planned for future releases.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ewen Cheslack-Postava about the Confluent Schema Registry

Interview

Introduction How did you get involved in the area of data engineering? What is the schema registry and what was the motivating factor for building it? If you are using Avro, what benefits does the schema registry provide over and above the capabilities of Avro’s built in schemas? How did you settle on Avro as the format to support and what would be involved in expanding that support to other serialization options? Conversely, what would be involved in using a storage backend other than Kafka? What are some of the alternative technologies available for people who aren’t using Kafka in their infrastructure? What are some of the biggest challenges that you faced while designing and building the schema registry? What is the tipping point in terms of system scale or complexity when it makes sense to invest in a shared schema registry and what are the alternatives for smaller organizations? What are some of the features or enhancements that you have in mind for future work?

Contact Info

ewencp on GitHub Website @ewencp on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Kafka Confluent Schema Registry Second Life Eve Online Yes, Virginia, You Really Do Need a Schema Registry JSON-Schema Parquet Avro Thrift Protocol Buffers Zookeeper Kafka Connect

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

data.world with Bryon Jacob - Episode 9

2017-12-03 · Data Engineering Podcast Listen

podcast_episode

by Bryon Jacob (data.world) , Tobias Macey

Data Engineering Data Management GitHub Linux Tableau

Summary

We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Bryon Jacob about the technology and purpose that drive data.world

Interview

Introduction How did you first get involved in the area of data management? What is data.world and what is its mission and how does your status as a B Corporation tie into that? The platform that you have built provides hosting for a large variety of data sizes and types. What does the technical infrastructure consist of and how has that architecture evolved from when you first launched? What are some of the scaling problems that you have had to deal with as the amount and variety of data that you host has increased? What are some of the technical challenges that you have been faced with that are unique to the task of hosting a heterogeneous assortment of data sets that intended for shared use? How do you deal with issues of privacy or compliance associated with data sets that are submitted to the platform? What are some of the improvements or new capabilities that you are planning to implement as part of the data.world platform? What are the projects or companies that you consider to be your competitors? What are some of the most interesting or unexpected uses of the data.world platform that you are aware of?

Contact Information

@bryonjacob on Twitter bryonjacob on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

data.world HomeAway Semantic Web Knowledge Engineering Ontology Open Data RDF CSVW SPARQL DBPedia Triplestore Header Dictionary Triples Apache Jena Tabula Tableau Connector Excel Connector Data For Democracy Jonathan Morgan

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

2017-11-22 · Data Engineering Podcast Listen

podcast_episode

by Doug Cutting , Julien Le Dem (Astronomer) , Tobias Macey

Arrow Avro CSV Data Engineering Data Management GitHub Hadoop Hive Linux Parquet Presto Spark +3 more

Summary With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems.

Interview

Introduction How did you first get involved in the area of data management? What are the main serialization formats used for data storage and analysis? What are the tradeoffs that are offered by the different formats? How have the different storage and analysis tools influenced the types of storage formats that are available? You’ve each developed a new on-disk data format, Avro and Parquet respectively. What were your motivations for investing that time and effort? Why is it important for data engineers to carefully consider the format in which they transfer their data between systems?

What are the switching costs involved in moving from one format to another after you have started using it in a production system?

What are some of the new or upcoming formats that you are each excited about? How do you anticipate the evolving hardware, patterns, and tools for processing data to influence the types of storage formats that maintain or grow their popularity?

Contact Information

Doug:

cutting on GitHub Blog @cutting on Twitter

Julien

Email @J_ on Twitter Blog julienledem on GitHub

Links

Apache Avro Apache Parquet Apache Arrow Hadoop Apache Pig Xerox Parc Excite Nutch Vertica Dremel White Paper

Twitter Blog on Release of Parquet

CSV XML Hive Impala Presto Spark SQL Brotli ZStandard Apache Drill Trevni Apache Calcite

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

2017-11-14 · Data Engineering Podcast Listen

podcast_episode

by Walter Menendez (BuzzFeed) , Tobias Macey

Analytics AWS Amazon EMR Cloud Computing Data Engineering Data Management Datadog DevOps GCP GitHub Google Analytics Linux +6 more

Summary

Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions. Walter Menendez is a data engineer on their infrastructure team and in this episode he describes how they manage data ingestion from a wide array of sources and create an interface for their data scientists to produce valuable conclusions.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Walter Menendez about the data engineering platform at Buzzfeed

Interview

Introduction How did you get involved in the area of data management? How is the data engineering team at Buzzfeed structured and what kinds of projects are you responsible for? What are some of the types of data inputs and outputs that you work with at Buzzfeed? Is the core of your system using a real-time streaming approach or is it primarily batch-oriented and what are the business needs that drive that decision? What does the architecture of your data platform look like and what are some of the most significant areas of technical debt? Which platforms and languages are most widely leveraged in your team and what are some of the outliers? What are some of the most significant challenges that you face, both technically and organizationally? What are some of the dead ends that you have run into or failed projects that you have tried? What has been the most successful project that you have completed and how do you measure that success?

Contact Info

@hackwalter on Twitter walterm on GitHub

Links

Data Literacy MIT Media Lab Tumblr Data Capital Data Infrastructure Google Analytics Datadog Python Numpy SciPy NLTK Go Language NSQ Tornado PySpark AWS EMR Redshift Tracking Pixel Google Cloud Don’t try to be google Stop Hiring DevOps Engineers and Start Growing Them

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

IBM DB2 12 for z/OS Technical Overview

2016-12-13 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Acacio Ricardo Gomes Pessoa , Tammie Dang , Meg Bernal

Agile/Scrum Cloud Computing IBM Cyber Security data data-engineering ibm-db2 relational-databases

IBM® DB2® 12 for z/OS® delivers key innovations that increase availability, reliability, scalability, and security for your business-critical information. In addition, DB2 12 for z/OS offers performance and functional improvements for both transactional and analytical workloads and makes installation and migration simpler and faster. DB2 12 for z/OS also allows you to develop applications for the cloud and mobile devices by providing self-provisioning, multitenancy, and self-managing capabilities in an agile development environment. DB2 12 for z/OS is also the first version of DB2 built for continuous delivery. This IBM Redbooks® publication introduces the enhancements made available with DB2 12 for z/OS. The contents help database administrators to understand the new functions and performance enhancements, to plan for ways to use the key new capabilities, and to justify the investment in installing or migrating to DB2 12.

Agile Data Warehousing for the Enterprise

2015-09-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ralph Hughes

Agile/Scrum BI Data Engineering DWH data data-engineering data-warehouse storage-repositories

Building upon his earlier book that detailed agile data warehousing programming techniques for the Scrum master, Ralph's latest work illustrates the agile interpretations of the remaining software engineering disciplines: Requirements management benefits from streamlined templates that not only define projects quickly, but ensure nothing essential is overlooked. Data engineering receives two new "hyper modeling" techniques, yielding data warehouses that can be easily adapted when requirements change without having to invest in ruinously expensive data-conversion programs. Quality assurance advances with not only a stereoscopic top-down and bottom-up planning method, but also the incorporation of the latest in automated test engines. Use this step-by-step guide to deepen your own application development skills through self-study, show your teammates the world's fastest and most reliable techniques for creating business intelligence systems, or ensure that the IT department working for you is building your next decision support system the right way. Learn how to quickly define scope and architecture before programming starts Includes techniques of process and data engineering that enable iterative and incremental delivery Demonstrates how to plan and execute quality assurance plans and includes a guide to continuous integration and automated regression testing Presents program management strategies for coordinating multiple agile data mart projects so that over time an enterprise data warehouse emerges Use the provided 120-day road map to establish a robust, agile data warehousing program

Pro XAML with C#: Application Development Strategies

2015-07-07 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Buddy James , Lori Lalonde

Git Microsoft XML data data-engineering storage-formats xaml

Pro XAML with C#: Application Development Strategies is your guide to real-world development practices on Microsoft’s XAML-based platforms, with examples in WPF, Windows 8.1, and Windows Phone 8.1. Learn how to properly plan and architect an application on one or more of these platforms for a robust, scalable solution. In Part I, authors Buddy James and Lori Lalonde introduce you to XAML and reveal proven techniques for developing successful line-of-business applications. You’ll also find out about some of the conflicting needs and interests that you might encounter as an enterprise XAML developer. Part II begins to lay the groundwork to help you properly architect your application, providing you with a deeper understanding of domain-driven design and the Model-View-ViewModel design pattern. You will also learn about proper exception handling and logging techniques, and how to cover your code with unit tests to reduce bugs and validate your design. Part III explores implementation and deployment details for each of Microsoft’s XAML UIs, along with advice on deploying and maintaining your application across different devices using version control repositories and continuous integration. Pro XAML with C# Application Development Strategies is for intermediate to experienced developers looking to improve their professional practice. Readers should have experience working with C# and at least one XAML-based technology (WPF, Silverlight, Windows Store, or Windows Phone).

Agile Analytics: A Value-Driven Approach to Business Intelligence and Data Warehousing

2011-07-27 · O'Reilly Business Intelligence Books O'Reilly Amazon

book

by Ken W. Collier

Agile/Scrum Analytics BI Data Management DWH Git agile software-collaboration software-development

Using Agile methods, you can bring far greater innovation, value, and quality to any data warehouse, business intelligence, or analytics project. However, conventional Agile methodologies must be carefully adapted to address the unique characteristics of DW/BI projects. In , Agile pioneer Ken Collier shows how to do just that. Agile Analytics Collier introduces platform-agnostic Agile solutions for integrating infrastructures consisting of diverse operational, legacy, and specialty systems that mix commercial and custom code. Using working examples, he shows how to manage analytics development teams with widely diverse skill sets; support enormous and fast-growing data volumes; and more. Collier's techniques offer equal value whether your projects involve "back-end" data management, "front-end" business analysis, or both. Part I focuses on Agile project management techniques and delivery team coordination, introducing core practices that shape the way your agile DW/BI project community works together towards success Part II presents technical methods for enabling continuous delivery of business value at production-quality levels, including evolving superior designs; test-driven DW development; version control; and project automation Collier brings together proven solutions you can apply right now--whether you're an IT decision-maker, data warehouse professional, DBA, business intelligence specialist, or database developer. With his help, you can mitigate project risk, improve business alignment, achieve better results--and have fun along the way.

Computational Intelligence and Pattern Analysis in Biological Informatics

2010-11-30 · O'Reilly Data Science Books O'Reilly Amazon

book

by Jason T. Wang , Ujjwal Maulik , Sanghamitra Bandyopadhyay

bioinformatics data data-science data-science-domains

An invaluable tool in Bioinformatics, this unique volume provides both theoretical and experimental results, and describes basic principles of computational intelligence and pattern analysis while deepening the reader's understanding of the ways in which these principles can be used for analyzing biological data in an efficient manner. This book synthesizes current research in the integration of computational intelligence and pattern analysis techniques, either individually or in a hybridized manner. The purpose is to analyze biological data and enable extraction of more meaningful information and insight from it. Biological data for analysis include sequence data, secondary and tertiary structure data, and microarray data. These data types are complex and advanced methods are required, including the use of domain-specific knowledge for reducing search space, dealing with uncertainty, partial truth and imprecision, efficient linear and/or sub-linear scalability, incremental approaches to knowledge discovery, and increased level and intelligence of interactivity with human experts and decision makers Chapters authored by leading researchers in CI in biology informatics. Covers highly relevant topics: rational drug design; analysis of microRNAs and their involvement in human diseases. Supplementary material included: program code and relevant data sets correspond to chapters. Note: The ebook version does not provide access to the companion files.

Continuous delivery with Google Cloud Deploy

· Google Cloud Next '25

session

Cloud Computing GCP

Build & deploy with Google Cloud Deploy! This hands-on lab equips you to create delivery pipelines, deploy container images to Artifact Registry, and promote applications across GKE environments.

If you register for a Learning Center lab, please ensure that you sign up for a Google Cloud Skills Boost account for both your work domain and personal email address. You will need to authenticate your account as well (be sure to check your spam folder!). This will ensure you can arrive and access your labs quickly onsite. You can follow this link to sign up!

talk-data.com

CI/CD

Activity Trend

Top Events

Top Speakers

Building Enterprise Big Data Systems At LEGO

Understanding #BigData for #BigCities with Maksim ( @MrMaksimize @CityofSanDiego )

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Dev Ops for Data Science

Understanding Experimentation Platforms

Database Refactoring Patterns with Pramod Sadalage - Episode 22

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16

Liberty in IBM CICS: Deploying and Managing Java EE Applications

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

Wallaroo with Sean T. Allen - Episode 12

SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

data.world with Bryon Jacob - Episode 9

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

IBM DB2 12 for z/OS Technical Overview

Agile Data Warehousing for the Enterprise

Pro XAML with C#: Application Development Strategies

Agile Analytics: A Value-Driven Approach to Business Intelligence and Data Warehousing

Computational Intelligence and Pattern Analysis in Biological Informatics

Continuous delivery with Google Cloud Deploy