DWH

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

2018-11-05 · Data Engineering Podcast Listen

podcast_episode

by Daniel Mintz (Looker) , Tobias Macey

AI/ML Airflow API Athena BI BigQuery Data Engineering Data Management DevOps ETL/ELT Hadoop Hive +10 more

Summary

Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Looker is and the problem that it is aiming to solve?

How do you define business intelligence?

How is Looker unique from other approaches to business intelligence in the enterprise?

How does it compare to open source platforms for BI?

Can you describe the technical infrastructure that supports Looker? Given that you are connecting to the customer’s data store, how do you ensure sufficient security? For someone who is using Looker, what does their workflow look like?

How does that change for different user roles (e.g. data engineer vs sales management)

What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency? What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?

What are the portions of the Looker architecture that you would do differently if you were to start over today?

What are some of the most interesting or unusual uses of Looker that you have seen? What is in store for the future of Looker?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Looker Upworthy MoveOn.org LookML SQL Business Intelligence Data Warehouse Linux Hadoop BigQuery Snowflake Redshift DB2 PostGres ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Airflow Luigi NiFi Data Curation Episode Presto Hive Athena DRY (Don’t Repeat Yourself) Looker Action Hub Salesforce Marketo Twilio Netscape Navigator Dynamic Pricing Survival Analysis DevOps BigQuery ML Snowflake Data Sharehouse

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov

2018-10-09 · Data Engineering Podcast Listen

podcast_episode

by Nikita Shamgunov (Neon) , Tobias Macey

AI/ML API BI Cloud Computing Data Engineering Data Management Data Science SQL Tableau

Summary One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chatYour host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by describing what MemSQL is and how the product and business first got started?What are the typical use cases for customers running MemSQL?What are the benefits of integrating the ingestion pipeline with the database engine? What are some typical ways that the ingest capability is leveraged by customers?How is MemSQL architected and how has the internal design evolved from when you first started working on it?Where does it fall on the axes of the CAP theorem?How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?Can you describe the lifecycle of a write transaction?Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?How do you mitigate the impact of network latency throughout the cluster during query planning and execution?How much of the implementation of MemSQL is using custom built code vs. open source projects?What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?When is MemSQL the wrong choice for a data platform?What do you have planned for the future of MemSQL? Contact Info @nikitashamgunov on TwitterLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links MemSQLNewSQLMicrosoft SQL ServerSt. Petersburg University of Fine Mechanics And OpticsCC++In-Memory DatabaseRAM (Random Access Memory)Flash StorageOracle DBPostgreSQLPodcast EpisodeKafkaKinesisWealth ManagementData WarehouseODBCS3HDFSAvroParquetData Serialization Podcast EpisodeBroadcast JoinShuffle JoinCAP TheoremApache ArrowLZ4S2 Geospatial LibrarySybaseSAP HanaKubernetes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

2018-09-24 · Data Engineering Podcast Listen

podcast_episode

by Todd Walter , Tobias Macey

AI/ML API Big Data Cloud Computing Data Engineering Data Lake Data Management Data Science ETL/ELT

Summary As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence

Interview

Introduction How did you get involved in the area of data management? How do you define data curation?

What are some of the high level concerns that are encapsulated in that effort?

How does the size and maturity of a company affect the ways that they architect and interact with their data systems? Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it? What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure? What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space? As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep? In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?

What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?

Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure? ETL has long been the default approac

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

2018-09-17 · Data Engineering Podcast Listen

podcast_episode

by Alexander Dean (Snowplow Analytics) , Tobias Macey

AI/ML Analytics API AWS Amazon EMR Kinesis BI Cloud Computing CRM Data Collection Data Engineering Data Management +13 more

Summary

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics

Interview

Introductions How did you get involved in the area of data engineering and data management? What is Snowplow Analytics and what problem were you trying to solve when you started the company? What is unique about customer event data from an ingestion and processing perspective? Challenges with properly matching up data between sources Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?

Cleanliness/accuracy

What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly? Can you describe the overall architecture of the ingest pipeline that Snowplow provides?

How has that architecture evolved from when you first started? What would you do differently if you were to start over today?

Ensuring appropriate use of enrichment sources What have been some of the biggest challenges encountered while building and evolving Snowplow? What are some of the most interesting uses of your platform that you are aware of?

Keep In Touch

Alex

@alexcrdean on Twitter LinkedIn

Snowplow

@snowplowdata on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Snowplow

GitHub

Deloitte Consulting OpenX Hadoop AWS EMR (Elastic Map-Reduce) Business Intelligence Data Warehousing Google Analytics CRM (Customer Relationship Management) S3 GDPR (General Data Protection Regulation) Kinesis Kafka Google Cloud Pub-Sub JSON-Schema Iglu IAB Bots And Spiders List Heap Analytics

Podcast Interview

Redshift SnowflakeDB Snowplow Insights Googl

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

2018-07-30 · Data Engineering Podcast Listen

podcast_episode

by Peter Lubell-Doughtie (Ona) , Tobias Macey

Ansible API Chef Data Collection Data Engineering Data Management DataOps Docker Druid GitHub Kafka Superset +2 more

Summary

With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy

Interview

Introduction How did you get involved in the area of data management? What is Ona and how did the company get started?

What are some examples of the types of customers that you work with?

What types of data do you support in your collection platform? What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users? Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization? What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers? Can you describe the flow of the data from collection through to analysis? To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?

What are the architectural considerations that you factored in when designing it? What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?

What are your plans for the future of Ona and Canopy?

Contact Info

Email pld on Github Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

OpenSRP Ona Canopy Open Data Kit Earth Institute at Columbia University Sustainable Engineering Lab WHO Bill and Melinda Gates Foundation XLSForms PostGIS Kafka Druid Superset Postgres Ansible Docker Terraform

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

2018-05-21 · Data Engineering Podcast Listen

podcast_episode

by Kamil Bajda-Pawlikowski (Starburst Data) , Tobias Macey

Analytics API Cassandra Data Engineering Data Management Hadoop Hive Kafka Presto Redis SQL Teradata +1 more

Summary

Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Presto is?

What are some of the common use cases and deployment patterns for Presto?

How does Presto compare to Drill or Impala? What is it about Presto that led you to building a business around it? What are some of the most challenging aspects of running and scaling Presto? For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?

How does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with?

What are some cases in which Presto is not the right solution? What types of support have you found to be the most commonly requested? What are some of the types of tooling or improvements that you have made to Presto in your distribution?

What are some of the notable changes that your team has contributed upstream to Presto?

Contact Info

Website E-mail Twitter – @starburstdata Twitter – @prestodb

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Starburst Data Presto Hadapt Hadoop Hive Teradata PrestoCare Cost Based Optimizer ANSI SQL Spill To Disk Tempto Benchto Geospatial Functions Cassandra Accumulo Kafka Redis PostGreSQL

The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utm_source=rss&utm_medium=rss Support Data Engineering Podcast

Database Refactoring Patterns with Pramod Sadalage - Episode 22

2018-03-12 · Data Engineering Podcast Listen

podcast_episode

by Pramod Sadalage , Tobias Macey

Agile/Scrum CI/CD Data Engineering Data Management DevOps Docker GitHub Java Linux MongoDB Neo4j NoSQL +1 more

Summary

As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Pramod Sadalage about refactoring databases and integrating database design into an iterative development workflow

Interview

Introduction How did you get involved in the area of data management? You first co-authored Refactoring Databases in 2006. What was the state of software and database system development at the time and why did you find it necessary to write a book on this subject? What are the characteristics of a database that make them more difficult to manage in an iterative context? How does the practice of refactoring in the context of a database compare to that of software? How has the prevalence of data abstractions such as ORMs or ODMs impacted the practice of schema design and evolution? Is there a difference in strategy when refactoring the data layer of a system when using a non-relational storage system? How has the DevOps movement and the increased focus on automation affected the state of the art in database versioning and evolution? What have you found to be the most problematic aspects of databases when trying to evolve the functionality of a system? Looking back over the past 12 years, what has changed in the areas of database design and evolution?

How has the landscape of tooling for managing and applying database versioning changed since you first wrote Refactoring Databases? What do you see as the biggest challenges facing us over the next few years?

Contact Info

Website pramodsadalage on GitHub @pramodsadalage on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Database Refactoring

Website Book

Thoughtworks Martin Fowler Agile Software Development XP (Extreme Programming) Continuous Integration

The Book Wikipedia

Test First Development DDL (Data Definition Language) DML (Data Modification Language) DevOps Flyway Liquibase DBMaintain Hibernate SQLAlchemy ORM (Object Relational Mapper) ODM (Object Document Mapper) NoSQL Document Database MongoDB OrientDB CouchBase CassandraDB Neo4j ArangoDB Unit Testing Integration Testing OLAP (On-Line Analytical Processing) OLTP (On-Line Transaction Processing) Data Warehouse Docker QA==Quality Assurance HIPAA (Health Insurance Portability and Accountability Act) PCI DSS (Payment Card Industry Data Security Standard) Polyglot Persistence Toplink Java ORM Ruby on Rails ActiveRecord Gem

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16

2018-01-29 · Data Engineering Podcast Listen

podcast_episode

by Danielle Robinson (Dat Project) , Joe Hand (Dat Project) , Tobias Macey

AI/ML CI/CD Data Engineering Data Management Data Science Git Linux Rust

Summary Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements:

There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.

Your host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future

Interview

Introduction How did you get involved in the area of data management? What is the Dat project and how did it get started? How have the grants to the Dat project influenced the focus and pace of development that was possible?

Now that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project?

Can you explain how the Dat protocol is designed and how it has evolved since it was first started? How does Dat manage conflict resolution and data versioning when replicating between multiple machines? One of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions? One of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made? How does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases? What mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default? For someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network? What have been the most challenging aspects of building and promoting Dat? What are some of the most interesting or inspiring uses of the Dat protocol that you are aware of?

Contact Info

Dat

datproject.org Email @dat_project on Twitter Dat Chat

Danielle

Email @daniellecrobins

Joe

Email @joeahand on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Dat Project Code For Science and Society Neuroscience Cell Biology OpenCon Mozilla Science Open Education Open Access Open Data Fortune 500 Data Warehouse Knight Foundation Alfred P. Sloan Foundation Gordon and Betty Moore Foundation Dat In The Lab Dat in the Lab blog posts California Digital Library IPFS Dat on Open Collective – COMING SOON! ScienceFair Stencila eLIFE Git BitTorrent Dat Whitepaper Merkle Tree Certificate Transparency Dat Protocol Working Group Dat Multiwriter Development – Hyperdb Beaker Browser WebRTC IndexedDB Rust C Keybase PGP Wire Zenodo Dryad Data Sharing Dataverse RSync FTP Globus Fritter Fritter Demo Rotonde how to Joe’s website on Dat Dat Tutorial Data Rescue – NYTimes Coverage Data.gov Libraries+ Network UC Conservation Genomics Consortium Fair Data principles hypervision hypervision in browser

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the unedited transcript… Tobias Macey 00:13…

talk-data.com

Activity Trend

Top Events

Top Speakers

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Database Refactoring Patterns with Pramod Sadalage - Episode 22

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16