talk-data.com talk-data.com

Topic

Data Science

machine_learning statistics analytics

1516

tagged

Activity Trend

68 peak/qtr
2020-Q1 2026-Q1

Activities

1516 activities · Newest first

Olá, Data Hacker! Bem-vindo a mais um episódio do podcast de Data Science e Machine Learning da nossa comunidade!

Nesse episódio, Paulo Vasconcellos, Allan Sene e Gabriel Lages conversam sobre os desafios diários de um Data Scientist, além de dar dicas para aqueles que estão começando. Para nos ajudar nesse papo, chamamos dois Data Hackers para conversar: o Jones Madruga, Cientista de Dados na Stoodi; e o Luis Otávio Martins, Cientista de Dados na Concepta Inc.

Participantes:

  • Jones Madruga
  • Luis Otávio Martins
  • Paulo Vasconcellos
  • Allan Sene
  • Gabriel Lages

Links do episódio: https://medium.com/data-hackers/como-%C3%A9-o-dia-a-dia-de-um-cientista-de-dados-podcast-data-hackers-fa478eb8f009

Não faz parte da comunidade do Data Hackers ainda? Cadastre-se em nossa newsletter e em nosso Slack para ficar sempre de olho nas novidades: www.datahackers.com.br

Summary One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chatYour host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by describing what MemSQL is and how the product and business first got started?What are the typical use cases for customers running MemSQL?What are the benefits of integrating the ingestion pipeline with the database engine? What are some typical ways that the ingest capability is leveraged by customers?How is MemSQL architected and how has the internal design evolved from when you first started working on it?Where does it fall on the axes of the CAP theorem?How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?Can you describe the lifecycle of a write transaction?Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?How do you mitigate the impact of network latency throughout the cluster during query planning and execution?How much of the implementation of MemSQL is using custom built code vs. open source projects?What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?When is MemSQL the wrong choice for a data platform?What do you have planned for the future of MemSQL? Contact Info @nikitashamgunov on TwitterLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links MemSQLNewSQLMicrosoft SQL ServerSt. Petersburg University of Fine Mechanics And OpticsCC++In-Memory DatabaseRAM (Random Access Memory)Flash StorageOracle DBPostgreSQLPodcast EpisodeKafkaKinesisWealth ManagementData WarehouseODBCS3HDFSAvroParquetData Serialization Podcast EpisodeBroadcast JoinShuffle JoinCAP TheoremApache ArrowLZ4S2 Geospatial LibrarySybaseSAP HanaKubernetes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Send us a text Making Data Simple host Al Martin has a chance to discuss all thing data with Laura Ellis, also known as Little Miss Data. Laura is an analytics architect for IBM Cloud as well as a frequent blogger. Together, they talk about how critical it is to understand your data in order create specific calls to action, and what it means to build a data democracy. Show Notes 00:00 - Follow @IBMAnalyticsSupport on Twitter. 00:22 - Check out our YouTube channel. We're posting full episodes weekly. 00:24 - Connect with Al Martin on LinkedIn and Twitter. 01:20 - Check out littlemissdata.com. 01:22 - Connect with Laura Ellis on Twitter, Instagram, and LinkedIn. 02:20 - Curious to know more about analytics architecture? Check out this IBM article on the topic. 03:52 - Check out the Little Miss Data article Al referenced here. 04:45 - Learn more about Data Democracy here in Laura's blog post. 05:31 - Understand more about the importance of data for your business in this article. 09:11 - Find out more about the challenges of being a data scientist here. 12:45 - Working with good quality data is crucial. Check out this article for more details. 16:12 - Simple data can provide the most effective returns. Learn more here. 21:15 - Choosing the right, supportive environment for your data science journey will make sure you don't get burnt out. This article examines your options. 21:35 - Data is a fundamental step when working with AI. But do you know the difference between data analytics, AI and machine learning? This Forbes article walks you through it. 22:42 - Need to brush up on what a data dashboard is? Learn more here. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

In this podcast, Maksim, CDO @ City of San Diago, discussed the nuances of running big data for big cities. He shares his perspectives on effectively building a central data office in a complex and extremely collaborative environment like a big city. He shared his thoughts on some ways to effectively prioritize which project to pursue. He shared how leadership and execution could blend to solve civic issues relating to big and small cities. A great practitioner podcast for folks seeking to build a robust data science practice across a large and collaborative ecosystem.

Timeline: 0:28 Maksim's journey. 6:45 Maksim's current role. 11:46 Collaboration process in creating a data inventory. 14:52 Working with the bureaucracy. 18:35 Dealing with unforeseen circumstances at work. 20:22 Prioritization at work. 22:58 Qualities of a good data leader. 26:15 Collaboration with other cities. 27:40 Cool data projects in other cities. 30:55 Shortcomings of other city representatives. 36:54 Use cases in AI 39:00 What would Maksim change about himself? 40:50 Future cities and data 43:55 Opportunities for private investors in the public sector. 45:53 Maksim's success mantra. 50:19 Closing remark.

Maksim's Book Recommendation: The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win by Gene Kim, Kevin Behr, George Spafford amzn.to/2MAu5Xv

Podcast Link: https://futureofdata.org/understanding-bigdata-for-bigcities-with-maksim-mrmaksimize-cityofsandiego-futureofdata-podcast/

Maksim's BIO: Maksim Pecherskiy: As the CDO for the City of San Diego, working in the Performance & Analytics Department, Maksim strives to bring the necessary components together to allow the City's residents to benefit from a more efficient, agile government that is as innovative as the community around it. He has been solving complex problems with technology for nearly a decade. He spent 2014 working as a Code For America fellow in Puerto Rico, focusing on economic development. His team delivered a product called PrimerPeso that provides business owners and residents a tool to search, and apply for, government programs for which they may be eligible.

Before moving to California, Maksim was a Solutions Architect at Promet Source in Chicago, where he built large web applications and designed complex integrations. He shaped workflow, configuration management, and continuous integration processes while leading and training international development teams. Before his work at Promet, he was a software engineer at AllPlayers, who was instrumental in the design and architecture of its APIs and the development and documentation of supporting client libraries in various languages.

Maksim graduated from DePaul University with a bachelor of science degree in information systems and from Linköping University, Sweden, with a bachelor of science degree in international business. He is also certified as a Lean Six Sigma Green Belt.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Wanna Join? If you or any you know wants to join in, Register your interest by mailing us @ [email protected]

Want to sponsor? Email us @ [email protected]

Keywords: FutureOfData,

DataAnalytics,

Leadership,

Futurist,

Podcast,

BigData,

Strategy

Summary

There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph

Interview

Introduction How did you get involved in the area of data management? Can you give a brief overview of what Enigma has built and what the motivation was for starting the company?

How do you define the concept of a knowledge graph?

What are the processes involved in constructing a knowledge graph? Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph? What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?

How do you manage the software lifecycle for your ETL code? What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?

What are the current challenges that you are facing in building and scaling your data infrastructure?

How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose? What techniques are you using to manage accuracy and consistency in the data that you ingest?

Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers? What are the weak spots in your platform that you are planning to address in upcoming projects?

If you were to start from scratch today, what would you have done differently?

What are some of the most interesting or unexpected uses of your product that you have seen? What is in store for the future of Enigma?

Contact Info

Email Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Enigma Chicago Tribune NPR Quartz CSVKit Aga

Python Data Science Essentials - Third Edition

Learn the essentials of data science with Python through this comprehensive guide. By the end of this book, you'll have an in-depth understanding of core data science workflows, tools, and techniques. What this Book will help me do Understand and apply data manipulation techniques with pandas and NumPy. Build and optimize machine learning models with scikit-learn. Analyze and visualize complex datasets for derived insights. Implement exploratory data analysis to uncover trends in data. Leverage advanced techniques like graph analysis and deep learning for sophisticated projects. Author(s) Alberto Boschetti and Luca Massaron combine their extensive expertise in data science and Python programming to guide readers effectively. With hands-on knowledge and a passion for teaching, they provide practical insights across the data science lifecycle. Who is it for? This book is ideal for aspiring data scientists, data analysts, and software developers aiming to enhance their data analysis skills. Suited for beginners familiar with Python and basic statistics, this guide bridges the gap to real-world applications. Advance your career by unlocking crucial data science expertise.

In this podcast, Jim Sterne shares how marketing has evolved through disruptive times. He shares some of the best practices in the marketing and digital analytics space. He sheds light on some opportunities in the marketing and analytics space and how machine learning is changing the face of digital and marketing. This is a great podcast for anyone looking to understand how AI is impacting marketing and what are some big opportunities in marketing and digital.

Timeline: 0:30 Jim's journey. 5:25 The evolution of marketing. 8:45 Breaking down the digital. 11:40 Marketing and analytics. 13:27 Misuse of analytics in marketing. 17:35 Resolving bad data and bias. 22:20 Good digital analyst vs. bad digital analyst. 28:06 Defining a well-oiled marketing machine. 30:33 Marketing industry's adoption of technology. 34:19 Technology adoption strategy. 38:23 Impact of machine learning and digital marketing. 42:19 Decision making, accountability, and AI. 47:08 Advice for start-ups. 48:52 Disruption opportunities in digital marketing. 55:57 Ethics and marketing. 58:52 What's next in digital marketing. 1:02:27 Jim's success mantra. 1:05:36 Jim's reading list. 1:07:30 Key takeaways.

Jim's Books: amzn.to/2KB1QCR

Jim's Current Read List: Shift: 19 Practical, Business-Driven Ideas for an Executive in Charge of Marketing but Not Trained for the Task by Sean Doyle amzn.to/2KG4K9d Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking by Foster Provost and Tom Fawcett amzn.to/2AWR3Dz

Podcast Link: https://futureofdata.org/future-of-data-in-marketing-digital-jimsterne/

Jim's BIO: Jim Sterne focused his thirty-five years in sales and marketing to create and strengthen customer relationships through digital communications. He sold business computers to companies that had never owned one in the 1980s, consulted and keynoted online marketing in the 1990s, and founded a conference and a professional association around digital analytics in the 2000s, following his humorous Devil's Data Dictionary. Sterne has just published his twelfth book Artificial Intelligence for Marketing: Practical Applications. Sterne produced the eMetrics Summit from 2002 - 2017 and now produces the Marketing Evolution Experience. He was co-founder and served for 17 years as the Board Chair of the Digital Analytics Association.

Jim was named one of the 50 most influential people in digital marketing by a top marketing magazine in the United Kingdom and identified as one of the top 25 Hot Speakers by the National Speakers Association.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Wanna Join? If you or any you know wants to join in, Register your interest by mailing us @ [email protected]

Want to sponsor? Email us @ [email protected]

Keywords: FutureOfData,

DataAnalytics,

Leadership,

Futurist,

Podcast,

BigData,

Strategy

R Programming Fundamentals

Master the essentials of programming with R and streamline your data analysis workflow with 'R Programming Fundamentals'. This book introduces key R concepts like data structures and control flow, and guides you through practical applications such as data visualization with ggplot2. By the end, you will progress to completing a full data science project for practical hands-on experience. What this Book will help me do Learn to use R's core features, including package management, data structures, and control flow. Process and clean datasets effectively within R, handling missing values and variable transformation. Master data visualization techniques with ggplot2 to create insightful plots and charts. Develop skills to import diverse datasets such as CSVs, Excel spreadsheets, and SQL databases into R. Construct a data science project end-to-end, applying skills in analysis, visualization, and reporting. Author(s) Kaelen Medeiros is a dedicated teacher with a passion for making complex concepts accessible. Bringing years of experience in data science and statistical computing, Kaelen excels at helping learners understand and leverage R for their data analysis needs. With a focus on practical learning, Kaelen has designed this book to give you the hands-on experience and foundational knowledge you need. Who is it for? This book is perfect for analysts looking to enhance their data science toolkit by learning R. It's especially suited for those with little R programming experience looking to start with foundational concepts. Whether you're an aspiring data scientist or a seasoned professional seeking a refresher, this book offers a structured approach to mastering R effectively.

Summary As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence

Interview

Introduction How did you get involved in the area of data management? How do you define data curation?

What are some of the high level concerns that are encapsulated in that effort?

How does the size and maturity of a company affect the ways that they architect and interact with their data systems? Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it? What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure? What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space? As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep? In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?

What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?

Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure? ETL has long been the default approac

Summary

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics

Interview

Introductions How did you get involved in the area of data engineering and data management? What is Snowplow Analytics and what problem were you trying to solve when you started the company? What is unique about customer event data from an ingestion and processing perspective? Challenges with properly matching up data between sources Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?

Cleanliness/accuracy

What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly? Can you describe the overall architecture of the ingest pipeline that Snowplow provides?

How has that architecture evolved from when you first started? What would you do differently if you were to start over today?

Ensuring appropriate use of enrichment sources What have been some of the biggest challenges encountered while building and evolving Snowplow? What are some of the most interesting uses of your platform that you are aware of?

Keep In Touch

Alex

@alexcrdean on Twitter LinkedIn

Snowplow

@snowplowdata on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Snowplow

GitHub

Deloitte Consulting OpenX Hadoop AWS EMR (Elastic Map-Reduce) Business Intelligence Data Warehousing Google Analytics CRM (Customer Relationship Management) S3 GDPR (General Data Protection Regulation) Kinesis Kafka Google Cloud Pub-Sub JSON-Schema Iglu IAB Bots And Spiders List Heap Analytics

Podcast Interview

Redshift SnowflakeDB Snowplow Insights Googl

Summary

Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $/0 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Thomas Hazel about Chaos Search and their effort to bring historical depth to your Elasticsearch data

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what you have built at Chaos Search and the problems that you are trying to solve with it?

What types of data are you focused on supporting? What are the challenges inherent to scaling an elasticsearch infrastructure to large volumes of log or metric data?

Is there any need for an Elasticsearch cluster in addition to Chaos Search? For someone who is using Chaos Search, what mechanisms/formats would they use for loading their data into S3? What are the benefits of implementing the Elasticsearch API on top of your data in S3 as opposed to using systems such as Presto or Drill to interact with the same information via SQL? Given that the S3 API has become a de facto standard for many other object storage platforms, what would be involved in running Chaos Search on data stored outside of AWS? What mechanisms do you use to allow for such drastic space savings of indexed data in S3 versus in an Elasticsearch cluster? What is the system architecture that you have built to allow for querying terabytes of data in S3?

What are the biggest contributors to query latency and what have you done to mitigate them?

What are the options for access control when running queries against the data stored in S3? What are some of the most interesting or unexpected uses of Chaos Search and access to large amounts of historical log information that you have seen? What are your plans for the future of Chaos Search?

Contact Info

Pete Cheslock

@petecheslock on Twitter Website

Thomas Hazel

@thomashazel on Twitter LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tool

In this podcast Jason Carmel(@defenestrate99) Chief Data Officer @ POSSIBLE talks about his journey leading data analytics practice of digital marketing agency. He sheds light on some methodologies for building a sound data science practice. He sheds light on using data science chops for doing some good while creating traditional value. He shared his perspective on keeping team-high on creativity to keep creating innovative solutions. This is a great podcast for anyone looking to understanding the digital marketing landscape and how to create a sound data science practice.

Timelines: 0:29 Jason's journey. 6:40 Advantage of having a legal background for a data scientist. 9:15 Understanding emotions based on data. 13:54 The empathy model. 14:53 From idea to inception to execution. 23:40 The role of digital agencies. 30:20 Measuring the right amount of data. 32:40 Management in a creative agency. 34:40 Leadership qualities that promote creativity. 38:14 Leader's playbook in a digital agency. 40:50 Qualities of a great data science team in the digital agency. 44:30 Leadership's role in data creativity. 47:00 Opportunites as a data scientist in the digital agency. 49:18 Future of data in digital media. 51:38 Jason's success mantra. 53:30 Jason's favorite reads. 57:11 Key takeaways.

Jason's Recommended Read: Trendology: Building an Advantage through Data-Driven Real-Time Marketing by Chris Kerns amzn.to/2zMhYkV Venomous: How Earth's Deadliest Creatures Mastered Biochemistry by Christie Wilcox amzn.to/2LhqI76

Podcast Link: https://futureofdata.org/jason-carmel-defenestrate99-possible-leading-analytics-data-digital-marketing/

Jason's BIO: Jason Carmel is Chief Data Officer at Possible. With nearly 20 years of digital data and marketing experience, Jason has worked with clients such as Coca Cola, Ford, and Microsoft to evolve digital experiences based on real-time feedback and behavioral data. Jason manages a global team of 100 digital analysts across POSSIBLE, a digital advertising agency that uses traditional and unconventional data sets and models to help brands connect more effectively with their customers.

Of particular interest is Jason’s work using data and machine learning to define and understand the emotional components of human conversation. Jason spearheaded the creation of POSSIBLE’s Empathy Model, with translates the raw, unstructured content of social media into a quantitative understanding of what customers are actually feeling about a given topic, event, or brand.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to come on show and discuss their journey in creating the data driven future.

Wanna Join? If you or any you know wants to join in, Register your interest by mailing us @ [email protected]

Want to sponsor? Email us @ [email protected]

Keywords: FutureOfData,

DataAnalytics,

Leadership,

Futurist,

Podcast,

BigData,

Strategy

Malware Data Science

"Security has become a ""big data"" problem. The growth rate of malware has accelerated to tens of millions of new files per year while our networks generate an ever-larger flood of security-relevant data each day. In order to defend against these advanced attacks, you'll need to know how to think like a data scientist. In Malware Data Science, security data scientist Joshua Saxe introduces machine learning, statistics, social network analysis, and data visualization, and shows you how to apply these methods to malware detection and analysis. You'll learn how to: • Analyze malware using static analysis• Observe malware behavior using dynamic analysis• Identify adversary groups through shared code analysis• Catch 0-day vulnerabilities by building your own machine learning detector• Measure malware detector accuracy• Identify malware campaigns, trends, and relationships through data visualization Whether you're a malware analyst looking to add skills to your existing arsenal, or a data scientist interested in attack detection and threat intelligence, Malware Data Science will help you stay ahead of the curve."

Summary

With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms

Interview

Introduction How did you get involved in the area of data management? Can you start by establishing a definition of data mastering that we can work from?

How does the master data set get used within the overall analytical and processing systems of an organization?

What is the traditional workflow for creating a master data set?

What has changed in the current landscape of businesses and technology platforms that makes that approach impractical? What are the steps that an organization can take to evolve toward an agile approach to data mastering?

At what scale of company or project does it makes sense to start building a master data set? What are the limitations of using ML/AI to merge data sets? What are the limitations of a golden master data set in practice?

Are there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them? Are there specific problem domains that are more likely to benefit from a master data set?

Once a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data) What storage mechanisms are typically used for managing a master data set?

Are there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure? How do you manage latency issues when trying to reference the same entities from multiple disparate systems?

What have you found to be the most common stumbling blocks for a group that is implementing a master data platform?

What suggestions do you have to help prevent such a project from being derailed?

What resources do you recommend for someone looking to learn more about the theoretical and practical aspects of

Data Science with SQL Server Quick Start Guide

"Data Science with SQL Server Quick Start Guide" introduces you to leveraging SQL Server's most recent features for data science projects. You will explore the integration of data science techniques using R, Python, and Transact-SQL within SQL Server's environment. What this Book will help me do Use SQL Server's capabilities for data science projects effectively. Understand and preprocess data using SQL queries and statistics. Design, train, and evaluate machine learning models in SQL Server. Visualize data insights through advanced graphing techniques. Deploy and utilize machine learning models within SQL Server environments. Author(s) Dejan Sarka is a data science and SQL Server expert with years of industry experience. He specializes in melding database systems with advanced analytics, offering practical guidance through real-world scenarios. His writing provides clear, step-by-step methods, making complex topics accessible. Who is it for? This book is tailored for professionals familiar with SQL Server who are looking to delve into data science. It is also ideal for data scientists aiming to incorporate SQL Server into their analytics workflows. The content assumes basic exposure to SQL Server, ensuring a straightforward learning curve for its audience.

In this podcast Mike Tamir (@MikeTamir, Head of #DataScience) talked about building a data science AI team. He shared his AI project (FakerFact.org). He shared the lifecycle of an AI project and some things that leaders could keep in mind to help create a successful data science AI team. This podcast is great for leaders learning to build a strong AI workforce.

TIMELINE: 0:28 Micheal's journey. 2:36 Micheal's current role. 3:18 AI and businesses. 5:28 Parameters to consider for AI adoption. 9:30 When do businesses invest in ML resources. 13:20 Tips for candidates in vetting data companies. 16:05 What's the faker fact? 20:45 Getting started on an AI product design. 24:58 Achieving accuracy in data. 27:40 AI the newsmaker and AI the fact-checker. 33:56 Tips for hiring the right data leader for a business. 35:32 Creating a great data science team. 37:19 Challenges in forming a data science team. 39:00 In job training to achieve technological competence. 44:00 Ingredients of a good hire. 47:35 Micheal's secret to success. 50:55 Micheal's favorite reads. 54:20 Key takeaways.

Mike's Recommended Read: What Technology Wants by Kevin Kelly https://amzn.to/2MaNiuN Deep Learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville http://www.deeplearningbook.org/

Podcast Link: https://futureofdata.org/building-data-science-ai-teams-by-miketamir-uberatg-futureofdata-podcast/

Mike's BIO: Mike serves as Head of Data Science at Uber ATG, UC Berkeley Data Science faculty, and head of Phronesis ML Labs. He has led teams of Data Scientists in the bay area as Chief Data Scientist for InterTrust and Takt, Director of Data Sciences for MetaScale/Sears, and CSO for Galvanize, where he founded the galvanizeU-UNH accredited Masters of Science in Data Science degree and oversaw the company's transformation from co-working space to Data Science organization. Mike's most recent passion in research has involved applying Machine Learning techniques to help combat fake news through the FakerFact.org project

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers and lead practitioners to discuss their journey to create the data-driven future.

Wanna Join? If you or any you know wants to join in, Register your interest @ https://analyticsweek.com/

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

podcast_episode
by Val Kroll , Julie Hoyer , Tim Wilson (Analytics Power Hour - Columbus (OH) , Taylor Udell (Heap) , Moe Kiss (Canva) , Michael Helbling (Search Discovery)

Business Intelligence. It's a term that's been around for a few decades, but that is every bit as difficult to nail down as "data science," "big data," or a jellyfish. Think too hard about it, and you might actually find yourself struggling to define "analytics!" With the latest generation of BI tools, though, it's a topic that is making the rounds at cocktail parties the world over! (Cocktail parties just aren't what they used to be.) On this episode, the crew snags Taylor Udell from Heap to join in a discussion on the subject, and Moe (unsuccessfully) attempts to end the episode after six minutes. Possibly because neither Tableau nor Superset can definitively prove where avocado toast originated (but Wikipedia backs her up). But we all know Tim can't be shut up that quickly, right?! For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

Summary

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation

Interview

Introduction How did you get involved in the area of data management? What was your initial project requirement?

What tooling did you consider in addition to Airflow? What aspects of the Airflow platform led you to choose it as your implementation target?

Can you describe your current deployment architecture?

How many engineers are involved in writing tasks for your Airflow installation?

What resources were the most helpful while learning about Airflow design patterns?

How have you architected your DAGs for deployment and extensibility?

What kinds of tests and automation have you put in place to support the ongoing stability of your deployment? What are some of the dead-ends or other pitfalls that you encountered during the course of this project? What aspects of Airflow have you found to be lacking that you would like to see improved? What did you wish someone had told you before you started work on your Airflow installation?

If you were to start over would you make the same choice? If Airflow wasn’t available what would be your second choice?

What are your next steps for improvements and fixes?

Contact Info

@eronarn on Twitter Website eronarn on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Quantopian Harvard Brain Science Initiative DevOps Days Boston Google Maps API Cron ETL (Extract, Transform, Load) Azkaban Luigi AWS Glue Airflow Pachyderm

Podcast Interview

AirBnB Python YAML Ansible REST (Representational State Transfer) SAML (Security Assertion Markup Language) RBAC (Role-Based Access Control) Maxime Beauchemin

Medium Blog

Celery Dask

Podcast Interview

PostgreSQL

Podcast Interview

Redis Cloudformation Jupyter Notebook Qubole Astronomer

Podcast Interview

Gunicorn Kubernetes Airflow Improvement Proposals Python Enhancement Proposals (PEP)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

In this podcast @BesaBauta from MeryFirst talks about the compliance and privacy challenges faced in the hyper regulated industry. With her experience in health informatics, Besa shared some best practices and challenges faced by data science groups in health informatics and other similar groups in regulated space. This podcast is great for anyone looking to learn about data science compliance and privacy challenges.

TIMELINE: 0:28 Besa's journey. 6:05 Besa's current role. 9:30 Privacy and compliance in health informatics. 14:44 Are the current privacy regulations sufficient? 16:15 Data management in different organizations. 22:37 The negatives for compliance policies on data. 26:28 Hiring a good chief data officer. 30:20 Vetting a company as a CDO. 32:38 Challenges for a startup in the healthcare sector. 36:25 Common challenges for data officers in the healthcare sector. 38:29 Millenials and technology. 40:05 Leadership dealing with compliance policies. 46:26 Requirements for working in health informatics. 49:18 Ingredients of a perfect hire. 50:40 Besa's success mantra. 52:35 How does Besa stay updated? 54:37 Besa's favorite read. 57:04 Key takeaway. Besa's Recommended Read: The Art Of War by Sun Tzu and Lionel Giles https://amzn.to/2Jx2PYm

Podcast Link: https://futureofdata.org/compliance-and-privacy-in-health-informatics-by-besabauta/

Besa's BIO: Dr. Besa Bauta is the Chief Data Officer and Chief Compliance Officer for MercyFirst, a social service organization providing health and mental health services to children and adolescents in New York City. She oversees the Research, Evaluation, Analytics, and Compliance for Health (REACH) division, including data governance and security measures, analytics, risk mitigation, and policy initiatives. She is also an Adjunct Assistant Professor at NYU and previously worked as a Research Director for a USAID project in Afghanistan and as the Senior Director of Research and Evaluation at the Center for Evidence-Based Implementation and Research (CEBIR). She holds a Ph.D. in implementation science with a focus on health services, an MPH in Global Health, and an MSW. Her research has focused on health systems, mental health, and technology integration to improve population-level outcomes.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

Healthcare Analytics Made Simple

Navigate the fascinating intersection of healthcare and data science with the book "Healthcare Analytics Made Simple." This comprehensive guide empowers you to use Python and machine learning techniques to analyze and improve real healthcare systems. Demystify intricate concepts with Python code and SQL to gain actionable insights and build predictive models for healthcare. What this Book will help me do Understand healthcare incentives, policies, and datasets to ground your analysis in practical knowledge. Master the use of Python libraries and SQL for healthcare data analysis and visualization. Develop skills to apply machine learning for predictive and descriptive analytics in healthcare. Learn to assess quality metrics and evaluate provider performance using robust tools. Get acquainted with upcoming trends and future applications in healthcare analytics. Author(s) The authors, None Kumar and None Khader, are experts in data science and healthcare informatics. They bring years of experience teaching, researching, and applying data analytics in healthcare. Their approach is hands-on and clear, aiming to make complex topics accessible and engaging for their audience. Who is it for? This book is perfect for data science professionals eager to specialize in healthcare analytics. Additionally, clinicians aiming to leverage computing and data analytics in improving healthcare processes will find valuable insights. Programming enthusiasts and students keen to enter healthcare analytics will also greatly benefit. Tailored for beginners in this field, it is an educational yet robust resource.