Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

2018-02-04 · Data Engineering Podcast Listen

podcast_episode

by Matteo Merli , Rajan Dhabalia , Tobias Macey

AI/ML Ansible API Data Engineering Data Management GitHub Kafka Linux Pub/Sub

Summary

One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements:

There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.

Your host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Pulsar is and what the original inspiration for the project was? What have been some of the most challenging aspects of building and promoting Pulsar? For someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components? What are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to? What projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison? The documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka? One of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged? When is Pulsar the wrong tool to use? What are some of the improvements or new features that you have planned for the future of Pulsar?

Contact Info

Matteo

merlimat on GitHub @merlimat on Twitter

Rajan

@dhabaliaraj on Twitter rhabalia on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Pulsar Publish-Subscribe Yahoo Streamlio ActiveMQ Kafka Bookkeeper SLA (Service Level Agreement) Write-Ahead Log Ansible Zookeeper Pulsar Deployme

Venu Vasudevan @VenuV62 (@ProcterGamble) on creating a rockstar data science team #FutureOfData

2018-02-01 · The Future of Data Podcast | conversation with leaders, influencers, and change makers in the World of Data & Analytics Listen

podcast_episode

by Venu Vasudevan (Procter & Gamble)

AI/ML Analytics Big Data IoT KPI

In this podcast, Venu Vasudevan(@ProcterGamble) talks about the best practices of creating a research-led data-driven data science team. He walked through his journey of creating a robust and sustained data science team, spoke about bias in data science, and some practices leaders and data science practitioners could adopt to create an impactful data science team. This podcast is great for future data science leaders and practitioners leading organizations to put together a data science practice.

Timeline: 0:29 Venu's jouney. 11:18 Venu's current role in PNG. 13:11 Standardization of technology and IoT. 17:18 The state of AI. 19:46 Running an AI and data practice for a company. 22:30 Building a data science practice in a startup in comparison to a transnational company. 24:05 Dealing with bias. 27:32 Culture: a block or an opportunity. 30:05 Dealing with data we've never dealt with before. 32:32 Sustainable vs. disruption. 36:17 Starting a data science team. 38:34 Data science as an art of doing and science of doing business. 41:37 Tips to improve storytelling for a data practitioner. 43:30 Challenges in Venu's journey. 44:55 Tenets of a good data scientist. 47:27 Diversity in hiring. 50:50 KPI's to look out for if you are running an AI practice. 51:37 Venu's favorite read.

Venu's Recommended Read: Isaac Newton: The Last Sorcerer - Michael White http://amzn.to/2FzGV0N Against the Gods: The Remarkable Story of Risk - Peter L. Bernstein http://amzn.to/2DRPveU

Podcast Link: https://futureofdata.org/venu-vasudevan-venuv62-proctergamble-on-creating-a-rockstar-data-science-team-futureofdata/

Venu's BIO: Venu Vasudevan is Research Director, Data Science & AI at Procter & Gamble, where he directs the Data Science & AI organization at Procter & Gamble research. He is a technology leader with a track record of successful consumer & enterprise innovation at the intersection of AI, Machine Learning, Big Data, and IoT. Previously he was VP of Data Science at an IoT startup, a founding member of the Motorola team that created the Zigbee IoT standard, worked to create an industry-first zero-click interface for mobile with Dag Kittlaus (co-creator of Apple Siri), created an industry-first Google Glass experience for TV, an ARRIS video analytics and big data platform recently acquired by Comcast, and a social analytics platform leveraging Twitter that was featured in Wired Magazine and BBC. Venu held a Ph.D. (Databases & AI) from Ohio State University and was a Motorola’s Science Advisory Board (top 2% of Motorola technologists). He is an Adjunct Professor at Rice University’s Electrical and Computer Engineering department and was a mentor at Chicago’s 1871 startup incubator.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

Wanna Join? If you or any you know wants to join in, Register your interest @ http://play.analyticsweek.com/guest/

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

Regression Analysis with R

2018-01-31 · O'Reilly Data Science Books O'Reilly Amazon

book

by Pierre Paquay , Giuseppe Ciaburro , Manoj Kumar , Shaikh Salamatullah

data data-science data-science-tasks regression-analysis statistics

Dive into the world of regression analysis with this hands-on guide that covers everything you need to know about building effective regression models in R. You'll learn both the theoretical foundations and how to apply them using practical examples and R code. By the end, you'll be equipped to interpret regression results and use them to make meaningful predictions. What this Book will help me do Master the fundamentals of regression analysis, from simple linear to logistic regression. Gain expertise in R programming for implementing regression models and analyzing results. Develop skills in handling missing data, feature engineering, and exploratory data analysis. Understand how to identify, prevent, and address overfitting and underfitting issues in modeling. Apply regression techniques in real-world applications, including classification problems and advanced methods like Bagging and Boosting. Author(s) Giuseppe Ciaburro is an experienced data scientist and author with a passion for making complex technical topics accessible. With expertise in R programming and regression analysis, he has worked extensively in statistical modeling and data exploration. Giuseppe's writing combines clear explanations of theory with hands-on examples, ideal for learners and practitioners alike. Who is it for? This book is perfect for aspiring data scientists and analysts eager to understand and apply regression analysis using R. It's suited for readers with a foundational knowledge of statistics and basic R programming experience. Whether you're delving into data science or aiming to strengthen existing skills, this book offers practical insights to reach your goals.

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16

2018-01-29 · Data Engineering Podcast Listen

podcast_episode

by Danielle Robinson (Dat Project) , Joe Hand (Dat Project) , Tobias Macey

AI/ML CI/CD Data Engineering Data Management DWH Git Linux Rust

Summary Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements:

There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.

Your host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future

Interview

Introduction How did you get involved in the area of data management? What is the Dat project and how did it get started? How have the grants to the Dat project influenced the focus and pace of development that was possible?

Now that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project?

Can you explain how the Dat protocol is designed and how it has evolved since it was first started? How does Dat manage conflict resolution and data versioning when replicating between multiple machines? One of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions? One of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made? How does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases? What mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default? For someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network? What have been the most challenging aspects of building and promoting Dat? What are some of the most interesting or inspiring uses of the Dat protocol that you are aware of?

Contact Info

Dat

datproject.org Email @dat_project on Twitter Dat Chat

Danielle

Email @daniellecrobins

Joe

Email @joeahand on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Dat Project Code For Science and Society Neuroscience Cell Biology OpenCon Mozilla Science Open Education Open Access Open Data Fortune 500 Data Warehouse Knight Foundation Alfred P. Sloan Foundation Gordon and Betty Moore Foundation Dat In The Lab Dat in the Lab blog posts California Digital Library IPFS Dat on Open Collective – COMING SOON! ScienceFair Stencila eLIFE Git BitTorrent Dat Whitepaper Merkle Tree Certificate Transparency Dat Protocol Working Group Dat Multiwriter Development – Hyperdb Beaker Browser WebRTC IndexedDB Rust C Keybase PGP Wire Zenodo Dryad Data Sharing Dataverse RSync FTP Globus Fritter Fritter Demo Rotonde how to Joe’s website on Dat Dat Tutorial Data Rescue – NYTimes Coverage Data.gov Libraries+ Network UC Conservation Genomics Consortium Fair Data principles hypervision hypervision in browser

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the unedited transcript… Tobias Macey 00:13…

Stephen Smith: Operationalizing Data Science

2018-01-25 · Secrets of Data Analytics Leaders Listen

podcast_episode

by Stephen Smith (Eckerson Group) , Henry Eckerson (Eckerson Group)

AI/ML Analytics Big Data CRM Data Analytics Marketing

In this podcast, Henry Eckerson and Stephen Smith discuss the movement to operationalize data science.

Smith is a well-respected expert in the fields of data science, predictive analytics and their application in the education, pharmaceutical, healthcare, telecom and finance industries. He co-founded and served as CEO of G7 Research LLC and the Optas Corporation which provided the leading CRM / Marketing Automation solution in the pharmaceutical and healthcare industries.

Smith has published journal articles in the fields of data mining, machine learning, parallel supercomputing, text understanding, and simulated evolution. He has published two books through McGraw-Hill on big data and analytics and holds several patents in the fields of educational technology, big data analytics, and machine learning. He holds a BS in Electrical Engineering from MIT and an MS in Applied Sciences from Harvard University. He is currently the research director of data science at Eckerson Group.

Neuroimaging and Big Data

2018-01-12 · Data Skeptic Listen

podcast_episode

by Ryan Cabeen (Laboratory of Neuroimaging (LONI), USC) , Farshid Sepherband (Laboratory of Neuroimaging (LONI), USC) , Kyle Polich , Dr. Meng Law (Laboratory of Neuroimaging (LONI), USC) , Dr. Arthur Toga (Laboratory of Neuroimaging (LONI), USC)

AI/ML Big Data Data Collection

Last year, Kyle had a chance to visit the Laboratory of Neuroimaging, or LONI, at USC, and learn about how some researchers are using data science to study the function of the brain. We're going to be covering some of their work in two episodes on Data Skeptic. In this first part of our two-part episode, we'll talk about the data collection and brain imaging and the LONI pipeline. We'll then continue our coverage in the second episode, where we'll talk more about how researchers can gain insights about the human brain and their current challenges. Next week, we'll also talk more about what all that has to do with data science machine learning and artificial intelligence. Joining us in this week's episode are members of the LONI lab, which include principal investigators, Dr. Arthur Toga and Dr. Meng Law, and researchers, Farshid Sepherband, PhD and Ryan Cabeen, PhD.

Learning Google BigQuery

2017-12-22 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Eric Brown , Thirukkumaran Haridass

Analytics API Big Data BigQuery Cloud Computing Data Analytics GCP Python SQL Tableau data data-engineering +1 more

If you're ready to untap the potential of data analytics in the cloud, 'Learning Google BigQuery' will take you from understanding foundational concepts to mastering advanced techniques of this powerful platform. Through hands-on examples, you'll learn how to query and analyze massive datasets efficiently, develop custom applications, and integrate your results seamlessly with other tools. What this Book will help me do Understand the fundamentals of Google Cloud Platform and how BigQuery operates within it. Migrate enterprise-scale data seamlessly into BigQuery for further analytics. Master SQL techniques for querying large-scale datasets in BigQuery. Enable real-time data analytics and visualization with tools like Tableau and Python. Learn to create dynamic datasets, manage partition tables and use BigQuery APIs effectively. Author(s) None Berlyant, None Haridass, and None Brown are specialists with years of experience in data science, big data platforms, and cloud technologies. They bring their expertise in data analytics and teaching to make advanced concepts accessible. Their hands-on approach and real-world examples ensure readers can directly apply the skills they acquire to practical scenarios. Who is it for? This book is tailored for developers, analysts, and data scientists eager to leverage cloud-based tools for handling and analyzing large-scale datasets. If you seek to gain hands-on proficiency in working with BigQuery or want to enhance your organization's data capabilities, this book is a fit. No prior BigQuery knowledge is needed, just a willingness to learn.

Paul Ballew(@Ford) on running global data science group #FutureOfData #Podcast

2017-12-20 · The Future of Data Podcast | conversation with leaders, influencers, and change makers in the World of Data & Analytics Listen

podcast_episode

by Paul Ballew (Ford Motor Company)

Analytics Big Data Data Analytics Data Management KPI Marketing Cyber Security

In this podcast, Paul Ballew(@Ford) talks about best practices when running a data science organization spanned across multiple continents. He shared the importance of being Smart, Nice, and Inquisitive in creating tomorrow's workforce today. He sheds some light on the importance of appreciating culture when defining forward-looking policies. He also builds a case for a non-native group and discusses ways to implement data science as a central organization(with no hub-spoke model). This podcast is great for future data science leaders leading organizations with a broad consumer base and multiple geo-political silos.

Timeline: 0:29 Paul's journey. 5:10 Paul's current role. 8:10 Insurance and data analytics. 13:00 Who will own the insurance in the time of automation. 18:22 Recruiting models in technologies. 21:54 Embracing technological change. 25:03 Will we have more analytics in Ford cars? 28:25 How does Ford stay competitive from a technology perspective. 30:30 Challenges for Analytics officer in Ford. 32:36 Ingredients of a good hire. 34:12 How is the data science team structured in Ford. 36:15 Dealing with shadow groups. 39:00 Successful KPIs. 40:33 Who owns data? 42:27 Who should own the security of data assets. 44:05 Examples of successful data science groups. 46:30 Practises for remaining bias-free. 48:55 Getting started running a global data science team. 52:45 How does Paul's keep himself updated. 54:18 Paul's favorite read. 55:45 Closing remarks.

Paul's Recommended Read: The Outsiders Paperback – S. E. Hinton http://amzn.to/2Ai84Gl

Podcast Link: https://futureofdata.org/paul-ballewford-running-global-data-science-group-futureofdata-podcast/

Paul's BIO: Paul Ballew is vice president and Global Chief Data and Analytics officer, Ford Motor Company, effective June 1, 2017. At the same time, he also was elected a Ford Motor Company officer. In this role, he leads Ford’s global data and analytics teams for the enterprise. Previously, Ballew was Global Chief Data and Analytics Officer, a position to which he was named in December 2014. In this role, he has been responsible for establishing and growing the company’s industry-leading data and analytics operations that are driving significant business value throughout the enterprise. Prior to joining Ford, he was Chief Data, Insight & Analytics Officer at Dun & Bradstreet. In this capacity, he was responsible for the company’s global data and analytic activities along with the company’s strategic consulting practice. Previously, Ballew served as Nationwide’s senior vice president for Customer Insight and Analytics. He directed customer analytics, market research, and information and data management functions, and supported the company’s marketing strategy. His responsibilities included the development of Nationwide’s customer analytics, data operations, and strategy. Ballew joined Nationwide in November 2007 and established the company’s Customer Insights and Analytics capabilities.

Ballew sits on the boards of Neustar, Inc. and Hyatt Hotels Corporation. He was born in 1964 and has a bachelor’s and master’s degree in Economics from the University of Detroit.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey in creating the data-driven future.

Wanna Join? If you or any you know wants to join in, Register your interest @ http://play.analyticsweek.com/guest/

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

D3.js in Action, Second Edition

2017-12-07 · O'Reilly Data Science Books O'Reilly Amazon

book

by Elijah Meeks

API DataViz HTML JavaScript React d3 data data-science data-science-tasks data-visualization

D3.js in Action, Second Edition is completely revised and updated for D3 v4 and ES6. It's a practical tutorial for creating interactive graphics and data-driven applications using D3. About the Technology Visualizing complex data is hard. Visualizing complex data on the web is darn near impossible without D3.js. D3 is a JavaScript library that provides a simple but powerful data visualization API over HTML, CSS, and SVG. Start with a structure, dataset, or algorithm; mix in D3; and you can programmatically generate static, animated, or interactive images that scale to any screen or browser. It's easy, and after a little practice, you'll be blown away by how beautiful your results can be! About the Book D3.js in Action, Second Edition is a completely updated revision of Manning's bestselling guide to data visualization with D3. You'll explore dozens of real-world examples in full-color, including force and network diagrams, workflow illustrations, geospatial constructions, and more! Along the way, you'll pick up best practices for building interactive graphics, animations, and live data representations. You'll also step through a fully interactive application created with D3 and React. What's Inside Rich full-color diagrams and illustrations Updated for D3 v4 and ES6 Reusable layouts and components Geospatial data visualizations Mixed-mode rendering About the Reader Suitable for web developers with HTML, CSS, and JavaScript skills. No specialized data science skills required. About the Author Elijah Meeks is a senior data visualization engineer at Netflix. Quotes From basic to complex, this book gives you the tools to create beautiful data visualizations. - Claudio Rodriguez, Cox Media Group The best reference for one of the most useful DataViz tools. - Jonathan Rioux, TD Insurance From toy examples to techniques for real projects. Shows how all the pieces fit together. - Scott McKissock, USAID A clever way to immerse yourself in the D3.js world. - Felipe Vildoso Castillo, University of Chile

@CyberIgor on #Metric Led Strategic Thinking in #InfoSec

2017-12-06 · The Future of Data Podcast | conversation with leaders, influencers, and change makers in the World of Data & Analytics Listen

podcast_episode

by Igor Volovich

Big Data Cyber Security

In this podcast, Igor Volovich(@CyberIgor) talks about the strategic side of cybersecurity. He shared some practices that businesses could adopt to keep their infrastructure safe. Igor sheds some light on some easy ways to measure security for your business and understand the leadership commitment needed to establish a security mindset. Igor also shares the need for metric lead strategies to quantify the outcome. This podcast is great for future information security leaders to understand data science and metrics led cybersecurity strategy.

Timeline: 0:29 Igor's journey. 10:37 Recognizing innovation in small companies. 16:30 Aligning with an incubator. 25:16 Creating robust risk metric. 39:29 Right way of thinking about cybersecurity. 50:42 Can a company be offensive about security. 57:43 Igor's favorite read. 59:17 Igor's upcoming book.

Igor's Recommended Read: How to Measure Anything in Cybersecurity Risk by Douglas W. Hubbard, Richard Seiersen http://amzn.to/2BOoK6D

Podcast Link: https://futureofdata.org/563505-2/

Igor's BIO: Strategist, advisor, advocate, mentor, author, speaker, and cyber leader. Passionate about the craft of cybersecurity and its role in protecting the computing public, the integrity of global commerce and international trade, and defense of critical national infrastructure.

Internationally experienced cybersecurity executive and senior advisor with 20 years of service to the world's largest private and public-sector entities, Fortune 100's, US legislative and executive branches, and regulatory agencies

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

Wanna Join? If you or any you know wants to join in, Register your interest @ http://play.analyticsweek.com/guest/

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

Learning Pentaho Data Integration 8 CE - Third Edition

2017-12-05 · O'Reilly Data Science Books O'Reilly Amazon

book

by Diethard Steiner , María Carina Roldán , Pablo Castagnaro , Miguel Gaspar , Paula Clemente , Paulo Alexandre de Oliveira Rodrigues Pires , Dan Keeley (Rebura)

BI Data Management DWH ETL/ELT RDBMS analytics-platforms data data-science pentaho

"Learning Pentaho Data Integration 8 CE" is your comprehensive guide to mastering data manipulation and integration using Pentaho Data Integration (PDI) 8 Community Edition. Through step-by-step instructions and practical examples, you'll learn to explore, transform, validate, and integrate data from multiple sources, equipping you to handle real-world data challenges efficiently. What this Book will help me do Effectively install and understand the foundational concepts of Pentaho Data Integration 8 Community Edition. Efficiently organize, clean, and transform raw data from various sources into useful formats. Perform advanced data operations like metadata injection, managing relational databases, and implementing ETL solutions. Design, create, and deploy comprehensive data warehouse solutions using modern best practices. Streamline daily data processing tasks with flexibility and accuracy while handling errors gracefully. Author(s) The author, Carina Roldán, is an experienced professional in the field of data science and ETL (Extract, Transform, Load) development. Her expertise in leveraging tools like Pentaho Data Integration has allowed her to contribute significantly to BI and data management projects. Her approach in writing this book reflects her commitment to simplifying complex topics for aspiring professionals. Who is it for? This book is ideal for software developers, data analysts, business intelligence professionals, and IT students aiming to enhance their skills in ETL processes using Pentaho Data Integration. Beginners who wish to learn PDI comprehensively and professionals looking to deepen their expertise will both find value in this resource. It's also suitable for individuals involved in data warehouse design and implementation. This book will equip you with the skills to handle diverse data transformation tasks effectively.

George (@RedPointCTO / @RedPointGlobal) on becoming an unbiased #Technologist in #DataDriven World

2017-11-29 · The Future of Data Podcast | conversation with leaders, influencers, and change makers in the World of Data & Analytics Listen

podcast_episode

by George Corugedo (RedPoint Global)

Analytics Big Data Data Analytics KPI Marketing

In this podcast, George Corugedo(@RedpointCTO) / @Redpoint talks about the ingredients of a technologist in a data-driven world. He sheds light on technology & technologist bias and how companies could work progressively to respond in an unbiased manner. He shared some insights on leading a data science product as a technologist and shared some takeaways for future technologists. This podcast is great for future technologists thinking of shaping their organization to take advantage of technological disruptions to stay competitive.

Timeline: 0:29 George's journey. 3:35 Challenges in George's journey. 7:22 The relevance of mathematics in this data-driven world. 13:02 Statistitians getting into the technology stack. 22:38 Data-driven customer engagement platform. 24:24 Challenges for a technologist to connect with various platforms and prospects. 28:52 Customer challenges for businesses. 31:55 What do businesses get about marketing? 34:04 Bridging the gap between data and analytics. 42:42 Hacks for mitigating bias. 46:18 Appification: a bane or an opportunity. 48:45 An candidate for a data analytics startup. 52:40 Important KPIs for a data-driven customer engagement company. 56:33 How does George keep himself updated? 57:58 What keeps George up at night? 59:15 George's favorite read. 1:01:05 Closing remarks.

Youtube: https://youtu.be/u6CtN-TYjXI iTunes: http://apple.co/2AJDnuz

Ed's Recommended Read: To Kill a Mockingbird by Harper Lee http://amzn.to/2hZnwwx Self-Reliance and Other Essays (Dover Thrift Editions) by Ralph Waldo Emerson http://amzn.to/2i0WcOx

Podcast Link: https://futureofdata.org/redpointcto-redpointglobal-on-becoming-an-unbiased-technologist-in-datadriven-world/

George's BIO: A former math professor and seasoned technology executive, RedPoint Chief Technology Officer and Co-Founder George Corugedo has more than two decades of business and technical experience. George is responsible for directing the development of the RedPoint Customer Engagement Hub, RedPoint’s leading enterprise customer engagement solution. George left academia in 1997 to co-found Accenture’s Customer Insights Practice, which specialized in strategic data utilization, analytics, and customer strategy. George’s previous positions include director of client delivery at ClarityBlue, Inc., a provider of hosted customer intelligence solutions, and COO/CIO of Riscuity, a receivables management company that specialized in using analytics to drive collections.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

Wanna Join? If you or any you know wants to join in, Register your interest @ http://play.analyticsweek.com/guest/

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

R Data Mining

2017-11-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Andrea Cirillo , Enrico Pegoraro

data data-science data-science-tools r

Dive into the world of data mining with 'R Data Mining' and discover how to utilize R's vast tools for uncovering insights in data. This hands-on guide immerses you in real-world cases, teaching both foundational concepts and advanced techniques like regression models and text mining. You'll emerge with a sharp understanding of how to transform raw data into actionable information. What this Book will help me do Gain proficiency in R packages such as dplyr and ggplot2 for data manipulation and visualization. Master the CRISP-DM methodology to systematically approach data mining projects. Develop skillsets in data cleaning and validation to ensure quality data analysis. Understand and implement multiple regression and classification techniques effectively. Learn to use ensemble learning methods and produce reporting with R Markdown. Author(s) Andrea Cirillo brings extensive expertise in data science and R programming as the author of 'R Data Mining.' Their practical approach, drawing from professional experiences in various industries, makes complex techniques accessible and engaging. Their passion for teaching translates into a meticulously crafted learning journey for aspiring data miners. Who is it for? This book is ideal for beginner to intermediate-level data analysts or aspiring data scientists eager to delve into the field of data mining using R. If you're familiar with the basics of programming in R and want to expand into practical applications of data mining methodologies, this is the resource for you. Gain hands-on experience by engaging with real-world datasets and scenarios.

@CRGutowski from @GE_Digital on Using #Analytics to #Transform Sales

2017-11-22 · The Future of Data Podcast | conversation with leaders, influencers, and change makers in the World of Data & Analytics Listen

podcast_episode

by Cate Gutowski (GE Digital)

Analytics Big Data Marketing

In this podcast, @CRGutowski from @GE_Digital talks about the importance of data and analytics in transforming sales organizations. She sheds light on challenges and opportunities with transforming the sales organization of a transnational enterprise using analytics and implement a growth mindset. Cate shared some of the tenets of the transformation mindset. This podcast is great for future leaders who are thinking of shaping their sales organization and empower them with the digital mindset.

Timeline: 0:29 Cate's journey. 7:40 Cate's typical day. 9:07 How does the sales cope up with disruption? 13:25 Data science in sales. 14:48 Planning a digital software for 25000 workforces. 18:00 The thin line between marketing and sales. 22:13 Safeguarding the workforce against tech disruption. 24:57 The culture of sales. 27:55 Designing a digitally connected strategy. 30:08 Designing customer experience. 33:48 Sales strategy for a startup. 36:43 Selling transformative sales strategies to executives. 40:55 How can organizations go digital? 43:25 Digital thread. 44:14 How can a sales organization deal with IT? 45:54 Pitfalls in the process of digitization. 48:44 Challenges for sales folks amid disruption. 50:30 How does Cate keep herself updated? 52:10 Cate's success mantra. 54:06 Closing remarks.

Youtube: https://youtu.be/3jcpYgvIli4 iTunes: http://apple.co/2hM9r5E

Cate's Recommended Read: Start with Why: How Great Leaders Inspire Everyone to Take Action by Simon Sinek http://amzn.to/2hGvc6w

Podcast Link: https://futureofdata.org/crgutowski-ge_digital-using-analytics-transform-sales/

Cate's BIO: Cate has 20 years of technical sales, marketing, and product leadership experience across various global divisions in GE. Cate is currently based in Boston, MA, and works as the VP – Commercial Digital Thread, leading the digital transformation of GE’s 25,000+ sales organization globally. Prior to relocating to Boston, Cate and her family lived in Budapest, Hungary, where she led product management, marketing, and commercial operations across EMEA for GE Current. Cate holds an M.B.A. from the University of South Florida and a Bachelor’s degree in Communications and Business Administration from the University of Illinois at Urbana-Champaign.

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

Wanna Join? If you or any you know wants to join in, Register your interest @ http://play.analyticsweek.com/guest/

Want to sponsor? Email us @ [email protected]

Keywords:

FutureOfData, #Data, #Analytics, #Leadership Podcast, #Big Data, #Strategy

R Data Visualization Recipes

2017-11-22 · O'Reilly Data Visualization Books O'Reilly Amazon

book

by Vitor Bianchi Lanzetta

DataViz Plotly data data-science data-science-tools r

"R Data Visualization Recipes" is a valuable resource for data professionals who want to create clear and effective data visualizations using R. Through a series of practical recipes, the book walks you through various techniques, from mastering the basics to creating advanced, interactive dashboards. By following these recipes, you'll be equipped to use R's visualization packages to their full potential. What this Book will help me do Understand and effectively use R's diverse data visualization libraries. Create polished and informative graphics with ggplot2, ggvis, and plotly. Enhance plots with interactive and animated elements to tell a compelling story. Develop expertise in creating three-dimensional and multivariate visualizations. Design custom interactive dashboards using the power of Shiny. Author(s) None Bianchi Lanzetta is an expert in data visualization and programming, bringing years of experience in using R for applications in data analysis and graphics. With a background in software development, data science, and teaching, the author shares practical insights and clear instructions. Lanzetta's approachable and methodical writing style makes even complex topics accessible. Who is it for? This book is perfect for data professionals, analysts, and scientists who know the basics of R and want to enhance their ability to communicate findings visually. Even if you are a beginner with some exposure to R's ggplot2 package or similar, you'll find the recipes approachable and methodical. The book is ideal for readers who want practical, directly applicable techniques. Whether you're looking to augment your reporting abilities or explore advanced data visualization, you'll gain valuable skills.

Functional Data Structures in R: Advanced Statistical Programming in R

2017-11-17 · O'Reilly Data Science Books O'Reilly Amazon

book

by Thomas Mailund

Big Data data data-science data-science-tools r

Get an introduction to functional data structures using R and write more effective code and gain performance for your programs. This book teaches you workarounds because data in functional languages is not mutable: for example you’ll learn how to change variable-value bindings by modifying environments, which can be exploited to emulate pointers and implement traditional data structures. You’ll also see how, by abandoning traditional data structures, you can manipulate structures by building new versions rather than modifying them. You’ll discover how these so-called functional data structures are different from the traditional data structures you might know, but are worth understanding to do serious algorithmic programming in a functional language such as R. By the end of Functional Data Structures in R, you’ll understand the choices to make in order to most effectively work with data structures when you cannot modify the data itself. These techniques are especially applicable for algorithmic development important in big data, finance, and other data science applications. What You'll Learn Carry out algorithmic programming in R Use abstract data structures Work with both immutable and persistent data Emulate pointers and implement traditional data structures in R Build new versions of traditional data structures that are known Who This Book Is For Experienced or advanced programmers with at least a comfort level with R. Some experience with data structures recommended.

Statistics for Data Science

2017-11-17 · O'Reilly Data Science Books O'Reilly Amazon

book

by James C. Mott , Shaikh Salamatullah , Vijayakumar Ramdoss , Rajprasath Subramanian , James D. Miller

AI/ML data data-science data-science-tasks statistics

Dive into the world of statistics specifically tailored for the needs of data science with 'Statistics for Data Science'. This book guides you from the fundamentals of statistical concepts to their practical application in data analysis, machine learning, and neural networks. Learn with clear explanations and practical R examples to fully grasp statistical methods for data-driven challenges. What this Book will help me do Understand foundational statistical concepts such as variance, standard deviation, and probability. Gain proficiency in using R for programmatically performing statistical computations for data science. Learn techniques for applying statistics in data cleaning, mining, and analysis tasks. Master methods for executing linear regression, regularization, and model assessment. Explore advanced techniques like boosting, SVMs, and neural network applications. Author(s) James D. Miller brings years of experience as a data scientist and educator. He has a deep understanding of how statistics foundationally supports data science and has worked across multiple industries applying these principles. Dedicated to teaching, James simplifies complex statistical concepts into approachable and actionable knowledge for developers aspiring to master data science applications. Who is it for? This book is intended for developers aiming to transition into the field of data science. If you have some basic programming knowledge and a desire to understand statistics essentials for data science applications, this book is designed for you. It's perfect for those who wish to apply statistical methods to practical tasks like data mining and analysis. A prior hands-on experience with R is helpful but not mandatory, as the book explains R methodologies comprehensively.

Python for R Users

2017-11-13 · O'Reilly Data Science Books O'Reilly Amazon

book

by Ajay Ohri

AI/ML Analytics Cloud Computing Computer Science Data Quality DataViz NLP Python data data-science data-science-tools r

The definitive guide for statisticians and data scientists who understand the advantages of becoming proficient in both R and Python The first book of its kind, Python for R Users: A Data Science Approach makes it easy for R programmers to code in Python and Python users to program in R. Short on theory and long on actionable analytics, it provides readers with a detailed comparative introduction and overview of both languages and features concise tutorials with command-by-command translations—complete with sample code—of R to Python and Python to R. Following an introduction to both languages, the author cuts to the chase with step-by-step coverage of the full range of pertinent programming features and functions, including data input, data inspection/data quality, data analysis, and data visualization. Statistical modeling, machine learning, and data mining—including supervised and unsupervised data mining methods—are treated in detail, as are time series forecasting, text mining, and natural language processing. • Features a quick-learning format with concise tutorials and actionable analytics • Provides command-by-command translations of R to Python and vice versa • Incorporates Python and R code throughout to make it easier for readers to compare and contrast features in both languages • Offers numerous comparative examples and applications in both programming languages • Designed for use for practitioners and students that know one language and want to learn the other • Supplies slides useful for teaching and learning either software on a companion website Python for R Users: A Data Science Approach is a valuable working resource for computer scientists and data scientists that know R and would like to learn Python or are familiar with Python and want to learn R. It also functions as textbook for students of computer science and statistics. A. Ohri is the founder of Decisionstats.com and currently works as a senior data scientist. He has advised multiple startups in analytics off-shoring, analytics services, and analytics education, as well as using social media to enhance buzz for analytics products. Mr. Ohri's research interests include spreading open source analytics, analyzing social media manipulation with mechanism design, simpler interfaces for cloud computing, investigating climate change and knowledge flows. His other books include R for Business Analytics and R for Cloud Computing.

Andrea Gallego / @BCG on Managing Analytics Practice

2017-11-01 · The Future of Data Podcast | conversation with leaders, influencers, and change makers in the World of Data & Analytics Listen

podcast_episode

by Andrea Gallego (Boston Consulting Group)

AI/ML Analytics Big Data Cloud Computing Computer Science Data Engineering KPI

In this podcast, Andrea Gallego, Principal & Global Technology Lead @ Boston Consulting Group, talks about her journey as a data science practitioner in the consulting space. She talks about some of the industry practices that up and rising data science professionals must deploy and talks about some operational hacks to help create a robust data science team. It is a must-listen conversation for practitioner folks in the industry trying to deploy a data science team and build solutions for a service industry.

Timeline: 0:29 Andrea's journey. 5:41 Andrea's current role. 8:02 Seasoned data professional to COO role. 11:27 The essentials for having analytics at scale. 14:56 First steps to creating an analytics practice. 18:33 Defining an engineering first company. 22:33 A different understanding of data engineering. 26:40 Mistakes businesses make in their data science practice. 30:21 Some good business problems that data science can solve. 36:42 Democratization of data vs. privacy in companies. 38:04 Tech to business challenges. 40:11 Important KPIs for building a data science practice. 43:47 Hacks to hiring good data science candidates. 49:07 Art of doing business and science of doing business. 52:16 Andrea's secret to success. 55:12 Andrea's favorite read. 58:35 Closing remarks.

Andrea's Recommended Read: Arrival by Ted Chiang http://amzn.to/2h6lJpv Build to Last by Jim Collins http://amzn.to/2yMCsam Designing Agentive Technology: AI That Works for People Paperback http://amzn.to/2ySDHGp

Podcast Link: https://futureofdata.org/andrea-gallego-bcg-managing-analytics-practice/

Andrea's BIO: Andrea is Principal & Global Technology Lead @ Boston Consulting Group. Prior to BCG, Andrea was COO of QuantumBlack’s Cloud platform. She also manages the cloud platform team and helps drive the vision and future of McKinsey Analytics’ digital capabilities. Andrea has broad expertise in computer science, cloud computing, digital transformation strategy, and analytics solutions architecture. Prior to joining the Firm, Andrea was a technologist at Booz Allen Hamilton. She holds a BS in Economics and MS in Analytics (with a concentration in computing methods for analytics).

About #Podcast:

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

Wanna Join? If you or any you know wants to join in, Register your interest @ http://play.analyticsweek.com/guest/

Want to sponsor? Email us @ [email protected]

Keywords: FutureOfData Data Analytics Leadership Podcast Big Data Strategy

Machine Learning with R Cookbook - Second Edition

2017-10-23 · O'Reilly Data Science Books O'Reilly Amazon

book

by AshishSingh Bhatia , Yu-Wei, Chiu (David Chiu)

AI/ML Big Data Hadoop data data-science data-science-tools r

Machine Learning with R Cookbook, Second Edition, is your hands-on guide to applying machine learning principles using R. Through simple, actionable examples and detailed step-by-step recipes, this book will help you build predictive models, analyze data, and derive actionable insights. Explore core topics in data science, including regression, classification, clustering, and more. What this Book will help me do Apply the Apriori algorithm for association analysis to uncover relationships in transaction datasets. Effectively visualize data patterns and associations using a variety of plots and graphing methods. Master the application of regression techniques to address predictive modeling challenges. Leverage the power of R and Hadoop for performing big data machine learning efficiently. Conduct advanced analyses such as survival analysis and improve machine learning model performance. Author(s) Yu-Wei, Chiu (David Chiu), the author, is an experienced data scientist and R programmer who specializes in applying data science and machine learning principles to solve real-world problems. David's pragmatic and comprehensive teaching style provides readers with deep insights and practical methodologies for using R effectively in their projects. His passion for data science and expertise in R and big data make this book a reliable resource for learners. Who is it for? This book is ideal for data scientists, analysts, and professionals working with machine learning and R. It caters to intermediate users who are versed in the basics of R and want to deepen their skills. If you aim to become the go-to expert for machine learning challenges and enhance your efficiency and capability in machine learning projects, this book is for you.

talk-data.com

Data Science

Activity Trend

Top Events

Top Speakers

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Venu Vasudevan @VenuV62 (@ProcterGamble) on creating a rockstar data science team #FutureOfData

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

Regression Analysis with R

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16

Stephen Smith: Operationalizing Data Science

Neuroimaging and Big Data

Learning Google BigQuery

Paul Ballew(@Ford) on running global data science group #FutureOfData #Podcast

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey in creating the data-driven future.

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

D3.js in Action, Second Edition

@CyberIgor on #Metric Led Strategic Thinking in #InfoSec

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

Learning Pentaho Data Integration 8 CE - Third Edition

George (@RedPointCTO / @RedPointGlobal) on becoming an unbiased #Technologist in #DataDriven World

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

FutureOfData #DataAnalytics #Leadership #Podcast #BigData #Strategy

R Data Mining

@CRGutowski from @GE_Digital on Using #Analytics to #Transform Sales

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

FutureOfData, #Data, #Analytics, #Leadership Podcast, #Big Data, #Strategy

R Data Visualization Recipes

Functional Data Structures in R: Advanced Statistical Programming in R

Statistics for Data Science

Python for R Users

Andrea Gallego / @BCG on Managing Analytics Practice

FutureOfData podcast is a conversation starter to bring leaders, influencers, and lead practitioners to discuss their journey to create the data-driven future.

Machine Learning with R Cookbook - Second Edition