talk-data.com talk-data.com

Topic

NLP

Natural Language Processing (NLP)

ai machine_learning text_analysis

252

tagged

Activity Trend

24 peak/qtr
2020-Q1 2026-Q1

Activities

252 activities · Newest first

Machine Learning with PySpark: With Natural Language Processing and Recommender Systems

Master the new features in PySpark 3.1 to develop data-driven, intelligent applications. This updated edition covers topics ranging from building scalable machine learning models, to natural language processing, to recommender systems. Machine Learning with PySpark, Second Edition begins with the fundamentals of Apache Spark, including the latest updates to the framework. Next, you will learn the full spectrum of traditional machine learning algorithm implementations, along with natural language processing and recommender systems. You’ll gain familiarity with the critical process of selecting machine learning algorithms, data ingestion, and data processing to solve business problems. You’ll see a demonstration of how to build supervised machine learning models such as linear regression, logistic regression, decision trees, and random forests. You’ll also learn how to automate the steps using Spark pipelines, followed by unsupervised models such as K-means and hierarchical clustering. A section on Natural Language Processing (NLP) covers text processing, text mining, and embeddings for classification. This new edition also introduces Koalas in Spark and how to automate data workflow using Airflow and PySpark’s latest ML library. After completing this book, you will understand how to use PySpark’s machine learning library to build and train various machine learning models, along with related components such as data ingestion, processing and visualization to develop data-driven intelligent applications What you will learn: Build a spectrum of supervised and unsupervised machine learning algorithms Use PySpark's machine learning library to implement machine learning and recommender systems Leverage the new features in PySpark’s machine learning library Understand data processing using Koalas in Spark Handle issues around feature engineering, class balance, bias andvariance, and cross validation to build optimally fit models Who This Book Is For Data science and machine learning professionals.

We talked about:

Mihail’s background NLP and self-driving vehicles Transitioning from academia to the industry Machine learning researchers Finding open-ended problems Machine learning engineers Is data science more engineering or research? What can engineers and researchers learn from one another? Bridging the disconnect between researchers and engineers Breaking down silos Fluid roles Full-stack data scientists Advice to machine learning researchers Advice to machine learning engineers Reading papers Choosing between engineering or research if you’re just starting Confetti.ai

Links:

https://twitter.com/mihail_eric http://confetti.ai/

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

Data Science For Dummies, 3rd Edition

Monetize your company’s data and data science expertise without spending a fortune on hiring independent strategy consultants to help What if there was one simple, clear process for ensuring that all your company’s data science projects achieve a high a return on investment? What if you could validate your ideas for future data science projects, and select the one idea that’s most prime for achieving profitability while also moving your company closer to its business vision? There is. Industry-acclaimed data science consultant, Lillian Pierson, shares her proprietary STAR Framework – A simple, proven process for leading profit-forming data science projects. Not sure what data science is yet? Don’t worry! Parts 1 and 2 of Data Science For Dummies will get all the bases covered for you. And if you’re already a data science expert? Then you really won’t want to miss the data science strategy and data monetization gems that are shared in Part 3 onward throughout this book. Data Science For Dummies demonstrates: The only process you’ll ever need to lead profitable data science projects Secret, reverse-engineered data monetization tactics that no one’s talking about The shocking truth about how simple natural language processing can be How to beat the crowd of data professionals by cultivating your own unique blend of data science expertise Whether you’re new to the data science field or already a decade in, you’re sure to learn something new and incredibly valuable from Data Science For Dummies. Discover how to generate massive business wins from your company’s data by picking up your copy today.

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.

Abstract Hosted by Al Martin, VP, IBM Expert Services Delivery, Making Data Simple provides the latest thinking on big data, A.I., and the implications for the enterprise from a range of experts. This week on Making Data Simple, we have Alex Watson. Alex was previously a GM at AWS and is currently a Co-Founder at Gretel.ai. Gretel is a privacy startup that enables developers, researchers, and scientists to quickly create safe versions of data for use in pre-production environments and machine learning workloads, which are shareable across teams and organizations. These tools address head-on the massive data privacy bottleneck--which has stifled innovation across multiple industries for years—by equipping builders everywhere with the ability to create quality datasets that scale. In short, synthetic data levels the playing field for everyone. This democratization of data will foster competition, scientific discoveries, and the inventions that will drive the next revolution of our data economy.  The company recently closed their series-A funding, led by Greylock, for another $12 million and brought Jason Warner, the current CTO for GitHub, on as an investor. Gretel also launched its latest public beta, Beta2, which offers privacy engineering as a service for everyone, not just developers. Show Notes 2:03 – Alex’s background 4:36 – What time frame was Harvest AI? 7:14 – How does NLP play into Harvest AI? 10:50 – How can we not have enough knowledge? 14:08 – Does the tech exist today for security? 18:14 – Privacy issues 20:42 – What does Gretel stand for? 27:42 – Do you increase the opportunity for bias? 31:18 – Where is the sweet spot for Gretel? 33:30 – When do synthetic not work? 37:42 – What is practical privacy? Gretel Connect with the Team Producer Kate Brown - LinkedIn. Producer Steve Templeton - LinkedIn. Host Al Martin - LinkedIn and Twitter.  Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

After performing several experiments with Airflow, we reached the best architectural design for processing text medical records in scale. Our hybrid solution uses Kubernetes, Apache Airflow, Apache Livy, and Apache cTAKES. Using Kubernetes’ containers has the benefit of having a consistent, portable, and isolated environment for each component of the pipeline. With Apache Livy, you can run tasks in a Spark Cluster at scale. Additionally, Apache cTAKES helps with the extraction of information from electronic medical records clinical free-text by using natural language processing techniques to identify codable entities, temporal events, properties, and relations.

We talked about:

Yury’s background Failing fast: Grammarly for science Not failing fast: Keyword recommender Four steps to epiphany Lesson learned when bringing XGBoost into production When data scientists try to be engineers Joining a fintech startup: Doing NLP with thousands of GPUs Working at a Telco company Having too much freedom The importance of digital presence Work-life balance Quantifying impact of failing projects on our CVs Business trips to Perm: don’t work on the weekend What doesn’t kill you makes you stronger

Links:

Yury's course: https://mlcourse.ai/ Yury's Twitter: https://twitter.com/ykashnitsky

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

Adding AI Cloud Services to Your On-Prem Data Workflows for NLP & Content Enrichment -Daniel Wrigley

Big Data Europe Onsite and online on 22-25 November in 2022 Learn more about the conference: https://bit.ly/3BlUk9q

Join our next Big Data Europe conference on 22-25 November in 2022 where you will be able to learn from global experts giving technical talks and hand-on workshops in the fields of Big Data, High Load, Data Science, Machine Learning and AI. This time, the conference will be held in a hybrid setting allowing you to attend workshops and listen to expert talks on-site or online.

Summary Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Tal Galfsky about how Cherre is bringing order to the messy problem of physical addresses and entity resolution in their data pipelines.

Interview

Introduction How did you get involved in the area of data management? Started as physicist and evolved into Data Science Can you start by giving a brief recap of what Cherre is and the types of data that you deal with? Cherre is a company that connects data We’re not a data vendor, in that we don’t sell data, primarily We help companies connect and make sense of their data The real estate market is historically closed, gut let, behind on tech What are the biggest challenges that you deal with in your role when working with real estate data? Lack of a standard domain model in real estate. Ontology. What is a property? Each data source, thinks about properties in a very different way. Therefore, yielding similar, but completely different data. QUALITY (Even if the dataset are talking about the same thing, there are different levels of accuracy, freshness). HIREARCHY. When is one source better than another What are the teams and systems that rely on address information? Any company that needs to clean or organize (make sense) their data, need to identify, people, companies, and properties. Our clients use Address resolution in multiple ways. Via the UI or via an API. Our service is both external and internal so what I build has to be good enough for the demanding needs of our data science team, robust enough for our engineers, and simple enough that non-expert clients can use it. Can you give an example for the problems involved in entity resolution Known entity example. Empire state buidling. To resolve addresses in a way that makes sense for the client you need to capture the real world entities. Lots, buildings, units.

Identify the type of the object (lot, building, unit) Tag the object with all the relevant addresses Relations to other objects (lot, building, unit)

What are some examples of the kinds of edge cases or messiness that you encounter in addresses? First class is string problems. Second class component problems. third class is geocoding. I understand that you have developed a service for normalizing addresses and performing entity resolution to provide canonical references for downstream analyses. Can you give an overview of what is involved? What is the need for the service. The main requirement here is connecting an address to lot, building, unit with latitude and longitude coordinates

How were you satisfying this requirement previously? Before we built our model and dedicated service we had a basic prototype for pipeline only to handle NYC addresses. What were the motivations for designing and implementing this as a service? Need to expand nationwide and to deal with client queries in real time. What are some of the other data sources that you rely on to be able to perform this normalization and resolution? Lot data, building data, unit data, Footprints and address points datasets. What challenges do you face in managing these other sources of information? Accuracy, hirearchy, standardization, unified solution, persistant ids and primary keys

Digging into the specifics of your solution, can you talk through the full lifecycle of a request to resolve an address and the various manipulations that are performed on it? String cleaning, Parse and tokenize, standardize, Match What are some of the other pieces of information in your system that you would like to see addressed in a similar fashion? Our named entity solution with connection to knowledge graph and owner unmasking. What are some of the most interesting, unexpected, or challenging lessons that you learned while building this address resolution system? Scaling nyc geocode example. The NYC model was exploding a subset of the options for messing up an address. Flexibility. Dependencies. Client exposure. Now that you have this system running in production, if you were to start over today what would you do differently? a lot but at this point the module boundaries and client interface are defined in such way that we are able to make changes or completely replace any given part of it without breaking anything client facing What are some of the other projects that you are excited to work on going forward? Named entity resolution and Knowledge Graph

Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today? BigQuery is huge asset and in particular UDFs but they don’t support API calls or python script

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Cherre

Podcast Episode

Photonics Knowledge Graph Entity Resolution BigQuery NLP == Natural Language Processing dbt

Podcast Episode

Airflow

Podcast.init Episode

Datadog

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Data Science on AWS

With this practical book, AI and machine learning practitioners will learn how to successfully build and deploy data science projects on Amazon Web Services. The Amazon AI and machine learning stack unifies data science, data engineering, and application development to help level up your skills. This guide shows you how to build and run pipelines in the cloud, then integrate the results into applications in minutes instead of days. Throughout the book, authors Chris Fregly and Antje Barth demonstrate how to reduce cost and improve performance. Apply the Amazon AI and ML stack to real-world use cases for natural language processing, computer vision, fraud detection, conversational devices, and more Use automated machine learning to implement a specific subset of use cases with SageMaker Autopilot Dive deep into the complete model development lifecycle for a BERT-based NLP use case including data ingestion, analysis, model training, and deployment Tie everything together into a repeatable machine learning operations pipeline Explore real-time ML, anomaly detection, and streaming analytics on data streams with Amazon Kinesis and Managed Streaming for Apache Kafka Learn security best practices for data science projects and workflows including identity and access management, authentication, authorization, and more

Machine Reading Comprehension

Machine reading comprehension (MRC) is a cutting-edge technology in natural language processing (NLP). MRC has recently advanced significantly, surpassing human parity in several public datasets. It has also been widely deployed by industry in search engine and quality assurance systems. Machine Reading Comprehension: Algorithms and Practice performs a deep-dive into MRC, offering a resource on the complex tasks this technology involves. The title presents the fundamentals of NLP and deep learning, before introducing the task, models, and applications of MRC. This volume gives theoretical treatment to solutions and gives detailed analysis of code, and considers applications in real-world industry. The book includes basic concepts, tasks, datasets, NLP tools, deep learning models and architecture, and insight from hands-on experience. In addition, the title presents the latest advances from the past two years of research. Structured into three sections and eight chapters, this book presents the basis of MRC; MRC models; and hands-on issues in application. This book offers a comprehensive solution for researchers in industry and academia who are looking to understand and deploy machine reading comprehension within natural language processing. Presents the first comprehensive resource on machine reading comprehension (MRC) Performs a deep-dive into MRC, from fundamentals to latest developments Offers the latest thinking and research in the field of MRC, including the BERT model Provides theoretical discussion, code analysis, and real-world applications of MRC Gives insight from research which has led to surpassing human parity in MRC

Machine Learning and Data Science Blueprints for Finance

Over the next few decades, machine learning and data science will transform the finance industry. With this practical book, analysts, traders, researchers, and developers will learn how to build machine learning algorithms crucial to the industry. You'll examine ML concepts and over 20 case studies in supervised, unsupervised, and reinforcement learning, along with natural language processing (NLP). Ideal for professionals working at hedge funds, investment and retail banks, and fintech firms, this book also delves deep into portfolio management, algorithmic trading, derivative pricing, fraud detection, asset price prediction, sentiment analysis, and chatbot development. You'll explore real-life problems faced by practitioners and learn scientifically sound solutions supported by code and examples. This book covers: Supervised learning regression-based models for trading strategies, derivative pricing, and portfolio management Supervised learning classification-based models for credit default risk prediction, fraud detection, and trading strategies Dimensionality reduction techniques with case studies in portfolio management, trading strategy, and yield curve construction Algorithms and clustering techniques for finding similar objects, with case studies in trading strategies and portfolio management Reinforcement learning models and techniques used for building trading strategies, derivatives hedging, and portfolio management NLP techniques using Python libraries such as NLTK and scikit-learn for transforming text into meaningful representations

AI and Machine Learning for Coders

If you're looking to make a career move from programmer to AI specialist, this is the ideal place to start. Based on Laurence Moroney's extremely successful AI courses, this introductory book provides a hands-on, code-first approach to help you build confidence while you learn key topics. You'll understand how to implement the most common scenarios in machine learning, such as computer vision, natural language processing (NLP), and sequence modeling for web, mobile, cloud, and embedded runtimes. Most books on machine learning begin with a daunting amount of advanced math. This guide is built on practical lessons that let you work directly with the code. You'll learn: How to build models with TensorFlow using skills that employers desire The basics of machine learning by working with code samples How to implement computer vision, including feature detection in images How to use NLP to tokenize and sequence words and sentences Methods for embedding models in Android and iOS How to serve models over the web and in the cloud with TensorFlow Serving

Um dos campos que mais estão crescendo e sendo aplicados em AI e Data Science é o NLP, o Processamento de Linguagem Natural. Mas você sabe o que é o Natural Language Processing e como começar a aplicar nos seus projetos? Nesse episódio de hoje vamos entrar de cabeça nesse assunto altamente importante para quem precisa lidar com dados de texto e voz.

E, para esse episódio, convidamos Flávio Clésio — Machine Learning Engineer na MyHammer — , e Ahirton Lopes — PhD Student no Mackensie e Data Scientist na Magna Sistemas — para que eles possam nos contar como tem sido a experiência deles aplicando NLP na indústria e academia.

Acesse nosso post no Medium para ter acesso as informações que falamos no episódio: https://medium.com/data-hackers/o-que-%C3%A9-natural-language-processing-o-tal-do-nlp-data-hackers-podcast-27-9819c1bed5bd

Machine Learning for Algorithmic Trading - Second Edition

Explore the intersection of machine learning and algorithmic trading with "Machine Learning for Algorithmic Trading" by Stefan Jansen. This comprehensive guide walks you through applying predictive modeling and data analysis to uncover financial signals and build systematic trading strategies. By the end, you'll be equipped to design and implement machine learning-driven trading systems. What this Book will help me do Develop data-driven trading strategies using supervised, unsupervised, and reinforcement learning methods. Master techniques for extracting predictive features from market and alternative datasets. Gain expertise in backtesting and validating ML-based trading strategies in Python. Apply text analysis techniques like NLP to news articles and transcripts for financial insights. Optimize portfolio risk and returns using advanced Python libraries. Author(s) Stefan Jansen is a quantitative researcher and data scientist with extensive experience in developing algorithmic trading solutions. He specializes in leveraging machine learning to extract financial insights and optimize investment strategies. His practical approach to applying ML in trading is reflected in this comprehensive guide, helping readers navigate complex trading challenges. Who is it for? This book is crafted for Python developers, data scientists, and finance professionals looking to integrate machine learning into algorithmic trading. Ideal for those with a basic understanding of Python and ML principles, it guides readers in crafting data-driven trading strategies. It's especially useful for analysts aiming to harness diverse data types for financial applications.

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.  Abstract On this week of Making Data Simple, we are joined by Christine Livingston, Managing Director and Chief Scientist at Perficient. Christine talks us through her approach to AI within the consulting and enterprise industries, along with how she and her team have been managing workloads during COVID-19. Tune in to hear more.  Connect with Christine LinkedIn Perficient Show Notes 09:41 - Check out Watson Assistant here. 10:06 - Learn more about natural language processing here. 25:30 - Here are 5 reasons why you should choose to DIY.    Connect with the Team Producer Liam Seston - LinkedIn. Producer Lana Cosic - LinkedIn. Producer Meighann Helene - LinkedIn.  Producer Kate Brown - LinkedIn. Producer Allison Proctor - LinkedIn. Producer Mark Simmonds - LinkedIn.  Producer Michael Sestak - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

This audio blog focuses on increase usage of NLP to navigate different formats, languages, terminologies, and biases and how this technology will help analyze the fast-growing body of research on COVID-19. Originally published at: https://www.eckerson.com/articles/how-covid-19-will-drive-adoption-of-natural-language-processing

Send us a text Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.  Abstract Currently, COVID-19 is disrupting the world. In an effort to better provide updated information on the status of the pandemic, IBM and The Weather Channel have created a COVID-19 dashboard. Bill Higgins, IBM Distinguished Engineer, and Daniel Benoit, Program Director of Information Governance have come on the podcast this week to discuss this new initiative. Together, with host Al Martin, they discuss the purpose of this project, their current findings, and how they personally have been impacted.  Check out the dashboard here.  Connect with Bill LinkedIn Connect with Daniel LinkedIn Show Notes 10:16 - Get up to speed on Natural Language Processing here. 14:47 - Not sure what a data lake is? Find out here.   21:49 - Learn more on why extensibility is important to your API's here.  Connect with the Team Producer Liam Seston - LinkedIn. Producer Lana Cosic - LinkedIn. Producer Meighann Helene - LinkedIn.  Producer Kate Brown - LinkedIn. Producer Allison Proctor - LinkedIn. Producer Mar Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Send us a text NOTE: This episode was recorded before the COVID-19 outbreak. Any comments made in this episode on travel are no longer relevant or took place during ordered quarantines. Please stay home and be safe. Want to be featured as a guest on Making Data Simple? Reach out to us at [[email protected]] and tell us why you should be next.  Abstract Back for part 2 of his Making Data Simple series is Jad Chehlawi, CEO and Founder of TelosTouch. This episode follows up to part 1, where Jad explains how TelosTouch is aiming to redefine the client experience of  investing. This episode focuses more on the emotional challenges that frequently alter decisions and how to better account for those changes.  Connect with Jad LinkedIn TelosTouch Show Notes 03:47 - Check out these 6 hacks for fighting financial procrastination.  10:06 - The missing link. Learn about the void TelosTouch seeks to fill here. 17:37 - Need to get up to speed on natural language processing (NLP)? Find out more here.  25:49 - Learn more about modern portfolio theory here. Connect with the Team Producer Liam Seston - LinkedIn. Producer Lana Cosic - LinkedIn. Producer Meighann Helene - LinkedIn.  Producer Kate Brown - LinkedIn. Producer Allison Proctor - LinkedIn. Producer Mark Simmonds - LinkedIn.  Producer Michael Sestak - LinkedIn. Host Al Martin - LinkedIn and Twitter. Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Did you know that there were monks in the 1400s doing text-based sentiment analysis? Can you name the 2016 movie that starred Amy Adams as a linguist? Have you ever laid awake at night wondering if stopword removal is ever problematic? Is the best therapist you ever had named ELIZA? The common theme across all of these questions is the broad and deep topic of natural language processing (NLP), a topic we've been wanting to form and exchange words regarding for quite some time. Dr. Joe Sutherland, the Head of Data Science at Search Discovery, joined the discussion and converted many of his thoughts on the subject into semantic constructs that, ultimately, were digitized into audio files for your auditory consumption. For complete show notes, including links to items mentioned in this episode and a transcript of the show, visit the show page.

Next-Generation Machine Learning with Spark: Covers XGBoost, LightGBM, Spark NLP, Distributed Deep Learning with Keras, and More

Access real-world documentation and examples for the Spark platform for building large-scale, enterprise-grade machine learning applications. The past decade has seen an astonishing series of advances in machine learning. These breakthroughs are disrupting our everyday life and making an impact across every industry. Next-Generation Machine Learning with Spark provides a gentle introduction to Spark and Spark MLlib and advances to more powerful, third-party machine learning algorithms and libraries beyond what is available in the standard Spark MLlib library. By the end of this book, you will be able to apply your knowledge to real-world use cases through dozens of practical examples and insightful explanations. What You Will Learn Be introduced to machine learning, Spark, and Spark MLlib 2.4.x Achieve lightning-fast gradient boosting on Spark with the XGBoost4J-Spark and LightGBM libraries Detect anomalies with the Isolation Forest algorithm for Spark Use the Spark NLP and Stanford CoreNLP libraries that support multiple languages Optimize your ML workload with the Alluxio in-memory data accelerator for Spark Use GraphX and GraphFrames for Graph Analysis Perform image recognition using convolutional neural networks Utilize the Keras framework and distributed deep learning libraries with Spark Who This Book Is For Data scientists and machine learning engineers who want to take their knowledge to the next level and use Spark and more powerful, next-generation algorithms and libraries beyond what is available in the standard Spark MLlib library; also serves as a primer for aspiring data scientists and engineers who need an introduction to machine learning, Spark, and Spark MLlib.