Topic

Scikit-learn

machine_learning data_science data_analysis

Activities

2

tagged

Activity Trend

6 peak/qtr

2020-Q1 2026-Q2

Top Events

O'Reilly Data Science Books 25 O'Reilly AI & ML Books 6 PyData Paris 2025 5 SciPy 2025 3 Data + AI Summit 2025 3 PyData Paris 2024 3 DataTalks.Club 3 Databricks DATA + AI Summit 2023 2 Data Engineering Podcast 2 O'Reilly Data Visualization Books 2 PyConDE & PyData Berlin 2023 2 Big Data & AI Paris 2025 2

Top Speakers

Andrew Worsley 2 Robert Thas John 2 Robert Johansson 2 Thomas Joseph 2 Dr. Samuel Asare 2 Daniel Y. Chen 2 Stephen Klosterman 2 Fabio Nelli 2 Anthony So 2 Jake VanderPlas 2 Kirthi Raman 2 Guillaume Lemaitre (scikit-learn) 2

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Databricks DATA + AI Summit 2023 ×

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Jeff Mroz

AI/ML Analytics API Cloud Computing Data Engineering Data Governance Data Lake Data Lakehouse Data Management Data Quality Databricks Delta +12 more

The US Army Corps of Engineers (USACE) is responsible for maintaining and improving nearly 12,000 miles of shallow-draft (9'-14') inland and intracoastal waterways, 13,000 miles of deep-draft (14' and greater) coastal channels, and 400 ports, harbors, and turning basins throughout the United States. Because these components of the national waterway network are considered assets to both US commerce and national security, they must be carefully managed to keep marine traffic operating safely and efficiently.

The National DQM Program is tasked with providing USACE a nationally standardized remote monitoring and documentation system across multiple vessel types with timely data access, reporting, dredge certifications, data quality control, and data management. Government systems have often lagged commercial systems in modernization efforts, and the emergence of the cloud and Data Lakehouse Architectures have empowered USACE to successfully move into the modern data era.

This session incorporates aspects of these topics: Data Lakehouse Architecture: Delta Lake, platform security and privacy, serverless, administration, data warehouse, Data Lake, Apache Iceberg, Data Mesh GIS: H3, MOSAIC, spatial analysis data engineering: data pipelines, orchestration, CDC, medallion architecture, Databricks Workflows, data munging, ETL/ELT, lakehouses, data lakes, Parquet, Data Mesh, Apache Spark™ internals. Data Streaming: Apache Spark Structured Streaming, real-time ingestion, real-time ETL, real-time ML, real-time analytics, and real-time applications, Delta Live Tables. ML: PyTorch, TensorFlow, Keras, scikit-learn, Python and R ecosystems data governance: security, compliance, RMF, NIST data sharing: sharing and collaboration, delta sharing, data cleanliness, APIs.

Talk by: Jeff Mroz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Predicting Repeat Admissions to Substance Abuse Treatment with Machine Learning

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML BI Dashboard Databricks Power BI TensorFlow

In our presentation, we will walk through a model created to predict repeat admissions to substance abuse treatment centers. The goal is to predict early who will be at high risk for relapse so care can be tailored to put additional focus on these patients. We used the Treatment Episode Data Set (TEDS) Admissions data set, which includes every publicly funded substance abuse treatment admission in the US.

While longitudinal data is not available in the data set, we were able to predict with 88% accuracy and an f-score of 0.85 which admissions were first or repeat admissions. Our solution used a scikit-learn Random Forest model and leveraged MLFlow to track model metrics to choose the most effective model. Our pipeline tested over 100 models of different types ranging from Gradient Boosted Trees to Deep Neural Networks in Tensorflow.

To improve model interpretability, we used Shapley values to measure which variables were most important for predicting readmission. These model metrics along with other valuable data are visualized in an interactive Power BI dashboard designed to help practitioners understand who to focus on during treatment. We are in discussions with companies and researchers who may be able to leverage this model in substance abuse treatment centers in the field.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/