NY R Conference
Get ready to celebrate the 10th anniversary of New York R Conference!
We're taking a trip down memory lane and looking back over the past nine years. Come listen to some of the all-time greats who will be gracing our stage once again, and we're also adding some fresh and exciting new voices to the mix!
Workshops: May 15, 2024 | Location: TBA
Conference: May 16-17, 2024 | Location: FIAF Manhattan
Speakers
Andrew Gelman
Professor
Department of Statistics and Department of Political Science, Columbia University
Talk: It’s About Time
Abigail Haddad
Lead Data Scientist
Capital Technology Group
Talk: Automating Tests for your RAG Chatbot or Other Generative Tool
Wes McKinney
Principal Architect
Posit
Talk: The Future Roadmap for the Composable Data Stack
Sean Taylor
Chief Scientist
Motif Analytics
Talk: Analyzing and Visualizing Event Sequence Data
Emily Zabor
Associate Staff Biostatistician
Cleveland Clinic, Department of Quantitative Health Sciences
Talk: Reporting Survival Analysis Results with the gtsummary and ggsurvfit Packages
Jared P. Lander
Chief Data Scientist
Lander Analytics
Talk: 15 Years of Data Science in NYC
Chang She
CEO & Cofounder
LanceDB
Talk: Building Data Tooling in Rust for Multimodal AI
Kelsey McDonald
Ticketing Director
Two Circles
Talk: R is for Retention: Using Regression Models to Increase Revenue in Sports
Walker Harrison
Analyst
New York Yankees
Talk: Kick or Receive? Determining Optimal NFL Playoff Overtime Strategy via Simulation
David Robinson
Director of Data Science
Contentsquare
Talk: The Science of Product Development: Bringing Causal Inference to Conversion and Retention Metrics
Jon Harmon
Executive Director
R4DS Online Learning Community
Talk: I Built a Robot to Write This Talk
Alan Feder
Senior Principal Data Scientist
Freelance
Talk: RAGtime in the Big Apple: Chat with a Decade of NYR Talks
More speakers coming soon…
Retrospective Panel
Join us for a captivating retrospective panel as we celebrate a decade of the New York R Conference, 15 years of the New York Open Statistical Programming Meetup, and the vibrant journey of the Data Science community. Dive into the highlights, memories, and collective achievements that have shaped our community’s remarkable evolution. Don’t miss this nostalgic journey reflecting on the past and embracing the exciting future of data science!
Drew Conway
Head of Data Science, Private Investments
Two Sigma
Workshops
Machine Learning in R
Hosted by Max Kuhn
Wednesday, May 15 | 9:00am - 5:00pm
Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You'll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and featur...
...re engineering, model creation, evaluation, and tuning. This is not a deep learning course and will focus on tabular data. Pre-requisites: some experience with modeling in R and the tidyverse (don't need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository. (In-Person & Virtual)Causal Inference in R
Hosted by Malcolm Barrett & Lucy D'Agostino McGowan
Wednesday, May 15 | 9:00am - 5:00pm
In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elem...
...ments of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools. You’ll be able to use the tools you already know--the tidyverse, regression models, and more--to answer the questions that are important to your work. This course is for you if you: -Know how to fit a linear regression model in R -Have a basic understanding of data manipulation and visualization using tidyverse tools -Are interested in understanding the fundamentals behind how to move from estimating correlations to causal relationships (In-Person & Virtual)Exploratory Data Analysis with the Tidyverse
Hosted by David Robinson
Wednesday, May 15 | 9:00am - 5:00pm
The tidyverse is a powerful collection of packages following a standard set of principles for usability. During this workshop David will demonstrate an exploratory data analysis in R using tidy tools. He will demonstrate the use of tools such as dplyr and ggplot2 for data transformation and visualization, as well as other packages from the tidyverse as they're needed. He'll narrate his thought pro...
...ocess as attendees follow along and offer their own solutions. The workshop expects some familiarity with dplyr and ggplot2—enough to work with data using functions like mutate, group_by, and summarize and to create graphs like scatterplots or bar plots in ggplot2. These concepts will be re-introduced to ensure a smooth workshop, but it isn't designed for brand new R programmers. The workshop is designed to be interactive and participants are expected to type along on their own keyboards. (In-Person & Virtual)More workshops coming soon…
Agenda
Wednesday, May 15
-
08:00 AM - 09:00 AM
Registration & Breakfast
-
09:00 AM - 05:00 PM
Workshop: Max Kuhn Scientist @ Posit
Machine Learning in R ...
Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You'll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. This is not a deep learning course and will focus on tabular data. Pre-requisites: some experience with modeling in R and the tidyverse (don't need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository. (In-Person & Virtual) -
09:00 AM - 05:00 PM
Workshop: Malcolm Barrett & Lucy D'Agostino McGowan
Causal Inference in R ...
In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools. You’ll be able to use the tools you already know--the tidyverse, regression models, and more--to answer the questions that are important to your work. This course is for you if you: -Know how to fit a linear regression model in R -Have a basic understanding of data manipulation and visualization using tidyverse tools -Are interested in understanding the fundamentals behind how to move from estimating correlations to causal relationships (In-Person & Virtual) -
09:00 AM - 05:00 PM
Workshop: David Robinson Director of Data Science @ Heap
Exploratory Data Analysis with the Tidyverse ...
The tidyverse is a powerful collection of packages following a standard set of principles for usability. During this workshop David will demonstrate an exploratory data analysis in R using tidy tools. He will demonstrate the use of tools such as dplyr and ggplot2 for data transformation and visualization, as well as other packages from the tidyverse as they're needed. He'll narrate his thought process as attendees follow along and offer their own solutions. The workshop expects some familiarity with dplyr and ggplot2—enough to work with data using functions like mutate, group_by, and summarize and to create graphs like scatterplots or bar plots in ggplot2. These concepts will be re-introduced to ensure a smooth workshop, but it isn't designed for brand new R programmers. The workshop is designed to be interactive and participants are expected to type along on their own keyboards. (In-Person & Virtual)
Thursday, May 16
-
08:00 AM - 08:50 AM
Registration & Breakfast
-
08:50 AM - 09:00 AM
Opening Remarks
-
09:00 AM - 09:20 AM
TBD
-
09:25 AM - 09:45 AM
TBD
-
09:50 AM - 10:10 AM
Chang She CEO & Cofounder @ LanceDB
Building Data Tooling in Rust for Multimodal AI ...
AI adoption is bringing a host of new challenges for data management and new workloads. This is especially true for multi-modal AI where data challenges extend far beyond just embeddings and require new tooling for working with images, audio, video, pdfs, and more. Traditional formats and tooling are optimized for purely tabular data and cannot be used effectively to manage unstructured data types. Instead, a new set of infrastructure and tooling are being built, in Rust. Rust makes high performance data manipulation code much safer, which means developers can move much quicker with more confidence. It's easy to bridge Rust into higher level languages like Python/R to be wrapped into APIs much more familiar to the data science / machine learning users. Finally, Rust offers powerful features for concurrency, which allows developers to parallelize data manipulation tasks much easier. In this talk we'll use Lance and LanceDB as a source of examples on building high performance data tools for AI in Rust. We'll show you how Rust is used to create blazing fast vector search with hardware acceleration, how Rust helps us create new data management tooling for unstructured data, and how these tools can be exposed in higher level languages like python and javascript. -
10:10 AM - 10:40 AM
Break
-
10:40 AM - 11:00 AM
Emily Zabor Associate Staff Biostatistician @ Cleveland Clinic, Department of Quantitative Health Sciences
Reporting Survival Analysis Results with the gtsummary and ggsurvfit Packages ...
Survival analysis is an essential tool to handle censored time-dependent endpoints such as overall survival, which are common across a variety of biomedical and other applications. The survival package in R provides the most essential tools to conduct a survival analysis, including estimating survival probabilities, fitting Cox proportional hazards models, and plotting Kaplan-Meier curves. While the functions are powerful, user-friendly, and well documented, getting publication-ready tables and figures can still be a challenge. In this talk, I will review the basics of survival analysis, and will demonstrate how to take results from the console to the manuscript using the gtsummary and ggsurvfit packages. -
11:05 AM - 11:25 AM
Jared P. Lander Chief Data Scientist @ Lander Analytics
15 Years of Data Science in NYC
-
11:30 AM - 11:50 AM
TBD
-
11:50 AM - 01:00 PM
Lunch
-
01:00 PM - 01:20 PM
Sean Taylor Chief Scientist @ Motif Analytics
Analyzing and Visualizing Event Sequence Data ...
Many business processes can be represented as event sequence data, especially from product instrumentation in web and mobile applications. However, low-level events are challenging to wrangle, model, and visualize. As a result, analysts typically aggregate data before visualization and estimation, discarding valuable information and introducing bias. In this talk I discuss how to work with event sequences directly, with a focus on exploratory analysis and hypothesis generation, and step through interactive visualizations that support these analysis goals. -
01:25 PM - 02:05 PM
Andrew Gelman Professor @ Department of Statistics and Department of Political Science, Columbia University
It’s About Time ...
Statistical processes occur in time, but this is often not accounted for in the methods we use and the models we fit. Examples include imbalance in causal inference, generalization from A/B tests even when there is balance, sequential analysis, adjustment for pre-treatment measurements, poll aggregation, spatial and network models, chess ratings, sports analytics, and the replication crisis in science. The point of this talk is to motivate you to include time as a factor in your statistical analyses. This may change how you think about many applied problems! -
02:05 PM - 02:35 PM
Break
-
02:35 PM - 02:55 PM
Jon Harmon Executive Director @ R4DS Online Learning Community
I Built a Robot to Write This Talk ...
Are large language models coming for your job? To examine both sides of that argument, I wrote {robodeck}, an R package that uses the OpenAI API to auto-generate a quarto slide deck from as little as a title. See how it helped, where it failed miserably, and how I coerced it to work at least most of the time. -
03:00 PM - 03:20 PM
David Robinson Director of Data Science @ Contentsquare
The Science of Product Development: Bringing Causal Inference to Conversion and Retention Metrics
-
03:25 PM - 03:45 PM
Alan Feder Senior Principal Data Scientist @ Freelance
RAGtime in the Big Apple: Chat with a Decade of NYR Talks
-
03:45 PM - 04:15 PM
Break
-
04:15 PM - 04:35 PM
Abigail Haddad Lead Data Scientist @ Capital Technology Group
Automating Tests for your RAG Chatbot or Other Generative Tool ...
Building a Retrieval Augmented Generation (RAG) chatbot that answers questions about a specific set of documents is straightforward. But how do you tell if it's working? Automated evaluation of generative tools for specific use cases is tricky, but it's also important if you want to easily compare performance using different underlying LLMs, system prompts, temperatures, or other parameters -- or just make sure you're not breaking something when you push your code. In this talk, I'll discuss why this kind of evaluation is challenging and review a few options for the kinds of assessments you can create, including using an LLM to evaluate your LLM-based tool. We'll then look at several ways to write automated LLM-led evaluations, including with a library that allows you to easily and with very little coding create complex grading rubrics for your tests. -
04:40 PM - 05:00 PM
Walker Harrison Analyst @ New York Yankees
Kick or Receive? Determining Optimal NFL Playoff Overtime Strategy via Simulation ...
This year's Super Bowl was the first to feature an overtime period under the NFL's new playoff rules, which guarantee that each team will possess the ball in the added time. The San Francisco 49ers opted to have the first possession, subsequently lost, and were roundly criticized for not forcing their opponent to start with the ball. But did they actually make a poor strategic decision? To answer this question, we can simulate overtime periods by re-sampling historical plays under some added constraints. -
05:00 PM - 05:10 PM
Closing Remarks
-
05:10 PM - 06:30 PM
Happy Hour
Friday, May 17
-
09:00 AM - 09:50 AM
Registration & Breakfast
-
09:50 AM - 10:00 AM
Opening Remarks
-
10:00 AM - 10:20 AM
Kelsey McDonald Ticketing Director @ Two Circles
R is for Retention: Using Regression Models to Increase Revenue in Sports ...
A conversation about how we’ve used R in the sports world to build logistic regression models that predict season ticket member retention, and multinomial regression models to identify upsell opportunities. -
10:25 AM - 10:45 AM
Zhangjun Zhou Lead Data Scientist @ Macy's
Personalized Customer Journey at Macy's ...
This talk will give an overview of how data science at Macy's has enabled and empowered data-driven business decision making and delivered impact on customer experience. Specifically, it will focus on how data and machine learning have been utilized to predict customer preferences and optimized customer journeys for personalized experiences. -
10:45 AM - 11:15 AM
Break
-
11:15 AM - 11:35 AM
TBD
-
11:40 AM - 12:20 PM
Hadley Wickham Chief Scientist @ Posit
R in Production
-
12:20 PM - 01:30 PM
Lunch
-
01:30 PM - 01:50 PM
Wes McKinney Principal Architect @ Posit
The Future Roadmap for the Composable Data Stack ...
In this talk, I plan to review the progress we have made in the last 10 years developing composable, interoperable open standards for the data processing stack, from such infrastructure projects as Parquet and Arrow to user-facing interface libraries like Ibis for Python and the tidyverse for R. In discussing the current landscape of projects, I will dig into the different areas where more innovation and growth is needed, and where we would ideally like to end up in the coming years. -
01:55 PM - 02:15 PM
Max Kuhn Scientist @ Posit
SHINYLIVE IS SO EASY ...
shinylive is an extension to the Quarto open-source scientific and technical publishing system. It enables shiny applications to run locally, without a shiny server using WebAssembly. I’ll show examples and discuss the limitations of using shinylive. -
02:20 PM - 02:40 PM
Hilary Mason Co-Founder @ Hidden Door
-
02:40 PM - 03:10 PM
Break
-
03:10 PM - 04:10 PM
Retrospective Panel
Join us for a captivating retrospective panel as we celebrate a decade of the New York R Conference, 15 years of the New York Open Statistical Programming Meetup, and the vibrant journey of the Data Science community. Dive into the highlights, memories, and collective achievements that have shaped our community's remarkable evolution. Don't miss this nostalgic journey reflecting on the past and embracing the exciting future of data science! ...
Hosted by Jon Krohn, this retrospective panel includes special guests Drew Conway, Soumya Karla, JD Long and Jared Lander. -
04:10 PM - 04:20 PM
Closing Remarks
Sponsors