Data Science Day @ Columbia University has ended
Columbia University’s Data Science Institute Presents:

Authors/Collaborators are listed in alphabetical order.

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Lightning Talks [clear filter]
Wednesday, April 6

9:45am EDT

Mining Images, Speech, Text and Social Ties for Insights and Important Events

Shih-Fu Chang  | Exploring Multimedia Recognition Tools in Big Data Applications
Advances in computer vision and the growth of digital photos and videos have created new opportunities to integrate content-recognition tools with mobile apps and large-scale systems. If you want more information about a building, product or bottle of wine, it’s now possible to search the Web with an image on your phone. New 3D sensors and search tools allow users to scan real-world objects and find matching models to make new products. Emerging multimedia-recognition tools are making it possible to track and summarize breaking news from streaming video and social media. This technology is also embedded in smart search engines that can mine video footage from sporting events, roads and security cameras to flag key events, from touchdowns to traffic accidents to criminal activity. I will give an overview of the novel technologies we are developing and discuss open issues.

Julia Hirschberg | Applications for Detecting Emotion in Text and Speech
Identifying the emotional content of written and spoken language is increasingly useful in business, medicine and security. Large data sets of text and speech, including social media, interviews and phone conversations, can be used to train systems to detect consumer reactions to products and services (and to flag ‘fake’ reviews), to diagnose medical conditions such as depression, and identify deception in a wide variety of government, business and social service settings. Each application picks up subtle cues that may indicate whether a speaker is angry, happy, disgusted, afraid, sad or surprised. Similar approaches have been used to distinguish among personality traits, and to infer how tired, drunk or bored someone might be.

Kathy McKeown | Tracking Events Through Time: Objective and Personal Views
The chaos following Hurricane Sandy in 2012 brought home the need for a faster, more accurate way to filter the oceans of text streaming over social media and news sites during and after a crisis. We have been working on an automated method for monitoring and summarizing news as events unfold. Our method can flag new information as it becomes available, and generate updates. This can be extremely useful during emergencies as well as for tracking a wide variety of everyday events. In a related project, we’ve come up with a way to automatically identify the most compelling part of a personal narrative, what we call the “most reportable event.” I will discuss the natural language processing techniques that underlie this work, and future research directions.

Tian Zheng | Mapping Subpopulations within Big Networks
Estimating the size of stigmatized groups such as the homeless, people with HIV and commercial sex workers remains difficult, even in the digital age. Those belonging to marginalized subpopulations may be difficult to reach by phone, or in online surveys, or may simply prefer to keep sensitive personal information to themselves. Advances in network science are now allowing researchers to move past these obstacles to learn more about hard-to-reach demographic groups. My colleagues and I have developed a modeling framework to infer the size and other hidden features of subpopulations within a large study sample. Our method produces inferential results that are easy to interpret and relevant for visualizing, monitoring and understanding structures underlying large, complex networks.

avatar for Shih-Fu Chang

Shih-Fu Chang

Senior Executive Vice Dean and Richard Dicker Professor of Telecommunications and Professor of Computer Science, Columbia Engineering
Shih-Fu Chang is Richard Dicker Chair Professor, Director of the Digital Video and Multimedia Lab, and Senior Executive Vice Dean of The Fu Foundation School of Engineering and Applied Science at Columbia University. He is an active researcher leading development of theories, algorithms... Read More →
avatar for Julia Hirschberg

Julia Hirschberg

Percy K. and Vida L. W. Hudson Professor of Computer Science and Department Chair, Columbia Engineering
Julia Hirschberg is Percy K. and Vida LW Hudson Professor of computer science at Columbia University and chair of the Department. She does research in prosody, spoken dialogue systems, and emotional and deceptive speech. She received her PhD in Computer Science from the University of Pennsylvania in 1985.  She worked at Bell Laboratories and AT&T Labo... Read More →
avatar for Kathy McKeown

Kathy McKeown

Director, Data Science Institute
A leading scholar and researcher in the field of natural language processing, McKeown focuses her research on big data; her interests include text summarization, question answering, natural language generation, multimedia explanation, digital libraries, and multilingual applications. Her research group's Columbia Newsblaster, which has been live since 2001, is... Read More →
avatar for Tian Zheng

Tian Zheng

Associate Professor of Statistics, Graduate School of Arts and Sciences
Tian Zheng is associate professor of Statistics at Columbia University. She obtained her PhD from Columbia in 2002. Her research is to develop novel methods and improve existing methods for exploring and analyzing interesting patterns in complex data from different application... Read More →

Wednesday April 6, 2016 9:45am - 10:30am EDT
Roone Arledge Auditorium Lerner Hall, Columbia University 2920 Broadway, New York, NY 10040

11:15am EDT

Developing Algorithms that Know Your Likes and Dislikes Better Than You

Shipra Agrawal | Explore and Exploit: Because You May Not Know What You're Missing
To improve its movie recommendations to subscribers, Netflix looks at what subscribers liked in the past to predict future preferences. But that method leaves out movies subscribers might like even better but don’t know about. Amazon faces a similar problem in recommending products to its customers. Discovering the full range of possibilities involves a trade-off between exploration and exploitation of data. Many sequential decision making problems are rooted in this problem, including recommendation systems, online advertising, content optimization, revenue and inventory management, and even teaching computers to play games like Pong and Go. I will discuss how machine learning and optimization techniques can be combined to achieve near-optimal trade-offs between exploration and exploitation.

Olivier Toubia | Recommending Movies by Character Traits Featured
Current movie recommendation systems are largely based on viewers’ past preferences. We propose an alternative that taps into viewer preferences for stories that feature specific character traits, a finding documented in the media psychology literature. Borrowing from the positive psychology literature, we have developed a character-based classification system that is easy to interpret, communicate and act on. We have also developed a companion natural language processing tool that can infer character traits from movie summaries. In two online studies, we show that character traits are a strong predictor of what movies people like. Our results apply to films that achieve critical acclaim as well as box-office success. We show that character-based classification works for models that use content alone, and content with collaborative filtering, to predict viewer behavior.

avatar for Shipra Agrawal

Shipra Agrawal

Assistant Professor of Industrial Engineering and Operations Research, Columbia Engineering
Profesor Shipra Agrawal is an Assistant Professor in the Department of Industrial Engineering and Operations Research. Her research spans several areas of optimization and machine learning, including data-driven optimization under partial, uncertain, and online inputs, and related... Read More →
avatar for Olivier Toubia

Olivier Toubia

Glaubinger Professor of Business, Columbia Business School
Olivier Toubia is the Glaubinger Professor of Business and the Faculty Director of the Lang Center for Entrepreneurship at Columbia Business School. His research focuses on various aspects of innovation (including idea generation, preference measurement, and the diffusion of innovation... Read More →

Wednesday April 6, 2016 11:15am - 11:40am EDT
Roone Arledge Auditorium Lerner Hall, Columbia University 2920 Broadway, New York, NY 10040

12:30pm EDT

Measuring and Addressing Social and Environmental Problems in Cities

Donald Davis | Mining Yelp Reviews to Measure Segregation in New York City
Until they were dismantled in the mid-1960s, the segregationist Jim Crow laws in the southern United States severely limited social interactions among ethnic groups. Despite the Civil Rights Act and later reforms, the U.S. remains deeply segregated, even in northern cities like New York. While standard measures of segregation exist for residences, jobs, and schools, we currently have no way of measuring how segregated common public activities like going to restaurants is. By studying five years of Yelp reviews in New York City, my colleagues and I provide the first estimate of diversity in city restaurants. Early results suggest that dining patterns are also segregated, though not as markedly as in housing.

Xiaofan (Fred) Jiang  | Smart Systems for Monitoring Air Pollution and Personal Energy Use
Analyzing observations of the physical world can be a messy process. But the rise of sensors to measure air quality, ocean temperatures and any number of other changes is allowing us to study our environment and actions like never before. I will discuss two projects that use intelligent sensor systems to map the environment. In one, my colleagues and I combined inexpensive, custom-built Internet-connected sensors with cloud-based data analysis to measure and infer air-quality at city scales. In a second project, here at Columbia, my lab is combining building energy-use monitoring with location data to estimate an individual’s energy footprint to provide real-time feedback to cut energy use. 

Desmond Patton | Preventing Gang Violence through Social Media Analysis
Social media is often an extension of the street for gang-involved youth. They may taunt rival gang members, downplay shootings and brag about fights and drug deals. Sometimes the tough talk turns into real violence. To be able to intervene, social workers need to understand how likely a specific post on Twitter may lead to violence. To do so requires deciphering the coded language and culture of gang-involved youth. I have recently collaborated with social science researchers and data scientists to analyze Twitter posts by Chicago gang members. Our goal is to combine observations with natural language processing tools to detect and decode high-risk language. I will discuss our process and early results.

avatar for Andrew Smyth

Andrew Smyth

Professor of Civil Engineering and Engineering Mechanics, Columbia Engineering
Andrew Smyth is a professor of civil engineering and engineering mechanics at Columbia Engineering. He specializes in structural health monitoring, using sensor information to determine the condition of critical infrastructure. Smyth has been involved with the sensor instrumentation... Read More →

avatar for Donald Davis

Donald Davis

Ragnar Nurkse Professor of Economics and Department Chair, Graduate School of Arts and Sciences
Donald Davis has been a professor of economics at Columbia University since 1999. In 2001 he was appointed chairman of the Department of Economics at the University. Professor David's research interests include international trade, economic development in the open economy... Read More →
avatar for Xiaofan (Fred) Jiang

Xiaofan (Fred) Jiang

Assistant Professor of Electrical Engineering, Columbia Engineering
Xiaofan (Fred) Jiang is an Assistant Professor in the Electrical Engineering Department at Columbia University. Fred received his B.Sc. (2004) and M.Sc. (2007) in Electrical Engineering and Computer Science, and his Ph.D. (2010) in Computer Science, all from UC Berkeley. Before... Read More →
avatar for Desmond Patton

Desmond Patton

Assistant Professor of Social Work, School of Social Work
Dr. Desmond Upton Patton is an Assistant Professor at the Columbia School of Social Work and a Faculty Affiliate of the Social Intervention Group (SIG) and the Data Science Institute.  His research utilizes qualitative and computational data collection methods to examine how and... Read More →

Wednesday April 6, 2016 12:30pm - 1:05pm EDT
Roone Arledge Auditorium Lerner Hall, Columbia University 2920 Broadway, New York, NY 10040

1:05pm EDT

The Moneyball Approach to Healthier Living

Hod Lipson | Data Smashing: Uncovering Order in Data Stream
From speech recognition to the discovery of new stars, almost all automated tasks involve comparing streams of data for similarities and outliers. Automated discovery methods, however, have not kept pace with the exponential growth in data. One reason is that most algorithms depend on humans to define what features to compare. Here, we propose a new way to match multiple sources of data streams without any prior learning. We show how this principle can be applied to challenging problems, including the interpretation of EEG patterns in epileptic seizures, the detection of abnormal heartbeats in ECG data and classifying astronomical objects from light measurements. Our data smashing principles produce results as accurate as algorithms developed by domain experts, and could open the door to understanding increasingly complex observations that experts don’t yet know how to interpret.

David Madigan | Observational Studies: Promise and Peril
Randomized experiments are the gold standard in measuring the effects of interventions in medicine, education, social science and other areas. In reality, researchers often rely on observational studies, leading to vast numbers of contradictory findings published in scholarly journals and widely disseminated through the media. Decision makers and the public assume that a rigorous peer-review process guarantees that these results are valid. This is not always so. Well-intentioned analysts make design choices, run analyses and publish their results overlooking the possibility that different choices may have produced entirely different results. I will provide an overview of the current state of the art in observational studies in healthcare and describe some promising research directions.

Olena Mamykina |  Predicting Blood-Glucose Levels to Manage Diabetes
Advances in personal health tracking promise to help individuals gain deep insights into their health and behavior. Yet, most health apps still rely on humans to identify trends, make discoveries and take action. In this research, we are building computational models and interactive decision-support tools to help type 2 diabetics improve their nutritional choices. Our decision-support tool forecasts how a planned meal will influence blood-glucose levels based on an individual’s physiology and past data. Early results suggest that this automated prediction tool may produce more accurate assessments than individuals or their healthcare providers can. 

Adler Perotte | Predicting Kidney Disease Progression with Large-Scale Patient Data
Columbia University coordinates a global network of health databases known as the Observational Health Data Science and Informatics (OHDSI) collaborative. With hundreds of millions of patient records, OHDSI allows researchers to look for large-scale patterns that can reveal new ways to identify and treat disease. In a recent study, my colleagues and I used observational health data to build a model to predict how likely a patient with stage 3 kidney disease, in which the kidney has lost half of its function, will progress to stage 4, with up to 90 percent loss. Our model, which incorporated patient lab test results and clinical records, outperformed models that did not include this information. Identifying patients at high risk for disease progression allows doctors to customize treatment that can stall or prevent its progression.

avatar for Hod Lipson

Hod Lipson

Professor of Mechanical Engineering, Columbia Engineering
Hod Lipson is a roboticist who works in the areas of artificial intelligence and digital manufacturing. He and his students love designing and building robots that do what you’d least expect robots to do: Self replicate, self-reflect, ask questions, and even be creative... Read More →
avatar for David Madigan

David Madigan

Executive Vice President for Arts and Sciences and Dean of the Faculty of Arts and Sciences, Professor of Statistics, Columbia University
David Madigan received a bachelor’s degree in Mathematical Sciences and a Ph.D. in Statistics, both from Trinity College Dublin. He has previously worked for AT&T Inc., Soliloquy Inc., the University of Washington, Rutgers University, and SkillSoft, Inc. He has over 100 publications in su... Read More →
avatar for Olena Mamykina

Olena Mamykina

Assistant Professor of Biomedical Informatics, College of Physicians and Surgeons
Olena Mamykina is an Assistant Professor of Biomedical Informatics in the Department of Biomedical Informatics at Columbia University. Her primary research interests reside in the areas of Biomedical Informatics, Human-Computer Interaction, Ubiquitous and Pervasive Computing, and... Read More →
avatar for Adler Perotte

Adler Perotte

Associate Research Scientist in Biomedical Informatics, College of Physicians and Surgeons
Dr. Adler Perotte is an Associate Research Scientist in the Department of Biomedical Informatics. Dr. Perotte’s primary research area is the development and application of statistical machine learning methods, including probabilistic graphical models for biomedical informatics... Read More →

Wednesday April 6, 2016 1:05pm - 1:50pm EDT
Roone Arledge Auditorium Lerner Hall, Columbia University 2920 Broadway, New York, NY 10040