Graph-based Data Augmentation for Entity Resolution

Upstream ASR (Automatic Speech Recognition) can influence downstream tasks such as Entity Resolution. In this project, I design and develop a graph-based data augmentation method to make the Entity Resolution more robust to the upstream error. I experiment the method on a use-case with more than 500K traffic weekly, and found that it can achieve 5.19% improvement on accuracy overall, and 27.86% improvement on harder cases. I also extend this method using graph neural network.

Cross-query Ranker on ASR N-best for Entity Resolution

In this project, I develop a machine learning ranker to leverage results from upstream ASR (Automatic Speech Recognition) to make the Entity Resolution result robust to ASR errors. I experiment the ranker on two use-cases with 100k examples and 1 million examples separately, and achieved about 10% gain in accuracy.

Ranker-based Entity Exploration Model for Entity Resolution

In this project, I lead, design and develop a ranker-based entity exploration model for entity resolution in Alexa. Our goal is to suggest alternative results comparing to the production system and collect customer feedback on such results while not hurting customer experience. These collected feedback would be useful for future model iteration.

I applied the model on a use case with more than 500K weekly traffic. Through online A/B test and offline analysis, I demonstrate an improvement of 5.04% comparing to the current production system.

Entity linking on Customer Reviews and Queries

In this project, I used natural language processing and learning to rank method and developed an entity linking system using wikipedia data on customer queries and reviews, which would be useful to enrich knowledge for search. I designed evaluation method on both wikipedia data and collected Mechanical Turk labeled data, and achieved about 20% improvement comparing to baseline.

Quantifying Systemic Gender Inequality in Visual Art

From disparities in the number of exhibiting artists to auction opportunities, there is overwhelming evidence of women’s under-representation in visual art, rooted in a complex and largely unknown interplay between gender, artistic performance and institutional practices. In this project we explore the exhibitions and auction sales of 65,768 gender-identified contemporary artists in 20,389 institutions, confirming systemic gender differences in the artist population, exhibitions and auctions. We distinguish between gender-neutrality, when artists have gender-independent access to exhibition opportunities and gender-balance, that strives for gender parity in representation, finding that 55% of institutions are gender-neutral but only 30% are gender-balanced, and that the fraction of man-preferred institutions increases with institutional prestige. Finally, we use machine learning to predict an artist’s access to the auction market, finding that co-exhibition gender, capturing the gender inequality of the institutions that an artist embraces has a higher impact on success than the artist’s gender. These results help unveil and quantify the institutional forces that contribute to the persistent gender imbalance in the art world.


  • Wang, Xindi, Alex J. Gates, and Albert-László Barabási, “Quantifying systemic gender inequality in visual art.” Nature Communications (Under review).

Information access equality on generative models of complex networks

It is well known that networks generated by common mechanisms such as preferential attachment and homophily can disadvantage the minority group by limiting their ability to establish links with the majority group. This has the effect of limiting minority nodes’ access to information. We present the results of an empirical study on the equality of information access in network models with different growth mechanisms and spreading processes. For growth mechanisms, we focus on the majority/minority dichotomy, homophily, preferential attachment, and diversity. For spreading processes, we investigate simple versus complex contagions, different transmission rates within and between groups, and various seeding conditions. We observe two phenomena. First, information access equality is a complex interplay between network structures and the spreading processes. Second, there is a trade-off between equality and efficiency of information access under certain circumstances (e.g., when inter-group edges are low and information transmits asymmetrically). Our findings can be used to make recommendations for mechanistic design of social networks with information access equality.


  • Wang, Xindi, Onur Varol, and Tina Eliassi-Rad. “Information access equality on generative models of complex networks.” Applied Network Science 7, no. 1 (2022): 1-20. link to paper

Success of Books and Authors

I worked on a project Success of Books and Authors in collaboration with Burcu Yucesoy, Onur Varol, Prof Tina Eliassi-Rad and Prof Albert-László Barabási. We are interested in why some books and authors become successful.

Our first paper in this project in EPJ Data Science. We analzed the New York Times Bestseller data and found a lot of interesting pattern in it. We also have an interactive visualization website and it’s fun to play with!

Our second paper in this project is on analyzing the contribution factors that make a book successful. We build a machine learning model to predict book sales from various features. This task is challenging since book sales is heavy-tailed distributed, and traditional machine learning models are prune to underpredict books with high sales. To tackle this, we build an algorithm called Learning to Place. We also analyzed the feature importance that contribute to book’s success.


  • Yucesoy, Burcu, Xindi Wang, Junming Huang, and Albert-László Barabási. “Success in books: a big data approach to bestsellers.” EPJ Data Science 7 (2018): 1-25. link to paper

  • Wang, Xindi, Burcu Yucesoy, Onur Varol, Tina Eliassi-Rad, and Albert-László Barabási. “Success in books: predicting book sales before publication.” EPJ Data Science 8, no. 1 (2019): 1-20. link to paper

  • Wang, Xindi, Onur Varol, and Tina Eliassi-Rad. “L2P: an algorithm for estimating heavy-tailed outcomes.” arXiv preprint arXiv:1908.04628 (2019).

  • Also check out the visualizations of NYT bestsellers

Understanding Music using Networks

Started as a course project, I initially collaborated with Syed Haque trying to understand music from the sheet music purely. We built the one-step note transition matrix for music pieces and use these matrices to cluster music; we found the matrices themselves are very distinct for Bach’s Fugue and the clusters we found aligns with music era. Check our paper!

During the Santa Fe Summer School, I pitched this music related project and collaborated with Josefine Brask, Ricky Laishram and Carlos Marcelo, trying to understand music by building higher-order networks. We came up with several metrics to quantify different charateristics of music from those higher-order networks such as “branchingness”, “repetitiveness”, etc., and try to connect them with different music genres. Here are the related slides!