vices ecosystem. PyMatcher is intended for a “power user” who possess knowledge about entity matching, programming, and basic machine learning while CloudMatcher is targeted for “lay users” who may not… Click to show full abstract
vices ecosystem. PyMatcher is intended for a “power user” who possess knowledge about entity matching, programming, and basic machine learning while CloudMatcher is targeted for “lay users” who may not know how to program or possess machine learning knowledge. PyMatcher provides how-to guides that describe how to approach the development of entity matching workflows. These guides describe how to develop a solution for a small sample of data (by downsampling, blocking, and training a matcher) and how to scale the solution to work with production data. The entity matching workflow for CloudMatcher is similar to that of PyMatcher except that CloudMatcher actively learns from the user how to block tuples. Afterwards, it executes the blocking rules that are learnt to obtain a set of candidate pairs of tuples and again actively learns from the users what are the (non-)matching candidate pairs of tuples before deriving a model that can be applied to match tuples across two tables. In short, Magellan makes it easy to develop an entity matching solution and easy to interoperate with other tools to form a bigger data integration pipeline that solves larger problems. It is a showcase for practical software development tools that originate from data management research. It has been successfully applied to multiple entity matching problems in the real world, is used in production at many data science groups and companies, and is recently being commercialized, demonstrating that using data science ideas to build entity matching systems is highly promising. For more details, check out Magellan’s website at https://sites.google.com/site/ anhaidgroup/projects/magellan.
               
Click one of the above tabs to view related content.