ORCID Profile
0000-0001-7239-5134
Current Organisations
Nagoya University
,
Osaka University
Does something not look right? The information on this page has been harvested from data sources that may not be up to date. We continue to work with information providers to improve coverage and quality. To report an issue, use the Feedback Form.
Publisher: IEEE
Date: 06-2014
Publisher: Information Processing Society of Japan
Date: 2019
Publisher: Information Processing Society of Japan
Date: 2016
Publisher: Association for Computing Machinery (ACM)
Date: 04-2013
Abstract: Query autocompletion is an important feature saving users many keystrokes from typing the entire query. In this paper we study the problem of query autocompletion that tolerates errors in users' input using edit distance constraints. Previous approaches index data strings in a trie, and continuously maintain all the prefixes of data strings whose edit distance from the query are within the threshold. The major inherent problem is that the number of such prefixes is huge for the first few characters of the query and is exponential in the alphabet size. This results in slow query response even if the entire query approximately matches only few prefixes. In this paper, we propose a novel neighborhood generation-based algorithm, IncNGTrie, which can achieve up to two orders of magnitude speedup over existing methods for the error-tolerant query autocompletion problem. Our proposed algorithm only maintains a small set of active nodes, thus saving both space and time to process the query. We also study efficient duplicate removal which is a core problem in fetching query answers. In addition, we propose optimization techniques to reduce our index size, as well as discussions on several extensions to our method. The efficiency of our method is demonstrated against existing methods through extensive experiments on real datasets.
Publisher: Springer Berlin Heidelberg
Date: 2012
Publisher: Springer International Publishing
Date: 2018
Publisher: Association for Computing Machinery (ACM)
Date: 08-2013
Abstract: Given a query string Q , an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures. In this article, we show that for any such signature scheme, the lower bound of the minimum number of signatures is τ + 1, which is lower than what is achieved by existing methods. We then propose several asymmetric signature schemes, that is, extracting different numbers of signatures for the data and query strings, which achieve this lower bound. A basic asymmetric scheme is first established on the basis of matching q -chunks and q -grams between two strings. Two efficient query processing algorithms (IndexGram and IndexChunk) are developed on top of this scheme. We also propose novel candidate pruning methods to further improve the efficiency. We then generalize the basic scheme by incorporating novel ideas of floating q -chunks, optimal selection of q -chunks, and reducing the number of signatures using global ordering. As a result, the Super and Turbo families of schemes are developed together with their corresponding query processing algorithms. We have conducted a comprehensive experimental study using the six asymmetric algorithms and nine previous state-of-the-art algorithms. The experiment results clearly showcase the efficiency of our methods and demonstrate space and time characteristics of our proposed algorithms.
Publisher: Association for Computing Machinery (ACM)
Date: 31-03-2018
DOI: 10.1145/3200200
Abstract: In this article, we propose a novel indexing and querying method for trajectories constrained in a road network. We aim to provide efficient algorithms for various types of spatiotemporal queries that involve routing in road networks, such as (1) finding moving objects that have traveled along a given path during a given time interval, (2) extracting all paths traveled after a given spatiotemporal context, and (3) enumerating all paths between two locations traveled during a certain time interval. Unlike the existing methods in spatial database research, we employ indexing techniques and algorithms from string processing. This idea is based on the fact that we can represent spatial paths as strings, because trajectories in a network are represented as sequences of road segment IDs. The proposed SNT-index ( u s /u uffix-array-based u n /u etwork-constrained u t /u rajectory index) introduces two novel concepts to trajectory indexing. The first is FM-index, which is a compact in-memory data structure for pattern matching. The second is an inverse suffix array, which allows the FM-index to be integrated with the temporal information stored in a forest of B + -trees. Thanks to these concepts, we can reduce the number of B + -tree accesses required by the query processing algorithms to a constant number, something that cannot be achieved with existing methods. Although an FM-index is essentially a static index, we also propose a practical method of appending new data to the index. Finally, experiments show that our method can process the target queries for more than 1 million trajectories in a few tens of milliseconds, which is significantly faster than what the baseline algorithms can achieve without string algorithms.
Publisher: Springer Berlin Heidelberg
Date: 2012
Publisher: Association for Computing Machinery (ACM)
Date: 08-2008
Abstract: There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance constraint to a weaker constraint on the number of matching q -grams between pair of strings. In this paper, we propose the novel perspective of investigating mismatching q -grams. Technically, we derive two new edit distance lower bounds by analyzing the locations and contents of mismatching q -grams. A new algorithm, Ed-Join, is proposed that exploits the new mismatch-based filtering methods it achieves substantial reduction of the candidate sizes and hence saves computation time. We demonstrate experimentally that the new algorithm outperforms alternative methods on large-scale real datasets under a wide range of parameter settings.
Publisher: Springer Berlin Heidelberg
Date: 2006
DOI: 10.1007/11775300_6
Publisher: ACM
Date: 12-06-2011
Publisher: Springer Berlin Heidelberg
Date: 2006
DOI: 10.1007/11912873_50
Publisher: Springer Science and Business Media LLC
Date: 24-10-2018
Publisher: ACM
Date: 29-06-2009
Publisher: ACM
Date: 29-10-2023
Publisher: Springer Science and Business Media LLC
Date: 31-05-2022
Publisher: ACM
Date: 26-06-2016
Publisher: Springer Science and Business Media LLC
Date: 14-12-2020
Publisher: No publisher found
Date: 2019
Publisher: Springer Berlin Heidelberg
Date: 2006
DOI: 10.1007/11912873_36
Publisher: Oxford University Press (OUP)
Date: 06-01-2016
Publisher: IEEE
Date: 04-2018
Publisher: Springer Berlin Heidelberg
Date: 2006
DOI: 10.1007/11733836_33
Publisher: ACM
Date: 21-04-2008
Publisher: Springer Berlin Heidelberg
Date: 2013
Publisher: Springer International Publishing
Date: 2014
Publisher: Hindawi Limited
Date: 2014
DOI: 10.1155/2014/749028
Abstract: Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin , a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results.
Publisher: ACM
Date: 25-06-2019
Publisher: Association for Computing Machinery (ACM)
Date: 08-2011
Abstract: With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near-duplicate records efficiently. In this article, we focus on efficient algorithms to find a pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in stand-alone and RDBMS-based settings. Experimental results show our proposed algorithms can outperform previous algorithms on several real datasets.
Publisher: Springer Science and Business Media LLC
Date: 03-2222
Publisher: Association for Computing Machinery (ACM)
Date: 18-03-2016
DOI: 10.1145/2877201
Abstract: Query autocompletion has become a standard feature in many search applications, especially for search engines. A recent trend is to support the error-tolerant autocompletion , which increases the usability significantly by matching prefixes of database strings and allowing a small number of errors. In this article, we systematically study the query processing problem for error-tolerant autocompletion with a given edit distance threshold. We propose a general framework that encompasses existing methods and characterizes different classes of algorithms and the minimum amount of information they need to maintain under different constraints. We then propose a novel evaluation strategy that achieves the minimum active node size by eliminating ancestor-descendant relationships among active nodes entirely. In addition, we characterize the essence of edit distance computation by a novel data structure named edit vector automaton (EVA). It enables us to compute new active nodes and their associated states efficiently by table lookups. In order to support large distance thresholds, we devise a partitioning scheme to reduce the size and construction cost of the automaton, which results in the universal partitioned EVA (UPEVA) to handle arbitrarily large thresholds. Our extensive evaluation demonstrates that our proposed method outperforms existing approaches in both space and time efficiencies.
Publisher: ACM
Date: 18-03-2013
Publisher: IEEE
Date: 04-2012
DOI: 10.1109/ICDE.2012.91
Publisher: Institute of Electrical and Electronics Engineers (IEEE)
Date: 08-2013
DOI: 10.1109/TKDE.2012.79
Publisher: Association for Computing Machinery (ACM)
Date: 11-2013
Abstract: Graphs are widely used to model complex data in many applications, such as bioinformatics, chemistry, social networks, pattern recognition, etc. A fundamental and critical query primitive is to efficiently search similar structures in a large collection of graphs. This paper studies the graph similarity queries with edit distance constraints. Existing solutions to the problem utilize fixed-size overlapping substructures to generate candidates, and thus become susceptible to large vertex degrees or large distance thresholds. In this paper, we present a partition-based approach to tackle the problem. By iding data graphs into variable-size non-overlapping partitions, the edit distance constraint is converted to a graph containment constraint for candidate generation. We develop efficient query processing algorithms based on the new paradigm. A candidate pruning technique and an improved graph edit distance algorithm are also developed to further boost the performance. In addition, a cost-aware graph partitioning technique is devised to optimize the index. Extensive experiments demonstrate our approach significantly outperforms existing approaches.
Publisher: Springer Berlin Heidelberg
Date: 2006
DOI: 10.1007/11912873_25
Publisher: ACM
Date: 31-05-2020
Publisher: ACM
Date: 19-05-2017
Publisher: IEEE
Date: 04-2018
Publisher: Springer Berlin Heidelberg
Date: 2011
Publisher: Springer Berlin Heidelberg
Date: 2012
Publisher: Springer Berlin Heidelberg
Date: 2006
DOI: 10.1007/11775300_40
No related grants have been discovered for Chuan Xiao.