ARDC Research Link Australia

Publication

Practising Scalable Graph Similarity Joins in MapReduce

Publisher: IEEE

Date: 06-2014

DOI: 10.1109/BIGDATA.CONGRESS.2014.25

Publication

Scope-aware Code Completion with Discriminative Modeling

Publisher: Information Processing Society of Japan

Date: 2019

DOI: 10.2197/IPSJJIP.27.469

Publication

Top-\emphk Similarity Search over Gaussian Distributions Based on KL-Divergence

Publisher: Information Processing Society of Japan

Date: 2016

DOI: 10.2197/IPSJJIP.24.152

Publication

Efficient Error-tolerant Query Autocompletion

Publisher: Association for Computing Machinery (ACM)

Date: 04-2013

DOI: 10.14778/2536336.2536339

Abstract: Query autocompletion is an important feature saving users many keystrokes from typing the entire query. In this paper we study the problem of query autocompletion that tolerates errors in users' input using edit distance constraints. Previous approaches index data strings in a trie, and continuously maintain all the prefixes of data strings whose edit distance from the query are within the threshold. The major inherent problem is that the number of such prefixes is huge for the first few characters of the query and is exponential in the alphabet size. This results in slow query response even if the entire query approximately matches only few prefixes. In this paper, we propose a novel neighborhood generation-based algorithm, IncNGTrie, which can achieve up to two orders of magnitude speedup over existing methods for the error-tolerant query autocompletion problem. Our proposed algorithm only maintains a small set of active nodes, thus saving both space and time to process the query. We also study efficient duplicate removal which is a core problem in fetching query answers. In addition, we propose optimization techniques to reduce our index size, as well as discussions on several extensions to our method. The efficiency of our method is demonstrated against existing methods through extensive experiments on real datasets.

Publication

Dynamic Set kNN Self-Join

Publisher: IEEE

Date: 04-2019

DOI: 10.1109/ICDE.2019.00078

Publication

Efficient Subgraph Similarity All-Matching

Publisher: Springer Berlin Heidelberg

Date: 2012

DOI: 10.1007/978-3-642-29038-1_33

Publication

Histogram Construction for Difference Analysis of Spatio-Temporal Data on Array DBMS

Publisher: Springer International Publishing

Date: 2018

DOI: 10.1007/978-3-319-92013-9_4

Publication

Asymmetric signature schemes for efficient exact edit similarity query processing

Publisher: Association for Computing Machinery (ACM)

Date: 08-2013

DOI: 10.1145/2508020.2508023

Abstract: Given a query string Q , an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures. In this article, we show that for any such signature scheme, the lower bound of the minimum number of signatures is τ + 1, which is lower than what is achieved by existing methods. We then propose several asymmetric signature schemes, that is, extracting different numbers of signatures for the data and query strings, which achieve this lower bound. A basic asymmetric scheme is first established on the basis of matching q -chunks and q -grams between two strings. Two efficient query processing algorithms (IndexGram and IndexChunk) are developed on top of this scheme. We also propose novel candidate pruning methods to further improve the efficiency. We then generalize the basic scheme by incorporating novel ideas of floating q -chunks, optimal selection of q -chunks, and reducing the number of signatures using global ordering. As a result, the Super and Turbo families of schemes are developed together with their corresponding query processing algorithms. We have conducted a comprehensive experimental study using the six asymmetric algorithms and nine previous state-of-the-art algorithms. The experiment results clearly showcase the efficiency of our methods and demonstrate space and time characteristics of our proposed algorithms.

Publication

Enhanced Indexing and Querying of Trajectories in Road Networks via String Algorithms

Publisher: Association for Computing Machinery (ACM)

Date: 31-03-2018

DOI: 10.1145/3200200

Abstract: In this article, we propose a novel indexing and querying method for trajectories constrained in a road network. We aim to provide efficient algorithms for various types of spatiotemporal queries that involve routing in road networks, such as (1) finding moving objects that have traveled along a given path during a given time interval, (2) extracting all paths traveled after a given spatiotemporal context, and (3) enumerating all paths between two locations traveled during a certain time interval. Unlike the existing methods in spatial database research, we employ indexing techniques and algorithms from string processing. This idea is based on the fact that we can represent spatial paths as strings, because trajectories in a network are represented as sequences of road segment IDs. The proposed SNT-index ( u s /u uffix-array-based u n /u etwork-constrained u t /u rajectory index) introduces two novel concepts to trajectory indexing. The first is FM-index, which is a compact in-memory data structure for pattern matching. The second is an inverse suffix array, which allows the FM-index to be integrated with the temporal information stored in a forest of B + -trees. Thanks to these concepts, we can reduce the number of B + -tree accesses required by the query processing algorithms to a constant number, something that cannot be achieved with existing methods. Although an FM-index is essentially a static index, we also propose a practical method of appending new data to the index. Finally, experiments show that our method can process the target queries for more than 1 million trajectories in a few tens of milliseconds, which is significantly faster than what the baseline algorithms can achieve without string algorithms.

Publication

Combination Skyline Queries

Publisher: Springer Berlin Heidelberg

Date: 2012

DOI: 10.1007/978-3-642-34179-3_1

Publication

Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Publisher: Association for Computing Machinery (ACM)

Date: 08-2008

DOI: 10.14778/1453856.1453957

Abstract: There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance constraint to a weaker constraint on the number of matching q -grams between pair of strings. In this paper, we propose the novel perspective of investigating mismatching q -grams. Technically, we derive two new edit distance lower bounds by analyzing the locations and contents of mismatching q -grams. A new algorithm, Ed-Join, is proposed that exploits the new mismatch-based filtering methods it achieves substantial reduction of the candidate sizes and hence saves computation time. We demonstrate experimentally that the new algorithm outperforms alternative methods on large-scale real datasets under a wide range of parameter settings.

Publication

Efficient Evaluation of Multiple Queries on Streamed XML Fragments

Publisher: Springer Berlin Heidelberg

Date: 2006

DOI: 10.1007/11775300_6

Publication

Efficient exact edit similarity query processing with the asymmetric signature scheme

Publisher: ACM

Date: 12-06-2011

DOI: 10.1145/1989323.1989431

Publication

Region-Based Coding for Queries over Streamed XML Fragments

Publisher: Springer Berlin Heidelberg

Date: 2006

DOI: 10.1007/11912873_50

Publication

Efficient structure similarity searches: a partition-based approach

Publisher: Springer Science and Business Media LLC

Date: 24-10-2018

DOI: 10.1007/S00778-017-0487-0

Publication

Efficient approximate entity extraction with edit distance constraints

Publisher: ACM

Date: 29-06-2009

DOI: 10.1145/1559845.1559925

Publication

A Method of Image Dehazing Based on Atmospheric Veil Prediction by ResNet

Publisher: ACM

Date: 29-10-2023

DOI: 10.1145/3607540.3617136

Publication

Efficient processing of graph similarity queries with edit distance constraints

Publisher: Springer Science and Business Media LLC

Date: 31-05-2022

DOI: 10.1007/S00778-013-0306-1

Publication

Local Similarity Search for Unstructured Text

Publisher: ACM

Date: 26-06-2016

DOI: 10.1145/2882903.2915211

Publication

Efficient Query Autocompletion with Edit Distance-based Error Tolerance

Publisher: Springer Science and Business Media LLC

Date: 14-12-2020

DOI: 10.1007/S00778-019-00595-4

Publication

Indexing Trajectories for Travel-Time Histogram Retrieval

Publisher: No publisher found

Date: 2019

DOI: 10.5441/002/EDBT.2019.15

Publication

Document Fragmentation for XML Streams Based on Query Statistics

Publisher: Springer Berlin Heidelberg

Date: 2006

DOI: 10.1007/11912873_36

Publication

Frequent Subgraph Mining Based on Pregel

Publisher: Oxford University Press (OUP)

Date: 06-01-2016

DOI: 10.1093/COMJNL/BXV118

Publication

GPH: Similarity Search in Hamming Space

Publisher: IEEE

Date: 04-2018

DOI: 10.1109/ICDE.2018.00013

Publication

Efficient Query Processing for Streamed XML Fragments

Publisher: Springer Berlin Heidelberg

Date: 2006

DOI: 10.1007/11733836_33

Publication

Efficient similarity joins for near duplicate detection

Publisher: ACM

Date: 21-04-2008

DOI: 10.1145/1367497.1367516

Publication

Processing Probabilistic Range Queries over Gaussian-Based Uncertain Data

Publisher: Springer Berlin Heidelberg

Date: 2013

DOI: 10.1007/978-3-642-40235-7_24

Publication

Improving Performance of Graph Similarity Joins Using Selected Substructures

Publisher: Springer International Publishing

Date: 2014

DOI: 10.1007/978-3-319-05810-8_11

Publication

Efficient and Scalable Graph Similarity Joins in MapReduce

Publisher: Hindawi Limited

Date: 2014

DOI: 10.1155/2014/749028

Abstract: Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin , a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results.

Publication

Autocompletion for Prefix-Abbreviated Input

Publisher: ACM

Date: 25-06-2019

DOI: 10.1145/3299869.3319858

Publication

Efficient similarity joins for near-duplicate detection

Publisher: Association for Computing Machinery (ACM)

Date: 08-2011

DOI: 10.1145/2000824.2000825

Abstract: With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near-duplicate records efficiently. In this article, we focus on efficient algorithms to find a pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in stand-alone and RDBMS-based settings. Experimental results show our proposed algorithms can outperform previous algorithms on several real datasets.

Publication

Load Shedding for Window Joins over Streams

Publisher: Springer Science and Business Media LLC

Date: 03-2222

DOI: 10.1007/S11390-007-9024-8

Publication

BEVA

Publisher: Association for Computing Machinery (ACM)

Date: 18-03-2016

DOI: 10.1145/2877201

Abstract: Query autocompletion has become a standard feature in many search applications, especially for search engines. A recent trend is to support the error-tolerant autocompletion , which increases the usability significantly by matching prefixes of database strings and allowing a small number of errors. In this article, we systematically study the query processing problem for error-tolerant autocompletion with a given edit distance threshold. We propose a general framework that encompasses existing methods and characterizes different classes of algorithms and the minimum amount of information they need to maintain under different constraints. We then propose a novel evaluation strategy that achieves the minimum active node size by eliminating ancestor-descendant relationships among active nodes entirely. In addition, we characterize the essence of edit distance computation by a novel data structure named edit vector automaton (EVA). It enables us to compute new active nodes and their associated states efficiently by table lookups. In order to support large distance thresholds, we devise a partitioning scheme to reduce the size and construction cost of the automaton, which results in the universal partitioned EVA (UPEVA) to handle arbitrarily large thresholds. Our extensive evaluation demonstrates that our proposed method outperforms existing approaches in both space and time efficiencies.

Publication

Trie-based similarity search and join

Publisher: ACM

Date: 18-03-2013

DOI: 10.1145/2457317.2457389

Publication

Top-k Set Similarity Joins

Publisher: IEEE

Date: 03-2009

DOI: 10.1109/ICDE.2009.111

Publication

Efficient Graph Similarity Joins with Edit Distance Constraints

Publisher: IEEE

Date: 04-2012

DOI: 10.1109/ICDE.2012.91

Publication

VChunkJoin: An Efficient Algorithm for Edit Similarity Joins

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Date: 08-2013

DOI: 10.1109/TKDE.2012.79

Publication

A Partition-Based Approach to Structure Similarity Search

Publisher: Association for Computing Machinery (ACM)

Date: 11-2013

DOI: 10.14778/2732232.2732236

Abstract: Graphs are widely used to model complex data in many applications, such as bioinformatics, chemistry, social networks, pattern recognition, etc. A fundamental and critical query primitive is to efficiently search similar structures in a large collection of graphs. This paper studies the graph similarity queries with edit distance constraints. Existing solutions to the problem utilize fixed-size overlapping substructures to generate candidates, and thus become susceptible to large vertex degrees or large distance thresholds. In this paper, we present a partition-based approach to tackle the problem. By iding data graphs into variable-size non-overlapping partitions, the edit distance constraint is converted to a graph containment constraint for candidate generation. We develop efficient query processing algorithms based on the new paradigm. A candidate pruning technique and an improved graph edit distance algorithm are also developed to further boost the performance. In addition, a cost-aware graph partitioning technique is devised to optimize the index. Extensive experiments demonstrate our approach significantly outperforms existing approaches.

Publication

Buffer-Preposed QoS Adaptation Framework and Load Shedding Techniques over Streams

Publisher: Springer Berlin Heidelberg

Date: 2006

DOI: 10.1007/11912873_25

Publication

Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach

Publisher: ACM

Date: 31-05-2020

DOI: 10.1145/3318464.3380570

Publication

A Framework for Presentation Slide Design Support

Publisher: ACM

Date: 19-05-2017

DOI: 10.1145/3093241.3093261

Publication

CiNCT: Compression and Retrieval for Massive Vehicular Trajectories via Relative Movement Labeling

Publisher: IEEE

Date: 04-2018

DOI: 10.1109/ICDE.2018.00102

Publication

Finding the Sites with Best Accessibilities to Amenities

Publisher: Springer Berlin Heidelberg

Date: 2011

DOI: 10.1007/978-3-642-20152-3_5

Publication

A Space-Efficient Indexing Algorithm for Boolean Query Processing

Publisher: Springer Berlin Heidelberg

Date: 2012

DOI: 10.1007/978-3-642-35063-4_47

Publication

Load Shedding for Window Joins over Streams

Publisher: Springer Berlin Heidelberg

Date: 2006

DOI: 10.1007/11775300_40

Chuan Xiao

Researcher

Related Links

Publications

Practising Scalable Graph Similarity Joins in MapReduce

Scope-aware Code Completion with Discriminative Modeling

Top-\emphk Similarity Search over Gaussian Distributions Based on KL-Divergence

Efficient Error-tolerant Query Autocompletion

Dynamic Set kNN Self-Join

Efficient Subgraph Similarity All-Matching

Histogram Construction for Difference Analysis of Spatio-Temporal Data on Array DBMS

Asymmetric signature schemes for efficient exact edit similarity query processing

Enhanced Indexing and Querying of Trajectories in Road Networks via String Algorithms

Combination Skyline Queries

Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Efficient Evaluation of Multiple Queries on Streamed XML Fragments

Efficient exact edit similarity query processing with the asymmetric signature scheme

Region-Based Coding for Queries over Streamed XML Fragments

Efficient structure similarity searches: a partition-based approach

Efficient approximate entity extraction with edit distance constraints

A Method of Image Dehazing Based on Atmospheric Veil Prediction by ResNet

Efficient processing of graph similarity queries with edit distance constraints

Local Similarity Search for Unstructured Text

Efficient Query Autocompletion with Edit Distance-based Error Tolerance

Indexing Trajectories for Travel-Time Histogram Retrieval

Document Fragmentation for XML Streams Based on Query Statistics

Frequent Subgraph Mining Based on Pregel

GPH: Similarity Search in Hamming Space

Efficient Query Processing for Streamed XML Fragments

Efficient similarity joins for near duplicate detection

Processing Probabilistic Range Queries over Gaussian-Based Uncertain Data

Improving Performance of Graph Similarity Joins Using Selected Substructures

Efficient and Scalable Graph Similarity Joins in MapReduce

Autocompletion for Prefix-Abbreviated Input

Efficient similarity joins for near-duplicate detection

Load Shedding for Window Joins over Streams

BEVA

Trie-based similarity search and join

Top-k Set Similarity Joins

Efficient Graph Similarity Joins with Edit Distance Constraints

VChunkJoin: An Efficient Algorithm for Edit Similarity Joins

A Partition-Based Approach to Structure Similarity Search

Buffer-Preposed QoS Adaptation Framework and Load Shedding Techniques over Streams

Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach

A Framework for Presentation Slide Design Support

CiNCT: Compression and Retrieval for Massive Vehicular Trajectories via Relative Movement Labeling

Finding the Sites with Best Accessibilities to Amenities

A Space-Efficient Indexing Algorithm for Boolean Query Processing

Load Shedding for Window Joins over Streams

Related Organisations

Nagoya University

Osaka University

Northeastern University

Nagoya University

University Of New South Wales

Osaka University

Related Funding Activities

ARDC NEWSLETTER SIGNUP