ORCID Profile
0000-0003-1206-3431
Current Organisations
Royal Melbourne Institute of Technology
,
Soochow University
,
University of Leeds
Does something not look right? The information on this page has been harvested from data sources that may not be up to date. We continue to work with information providers to improve coverage and quality. To report an issue, use the Feedback Form.
Publisher: ACM
Date: 20-07-2008
Publisher: Edinburgh University Library
Date: 31-10-2016
Abstract: This article introduces the provenance activities that are being carried out at the Australia National Data Services (ANDS). Since its beginning, ANDS has been promoting four data transformations so that Australia’s research data become more valuable and reusable by researchers. Among many other activities that enable the four transformations, ANDS has been encouraging ANDS partners to capture and describe rich context at the time when a data collection is created. In 2015, ANDS funded a number of external projects that had provenance components. In addition, ANDS is working on the interoperability between the schema that is used by the ANDS research data registration and discovery service – Research Data Australia (RDA) – and the W3C recommended provenance standard, Provenance Ontology (PROV-O), and investigating how to enrich the schema to access provenance information. The article concludes by discussing the lessons we learnt and our future planned activity.
Publisher: ACM
Date: 23-11-2009
Publisher: Cold Spring Harbor Laboratory
Date: 24-06-2014
Abstract: BLUP ( b est l inear u nbiased p rediction) is widely used to predict complex traits in plant and animal breeding, and increasingly in human genetics. The BLUP mathematical model, which consists of a single random effect term, was adequate when kinships were measured from pedigrees. However, when genome-wide SNPs are used to measure kinships, the BLUP model implicitly assumes that all SNPs have the same effect-size distribution, which is a severe and unnecessary limitation. We propose MultiBLUP, which extends the BLUP model to include multiple random effects, allowing greatly improved prediction when the random effects correspond to classes of SNPs with distinct effect-size variances. The SNP classes can be specified in advance, for ex le, based on SNP functional annotations, and we also provide an adaptive procedure for determining a suitable partition of SNPs. We apply MultiBLUP to genome-wide association data from the Wellcome Trust Case Control Consortium (seven diseases), and from much larger studies of celiac disease and inflammatory bowel disease, finding that it consistently provides better prediction than alternative methods. Moreover, MultiBLUP is computationally very efficient for the largest data set, which includes 12,678 in iduals and 1.5 M SNPs, the total analysis can be run on a single desktop PC in less than a day and can be parallelized to run even faster. Tools to perform MultiBLUP are freely available in our software LDAK.
Publisher: Research Data Alliance
Date: 2020
DOI: 10.15497/RDA00041
Publisher: Proceedings of the National Academy of Sciences
Date: 21-04-2014
Abstract: Our knowledge of the domestication of animal and plant species comes from a erse range of disciplines, and interpretation of patterns in data from these disciplines has been the dominant paradigm in domestication research. However, such interpretations are easily steered by subjective biases that typically fail to account for the inherent randomness of evolutionary processes, and which can be blind to emergent patterns in data. The testing of explicit models using computer simulations, and the availability of powerful statistical techniques to fit models to observed data, provide a scientifically robust means of addressing these problems. Here we outline the principles and argue for the merits of such approaches in the context of domestication-related questions.
Publisher: PeerJ
Date: 19-09-2016
DOI: 10.7717/PEERJ-CS.86
Abstract: Software is a critical part of modern research and yet there is little support across the scholarly ecosystem for its acknowledgement and citation. Inspired by the activities of the FORCE11 working group focused on data citation, this document summarizes the recommendations of the FORCE11 Software Citation Working Group and its activities between June 2015 and April 2016. Based on a review of existing community practices, the goal of the working group was to produce a consolidated set of citation principles that may encourage broad adoption of a consistent policy for software citation across disciplines and venues. Our work is presented here as a set of software citation principles, a discussion of the motivations for developing the principles, reviews of existing community practice, and a discussion of the requirements these principles would place upon different stakeholders. Working ex les and possible technical solutions for how these principles can be implemented will be discussed in a separate paper.
Publisher: Springer Berlin Heidelberg
Date: 2004
Publisher: Oxford University Press (OUP)
Date: 25-07-2014
DOI: 10.1093/BRAIN/AWU206
Publisher: Zenodo
Date: 2022
Publisher: Wiley
Date: 22-08-2014
DOI: 10.1111/CONL.12124
Publisher: Elsevier BV
Date: 11-2014
Publisher: Association for Computing Machinery (ACM)
Date: 05-2011
Abstract: Searchers on the Web often aim to find key resources about a topic. Finding such results is called topic distillation. Previous research has shown that the use of sources of evidence such as page indegree and URL structure can significantly improve search performance on interconnected collections such as the Web, beyond the use of simple term distribution statistics. This article presents a new approach to improve topic distillation by exploring the use of external sources of evidence: link structure, including query dependent indegree and outdegree and web page characteristics, such as the density of anchor links. Our experiments with the TREC .GOV collection, an 18GB crawl of the US .gov domain from 2002, show that using such evidence can significantly improve search effectiveness, with combinations of evidence leading to significant performance gains over both full-text and anchor-text baselines. Moreover, we demonstrate that, at a different scope level, both local query-dependent outdegree and query-dependent indegree out-performed their global query-independent counterparts and at the same scope level, outdegree out-performed indegree. Adding query-dependent indegree or page characteristics to query-dependent outdegree could have a small, but not significant, improvement.
Publisher: Wiley
Date: 28-03-2012
DOI: 10.1002/ASI.22639
Publisher: ACM
Date: 09-2001
Publisher: Springer Science and Business Media LLC
Date: 02-12-2014
DOI: 10.1038/NCOMMS6631
Abstract: In 2012, a skeleton was excavated at the presumed site of the Grey Friars friary in Leicester, the last-known resting place of King Richard III. Archaeological, osteological and radiocarbon dating data were consistent with these being his remains. Here we report DNA analyses of both the skeletal remains and living relatives of Richard III. We find a perfect mitochondrial DNA match between the sequence obtained from the remains and one living relative, and a single-base substitution when compared with a second relative. Y-chromosome haplotypes from male-line relatives and the remains do not match, which could be attributed to a false-paternity event occurring in any of the intervening generations. DNA-predicted hair and eye colour are consistent with Richard’s appearance in an early portrait. We calculate likelihood ratios for the non-genetic and genetic data separately, and combined, and conclude that the evidence for the remains being those of Richard III is overwhelming.
Publisher: No publisher found
Date: 2018
Publisher: Springer Berlin Heidelberg
Date: 2008
Publisher: ACM
Date: 19-07-2009
Publisher: BMJ
Date: 10-03-2014
Publisher: Wiley
Date: 10-2020
DOI: 10.1002/PRA2.291
Abstract: This panel will address the issues associated with the practice and service of open research data curation and discovery from a global perspective. The sub‐fields of information science such as information retrieval, information curation, information practices and human‐centered data science have approached the open research data initiatives from multiple lenses. The issues of data creation, capturing, curation, sharing, discovery and reuse of cut across the sub‐fields. We will identify and discuss the emerging themes in open data curation and discovery drawing on active research projects, repository practices and research data capturing and reuse in a selection of disciplines from health domain to archaeology and cultural heritage.
Publisher: Elsevier BV
Date: 05-2001
Publisher: Oxford University Press (OUP)
Date: 09-2014
DOI: 10.1534/GENETICS.114.165704
Abstract: Models for genome-wide prediction and association studies usually target a single phenotypic trait. However, in animal and plant genetics it is common to record information on multiple phenotypes for each in idual that will be genotyped. Modeling traits in idually disregards the fact that they are most likely associated due to pleiotropy and shared biological basis, thus providing only a partial, confounded view of genetic effects and phenotypic interactions. In this article we use data from a Multiparent Advanced Generation Inter-Cross (MAGIC) winter wheat population to explore Bayesian networks as a convenient and interpretable framework for the simultaneous modeling of multiple quantitative traits. We show that they are equivalent to multivariate genetic best linear unbiased prediction (GBLUP) and that they are competitive with single-trait elastic net and single-trait GBLUP in predictive performance. Finally, we discuss their relationship with other additive-effects models and their advantages in inference and interpretation. MAGIC populations provide an ideal setting for this kind of investigation because the very low population structure and large s le size result in predictive models with good power and limited confounding due to relatedness.
Publisher: Elsevier BV
Date: 2013
DOI: 10.1016/J.FSIGEN.2012.06.001
Abstract: We consider the comparison of hypotheses "parent-child" or "full siblings" against the alternative of "unrelated" for pairs of in iduals for whom DNA profiles are available. This is a situation that occurs repeatedly in familial database searching. A decision rule that uses both the kinship index (KI), also known as the likelihood ratio, and the identity-by-state statistic (IBS) was advocated in a recent report as superior to the use of KI alone. Such proposal appears to conflict with the Neyman-Pearson Lemma of statistics, which states that the likelihood ratio alone provides the most powerful criterion for distinguishing between any two simple hypotheses. We therefore performed a simulation study that was two orders of magnitude larger than in the previous report, and our results corroborate the theoretical expectation that KI alone provides a better decision rule than KI combined with IBS.
Publisher: ACM
Date: 25-07-2004
Publisher: Zenodo
Date: 2019
Publisher: MIT Press
Date: 2023
DOI: 10.1162/DINT_A_00186
Abstract: The increased number of data repositories has greatly increased the availability of open data. To enable broad discovery and access to research dataset, some data repositories have begun leveraging the web architecture by embedding structured metadata markup in dataset web landing pages using vocabularies from Schema.org and extensions. This paper aims to examine metadata interoperability for supporting global data discovery. Specifically, the paper reports a survey on which metadata schema has been adopted by participating data repositories, and presents an analysis of crosswalks from fourteen research data schemas to Schema.org. The analysis indicates most descriptive metadata are interoperable among the schemas, the most inconsistent mapping is the rights metadata, and a large gap exists in the structural metadata and controlled vocabularies to specify various property values. The analysis and collated crosswalks can serve as a reference for data repositories when they develop crosswalks from their own schemas to Schema.org, and provide the research data community a benchmark of structured metadata implementation.
Publisher: International Information and Engineering Technology Association
Date: 08-2003
Publisher: Research Data Alliance
Date: 2021
DOI: 10.15497/RDA00069
Publisher: Public Library of Science (PLoS)
Date: 12-04-2018
Publisher: Research Data Alliance
Date: 2021
DOI: 10.15497/RDA00066
Publisher: Edinburgh University Library
Date: 04-07-2017
Abstract: The Australian National Data Service (ANDS) has been funded by the Australian Government since 2009, with a goal to increase the value of data to researchers, research institutions and the nation. To achieve this goal, ANDS has funded more than 200 projects under seven programs. This paper provides an overview of one of these programs, the Applications Program, which focused on funding software infrastructure to enable data reuse to demonstrate the value of making data available to researchers. The paper also presents some representative projects, a summary of what the program has achieved, and lessons learned.
Publisher: IEEE Comput. Soc
Date: 1999
Publisher: Wiley
Date: 11-10-2014
DOI: 10.1002/ASI.22951
Publisher: ACM
Date: 15-06-2009
Publisher: Elsevier BV
Date: 09-2014
Publisher: Proceedings of the National Academy of Sciences
Date: 07-2013
Abstract: Enhancements in sensitivity now allow DNA profiles to be obtained from only tens of picograms of DNA, corresponding to a few cells, even for s les subject to degradation from environmental exposure. However, low-template DNA (LTDNA) profiles are subject to stochastic effects, such as “dropout” and “dropin” of alleles, and highly variable stutter peak heights. Although the sensitivity of the newly developed methods is highly appealing to crime investigators, courts are concerned about the reliability of the underlying science. High-profile cases relying on LTDNA evidence have collapsed amid controversy, including the case of Hoey in the United Kingdom and the case of Knox and Sollecito in Italy. I argue that rather than the reliability of the science, courts and commentators should focus on the validity of the statistical methods of evaluation of the evidence. Even noisy DNA evidence can be more powerful than many traditional types of evidence, and it can be helpful to a court as long as its strength is not overstated. There have been serious shortcomings in statistical methods for the evaluation of LTDNA profile evidence, however. Here, I propose a method that allows for multiple replicates with different rates of dropout, sporadic dropins, different amounts of DNA from different contributors, relatedness of suspected and alternate contributors, “uncertain” allele designations, and degradation. R code implementing the method is open source, facilitating wide scrutiny. I illustrate its good performance using real cases and simulated crime scene profiles.
Publisher: ACM
Date: 05-12-2013
Publisher: Public Library of Science (PLoS)
Date: 08-10-2013
Publisher: Annual Reviews
Date: 03-01-2014
DOI: 10.1146/ANNUREV-STATISTICS-022513-115602
Abstract: The evaluation of weight of evidence for forensic DNA profiles has been a subject of controversy since their introduction over 20 years ago. Substantial progress has been made for standard DNA profiles, but new issues have arisen in recent years with the advent of more sensitive profiling techniques, allowing profiles to be recovered from minuscule amounts of possibly degraded DNA. These low-template DNA profiles suffer from enhanced stochastic effects, including dropin, dropout, and stutter, which pose problems for DNA profile evaluation. These problems are now beginning to be overcome with the emergence of several statistical models and software. We first review the general principles of statistical evaluation of DNA profile evidence, and we then focus on low-template DNA profiles, briefly reviewing the main statistical models and software. We cover methods that use allele presence/absence and those that use electropherogram peak heights, focusing on the likelihood ratio as measure of evidential weight.
Publisher: ACM
Date: 20-07-2008
Publisher: Springer Berlin Heidelberg
Date: 1998
Publisher: ACM
Date: 07-2000
Publisher: Wiley
Date: 10-2021
DOI: 10.1002/PRA2.510
Abstract: The proposed panel will address the issues of the discovery and reuse of publicly available data on the web in the context of data service practices from a global perspective. Thousands of data discovery services have appeared around the world since the promotion of “open science”, reproducible research, and the FAIR (Findable, Accessible, Interoperable and Reusable) data principles in the research sector. However, there is also increasing demand for transparency of search algorithms, and in the design, development, evaluation, and deployment of current data search services this requires a better understanding of how users approach data discovery and interact with data in search settings. From a global perspective, we will identify and discuss the specific system design issues in data discovery and reuse, drawing on our organization of the NTCIR (NII Testbeds and Community for Information access Research) project of Data Search track, the design and evaluation of the data discovery service of the Australian Research Data Commons (ARDC), and studies examining researchers' practices of data discovery and reuse.
Publisher: Nomos Verlag
Date: 2021
DOI: 10.5771/0943-7444-2021-3-219
Abstract: In this paper, we present a case study of how well subject metadata (comprising headings from an international classification scheme) has been deployed in a national data catalogue, and how often data seekers use subject metadata when searching for data. Through an analysis of user search behaviour as recorded in search logs, we find evidence that users utilise the subject metadata for data discovery. Since approximately half of the records ingested by the catalogue did not include subject metadata at the time of harvest, we experimented with automatic subject classification approaches in order to enrich these records and to provide additional support for user search and data discovery. Our results show that automatic methods work well for well represented categories of subject metadata, and these categories tend to have features that can distinguish themselves from the other categories. Our findings raise implications for data catalogue providers they should invest more effort to enhance the quality of data records by providing an adequate description of these records for under-represented subject categories.
Publisher: MIT Press
Date: 2023
DOI: 10.1162/DINT_A_00162
Abstract: Automated metadata annotation is only as good as training dataset, or rules that are available for the domain. It's important to learn what type of data content a pre-trained machine learning algorithm has been trained on to understand its limitations and potential biases. Consider what type of content is readily available to train an algorithm—what's popular and what's available. However, scholarly and historical content is often not available in consumable, homogenized, and interoperable formats at the large volume that is required for machine learning. There are exceptions such as science and medicine, where large, well documented collections are available. This paper presents the current state of automated metadata annotation in cultural heritage and research data, discusses challenges identified from use cases, and proposes solutions.
Publisher: Ubiquity Press, Ltd.
Date: 2021
DOI: 10.5334/DSJ-2021-012
Publisher: Wiley
Date: 17-09-2014
DOI: 10.1111/AHG.12081
Abstract: We estimate the population genetics parameter FST (also referred to as the fixation index) from short tandem repeat (STR) allele frequencies, comparing many worldwide human subpopulations at approximately the national level with continental-scale populations. FST is commonly used to measure population differentiation, and is important in forensic DNA analysis to account for remote shared ancestry between a suspect and an alternative source of the DNA. We estimate FST comparing subpopulations with a hypothetical ancestral population, which is the approach most widely used in population genetics, and also compare a subpopulation with a s led reference population, which is more appropriate for forensic applications. Both estimation methods are likelihood-based, in which FST is related to the variance of the multinomial-Dirichlet distribution for allele counts. Overall, we find low FST values, with posterior 97.5 percentiles < 3% when comparing a subpopulation with the most appropriate population, and even for inter-population comparisons we find FST < 5%. These are much smaller than single nucleotide polymorphism-based inter-continental FST estimates, and are also about half the magnitude of STR-based estimates from population genetics surveys that focus on distinct ethnic groups rather than a general population. Our findings support the use of FST up to 3% in forensic calculations, which corresponds to some current practice.
Publisher: Elsevier BV
Date: 12-2014
Publisher: Ubiquity Press, Ltd.
Date: 2021
DOI: 10.5334/DSJ-2021-019
Publisher: Ubiquity Press, Ltd.
Date: 2019
DOI: 10.5334/DSJ-2019-003
Publisher: Emerald
Date: 27-04-2023
Abstract: With an explosion of datasets available on the Web, dataset search has gained attention as an emerging research domain. Understanding users' dataset behaviour is imperative for providing effective data discovery services. In this paper, the authors present a study on users' dataset search behaviour through the analysis of search logs from a research data discovery portal. Using query and session based features, the authors apply cluster analysis to discover distinct user profiles with different search behaviours. One particular behavioural construct of our interest is users' expertise that the authors generate via computing semantic similarity between users' search queries and the title of metadata records in the displayed search results. The findings revealed that there are six distinct classes of user behaviours for dataset search, namely Expert Research, Expert Search, Expert Explore, Novice Research, Novice Search and Novice Explore. The user profiles are derived based on analysis of the search log of the research data catalogue in this study. Further research is needed to generalise the user profiles to other dataset search settings. Future research can take on a confirmatory approach to verify these user groups and establish a deeper understanding of their information needs. The findings in this paper have implications for designing search systems that tailor search results matching the erse information needs of different user groups. We propose for the first time a taxonomy of users for dataset search based on their domain expertise and search behaviour.
Location: United Kingdom of Great Britain and Northern Ireland
Location: Australia
No related grants have been discovered for Mingfang Wu.