Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Statistical analysis of large scale data with perturbation subsampling

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • معلومة اضافية
    • الموضوع:
      2022
    • Collection:
      Columbia University: Academic Commons
    • نبذة مختصرة :
      The past two decades have witnessed rapid growth in the amount of data available to us. Many fields, including physics, biology, and medical studies, generate enormous datasets with a large sample size, a high number of dimensions, or both. For example, some datasets in physics contains millions of records. It is forecasted by Statista Survey that in 2022, there will be over 86 millions users of health apps in United States, which will generate massive mHealth data. In addition, more and more large studies have been carried out, such as the UK Biobank study. This gives us unprecedented access to data and allows us to extract and infer vital information. Meanwhile, it also poses new challenges for statistical methodologies and computational algorithms. For increasingly large datasets, computation can be a big hurdle for valid analysis. Conventional statistical methods lack the scalability to handle such large sample size. In addition, data storage and processing might be beyond usual computer capacity. The UK Biobank genotypes and phenotypes dataset contains about 500,000 individuals and more than 800,000 genotyped single nucleotide polymorphism (SNP) measurements per person, the size of which may well exceed a computer's physical memory. Further, the high dimensionality combined with the large sample size could lead to heavy computational cost and algorithmic instability. The aim of this dissertation is to provide some statistical approaches to address the issues. Chapter 1 provides a review on existing literature. In Chapter 2, a novel perturbation subsampling approach is developed based on independent and identically distributed stochastic weights for the analysis of large scale data. The method is justified based on optimizing convex criterion functions by establishing asymptotic consistency and normality for the resulting estimators. The method can provide consistent point estimator and variance estimator simultaneously. The method is also feasible for a distributed framework. The finite sample performance ...
    • Relation:
      https://doi.org/10.7916/5dgg-jr52
    • الرقم المعرف:
      10.7916/5dgg-jr52
    • الرقم المعرف:
      edsbas.5B5EA67B