Granular knowledge based search engine

Item request has been placed!

Item request cannot be made.

Processing Request

اقرأ أكثر حفظ في قائمتي

Publication Date:
May 7, 2009

معلومة اضافية
- Document Number:
  20090119281
- Appl. No:
  11/998222
- Application Filed:
  November 29, 2007
- نبذة مختصرة :
  The application borrows terminology from data mining, association rule learning and topology. A geometric structure represents a collection of concepts in a document set. The geometric structure has a high-frequency keyword set that co-occurs closely which represents a concept in a document set. Document analysis seeks to automate the understanding of knowledge representing the author's idea. Granular computing theory deals with rough sets and fuzzy sets. One of the key insights of rough set research is that selection of different sets of features or variables will yield different concept granulations. Here, as in elementary rough set theory, by “concept” we mean a set of entities that are indistinguishable or indiscernible to the observer (i.e., a simple concept), or a set of entities that is composed from such simple concepts (i.e., a complex concept).
- Inventors:
  Wang, Andrew Chien-Chung (Chino, CA, US); Lin, Tsau Young (Chino, CA, US); Chiang, I-Jen (Chino, CA, US)
- Claim:
  1. A system of indexing documents comprising the steps of: a. preprocessing documents to extract words; b. then extracting keywords by calculating a TFIDF for each word, wherein the step of calculating a TFIDF further comprises the substeps of: i. calculating a term frequency; ii. calculating a document frequency; iii. calculating a total number of documents in which a term appears at least once; c. then comparing the TFIDF for each word with a TFIDF predefined threshold; d. then finding keyword association by generating a plurality of keyword sets, wherein the step of generating a plurality of keyword sets further comprises the sub steps of: i. filtering keyword sets that do not meet a predefined within distance threshold; and ii. filtering keyword sets that do not meet a predefined support threshold, wherein the support threshold is compared to a support level which is proportional to the percentage of documents that contain the keyword set; e. then providing a clustering of keyword sets and building a document index having a clustering of keyword sets; f. then providing a search result in the form of a document cluster.
- Claim:
  2. The system of claim 1, wherein the TFIDF for any particular term in a document equals the term frequency multiplied by the log of the total number of documents divided by the document frequency, wherein the term frequency is the number of appearances of a term in a document divided by the total number of words in the document.
- Claim:
  3. The system of claim 1, wherein the TFIDF for any particular term in a document equals the term frequency multiplied by the log of the total number of documents divided by the document frequency, wherein term frequency is equal to one plus the log of the frequency of a token in a document.
- Claim:
  4. The system of claim 1, further comprising the step of defining the predefined within distance having a value between 8 and 12.
- Claim:
  5. The system of claim 1, further comprising the step of defining TFIDF predefined threshold having a range of 0.01 to 0.001.
- Claim:
  6. A system of indexing documents comprising the steps of: a. preprocessing documents to extract words; b. then extracting keywords by calculating a TFIDF for each word, c. then comparing the TFIDF for each word with a TFIDF predefined threshold; d. then finding keyword association by generating a plurality of keyword sets, e. then providing a clustering of keyword sets and building a document index having a clustering of keyword sets; f. then allowing user selection of a query presented in the clustering of keyword sets; g. then receiving a user selection of a query presented in the clustering of keyword sets; h. then providing a search result in the form of a document cluster.
- Claim:
  7. The system of indexing documents according to claim 6, wherein the step of calculating a TFIDF further comprises the substeps of: calculating a term frequency; calculating a document frequency; and calculating a total number of documents in which a term appears at least once.
- Claim:
  8. The system of claim 7, wherein the TFIDF for any particular term in a document equals the term frequency multiplied by the log of the total number of documents divided by the document frequency, wherein the term frequency is the number of appearances of a term in a document divided by the total number of words in the document.
- Claim:
  9. The system of claim 7, wherein the TFIDF for any particular term in a document equals the term frequency multiplied by the log of the total number of documents divided by the document frequency, wherein term frequency is equal to one plus the log of the frequency of a token in a document.
- Claim:
  10. The system of claim 7, further comprising the step of defining the predefined within distance having a value between 8 and 12.
- Claim:
  11. The system of claim 1, further comprising the step of defining TFIDF predefined threshold having a range of 0.01 to 0.001.
- Claim:
  12. The system of indexing documents according to claim 6, wherein the step of generating a plurality of keyword sets further comprises the sub steps of: filtering keyword sets that do not meet a predefined within distance threshold; and filtering keyword sets that do not meet a predefined support threshold, wherein the support threshold is compared to a support level which is proportional to the percentage of documents that contain the keyword set.
- Claim:
  13. The system of claim 12, wherein the TFIDF for any particular term in a document equals the term frequency multiplied by the log of the total number of documents divided by the document frequency, wherein the term frequency is the number of appearances of a term in a document divided by the total number of words in the document.
- Claim:
  14. The system of claim 12, wherein the TFIDF for any particular term in a document equals the term frequency multiplied by the log of the total number of documents divided by the document frequency, wherein term frequency is equal to one plus the log of the frequency of a token in a document.
- Claim:
  15. The system of claim 12, further comprising the step of defining the predefined within distance having a value between 8 and 12.
- Claim:
  16. The system of claim 12, further comprising the step of defining TFIDF predefined threshold having a range of 0.01 to 0.001.
- Claim:
  17. The system of indexing documents according to claim 6, wherein the step of generating a plurality of keyword sets further comprises the sub steps of: filtering keyword sets that do not meet a predefined within distance threshold; and filtering keyword sets that do not meet a predefined support threshold, wherein the support threshold is compared to a support level which is proportional to the percentage of documents that contain the keyword set, wherein the step of calculating a TFIDF further comprises the substeps of: calculating a term frequency; calculating a document frequency; and calculating a total number of documents in which a term appears at least once.
- Claim:
  18. The system of claim 17, wherein the TFIDF for any particular term in a document equals the term frequency multiplied by the log of the total number of documents divided by the document frequency, wherein the term frequency is the number of appearances of a term in a document divided by the total number of words in the document.
- Claim:
  19. The system of claim 18, wherein the TFIDF for any particular term in a document equals the term frequency multiplied by the log of the total number of documents divided by the document frequency, wherein term frequency is equal to one plus the log of the frequency of a token in a document.
- Claim:
  20. The system of claim 18, further comprising the step of defining the predefined within distance having a value between 8 and 12.
- Current U.S. Class:
  707/5
- Current International Class:
  06; 06
- الرقم المعرف:
  edspap.20090119281

تعليقات

No Comments.

Granular knowledge based search engine

اتصل بنا

اتبع