-
2024-06-11
17/948,172
2022-09-19
US 12,007,939 B1
2024-06-11
-
-
Yicun Wu
Kaplan IP Law, PLLC | Jonathan T. Kaplan
2042-09-19
Techniques are presented for producing demographics, in an automated fashion, from a search result of computer-accessible content. While the demographics can be determined for a research product that has been produced by any technique, they are particularly useful when applied to an automated frame-based search approach. Frame-based search engines are presented for technology profiling, healthcare-related search and brand research. Determination of a demographic proceeds at two levels: member and population. At the member level, a demographic characteristic can be determined applicable with either total or partial certainty. Each value assigned by a demographic, to a population member, has a confidence level associated with it and the assignments can be represented by a Confidence Distribution. Summarization of a demographic, at the population level, depends upon whether the certainty assignments, at the member level, are total or partial. Declarant Demographics are presented. Approaches, to determining Declarant Demographics, are presented.
Get notified when new applications in this technology area are published.
G06F16/00 » CPC main
Information retrieval; Database structures therefor; File system structures therefor
G06F16/24578 » CPC further
Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data; Querying; Query processing with adaptation to user needs using ranking
G06F16/248 » CPC further
Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data; Querying Presentation of query results
G06F16/951 » CPC further
Information retrieval; Database structures therefor; File system structures therefor; Details of database functions independent of the retrieved data types; Retrieval from the web Indexing; Web crawling techniques
G06F16/2457 IPC
Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data; Querying; Query processing with adaptation to user needs
As provided for under 35 U.S.C. § 120, this application claims benefit of the filing date of the following U.S. patent application, herein incorporated by reference in its entirety:
âMethod and Apparatus For Determining Search Result Demographics,â filed 2022 Feb. 16 (y/m/d), having inventors Michael Jacob Osofsky, Jens Erik Tellefsen, Wei Li, and Ranjeet Singh Bhatia and App. No. 17672707.
As provided for under 35 U.S.C. § 120, App. No. 17672707 claimed benefit of the filing date of the following U.S. patent application, herein incorporated by reference in its entirety:
âMethod and Apparatus For Determining Search Result Demographics,â filed 2021 Jul. 5 (y/m/d), having inventors Michael Jacob Osofsky, Jens Erik Tellefsen, Wei Li, and Ranjeet Singh Bhatia and App. No. 17367612.
As provided for under 35 U.S.C. § 120, App. No. 17367612 claimed benefit of the filing date of the following U.S. patent application, herein incorporated by reference in its entirety:
âMethod and Apparatus For Determining Search Result Demographics,â filed 2015 May 5 (y/m/d), having inventors Michael Jacob Osofsky, Jens Erik Tellefsen, Wei Li, and Ranjeet Singh Bhatia and App. No. 14704919 (now U.S. patent Ser. No. 11/055,295).
As provided for under 35 U.S.C. § 120, App. No. 14704919 claimed benefit of the filing date of the following U.S. patent application, herein incorporated by reference in its entirety:
âMethod and Apparatus For Determining Search Result Demographics,â filed 2010 Apr. 22 (y/m/d), having inventors Michael Jacob Osofsky, Jens Erik Tellefsen, Wei Li, and Ranjeet Singh Bhatia and App. No. 12765848 (now U.S. Pat. No. 9,026,529).
The following U.S. patent applications are herein incorporated by reference in their entirety:
âMethod and Apparatus For Frame-Based Search,â filed 2008 Jul. 21 (y/m/d), having inventors Wei Li, Michael Jacob Osofsky and Lokesh Pooranmal Bajaj and App. No. 12177122 (âthe '122 Applicationâ);
âMethod and Apparatus For Frame-Based Analysis of Search Results,â filed 2008 Jul. 21 (y/m/d), having inventors Wei Li, Michael Jacob Osofsky and Lokesh Pooranmal Bajaj and App. No. 12177127 (âthe '127 Applicationâ); and
âMethod and Apparatus For Automated Generation of Entity Profiles Using Frames,â filed 2009 Jul. 20 (y/m/d), having inventors Wei Li, Michael Jacob Osofsky and Lokesh Pooranmal Bajaj and App. No. 61227068 (âthe '068 Applicationâ).
This application is related to the following U.S. patent application(s), which are herein incorporated by reference in their entirety:
the '122 Application;
the '127 Application; and
the '068 Application.
The present invention relates to automated analysis, of computer-accessible content, to produce demographic data regarding such content. More particularly, the present invention relates to the production of demographic data from the information upon which a search result is based.
The product of a research project, whether performed by manual and/or automated means, can often be expressed as a âresultâ (or results), where each such result is supported by items drawn from the various content-sources searched. Herein, each result can be referred to as a âresult-valueâ and the items supporting such result-value can be referred to as its âresult-base.â The pair, of a result-value and its result-base, can be referred to as a âresult-pair.â
Having obtained a result-value, there are many situations in which it is useful to know various demographics about its result-base. An example situation, where such demographics are often useful, is where the research product is a profile. If a collection of sought-for values (i.e., a collection of result-values) has been identified, where each relates back to a common entity (as used herein, an âentityâ can refer to virtually anything, regardless of whether the item referred-to is completely abstract or more concrete), the collection can be referred-to as a âprofile.â
The utility of a âprofile,â for describing entities of various types, is well known: if there is a need to quickly obtain an understanding of a particular entity, the review of profile, if available, can be an extremely effective tool for doing so.
Some example profiles are as follows:
In general, the faster demographic data can be made available, regarding the result-bases forming the basis of a research project's product, the faster a productive use, of such research product, can be accomplished. Since automated (or largely automated) processes are, in general, faster than those that are manual (or largely manual), there is a need for tools that can automatically generate demographic data regarding such result-bases.
The accompanying drawings, that are incorporated in and constitute a part of this specification, illustrate several embodiments of the invention and, together with the description, serve to explain the principles of the invention:
FIG. 1A presents an example computer screen on which is shown a generic search result.
FIG. 1B shows an example expansion of a result-value of FIG. 1A.
FIG. 1C is intended to show the same search results of FIG. 1A, except that a Demographic View is selected.
FIG. 1D shows an example data organization, upon which the results of FIGS. 1A-1C can be based.
FIG. 2A shows a screen where part of a profile, for the technology ârechargeable cell,â is shown.
FIG. 2B is a record view that presents the same profile as FIG. 2A, except that an innovator has been expanded.
FIG. 2C shows information regarding the same profile of FIGS. 2A-2B, except that a Demographic View is selected.
FIG. 2D shows an example data organization, upon which the results of FIGS. 2A-2C can be based.
FIG. 3A shows an example search result, for treatments to the condition âheart attack.â
FIG. 3B is a record view of the same profile shown in FIG. 3A, except that the a treatment has been expanded.
FIG. 3C shows information regarding the same search result as FIGS. 3A-3B, except that a Demographic View is selected.
FIG. 3D shows an example data organization, upon which the results of FIGS. 3A-3C can be based.
FIG. 4A shows a screen where part of a profile for a mouthwash brand is shown.
FIG. 4B is a record view that presents the same profile as FIG. 4A, except that a con has been expanded.
FIG. 4C shows information regarding the same profile of FIGS. 4A-4B, except that a Demographic View is selected
FIG. 4D shows an example data organization, upon which the results of FIGS. 4A-4C can be based.
FIGS. 5A-C depict Confidence Distributions.
FIGS. 6A-C depict techniques for combining Confidence Distributions.
FIGS. 7A-D depict processing stages, at the instance level, for the determination of an example technology profile.
FIGS. 8A-C depict processing stages, at the snippet level, for the determination of an example healthcare-related search.
FIGS. 8D-F depict processing stages, at the instance level, for the determination of an example healthcare-related search.
FIG. 9A shows an example that permits one to see some of the snippets forming the basis of a role-value-oriented search result.
FIG. 9B shows an example to motivate grouping of each result-value under its respective frame type.
FIG. 10 illustrates an example Healthcare Frame Set.
FIGS. 11A-D depict an example frame extraction rule for the Benefit Frame.
FIGS. 12A-D depict an example frame extraction rule for the Treatment Frame.
FIG. 13 depicts a generic frame structure.
FIGS. 14A-B present a generic Frame-Based Search Engine (FBSE).
FIG. 15 depicts an example production-level computer system design in which the techniques described herein can be applied.
FIG. 16 presents an example pseudo-coded for performing Instance Merging.
FIG. 17 depicts an example pseudo-coded procedure for accomplishing Instance Selection.
Reference will now be made in detail to various embodiments of the invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
Please refer to the Glossary of Selected Terms, included at the end of the Detailed Description, for the definition of selected terms used below.
Section 2 presents some example products, of research projects, for which demographic data can be useful. In Section 3, several techniques for producing demographics, in an automated fashion, are presented. The techniques of Section 3 can be applied to a research product that has been produced by any technique, so long as the product satisfies the following two conditions:
However, the techniques of Section 3 are particularly useful when applied to an automated frame-based approach and this type of approach is presented in Section 4.
This section presents the following four types of example search results for which demographic determination can be suitable:
2.1 Generic Search
FIG. 1A presents an example computer screen 100, on which is shown a generic search result. The search result of screen 100 is hierarchically organized, where the highest level, called Search Type 110, indicates the basic type of search that has been performed. For each of Sections 2.2 to 2.4 below, one of each of the following example basic search types is discussed:
Within a search type, Search Object 111 represents the particular query for which the search is performed. For each of Sections 2.2 to 2.4 below, its Search Object is, respectively:
A âSearch Aspectâ provides a category, under which a collection of result-values can be organized. FIG. 1A shows two example Search Aspects: 120 and 130. For each of Search Aspects 120 and 130, several result-values are shown (i.e., result-values 121-123 are shown under 120 and 131-133 are shown under 130). For each of Sections 2.2 to 2.4 below, some Search Aspects are, respectively:
Screen 100 shows that there are two basic âmodesâ by which a search result can be displayed:
In screen 100, Record View 101 has been selected. For each result-value shown, a number of records, of its corresponding result-base, can also be displayed. In FIG. 1A, for example, consider the display of result-value 121. To the right of it is shown, in parenthesis, the number of records in the result-base of result-value 121. This number of records is indicated, in FIG. 1A, as ânum recs 151.â For each result-value shown, a user-interface can also be provided by which the records, of a result-value's result-base, can be displayed. Continuing with example result-value 121, consider the â+â (plus) sign displayed to the left of 121. The plus sign indicates that the display of the result-value can be expanded, such that the records of its result-base are shown. FIG. 1B, in fact, shows an example expansion of result-value 123. Of the records in the result-base for result-value 123, FIG. 1B shows two of them: record 160 and record 161. To the left of result-value 123 is shown a âââ (minus) sign, indicating that the display of records can be hidden from view (or âcontractedâ).
FIG. 1C is intended to show the same search results of FIG. 1A, except that Demographic View 102 has been selected. Further, result-values 121 and 122, of FIG. 1C, have already been selected for expansion so that the demographic data of each can be seen. For each of result-values 121 and 122, the following two demographic characteristics are shown:
A definition, of a demographic, is given below in the Glossary of Selected Terms. The particular form of data display, used for the demographics of FIG. 1C, are only for purposes of example. Any suitable technique, from the art of data display and/or data visualization, can be used to present the demographics. The type of demographic data display used in FIG. 1C is presented in more detail in Section 5.1 below (âApplicability Distributionsâ) as well as in Section 3 (âDetermining Demographicsâ).
FIG. 1D shows an example data organization, upon which the results of FIGS. 1A-1C can be based. FIG. 1D is intended to most directly support FIG. 1C. As can be seen, Search Aspect 120 of FIG. 1C corresponds to data grouping 103 (see dashed outline 103) in FIG. 1D. In addition to containing Search Aspect 120 itself, data grouping 103 is depicted as containing data groupings 101 and 102 (see dashed outlines 101 and 102). Data groupings 101 and 102 each represents a result-pair, since each contains a result-value and its result-base.
Specifically, data grouping 101 contains result-value 121 and result-base 180. Result-value 121, in FIG. 1C, is depicted as being based upon the following number of records: <num recs 151>. As can be seen, the records of result-base 180 are enumerated by starting at 1 (for the leftmost record) and continuing up to <num recs 151>.
Data grouping 102 contains result-value 122 and result-base 190. Result-value 122, in FIG. 1C, is depicted as being based upon the following number of records: <num recs 152>. As can be seen, the records of result-base 190 are enumerated by starting at 1 (for the leftmost record) and continuing up to <num recs 152>.
2.2 Technology Profiling
The production of a profile, regarding a technology, can be useful as part of a technology scouting project. In technology scouting, a technology searcher begins with a problem (call it, in general, âP_1â) and looks for an existing technology (call it, in general, an âET_1â) to solve or otherwise address P_1. If a technology search process (such as the search techniques discussed in the '122 and '127 Applications) has identified a candidate ET_1, a further evaluation, of the suitability of applying ET_1 to P_1, can be aided by having a profile of ET_1 (where the profile can be produced with the techniques of the '068 Application). If the technology searcher knows the demographics, of each result-base of the profile, the profile can be more useful to the evaluation of ET_1.
FIGS. 2A-2D present an example technology profile that is similar to those discussed in the '068 Application, but is supplemented with demographic information. Each of FIGS. 2A-2D corresponds to FIGS. 1A-1D. FIG. 2A shows a screen 200 where part of a profile, for the technology ârechargeable cell,â is shown. In particular, two aspects of rechargeable cells are shown:
FIG. 2B is a record view that presents the same profile as FIG. 2A, except that the innovator Hitachi has been expanded, such that some of the records, upon which the identification of Hitachi is based, are shown. In FIG. 2B, only the first two records, of the 27 records upon which Hitachi is based, are shown. The records on Hitachi are referred-to as being âarticles,â but any appropriate information source can be displayed.
FIG. 2C shows information regarding the same profile of FIGS. 2A-2B, except that now a âDemographic Viewâ has been selected. Further, result-values âHitachiâ and âSony,â for the aspect of innovators, have already been selected for expansion so that the demographic data of each can be seen. For each of result-values âHitachiâ and âSony,â the following two demographic characteristics are shown:
Definitions, respectively, for each of these demographic characteristics are as follows:
A brief discussion, of how the demographics depicted in FIG. 2C can be helpful to understanding the Technology Profile, follows. The Technology Profile produced has identified both Hitachi and Sony as innovators of rechargeable cells. The demographics, however, show that there are differences:
FIG. 2D shows an example data organization, upon which the results of FIGS. 2A-2C can be based. FIG. 2D is intended to most directly support FIG. 2C. As can be seen, the âInnovatorsâ Search Aspect of FIG. 2C corresponds to data grouping 203 (see dashed outline 203) in FIG. 2D. In addition to containing the âInnovatorsâ Search Aspect itself, data grouping 203 is depicted as containing result-pairs 201 and 202 (see dashed outlines 201 and 202).
Specifically, result-pair 201 contains result-value âHitachiâ and result-base 280. Result-pair 202 contains result-value âSonyâ and result-base 290. As can be seen, the records of result-base 280 are enumerated by starting at 1 (for the leftmost record) and continuing up to 27. The records of result-base 290 are enumerated by starting at 1 (for the leftmost record) and continuing up to 18.
2.3 Healthcare-Related
Healthcare-related content is a knowledge domain that is of both great importance and vast size. Items sought-for, in a healthcare-related search, can include the following: a treatment for a condition, the causes and/or complications of a condition and the pros and/or cons of a treatment. For a set of records (the result-base) identified as addressing any of these sought-for items (the result-value), understanding the result-base's demographics can be useful.
An example search result, for treatments to the condition âheart attack,â is shown in FIG. 3A. For ease of organization, the treatments shown are divided into three sub-types:
FIG. 3B is a record view of the same profile shown in FIG. 3A, except that the treatment âChocolateâ (under the Food and Plants sub-type) has been expanded, such that some of the records, upon which the identification of Chocolate is based, are shown. In FIG. 3B, only the first two records, of the 818 records upon which Chocolate is based, are shown. The records regarding âChocolateâ are referred-to as being âarticles,â but any appropriate information source can be displayed.
FIG. 3C shows information regarding the same search result as FIGS. 3A-3B, except that now a âDemographic Viewâ has been selected. Further, result-values âAspirinâ (under the Drugs and Medications sub-type) and âChocolateâ have already been selected for expansion so that the demographic data of each can be seen. For each of result-values âAspirinâ and âChocolate,â the following two demographic characteristics are shown:
These demographic characteristics are the same as those discussed above for a Technology Profile search. Therefore, please refer to Section 2.2 for definitions of them.
A brief discussion, of how the demographics depicted in FIG. 3C can be helpful to understanding the Healthcare-related search, follows. The Healthcare-related search has identified both Aspirin and Chocolate as potential treatments for a heart attack. The demographics, however, show that there are differences:
FIG. 3D shows an example data organization, upon which the results of FIGS. 3A-3C can be based. FIG. 3D is intended to most directly support FIG. 3C. As can be seen, the âTreatmentsâ Search Aspect of FIG. 3C corresponds to data grouping 303 (see dashed outline 303) in FIG. 3D. In addition to containing the âTreatmentsâ Search Aspect itself, data grouping 303 is depicted as containing result-pairs 301 and 302 (see dashed outlines 301 and 302).
Specifically, result-pair 301 contains result-value âAspirinâ and result-base 380. Result-pair 302 contains result-value âChocolateâ and result-base 390. As can be seen, the records of result-base 380 are enumerated by starting at 1 (for the leftmost record) and continuing up to 78. The records of result-base 390 are enumerated by starting at 1 (for the leftmost record) and continuing up to 818.
2.4 Brand Research
A professional of marketing research is constantly seeking to better understand the perception of brands, as seen by members of a relevant market. As part of achieving this, a set of records (i.e., a result-base) can be identified as addressing an important characteristic (i.e., a result-value) of a brand. In a manner similar to that discussed above for a Technology Profile, a collection of result-pairs (each addressing a distinct but important characteristic) can be determined and the results presented in the form of a Brand Profile.
Once a Brand Profile has been determined, a next task can be to better understand the market members responsible for the result-value. Demographics can be very useful to achieving this goal.
FIG. 4A shows a screen 400 where part of a profile, for the mouthwash brand âMouthwash XYZ,â is shown. In particular, two aspects of Mouthwash XYZ are shown:
FIG. 4B is a record view that presents the same profile as FIG. 4A, except that the con âbad for denturesâ has been expanded, such that some of the records, upon which the identification of âbad for denturesâ is based, are shown. In FIG. 4B, only the first two records, of the 27 records upon which âbad for denturesâ is based, are shown. The records on âbad for denturesâ are referred-to as being âarticles,â but any appropriate information source can be displayed.
FIG. 4C shows information regarding the same profile of FIGS. 4A-4B, except that now a âDemographic Viewâ has been selected. Further, result-values âbad for denturesâ and âremoves lipstick,â for the aspect of cons, have already been selected for expansion so that the demographic data of each can be seen. For each of result-values âbad for denturesâ and âremoves lipstick,â the following two demographic characteristics are shown:
Definitions, respectively, for each of these demographic characteristics are as follows:
A brief discussion, of how the demographics depicted in FIG. 4C can be helpful to understanding the Brand Profile, follows. The Brand Profile produced has identified both âbad for denturesâ and âremoves lipstickâ as cons of the mouthwash brand Mouthwash XYZ. The demographics, however, show that there are differences:
FIG. 4D shows an example data organization, upon which the results of FIGS. 4A-4C can be based. FIG. 4D is intended to most directly support FIG. 4C. As can be seen, the âConsâ Search Aspect of FIG. 4C corresponds to data grouping 403 (see dashed outline 403) in FIG. 4D. In addition to containing the âConsâ Search Aspect itself, data grouping 403 is depicted as containing result-pairs 401 and 402 (see dashed outlines 401 and 402).
Specifically, result-pair 401 contains result-value âbad for denturesâ and result-base 480. Result-pair 402 contains result-value âremoves lipstickâ and result-base 490. As can be seen, the records of result-base 480 are enumerated by starting at 1 (for the leftmost record) and continuing up to 27. The records of result-base 490 are enumerated by starting at 1 (for the leftmost record) and continuing up to 18.
Having introduced the utility of demographics, for several example research projects, this Section addresses techniques for determining such demographics. Section 3.2 introduces âConfidence Distributionsâ as a way of representing the application of a demographic characteristic, both at the level of an individual record and for summarizing a population (or result-base) of records. Section 3.3 addresses techniques for combining Confidence Distributions. This is particularly useful for determining the summarizing Confidence Distribution of a population, since it can be produced by combining the Confidence Distributions produced for each individual record. Section 3.4 discusses types of demographic characteristics, and ways in which such demographic characteristics can be determined.
Determination of a demographic characteristic âDC_1,â with respect to a population âP_1,â involves at least the following two levels of determination:
At the Individual Member level, two main types of results, of applying a demographic characteristic DC_1 to a population member M_1, are addressed:
Total Certainty can be regarded as a sub-variety of Partial Certainty, where the following two limitations apply:
Summarization of the demographic characteristic, at the Whole Population level, depends upon whether individual members of the population have been assigned values with Total or Partial Certainty.
With Total Certainty at the member level, a population can be summarized with a histogram: for each value, from the range of potential values, the number of members assigned such value can be provided. If desired, the histogram can be normalized, so that each value is assigned a number in the range of 0.0 to 1.0 and the assigned numbers sum to 1.0.
With Partial Certainty at the member level, summarization for the population is more complex. FIG. 5A depicts an example representation (any suitable representation can be used) for the application of a demographic characteristic DC_1 to a member M_1 (where M_1 is from a population P_1). FIG. 5A can be referred to herein as a âConfidence Distributionâ (or âCDâ) and is comprised of two axes:
If desired, the confidence levels can be normalized such that they all fit within a predetermined range, such as 0.0 to 1.0 (in the manner of probabilities) or 0.0 to 100.0 (like percentages), and the sum of the confidence levels typically equals (but does not exceed) the maximum value of the range.
FIG. 5B depicts an example Confidence Distribution 510, where the demographic characteristic has just two values (â1â or â2â) that it can assign to a member M_1. The confidence level, with which each value is determined applicable to M_1, is limited to the range 0.0 to 1.0. An example use, of a two-value Confidence Distribution, is presented in Section 1.4 (âBrand Researchâ). In Section 1.4, the examples address the gender of the Declarants of a population (or result-base) of records, rather than an individual record. The examples of Section 1.4 are illustrated in FIG. 4C and are identified as follows:
Each of these Confidence Distributions, however, can be interpreted, with respect to a single record, as follows:
FIG. 5C depicts an example Confidence Distribution 520, where the demographic characteristic has three or more values (four values are shown in FIG. 5C) that it can assign to a member M_1. The confidence level, with which each value is determined applicable to M_1, is limited to the range 0.0 to 1.0. Example uses, of a multi-value Confidence Distribution, are presented in Section 1.4 (âBrand Researchâ). In Section 1.4, the examples address the age of the Declarants of a population (or result-base) of records, rather than an individual record. The examples of Section 1.4 are illustrated in FIG. 4C and are identified as follows:
Each of these Confidence Distributions, however, can be interpreted, with respect to a single record, as follows:
Once the Confidence Distribution has been produced, for each member of a population P_1, any suitable technique can be used to combine such Confidence Distributions into a value or values that appropriately summarize P_1 with respect to a demographic characteristic DC_1. For purposes of example, one combining technique is presented herein.
The combining technique presented herein is depicted graphically in FIG. 6A and with pseudo-code in FIG. 6B.
FIG. 6A depicts two, three-value, Confidence Distributions âCD 1â and âCD 2.â Consider the case where CD 1 and CD 2 each represent a demographic characteristic DC_1 as applied to two distinct, but single, records. The records shall be referred to as, respectively, M_1 and M_2. CD 1 and CD 2 each have the following properties, regarding their confidence levels: they are in the range 0.0 to 1.0 and they sum to 1.0. Confidence Distribution âCD 3â represents the result of combining CD 1 and CD 2 by adding corresponding confidence levels. For example, in CD 1, the confidence level assigned to value 2, when DC_1 is applied to M_1, is 0.3. In CD 2, the confidence level assigned to value 2, when DC_1 is applied to M_2, is 0.1. In CD 3, the combined confidence level, assigned to value 2, is the sum of 0.3 and 0.1 or 0.4. As can be seen, just summing the corresponding values of two Confidence Distributions produces a Confidence Distribution CD 3 that no longer meets the properties of the confidence levels of CD 1 and CD 2 (i.e., the confidence levels of CD 3 are in the range 0.0 to 1.4 and the confidence levels sum to 1.9). However, CD 3 can be normalized to produce a âCD 4â where its confidence levels satisfy the same properties of CD 1 and CD 2 (i.e., the confidence levels of CD 4 are in the range 0.0 to 1.0 and the confidence levels sum to 1.0). If there are more records to be combined, such as a record M_3, its Confidence Distribution can be combined with CD 4 in the same manner as described above for combining CD 1 and CD 2.
The combining techniques of FIG. 6A can also be applied to the following situation: where there are two or more âApproachesâ to determining the same demographic characteristic. In this situation, CD 1 and CD 2 can each represent the application of a different Approach, respectively, DC_1_A1 and DC_1_A2, to a same record M_1. Although representative of different approaches, CD 1 and CD 2 can still be combined in the same way, as described above, to produce CD 3 and CD 4.
Utilization of multiple Approaches, for determining a same demographic characteristic, can be useful in a variety of situations. For example, the Confidence Distributions of different Approaches can reinforce each other, thus leading to higher net confidence levels in the values identified for a record. Also, the impact of an erroneous Confidence Distribution, from one Approach, can be mitigated by other Approaches producing more accurate Confidence Distributions.
Once multiple Confidence Distributions, resulting from the application of multiple Approaches to a single record M_1, have been combined to produce a single Confidence Distribution, this single Confidence Distribution can be treated, for purposes of combining M_1 with other records, as the single Confidence Distribution for M_1. Thus, in the example discussed above, where CD 1 and CD 2 were each described as representative of different records, respectively, M_1 and M_2, it is possible that each of CD 1 and CD 2 has been produced by some prior combining process, in which the results of multiple Approaches were applied to M_1 and M_2.
FIG. 6B can be understood as determining a demographic output for the result-pairs of a Search Aspect. Specifically, FIG. 6B can be understood as producing the demographic output, in FIG. 4C, for the âConsâ Search Aspect (FIG. 4C is discussed above in Section 1.4 âBrand Researchâ). This can be accomplished by applying, as follows, the pseudo-code of FIG. 6B to data grouping 403 of FIG. 4D.
The variable current_search_aspect (line 1, FIG. 6B) is assumed to be set to Search Aspect 403 of FIG. 4D. Each iteration of the âforâ loop (lines 1-20, FIG. 6B) sets current_rp (which stands for âcurrent result pairâ) to a successive result-pair of 403. First, current_rp is set to result-pair 401 and, second, current_rp is set to result-pair 402. For current_rp set to result-pair 401, where the result-value is âBad for dentures,â demographics 470 and 472 (see FIG. 4C) can be produced as follows.
The âforâ loop, of lines 3-8, FIG. 6B, can produce Confidence Distribution 470 for the âGenderâ demographic characteristic, where âD1â is assumed to refer to a determination of the gender of a record's Declarant. This âforâ loop applies two Approaches, called D1_A1 and D1_A2 (see lines 4 and 5), to produce two Confidence Distributions, called CD_A1 and CD_A2. A âcombine_CDâ function (see line 6) is then applied to combine the Confidence Distributions, resulting from the two Approaches, to produce a single Confidence Distribution for the current_record (called âcurrent_record.D1_CDâ). The operation of combine_CD can be very similar to that described above with respect to FIG. 6A. The combine_CD function can then be applied again (see line 7), so that a single Confidence Distribution, representative of the current result-pair, can be accumulated into current_rp.D1_CD.
In the same manner as described above (for the âforâ loop of lines 3-8), the âforâ loop, of lines 10-15, FIG. 6B, can produce Confidence Distribution 472 for the âAgeâ demographic characteristic, where âD2â is assumed to refer to a determination of the age of a record's Declarant. As with demographic characteristic D1, D2 is also assumed to have two different Approaches.
For result-pair 401, each of the âforâ loops iterates over the 27 records, regarding âBad for dentures,â to produce Confidence Distributions 470 and 472. Lines 17-18, FIG. 6B, can then be executed in order to display the Confidence Distributions (such as the display in FIG. 4C).
In its second iteration, the âforâ loop of lines 1-20 sets the current result-pair to 402 (see FIG. 4D). Each of the two inner âforâ loops (i.e., lines 3-8 and lines 10-15) iterates over the 18 records, regarding âRemoves lipstick,â to produce Confidence Distributions 471 and 473.
Within the general requirement, for a display or visualization that can represent a variation of confidence, as a demographic's values vary, any suitable technique, for data display or visualization, can be used. Some of these display or visualization techniques can include (but are in no way limited to) the following:
Example demographic characteristics, discussed above, are as follows:
Characteristics 1 and 2 can be put under the more general classification of âDeclarant Demographics,â where a Declarant Demographic can be any demographic characteristic regarding the Declarants of a result-base's records. More example Declarant Demographics can include the following:
This section focuses on Approaches (where an âApproachâ was introduced above in Section 3.3) for determining Declarant Demographics. The Approaches discussed herein can be used individually, in any combination with each other or in combination with other Approaches not addressed herein. The Approaches discussed herein can be summarized as follows:
Each of these Approaches is now addressed in greater detail.
The technique of âlexical-to-demographic association,â when applicable to a demographic characteristic âDC_1,â works as follows. If a particular lexical unit is present in a record, there is a certain (above zero) probability that the Declarant of the record has characteristic âDC_1.â Depending on the demographic sought and the lexical unit detected, the probability can range from low or inaccurate (e.g., 0.2) to high or accurate (e.g., 0.9). Even a low level of probability, however, can be usefulâparticularly if combined with probability information determined from other Approaches to the same demographic.
Lexical-to-demographic association can be used, for example, with regard to determining the geographic location of a Declarant. This is because certain lexical units are known to be more frequently utilized (or, perhaps, only utilized) in certain geographical areas. Thus, if lexical units are included in a record, where such lexical units are indicative of a geographical area âGA1,â there is a certain (above zero) probability that the Declarant of the record is from GA1.
Sources of geographically-indicative language include the web site (www.UrbanDictionary.com) and books (such as âUrban dictionary: fularious street slang defined,â Andrews McMeel Publishing, Kansas City, Missouri, 2005) by Aaron Peckham. For example, the word âhyphyâ has been associated with the area of Oakland, in the San Francisco Bay Area, CA, U.S.A.
For some demographic characteristics, a record can be analyzed for statements wherein the Declarant describes himself or herself as having a sought-for characteristic âDC_1.â If a self-referential statement is found that has the sought-for properties, there is a certain (above zero) probability that the Declarant of the record has the characteristic DC_1. This technique can be referred to herein as âself-referential demographic identification.â As with âlexical-to-demographic association,â the probability can range from low or inaccurate (e.g., 0.2) to high or accurate (e.g., 0.9).
For example, a linguistic rule can be written that triggers upon a Logical Form that satisfies all of the following properties:
Such linguistic rules can be written in the form of âframe extraction rules,â as discussed below in Sections 4-6 and defined in Section 7.2 (âFrame Extraction Rulesâ). However, the âactionâ portion of a frame extraction rule, suitable for identification of a demographic characteristic, does not need to produce a frame instance when triggered. Instead, the action needs to indicate, with an appropriate Confidence Distribution, presence of the demographic characteristic.
Self-referential demographic identification can be used, for example, with regard to determining the geographic location of a Declarant. For example, a linguistic rule can be written that triggers upon a Logical Form satisfying all of the following properties:
Self-referential demographic identification can also be used, for example, with regard to determining the gender of a Declarant. For example, a linguistic rule can be written that triggers upon a Logical Form satisfying all of the following properties:
For the linguistic rules presented thus far, the confidence in the presence of the demographic characteristic, if found, is very high (e.g., 1.0 on a scale of 0.0 to 1.0). However, the confidence of a match, by a particular linguistic rule, can vary depending upon the particular lexical unit (or units) that are part of the match. In this case, lexical units associated with the detection of a particular value V_1 (such as âFemaleâ), from the range of values (e.g., Male or Female) that can be assigned by a demographic characteristic DC_1 (e.g., gender) to a member M_1 of its population, can be paired with an appropriate confidence level that V_1 is, in fact, present. Any suitable data format, to represent such pairing, can be used. For purposes of simplicity of exposition herein, the above-given feature set for FEMALE can be expressed as follows:
The above feature set of pairs, contains all of the same lexical units as present in the non-paired form, except the lexical unit âsecretaryâ has been added. As can be seen, âsecretaryâ is the one lexical unit shown that is not paired with a confidence level of 1.0. This is because a Declarant, describing himself or herself as a âsecretary,â does not lead to Total Certainty (where âTotal Certaintyâ is discussed above in Section 3.2 âConfidence Distributionsâ) that the Declarant is female (e.g., a confidence level of 0.7 is shown). Depending upon the application, the pairing can be between a lexical unit and a Confidence Distribution. In the case of the gender demographic, since only two values are possible, a Confidence Distribution need only have two values. For example, FEMALE can be expressed as follows (where each Confidence Distribution is ordered with the confidence values for Female, Male):
It is often the case that the producer (or publisher) of a content source keeps demographic data on its content contributors and users. Also, there are companies that specialize in producing demographic data on content providers.
For a record âM_1,â of a result-base, its content producer âC_1â can be identified and the demographics, of such content source, can be accessed. Such demographic information can be used to deduce a Confidence Distribution, for the Declarant of âM_1,â with respect to a particular demographic characteristic âDC_1.â
For example, DC_1 can be gender and the demographic data, for C_1, can be that 90% of its contributors are female while only 10% are male. Thus, in the particular case of record M_1, it can be reasonable to deduce that there is a 90% probability that its Declarant is female.
An example category of content source are online sources, such as Internet web sites. In this case, many web sites compile demographic data on its contributors and users. Also, there are companies that compile demographics across many online content sources.
Example web sites, that compile demographic data on its contributors and users, include:
An example company, that compiles demographics across many online content sources, is www.QuantCast.com, operated by Quantcast Corporation, San Francisco, CA, U.S.A. Quantcast provides a database wherein an Universal Resource Locator (or âurlâ) can be input and a variety of demographics, describing that url, are output.
If explicit information about the Declarant of a record is available, it can be used to deduce demographics of the record's Declarant.
For example, the Declarant's name can be included as part of a record (sometimes in a specific field where âauthorsâ are identified). Based on a Declarant's name, demographics, such as the Declarant's likely gender and/or age, can be determined.
For example, if a Declarant's name is âMary,â it can be deduced, with a high level of probability, that the Declarant is female. However, since the name âMaryâ has been popular for a long time, and remains popular, it is not useful for deducing the age of the Declarant. Names such as âGertrudeâ or âBeatriceâ are no longer popular and therefore it can be deduced, with a certain level of probability, that the Declarant is in an older age range (such as 50 years or older).
An example database, that provides detailed information on the popularity of names over a long period of time (e.g., over the past 100 years), is www.BabyNameWizard.com, operated by Laura Wattenberg, Wellesley, MA, U.S.A and Generation Grownup, LLC.
Additional linguistic patterns for detection of demographics, that are not amenable to being manually deduced, can be created by the application of automated machine learning procedures. Any suitable machine learning procedures can be used, with such procedures executed on source (or âtrainingâ) corpora.
An example type of linguistic pattern, that can be deduced from machine learning, is as follows. It can be determined that the presence of a particular lexical unit, in a record, implies a certain (above zero) probability that the record's Declarant has a particular demographic characteristic. In this situation, machine learning can be use to produce additional lexical-to-demographic associations, as described above in Section 3.4.1.1.
Frames and frame-based search systems are discussed extensively in the following patent applications (see citations above): the '122 Application, the '127 Application and the '068 Application. The entirety of each of these applications is incorporated by reference in the present description. However, for purposes of convenience, certain information of such applications is repeated herein.
A key advantage, of a frame-based search system, is that result-pairs can be generated automatically. The following nomenclature can be used herein:
For any of the techniques described in Sections 2 (âExample Search Resultsâ) and 3 (âDetermining Demographicsâ), the result-pairs, result-values and result-bases can be replaced by, respectively, their frame-produced versions. In terms of determining demographics, as addressed above in Section 3 (âDetermining Demographicsâ), the âitemsâ or ârecordsâ processed, for purposes of evaluating a demographic characteristic, can be replaced by snippets. For example, with respect to FIG. 6B, discussed above in Section 3.3 (âCombining Confidence Distributionsâ), FIG. 6C presents the same pseudo-code, except that a current snippet is processed (current_snippet) rather than a current record (current_record).
A generic frame-based search engine (or FBSE) is described below in Section 5 (âFrame-Based Searchâ) and, more particularly, in Section 5.1.3. The following sub-Sections 4.1 to 4.3 show how to apply this FBSE to each of the three example search areas.
Technology profiling and, more broadly, the profiling of an entity, is addressed extensively in the '068 Application. While the entirety of the '068 Application has been incorporated by reference, for purposes of convenience, certain information of such application is repeated herein.
Described herein are techniques for generating a profile of an entity as it is addressed by a corpus of natural language (or âSource Corpusâ). More particularly, the profile is generated by using a set of frames referred to as an âEntity Profile Frame Set.â Each frame, of an Entity Profile Frame Set, shares a role in common, called herein an âAnchor Role.â For each instance produced from an Entity Profile Frame Set, the value assigned to its Anchor Role is called herein an âAnchor Role Value.â The particular entity, that an Anchor Role Value indicates (or maps to) is called herein an âAnchor Entity.â
The profile of an entity (or an âEntity Profileâ) is a set of instances (called herein an âEntity Profile Instance Setâ) that satisfies the following two properties:
An âAnchor Entityâ (as used herein) is an abstraction, defined, in practice, by the range of Anchor Role Values that are understood as indicating a same Anchor Entity.
With regard to the example technology profiling of Section 2.2 above, and its illustration in FIGS. 2A-2D, an âAspectâ contains instances that are all from a same frame.
An example Entity Profile Frame Set is presented below in Section 6.1 (âTechnology Profilingâ). In the remainder of this Section, the operation of this Entity Profile Frame Set is illustrated by pursuing an example Entity Profile Instance Set through the operation of the generic FBSE. For this example, it is assumed that the Entity Profile Instance Set corresponds to an Instance Superset.
An Instance Superset, to which Instance Merging can be applied, is assumed to have been already produced and is depicted in FIG. 7A. All the instances are produced from the frames, used for creating a profile for a Technology Candidate, presented in Section 6.1. It should be noted that the instances of FIG. 7A are assumed to have been produced from a general Source Corpus, prior to a user entry of a query for a particular technology. This is why instances, for technologies as divergent as âfuel cellsâ and ânatural language processingâ are shown. FIG. 7A can be assumed to be a small fragment of the set of instances that would be produced.
FIG. 7A has 10 instances, 701-710. Each instance is depicted as follows:
The frames upon which each instance is based are as follows:
FIG. 7B is the same as FIG. 7A, except that the following Instance Merging has been accomplished:
After the merging, the following instances, with their instance-mentions, are as follows:
FIG. 7C is the same as FIG. 7B, except that the only instances remaining are those whose Anchor Role Value sufficiently match an input Anchor Role Value (or user's query). For this example, the user's query is assumed to be âfuel cells.â
Specifically, all the instances of FIG. 7B have been kept in FIG. 7C, except for instances 701 and 710. The Anchor Role Values of instances 701 and 710 (i.e., the string of lexical units ânatural language processingâ), are clearly identifying a different Anchor Entity (in this case, a different technology) than the technology identified by âfuel cells.â
Thus, the instances of FIG. 7C can all be part of a profile for the Technology Candidate âfuel cells.â FIG. 7D is the same as FIG. 7C, except that instances produced from a same frame have been grouped together. Such grouping can be useful for presentation of the profile.
In terms of the example Technology Profiling discussed above in Section 2.2 (âTechnology Profilingâ) and illustrated in FIGS. 2A-2D, the general structure of FIG. 7D can be related as follows. (Since they are the result of different search queries, ârechargeable cellâ for Section 2.2 and âfuel cellsâ for FIG. 7D, the particular instances and instance types, of the two results, differ.)
Each of 730, 731 and 732 corresponds to what is called an âAspectâ in FIGS. 2A-2D. More particularly, each of 730, 731 and 732 corresponds to Aspect 203 of FIG. 2D. Within, for example, Aspect 730 are instances 720, 722 and 706. In terms of FIG. 2D, each of these instances can be represented by a result-pair-F, such as 201 or 202. For example, the result-pair-F for instance 720, can comprise:
A search engine, specialized for the domain of healthcare-related computer-accessible content, can be referred to as âhealthcare-related search engineâ (or âHRSEâ). Currently available HRSE's include, for example, the following web sites: PubMed (provided by the United States National Library of Medicine of the National Institutes of Health), WebMD (provided by the WebMD Health Corporation) and Healthline (provided by Healthline Networks, Inc.).
An approach to a frame-based search engine (âFBSEâ) is presented in this section (Section 4.2 âHealthcare-relatedâ). More particularly, the principles of frame-based search are applied to the domain of healthcare-related knowledge. The resulting system can be referred to as a frame-based HRSE.
The development of a frame-based HRSE includes the development of a Frame Set (called a âHealthcare Frame Setâ or âHFSâ) that models concepts of particular importance to people working within the healthcare field (or âhealthcare professionalsâ).
This section presents an example HFS, called âHFS52,â that contains 5 frames, with each frame having two roles. Each frame of HFS52 is depicted in FIG. 10 and is listed below:
Within each frame of FIG. 10 are its (two) role names in all capital letters. In other diagrams, for purposes of clarity, the suffix â_ROLEâ may be added to a role name.
A set of values, that can be assigned to the roles of an âinstanceâ of a frame, is indicated in FIG. 10 by all lower-case letters. The 6 sets of values depicted in FIG. are:
For each frame of FIG. 10, one role is indicated as the âinput roleâ (see below Glossary of Selected Terms for definition) and the other role serves as the output role.
More detailed discussion, of each of the five frames of HFS52, can be found in Section 6.2 (âHealthcare-relatedâ).
An approach, to implementing a frame-based HRSE, is as follows: utilize four frame-based search engines (or FBSE's), where each such FBSE has been described, generically, in Section 5.1.3. Each of the four FBSE's accomplishes the following:
Each of the main steps of an FBSE, customized for Healthcare-related search, is described in more detail in the following sub-sections of Section 4.2.
To illustrate the principles of Instance Generation, presented in Section 5.2, this section (Section 4.2.4) presents an example of Instance Generation related to a specific healthcare-related search.
Specifically, the example of Instance Generation relates to a user seeking to find treatments for the condition âheart attack.â
As presented in Section 2.3 (âHealthcare-relatedâ), the results of the search (shown in FIG. 3A) include the treatments of âAspirinâ (under the category of âDrugs and Medicationsâ) and âReduce cholesterolâ (under the category of âOther Treatmentsâ).
FIG. 3C shows that one of the results (specifically, the result relating to aspirin) is from such sources as MedicineNet (see 320) and Medical News Today (see 321).
According to the Pre-Query Processing of Section 5.2.2, the results of FIG. 3C can result from an FBDB for the Treatment Frame (or FBDB(Treatment Frame)) that is indexed to include snippets from MedicineNet and Medical News Today.
Applying the Post-Query Processing of Section 5.2.3.1 (âProducing A Query Selective Corpusâ), with the FBDB being FBDB(Treatment Frame) and the query being âheart attack,â can produce a Query Selective Corpus that includes snippets 801-804 of FIG. 8A. Each of snippets 801-804 depicts, in dotted underline, a location where the query (âheart attackâ) resulted in its retrieval.
Applying the Post-Query Processing of Section 5.2.3.2 (âProducing An Instance Supersetâ), to snippets 801-804, can produce an Instance Superset that includes instances 810-815 of FIG. 8D. The Query Selective Corpus of FIG. 8A can be processed on a UNL-by-UNL basis (e.g., on a sentence-by-sentence basis).
FIG. 8B shows where the third (or focus) sentence of each snippet has triggered a frame extraction rule to produce a Treatment Frame. Each dotted underline depicts a text fragment that is assigned to a CONDITION_ROLE (the input role) and each solid underline depicts a text fragment that is assigned to a TREATMENT_ROLE (output role). For each snippet of FIG. 8B, the triggering of a frame extraction rule produces an instance (shown in FIG. 8D) where the role value of the CONDITION_ROLE is âheart attack.â Specifically, snippets 801-804 of FIG. 8B correspond to, respectively, the following instances of FIG. 8D: 810, 811, 813 and 815.
FIG. 8C also shows where the third (or focus) sentence, of snippets 802-803, has triggered a frame extraction rule to produce a Treatment Frame. For each of snippets 802-803 of FIG. 8C, the triggering of a frame extraction rule produces an instance (shown in FIG. 8D) where the role value of the CONDITION_ROLE is âstroke.â Specifically, the snippets 802-803 of FIG. 8C correspond to, respectively, the following instances of FIG. 8D: 812 and 814. It is the production of instances such as 812 and 814, that do not relate to the query (âheart attackâ), that justify the name of Instance Superset for the output of Instance Generation.
An example Instance Superset (produced above in Section 4.2.4), to which Instance Merging can be applied, is depicted in FIG. 8D. All the instances are produced from a Treatment Frame. This Treatment Frame, along with other example frames for a frame-based HRSE, are addressed in Section 6.2 (âHealthcare-relatedâ).
FIG. 8D has 6 instances, 810-815.
FIG. 8E is the same as FIG. 8D, except that the following Instance Merging has been accomplished:
After the merging, FIG. 8E has the following four instances (from top to bottom): 820, 812, 814 and 821.
Regarding instances 820 and 821, each contains the following instance-mentions:
FIG. 8F is the same as FIG. 8E, discussed above in section 4.2.5 (âExample Instance Mergingâ), except that the only instances remaining are those whose input role value sufficiently match a query. For this example, the query is âheart attack.â
Specifically, all the instances of FIG. 8E are kept in FIG. 8F, except for instances 812 and 814. The CONDITION role values of instances 812 and 814 (i.e., lexical units âstrokesâ and âstrokeâ), are clearly not a match to the query âheart attack.â
Thus, the instances of FIG. 8F can all be part of a search result for treatments to heart attacks.
The role-value oriented approach, to Search Result presentation, can be illustrated with the example Search Result fragment of FIG. 8F. This search result is comprised of two instances (820 and 821), each of which contains two instance-mentions. For instance 820, an appropriate role-value-oriented search result is âaspirin,â since it cannot be simplified and it is the same role value for both its instance-mentions. For instance 821, an appropriate role-value-oriented search result is âreduce cholesterol,â which is a commonality or summarization of the role values of its instance-mentions. An example presentation, that includes both of these role-value-oriented search results, is shown in FIG. 3A (discussed above in Section 2.3 âHealthcare-relatedâ), where âAspirinâ is the first result under âDrugs and Medicationsâ and âReduce cholesterolâ is the second result under âOther Treatments.â
An example, of being able to see the snippets forming the basis of a role-value-oriented search result, is shown in FIG. 9A. FIG. 9A shows the search result âAspirinâ as having been expanded. As can be seen, the expansion of Aspirin has resulted in two snippets (indicated in FIG. 9A as 920 and 921) being displayed. Rather than displaying the full snippet, FIG. 9A just shows the âfocus sentenceâ for each (see Section 7.4 âSnippet Formationâ for a discussion of âfocus sentenceâ). For each of 920 and 921, the portion that corresponds to the role-value-oriented search result is emphasized by solid underlining, while the portion that matches the input query is emphasized by dotted underlining.
For example, in the case of a frame-based HRSE, it can be useful to group the treatments, found for a condition, according to type. Example types, into which treatments can be grouped, include the following: âDrugs and Medications,â âFoods and Plants,â and âOther Treatments.â An example use of these three types, for grouping potential treatments for âheart attack,â is shown in FIG. 3A.
In the example of the treatments for a heart attack, as shown in FIG. 3A, it was not useful to use frame-type grouping because all the search results are based upon instantiation of just one frame: the Treatment Frame.
However, for the example of finding the pros and cons of using aspirin, as shown in part 902 of FIG. 9B, the following two frame types were used: Pro Frame and Con Frame. As can be seen in FIG. 9B, role-value-oriented search results instantiated from the Pro Frame are grouped under a âProsâ heading while the role-value-oriented search results instantiated from the Con Frame are grouped under a âConsâ heading.
Part 903, of FIG. 9B, is a definition of the treatment (in this case, aspirin) upon which the pros and cons search was performed. The automated generation of definitions is addressed in Section 5.2.4.1 (âDescriptorâ) of the '068 Application. The '068 Application addressed the generation of definitions for a âtechnology.â However, essentially the same process can be used to generate a definition for a treatment (or a condition).
An FBSE that performs brand research, in the manner of the example of Section 2.4 (âBrand Researchâ), can be constructed very similarly to a technology profiling FBSE, as discussed above in Section 4.1 (âTechnology Profilingâ). Rather than producing a profile of a technology, the profile of a âbrandâ is produced instead.
To determine the pros and cons of a brand, essentially the same techniques can be used as those that were described for finding the pros and cons of a technology. Finding pros and cons of a technology are discussed in the following sections:
For any brand-related frame extraction rule, rather than use the feature TECHNOLOGY, as is used, for example, in the example frame extraction rule of Section 6.1.2.1 (âBenefits Frameâ), a feature called BRAND can be substituted. It is possible to produce a useful brand research system where the definition, for the BRAND feature, is essentially the same as that given (see, for example, Section 7.3 âFeaturesâ) for TECHNOLOGY.
In addition to a Pros Frame and a Cons Frame, additional frames, that can be useful for brand research, include:
This section (i.e., Section 5) addresses how a search result can be produced using frames, where such search result uses the knowledge (or semantics) expressed in the corpus of natural language (or âSource Corpusâ) that is searched.
In general, a frame is a structure for representing a concept, wherein such concept is also referred to herein as a âFrame Concept.â A frame specifies a concept in terms of a set of âroles.â Any type of concept can be represented by a frame, as long as the concept can be meaningfully decomposed (or modeled), for the particular application, by a set of roles. FIG. 13 shows a generic frame comprised of two roles. A frame can be referred to by a unique label called a âFrame Nameâ (in FIG. 13, a location, where a Frame Name can be stored, is shown as <Frame Name>). Each role of a frame can be represented by a collection of attributes, called herein ârole attributes.â For the example frame of FIG. 13, each role is shown as having three attributes: role name, role value representation and role type.
The attribute ârole nameâ stores a label for a role that is unique (at least within its frame). In FIG. 13, the two locations, where a role name can be stored, are shown as <role name 1> and <role name 2>.
A role's value requires some kind of representation, referred to herein as its ârole value representationâ (in FIG. 13, the two locations, where the role value representation can be specified, are shown as <ry rep 1> and <ry rep 2>).
Depending upon the role, and its function in representing a frame's Frame Concept, a particular âtypeâ (or types) of role value can be assigned to it. Thus, among the full set of values that could otherwise be assigned to a role, a ârole typeâ serves to limit the set of permissible values. The type of a role value can be specified by one or more attributes. In the example of FIG. 13, each role is shown as having one type-specifying role attribute: <type 1> for Role 1 and <type 2> for Role 2.
A set of frames, that serves as the semantic basis of a frame-based search, can be called the search's âFrame Set.â Example Frame Sets, for the example search types discussed herein, are presented below in Section 6 (âExample Frame Setsâ).
A particular âinvocationâ (see below Glossary of Selected Terms for definition) of a Frame Concept, by a âUNLâ (see below Glossary of Selected Terms for definition), can be represented by an âinstanceâ of the frame (also called a âframe instanceâ). A frame instance is the same as the frame itself, except that, for each role, a value (also referred to herein as a ârole valueâ) has been assigned (in FIG. 13, the two locations, where a role value can be stored, are shown as <role value 1> and <role value 2>). These role values (usually drawn from the invoking UNL) represent the specifics of how the Frame Concept is used at a particular location in a Source Corpus.
Identification, of when a frame's Frame Concept is invoked by a UNL, can be determined by a set of linguistic rules, each rule herein called a âframe extraction rule.â A set of frame extraction rules, that all relate to a particular frame, can be called the frame's âRule Set.â Ideally, a frame's Rule Set is able to detect whenever the frame's Frame Concept is invoked, and thereby produce a frame instance representing each particular use of the Frame Concept. âFrame extraction,â as used herein, refers to the utilization of a frame extraction rule to determine whether a frame is invoked by a UNL.
Example frame extraction rules, for the example search types discussed herein, are presented below in Section 6 (âExample Frame Setsâ).
A Frame-Based Search Engine (FBSE), that accepts a user's query and outputs a search result, can be described as operating in three main steps (see FIG. 14A):
Instance Generation is performed before the steps of Instance Merging or Instance Selection. Instance Merging and Instance Selection, however, can be performed in either order, depending upon the particular application. For example, if the ordering is Instance Merging 1420 followed by Instance Selection 1430, then the input to Instance Merging 1420 is Instance Superset 1405 and the input to Instance Selection 1430 is Merged Superset 1406. Alternatively, if the ordering is Instance Selection 1430 followed by Instance Merging 1420, then the input to Instance Selection 1430 is Instance Superset 1405 and the input to Instance Merging 1420 is Search Result 1404 (with Merged Superset 1406, produced by Instance Merging 1420, serving as the actual search result).
Production of a search result can be accomplished by using, for example, the computing environment described in Section 7.5.
The Instance Superset can be generated in accordance with any suitable technique, depending on the particular application. While the principles described herein can be applied to a small Source Corpus, this section (Section 5.2) will focus upon instance generation where the Source Corpus is large. Small and large Source Corpora are defined as follows:
Instance Generation is described below in conjunction with FIG. 14B. FIG. 14B represents a possible expansion of Instance Generation process box 1410 of FIG. 14A. In FIG. 14B, process box 1410 is represented by a dashed outline 1410, said outline 1410 receiving the same inputs, and producing the same outputs, as process box 1410 of FIG. 14A.
FIG. 14B divides process 1410 into two main phases: pre-query processing (enclosed in dotted outline 1416) and post-query processing (enclosed in dotted outline 1417). Each of these phases is described below.
The objective of pre-query processing is to produce a âFrame-Based DataBaseâ (FBDB) from the Source Corpus. An FBDB means that a Source Corpus has been analyzed for where (if at all) certain concepts are used within it. The concepts, for which the Source Corpus is analyzed, are the Frame Concepts of the Organizing Frames of the FBDB. An FBDB may be produced for just one Frame Concept as represented by one Organizing Frame.
Production of an FBDB means that, at least, an index has been produced. The index permits the fast location of occurrences, in the Source Corpus, of concepts modeled by the Organizing Frames.
Thus, in FIG. 14B, pre-query processing 1416 is represented by the production of an index 1412, by an index generation process 1411, given the inputs of a Source Corpus 1402 and frame extraction rules 1403.
Pre-query processing can be divided into the following two main operations:
Each of these operations is described below.
A large Source Corpus can be processed on a UNL-by-UNL basis (e.g., on a sentence-by-sentence basis) to produce an FBDB. For each UNL (âUNL_currentâ) processed, each potentially applicable frame extraction rule (ârule_currentâ) is evaluated for whether it is invoked by UNL_current to produce an instance (âI_currentâ).
Whether a rule_current is evaluated depends upon the FBDB being generated and the frame(s) such FBDB includes as its Organizing Frames.
To determine whether a rule_current applies to a UNL_current, each UNL_current can be converted, by a semantic parser, into a representation known as âLogical Form.â Logical Form is described in greater detail in below Section 7.1 (âLogical Formâ). To present more detailed definitions, of example frame extraction rules, a pseudo-coded representation is defined below in Section 7.2 (âFrame Extraction Rulesâ).
5.2.2.3 Pre-Query Indexing
For each UNL (âUNL_currentâ) that produces an instance I_current, as a result of the UNL-by-UNL processing, an amount of content, referred to herein as a âsnippet,â that at least includes UNL_current, can be indexed for an FBDB. The index then makes it possible, during post-query processing, that a suitably fast response is provided to a user's query.
Design of a Pre-query Indexing process involves the following choices:
For keyword indexing, any kind of conventional keyword index can be produced. For this type of index, each word of each snippet, except for âstop wordsâ (see below Glossary of Selected Terms for definition), can be indexed.
For frame indexing, the following can be performed. Each time a focus UNL causes the invocation of a frame âF_1,â to produce an instance âI_1,â each word, of the role value of I_1's input role, can be indexed.
A frame index is likely to yield less snippets, in response to a user's query, than a keyword index. If the topic sought for searching is thoroughly discussed by the Source Corpus (e.g., it is a well-known condition or treatment), then utilization of a frame index will likely yield sufficient results. If the topic sought for search is infrequently discussed by the Source Corpus (e.g., it is a rare condition), then a frame index may not produce sufficient results. For such infrequently-referenced topics, a user may want to apply a keyword index as an addition to, or instead of, a frame index.
Alternation, between UNL-by-UNL Preprocessing and Pre-query Indexing, can vary depending upon the particular application and the particular desired characteristics of the FBSE. For example, essentially all UNL-by-UNL Preprocessing can be completed first, before Pre-query Indexing is begun. As another example approach, pre-query processing can alternate (between UNL-by-UNL Preprocessing and Pre-query Indexing) on a UNL-by-UNL basis.
The objectives of post-query processing (shown within dotted outline 1417 of FIG. 14B) are twofold:
Each of these operations is further described below.
Production of a reduced corpus can also be called production of a query-selective corpus, since a query is the basis by which to select limited content from a Source Corpus.
Once produced, as described above (see Section 5.2.2.3 âPre-query Indexingâ), an index or indexes can be used (by, for example, Corpus Reduction process 1413 of FIG. 14B) to retrieve a set of snippets (also called a Reduced or Query Selective Corpus 1414), from an appropriate FBDB, that are likely to be of relevance for the search result to be generated. Such snippet selection represents a rapid way to reduce a large Source Corpus to a smaller Query Selective Corpus (such as the retrieved set of snippets) that can be processed in the time available for responding to a query (e.g., within a delay period that is acceptable to a user of the FBSE).
For example, when used with a keyword index, the query can be decomposed into a set of its constituent lexical units, excepting any stop words (while well known in the art, a definition of stop words is presented in the Glossary). Such set of lexical units is called herein a âconstituent lexical unit set.â Standard techniques can then be used, that access the keyword index with each member of the constituent lexical unit set and produce an initial set of snippets for a Query Selective Corpus. In essentially the same manner as for a keyword index, a frame index can be accessed, with the constituent lexical unit set of the query, to produce a Query Selective Corpus.
The snippets of the Query Selective Corpus can be ranked in order of decreasing match quality to the query (e.g., Query 1401). If the Query Selective Corpus is too large, only the first ânâ snippets can be kept for further processing. As an example value for ân,â only the first 3,000 snippets can be kept in the Query Selective Corpus.
A similar process to that described above (see Section 5.2.2.2, âUNL by UNL Preprocessingâ), which is applied to a Source Corpus, can be applied (by Instance Generation process 1415) to the Query Selective Corpus. The Query Selective Corpus can be processed on a UNL-by-UNL basis (e.g., on a sentence-by-sentence basis). For each UNL (âUNL_currentâ) processed, each potentially applicable frame extraction rule (ârule_currentâ) is evaluated for whether it is invoked by UNL_current to produce an instance (âI_currentâ). Each I_current produced can be added to the Instance Superset (such as Instance Superset 1405 of FIG. 14A).
The set of instances to which Instance Merging is applied can be the Instance Superset, if just Instance Generation (as described above in Section 5.2) has been performed. Alternatively, Instance Merging can be applied to a Search Result, if Instance Generation and Instance Selection (described below in Section 5.4) have been accomplished. For either case, in this section (Section 5.3), the input set of instances shall be referred to as the âInstance Set.â
The Instance Merging described in this section assumes that each member, of the Instance Set, has just two roles. A subset of an Instance Set (called âSubset_1â) can have its members merged together when such subset satisfies the following two conditions:
Each member of Subset_1 is instantiated from a same frame; and For each unique role, when considered across all members of Subset_1, the same (or sufficiently similar) value has been assigned.
If all members of a Subset_1 are merged together, the subset can be regarded, for purposes of producing a search result, as a single instance. Each member, of a merged Subset_1, can be referred-to as an instance-mention.
FIG. 16 presents an example pseudo-coded (see below Glossary of Selected Terms for definition) procedure, called âInstance_Merge,â for performing Instance Merging on an Instance Set called âInstance_Set.â The pseudo-code operates as follows, with line numbers being references to FIG. 16.
Instance_Set is assumed to have internal state (referred to herein as âsequence-stateâ) whereby a function, such as âNext_Pairâ (line 7), is able to generate a sequence of possible instance pairs of Instance_Set. Instance_Merge begins by re-setting the sequence-state of Instance_Set with a call to âReset_Next_Pairâ (line 4).
A âwhileâ loop is then begun (line 7), that continues to execute while Next_Pair returns TRUE. Each call to Next_Pair causes the following. A pair of instances is selected from Instance_Set and assigned to: Instance_1 and Instance_2. For a given set of instances in Instance_Set, Next_Pair is defined as returning a sequence (one pair per invocation) of the possible instance pairs.
First and second tests are then performed which, if both satisfied, result in a merging of Instance_1 and Instance_2 by âMerge_Instancesâ (line 19). Merge_Instances is defined as replacing its two arguments, in Instance_Set, with the merger of Instance_1 and Instance_2. Such modification of Instance_Set creates the possibility for a new sequence of possible instance pairs. For this reason, following Merge_Instances, the sequence-state is reset by a call to Reset_Next_Pair (line 20).
The first test checks for whether Instance_1 and Instance_2 were produced from the same frame by calling âSame_Frameâ (line 10). Same_Frame is defined to return TRUE if the instances are from a same frame. If the first test is satisfied, a second test checks for whether the corresponding roles, of Instance_1 and Instance_2, have sufficiently similar values. The second test is performed by two calls to âMatch_Role_Valuesâ (lines 13-14 and 16-17). In the first call to Match_Role_Values (lines 13-14), the role value assigned to the input role of Instance_1 is compared to the role value assigned to the input role of Instance_2. In the second call to Match_Role_Values (lines 16-17), the role value assigned to the output role of Instance_1 is compared to the role value assigned to the output role of Instance_2.
A discussion of Match_Role_Values is presented in the following section.
Matching, between role values, depends on the type of representation to be compared. The two main role value representations addressed herein are:
Matching, between role values, where each role value is treated as one or more lexical units, can proceed as follows.
If the two role values are identical, a match can be indicated.
If one role value is determined to be a substring of the other role value, or if both role values are determined to share a sufficiently substantial substring, a match can be indicated. Any suitable techniques for substring matching, known in the art, can be used. For example, the phrases âthe fuel cell technology,â âthe fuel cell applicationâ and âthe fuel cell softwareâ can all be regarded as sharing a sufficiently substantial substring, such that all can be regarded as referring to a âfuel cell.â
If one role value is determined to be an acronym of the other, or if both role values are determined to be acronyms of a common term, a match can be indicated. Any suitable techniques for acronym matching, known in the art, can be used. For example, if one role value is ânatural language processing,â the role value âNLPâ could be regarded as matching.
Each role value RV_1 can be replaced by a set of role values RVS_1, where each member of RVS_1 is believed to mean the same as RV_1, by a process called âlexical expansion.â In lexical expansion, the following operation can be performed on any combination of the lexical unit or units forming RV_1: for each lexical unit, within RV_1, it can be replaced by another lexical unit that is known to be synonymous. For example, if a role value âfuel cellâ is to be matched, the lexical unit âbatteryâ could replace the lexical unit âcell.â Such replacement would mean that the role value âfuel batteryâ could be regarded as matching the role value âfuel cell.â
Matching, between two role values, where each role value is represented as (or converted into) a Logical Form, can proceed as follows.
Matching can begin at the root node of each role value. Each corresponding pair of nodes can be selected by traversing each Logical Form in any appropriate order (e.g., depth first or breadth first). Each node, of a pair of corresponding nodes, has a fragment (comprised of one or more lexical units) of the UNL that triggered creation of the Logical Form. In a manner similar to that discussed above (5.3.3.1 âLexical Unitâ), the pair of textual fragments can be compared.
As long as identity, or sufficient identity, is determined between each pair of textual fragments traversed, the two Logical Forms are considered to match.
As an example, two Logical Forms may only be traversed from the root (which can represent a verb) to the direct children (that can represent the object of the verb). For example, the phrases âincreased densityâ and âincreased bone densityâ will appear the same if the Logical Form for each is only compared from the root to the direct child. In each case, the root is the verb âincreasedâ and the object (at the direct child level) is âdensity.â The modifier âbone,â for the phrase âincreased bone density,â appears in the Logical Form at the grandchild level.
Once a suitable Instance Set has been generated, selection, of those instances corresponding to an input query, can be accomplished. As discussed above, such selection can be accomplished before or after Instance Merging.
FIG. 17 depicts an example pseudo-coded procedure, called âInstance_Select,â for accomplishing Instance Selection. Line numbers, referenced below, refer to FIG. 17. The inputs to Instance_Select are as follows:
The outputs of Instance_Select are as follows:
Instance_Select begins by setting (at line 4) the output set (i.e., Output_Rep_Set), of query (or input role value) representations, to contain just the query passed to the procedure by Input_Rep.
Next, a âwhile-loopâ is begun (line 7). The while-loop iterates through the instances of Input_Instance_Set by successively calling âNext_Instance.â Next_Instance sets Instance_Current to a next instance of Input_Instance_Set, and Next_Instance sets state, associated with Input_Instance_Set, such that, after sufficient calls to Next_Instance, each instance of Input_Instance_Set has been assigned to Instance_Current.
Within each iteration of the while-loop, a âfor-loopâ is begun (line 9). The for-loop iterates through the query values stored in Output_Rep_Set, setting each such representation to Current_Rep. For each iteration of the for-loop, Match_Role_Values (line 10) compares the role value of the input role of Instance_Current with the query value assigned to Current_Rep. If the values are the same, or sufficiently similar, lines 11-16 are executed. Lines 11-16 perform the following.
The Instance_Current is moved from the Input_Instance_Set to the Output_Instance_Set (line 11). Next, a test is made of whether the two values, just compared by Match_Role_Values, are represented in exactly the same way (line 12). If the two representations are not exactly the same, then the input role value assigned to Instance_Current appears to represent a broadening of the set of possible representations for the query, and lines 13-14 are executed. Lines 13-14 perform the following.
The alternate representation of the query, assigned to Instance_Current, is added to the set Output_Rep_Set (line 13). Also, the iteration of the while-loop, through the instances of Input_Instance_Set, is reset (line 14). The reset is performed because the new representation of the query, added to the set Output_Rep_Set, means that each instance not previously added to Output_Instance_Set, because its input role value did not match the set of query representations of Output_Rep_Set, might now match the newly-added query representation.
Regardless of whether the two values match exactly, once a match has been determined, it is known that the for-loop need no longer be executed, since Instance_Current has already been added to the Output_Instance_Set. Therefore the for-loop is ended (line 16).
Once a Search Result has been produced, any appropriate technique(s) can be used to achieve a more effective presentation, to the user, of the instances of which it is comprised. This Section presents several example techniques. Any combination of the following techniques can be used, depending upon the particular application.
It is often useful to present to a user a Search Result that emphasizes the role values of the output roles. The usefulness of this presentation approach arises from the fact that it emphasizes the information the user is seeking. Also, because a single instance can represent multiple records (or snippets) that have matched a query, it also presents a more compact search result that a user can review more quickly.
A role-value oriented presentation, of a search result, can be achieved as follows: for each instance of the Search Result, its output-role role value is displayed (by using some appropriate character string) as a primary âresultâ of the search. The character string, representative of the role value of an instance's output role, can be referred-to as a ârole-value-oriented search result.â
When an instance contains only one output-role role value, its role-value-oriented search result can be the same as its output-role role value. However, where an instance of a search result is comprised of multiple instance-mentions, a common (or summarizing) role-value-oriented search result is needed. Any suitable technique can be used to determine a role-value-oriented search result that represents an appropriate commonality and/or summarization of an instance's multiple output-role role values.
Once a user has identified a role-value-oriented search result of particular interest, the user can be provided with an option to view the records (or snippets) on which it is based. For each such snippet displayed, the portion that corresponds to the role-value-oriented search result can be highlighted (or otherwise emphasized). Also, it may be useful to display, in some other highlighted (or emphasized) way, the portion of each snippet that matches the input query.
When viewing a role-value oriented search presentation, it can be useful to group the role-value-oriented search results according to their type.
When viewing a search result, of a frame-based HRSE, it can be useful to group each search result according to the frame from which it is instantiated.
This section describes an example Entity Profile Frame Set, where the entity for profiling is a Technology Candidate. Such Entity Profile Frame Set can be referred to as a âTechnology Profile Frame Set.â
Frames are first categorized according to their meta-type, which are:
Within each frame meta-type, each frame definition follows the following format:
For just the Benefits Frame, an example frame extraction rule is also presented in a pseudo-coded form. The pseudo-code format is defined in section 7.2 (âFrame Extraction Rulesâ). An example frame extraction rule, for each of the other frame types, can be found in the '068 Application.
Since the Anchor Role Value, for each frame of this section, is representative of a technology-type entity, all of the example frame extraction rules use a âfeatureâ (see Section 7.3 âFeatures,â for definition of feature) called TECHNOLOGY. An example definition for the TECHNOLOGY feature is also presented in Section 7.3. The example definition of TECHNOLOGY is intended to be broad. In this way, when a Source Corpus is subjected to Entity Profile processing, the set of entities with profiles will be broader and more likely to cover a Technology Candidate of the technology searcher.
The âBenefit Frame TPâ is used to answer the question: âwhat are the benefits of this technology?â Benefit Frame TP (where âTP,â when used as part of the name for a frame or a frame extraction rule, means Technology Profiling) is used as part of profiling a Technology Candidate. Compared with the Benefit Frame utilized for technology scouting in the '122 and '127 Applications (that is re-presented below in Section 6.2.1 âBenefit Frameâ), Benefit Frame TP is simplified. The Instrument and Benefactor roles, of the Benefit Frame, become the Technology_Role in Benefit Frame TP. Of the other roles of Benefit Frame, just the Benefit role is used in Benefit Frame TP.
An example frame extraction rule for a Benefit Frame TP is shown in FIG. 11A. A general description of the rule's function is as follows. If IMPROVE is identified, with an Actor representing a TECHNOLOGY and an Undergoer, then the IMPROVE, along with the textual parts of its sub-tree, maps to the Benefit_Role and the TECHNOLOGY maps to the Technology_Role.
In terms of the pseudo-code of FIG. 11A, the frame extraction rule can be described, in more detail, as follows. As can be seen, the name for the rule is âIMPROVE_Rule_TPâ (line 1). The rule matches a Logical Form where:
Once the conditional part of the IMPROVE_Rule_TP has been determined to fire, its action part can do the following:
FIG. 11B shows an example sentence that can be converted, by a semantic parser, to a Logical Form of FIG. 11C. The Logical Form of FIG. 11C will match the rule of FIG. 11A, producing the instance of FIG. 11D.
Name: âProblem Frame TPâ
Technology_Role: <Technology Name>
Problem_Role: <Problem Here>
The Problem Frame TP is used to answer the question: âwhat are the problems with this technology?â Problem Frame TP is used as part of profiling a Technology Candidate. Compared with the Problem Frame utilized for âmarket scoutingâ in the '122 and '127 Applications (that is re-presented below in Section 6.2.2 âProblem Frameâ), Problem Frame TP is simplified. The Adversary and Method roles, of the Problem Frame, become the Technology_Role in Problem Frame TP. Of the other roles of Problem Frame, just the Problem role (specified as Problem_Role) is used in Problem Frame TP.
Problem Frame TP allows one to identify problems, of the Technology Candidate itself, from users of a technology. (To highlight the differences, between the profiling of a technology and the technology scouting process by which candidate technologies can be identified, it is worth noting that the Problem role, of the Benefit Frame of technology scouting, is used to identify technologies that can solve a problem.) Knowledge of a technology's problems can be helpful for such activities as: the design of a new product or the improvement of an existing product.
The value for the Inventor role describes an entity that has developed or contributed to the development of the Technology. Typically the entity is a person. The Inventors Frame is used to answers the question: âwho invents the technology?â
Name: âExperts Frameâ
Technology_Role: <Technology Name>
Experts_Role: <Expert Here>
The value of the Expert role describes a person who has been noted for their expertise in the Technology. The Experts Frame is used to answer the question: âwho are the experts on this technology?â
This frame is used to answers the question: Who makes or sells the technology?
Name: âSellers Frameâ
Technology_Role: <Technology Name>
Sellers_Role: <Seller Here>
The value of the Seller role describes an entity that sells the Technology. Typically the entity is a company. The Sellers Frame is used to answers the question: âwho makes or sells the technologyâ
Name: âUsers Frameâ
Technology_Role: <Technology Name>
Users_Role: <User Here>
The value of the User role describes an entity that uses the Technology. Typical entities can include an organization, person or location. The Users Frame is used to answers the question: âwho uses this technology?â
Name: âDerivedProducts Frameâ
Technology_Role: <Technology Name>
DerivedProducts_Role: <DerivedProduct Here>
The value of the Derived Products (or Products Based On) role describes a product that is based on the Technology. A product can be a branded commercial product such as âTOYOTA PRIUSâ or a product category such as âstaplers.â The DerivedProducts Frame is used to answers the question: âwhich products are derived from this technology?â
Name: âDescriptor Frameâ
Technology_Role: <Technology Name>
Descriptor_Role: <Definition Here>
The Descriptor Frame is used to produce a definition of the Technology Candidate indicated by <Technology Name>.
Name: âPros Frameâ
Technology_Role: <Technology Name>
Pros Role: <Pro Here>
The Pros Frame is an example of a Modifier frame type, within the Categorical frame meta-type. It is used to represent GOOD features of the Anchor Entity and, in the case of a Technology Candidate, favorable modifiers of such technology.
Name: âCons Frameâ
Technology_Role: <Technology Name>
Cons_Role: <Con Here>
The Cons Frame is an example of a Modifier Frame type, within the Categorical frame meta-type. It is used to represent BAD features of the Anchor Entity and, in the case of a Technology Candidate, unfavorable modifiers of the technology.
An example HFS, HFS52, that can be used to produce a frame-based HRSE (Section 4.2), is presented in this section (Section 6.2).
In the following Sections 6.2.2 to 6.2.6, each of the five frames of HFS52 is defined. An example frame extraction rule is also presented in Section 6.2.2 for the Treatment Frame. The frame extraction rule is presented in a pseudo-coded form and is then used to produce an example instance from an example input sentence.
The frame extraction rule pseudo-coded format is defined in Section 7.2. Before being tested against a frame extraction rule, an example input sentence is converted (by a semantic parser) into a representation called âLogical Form.â The Logical Form format used herein is defined in Section 7.1.
When considering example pseudo-coded frame extraction rules, a âfeatureâ can be identified as follows:
In general, and as depicted by frame 1003 of FIG. 10, a Treatment Frame relates a condition (such as a condition of condition set 1010 of FIG. 10) to a treatment (such as a treatment of treatment set 1011 of FIG. 10).
FIG. 10 depicts Treatment Frame 1003 as comprised of two roles:
II. Condition
An example frame extraction rule for a Treatment Frame is shown in FIG. 12A. A general description of the rule's function is as follows.
Line 1 of FIG. 12A indicates the rule's name is:
First, a node must be found, in the Logical Form matched against the rule, that matches the root Logical Form rule of line 2 of FIG. 12A. If this root Logical Form rule is not matched, then greater computational efficiency can be achieved by avoiding the testing of any other of the rule's Logical Form rules. The root Logical Form rule is satisfied by a Logical Form node ân1,â where n1 is the root of a Logical Form and the text represented by n1 matches (the feature) DECREASE. Note that the root Logical Form rule, even if triggered, does not have its action part assign a value to a role (indicated by âno_roleâ in the Logical Form rule's action part).
If the root Logical Form rule is satisfied, there are two mandatory Logical Form rules:
The frame extraction rule of FIG. 12A also includes an optional Logical Form rule:
FIG. 12B shows an example sentence that can be converted, by a semantic parser, to a Logical Form of FIG. 12C. The Logical Form of FIG. 12C will match the rule of FIG. 12A, producing the instance of FIG. 12D. Specifically, the following is accomplished:
In general, and as depicted by frame 101 of FIG. 1A, a Cause Frame relates a condition (such as a condition of condition set 110 of FIG. 1A) to a potential cause of such condition (such as a cause of cause set 112 of FIG. 1A).
FIG. 1A depicts Cause Frame 101 as being comprised of two roles:
I. Condition
The fact that the values for CONDITION and CAUSE are drawn from a same Condition Lexicon permits a cause to be, itself, input as a condition and its cause found with the Cause Frame. The process, of finding the âcause of a cause,â can be applied indefinitely and/or in conjunction with finding the âeffect of an effectâ (see Section 3.3 âEffect Frameâ).
In general, and as depicted by frame 102 of FIG. 1A, an Effect Frame relates a condition (such as a condition of condition set 110 of FIG. 1A) to a potential effect of such condition (such as an effect of effect set 113 of FIG. 1A).
FIG. 1A depicts Effect Frame 102 as being comprised of two roles:
I. Condition
The fact that the values for CONDITION and EFFECT are drawn from a same Condition Lexicon permits an effect to be, itself, input as a condition and its effect found with the Effect Frame. The process, of finding the âeffect of an effect,â can be applied indefinitely and/or in conjunction with finding the âcause of a causeâ (see Section 3.2 âCause Frameâ).
Rather than using the more generally-know term of âeffect,â the healthcare profession generally refers to an effect as being either or both of the following: a âcomplicationâ or a âsymptom.â If needed for the particular HRSE, specialized lexicons, such as a Complication Lexicon of complications and a Symptom Lexicon of symptoms, can be used to appropriately categorize an effect.
In general, and as depicted by frame 104 of FIG. 1A, a Pro Frame relates a treatment (such as a treatment of treatment set 111 of FIG. 1A) to a potential âproâ (or positive aspect) of such treatment (such as a pro of pro set 114 of FIG. 1A).
FIG. 1A depicts Pro Frame 104 as being comprised of two roles:
I. Treatment
In general, and as depicted by frame 105 of FIG. 1A, a Con Frame relates a treatment (such as a treatment of treatment set 111 of FIG. 1A) to a potential âconâ (or negative aspect) of such treatment (such as a con of con set 115 of FIG. 1A).
FIG. 1A depicts Con Frame 105 as being comprised of two roles:
I. Treatment
II. Con
In general, a Logical Form representation is produced from analysis of a UNL âUNL_currentâ (where the UNL focused-upon herein is a sentence).
A Logical Form can be produced by what is known as, in the field of natural language processing, a âsemantic parser.â A Logical Form is intended to represent the semantics of a UNL_current. For this reason, it is desirable to produce a Logical Form that is, as much as possible, âsemantically canonical.â This means the following:
For example, a semantically canonical semantic parser, when given a passive sentence and an active sentence that both express the same meaning, will translate both sentences, as much as possible, into a same Logical Form.
A Logical Form can be represented as a collection of nodes, with each node representing a particular semantic aspect of a UNL_current. Assigned to each node of a Logical Form can be a fragment, of a UNL_current, closely associated with the semantics represented by such node.
If arranged in a tree form, such nodes (with their links) can be referred to as a âlogical dependency tree.â Some characteristics, of a dependency tree, are as follows:
Semantic constituents comprise at least the following two types: core and modifier. Core semantic constituents specify key information, such as âwho did what to whom.â A core semantic constituent is also called (in the field of natural language processing) an âargument.â Modifier semantic constituents carry information about other aspects of an action, that are optional or are only sometimes applicable.
Three core semantic constituents, and their definitions, follow:
Example modifier semantic constituents, and the types of questions they answer, include the following:
An important type of logical dependency tree (called herein an âISAâ tree) can be generated for what are called, at a more surface level, copula and appositive structures. Both copula and appositive structures refer to sentence forms that define a term (e.g., a noun phrase) by linking it to a definition (e.g., another noun phrase). For copula structures, the linking is performed by a verb (such as âto beâ or âto becomeâ). For appositive structures, the linking is indicated by a syntactic symbol (such as the comma) or by trigger words (such as ânamely,â âi.e.â or âsuch asâ).
For an ISA dependency tree, the root node is the noun phrase that is being defined. One of the core semantic constituents is an âISAâ node that indicates a definitional noun phrase.
Examples, that help illustrate the above-listed semantic constituents, follow.
Because the Actor and Undergoer are logical, a passive and an active sentence, which both express the same meaning, will have the same Actor and Undergoer. For example, in both of the following sentences, âexerciseâ is the Actor and âbone densityâ is the Undergoer:
For both of the following sentences, âJohnâ is the Actor, âbookâ is the Undergoer and âMaryâ is the Complement:
For the following phrase, âsomebodyâ is the Undergoer and âfor somethingâ is the Complement:
The modifier semantic constituent Cause can be identified by searching for particular expressions that are indicative of something being a cause. Such expressions can include: âdue to,â âthanks to,â âbecause ofâ and âfor the reason of.â In one of the above example sentences, depending upon the semantic parser, âby exerciseâ may also be identified as the Cause for the action âcan be enhanced.â
FIG. 11C depicts an example Logical Form that can be produced from the example sentence of FIG. 11B, by application (to FIG. 11B) of the frame extraction rule of FIG. 11A. Each line of FIG. 11C represents a node, while the tree structure is indicated by the indentation of the lines. The greater the indentation of a line (i.e., the further a line is from the margin), the further it is from the tree's root. A Logical Form node âLN_1â and a Logical Form node âLN_2â are, respectively, in a parent and child relationship when LN_1 is the first Logical Form node that is both above LN_2 and has a lesser indentation than LN_2. For example, in FIG. 11C, each of lines 2 and 3 specifies a node that is a child of the node specified by line 1, while line 4 is a child of line 3. Logical Form nodes âLN_1â and âLN_2â are in a sibling relationship when the following conditions are satisfied:
Each node of a Logical Form, with the exception of the root node, can be represented by the following two parts:
The root node of a Logical Form can be represented by the following two parts:
In FIG. 11C, line 1 represents the root node, with the root indication being implicit (from the fact that line 1 has the least indentation) and the textual part comprising the text of the line.
As discussed above, a frame extraction rule specifies a pattern that matches against a Logical Form which has been produced from an input statement (i.e., a UNL, such as a sentence). If the frame extraction rule matches, a frame instance is produced.
An overall structure, for a frame extraction rule, is that it expresses a tree pattern for matching against an input Logical Form. In general, a frame extraction rule has two main parts:
For purposes of organization, each frame extraction rule can be given a name.
A frame extraction rule can be expressed as a collection of simpler rules, each such simpler rule referred to herein as a âLogical Form rule.â A Logical Form rule, like the overall frame extraction rule of which it is a part, can also have a conditional part and an action part. Logical Form rules can be of two main varieties: mandatory and optional. For a frame extraction rule to take action, all of its mandatory Logical Form rules must be satisfied. Any optional Logical Form rules, that are also satisfied when all mandatory Logical Form rules are satisfied, can specify additional action that can be taken by the frame extraction rule.
In order to further discuss frame extraction rules, in general, it will be useful to present a format for presenting such rules as pseudo-code. An example tree-structured frame extraction rule, shown in the pseudo-code, is presented in FIG. 11A. For the pseudo-coded rules presented herein, the name for the rule is provided in the first line (for the example rule of FIG. 11A, âIMPROVE_Rule_TPâ is its name).
For the pseudo-coded frame extraction rules presented herein, each line (other than the line specifying a name for the frame extraction rule) represents a Logical Form rule. Each Logical Form rule is mandatory, unless the entire line is enclosed in a pair of parenthesis.
For the type of Logical Form rule presented herein, its conditional part specifies the conditions under which it is satisfied by a node ân1â of the input Logical Form while its action part specifies the role, of a frame instance, that is assigned the value ân1.â
The conditional part, of a Logical Form rule, can itself be comprised of two sub-parts (both of which must be satisfied by a single node of a Logical Form):
For each Logical Form rule presented herein, its syntax divides it into three parts (from left to right):
As can be seen, the node-based sub-part is separated from the text-based sub-part by a colon symbol, while the text-based sub-part is separated from the action by a right-pointing arrow symbol.
The node-based sub-part can specify either of the following two conditions:
The action specifies a role, of the frame instance created, that is assigned a value as a result of the Logical Form rule being satisfied. The value assigned to a role can comprise the textual part of the Logical Form node that satisfies the rule's node-based sub-part. Additional information, that can comprise the value assigned to a role, includes the following: if the node ân1,â satisfying the node-based sub-part, is the root of a sub-tree, the textual parts of some essential child nodes, of such sub-tree, can be assigned to the role. For example, if n1 is the root of a verb phrase, it is typical for only the core argument structure, of such verb phrase, to be assigned to the role. The core argument structure of a verb phrase typically consists of the verb itself and, possibly, the undergoer and/or complement. Such core verb phrase typically excludes adverbial details, such as time and/or location. Assignment of the selected core textual parts, of a sub-tree's child nodes, is indicated herein by enclosing the role name in square brackets.
Regarding the specification of conditions, for matching the node-based sub-part of a Logical Form rule, line 2 of FIG. 11A depicts a Logical Form rule where the node-based sub-part requires a matching node to be a sub-tree root. By enclosing the role name (Benefit_Role) in square brackets, it is known that the verb phrase (and not just the verb matching IMPROVE) is assigned to the Benefit_Role. Line 3 of FIG. 11A depicts a Logical Form rule where the node-based sub-part requires a node to be of a certain semantic constituent type. For example, line 3 requires semantic constituent type âActor.â
Typically, only one Logical Form rule, of a frame extraction rule, uses a node-based sub-part that requires its matching node to serve as the sub-tree root. This Logical Form rule can be referred to as the âroot Logical Form rule.â The root Logical Form rule can be used as the entry point for a frame extraction rule: it can be tested, for matching against an input Logical Form, before any other Logical Form rules are tested. If the root Logical Form rule does not match, then no further Logical Form rules of the frame extraction rule need be tested.
The text-based sub-part, of a Logical Form rule, specifies a pattern of lexical units and/or features that need to appear in the textual part of a Logical Form node, even if that node already matches the node-based sub-part of the Logical Form rule. A âfeatureâ is represented, in the pseudo-coded frame extraction rules, by any word that is entirely capitalized. Please see section 6.4 (âFeaturesâ) for a definition of a feature. The frame extraction rule of FIG. 11A contains the following feature: TECHNOLOGY (line 3).
One type of pattern, that can be specified by the text-based sub-part, is a prepositional phrase. In particular, the text-based sub-part can specify that a preposition must be followed by a specific noun or by a feature that represents a collection of nouns. For example, the text-based sub-part of line 6 of FIG. 11A requires that the preposition âforâ or âinâ be followed by a noun that satisfies the feature HUMAN. The exact syntax is: Prep(for|in) . . . HUMAN.
The tree structure, specified by a pseudo-coded frame extraction rule, can be indicated by the indentation of its Logical Form rules and by the use, or non-use, of blank lines between such Logical Form rules. As with specifying the Logical Form itself, greater indentation of a line (i.e., further distance of a line from the left margin) is used herein to indicate a Logical Form rule calling for a node farther from the root.
A Logical Form rule âLF1â and a Logical Form rule âLF2â specify, respectively, two nodes in a parent and child relationship when LF1 is the first Logical Form rule that is both above LF2 and LF1 has a lesser indentation than LF2. For example, in FIG. 11A, each of lines 3-5 specifies a node that is a child of the node specified by line 2. Logical Form rules âLF1â and âLF2â specify two nodes in a sibling relationship when the following conditions are satisfied:
In certain cases, multiple Logical Form rules can be combined, with an appropriate logical operator, to form one compound Logical Form rule. For example, a group of Logical Form rules can be combined by the XOR operator. In this case, when one, and only one, of the Logical Form rules is satisfied, the compound Logical Form rule is also satisfied.
For the pseudo-coded example frame extraction rules presented herein, a pair of Logical Form rules âLF1â and âLF2â are implicitly combined with an XOR operator when the following conditions are satisfied:
This section presents an example defining set (i.e., a set of lexical units) for each feature utilized in the example benefit frame extraction rules presented herein. As discussed above, a âfeatureâ is represented, in the pseudo-coded frame extraction rules, by any word that is entirely capitalized. A multi-word lexical unit, that is a member of a defining set, is connected with the underscore character.
ABSTRACT_NOUN
As discussed above, a snippet refers to the locality around the match of a frame to a location in computer-accessible content. More specifically, if a match of a frame has occurred in a UNL âUM1,â the snippet comprises a copy of UM1 (also called the âfocusâ UNL) and may also comprise a copy of additional, surrounding, contextual content.
Choosing an appropriately-sized snippet depends on several factors. First, it can depend upon the UNL by which frame instances are identified (e.g., whether frames are identified within individual sentences or across larger units of text). Second, it can depend upon providing sufficient surrounding context for keyword searching. Third, snippet size can depend upon the amount of text necessary, for a user of a search system, such that a snippet can be read and evaluated, apart from its original source content.
A specific issue to consider, in determining snippet size, is pronoun resolution. In the context of snippet size determination, the pronoun resolution problem can be stated as follows. If a pronoun occurs in a UNL âU1,â in which a frame instance has been identified, it is desirable that the pronoun's antecedent noun appear in the snippet context that surrounds âU1.â The larger the snippet size, the more likely it is that all pronouns of âU1â will be resolved. Counterbalancing pronoun resolution, however, are such factors as making a snippet small enough for fast comprehension by the searcher.
If the UNL by which frame instances are identified is the sentence, a snippet size of five sentences has been experimentally determined as desirable. Once a frame instance has been identified in a focus sentence âS1,â two sentences before S1 and two sentences after S1 can be added to the snippet to provide sufficient context for S1. While a desirable goal, depending upon the logical organization of the computer-accessible content from which snippets are being extracted, an individual snippet may comprise less than five sentences. For example, the computer-accessible content may be organized into separate documents. If S1 is at the beginning of a document, two sentences prior to S1 may not be available for addition to the snippet. Similarly, if S1 is at the end of a document, two sentences after S1 may not be available for addition to the snippet.
FIG. 15 depicts an example production-level computer system design in which the techniques described herein can be applied.
Cloud 1530 represents data available via the Internet. Computer 1510 can execute a web crawling program, such as Heritrix, that finds appropriate web pages and collects them in an input database 1500. An alternative, or additional, route for collecting input database 1500 is to use user-supplied data 1531. For example, such user-supplied data 1531 can include the following: any non-volatile media (e.g., a hard drive, CD-ROM or DVD), record-oriented databases (relational or otherwise), an Intranet or a document repository. A computer 1511 can be used to process (e.g., reformat) such user-supplied data 1531 for input database 1500.
Computer 1512 can perform the indexing needed for formation of an appropriate FBDB (for example, an FBDB as discussed in section 5.2.2.3 âPre-query Indexingâ). The indexing phase scans the input database for sentences that refer to an organizing frame, produces a snippet around each such sentence and adds the snippet to the appropriate frame-based database. FIG. 15 depicts an example frame-based database 1501. For an example frame-based HRSE as described in Section 4.2.3 (âFrame-Based HRSEâ), four such FBDB's could be produced. Each FBDB could have the following organizing frame(s):
Databases 1520 and 1521 represent, respectively, stable âsnapshotsâ of databases 1500 and 1501. Databases 1520 and 1521 can provide stable databases that are available to service search queries entered by a user at a user computer 1533. Such user query can travel over the Internet (indicated by cloud 1532) to a web interfacing computer 1514 that can also run a firewall program. Computer 1513 can receive the user query and perform a search upon the contents of the appropriate FBDB (e.g., FBDB 1521). The search results can be stored in a database 1502 that is private to the individual user. When a snippet of interest is found in the search results, input database 1520 is available to the user to provide the full document from which the snippet was obtained.
In accordance with what is ordinarily known by those in the art, computers 1510, 1511, 1512, 1513, 1514 and 1533 contain computing hardware, and programmable memories, of various types.
The information (such as data and/or instructions) stored on computer-readable media or programmable memories can be accessed through the use of computer-readable code devices embodied therein. A computer-readable code device can represent that portion of a device wherein a defined unit of information (such as a bit) is stored and/or read.
While the invention has been described in conjunction with specific embodiments, it is evident that many alternatives, modifications, variations and equivalents will be apparent in light of the foregoing description. Accordingly, the invention is intended to embrace all such alternatives, modifications, variations and equivalents, as fall within the scope (both literally and by reason of the doctrine of equivalents) of the appended claims.
1. A method for frame-based search, performed by computing hardware and programmable memory, comprising the following steps:
receiving a first rule, for producing an instance in accordance with a first frame, wherein the first frame comprises a first input role and a first output role;
receiving a second rule, for producing an instance in accordance with a second frame, wherein the second frame comprises a second input role and a second output role;
receiving a first source corpus;
identifying first and second units of natural language from, respectively, first and second records of the first source corpus;
producing a first instance, from the first unit, by application of the first rule, wherein the first instance has a first input value assigned to its first input role and a first output value assigned to its first output role;
producing a second instance, from the second unit, by application of the second rule, wherein the second instance has a second input value assigned to its second input role and a second output value assigned to its second output role;
matching an input user query to the first input value of the first instance, and the second input value of the second instance;
determining the first and second instances represent a same third frame;
determining a same third value as representative of the first and second input values;
determining a same fourth value as representative of the first and second output values;
producing, as a result of computing hardware and programmable memory, a first result that contains a first result-value and a first result-base, wherein the first result-value is the fourth value and the first result-base contains the first and second records;
determining, as a result of computing hardware and programmable memory, a member-level demographic value for each member of the first result-base;
determining, as a result of computing hardware and programmable memory, a first demographic value, for the first result, by combining the member-level demographic values; and
displaying, to a user as part of a search result, the first demographic value as a demographic determined for the first result-value.
2. The method of claim 1, wherein the step of determining a member-level demographic value for each member of the first result-base further comprises:
determining whether a first lexical unit, with a demographic association, is present.
3. The method of claim 2, wherein the first lexical unit is indicative of a geographical area.
4. The method of claim 2, wherein the lexical unit is produced by application of automated machine learning procedures to training corpora.
5. The method of claim 1, wherein the step of determining a member-level demographic value for each member of the first result-base further comprises:
application of a third linguistic rule, that triggers upon a logical form, and, if triggered, has an action that indicates a presence of a demographic characteristic.
6. The method of claim 5, wherein the third linguistic rule tests for self-referential demographic identification.
7. The method of claim 5, wherein the third linguistic rule requires a logical form with a verb, actor, and undergoer.
8. The method of claim 5, wherein the third linguistic rule requires a logical form with a verb of a form of to be, and a self-referential actor.
9. The method of claim 5, wherein the third linguistic rule, if triggered, indicates the presence of the demographic characteristic with a confidence distribution.
10. The method of claim 1, further comprising:
producing a first Logical Form semantic representation for the first unit of natural language of the first corpus;
determining whether a first conditional part of the first rule matches the first Logical Form;
producing, if the first conditional part matches, in accordance with a first action part of the first rule, an assignment of values of the first Logical Form to the first input role and first output role of the first instance;
producing a second Logical Form semantic representation for the second unit of natural language of the first corpus;
determining whether a second conditional part of the second rule matches the second Logical Form; and
producing, if the second conditional part matches, in accordance with a second action part of the second rule, an assignment of values of the second Logical Form to the second input role and second output role of the second instance.