You’re using a public version of DrugPatentWatch with 5 free searches available | Register to unlock more free searches. CREATE FREE ACCOUNT

Last Updated: April 19, 2024

Claims for Patent: 8,788,522


✉ Email this page to a colleague

« Back to Dashboard


Summary for Patent: 8,788,522
Title:Pair character string retrieval system
Abstract: A data structure of index information for retrieving pair character strings on a computer at high speed is provided. A method of retrieving a pair character strings appearing in close proximity of each other in a document using the index information at high speed is also provided. Bits of a suffix array of reference document data are rearranged, thereby creating index information LSA localizable, or usable as an index for a subregion of the document. Through use of this, a process of dichotomizing a region, where the entire document is designated as an initial region, is repeated and positions of index information for a query character string in the reference document data are gradually detailed. The distance between the pair is evaluated and candidates are narrowed down. Finally, positions where the pair character strings occur in close proximity of each other are identified.
Inventor(s): Kimura; Kouichi (Koganei, JP)
Assignee: Hitachi, Ltd. (Tokyo, JP)
Application Number:13/264,037
Patent Claims:1. A pair character string retrieval system, comprising: a storing device storing a reference character string; an input device inputting paired first and second character strings, and an upper limit value of a distance between the first and second character strings; a suffix array construction processor constructing a suffix array of the reference character string; an index range retrieval processor computing an index range in the suffix array for each of the first and second character strings in an independent manner; an LSA construction processor that designates the suffix array as an initial block, divides the suffix array into two blocks corresponding to subregions that are first and second halves of the reference character string according to most significant bits thereof being zero or one, designates arrangement of the most significant bits, having been referred to, as a first column of LSA, further dichotomizes the acquired two blocks according to whether second significant bits are zero or one, thus dividing these two blocks into four blocks corresponding to subregions dividing the entire reference character string such as the entire reference genome sequence into four, designates arrangement of the second significant bits, having been referred to, as a second column of LSA, repeats analogous processing until the length of the subregion becomes one, computes each column of LSA in the processing, and thereby constructs localizable index information LSA; an index range localization processor that designates the respective index ranges in the suffix array that correspond to the first and second character strings as index ranges in the initial block, divides the index range in the initial block into the index ranges in the blocks using information of the first column of the LSA according to division into the two blocks corresponding to the subregions that are the first and second halves of the reference character string, divides the index ranges in the two blocks into index ranges in the four blocks using information of the second column of the LSA according to division into the four blocks corresponding to the subregions dividing the entire reference character string into four, repeats analogous processing thereafter, and performs computation of localizing the index range in the initial block into the index range in the subregion in the reference character string; an evaluation processor of the distance between index ranges that evaluates an upper limit of a distance between the first and second character strings localized into the subregions in the reference character string such that positional coordinates of the index ranges localized into the subregions on the reference character string have a remaining undetermined range corresponding to the subregions; and an output device outputting positions on the reference character string of the first and second character strings paired in a distance not exceeding the upper limit value, wherein the index range localization processor removes information of the index range incapable of creating a pair in a distance not exceeding the upper limit value, as a result of information of the index range which becomes null in the computation process and evaluation by the evaluation processor of the distance between index ranges.

2. The pair character string retrieval system according to claim 1, wherein the index range localization processor applies a rank function to the block of each column of the LSA, and computes the positions of the index ranges in the two blocks that are divisions of the block concerned.

3. The pair character string retrieval system according to claim 1, wherein the reference character string is reference genome sequence data, and the paired first and second character strings are first and second nucleotide sequences of DNA acquired by a DNA sequencer.

4. The pair character string retrieval system according to claim 1, wherein the reference character string is reference genome sequence data, the paired first and second character strings are a pair of first and second nucleotide sequences generated by dichotomizing a nucleotide sequence of cDNA acquired by a DNA sequencer, and the upper limit value is an upper limit value of an intron length, the system further comprising a consensus verification processor that verifies a consensus sequence at and around a splice site in the reference genome sequence data, and the consensus verification processor analyzes a part sandwiched by the first and second nucleotide sequences as an intron sequence, and removes a retrieval result in which a consensus sequence does not appear at or around the splice site.

5. The pair character string retrieval system according to claim 1, further comprising: a protein identification processor, wherein the reference character string is a plurality of pieces of reference protein sequence data, the paired first and second character strings are two amino acid sequences of a fragmented protein acquired by a mass spectroscopic system, the suffix array construction processor connects amino acid sequences in the plurality of pieces of reference protein sequence data to each other via a delimiter, and computes a suffix array thereof, the protein identification processor identifies a protein including the paired two amino acid sequences, and the output device outputs a name of the identified protein, and information of occurring positions of the paired two amino acid sequences.

6. The pair character string retrieval system according to claim 1, further comprising: a document identification processor, wherein the reference character string is reference document data in which a plurality of pieces of document data are connected via a distinguishable delimiter, the document identification processor verifies whether the paired first and second character strings occur in an identical document or not, and the output device outputs a name of the document including the paired first and second character strings and information of occurring positions of the first and second character strings.

Details for Patent 8,788,522

Applicant Tradename Biologic Ingredient Dosage Form BLA Approval Date Patent No. Expiredate
Merck Sharp & Dohme Corp. INTRON A interferon alfa-2b For Injection 103132 06/04/1986 ⤷  Try a Trial 2029-04-13
Merck Sharp & Dohme Corp. INTRON A interferon alfa-2b For Injection 103132 ⤷  Try a Trial 2029-04-13
Merck Sharp & Dohme Corp. INTRON A interferon alfa-2b Injection 103132 ⤷  Try a Trial 2029-04-13
>Applicant >Tradename >Biologic Ingredient >Dosage Form >BLA >Approval Date >Patent No. >Expiredate

Make Better Decisions: Try a trial or see plans & pricing

Drugs may be covered by multiple patents or regulatory protections. All trademarks and applicant names are the property of their respective owners or licensors. Although great care is taken in the proper and correct provision of this service, thinkBiotech LLC does not accept any responsibility for possible consequences of errors or omissions in the provided data. The data presented herein is for information purposes only. There is no warranty that the data contained herein is error free. thinkBiotech performs no independent verification of facts as provided by public sources nor are attempts made to provide legal or investing advice. Any reliance on data provided herein is done solely at the discretion of the user. Users of this service are advised to seek professional advice and independent confirmation before considering acting on any of the provided information. thinkBiotech LLC reserves the right to amend, extend or withdraw any part or all of the offered service without notice.