🔗 Permalink

Patent application title:

Matrix operations in an integrated circuit device

Publication number:

Publication date:

2014-06-24

Application number:

13/296,360

Filed date:

2011-11-15

✅ Patent granted

Patent number:

US 8,762,443 B1

Grant date:

2014-06-24

PCT filing:

PCT publication:

Examiner:

Tan V. Mai

Agent:

Ropes & Gray LLP | Jeffrey H. Ingerman

Adjusted expiration:

2033-02-15

Smart Summary: Matrix operations can be done on smaller parts of a larger matrix using special circuitry in integrated circuits. This setup includes two types of memory: one for processing individual small sections of the matrix and another for handling groups of these sections after they have been processed. The size of the small sections is designed to match the best performance level of the memory used. This method helps with tasks like transposing large matrices, which can be broken down into smaller tasks for easier handling. Using this approach can improve speed and efficiency while reducing the need for expensive and power-hungry memory types. 🚀 TL;DR

Abstract:

Matrix operations circuitry for performing operations on submatrices of an input matrix includes a first working memory in which individual ones of the submatrices are operated on. The first working memory has a first submatrix size. The matrix operations circuitry also includes a second working memory in which a collection of the submatrices, that have been operated on in the first working memory, is operated on. The second working memory has an optimum burst size, and the first submatrix size is matched to the optimum burst size.

Inventors:

Brian L. Kurtz 3 🇺🇸 Addison, TX, United States

Assignee:

ALTERA CORPORATION 3,547 🇺🇸 San Jose, CA, United States

Applicant:

Brian L. Kurtz 🇺🇸 Addison, TX, United States

Interested in similar patents?

Get notified when new applications in this technology area are published.

Create Free Alert

Classification:

G06F7/32 IPC

Methods or arrangements for processing data by operating upon the order or content of the data handled; Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc Merging, i.e. combining data contained in ordered sequence on at least two record carriers to produce a single carrier or set of carriers having all the original data in the ordered sequence merging methods in general

Description

FIELD OF THE INVENTION

This invention relates to performing matrix operations in integrated circuit devices, and particularly in programmable integrated circuit devices including, for example, programmable logic devices (PLDs).

BACKGROUND OF THE INVENTION

Matrix operations on large matrices are becoming more common. For some technical problems, solutions may involve matrices as large as 1000-by-1000. One common operation is matrix transposition. For example, it may be necessary to transpose a large matrix to perform a Fast Fourier Transform operation, an interleaving operation, or other linear algebraic operations.

Large transposition operations can be broken down into a series of smaller transposition operations. For example, to transpose an 8-by-8 matrix, one can break down the matrix into four 2-by-2 matrices. Each of the 2-by-2 matrices can be transposed individually in a series of “inner transposition” operations, after which the larger matrix can be treated as a 2-by-2 matrix, each of whose elements is one of the smaller 2-by-2 matrices. Transposing the positions of the smaller matrices in an “outer transposition” operation, after each of the smaller matrices has been transposed individually, results in a transpose of the larger 8-by-8 matrix.

Such a cascaded transposition technique can be used to transpose any size matrix. However, when the technique is implemented in hardware, memory speed limitations may come into play. For example, some types of memory, such as DDR SDRAM (Double Data Rate Synchronous Dynamic RAM) may be read much faster in one direction (vertically or horizontally) than in the other direction. Thus, for large matrices, performing the transposition within an acceptable duration may require fast memories that are expensive in terms of both price and power consumption. For example, if the remainder of the system uses double-data-rate (DDR) memory, it may be necessary to use quad-data-rate (QDR) memories for the transposition operation.

SUMMARY OF THE INVENTION

In accordance with the present invention, there is provided matrix operations circuitry for performing operations on submatrices of an input matrix. The matrix operations circuitry includes a first working memory in which individual ones of the submatrices are operated on. The first working memory has a first submatrix size. The matrix operations circuitry also includes a second working memory in which a collection of the submatrices, that have been operated on in the first working memory, is operated on. The second working memory has an optimum burst size, and the first submatrix size is matched to the optimum burst size.

A method of configuring such circuitry on a programmable device, and a machine-readable data storage medium encoded with software for performing the method, are also provided.

BRIEF DESCRIPTION OF THE DRAWINGS

Further features of the invention, its nature and various advantages will be apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout, and in which:

FIG. 1 shows an example of the definition of a matrix transposition;

FIG. 2 shows an example of transposition of elements within submatrices of a matrix;

FIG. 3 shows an example of transposition of submatrices considered as elements of a matrix;

FIG. 4 shows an example of a circuit structure 400 according to an embodiment of the invention;

FIG. 5 is a cross-sectional view of a magnetic data storage medium encoded with a set of machine-executable instructions for performing a method according to the present invention;

FIG. 6 is a cross-sectional view of an optically readable data storage medium encoded with a set of machine executable instructions for performing a method according to the present invention; and

FIG. 7 is a simplified block diagram of an illustrative system employing a programmable logic device incorporating the present invention.

DETAILED DESCRIPTION OF THE INVENTION

The present invention may be used to transpose, at acceptable speed, a matrix of any size, provided at least one of its dimensions is a non-prime number, by optimizing the size of the inner transposition operation to the burst speed of the memory being used. In practice, the non-prime number restriction may not come into play often, because most matrices that are operated on are large enough that it becomes unusual to find one with a prime dimension.

FIG. 1 shows an example of the definition of a matrix transposition. Although this example depicts the case of a 4×4 matrix, any size matrix may be transposed. As seen in FIG. 1, transposition involves reading each column 111 of a source matrix 101, and writing that column 111 as a row 122 of a transposed matrix 102. This can also be thought of as reading each row 121 of source matrix 101, and writing that row 121 as a column 112 of a transposed matrix 102. When performed electronically in a memory, one of the two orthogonal directions will be slower than (as slow as 5% of the speed of) the other direction. Whether one reads rows and write columns, or reads columns and writes rows, either the reading or the writing will be in the slow direction.

As seen in FIG. 1, for a square matrix, transposition results in the flipping of source matrix 101 about its upper-left-to-lower-right diagonal. However, any shape matrix can be transposed by writing each column, in order, as a row (or each row, in order, as a column). For a non-square matrix, this results in changing the dimensions of the matrix from m×n to n×m. (Technically, the same is true for a square matrix, except that for a square matrix m×n and n×m are the same.)

Matrix transposition can be cascaded—i.e., performed by breaking down the source matrix into smaller submatrices, and transposing each submatrix. Then, treating the submatrices as the elements of the source matrix, the positions of the submatrices can be transposed. FIGS. 2 and 3 show an example of transposition of a 4×4 matrix as a cascaded transposition of 2×2 submatrices.

As seen in FIG. 2, a 4×4 source matrix 201 may be broken down into four 2×2 submatrices 211, and each submatrix 211 may be transposed to yield an intermediate matrix 202 composed of transposed submatrices 212. Next, as shown in FIG. 3, treating transposed submatrices 212 as elements of matrix 202, the positions of transposed submatrices 212 are transposed within matrix 202 to yield transposed matrix 301. Comparison with FIG. 1 shows that source matrix 201 is identical to source matrix 101, and transposed matrix 301, derived using the cascaded transposition technique, is identical to transposed matrix 102, derived using the straightforward transposition technique.

As noted above, matrix transposition is used in many mathematical operations, including, but not limited to, Fast Fourier Transforms (FFTs), back-substitution for QR decomposition, interleaving, and various linear algebraic techniques.

For example, a 1-million-point FFT requires a 1,000×1,000 transposition with 32 bit I and Q data. In such a case, involving complex numbers each of which has a 32-bit real component and a 32-bit imaginary component, each element includes 64 bits. With 1 million elements, the total memory required may exceed 64 Mb, exclusive of buffering. That storage requirement may not present substantial difficulties with available DDR densities, but the data access may be very irregular. Depending on the type of memory used, as noted above, reading and writing may be substantially slower in one direction than in the direction orthogonal to the one direction. To maintain the desired throughput, this may require performing the transposition operation entirely in small, expensive, power hungry QDR external memories.

In accordance with embodiments of the present invention, the size of the submatrices is chosen to jointly optimize the amount of internal memory required (in this case, optimization may be minimization), and the burst length of the “external,” or main matrix, memory, which may be DDR memory (in this case, optimization may be maximization).

For example, if DDR memory has an optimal burst length of B=64 words, then the 1-million-point FFT described above may be optimized by selecting an inner (i.e., submatrix) transposition size of 8×8=64 words. With an inner square size of 8×8, a 1 Mb DDR memory (containing 2²⁰bits) can be broken down into 128×128 8×8 submatrices (2⁷×2⁷×2³×2³=2²⁰). If the transposition were performed using the straightforward technique, for a burst length of 1 the memory efficiency is less than 5%. However, for a DDR memory with a burst length of 64, the memory efficiency may exceed 90% with effective bank interleaving.

Thus, a submatrix size of 8×8 in such a case improves efficiency by a factor of more than 18. The optimum submatrix size may vary depending on the particular type of memory involved, and similarly the efficiency improvement will depend on not only the efficiency of the particular memory at the optimum submatrix size for that memory, but also on the efficiency for a burst length of 1 for that memory. However, the ability to use DDR memory (e.g., DDR3 memory) instead of QDR memory reduces power consumption by at least 50%, reduces cost by up to 90%, reduces the amount of board space consumed, and increases the total aggregate bandwidth.

As noted above, the cascading technique will work with any size matrix, whether square nor non-square, as long at least one of its dimensions is not a prime number. Therefore, when implementing the technique in hardware, a series of identical reusable blocks can be provided regardless of the size of the matrix to be transposed (although the external memory for storing the input and output matrix, and the scratchpad memory, may be of different sizes). This is particularly advantageous when the technique is implemented in a programmable logic device such as a field-programmable gate array (FPGA) which is built as a series of identical reusable blocks. Moreover, programming software for such an FPGA implementation can similarly be modular and scalable.

The technique also may extended beyond two-dimensional matrices. For example, a Super Sample Rate Fast Fourier Transform (SSFFT) can involve transposition of a second vector, which effectively is a rotation of a three-dimensional structure. Indeed, the transposition technique can be extended generically to n-dimensional structures, where it is more commonly referred to as permutation. For example, an N×M×P matrix can become an M×P×N matrix or a P×M×N or some other variant.

FIG. 4 shows an example of a circuit structure 400 according to an embodiment of the invention for implementing a two-dimensional transposition. Circuit structure 400 may be implemented in any dedicated circuitry—e.g., in an application-specific integrated circuit (ASIC)—or may be configured in a programmable device as discussed above. Either way, transposition engine 401 includes an “internal” or scratchpad memory 411 which is sized according to the size of the submatrices to be transposed in the first, or inner, transposition step, as well as an external memory interface 412. Internal address generator 421 generates the addressing for performing the inner transposition in internal memory 411, while external address generator 422 generates the addressing for external memory interface 412 to control the outer transposition. The actual memory write operations for the outer transposition are performed in external memory 402, which is external to transposition engine 401, and may be completely external to whatever device transposition engine 401 is implemented in.

Source memory 441 is external to transposition engine 401, and may be completely external to whatever device transposition engine 401 is implemented in. In operation, data are read into transposition engine 401 from source memory 441 in blocks of the inner transposition size, which are transposed in internal memory 411 under control of internal address generator 421 and transferred at 451 to external memory interface 412. External memory interface 412 transfers the inner transposition results at 452 to external memory 402, and reads them back via 452, under the control of external address generator 422 to perform the outer transposition. The results are read out at 453.

Internal memory 411 operates on single-element words. The size of internal memory 411 is selected with the goal of reducing that size, but also so that the number of elements in internal memory 411 form a larger word of a size for which the burst speed of external memory 402 is improved. For example, in the 1 Mb FFT example above, the size of internal memory 411 may be 64 single-element words, which correlates to the 64-word optimum burst size of external memory 402. Of course, these are only examples. If a particular DDR memory used as external memory 402 has a different optimum burst size, then internal memory 411 can be sized accordingly. The various parameters of a system can be traded off, and a user can select a burst size that allows the most flexible trade-offs.

As a comparison, in previously known matrix transposition architectures (not shown), internal memory 411 and internal address generator 421 would not be present, and the entire transposition operation would be carried out using external memory interface 412 and external memory 402, under the control of external address generator 422. Because in such an architecture there was no internal memory, transfers between external memory interface 412 and external memory 402 would have had to be in the form of single words, at only about 5% efficiency, as compared to 90% efficiency for the embodiment of the invention described above.

As noted above, the structures described above may be provided in fixed logic, in which case the sizes of the various components may be fixed to a particular application. Alternatively, the fixed logic circuitry could allow for limited parameterization.

Again as noted above, another potential use for the present invention may be in programmable integrated circuit devices such as programmable logic devices, where programming software can be provided to allow users to configure a programmable device to perform matrix operations.

Instructions for carrying out a method according to this invention for programming a programmable device to perform matrix transposition may be encoded on a machine-readable medium, to be executed by a suitable computer or similar device to implement the method of the invention for programming or configuring PLDs or other programmable devices to perform addition and subtraction operations as described above. For example, a personal computer may be equipped with an interface to which a PLD can be connected, and the personal computer can be used by a user to program the PLD using a suitable software tool, such as the QUARTUS® II software available from Altera Corporation, of San Jose, Calif.

FIG. 5 presents a cross section of a magnetic data storage medium 800 which can be encoded with a machine executable program that can be carried out by systems such as the aforementioned personal computer, or other computer or similar device. Medium 800 can be a floppy diskette or hard disk, or magnetic tape, having a suitable substrate 801, which may be conventional, and a suitable coating 802, which may be conventional, on one or both sides, containing magnetic domains (not visible) whose polarity or orientation can be altered magnetically. Except in the case where it is magnetic tape, medium 800 may also have an opening (not shown) for receiving the spindle of a disk drive or other data storage device.

The magnetic domains of coating 802 of medium 800 are polarized or oriented so as to encode, in manner which may be conventional, a machine-executable program, for execution by a programming system such as a personal computer or other computer or similar system, having a socket or peripheral attachment into which the PLD to be programmed may be inserted, to configure appropriate portions of the PLD, including its specialized processing blocks, if any, in accordance with the invention.

FIG. 6 shows a cross section of an optically-readable data storage medium 810 which also can be encoded with such a machine-executable program, which can be carried out by systems such as the aforementioned personal computer, or other computer or similar device. Medium 810 can be a conventional compact disk read-only memory (CD-ROM) or digital video disk read-only memory (DVD-ROM) or a rewriteable medium such as a CD-R, CD-RW, DVD-R, DVD-RW, DVD+R, DVD+RW, or DVD-RAM or a magneto-optical disk which is optically readable and magneto-optically rewriteable. Medium 810 preferably has a suitable substrate 811, which may be conventional, and a suitable coating 812, which may be conventional, usually on one or both sides of substrate 811.

In the case of a CD-based or DVD-based medium, as is well known, coating 812 is reflective and is impressed with a plurality of pits 813, arranged on one or more layers, to encode the machine-executable program. The arrangement of pits is read by reflecting laser light off the surface of coating 812. A protective coating 814, which preferably is substantially transparent, is provided on top of coating 812.

In the case of magneto-optical disk, as is well known, coating 812 has no pits 813, but has a plurality of magnetic domains whose polarity or orientation can be changed magnetically when heated above a certain temperature, as by a laser (not shown). The orientation of the domains can be read by measuring the polarization of laser light reflected from coating 812. The arrangement of the domains encodes the program as described above.

A PLD 90 programmed according to the present invention may be used in many kinds of electronic devices. One possible use is in a data processing system 900 shown in FIG. 7. Data processing system 900 may include one or more of the following components: a processor 901; memory 902; I/O circuitry 903; and peripheral devices 904. These components are coupled together by a system bus 905 and are populated on a circuit board 906 which is contained in an end-user system 907.

System 900 can be used in a wide variety of applications, such as computer networking, data networking, instrumentation, video processing, digital signal processing, or any other application where the advantage of using programmable or reprogrammable logic is desirable. PLD 90 can be used to perform a variety of different logic functions. For example, PLD 90 can be configured as a processor or controller that works in cooperation with processor 901. PLD 90 may also be used as an arbiter for arbitrating access to a shared resources in system 900. In yet another example, PLD 90 can be configured as an interface between processor 901 and one of the other components in system 900. It should be noted that system 900 is only exemplary, and that the true scope and spirit of the invention should be indicated by the following claims.

Various technologies can be used to implement PLDs 90 as described above and incorporating this invention.

It will be understood that the foregoing is only illustrative of the principles of the invention, and that various modifications can be made by those skilled in the art without departing from the scope and spirit of the invention. For example, the various elements of this invention can be provided on a PLD in any desired number and/or arrangement. One skilled in the art will appreciate that the present invention can be practiced by other than the described embodiments, which are presented for purposes of illustration and not of limitation, and the present invention is limited only by the claims that follow.

Claims

What is claimed is:

1. Matrix operations circuitry for performing operations on submatrices of an input matrix, said matrix operations circuitry comprising:

a first working memory in which individual ones of said submatrices are operated on, said first working memory having a first submatrix size; and

a second working memory in which a collection of said submatrices, that have been operated on in said first working memory, is operated on, said second working memory having an optimum burst size; wherein:

said first submatrix size is matched to said optimum burst size.

2. The matrix operations circuitry of claim 1 wherein said first submatrix size is equal to said optimum burst size.

3. The matrix operations circuitry of claim 1 wherein said input matrix has dimensions, at least one of said dimensions being other than a prime number.

4. The matrix operations circuitry of claim 1 wherein said operations comprise transposing said input matrix, said matrix operations circuitry further comprising:

first address generation circuitry for selection of one of said submatrices from said input matrix for reading into said first working memory, and for controlling transposition of said one of said submatrices in said first working memory.

5. The matrix operations circuitry of claim 4 further comprising second address generation circuitry for controlling transposition of positions of said submatrices within said input matrix in said second working memory.

6. The matrix operations circuitry of claim 5 wherein:

said first working memory is part of an integrated circuit device; and

said second working memory is external to said integrated circuit device; said matrix operations circuitry further comprising:

an external memory interface on said integrated circuit device, said external memory interface being coupled to said first working memory, said second working memory, and said second address generation circuitry, and reading and writing data to and from said second working memory under control of said second address generation circuitry.

7. The matrix operations circuitry of claim 6 further comprising input matrix storage external to said integrated circuit device and coupled to said first working memory and to said first address generation circuitry.

8. The matrix operations circuitry of claim 6 wherein said integrated circuit device is programmable.

9. The matrix operations circuitry of claim 8 wherein said programmable integrated circuit device is a programmable logic device.

10. A method of configuring a programmable integrated circuit device as matrix operations circuitry for performing operations on submatrices of an input matrix, said method comprising:

configuring memory of said programmable integrated circuit device as a first working memory in which individual ones of said submatrices are operated on, said first working memory having a first submatrix size; and

configuring a second working memory in which a collection of said submatrices, that have been operated on in said first working memory, is operated on, said second working memory having an optimum burst size; wherein:

said first submatrix size is configured to be matched to said optimum burst size.

11. The method of claim 10 wherein said first submatrix size is configured to be equal to said optimum burst size.

12. The method of claim 10 wherein said input matrix has dimensions, at least one of said dimensions being other than a prime number.

13. The method of claim 10 wherein said operations comprise transposing said input matrix, said method further comprising:

configuring logic of said programmable integrated circuit device as first address generation circuitry for selection of one of said submatrices from said input matrix for reading into said first working memory, and for controlling transposition of said one of said submatrices in said first working memory.

14. The method of claim 13 further comprising configuring logic of said programmable integrated circuit device as second address generation circuitry for controlling transposition of positions of said submatrices within said input matrix in said second working memory.

15. The method of claim 14 wherein:

said second working memory is external to said programmable integrated circuit device; said method further comprising:

configuring logic of said programmable integrated circuit device as an external memory interface that (a) is coupled to said first working memory, said second working memory, and said second address generation circuitry, and (b) reads and writes data to and from said second working memory under control of said second address generation circuitry.

16. The method of claim 15 wherein:

said input matrix is stored in input storage external to said integrated circuit device; and

said configuring logic of said programmable integrated circuit device as said first address generation circuitry comprises configuring said first address generation circuitry to be coupled to said first working memory and to said first address generation circuitry.

17. The method of claim 15 wherein said programmable integrated circuit device is a programmable logic device.

18. A non-transitory machine-readable data storage medium encoded with machine-executable instructions for configuring a programmable integrated circuit device as matrix operations circuitry for performing operations on submatrices of an input matrix, said instructions comprising:

instructions to configure memory of said programmable integrated circuit device as a first working memory in which individual ones of said submatrices are operated on, said first working memory having a first submatrix size; and

instructions to configure a second working memory in which a collection of said submatrices, that have been operated on in said first working memory, is operated on, said second working memory having an optimum burst size; wherein:

in said instructions to configure memory of said programmable integrated circuit device as a first working memory, said first submatirx size is configured to be matched to said optimum burst size.

19. The non-transitory machine-readable data storage medium of claim 18 wherein in instructions to configure memory of said programmable integrated circuit device as a first working memory, said first submatrix size is configured to be equal to said optimum burst size.

20. The non-transitory machine-readable data storage medium of claim 18 wherein said operations comprise transposing said input matrix, said instructions further comprising:

instructions to configure logic of said programmable integrated circuit device as first address generation circuitry for selection of one of said submatrices from said input matrix for reading into said first working memory, and for controlling transposition of said one of said submatrices in said first working memory.

21. The non-transitory machine-readable data storage medium of claim 20 wherein said instructions further comprise instructions to configure logic of said programmable integrated circuit device as second address generation circuitry for controlling transposition of positions of said submatrices within said input matrix in said second working memory.

22. The non-transitory machine-readable data storage medium of claim 21 wherein:

said instructions to configure said second working memory comprise instructions to configure memory external to said programmable integrated circuit device as said second working memory; said instructions further comprising:

instructions to configure logic of said programmable integrated circuit device as an external memory interface that (a) is coupled to said first working memory, said second working memory, and said second address generation circuitry, and (b) reads and writes data to and from said second working memory under control of said second address generation circuitry.

Resources