Yesterday, I went to a job interview. I was asked some difficult
problems and ways to solve them. One of them was: We have a list of a
million phone numbers on the standard input and we have a reduced memory
pc which we want to use it to sort them and use to check later if any
number given is in that list or not.
The solution offered was to use an array of bits and use the number as
an index, so if b[1234 ] == 0 would mean that the phone number 1234
wasn’t on the input list and 1 would mean that we had it.
I have been thinking about the solution they provided me.
I’ve been thinking for a while and would
like to extend the rationale behind.
Ok, so, Let’s imagine we have 1 million phone numbers. If we choose to
use the bit array approach, that means that, given the English phone
number system, we have numbers like these:
01273 121212
That is, eleven numbers. I guess, the initial 0 is always present, so we
can omit it right now. So that leaves us 10 numbers to deal with.
Supposing that there are no numbers starting with a 0 (after the initial
0), that leaves us with the possibility of having number in the range:
1000000000 to
9999999999 (=10 000 000 000, 10 thousand million numbers )
So, there can be a total of 10,000,000,000 – 1,000,000,000 numbers, which is
then, 9,000,000,000.
If you want an array of bits to use for storing if certain number is on
the list or not, that means that you need an array of 9,000,000,000
elements, or, 9,000,000,000 bits. That is 1,125,000,000 bytes, or, 1,072Mb
or a little bit more than 1 Gb, right?
Ok, that’s a lot of memory. And, the thing is that you would need to
have all that array even if it was the best case (that in which all the
numbers would be the same, one single number repeated one million
time). So the memory consumption would be fixed.
On the other hand, if we used a n-ary tree to hold all the information
as something like this:
0 0 0 1 1 1 2 2 2 3 3 3 4 /4--4 5/ 5 5 6 6 6 7 7 7 8 8 8 9 9 9
would mean that the number 544 (in a 3-number scheme) would exist.
For example, if we used a C struct like this one:
typedef struct Node Node;
struct Node {
Node *array[10]; /* */
};
In which each position of the array of pointers would represent a number
itself and a pointer to the next one.
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
| ------ \ |null|
+----+ +----+\ +----+
|null| |null| \ |null|
+----+ +----+ \ +----+
|null| |null| \ XXX |
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
| \ |null| |null|
+----+\ +----+ +----+
\
\ +----+ +----+
\ |null| |null|
\ +----+ +----+
\ null| |null|
+----+ +----+
|null| / XXX |
+----+ / +----+
|null| / |null|
+----+/ +----+
| / |null|
+----+ +----+
|null| |null|
+----+ +----+
|null| |null|
+----+ +----+
|null| |null|
+----+ +----+
|null| |null|
+----+ +----+
| \ |null|
+----+\ +----+
\
\
\ +----+
\ |null|
\+----+
XXX |
+----+
|null|
+----+
|null|
+----+
|null|
+----+
|null|
+----+
|null|
+----+
|null|
+----+
|null|
+----+
|null|
+----+
So that would give us numbers 446 and 912 and 992 in a 3 digit scheme.
We would set up all the pointers to null. If any pointer at any position
is different of null, that means that is a valid number and it points to
the following number (the next array of pointers), so we use the pointer
to both show the number and point to the next structure. The only
exception is for the last node (last digit), that we could make it point
to a invalid address (represented here as XXX).
So, I don’t know if I make myself clear enough, but using this method,
the thing is that in the worst case, we would have 1,000,000
ways to go from the root to the end of each branch, meaning that we
would need 1,000,000 * 10 (the number of digits each phone number has)
nodes to hold all that information. That is 10,000,000 nodes. In this
structure, thats about 40 bytes, which gives us the amount of
400,000,000 bytes, that’s is : 381 Mb approximately, which is 1/3 of the
other approach. But, that’s the worst case scenario. In the average case
scenario, many numbers would be repeated and the total size would be
reduced, actually. And on the best case scenario (one number repeated
one million times) we would only need 10 nodes.
On doing the search, using this n-ary tree, it is O(10) at most if we
need to check the whole number, but as soon we found a null on any
digit’s position, we can stop there, so it would be < O(10) which is a
good number anyway.
the thing is, Am I right, or not?
Ok, so, Let’s imagine we have 1 million phone numbers. If we choose to
use the bit array approach, that means that, given the English phone
number system, we have numbers like these:
01273 121212
That is, eleven numbers. I guess, the initial 0 is always present, so we
can omit it right now. So that leaves us 10 numbers to deal with.
Supposing that there are no numbers starting with a 0 (after the initial
0), that leaves us with the possibility of having number in the range:
1000000000 to
9999999999 (=10 000 000 000, 10 thousand million numbers )
So, there can be a total of 10,000,000,000 – 1,000,000,000 numbers, which is
then, 9,000,000,000.
If you want an array of bits to use for storing if certain number is on
the list or not, that means that you need an array of 9,000,000,000
elements, or, 9,000,000,000 bits. That is 1,125,000,000 bytes, or, 1,072Mb
or a little bit more than 1 Gb, right?
Ok, that’s a lot of memory. And, the thing is that you would need to
have all that array even if it was the best case (that in which all the
numbers would be the same, one single number repeated one million
time). So the memory consumption would be fixed.
On the other hand, if we used a n-ary tree to hold all the information
as something like this:
0 0 0 1 1 1 2 2 2 3 3 3 4 /4--4 5/ 5 5 6 6 6 7 7 7 8 8 8 9 9 9
would mean that the number 544 (in a 3-number scheme) would exist.
For example, if we used a C struct like this one:
typedef struct Node Node;
struct Node {
Node *array[10]; /* */
};
In which each position of the array of pointers would represent a number
itself and a pointer to the next one.
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
| ------ \ |null|
+----+ +----+\ +----+
|null| |null| \ |null|
+----+ +----+ \ +----+
|null| |null| \ XXX |
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
|null| |null| |null|
+----+ +----+ +----+
| \ |null| |null|
+----+\ +----+ +----+
\
\ +----+ +----+
\ |null| |null|
\ +----+ +----+
\ null| |null|
+----+ +----+
|null| / XXX |
+----+ / +----+
|null| / |null|
+----+/ +----+
| / |null|
+----+ +----+
|null| |null|
+----+ +----+
|null| |null|
+----+ +----+
|null| |null|
+----+ +----+
|null| |null|
+----+ +----+
| \ |null|
+----+\ +----+
\
\
\ +----+
\ |null|
\+----+
XXX |
+----+
|null|
+----+
|null|
+----+
|null|
+----+
|null|
+----+
|null|
+----+
|null|
+----+
|null|
+----+
|null|
+----+
So that would give us numbers 446 and 912 and 992 in a 3 digit scheme.
We would set up all the pointers to null. If any pointer at any position
is different of null, that means that is a valid number and it points to
the following number (the next array of pointers), so we use the pointer
to both show the number and point to the next structure. The only
exception is for the last node (last digit), that we could make it point
to a invalid address (represented here as XXX).
So, I don’t know if I make myself clear enough, but using this method,
the thing is that in the worst case, we would have 1,000,000
ways to go from the root to the end of each branch, meaning that we
would need 1,000,000 * 10 (the number of digits each phone number has)
nodes to hold all that information. That is 10,000,000 nodes. In this
structure, thats about 40 bytes, which gives us the amount of
400,000,000 bytes, that’s is : 381 Mb approximately, which is 1/3 of the
other approach. But, that’s the worst case scenario. In the average case
scenario, many numbers would be repeated and the total size would be
reduced, actually. And on the best case scenario (one number repeated
one million times) we would only need 10 nodes.
On doing the search, using this n-ary tree, it is O(10) at most if we
need to check the whole number, but as soon we found a null on any
digit’s position, we can stop there, so it would be < O(10) which is a
good number anyway.
the thing is, Am I right, or not?
Comment by Gufete — January 31, 2006 @ 21:19
Comment by Antonio — February 1, 2006 @ 00:09
Comment by tripu — February 1, 2006 @ 02:04